# Using Snorkel with biomedical literature and PubAnnotation

In this tutorial we will try to show how to use snorkel for extraction of related Diseases and Genes from PubMed abstracts using PubAnnotation.

The overall flow of this tutorial is the following:
1. Use stanford CoreNLP to parse an inital set of 5 documents (which we have labels from DisGeNET).
2. Create labeling functions.
3. Run labeling functions on the small corpus.
4. Compare against gold labels.
5. Iterate refining the labeling functions.
6. Annotate new corpus and upload to PubAnnotation project.

NOTE: A lot of the code is adapted and converted from the Snorkel intro [tutorial](https://github.com/HazyResearch/snorkel/tree/master/tutorials/intro). This is a work in progress and will continue having this warning until the example works end-to-end properly.  


## Initializing a SnorkelSession

First, we initialize a SnorkelSession, which manages a connection to a database automatically for us, and will enable us to save intermediate results. If we don't specify any particular database (see commented-out code below), then it will automatically create a SQLite database in the background for us:

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os

# TO USE A DATABASE OTHER THAN SQLITE, USE THIS LINE
# Note that this is necessary for parallel execution amongst other things...
# os.environ['SNORKELDB'] = 'postgres:///snorkel-intro'

from snorkel import SnorkelSession
session = SnorkelSession()

# Here, we just set a global variable related to automatic testing- you can safely ignore this!
max_docs = 5 if 'CI' in os.environ else float('inf')



## Fetching the corpus from PubAnnotation

Next, we load and pre-process the corpus of documents.


In [29]:
from pubannotationutils.getfrompubannotation import getCorpusPubAnnotation
getC = getCorpusPubAnnotation('data/test_get.csv')

This is an important step as the last piece of code fetches two things:
1. All the text for the specified PMIDs
2. All available annotations for the specified project listed as well on test_get.csv. This file is constructed by having an individual PMID,project_name on each line.

In this example we get 5 PMID abstracts and all their annotations to use later as gold labels. This the produces 3 different files: 
1. test_get_text.txt file includes PMID and text.
2. test_get_denotations.txt file includes all PubAnnotation denotations.
3. test_get_relations.txt file which includes all PubAnnotation relations.

We will begin by loading the test_get_text.txt file for parsing

In [30]:
from snorkel.parser import TSVDocPreprocessor

doc_preprocessor = TSVDocPreprocessor('data/test_get_text.txt', max_docs=max_docs)

### Running a CorpusParser

We'll use an NLP preprocessing tool to split our documents into sentences, tokens, and provide annotations--part-of-speech tags, dependency parse structure, lemmatized word forms, etc.--for these sentences.

Let's run it single-threaded first:

In [31]:
from snorkel.parser import CorpusParser

corpus_parser = CorpusParser()
%time corpus_parser.apply(doc_preprocessor)

Clearing existing...
Running UDF...
CPU times: user 218 ms, sys: 164 ms, total: 382 ms
Wall time: 2min 29s


We can then use simple database queries (written in the syntax of [SQLAlchemy](http://www.sqlalchemy.org/), which Snorkel uses) to check how many documents and sentences were parsed:

In [32]:
from snorkel.models import Document, Sentence

print "Documents:", session.query(Document).count()
print "Sentences:", session.query(Sentence).count()

Documents: 5
Sentences: 49


## Defining a Candidate schema

We now define the schema of the relation mention we want to extract (which is also the schema of the candidates). This must be a subclass of Candidate, and we define it using a helper function. Here we'll define a binary Disease-Gene mention which connects two Span objects of text. Note that this function will create the table in the database backend if it does not exist:


In [33]:
from snorkel.models import candidate_subclass
Spouse = candidate_subclass('DiseaseGene', ['disease', 'gene'])