# Finding causal genotype-phenotype relations with ddlite: extraction

## Introduction
In this example **ddlite** app, we'll build a system to indentify causal relationships between genotypes and phenotypes from raw journal articles. For an end-to-end example, see **GeneTaggerExample_Extraction.ipynb** and **GeneTaggerExample_Learning.ipynb**.

In [1]:
%load_ext autoreload
%autoreload 2

import cPickle, os, sys
sys.path.insert(1, os.path.join(sys.path[0], '..'))

data_dir = 'gene_phen_relation_example/{}'.format(os.environ.get('docs', 'data'))

from snorkel.snorkel import *

## Processing the input data
We already downloaded the raw HTML for 2800 relevant article pages from PubMed. These can be found in the `data` folder. We can use ddlite's `DocParser` to read in the article text. It uses CoreNLP via ddlite's `SentenceParser` to parse each sentence. This can take a little while, so if the example has already been run, we'll reload it.

In [2]:
pkl_f = 'gene_phen_relation_example/gene_phen_saved_sents_v3.pkl'
try:
    with open(pkl_f, 'rb') as f:
        sents = cPickle.load(f)
except:
    print "Getting data from directory {}".format(data_dir)
    dp = DocParser(data_dir, TextReader())
    %time sents = dp.parseDocSentences()
    with open(pkl_f, 'w+') as f:
        cPickle.dump(sents, f)

print sents[0]

Getting data from directory gene_phen_relation_example/data
CPU times: user 34.6 s, sys: 1.12 s, total: 35.7 s
Wall time: 6min 44s
Sentence(words=[u'KORA-gen', u'-', u'Resource', u'for', u'population', u'genetics', u',', u'controls', u'and', u'a', u'broad', u'spectrum', u'of', u'disease', u'phenotypes', u'.'], lemmas=[u'kora-gen', u'-', u'Resource', u'for', u'population', u'genetics', u',', u'control', u'and', u'a', u'broad', u'spectrum', u'of', u'disease', u'phenotype', u'.'], poses=[u'NN', u':', u'NNP', u'IN', u'NN', u'NNS', u',', u'NNS', u'CC', u'DT', u'JJ', u'NN', u'IN', u'NN', u'NNS', u'.'], dep_parents=[3, 3, 0, 6, 6, 3, 3, 3, 8, 12, 12, 8, 15, 15, 12, 3], dep_labels=[u'compound', u'punct', u'ROOT', u'case', u'compound', u'nmod', u'punct', u'dep', u'cc', u'det', u'amod', u'conj', u'case', u'compound', u'nmod', u'punct'], sent_id=0, doc_id=0, text=u'KORA-gen - Resource for population genetics, controls and a broad spectrum of disease phenotypes.', token_idxs=[0, 9, 11, 20, 24, 35,

## Extracting relation candidates with matchers
Extracting candidates for relations in ddlite is done with `Matcher` objects. Here, we'll use two `DictionaryMatcher`s. We have access to pretty comprehensive gene and phenotype dictionaries. Let's load them in and create the `DictionaryMatcher`s.

In [3]:
# Schema is: ENSEMBL_ID | NAME | TYPE (refseq, canonical, non-canonical)
genes = [line.rstrip().split('\t')[1] for line in open('gene_phen_relation_example/dicts/ensembl_genes.tsv')]
genes = filter(lambda g : len(g) > 3, genes)

# Schema is: HPO_ID | NAME | TYPE (exact, lemma)
phenos = [line.rstrip().split('\t')[1] for line in open('gene_phen_relation_example/dicts/pheno_terms.tsv')]

GM = DictionaryMatch(label='G', dictionary=genes, ignore_case=False)
PM = DictionaryMatch(label='P', dictionary=phenos)

If we wanted to define more `Matcher` for, say, genes, we could use multiple `Matcher` objects with a `MultiMatcher`. For now, we'll just use the single `DictionaryMatcher` for both classes. We'll use this to extract our candidate relations from the sentences into an `Relations` object. Using just the matchers will likely provide high recall but poor precision. This is because not all genotype-phenotype mention pairs in the same sentence represent a causal pairing. The `Relations` object we create can be used in a `CandidateModel`. This allows us to learn a model to predict whether each candidate pair represents a causal pair or not.

In [4]:
R = Relations(sents, GM, PM)

We can visualize contexts for our extractions too. This may help in writing labeling functions in a learning task.

In [5]:
R[2].render()

Finally, we can pickle the extracted candidates from our `Relations` object for later use.

In [6]:
R.dump_candidates('gene_phen_relation_example/gene_phen_saved_relations_v4.pkl')