# Phenotype extraction

In this notebook, we demo the first module of the GWASdb system, which extracts the phenotypes that are studied in each paper.

Before starting, make sure you have downloaded all the datasets: the phenotype ontologies, the GWAS Catalog database, and the open-access GWAS papers.

## Preparations

We start by configuring Jupyter and setting up our environment.

In [97]:
%load_ext autoreload
%autoreload 2

import sys
import cPickle
import numpy as np
import sqlalchemy

# set the paths to snorkel and gwasdb
sys.path.append('../snorkel-tables')
sys.path.append('../src')
sys.path.append('../src/crawler')

# set up the directory with the input papers
abstract_dir = '../data/db/papers'

# set up matplotlib
import matplotlib
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (12,4)

# create a Snorkel session
from snorkel import SnorkelSession
session = SnorkelSession()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Load corpus

Our system will read PubMed papers that have been previously identified as GWAS-related. We load this corpus below

In [3]:
from extractor.parser import GWASXMLAbstractParser

xml_parser = GWASXMLAbstractParser(
    path=abstract_dir,
    doc='./*',
    title='.//front//article-title//text()',
    abstract='.//abstract//p//text()',
    par1='.//body/p[1]//text()',
    id='.//article-id[@pub-id-type="pmid"]/text()',
    keep_xml_tree=True)

`GWASXMLAbstractParser` is a custom parser that we wrote. For each paper, it extracts the title and either the abstract of the first paragraph (if there is no abstract).

In [4]:
from snorkel.parser import SentenceParser
from snorkel.parser import CorpusParser
from snorkel.models import Corpus

# this splits documents into sentences and parses each sentence with Stanford CoreNLP
sent_parser = SentenceParser(timeout=600000)

try:
    corpus = session.query(Corpus).filter(Corpus.name == 'GWAS Corpus').one()
except:
    cp = CorpusParser(xml_parser, sent_parser)
    %time corpus = cp.parse_corpus(name='GWAS Corpus', session=session)
    session.add(corpus)
    session.commit()

print 'Loaded corpus of %d documents' % len(corpus)

Loaded corpus of 589 documents


## Candidate extraction

The first stage is to generate a large set of candidate phenotypes, which may or may not be correct. After that, we will train classifiers to predict which ones are correct.

### Extract candidates

We first load our phenotype ontologies, which will be used to generate candidates.

In [7]:
from db.kb import KnowledgeBase
from extractor.util import make_ngrams

# collect phenotype list
kb = KnowledgeBase()

# efo phenotypes
efo_phenotype_list0 = kb.get_phenotype_candidates(source='efo-matching', peek=False) # TODO: remove peaking
efo_phenotype_list = list(make_ngrams(efo_phenotype_list0))
# snomed keywords
snomed_phenotype_list = kb.get_phenotype_candidates(source='snomed')
# mesh diseases
mesh_phenotype_list0 = kb.get_phenotype_candidates(source='mesh')
mesh_phenotype_list = list(make_ngrams(mesh_phenotype_list0))
# mesh chemicals
chem_phenotype_list = kb.get_phenotype_candidates(source='chemical')
# regex matches
rgx = u'[A-Za-z\u2013-]+ (disease|trait|phenotype|outcome|response|quantitative trait|measurement|response|side effects)s?'

We define matchers and an extractor that generate candaites based on these ontologies.

In [99]:
from snorkel.candidates import Ngrams
from snorkel.matchers import DictionaryMatch, Union, RegexMatchSpan
from extractor.matcher import PhenotypeMatcher
from extractor.util import change_name

# Define a candidate space
ngrams = Ngrams(n_max=7)

# Define a matcher for each ontology
efo_phen_matcher = PhenotypeMatcher(d=efo_phenotype_list, ignore_case=True, mod_fn=change_name)
snom_phen_matcher = PhenotypeMatcher(d=snomed_phenotype_list, ignore_case=True, mod_fn=change_name)
mesh_phen_matcher = PhenotypeMatcher(d=mesh_phenotype_list, ignore_case=True, mod_fn=change_name)
chem_phen_matcher = DictionaryMatch(d=chem_phenotype_list, longest_match_only=True, ignore_case=True)
regex_phen_matcher = RegexMatchSpan(rgx=rgx)

# The phenotype matcher is the union of these
# phen_matcher = Union(efo_phen_matcher, mesh_phen_matcher, chem_phen_matcher, regex_phen_matcher)
phen_matcher = Union(efo_phen_matcher, snom_phen_matcher, mesh_phen_matcher, chem_phen_matcher, regex_phen_matcher)

# Define the extractor
from snorkel.candidates import CandidateExtractor
from snorkel.models import candidate_subclass

Phenotype = candidate_subclass('SnorkelPhenotype', ['phenotype'])
phen_extractor = CandidateExtractor(Phenotype, ngrams, phen_matcher)

NameError: name 'efo_phenotype_list' is not defined

In [100]:
rgx = u'[A-Za-z\u2013-]+ (disease|trait|phenotype|outcome|response|quantitative trait|measurement|response|side effects)s?'
regex_phen_matcher = RegexMatchSpan(rgx=rgx)

In [5]:
from snorkel.models import candidate_subclass

Phenotype = candidate_subclass('SnorkelPhenotype', ['phenotype'])

Finally, we extract the candidates.

In [48]:
from snorkel.models import CandidateSet

try:
    phen_c = session.query(CandidateSet).filter(CandidateSet.name == 'Phenotype Candidates').one()
except:
    sentences = [s for doc in corpus for s in doc.sentences]
    print '%d sentences loaded' % len(sentences)
    %time phen_c = phen_extractor.extract(sentences, 'Phenotype Candidates', session)
    session.add(phen_c)
    session.commit()

print '%d candidates extracted' % len(phen_c)

71476 candidates extracted


In [47]:
phen_c[0].phenotype.parent.position

0

We would like to remove nested candidates as well as obviously wrong candidates.

In [50]:
from extractor.candidates import deduplicate, filter_cand

# we filter candidates or candidates that don't occur within first 3 sentences
# TODO: add stopwords: genome, association, population, analysis
def filter_fn(cand, attrib='phenotype'):
    txt = getattr(cand, attrib).get_span()
    sent_n = getattr(cand, attrib).parent.position
    return False if len(txt) < 5 or sent_n > 2 else True

# try:
#     new_phen_c = session.query(CandidateSet).filter(CandidateSet.name == 'Filtered Phenotype Candidates').one()
# except:
new_phen_c = CandidateSet(name='Filtered Phenotype Candidates')
for cand in filter_cand(deduplicate(phen_c), filter_fn=filter_fn):
    new_phen_c.append(cand)
session.add(new_phen_c)
session.commit()
    
print len(phen_c) - len(new_phen_c), 'candidates dropped, now we have', len(new_phen_c)
phen_c = new_phen_c

64008 candidates dropped, now we have 7468


### Candidate recall statistics

We say that a mention for an aggregate phenotype is correct, if it corresponds to the name of the GWAS Catalog phenotype or to the phenotype of any equivalent EFO phenotype. This gives us a rough overview of precision and recall.

In [37]:
# from db.kb import KnowledgeBase
# from nltk.stem import PorterStemmer
# from extractor.util import change_name

# kb = KnowledgeBase() # reload
# gold_set_agg_phens = frozenset \
# ([ 
#     (doc.name, phen.id) for doc in corpus.documents 
#                         for phen in kb.phen_by_pmid(doc.name, source='gwas_catalog')
# ])

# # map phenotype names to their id (EFO syn -> GWC id)
# agg_phen2id = dict()
# for doc in corpus.documents:
#     for phen in kb.phen_by_pmid(doc.name, source='gwas_catalog'):
#         for eq_phen in phen.equivalents:
#             for syn in [phen.name] + [eq_phen.name] + eq_phen.synonyms.split('|'):
#                 syn_name = change_name(syn)
#                 if syn_name not in agg_phen2id: agg_phen2id[syn_name] = set()
#                 agg_phen2id[syn_name].add(phen.id)

# # map ids to phenotypes (GWC id -> GWC phen obj)                
# agg_id2phen = \
# {
#     phen.id : phen for doc in corpus.documents
#                    for phen in kb.phen_by_pmid(doc.name, source='gwas_catalog')
# }

from extractor.util import gold_agg_phen_stats
gold_agg_phen_stats(phen_c, gold_set_agg_phens, agg_phen2id)

AttributeError: 'Span' object has no attribute 'get_attributes'

In [89]:
# FOR DEBUGGING WHY SPANS ARENT MATCHED
from extractor.util import change_name

doc_id = '23583980'
ngrams = Ngrams(n_max=7)
print id2doc[doc_id].sentences[0]
for span in ngrams.apply(id2doc[doc_id].sentences[0]):
    print span.get_span()
    if phen_matcher._f(span):    
        phen_name = span.get_span()
        print phen_name, change_name(phen_name)
        print '...', phen_name == 'fibrosis', phen_name in phenotype_list, change_name(phen_name) in phenotype_list, phen_name in efo_phen_matcher.d, change_name(phen_name) in efo_phen_matcher.d
        phen_id = phen2id.get(change_name(phen_name), None)
        print phen_id
        if not phen_id or phen_id not in gold_dict_phen[span.context.document.name]:
            print span.context.document.name, phen_id
            print gold_dict_phen[span.context.document.name]
        
        print

Sentence(Document('23583980', Corpus (GWAS Corpus)), 0, u'Genome-wide association study identifies multiple susceptibility loci for pulmonary fibrosis.')
Genome-wide association study identifies multiple susceptibility loci
Genome-wide association study identifies multiple susceptibility loci genom wide associ studi identifi multipl suscept loci
... False False False False False
None
23583980 None
set([u'http://www.ebi.ac.uk/efo/EFO_0004244'])

association study identifies multiple susceptibility loci for
association study identifies multiple susceptibility loci for associ studi identifi multipl suscept loci for
... False False False False False
None
23583980 None
set([u'http://www.ebi.ac.uk/efo/EFO_0004244'])

study identifies multiple susceptibility loci for pulmonary
study identifies multiple susceptibility loci for pulmonary studi identifi multipl suscept loci for pulmonari
... False False False False False
None
23583980 None
set([u'http://www.ebi.ac.uk/efo/EFO_0004244'])

identifi

In [None]:
print [ph for ph in efo_phenotype_list0 if 'alpha' in ph]
print [ph for ph in efo_phen_matcher.d if 'alpha' in ph]

In [83]:
query_word = 'trait in'
from db import db_session
from db.schema import *

phenotypes = db_session.query(Phenotype).filter(Phenotype.source=='snomed').all()
# phenotypes == kb.get_phenotype_candidates_cheating()
phenotype_names = set()
for phenotype in phenotypes:
    if phenotype.name:
        phenotype_names.add((phenotype.name))
        synonyms = [(syn) for syn in phenotype.synonyms.split('|')]
        if query_word in synonyms or query_word == phenotype.name:
            print phenotype.name, phenotype.ontology_ref
        phenotype_names.update(synonyms)

AttributeError: 'NoneType' object has no attribute 'split'

In [91]:
# print len(phenotype_names)
[(word, change_name(word)) for word in snomed_phenotype_list if change_name(word) == change_name('borderline personality')]

[(u'borderline personality disorder', u'borderlin person')]

## Learning the correctness of our candidates

Next, we will train machine learning models to identify which phenotype candidates are actually correct.

### Generating a labeled set of examples

We first split data into an (unlabeled) training set (since we will use unsupervised risk estimation to train a candidate on it), and a labeled set, which we will split into dev and test later on.

In [76]:
session.rollback()
# session.query(CandidateSet).filter(CandidateSet.name == 'Phenotype Training Candidates').delete()
# session.query(CandidateSet).filter(CandidateSet.name == 'Phenotype Dev/Test Candidates').delete()
session.query(CandidateSet).filter(CandidateSet.name == 'Phenotype Labeled Candidates').delete()

# session.delete(train_c)
# session.rollback()

1

In [74]:
frac_test = 0.5

# initialize the new sets
train_c = CandidateSet(name='Phenotype Training Candidates')
devtest_c = CandidateSet(name='Phenotype Dev/Test Candidates')

# choose a random subset for the labeled set
n_test = len(phen_c) * frac_test
test_idx = set(np.random.choice(len(phen_c), size=(n_test,), replace=False))

# add to the sets
for i, c in enumerate(phen_c):
    if i in test_idx:
        devtest_c.append(c)
    else:
        train_c.append(c)

# save the results
session.add(train_c)
session.add(devtest_c)
session.commit()

print 'Initialized %d training and %d dev/testing candidates' % (len(train_c), len(devtest_c))



Initialized 3734 training and 3734 dev/testing candidates


We will label a small number of dev/test candidates.

In [77]:
n_labeled = 300 # number of candidates to label

random_idx = np.random.choice(len(phen_c), size=(n_labeled,), replace=False)
labeled_c = CandidateSet(name='Phenotype Labeled Candidates')
for i in random_idx:
    labeled_c.append(phen_c[i])

We may use the Snorkel viewer to label a set of examples.

In [78]:
from snorkel.viewer import SentenceNgramViewer
sv = SentenceNgramViewer(labeled_c, session, annotator_name="Snorkel Phenotype Annotations")

<IPython.core.display.Javascript object>

This will display the viewer.

In [79]:
sv

We now further split the labeled set into dev and test.

In [80]:
frac_dev = 0.2

# initialize the new sets
dev_c = CandidateSet(name='Phenotype Dev Candidates')
test_c = CandidateSet(name='Phenotype Test Candidates')

# choose a random subset for the labeled set
n_dev = len(phen_c) * frac_dev
dev_idx = set(np.random.choice(len(devtest_c), size=(n_dev,), replace=False))

# add to the sets
for i, c in enumerate(devtest_c):
    if i in dev_idx:
        dev_c.append(c)
    else:
        test_c.append(c)

# save the results
session.add(dev_c)
session.add(test_c)
session.commit()

print 'Initialized %d dev and %d test candidates' % (len(dev_c), len(test_c))



Initialized 1493 dev and 2241 test candidates


In [139]:
from snorkel.models.annotation import AnnotationKeySet, AnnotationKey

key_set = session.query(AnnotationKeySet).filter(AnnotationKeySet.name == "Snorkel Phenotype Annotations").first()
print key_set.keys[0].labels[0].candidate

SnorkelPhenotype(Span("detection", parent=604, chars=[113,121], words=[16,16]))


### Feature extraction

Next, we generate features based on our training set.

In [85]:
from snorkel.annotations import FeatureManager

feature_manager = FeatureManager()
%time F_train = feature_manager.create(session, train_c, 'Phenotype Train Features')

Generating annotations for 3734 candidates...
Loading sparse Feature matrix...


In [86]:
F_train.get_candidate(0)

SnorkelPhenotype(Span("association", parent=6962, chars=[12,22], words=[1,1]))

In [88]:
F_train.get_key(1)

AnnotationKey (DDL_LEMMA_SEQ_[association])

### Labeling functions

Following the data programming approach, we define set of labeling functions. First, we need to preload data form our phenotype dictionaries, and we also define common stopwords that we will try to filter out.

In [91]:
import re, string
from nltk.stem import PorterStemmer
from db.kb import KnowledgeBase
punctuation = set(string.punctuation)
stemmer = PorterStemmer()

# load set of dictionary phenotypes
kb = KnowledgeBase()
phenotype_list = kb.get_phenotype_candidates() # TODO: load disease names from NCBI
phenotype_list = [phenotype for phenotype in phenotype_list]
phenotype_set = set(phenotype_list)

# load stopwords
with open('../data/phenotypes/snorkel/dicts/manual_stopwords.txt') as f:
    stopwords = {line.strip() for line in f}
stopwords.update(['analysis', 'age', 'drug', 'community', 'detect', 'activity', 'genome',
                  'genetic', 'phenotype', 'response', 'population', 'parameter', 'diagnosis',
                  'level', 'survival', 'maternal', 'paternal', 'clinical', 'joint', 'related',
                  'status', 'risk', 'protein', 'association', 'signal', 'pathway', 'genotype', 'scale',
                  'human', 'family', 'heart', 'general', 'chromosome', 'susceptibility', 'select', 
                  'medical', 'system', 'trait', 'suggest', 'confirm', 'subclinical', 'receptor', 
                  'class', 'adult', 'affecting', 'increase'])
from nltk.corpus import stopwords as nltk_stopwords
stopwords.update(nltk_stopwords.words('english'))
stopwords = {stemmer.stem(word) for word in stopwords}

In [89]:
# we also define a few helpers
def get_phenotype(entity, stem=False):
    phenotype = entity.get_span()
    if stem: phenotype = stemmer.stem(phenotype)
    return phenotype.lower()

def stem_list(L):
    return [stemmer.stem(l.lower()) for l in L]

def span(c):
    return c if isinstance(c, TemporarySpan) else c[-1]

Now we define the functions themselves.

In [201]:
from snorkel.annotations import LabelManager
from snorkel.lf_helpers import *

label_manager = LabelManager()

# positive LFs
def LF_first_sentence(m):
    return +10 if span(m).parent.position == 0 else 0
def LF_from_regex(m):
    if span(m).parent.position == 0 and not regex_phen_matcher._f(span(m)) and not LF_bad_words(m): return +5
    else: return 0
def LF_with_acronym(m):
    post_txt = ''.join(right_text(m, attr='words', window=5))
    return +1 if re.search(r'\([A-Z]{2,4}\)', post_txt) else 0
def LF_many_words(m):
    return +1 if len(span(m).get_span().split()) >= 3 else 0
def LF_start_of_sentence(m):
    return +1 if m.get_word_start() <= 3 else 0

LFs_pos = [LF_first_sentence, LF_with_acronym, LF_from_regex, LF_many_words]

# negative LFs
def LF_bad_words(m):
    bad_words = ['disease', 'single', 'map', 'genetic variation', '( p <']
    return -100 if any(span(m).get_span().lower().startswith(b) for b in bad_words) else 0
def LF_short(m):
    txt = span(m).get_attrib_span('words', 3)
    return -50 if len(txt) < 5 else 0
def LF_no_nouns(m):
    return -10 if not any(t.startswith('NN') for t in span(m).get_attrib_tokens('pos_tags')) else 0
def LF_pvalue(m):
    txt = span(m).get_span().lower()
    return -100 if 'p <' in txt or 'p =' in txt else 0
def LF_not_true_phen(m):
    indicator_ngrams = ['factor for']
    return -1 if any(ngram in get_left_ngrams(m) for ngram in indicator_ngrams) else 0
def LF_not_first_sentences(m):
    return -1 if span(m).parent.position > 1 else 0
def LF_stopwords(m):
    txt = span(m).get_span()
    txt = ''.join(ch for ch in txt if ch not in punctuation)
    words = txt.lower().split()
    return -50 if all(word in stopwords for word in words) or \
                  all(stemmer.stem(word) in stopwords for word in words) or \
                  all(change_name(word) in stopwords for word in words) else 0


LFs_neg = [LF_bad_words, LF_short, LF_no_nouns, LF_pvalue, LF_not_true_phen, LF_not_first_sentences, LF_stopwords]
LFs = LFs_pos + LFs_neg

try:
    %time L_train = label_manager.load(session, train_c, 'Phenotype LF Labels')
except sqlalchemy.orm.exc.NoResultFound:
    %time L_train = label_manager.create(session, train_c, 'Phenotype LF Labels', f=LFs)

CPU times: user 230 ms, sys: 9.25 ms, total: 239 ms
Wall time: 250 ms


In [124]:
session.query(AnnotationKeySet).filter(AnnotationKeySet.name == 'Phenotype LF Labels').delete()

1

In [126]:
L_train.lf_stats()

Unnamed: 0,conflicts,coverage,j,overlaps
LF_first_sentence,1.620246,2.739689,0,2.739689
LF_with_acronym,0.022496,0.036154,1,0.027852
LF_from_regex,0.784681,1.289502,2,1.289502
LF_many_words,0.030262,0.114087,3,0.065613
LF_bad_words,0.508838,1.874665,4,1.874665
LF_short,0.0,0.0,5,0.0
LF_no_nouns,0.728441,3.211034,6,3.211034
LF_pvalue,0.0,0.0,7,0.0
LF_not_true_phen,0.0,0.0,8,0.0
LF_not_first_sentences,0.029727,0.339047,9,0.247188


### Training machine learning models

In [202]:
from snorkel.learning import NaiveBayes

gen_model = NaiveBayes()
gen_model.train(L_train, n_iter=10000, rate=1e-2)

Training marginals (!= 0.5):	3734
Features:			11
Begin training for rate=0.01, mu=1e-06
	Learning epoch = 0	Gradient mag. = 1.256793
	Learning epoch = 250	Gradient mag. = 1.739181
	Learning epoch = 500	Gradient mag. = 1.055386
	Learning epoch = 750	Gradient mag. = 0.656981
	Learning epoch = 1000	Gradient mag. = 0.420766
	Learning epoch = 1250	Gradient mag. = 0.277369
	Learning epoch = 1500	Gradient mag. = 0.187874
	Learning epoch = 1750	Gradient mag. = 0.130459
	Learning epoch = 2000	Gradient mag. = 0.092794
	Learning epoch = 2250	Gradient mag. = 0.067774
	Learning epoch = 2500	Gradient mag. = 0.051165
	Learning epoch = 2750	Gradient mag. = 0.040311
	Learning epoch = 3000	Gradient mag. = 0.033423
	Learning epoch = 3250	Gradient mag. = 0.029205
	Learning epoch = 3500	Gradient mag. = 0.026701
	Learning epoch = 3750	Gradient mag. = 0.025238
	Learning epoch = 4000	Gradient mag. = 0.024379
	Learning epoch = 4250	Gradient mag. = 0.023859
	Learning epoch = 4500	Gradient mag. = 0.023529
	Learn

In [168]:
gen_model.save(session, 'Phenotype Generative Params2')

IntegrityError: (sqlite3.IntegrityError) columns feature_key_id, set_id are not unique [SQL: u'INSERT INTO parameter (feature_key_id, set_id, value) VALUES (?, ?, ?)'] [parameters: ((84834, 1, 0.9994068723080292), (84835, 1, 0.9984628473423227), (84836, 1, 0.9989635411868477), (84837, 1, 0.9984939281413469), (84838, 1, 1.0104212279198124), (84839, 1, 0.9985003753122208), (84840, 1, 1.0040560162893801), (84841, 1, 0.9985003753122208)  ... displaying 10 of 11 total bound parameter sets ...  (84843, 1, 0.9987718730481357), (84844, 1, 0.9992988026859801))]

In [129]:
train_marginals = gen_model.marginals(L_train)

In [203]:
gen_model.w

array([ 1.25701185,  0.59526402,  1.27784439,  0.79873538,  9.99546033,
        0.98503744,  9.99735828,  0.98503744,  0.98503744,  3.30193975,
        1.24502444])

### Look at results on the test set

We start by creating features and labels for each element of the test set.

In [146]:
from snorkel.annotations import FeatureManager
from snorkel.annotations import LabelManager

feature_manager = FeatureManager()
label_manager = LabelManager()

# try:
#     %time L_test = label_manager.load(session, test_c, 'Phenotype LF Test Labels')
#     %time F_test = feature_manager.load(session, test_c, 'Phenotype Test Features')
# except sqlalchemy.orm.exc.NoResultFound:
#     %time L_test = label_manager.create(session, test_c, 'Phenotype LF Test Labels', f=LFs)
#     session.commit()
# #     %time F_test = feature_manager.create(session, test_c, 'Phenotype Test Features')
# #     session.commit()

%time Y_test = label_manager.load(session, test_c, 'Snorkel Phenotype Annotations')

CPU times: user 127 ms, sys: 56.1 ms, total: 183 ms
Wall time: 403 ms


In [167]:
session.rollback()
from snorkel.models import ParameterSet
session.query(ParameterSet).filter(ParameterSet.name == 'Phenotype Generative Params').delete()

  "Session's state has been changed on "


1

In [159]:
gen_model.score(L_test, Y_test, test_c)

## Classify all the papers

We now have a classifier that can score phenotype mentions in the text. Let's apply this classifier to assign a phenotype to each of our papers.

### Analyze / Visualize

If a mention occurs in the title, its probably correct, we can take it.

Question: what papers did not have any disease mentions in the title?

In [204]:
from snorkel.annotations import LabelManager

label_manager = LabelManager()

# delete existing labels
session.rollback()
session.query(AnnotationKeySet).filter(AnnotationKeySet.name == 'Phenotype LF All Labels').delete()
%time L_all = label_manager.create(session, phen_c, 'Phenotype LF All Labels', f=LFs)

# try:
#     %time L_all = label_manager.load(session, phen_c, 'Phenotype LF All Labels')
# except sqlalchemy.orm.exc.NoResultFound:
#     %time L_all = label_manager.create(session, phen_c, 'Phenotype LF All Labels', f=LFs)

Generating annotations for 7468 candidates...
Loading sparse Label matrix...
CPU times: user 1min 18s, sys: 550 ms, total: 1min 18s
Wall time: 1min 20s


In [205]:
learner = gen_model

preds = learner.predict(L_all)
results = [c for p, c in zip(preds, phen_c) if p > 0 and c[0].parent.position == 0]
doc_set = {c[0].parent.document.name for c in results}
missing_docs = {doc.name for doc in corpus.documents} - doc_set
docs = sorted(list(missing_docs))
print len(docs)
for d in missing_docs:
    print d, kb.paper_by_pmid(d).title

33
23935956 Genome wide association analysis of a founder population identified TAF3 as a gene for MCHC in humans.
22359512 Genome-wide association study identifies novel loci associated with circulating phospho- and sphingolipid concentrations.
23056639 A genome-wide association study of circulating galectin-3.
22044751 Heritability and genome-wide association analysis of renal sinus fat accumulation in the Framingham Heart Study.
23776548 Genetic loci for retinal arteriolar microcirculation.
24489884 Genome-wide association study of proneness to anger.
20921969 Genome-wide association study of antipsychotic-induced QTc interval prolongation.
21935397 Genome-wide population-based association study of extremely overweight young adults--the GOYA study.
24847357 Genome wide association study of SNP-, gene-, and pathway-based approaches to identify genes influencing susceptibility to Staphylococcus aureus infections.
25226531 Common variation near ROBO2 is associated with expressive vocab

Let's not visualize what we found.

In [206]:
scores = learner.odds(L_all)
score_dict = { doc.name : list() for doc in corpus.documents }
for s, c in zip(scores, phen_c):
    score_dict[c[0].parent.document.name].append((s,c))

results = dict()
for pmid, preds in score_dict.items():
    if preds: 
        best_c = sorted(preds, reverse=True)[0][1]
        results[best_c[0].parent.document.name] = best_c
    

In [207]:
# doc_set = {c.context.document.name for c in results}
# missing_docs = {doc.name for doc in corpus.documents} - doc_set
# docs = sorted(list(missing_docs))
# print len(docs)
for d in corpus.documents:
    print d, kb.paper_by_pmid(d.name).title
    print unicode(results.get(d.name, None)), [LF(results.get(d.name)) for LF in LFs]
    try:
        print sorted(score_dict[d.name], reverse=True)[:5]
    except UnicodeEncodeError:
        print 'Unicode error'
    print

Document 25086665 Genome-wide association study identifies multiple susceptibility loci for pancreatic cancer.
SnorkelPhenotype(Span("pancreatic cancer", parent=6962, chars=[74,90], words=[8,9])) [10, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0]
[(18.959340425512902, SnorkelPhenotype(Span("pancreatic cancer", parent=6962, chars=[74,90], words=[8,9]))), (0.0, SnorkelPhenotype(Span("pancreatic cancer", parent=6963, chars=[96,112], words=[14,15]))), (-3.3019397536672925, SnorkelPhenotype(Span("ratio", parent=6964, chars=[98,102], words=[16,16]))), (-43.291881522112206, SnorkelPhenotype(Span("association", parent=6962, chars=[12,22], words=[1,1]))), (-62.251221947625105, SnorkelPhenotype(Span("association", parent=6963, chars=[38,48], words=[5,5])))]

Document 23349640 Susceptibility loci associated with specific and shared subtypes of lymphoid malignancies.
SnorkelPhenotype(Span("Malignancies", parent=5094, chars=[77,88], words=[10,10])) [10, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0]
[(18.959340425512902, SnorkelPh


### Save results

In [211]:
import string
from nltk.corpus import stopwords as nltk_stopwords

punctuation = set(string.punctuation)
nltk_stopword_set = set(nltk_stopwords.words('english'))

def clean_stopwords(txt):
    words = txt.split()
    i = 0
    new_words = []
    while i < len(words):
        i += 1
        if words[i-1] in nltk_stopword_set or words[i-1] in punctuation: continue
        new_words.append(words[i-1])
    return ' '.join(new_words)

with open('phenotypes.extracted.tsv', 'w') as f:
    for d in corpus.documents:
        # pick the top two results:
        best = sorted(score_dict[d.name], reverse=True)[:3]
        # if both are in title, report both, otherwise report only the best one
        if len(best) == 3 and best[2][1][0].parent.position == 0 and best[1][0] - best[2][0] < 1:
            (_, r1), (_, r2), (_, r3) = best
            phen = '|'.join(set([clean_stopwords(r[0].get_span()) for s,r in best[:3]]))
        elif len(best) >= 2 and best[1][1][0].parent.position == 0 and best[1][0] > 5:
            phen = '|'.join(set([clean_stopwords(r[0].get_span()) for s,r in best[:2]]))                
        else:
            phen = clean_stopwords(best[0][1][0].get_span())
        out_str = u'%s\t%s\t\n' % (d.name, phen)        
        f.write(out_str.encode("UTF-8"))
        