# Gene NER using PySysrev and Human Review (Part III)
<span style="color:gray">James Borden, Nole Lin</span>

In this series on the Sysrev tool, we build a Named Entity Recognition (NER) model for genes.  We use data from 2000 abstracts reviewed in the sysrev [Gene Hunter project](https://sysrev.com/p/3144). This third part of the series details how we can evaluate our model .

In this notebook we:

1. **Perform k-fold cross validation** on our model
2. **Evaluate Model** on Gene Hunter text to test performance

We start by getting the training annotations from the gene hunter project ([sysrev.com/p/3144](https://sysrev.com/p/3144)) below.  This process is described in [part I](https://s3.amazonaws.com/sysrev-blog/NERGenes_Processing.html)

In [24]:
from __future__ import unicode_literals, print_function
import spacy
import PySysrev
import random, sys

TRAIN_DATA = PySysrev.processAnnotations(project_id=3144, label='GENE')
uniq_articles = list(set([x[0] for x in TRAIN_DATA]))
test_size = int(0.2 * len(uniq_articles))
test_articles = uniq_articles[0:test_size]

nlp = spacy.blank('en')
nlp.meta['name'] = 'gene'

ner = nlp.create_pipe('ner')
ner.add_label('GENE')

nlp.add_pipe(ner)
optimizer = nlp.begin_training()

epochs = 20

for itn in range(epochs):
    random.shuffle(TRAIN_DATA)
    losses = {}
    test_range = range(test_size)
    text = [item[0] for item in TRAIN_DATA if item[0] not in test_articles] #get training text items
    annotations = [item[1] for item in TRAIN_DATA if item[0] not in test_articles] #get training annotations
    
    nlp.update(text, annotations, sgd=optimizer, drop=0.2,losses=losses)



In [59]:
import plotly as py
import plotly.graph_objs as go



In [35]:
from __future__ import division

def get_metrics(test_or_train, model):
    if test_or_train == 'test':
        section = [x for x in TRAIN_DATA if x[0] in test_articles]
    elif test_or_train == 'train':
        section = [x for x in TRAIN_DATA if x[0] not in test_articles]
    true_genes = 0
    pred_genes = 0
    true_non_genes = 0
    pred_non_genes = 0
    nlp2 = spacy.load('en_core_web_sm')
    for txt in section:
        if txt[0] is None:
            continue
        else:
            doc = model(txt[0])
            predict_annotations = [str(x) for x in list(doc.ents)]
            entities = txt[1]['entities']
            true_annotations = [txt[0][x[0]:x[1]] for x in entities]
            pred_genes += len([value for value in predict_annotations if value in true_annotations])
            true_genes += len(true_annotations)
            doc2 = nlp2(txt[0])
            for token in doc2:
                if str(token) not in true_annotations:
                    true_non_genes += 1
                    if str(token) not in predict_annotations:
                        pred_non_genes += 1
    print ("Sensitivity: ", pred_genes / true_genes)
    print ("Specificity: ", pred_non_genes / true_non_genes)

In [54]:
get_metrics('test', nlp)

Sensitivity:  0.332777314429
Specificity:  0.996592151231


In [55]:
get_metrics('train', nlp)

Sensitivity:  0.517674616695
Specificity:  0.99625218954


In [57]:
from spacy.vocab import Vocab
from spacy.language import Language
from spacy.lang.en import English

all_genes = []
for txt in TRAIN_DATA:
    if txt[0] is not None:
        entities = txt[1]['entities']
        genes = [txt[0][x[0]:x[1]] for x in entities]
        all_genes.extend(genes)

new_nlp = spacy.blank('en')
[new_nlp.vocab[x] for x in all_genes]
print (new_nlp.vocab.length)
new_nlp.meta['name'] = 'gene'

ner = new_nlp.create_pipe('ner')
ner.add_label('GENE')

new_nlp.add_pipe(ner)
optimizer = new_nlp.begin_training()

epochs = 20

for itn in range(epochs):
    random.shuffle(TRAIN_DATA)
    losses = {}
    test_range = range(test_size)
    text = [item[0] for item in TRAIN_DATA if item[0] not in test_articles] #get training text items
    annotations = [item[1] for item in TRAIN_DATA if item[0] not in test_articles] #get training annotations
    
    new_nlp.update(text, annotations, sgd=optimizer, drop=0.2,losses=losses)

2782


KeyboardInterrupt: 

In [28]:
from spacy import displacy
from IPython.core.display import display, HTML

doc = nlp("The aim of our study was to assess the possible relationships among heme oxygenase (HMOX), bilirubin UDP-glucuronosyl transferase (UGT1A1) promoter gene variations, serum bilirubin levels, and Fabry disease (FD).")
html_ner_prediction = spacy.displacy.render(doc, style='ent')

display(HTML("<div style='color:red;padding-left:50px'>{}</div>".format(html_ner_prediction)))

In [29]:
doc = nlp("Differential Requirement of Human Cytomegalovirus UL112-113 Protein Isoforms for Viral Replication.")
html_ner_prediction = spacy.displacy.render(doc, style='ent')

display(HTML("<div style='color:red;padding-left:50px'>{}</div>".format(html_ner_prediction)))

In [30]:
doc = nlp("Furthermore, our results demonstrate that miR-365 functions as an upstream regulator of MDM2/p53 expression, cell cycle progression and apoptosis in trophoblasts")
html_ner_prediction = spacy.displacy.render(doc, style='ent')

display(HTML("<div style='color:red;padding-left:50px'>{}</div>".format(html_ner_prediction)))

In [31]:
doc = nlp("showed that malat1\xa0M5 interacts with the C-terminal domain of SP1 by RNA immunoprecipitation (RIP) assay coupled with UV cross-linking")
html_ner_prediction = spacy.displacy.render(doc, style='ent')

display(HTML("<div style='color:red;padding-left:50px'>{}</div>".format(html_ner_prediction)))

In [49]:
new_nlp.vocab['wasdddsup']

<spacy.lexeme.Lexeme at 0x7f36b74102d0>

In [56]:
spacy.blank('en').vocab.length

476