# Gene NER using PySysrev and Human Review (Part III)
<span style="color:gray">James Borden, Nole Lin</span>

In this series on the Sysrev tool, we build a Named Entity Recognition (NER) model for genes.  We use data from 2000 abstracts reviewed in the sysrev [Gene Hunter project](https://sysrev.com/p/3144). This third part of the series details how we can evaluate our model .

In this notebook we:

1. **Evaluate Model** on Gene Hunter text to test performance
2. **Demonstrate** our model in action on example sentences

We start by training on our processed data and separate 20% of the training set into a test set. We will train for 20 epochs with a dropout rate of 0.2. The sysrev [Gene Hunter project](https://sysrev.com/p/3144) has ~1200 annotated articles, we are careful to split are train and test sets to avoid shared articles.

In [156]:
from __future__ import unicode_literals, print_function, division
import spacy, sklearn, PySysrev, random, sys

GENE_DATA   = PySysrev.processAnnotations(project_id=3144, label='GENE')
train, test = GENE_DATA[:int(0.8 * len(GENE_DATA))], GENE_DATA[-int(0.2 * len(GENE_DATA)):]
print("{} train instances {} test instances".format(len(train),len(test)))

984 train instances 246 test instances


### Create Model
Now that we have some training/testing data we can create a gene identification model

In [157]:
# create nlp model
nlp = spacy.blank('en')
ner = nlp.create_pipe('ner')
ner.add_label('GENE')
nlp.add_pipe(ner)
optimizer = nlp.begin_training()

# train model
epochs = 20
for itn in range(epochs):
    sys.stdout.write("{} ".format(itn))
    text, annotations = zip(*train) #unzip text/annotations
    nlp.update(text, annotations, sgd=optimizer, drop=0.2,losses=losses)

print("done!")

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 done!


### Evaluate model
Now that we have a model, let's evaluate performance. Spacy.io provides `spacy.scorer.Scorer` which we use to evaluate **true positives** (genes correctly identified as genes), **false negatives** (genes incorrectly identified as non-genes) and **false positives** (non-genes incorrectly classified as genes).  These numbers allow us to calculate **Recall** or the percent of genes that were captured by the model and **Precision** or the proportion of gene identifications that were correct relative to all gene identifications.

Recall is important because we don't want to miss any genes.  The gene NER will be used to identify relationships between genes, diseases, chemicals and more.  Randomly missing some genes is ok, but a low recall may indicate that the model misses specific sets of genes in a *biased* manner.  This would make the model much less useful in identifying gene relationships.

Precision is a little less important than recall.  Ideally the model doesn't misclassify large numbers of non-genes as genes.  In practice, this might happen when tokens that are used as genes but also have other meanings (like FAT).   

In [158]:
# creates a spacy_io 'Scorer' object with accuracy metrics
def evaluate(data):
    scorer = spacy.scorer.Scorer()
    for text,entity_offsets in data: 
        gold_value    = spacy.gold.GoldParse(nlp.make_doc(text),entities=entity_offsets.get('entities'))
        pred_value    = nlp(text)
        try: 
            scorer.score(pred_value,gold_value)
        except ValueError:
            pass; # spacy has rare error w/ score fn - https://github.com/explosion/spaCy/issues/2661
    return scorer

# build evaluations
tests_eval, train_eval = evaluate(test).ner, evaluate(train).ner
mat = [["","TP","FN","FP","Recall","Precision"],
       ["TEST", tests_eval.tp, tests_eval.fn, tests_eval.fp, tests_eval.recall, tests_eval.precision],
       ["TRAIN",train_eval.tp, train_eval.fn, train_eval.fp, train_eval.recall, train_eval.precision]]

from IPython.display import HTML, display
import tabulate
display(HTML("<div style='margin: auto;width:50%'>{}</div>".format(tabulate.tabulate(mat,tablefmt='html'))))


0,1,2,3,4,5
,TP,FN,FP,Recall,Precision
TEST,203,175,113,0.537037037037,0.642405063291
TRAIN,1390,289,272,0.827873734366,0.836341756919


In [159]:
import plotly as py
trace1 = py.graph_objs.Bar(x=['gene recall', 'precision'],y=[tests_eval.recall, tests_eval.precision],name='Test')
trace2 = py.graph_objs.Bar(x=['gene recall', 'precision'],y=[train_eval.recall, train_eval.precision],name='Train')

fig = go.Figure(data=[trace1, trace2], layout=py.graph_objs.Layout(barmode='group'))
py.plotly.iplot(fig, filename='grouped-bar')

Above we see that our recall and precision are pretty good! Not perfect, but we only have a few thousand paragraph annotations.  We may be able to do much better with additional model tuning, but we can also improve our models by continuing to run the gene hunter review. The evaluation on the training data is significantly stronger than the test data.  This may indicate overfitting on specific genes, or possibly that distinct gene names occur in the test set without any training examples.

Now, we look at specific sentences to see how our model performs in detecting gene terms. 

### Specific examples
In the below examples we consider a few basic example sentences.
In the first example the model is able to extract "HMOX" and "UGT1A1" correctly and exclude the rest of the words.

In [167]:
doc = nlp("The aim of our study was to assess the possible relationships among heme oxygenase (HMOX), bilirubin UDP-glucuronosyl transferase (UGT1A1) promoter gene variations, serum bilirubin levels, and Fabry disease (FD).")
html_ner_prediction = spacy.displacy.render(doc, style='ent')
display(HTML("<div style='background-color: lightblue; padding:10px'>{}</div>".format(html_ner_prediction)))

Again, the model is able to nicely detect an unconventional gene name with a hyphen in the term.

In [166]:
doc = nlp("Differential Requirement of Human Cytomegalovirus UL112-113 Protein Isoforms for Viral Replication.")
html_ner_prediction = spacy.displacy.render(doc, style='ent')
display(HTML("<div style='background-color: lightblue; padding:10px'>{}</div>".format(html_ner_prediction)))

However, we now see some flaws in our model. The below sentence contains two gene names "MDM2" and "p53." But because they are separated by a slash instead of a space, the model is unable to identify the genes.

In [169]:
doc = nlp("Furthermore, our results demonstrate that miR-365 functions as an upstream regulator of MDM2/p53 expression, cell cycle progression and apoptosis in trophoblasts")
html_ner_prediction = spacy.displacy.render(doc, style='ent')
display(HTML("<div style='background-color: lightblue; padding:10px'>{}</div>".format(html_ner_prediction)))


[W006] No entities to visualize found in Doc object. If this is surprising to you, make sure the Doc was processed using a model that supports named entity recognition, and check the `doc.ents` property manually if necessary.



Other times, the model is only able to get one of the genes in the sentence. "SPI" is also a gene, but is not highlighted as only "malat1" is.

In [170]:
doc = nlp("showed that malat1\xa0M5 interacts with the C-terminal domain of SP1 by RNA immunoprecipitation (RIP) assay coupled with UV cross-linking")
html_ner_prediction = spacy.displacy.render(doc, style='ent')

display(HTML("<div style='background-color: lightblue; padding:10px'>{}</div>".format(html_ner_prediction)))

Overall, our trained model shows promising results in the test and train metrics, as well as specific identification tasks. Some things we could do to improve model performance would be to look into tuning hyperparameters such as the number of epochs and the dropout rate. But with our current working model, we will look into turning it into a web application with an API as our next step, documented in the next post.