# Error Analysis (1)

From the previous notebook, you have seen that our current NLP solution does not get 100% correct. Although it is unrealistic to reach that goal, we definitely can make it closer. 

This notebook will show you how to analysis errors related to name entity recognition, and guide you through step by step to improve the recall. We will talk about how to improve precision tomorrow.

## 1. Locate the errors

In [None]:
#first import libraries
import urllib.request
import os
import codecs
import zipfile
import pandas as pd
from IPython.display import display, HTML,IFrame
import sklearn.metrics

Reuse the classes and functions that we have created in previous notebook.
Note: we are going to use *read_doc_annotations* (return a map with document name as the key, and annotations as the value) instead of *read_annotations* (return a list of documents' annotations), so that we list the errors with the corresponding document name.

In [None]:
from nlp_pneumonia_utils import Annotation
from nlp_pneumonia_utils import AnnotatedDocument
from nlp_pneumonia_utils import read_brat_annotations
from nlp_pneumonia_utils import read_doc_annotations
from nlp_pneumonia_utils import calculate_prediction_metrics
from nlp_pneumonia_utils import mark_text
from nlp_pneumonia_utils import pneumonia_html_markup

Next, we tweak the function **calculate_prediction_metrics** to list the difference--errors, instead of calculate the measurements:

Question before we move on:

Why we only care *false negatives* for now?


In [None]:
def list_false_negatives(gold_docs, prediction_function):
    fn_docs={}
    for doc_name, gold_doc in gold_docs.items():
        gold_label=gold_doc.positive_label;
        pred_label = prediction_function(gold_doc.text)
        if gold_label==1 and pred_label==0:
            fn_docs[doc_name]=gold_doc            
    return fn_docs     


## 2. Display errors

Now we put everything together to display errors:

In [None]:
class KeywordClassifier(object):
    def __init__(self):
        self.keywords = set()
    def predict(self, text):
        prediction = 0
        for keyword in self.keywords:
            if keyword in text:
                prediction = 1
        return prediction
    
keyword_classifier = KeywordClassifier()
# let's load in some manual keywords...
keyword_classifier.keywords.add('pneumonia')

In [None]:
i=0
annotated_docs = read_doc_annotations('pneumonia_brat_full_set1.zip')
print('Total Annotated Documents : {0}'.format(len(annotated_docs)))

fn=list_false_negatives(annotated_docs, keyword_classifier.predict)

docs=list(fn.keys());

Show one document a time:

In [None]:
if i<len(docs):
    print (docs[i])

    anno_doc=fn[docs[i]]

    display(HTML(pneumonia_html_markup(anno_doc).replace('\n', '<br>')))
    
    i+=1

## 3. More efficient review:
Not convenient to read? Let's try snippet view instead. Now we need to make another function to replace "*pneumonia_html_markup*". 

Although we measuring the document level annotation, we will focus on mention level ("**SPAN_POSITIVE_PNEUMONIA_EVIDENCE**") error analyses. Because the later is where the errors originate from.



In [None]:

def snippets_markup(annotated_doc_map):
    html = ["<html>","<table width=100% >",
            "<col style=\"width:25%\"><col style=\"width:75%\">"
            "<tr><th style=\"text-align:center\">document name</th><th style=\"text-align:center\">Snippets</th>"]
    for doc_name, anno_doc in annotated_doc_map.items():
        html.extend(snippet_markup(doc_name,anno_doc))
    html.append("</table>")
    html.append("</html>")
    return ''.join(html) 


def snippet_markup(doc_name,anno_doc):
    from pyConTextNLP.display.html import __sort_by_span
    from pyConTextNLP.display.html import __insert_color
    html=[]
    color= 'blue'    
    window_size=50    
    html.append("<tr>")
    html.append("<td style=\"text-align:left\">{0}</td>".format(doc_name))
    html.append("<td></td>")
    html.append("</tr>")
    for anno in anno_doc.annotations:
        if anno.type == 'SPAN_POSITIVE_PNEUMONIA_EVIDENCE':
#           make sure the our snippet will be cut inside the text boundary
            begin=anno.start_index-window_size
            end=anno.end_index+window_size
            begin=begin if begin>0 else 0
            end=end if end<len(anno_doc.text) else len(anno_doc.text)    
#           render a highlighted snippet
            cell=__insert_color(anno_doc.text[begin:end],[anno.start_index-begin,anno.end_index-end],color)
#           add the snippet into table
            html.append("<tr>")
            html.append("<td></td>")
            html.append("<td style=\"text-align:left\">{0}</td>".format(cell))
            html.append("</tr>") 
    return html

Let's try it out:

In [None]:
display(HTML(snippets_markup(fn)))

## 4. Now what?