# Overview
- Process documents training set with pyConTextNLP
- Measure system performance
- Review false positives and false negatives
- Edit knowledge base to reduce errors
- Iteratively improve system
- Run system on final test set

In [None]:
# Let's import some packages
import os
import pyConTextNLP
from pyConTextNLP import pyConTextGraph
from pyConTextNLP import itemData
import sklearn.metrics
import pandas as pd
import networkx as nx

from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display, HTML, Image
import ipywidgets
# And also our utilities for this class

from nlp_pneumonia_utils import read_doc_annotations, list_errors, clearPyConTextRegularExpressions
from nlp_pneumonia_utils import mark_document_with_html, pneumonia_annotation_html_markup
from nlp_pneumonia_utils import classify_pneumonia_document, markup_context_document
# from nlp_pneumonia_utils import DocumentClassifier


# System Evaluation and Error Analysis

In the previous notebook, we developed an NLP system to extract mentions of pneumonia, check whether they are modified by certain semantic modifiers, and use that information to classify a document as being either **positive** or **negative** for pneumonia.

In this notebook, we will use an annotated dataset to compare our system with a human gold standard. By comparing with expert annotations, we can see instances where our system is making mistakes and we can modify it to have better performance. The two types of errors to look for are:

1. **False positives** - documents which we say have pneumonia but the human annotators did not
2. **False negatives** - documents which the annotators said have pneumonia but we missed

To measure our system's performance, we'll calculate:
- **Precision** - how likely our positive classifications are to actually have pneumonia. A system with many *false positives* will have *low precision*
- **Recall** - how likely it is that our system will identify a document which has pneumonia. A system with many *false negatives* will have *low recall*
- **F1** - the balanced average of the two

## Workflow
1. Process our training set with pyConText and calculate metrics
2. Look at documents which our system classified incorrectly
3. Make changes to our knowledge base or code
4. Re-run on the training set
5. Once we have a score we're happy with, run our system on the test set to get a final score

# I. Processing training set and calculate metrics

## Read in our knowledge base

In [None]:
clearPyConTextRegularExpressions()
full_targets_path = 'file:///' + os.path.join(os.getcwd(), "KB/pneumonia_targets.tsv")
modifier_file_path = 'file:///' + os.path.join(os.getcwd(), "KB/pneumonia_modifiers.tsv")

targets = pyConTextNLP.itemData.instantiateFromCSVtoitemData(full_targets_path)
modifiers = pyConTextNLP.itemData.instantiateFromCSVtoitemData(modifier_file_path)

In [None]:
# Remember, these are the classes which tell us a mention of pneumonia
# doesn't mean the patient actually has pneumonia
non_positive_categories = ["definite_negated_existence",
                 "probable_negated_existence",
                 "probable_existence",
                "indication",
                 ]

## Read in our expert annotations

In [None]:
#Read in the training documents and annotations
annotated_doc_map = read_doc_annotations('pneumonia_data/training_v2')

Let's see how our system performs.

In [None]:
# Process the corpus using docClassifier to return errors
fns, fps, context_docs = list_errors(gold_docs=annotated_doc_map, 
                      modifiers=modifiers, 
                      targets=targets, 
                      non_positive_categories=non_positive_categories,
                      print_prediction_metrics=True)

In [None]:
fps

In [None]:
fns

What if we didn't use any modifiers?

In [None]:
# Process the corpus using docClassifier to return errors
_ = list_errors(gold_docs=annotated_doc_map, 
                      modifiers=[], 
                      targets=targets, 
                      non_positive_categories=non_positive_categories,
                      print_prediction_metrics=True)

# II. Error Analysis
Let's now take a closer look at the documents which we classified incorrectly. We identified the error documents and saved them in the lists `fps` and `fns`. We'll look at both the human annotations and the results of our system.

In [None]:
colors = {
    "evidence_of_pneumonia": "orange",
    "definite_negated_existence": "red",
    "probable_negated_existence": "indianred",
    "ambivalent_existence": "forestgreen",
    "probable_existence": "forestgreen",
    "definite_existence": "green",
    "historical": "goldenrod",
    "indication": "pink",
    "acute": "golden"
}        

## Display false negatives
Now we can display the **false negatives** with expert annotations.<br/><br/>

In [None]:
# This function let's us iterate through all documents and view the markup
def view_annotation_markup(anno_docs):
    @interact(i=ipywidgets.IntSlider(min=0, max=len(anno_docs)-1))
    def _view_markup(i):
        report_html = pneumonia_annotation_html_markup(anno_docs[i])
        report_html = report_html.replace('\n', '<br>')
        display(HTML(report_html))

In [None]:
fn_names, fn_docs = zip(*[(k, v) for k, v in annotated_doc_map.items() if k in fns])

### Gold standard

In [None]:
# Pick one of the false negative documents to review
idx = 0
fn_name = fn_names[idx]
fn_gold_doc = fn_docs[idx]
fn_nlp_doc = context_docs[fn_name]

In [None]:
view_annotation_markup([fn_gold_doc])

### NLP System Output

In [None]:
print("Document name: {}".format(fn_name))
display(HTML(mark_document_with_html(
    fn_nlp_doc, 
    colors=colors, default_color="black")))

In [None]:
# Debug with our classification function
classify_pneumonia_document(fn_nlp_doc, non_positive_categories, 1)

## Display false positives
Let's do the same thing with false positives.

In [None]:
fp_names, fp_docs = zip(*list((k, v) for k,v in annotated_doc_map.items() if k in fps))

In [None]:
# Pick one of the false negative documents to review
idx = 0
fp_name = fp_names[idx]
fp_gold_doc = fp_docs[idx]
fp_nlp_doc = context_docs[fp_name]

### Gold standard

In [None]:
view_annotation_markup([fp_gold_doc])

In [None]:
print(fp_name)
display(HTML(mark_document_with_html(fp_nlp_doc, colors = colors, default_color="black")))

In [None]:
# Look at the document markups
fp_nlp_doc.getDocumentGraph()

In [None]:
# Debug with our classification function
classify_pneumonia_document(fp_nlp_doc, non_positive_categories, 1)

### Discussion
What are the causes of false positives? What are the causes of false negatives?

# Improve system
As you identify errors in the output, you can start to make changes to the system. Some changes you can make are:

- Add terms to targets.tsv
- Add terms to modifiers.tsv
- Add semantic classes to `non_positive_categories`

### Discussion

For each of the options above, will a change reduce **false positives** or **false negatives**? Will that affect **precision** or **recall**?

### Discussion

As you make changes to improve either precision or recall, what happens to the other metric?

# Evaluate on Test Set
Once you feel like you have a good system, you will evaluate your NLP system on a new set of documents which you haven't seen before. This is important to show how **generalizable** your system is - can it work well on documents which you hadn't reviewed when developing the system?

**Once you run your system on your test set, you can't make any more changes!** It's important that you only evaluate using the test documents *one time*. This is the score you would report in a publication. You can review the errors after, but you should consider your system to be frozen.

In [None]:
#Read in the training documents and annotations
test_annotated_doc_map = read_doc_annotations('pneumonia_data/test_v2/')

In [None]:
# # UNCOMMENT WHEN YOU ARE READY TO RUN
test_fns, test_fps, test_context_docs = list_errors(gold_docs=test_annotated_doc_map, 
                      modifiers=modifiers, 
                      targets=targets, 
                      non_positive_categories=non_positive_categories,
                      print_prediction_metrics=True)