# FHI Module 7 Hands-on ---Breast Cancer Workbook

## Time to get your hands dirty

You've learned what we need to do and how the tool works. Now it is time for you to make it actually work.

You are welcome to spend your time however you'd like but here are a few ideas of how to improve your system:
* Improve targets.  Are there any False Negatives your system is missing?  Are there regular expressions that would help?
* Improve modifiers.  Not all modifiers typically used in practice are the modifiers starter file.  Are there some to add?  Do some existing modifiers cause problems in your processing?  They can be changed or removed.
* Improve document classification rules.  This is **optional**, because the default rules are ready to go. If you are interested, feel free the read the comments in the file to see how it works.

## 1. Let's go

In [None]:
# import packages that we will need
from nlp_pneumonia_utils import read_doc_annotations
from DocumentClassifier import DocumentClassifier
from nlp_pneumonia_utils import list_errors
from visual import Vis
from visual import snippets_markup
from visual import view_pycontext_output
from visual import display_doc_text
# packages for interaction
from IPython.display import display, HTML
import ipywidgets

## 2. Load our training set

In [2]:
pos_doc_type='FAM_BREAST_CA_DOC'
annotated_doc_map = read_doc_annotations(archive_file='data/bc_train.zip', pos_type=pos_doc_type)
print('Total Annotated Documents : {0}'.format(len(annotated_doc_map)))

Reading annotations from file : data/bc_train.zip
Opening local file : data/bc_train.zip
Total Annotated Documents : 60


## 3. Read in our Knowledge Base files
**The targets file is seeded with one target: "breast cancer" and "breast carcinoma"**  ([target rule file](/edit/KB/fam_bc_targets.yml))  
**The modifier file has all modifiers available with pyConText distribution, but the family history modifiers are not complete** ([modifier rule file](/edit/KB/fam_bc_modifiers.yml)) 

In [3]:
TARGETS_FILE_PATH = 'KB/fam_bc_targets.yml'
MODIFIERS_FILE_PATH = 'KB/fam_bc_modifiers.yml'
FEATURE_INFERENCER_FILE_PATH = 'KB/fam_bc_featurer_inferences.csv'
DOC_INFERENCER_FILE_PATH = 'KB/fam_bc_doc_inferences.csv'
# clear just in case files/regular expressions have been updated
classifier = DocumentClassifier(TARGETS_FILE_PATH, MODIFIERS_FILE_PATH,
                               FEATURE_INFERENCER_FILE_PATH, DOC_INFERENCER_FILE_PATH,
                               {pos_doc_type})
classifier.reset_saved_predictions()

## 4. Let's classify some documents
The function * list_errors* wraps up several functions together. It will compare the classifier's conclusions against the reference standard (manually annotated documents), and return the false positive documents (with pyConText markups), false negative documents (with manual annotations), and the measurements (precision, recall and F1).


For the detailed implementation of this *list_errors* function, you can check the code in [nlp_pneumonia_utils](/edit/nlp_pneumonia_utils.py).

In [4]:
%%time
print('****************')
print('Performance for Classifier :')
current_false_negatives, current_false_positives, measurements,confusion_matrix_df = classifier.eval(annotated_doc_map)
print(measurements)
display(confusion_matrix_df)
print('****************')

****************
Performance for Classifier :
Start to evaluate against reference standards...
Precision : 0.958
Recall :    0.719
F1:         0.821
