# FHI Module 7 Hands-on ---Breast Cancer Workbook

## Time to get your hands dirty

You've learned what we need to do and how the tool works. Now it is time for you to make it actually work.

You are welcome to spend your time however you'd like but here are a few ideas of how to improve your system:
* Improve targets.  Are there any False Negatives your system is missing?  Are there regular expressions that would help?
* Improve modifiers.  Not all modifiers typically used in practice are the modifiers starter file.  Are there some to add?  Do some existing modifiers cause problems in your processing?  They can be changed or removed.
* Improve document classification rules.  This is **optional**, because the default rules are ready to go. If you are interested, feel free the read the comments in the file to see how it works.

## 1. Let's go

In [1]:
# import packages that we will need
from nlp_pneumonia_utils import read_doc_annotations
from DocumentClassifier import DocumentClassifier
from nlp_pneumonia_utils import list_errors
from visual import Vis
from visual import snippets_markup
from visual import view_pycontext_output
from visual import display_doc_text
# packages for interaction
from IPython.display import display, HTML
import ipywidgets

## 2. Load our training set

In [2]:
pos_doc_type='FAM_BREAST_CA_DOC'
annotated_doc_map = read_doc_annotations(archive_file='data/bc_train.zip', pos_type=pos_doc_type)
print('Total Annotated Documents : {0}'.format(len(annotated_doc_map)))

Reading annotations from file : data/bc_train.zip
Opening local file : data/bc_train.zip
Total Annotated Documents : 60


## 3. Read in our Knowledge Base files
**The targets file is seeded with one target: "breast cancer" and "breast carcinoma"**  ([target rule file](/edit/KB/fam_bc_targets.yml))  
**The modifier file has all modifiers available with pyConText distribution, but the family history modifiers are not complete** ([modifier rule file](/edit/KB/fam_bc_modifiers.yml)) 

In [3]:
TARGETS_FILE_PATH = 'KB/fam_bc_targets.yml'
MODIFIERS_FILE_PATH = 'KB/fam_bc_modifiers.yml'
FEATURE_INFERENCER_FILE_PATH = 'KB/fam_bc_featurer_inferences.csv'
DOC_INFERENCER_FILE_PATH = 'KB/fam_bc_doc_inferences.csv'
# clear just in case files/regular expressions have been updated
classifier = DocumentClassifier(TARGETS_FILE_PATH, MODIFIERS_FILE_PATH,
                               FEATURE_INFERENCER_FILE_PATH, DOC_INFERENCER_FILE_PATH,
                               {pos_doc_type})
classifier.reset_saved_predictions()

## 4. Let's classify some documents
The function * list_errors* wraps up several functions together. It will compare the classifier's conclusions against the reference standard (manually annotated documents), and return the false positive documents (with pyConText markups), false negative documents (with manual annotations), and the measurements (precision, recall and F1).


For the detailed implementation of this *list_errors* function, you can check the code in [nlp_pneumonia_utils](/edit/nlp_pneumonia_utils.py).

In [4]:
%%time
print('****************')
print('Performance for Classifier :')
current_false_negatives, current_false_positives, measurements,confusion_matrix_df = classifier.eval(annotated_doc_map)
print(measurements)
display(confusion_matrix_df)
print('****************')

****************
Performance for Classifier :
Start to evaluate against reference standards...
Precision : 0.958
Recall :    0.719
F1:         0.821


Predicted,1,0
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
1,23,9
0,1,27


****************
CPU times: user 15.9 s, sys: 19.1 ms, total: 15.9 s
Wall time: 15.9 s


## 5. Development of your system: can you improve the performance?
* We have found the tools below for highlighting and graphing False Positives and False Negatives to be very useful.  We've provided them below in case it helps you as well

Instructions:
1. run the system and calculate performance
2. review false negatives and positives and make changes to the target file or the modifier file
3. repeat

### 5.1 Review the False Negatives - we have provided two viewers below

There are two reasons that our pipeline got false negative errors:

1. We don't have the lexicon in your target file. If so, we need to add your new lexicon to KB/fam_bc_targets.yml
2. Our context rule didn't identify the **family context**, you will need to add it to /KB/fam_bc_modifiers.yml
3. Our context rule **excluded** the target concept. If so, we will need to locate the context rule, remove or modifiy it in your KB/fam_bc_modifiers.yml

## False Negative Viewer - reference standard snippet annotations

In [5]:
fn_docs=dict((k, v) for k, v in annotated_doc_map.items() if k in current_false_negatives)
display(HTML(snippets_markup(annotated_doc_map,'FAM_BREAST_CA')))

document name,Snippets

0,1
16709.0,
,. FAMILY HISTORY: The patient has a sister with breast cancer and two sisters with liver cancer. Additionally
24880.0,
,y is significant only for a maternal niece with breast cancer at the age of 78. Physical Exam: Vitals - 97
24389.0,
,.  Family History: Family History: One sister: Breast cancer and colon cancer. One aughter: Breast cancer an
,": Breast cancer and colon cancer. One aughter: Breast cancer and colon cancer, thrombocytosis. Other daughte"
35010.0,
17641.0,
,"H: unclear, daughter is a poor historian, has h/o breast ca and possible pulmonary fibrosis, may also have d"


If you are sure the target lexicon have been included in the targets file, then these false negatives errors must be caused by your modifiers that excluded these targets.Let's take a look at what pyConText output looks like:

## False Negative Viewer - pyConText annotations

In [6]:
# set up the visualizer for pyConText output
vis=Vis(MODIFIERS_FILE_PATH)
fn_docs = dict((k,v) for k, v in classifier.saved_markups_map.items() if k in current_false_negatives)
view_pycontext_output(fn_docs,vis)

interactive(children=(IntSlider(value=0, description='i', max=2), Output()), _dom_classes=('widget-interact',)…

### 5.2 Review the false positives
For False Positives, it's most useful to see a pyConText graph since there may need to be modifiers adjusted so that targets can be properly utilized in classification

In [7]:
fp_docs = dict((k,v) for k, v in classifier.saved_markups_map.items() if k in current_false_positives)
view_pycontext_output(fp_docs,vis)

interactive(children=(IntSlider(value=0, description='i', max=0), Output()), _dom_classes=('widget-interact',)…

### 5.3 pyConText playground
After you change your target and modifier rules, type a sentence below (str) and make sure the rule does what you think it does. 
(move above viewers)


In [11]:
# Refresh the classifier with updated rules
classifier = DocumentClassifier(TARGETS_FILE_PATH, MODIFIERS_FILE_PATH,
                               FEATURE_INFERENCER_FILE_PATH, DOC_INFERENCER_FILE_PATH,
                               {pos_doc_type})

str='''his sister was dx breast cancer 20 years ago'''
res=classifier.predict(str)
print("Positive" if res==1 else "Negative")
view_pycontext_output(classifier.get_last_context_doc(), vis)

Negative


## 6. Test Set Evaluation 
* We've been waiting for the test set.  It will not be available until the morning of the final class session.
* At that time, you can uncomment this code and make any changes to it as instructed by the class instructors:

In [12]:
%%time
annotated_doc_map = read_doc_annotations(archive_file='img/bc_test.zip', pos_type=pos_doc_type)

classifier.reset_saved_predictions()
print('****************')
print('Performance for Classifier on test set:')
current_false_negatives, current_false_positives, measurements,confusion_matrix_df = classifier.eval(annotated_doc_map)
print(measurements)
display(confusion_matrix_df)
print('****************')

Reading annotations from file : img/bc_test.zip
Opening local file : img/bc_test.zip
****************
Performance for Classifier on test set:
Start to evaluate against reference standards...
Precision : 1.000
Recall :    0.682
F1:         0.811


Predicted,1,0
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
1,15,7
0,0,18


****************
CPU times: user 7.67 s, sys: 6.75 ms, total: 7.68 s
Wall time: 7.68 s


<br/><hr/>This material presented as part of the Foundations of Healthcare Informatics Course, 2017 Fall, BMI, University of Utah. It's revised from the <a href="https://github.com/UUDeCART/decart_rule_based_nlp">material</a> of the DeCART  Summer Program (Data, exploration, Computation, and Analytics Real-world Training for the Health Sciences) at the University of Utah in 2017. <br/><br/>Original presenters : Dr. Wendy Chapman, Jianlin Shi and Kelly Peterson.<br/>
Revised by: Jianlin Shi and Dr. Wendy Chapman<br/>
<img align="left" src="https://wiki.creativecommons.org/images/1/10/Cc.org_cc_by_license.jpg" alt="Except where otherwise noted, this website is licensed under a Creative Commons Attribution 3.0 Unported License.">