In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix

### Read the set of labels into a dataframe:

In [None]:
labels = pd.read_csv('labels.csv')

In [None]:
labels.head()

### Start with assessing the radiologist's performance:
* Assess the _accuracy_ of the radiologist by just looking at the percent of cases that they correctly labeled
* Next, look at the true positive and true negative rates of the radiologist by generating a _confusion matrix_ 

In [None]:
radiologist_accuracy = sum(labels.perfect_labeler == labels.radiologist)/len(labels)

In [None]:
radiologist_accuracy

In [None]:
confusion_matrix(labels.perfect_labeler.values,labels.radiologist.values,labels=["cancer","benign"])

### Now look at the algorithm's performance compared to the perfect labeler:
* Since the algorithm doesn't create a binary label, it instead returns a _probability_ of cancer, choose a probability cut-off to use for the algorithm's labeling of cancer vs. bening. _(Hint: 0.5 is a reasonable starting place)_
* Start with assessing _accuracy_ again here
* Generate a confusion matrix

What happens now if you change the threshold cut-off for your algorithm's classification to 0.4? What if you raise it to 0.6? How do accuracy, fp, fn, tp, and tn change?

### Finally, let's compare our algorithm to the radiologist
* A "perfect labeler" might not exist in the real world, and in fact, if often does not
* In AI for medical imaging, using a radiologist's labels as our "true" label is often the standard of practice, and algorithm performance is judged in both an academic setting as well as in the regulated industry landscape based on performance against an expert human

* Repeat the steps above using a set threshold for your algorithm (again, 0.5 is perfectly reasonable) but now computing accuracy, tp, tn, fp, fn against the radiologist. 

## Reflection: 
* In the above exercise you assess performances of a human as well as of an algorithm against a 'perfect labeler' and also against each other. 
* Does accuracy seem like the appropriate statistic to use when evaluating these labels? Why or why not? 
* In what clinical settings does it seem more or less acceptable to have a high level of FNs? FPs? 
* How did changing the threshold on the algorithm performance change the different performance statistics? 
* How did your opinion of the algorithm's performance change when you started comparing it to a radiologist instead of the perfect labeler? What does this mean for a real-world scenario when a perfect labeler doesn't exist, and we only have a radiologist's read to base our performance on? 