In [1]:
from IPython.core.display import HTML
import urllib2
HTML(urllib2.urlopen('https://gist.githubusercontent.com/mattlewissf/83989910849fdb4a04a72d431e84053f/raw/cefa015a9065665faccd0219774c7087be7d21a8/skeleton.css').read())

#### MIMIC Deep Dive - Understanding How to Evaluate our Classifier
**[Intro](#intro)**   
**[30 Day Readmission](#30_day_readmission)**  
**[The MIMIC Dataset](#mimic_dataset)**  
**[Setting up the database](#setting_up_db)** 

Finally we've reached a point where we can start to understand how well our model does at the task of predicting 30 day readmissions. But before we start importing classifiers and such, it is important that we understand how to evaluate what our model is giving us back, and how to answer a few key questions. 

Creating a model that can predict the liklihood of patient readmission based on a discharge event is a problem of **classification**. Classification can be understood as two distict tasks: creating clusters within an existing set of data, or 'unsupervised learning'; or classifiying new data into an already understood class, or 'supervised learning'.  Our goal is the latter, as we want to be able to predict whether a given patient (with a given index admission) should be considered as 'a likley readmitter' or not. 

<br></br> 

Our model will push each patient into one of two groups (readmit / no readmit). Ignoring for now how we can define 'likely' in our readmission (we'll get back to it), we can start with the a pretty basic questions: how do we know if our model has done a good job? 

<br></br> 


#### Classification and how we measure success

Since our output is a binary measure of whether or not we think that patient will be readmitted within 30 days, and we are comparing our test data to outcomes where we know the result, we might be tempted intitally to look for something like 'accuracy' that our model produces. In this case, **accuracy** might be defined as: 

$$ accuracy =  \frac{\text{(true positives + true negatives)}}{\text{all cases}}$$

<br></br>

However, accuracy isn't going to be super useful for us here. While 95% sounds pretty great, it doesn't tell us all that much. For example, ,a 95% accuracy measure might mean that we have a data set where one state happens only 5% of the time - and all that we've done to earn that 95% accuracy is to never predict the less likely outcome. To quote a [loftier source](http://www.umich.edu/~ners580/ners-bioe_481/lectures/pdfs/1978-10-semNucMed_Metz-basicROC.pdf): 
> "Though accuracy provides a single simple number for diagnostic performance, it is often too simple and must be interpreted with
considerable caution. The limitations of this index force us to introduce some complexity into our evaluation scheme: We must sort out the effect of disease prevalence, and we must score separately the various kinds of right and wrong diagnostic decisions."





More useful for our purposes are **sensitivity** (when it is yes, how often does it predict yes?) and **specificity** (when if is no, how often does it predict no?). Additionally, we want to know how often we incorrectly characterized an example. One way to display this information is called a **confusion matrix** <sub>[1](#bib)</sub>: 

![Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27, 861–874.](http://i.imgur.com/cEQmS0f.png)




#### Thresholds and weights

In a perfect world, there would a clear distinction for our model to classify the two groups with a clear cut. But that's not going to be the case with our model. Our model gives us a probability for each person record - but where is the threshold at which we should consider a person to be a 'likely readmit'? 

The answer is that it depends. If we are very concerned with minimizing false positives, for example, we might have a relatively high threshold for inclusion into the 'readmit' group - however, this will lead to an increase in false negatives. If we were more concerned with grabbing every lilely positive (perhaps for an online advertising campaign), our threshold might be considerably lower. 

<br></br>

The most common way of representing performance over the spectrum of the trade-off between sensitivity and specificity is to plot an **ROC curve**, which has TPR and FPR as it's two axes: 

![](http://i.imgur.com/7dXmYt0.png)








The graph above shows an ROC curve with totally random performance - anything below (to the right and down) of the dotted line is worse than guessing randomly. We'll be creating ROC curves for our model. 

<br></br>

Here's a ROC curve based on just a bit of our data, using a **something** classifier

{add image here when you get back}

<br></br>

{}

<br></br>

Since the ROC curve is a two-dimensial description of our classifier's performance, we'll want to calculate the Area Under the Curve (AUC) for the ROC curves that we produce. The AUC will be our metric for the general effectiveness of our predictor, with an ideal value of 1.0. 



#### Classifiers

Since we're here to learn, we'll be trying out a lot of different classifiers on for our model, and seeing what gives us the best predictions. All of these are packaged with sk-learn, so beside computation time, there's not a lot of extra effort required to try these on. We'll look into these more when we get our results

- RandomForestClassifier,
- AdaBoostClassifier
- GradientBoostingClassifier
- DecisionTreeClassifier
- LogisticRegression
- LogisticRegressionCV


#### Test, train, split

We've extracted our features into a pandas dataframe, and now we can easily set up our X and y for this model:

In [None]:
# separating X and y 
sk_features = df.columns[1:-1]
X = df[sk_features]
y = df["readmit_30"]

We want to split the data into test and training sets in order to setup our model. We could do this in a simple way (the "holdout method") by  just taking some part of both X and y and making those the test and training sets (we'd probably use [sklearn.model_selection.train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). However, we're going to go a bit beyond this, and use **k-fold cross validation** on our data. The k-fold method is similar to the holdout method, but we divide the data set into k subsets, and repeat that holdout method k times. Each time this is done we switch which one of the k-sets is being used as the test set, and all of the other k-sets are combined into a training set. 

<br></br>
<img src='http://www.mdpi.com/sensors/sensors-12-12489/article_deploy/html/images/sensors-12-12489f7-1024.png' width="900" height="400" align='float' margin-right=50px />

<br></br>

We'll use sk-learn's very convienent [KFold](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold) method to help us with this:

In [None]:
kf = KFold(n_splits = 10) # 10 is a generally recommended value
kf.get_n_splits(X)

We'll then used each of these separate folds to fit our classifier to the data. 

#####  **Next  |   [Fitting and Plotting]()**

#### Sources

- Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27, 861–874