# DSCI6003 4.1 Lab


### 1: KFold CV

We will implement K-fold validation **on the training dataset** of the loan dataset.

We're going to use the FICO Loan dataset. We want to predict whether or not you get approved for a loan of 12% interest rate given the FICO Score, Loan Length and Loan Amount. Here's the code to load the data:

    ```python
    import pandas as pd
    df = pd.read_csv('data/loanf.csv')
    y = (df['Interest.Rate'] <= 12).values
    X = df[['FICO.Score', 'Loan.Length', 'Loan.Amount']].values
    ```
    
`sklearn` has its own implementation of K-fold
(`sklearn.cross_validation.cross_val_score()`).
However, to ensure you have an understanding of K-fold, you will implement it
here using the more general `KFold` class in `sklearn`.

<br>

1. To do this you need to manage randomly sampling **k** folds.

2. Properly combining those **k** folds into a test and training set on
   your **on the training dataset**. Outside of the k-fold, there should be
   another set which will be referred to as the **hold-out set**.

3. Train your model on your constructed training set and evaluate on the given test set

3. Repeat steps __2__ and __3__ _k_ times.

4. Average your results of your error metric.

5. Compare the MSE for a simple **single** test/train split to your K-fold cross validated error in `4.`.

6. Plot a learning curve and test vs training error curve.
   (You might want to use: [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.cross_val_score.html) which is scikit-learn's built-in
   function for K-fold cross validation).  See [Illustration of Learning Curves](http://www.astro.washington.edu/users/vanderplas/Astr599/notebooks/18_IntermediateSklearn) for more details.  

<div style="background: yellow; padding:10px">
Once you find the optimal hyperparameters, retrain on **ALL** of your data to create your final model.
</div>

### 2: ROC Curves 

One of the best ways to evaluate how a classifier performs is an ROC curve. (http://en.wikipedia.org/wiki/Receiver_operating_characteristic) 

![](images/roc_curve.png)

To understand what is actually happening with an ROC curve, we can create one ourselves.  Here is pseudo code to plot it.

The `probabilities` are values in (0,1) returned from Logistic Regression. The standard default threshold is 0.5 where 0-0.5 values are interpreted as the negative class and 0.5-1 values are predicted as the positive class.

The `labels` are the true values.

```
function ROC_curve(probabilities, labels):
    Sort instances by their prediction strength (the probabilities)
    For every instance in increasing order of probability:
        Set the threshold to be the probability
        Set everything above the threshold to the positive class
        Calculate the True Positive Rate (aka sensitivity or recall)
        Calculate the False Positive Rate (1 - specificity)
    Return three lists: TPRs, FPRs, thresholds
```

Recall that the true positive **rate** is

```
 number of true positives     number correctly predicted positive
-------------------------- = -------------------------------------
 number of positive cases           number of positive cases
```

and the false positive **rate** is

```
 number of false positives     number incorrectly predicted positive
--------------------------- = ---------------------------------------
  number of negative cases           number of negative cases
```

We are going to be implementing the `roc_curve` function.

Here's some example code that you should be able to use to plot the ROC curve with your function. This uses a fake dataset.

```python
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split

X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
                           n_clusters_per_class=2, n_samples=1000)
X_train, X_test, y_train, y_test = train_test_split(X, y)

model = LogisticRegression()
model.fit(X_train, y_train)
probabilities = model.predict_proba(X_test)[:, 1]

tpr, fpr, thresholds = roc_curve(probabilities, y_test)

plt.plot(fpr, tpr)
plt.xlabel("False Positive Rate (1 - Specificity)")
plt.ylabel("True Positive Rate (Sensitivity, Recall)")
plt.title("ROC plot of fake data")
plt.show()
```

### 3: ROC Curve Implementation

1. Write an ROC curve function to compute the above in `roc_curve.py`.

    It should take as input the predicted probabilities and the true labels.

2. Run the above code to verify that it's working correctly. You can also validate your correctness against [scikit-learns built-in function](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html).

3. Let's see how the roc curve looks on a real dataset. We're going to use the FICO Loan dataset. We want to predict whether or not you get approved for a loan of 12% interest rate given the FICO Score, Loan Length and Loan Amount. Here's the code to load the data:

    ```python
    import pandas as pd
    df = pd.read_csv('data/loanf.csv')
    y = (df['Interest.Rate'] <= 12).values
    X = df[['FICO.Score', 'Loan.Length', 'Loan.Amount']].values
    ```

    Make sure to split your data into training and testing using sklearn's [train_test_split()](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html).

### 4: Case Study -- Graduate School Admissions

The data we will be using is the admission data on Grad school acceptances we saw before.

* `admit`: whether or not the applicant was admitted to grad. school
* `gpa`: undergraduate GPA
* `GRE`: score of GRE test
* `rank`: prestige of undergraduate school (1 is highest prestige, ala Harvard)

Remember, we will use the GPA, GRE, and rank of the applicants to try to predict whether or not they will be accepted into graduate school.

#### 5: Treating the data with a model

Now we're ready to try to fit our data with Logistic Regression and today evaluate it with a ROC curve.  Remember the following from earlier in class and use sklearn to fit a logisitc regression again:

    * Use sklearn's [KFold cross validation](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html) and [LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to calculate the average accuracy, precision and recall.

        Hint: Use sklearn's implementation of these scores in [sklearn.metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics).

    * The `rank` column is numerical, but as it has 4 buckets, we could also consider it to be categorical. Use panda's [get_dummies](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html) to binarize the column.

6. Make a plot of the ROC curve (using your function defined in Part 1).

7. Is it possible to pick a threshold where TPR > 60% and FPR < 40%? What is the threshold?

    *Note that even if it appears to be in the middle of the graph it doesn't make the threshold 0.5.*

### 6: Using the Youden Index

Youden's Index (sometimes called J statistic) is similar to the F1 score in that it is a single number that describes the performance of a classifier.

$$J = Sensitivity + Specificity - 1$$

$$where$$

$$Sensitivity = \frac{TP}{TP + FN}$$

$$Specificity = \frac{TN}{TN + FP}$$

![](http://i.stack.imgur.com/ysM0Z.png)

The J statistic ranges from 0 to 1:
* 0 indicating that the classifier does no better than random
* 1 indicating that the test performed perfectly

It can be thought of as an improvement on the F1 score since it takes into account all of the cells in a confusion matrix.  It can also be used to find the optimal threshold for a given ROC curve.