<!--
If you're creating new material, please share w the community! We'll review it and perform any maintenance. Simply:
1. Fork your new repo over to `data-staging`
2. Create an issue and tag `jeffboykin`
3. *Optional*: Comment on what you made, where you think it should go, or how it should be used!

If you have questions or input on existing materials, get in touch: 
1. Log an issue to this repo to alert us of a problem.
2. Suggest an edit yourself by forking this repo, making edits, and submitting a pull request with your changes back to our master branch.
3. Reach out to the data team on Slack and share your thoughts!
-->

<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Model Evaluation & Unbalanced Classes Lab

_Authors: Matt Speck (DC), Matt Brems (DC)_

##  Overview

In this lab, we will be exploring various metrics for evaluating classification models, writing a function to generate a ROC curve by hand, and exploring methods of dealing with unbalanced classes. The AUC ROC is one of the most important metrics for evaluating a classification model, and while Sklearn does have modules for calculating this, one of the best ways to understand the ROC curve is to write it yourself.  

Since the AUC ROC--in addition to other metrics--is a metric to evaluate a model, we'll have to create a model to evaluate first. We'll be using a logistic regression to predict whether or not a credit card client will default on their loan. The dataset and data dictionary are provided in the repository.

## Required Objectives

1. [Train a logistic regression on the `defaultcc.csv` dataset](#train-logit)
1. [Create a confusion matrix for your logistic regression](#confusion)
1. [Write a function that will plot an ROC curve](#roc-func)
1. [Write a function that takes in either 'sensitivity' or 'specificity' and a value of sensitivity/specificity, and returns the sensitivity, specificity, and threshold that generates the input value of sensitivity/specificity.](#sens-spec)
1. [Exacerbate unbalanced classes by artificially dropping 70% of values in underrepresented class.](#very-unbalanced)
1. [Explain which is worse in our case: false positives or false negatives](#fp-vs-fn)
1. [Build a new logistic regression model based on the unbalanced classes. Generate a confusion matrix based on this new model and evaluate.](#second-logit)
1. [Compare logistic regression models](#compare-logits)
1. [Try two methods of accounting for unbalanced classes: undersampling and changing the threshold](#two-methods)

## Bonus Objectives

2. [BONUS: Build out your function to approximate the area under the ROC curve.](#bonus-1)
2. [BONUS: Try accounting for unbalanced classes through oversampling until you get results that are 50% positive and 50% negative. Generate a confusion matrix and briefly summarize your results.](#bonus-2)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, auc, roc_curve
from random import seed

<a id='read-data'></a>

## Read in the defaultcc data from the repo. You may want to examine page 3 of the .pdf for a data dictionary.

In [None]:
df = pd.read_csv('defaultcc.csv', header=1) # The row at index 1 has the column names for this csv file

<a id='train-logit'></a>

## Fit a logistic regression model predicting whether or not someone will default on their credit card. 

<a id='confusion'></a>

## Generate a confusion matrix. Write a few sentences summarizing your results.

<a id='roc-func'></a>

## Write a function that will create an ROC curve for you. Here's a strategy you might consider:

0. In order to even begin, you'll need some fit model. Build a logistic regression model with X and y as defined above.

1. We want to look at all values of your "threshold" - that is, anything where .predict() gives you above your threshold falls in the "positive class," and anything that is below your threshold falls in the "negative class." Start the threshold at 0.

2. At this value of your threshold, calculate the sensitivity and specificity. Store these values.

3. Increment your threshold by some "step." Maybe set your step to be 0.01, or even smaller.

4. At this value of your threshold, calculate the sensitivity and specificity. Store these values.

5. Repeat steps 3 and 4 until you get to the threshold of 1.

6. Plot the values of sensitivity and 1 - specificity.

<a id='sens-spec'></a>

## Either build out your function from above or create a new function that uses the above function to take in the string "sensitivity" or "specificity" and a value of sensitivity/specificity, and returns the sensitivity, specificity, and threshold that generates the input value of sensitivity/specificity.

<a id='sens-spec-example'></a>
### For example, function("sensitivity", 0.95) might return "Sensitivity: 95%, Specificity: 90%, Threshold: 50%."

<a id='very-unbalanced'></a>

## Note that the defaultcc data has approximately 20% of observations in class 1 and 80% in class 2. Set a seed of 48 and artificially drop 70% of the values marked 1. This will ensure we have very unbalanced classes.

<a id='fp-vs-fn'></a>

## Which is worse in our particular use-case - false positives or false negatives? Why? (If you feel there's not one clear answer, defend your conclusion.)

<a id='second-logit'></a>

## Build the same logistic regression model based on the unbalanced classes. Generate a confusion matrix based on this new model. What do you notice?

<a id='compare-logits'></a>

## Using your function, plot the ROC of both models. How do they compare? Summarize your results.

<a id='two-methods'></a>

## Try two methods of accounting for unbalanced classes. For each, generate a confusion matrix and briefly summarize your results.

1. Undersample the 0s until 50% of your observations have a "positive outcome" and 50% of your observations have a "negative outcome."
2. Change your threshold for classifying observations as positive or negative to 10%.
3. Do 2. again, but for 90%.

### Method 1: Undersampling

### Method 2: Changing Threshold

<a id='bonus-1'></a>

## BONUS: Build out your function to approximate the area under the ROC curve. I recommend using [step size] * [height of the curve in the middle of that step size] to create rectangles and then summing the areas of those rectangles.

<a id='bonus-2'></a>

## BONUS: Try accounting for unbalanced classes through oversampling until you get results that are 50% positive and 50% negative. Generate a confusion matrix and briefly summarize your results.