<a href="https://colab.research.google.com/github/martin-fabbri/colab-notebooks/blob/master/model_selection_auc_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ROC / AUC


Table of Contents

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import defaultdict
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

print(f'numpy: {np.__version__}')
print(f'pandas: {pd.__version__}')

numpy: 1.17.5
pandas: 0.25.3


## ROC/AUC for Binary Calssification

We will analyze a employee retention. Our goal is to find the employees that are likely to leave in the future.

In [0]:
hr_retention_url = 'https://raw.githubusercontent.com/martin-fabbri/colab-notebooks/master/data/model-selection/hr_retention.csv'
df = pd.read_csv(hr_retention_url)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 7 columns):
s          12000 non-null float64
lpr        12000 non-null float64
np         12000 non-null int64
anh        12000 non-null int64
tic        12000 non-null int64
newborn    12000 non-null int64
left       12000 non-null int64
dtypes: float64(2), int64(5)
memory usage: 656.4 KB


Variables definition:

- **s**: The satisfaction level on a scale of 0 to 1
- **lpe**: Last project evaluation by a client on a scale of 0 to 1
- **np**: Represents the number of projects worked on by employee in the last 12 month
- **anh**: Average number of hours worked in the last 12 month for that employee
- **tic**: Amount of time the employee spent in the company, measured in years
- **newborn**: This variable will take the value 1 if the employee had a newborn within the last 12 month and 0 otherwise
- **left**: 1 if the employee left the company, 0 if they're still working here. This is our response variable

In [0]:
df.head()

Unnamed: 0,s,lpr,np,anh,tic,newborn,left
0,0.38,0.53,2,157,3,0,1
1,0.8,0.86,5,262,6,0,1
2,0.11,0.88,7,272,4,0,1
3,0.72,0.87,5,223,5,0,1
4,0.37,0.52,2,159,3,0,1


In [0]:
X = df.drop(['left'], axis=1)
y = df['left']
print('labels distribution:', np.bincount(y) / y.size)

test_size = 0.2
random_state = 43
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, 
                                                    random_state=random_state, 
                                                    stratify=y)

labels distribution: [0.83333333 0.16666667]


### Preprocessing

We will perform generic preprocessing task including:

- Normalize numeric columns
- One-hot-encode categorical columns. The ***newborn*** variable is treated as a categorical variable.


In [0]:
column_trans = make_column_transformer(
    (StandardScaler(), ['s', 'lpr', 'np', 'anh', 'tic']),
    (OneHotEncoder(), ['newborn']),
    remainder='passthrough'
)

## Model Building

In [0]:
rf = RandomForestClassifier(max_depth = 4, random_state=43)
pipe = make_pipeline(column_trans, rf)

cross_val_score(pipe, X, y, cv=3, scoring='accuracy').mean()

0.9473333333333334

After training our model, we need to evaluate whether its any good or not and the most straightforward and intuitive metric for a supervised classifier's performance is accuracy. Unfortunately, there are circumstances where simple accuracy does not work well. For example, with a disease that only affects 1 in a million people, a completely bogus screening test that always reports "negative" will be 99.9999% accurate. Unlike accuracy, ROC curves are less sensitive to class imbalance; the bogus screening test would have an AUC of 0.5, which is like not having a test at all.

>First column name | Second column name
>--- | ---
>Row 1, Col 1 | Row 1, Col 2
>Row 2, Col 1 | Row 2, Col 2

---



**ROC curve (Receiver Operating Characteristic)** is a commonly used way to visualize the performance of a binary classifier and AUC (Area Under the ROC Curve) is used to summarize its performance in a single number. Most machine learning algorithms have the ability to produce probability scores that tells us the strength in which it thinks a given observation is positive. Turning these probability scores into yes or no predictions requires setting a threshold; cases with scores above the threshold are classified as positive, and vice versa. Different threshold values can lead to different result:

- A higher threshold is more conservative about labeling a case as positive; this makes it less likely to produce false positive (an observation that has a negative label but gets classified as positive by the model) results but more likely to miss cases that are in fact positive (lower true positive rate)
- A lower threshold produces positive labels more liberally, so it creates more false positives but also generate more true positives

A quick refresher on terminology:

\begin{align}
[\text{true positive rate}]
&= \frac{[\text{# positive data points with positive predictions}]}{\text{[# all positive data points]}} \\
&= \frac{[\text{# true positives}]}{[\text{# true positives}] + [\text{# false negatives}]}
\end{align}

true positive rate is also known as **recall** or **sensitivity**

\begin{align}
[\text{false positive rate}]
&= \frac{[\text{# positive data points with positive predictions}]}{\text{[# all negative data points]}} \\
&= \frac{[\text{# false positives}]}{[\text{# false positives}] + [\text{# true negatives}]}
\end{align}

The ROC curve is created by plotting the true positive rate (when it's actually a yes, how often does it predict yes?) on the y axis against the false positive rate (when it's actually a no, how often does it predict yes?) on the x axis at various cutoff settings, giving us a picture of the whole spectrum of the trade-off we're making between the two measures.

If all these true/false positive terminology is confusing to you, consider reading the material at the following link. [Blog: Simple guide to confusion matrix terminology](http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/)