<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Classification Metrics I

_Authors: Matt Brems (DC), Riley Dallas (AUS)_

---

## Importing libraries
---

We'll need the following libraries for today's lecture:
1. `pandas`
4. `KNeighborsClassifier` from `sklearn`'s `neighbors` module
5. The `load_breast_cancer` function from `sklearn`'s `datasets` module
6. `train_test_split` and `cross_val_score` from `sklearn`'s `model_selection` module
7. `StandardScaler` from `sklearn`'s `preprocessing` module
8. The `confusion_matrix` function from `sklearn`'s `metrics` module

In [9]:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import sklearn.datasets as skds
import sklearn.preprocessing as skpp
import sklearn.model_selection as skms
import sklearn.neighbors as sknb
import sklearn.metrics as skm

%matplotlib inline

## Create dataset
---

Similar to `load_iris` from this morning, we'll call the `load_breast_cancer()` function to create our dataset.

In [25]:
data = skds.load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
target = pd.DataFrame(data.target, columns=['cancer_type'])

## Create `X` and `y`
---

The dataset labels benign tumors as 1, and malignant tumors as 0. This is contrary to how you typically label data: the more important class (malignant) should be labeled 1.

In [26]:
data.target_names

array(['malignant', 'benign'], dtype='<U9')

In [30]:
X = df
y = target['cancer_type']

## Train/Test Split
---

In the cell below, train/test split your `X` and `y` variables. 

**Note** we'll want to create a stratified split.

In [31]:
X_train, X_test, y_train, y_test = skms.train_test_split(X, y, random_state=42, stratify=target)

In [32]:
y_train.value_counts()

1    267
0    159
Name: cancer_type, dtype: int64

## Scaling our features
---

Because we're using KNN for our model, we'll want to scale our training and testing sets.

In [33]:
ss = skpp.StandardScaler()
X_train = pd.DataFrame(ss.fit_transform(X_train), columns=X_train.columns)

In [35]:
X_train.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,1.659096,0.217205,1.61062,1.633339,0.576312,0.523545,0.645326,1.198745,-9.4e-05,-0.124425,...,1.567319,-0.075879,1.607223,1.384969,0.412628,0.461629,0.642584,0.701835,-0.556084,0.388781
1,-0.338165,-1.389968,-0.401667,-0.387017,-1.985604,-1.257886,-0.8205,-0.949158,-1.684127,-0.96426,...,-0.53772,-1.613244,-0.580788,-0.52916,-1.6004,-0.871596,-0.726165,-0.900606,-0.923646,-0.797233
2,0.874457,-0.651659,1.01037,0.761353,1.694102,2.359914,1.657179,2.389453,4.483419,1.570465,...,1.259163,-0.683527,1.364776,1.053712,0.978433,0.856293,0.491059,2.096751,1.767211,1.165217
3,0.920109,-0.498594,0.88618,0.806211,0.358755,0.012174,0.465964,0.918425,0.039744,-0.919986,...,0.75945,-0.09809,0.721243,0.625763,0.408208,-0.095834,0.274268,1.065079,0.345973,-0.157501
4,2.263981,0.58636,2.301943,2.408951,0.771362,1.747791,1.928079,2.64949,0.079581,-0.190837,...,2.385598,0.014555,2.639868,2.425295,-0.131075,0.816827,0.90319,1.921083,-0.262035,0.088673


In [36]:
X_test = pd.DataFrame(ss.transform(X_test), columns=X_test.columns)

In [37]:
X_test.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,-0.378111,-0.58413,-0.376001,-0.450976,1.236483,0.156651,-0.620108,-0.474851,1.151568,0.477434,...,-0.364903,-0.629584,-0.394313,-0.444823,0.669009,-0.358532,-0.617998,-0.501782,0.243362,0.079595
1,1.116982,0.307243,1.084884,0.989985,0.56881,0.512138,0.38556,1.040903,0.688006,-0.289072,...,1.015553,-0.047321,0.936752,0.853167,0.699951,0.724327,0.239125,1.239257,0.226516,0.050759
2,0.252453,-0.043904,0.225077,0.109028,-0.457457,-0.099414,-0.36529,-0.000544,0.278768,-0.589309,...,0.009881,0.07643,0.068731,-0.112313,-0.047088,0.28958,-0.444109,0.517236,0.076428,0.034205
3,-0.341019,-0.241987,-0.295692,-0.453002,1.934164,1.190797,-0.503213,0.125313,-0.3369,1.349091,...,-0.239975,-0.212322,-0.2243,-0.35422,0.585022,0.24703,-0.698781,-0.067825,-0.505544,0.259019
4,0.149737,0.899241,0.098403,0.043622,-0.690017,-0.705264,-0.062228,0.116992,-0.63387,-1.192552,...,-0.085897,0.952204,-0.137798,-0.204707,-0.051509,-0.603964,-0.046583,0.323704,-0.666353,-0.839419


## Instantiate and fitting our model
---

In the cells provided, create and fit an instance of `KNeighborsClassifier`. You can use the default parameters.

In [38]:
knn = sknb.KNeighborsClassifier()

In [40]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

## Predictions
---

Use our newly fitted KNN model to create predictions from `X_test_scaled`.

In [43]:
y_hat = knn.predict(X_test)

## Confusion Matrix
---

We'll create a confusion matrix using the `confusion_matrix` function from `sklearn`'s `metrics` module.

In [44]:
cm = skm.confusion_matrix(y_test, y_hat)

In [45]:
# cm[0] is the predicted -ve +ve values.
cm

array([[50,  3],
       [ 0, 90]], dtype=int64)

## Confusion DataFrame
---

The confusion matrix we just created isn't very explanatory, so let's drop it into a pandas `DataFrame`.

In [48]:
cm_df = pd.DataFrame(cm, columns = ['pred_benign', 'pred_malignant'], index=['actual_benign', 'actual_malignant'])

In [49]:
cm_df

Unnamed: 0,pred_benign,pred_malignant
actual_benign,50,3
actual_malignant,0,90


## Calculate recall
---

<details>
    <summary>Need a hint?</summary>
    Recall = Sensitivity, and there are no p's in sensitivity.
</details>

## How many Type I errors are there?
---

<details>
    <summary>Need a hint?</summary>
    Type I = False positive
</details>

## How many Type II errors are there?
---
<details>
    <summary>Need a hint?</summary>
    Type II = False negatives
</details>

## Which error is worse (Type I vs Type II)?
---

## Calculate the sensitivity
---

<details>
    <summary>Need a hint?</summary>
    There are no p's in sensitivity: TP/P
</details>

## Calculate the specificity
---

<details>
    <summary>Need a hint?</summary>
    There is a p in specificity, therefore there are no p's in the calculation: TN/N
</details>