<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Classification Metrics I

_Authors: Matt Brems (DC), Riley Dallas (AUS)_

---

## Importing libraries
---

We'll need the following libraries for today's lecture:
1. `pandas`
4. `KNeighborsClassifier` from `sklearn`'s `neighbors` module
5. The `load_breast_cancer` function from `sklearn`'s `datasets` module
6. `train_test_split` and `cross_val_score` from `sklearn`'s `model_selection` module
7. `StandardScaler` from `sklearn`'s `preprocessing` module
8. The `confusion_matrix` function from `sklearn`'s `metrics` module

In [2]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as py

%matplotlib inline

  return f(*args, **kwds)
  return f(*args, **kwds)


## Create dataset
---

Similar to `load_iris` from this morning, we'll call the `load_breast_cancer()` function to create our dataset.

In [3]:
cancer = load_breast_cancer()

In [5]:
cancer.data.shape

(569, 30)

In [6]:
df = pd.DataFrame(cancer.data, columns = cancer.feature_names)
cancer.target_names

array(['malignant', 'benign'], dtype='<U9')

In [7]:
df['cancer_type'] = cancer.target

In [8]:
df.head(2)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,cancer_type
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0


## Create `X` and `y`
---

The dataset labels benign tumors as 1, and malignant tumors as 0. This is contrary to how you typically label data: the more important class (malignant) should be labeled 1.

In [9]:
target = 'cancer_type'
X = df[[k for k in df.columns if k!=target]]
y = df[target]

## Train/Test Split
---

In the cell below, train/test split your `X` and `y` variables. 

**Note** we'll want to create a stratified split.

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=df[target])

In [18]:
y_train.value_counts()

1    267
0    159
Name: cancer_type, dtype: int64

## Scaling our features
---

Because we're using KNN for our model, we'll want to scale our training and testing sets.

In [19]:
ss = StandardScaler()
X_train_ss = ss.fit_transform(X_train)

In [21]:
X_test_ss = ss.transform(X_test)

## Instantiate and fitting our model
---

In the cells provided, create and fit an instance of `KNeighborsClassifier`. You can use the default parameters.

In [22]:
knn = KNeighborsClassifier()

In [23]:
knn.fit(X_train_ss, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

## Predictions
---

Use our newly fitted KNN model to create predictions from `X_test_scaled`.

In [24]:
y_hat_ss = knn.predict(X_test_ss)

## Confusion Matrix
---

We'll create a confusion matrix using the `confusion_matrix` function from `sklearn`'s `metrics` module.

In [26]:
help(confusion_matrix)

Help on function confusion_matrix in module sklearn.metrics.classification:

confusion_matrix(y_true, y_pred, labels=None, sample_weight=None)
    Compute confusion matrix to evaluate the accuracy of a classification
    
    By definition a confusion matrix :math:`C` is such that :math:`C_{i, j}`
    is equal to the number of observations known to be in group :math:`i` but
    predicted to be in group :math:`j`.
    
    Thus in binary classification, the count of true negatives is
    :math:`C_{0,0}`, false negatives is :math:`C_{1,0}`, true positives is
    :math:`C_{1,1}` and false positives is :math:`C_{0,1}`.
    
    Read more in the :ref:`User Guide <confusion_matrix>`.
    
    Parameters
    ----------
    y_true : array, shape = [n_samples]
        Ground truth (correct) target values.
    
    y_pred : array, shape = [n_samples]
        Estimated targets as returned by a classifier.
    
    labels : array, shape = [n_classes], optional
        List of labels to index the m

In [27]:
y_hat_ss[:5]

array([1, 0, 1, 1, 0])

In [29]:
cm = confusion_matrix(y_test, y_hat_ss)

## Confusion DataFrame
---

The confusion matrix we just created isn't very explanatory, so let's drop it into a pandas `DataFrame`.

In [30]:
cm_df = pd.DataFrame(cm, columns = ['pred_benign', 'pred_malignant'], index=['actual_benign', 'actual_malignant'])

In [31]:
cm_df

Unnamed: 0,pred_benign,pred_malignant
actual_benign,50,3
actual_malignant,0,90


## Calculate recall
---

<details>
    <summary>Need a hint?</summary>
    Recall = Sensitivity, and there are no p's in sensitivity.
</details>

In [38]:
cm_df.iloc[1,:]

pred_benign        0
pred_malignant    90
Name: actual_malignant, dtype: int64

In [37]:
sum(cm_df.iloc[1,:])

90

In [36]:
cm_df.iloc[1,1]

90

In [39]:
sensitivity = cm_df.iloc[1,1]/(sum(cm_df.iloc[1,:]))

In [40]:
sensitivity

1.0

In [41]:
from sklearn.metrics import classification_report

## How many Type I errors are there?
---

<details>
    <summary>Need a hint?</summary>
    Type I = False positive
</details>

## How many Type II errors are there?
---
<details>
    <summary>Need a hint?</summary>
    Type II = False negatives
</details>

## Which error is worse (Type I vs Type II)?
---

## Calculate the sensitivity
---

<details>
    <summary>Need a hint?</summary>
    There are no p's in sensitivity: TP/P
</details>

## Calculate the specificity
---

<details>
    <summary>Need a hint?</summary>
    There is a p in specificity, therefore there are no p's in the calculation: TN/N
</details>