|   |True | False |   
|---|---|---|
|   |   |   |   
|True | Correct!  | **Type 1 Error**  |  
|False  | **Type 2 Error**  | Correct!  | 

- **True Positives**: A positive class observation (1) is correctly classified as positive by the model.
- **False Positive**: A negative class observation (0) is incorrectly classified as positive.
- **True Negative**: A negative class observation is correctly classified as negative.
- **False Negative**: A positive class observation is incorrectly classified as negative.

Load logistic regression, numpy, and cross validation train/test split functions.

In [195]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
import numpy as np
import pandas as pd

Return to the Wisconsin breast cancer data. Clean it up as we did before.

In [196]:
column_names = ['id',
                'clump_thickness',
                'cell_size_uniformity',
                'cell_shape_uniformity',
                'marginal_adhesion',
                'single_epithelial_size',
                'bare_nuclei',
                'bland_chromatin',
                'normal_nucleoli',
                'mitoses',
                'class']

bcw = pd.read_csv('../assets/datasets/breast-cancer-wisconsin.csv',
                 names=column_names, na_values=['?'])

bcw.head(10)

Unnamed: 0,id,clump_thickness,cell_size_uniformity,cell_shape_uniformity,marginal_adhesion,single_epithelial_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class
0,1000025,5,1,1,1,2,1.0,3,1,1,2
1,1002945,5,4,4,5,7,10.0,3,2,1,2
2,1015425,3,1,1,1,2,2.0,3,1,1,2
3,1016277,6,8,8,1,3,4.0,3,7,1,2
4,1017023,4,1,1,3,2,1.0,3,1,1,2
5,1017122,8,10,10,8,7,10.0,9,7,1,4
6,1018099,1,1,1,1,2,10.0,3,1,1,2
7,1018561,2,1,2,1,2,1.0,3,1,1,2
8,1033078,2,1,1,1,2,1.0,1,1,5,2
9,1033078,4,2,1,1,2,1.0,2,1,1,2


In [197]:
bcw.dropna(inplace=True)
print(bcw.shape)
bcw.head(8)

(683, 11)


Unnamed: 0,id,clump_thickness,cell_size_uniformity,cell_shape_uniformity,marginal_adhesion,single_epithelial_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class
0,1000025,5,1,1,1,2,1.0,3,1,1,2
1,1002945,5,4,4,5,7,10.0,3,2,1,2
2,1015425,3,1,1,1,2,2.0,3,1,1,2
3,1016277,6,8,8,1,3,4.0,3,7,1,2
4,1017023,4,1,1,3,2,1.0,3,1,1,2
5,1017122,8,10,10,8,7,10.0,9,7,1,4
6,1018099,1,1,1,1,2,10.0,3,1,1,2
7,1018561,2,1,2,1,2,1.0,3,1,1,2


Create a percentage score across the predictor columns for simplicity in this lesson.

In [198]:
# Let's select everything from our column_names, minus the "class" and "id" columns
subset_mask = list(set(column_names) - set(['class', 'id']))   # difference set operation
subset_mask

['cell_size_uniformity',
 'marginal_adhesion',
 'clump_thickness',
 'cell_shape_uniformity',
 'normal_nucleoli',
 'single_epithelial_size',
 'mitoses',
 'bare_nuclei',
 'bland_chromatin']

In [199]:
bcw[subset_mask].sum(axis=1)/90 # axis:1 == rows

0      0.177778
1      0.455556
2      0.166667
3      0.455556
4      0.188889
5      0.777778
6      0.233333
7      0.155556
8      0.166667
9      0.166667
10     0.122222
11     0.133333
12     0.311111
13     0.155556
14     0.666667
15     0.400000
16     0.155556
17     0.166667
18     0.566667
19     0.188889
20     0.555556
21     0.600000
22     0.144444
24     0.133333
25     0.366667
26     0.144444
27     0.166667
28     0.133333
29     0.133333
30     0.133333
         ...   
669    0.677778
670    0.588889
671    0.177778
672    0.144444
673    0.200000
674    0.122222
675    0.177778
676    0.133333
677    0.155556
678    0.111111
679    0.122222
680    0.911111
681    0.700000
682    0.188889
683    0.111111
684    0.111111
685    0.111111
686    0.111111
687    0.166667
688    0.144444
689    0.188889
690    0.133333
691    0.533333
692    0.133333
693    0.155556
694    0.155556
695    0.122222
696    0.644444
697    0.511111
698    0.544444
dtype: float64

In [200]:
bcw['metrics_pct'] = bcw[subset_mask].sum(axis=1)/90.
bcw['class'] = bcw['class'].map(lambda x: 0 if x == 2 else 1) # Here we're shifting 2 & 4 to 0 (healthy) and 1 (cancer)

In [201]:
# Notice our class and new metrics_pct
bcw.head(10)

Unnamed: 0,id,clump_thickness,cell_size_uniformity,cell_shape_uniformity,marginal_adhesion,single_epithelial_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class,metrics_pct
0,1000025,5,1,1,1,2,1.0,3,1,1,0,0.177778
1,1002945,5,4,4,5,7,10.0,3,2,1,0,0.455556
2,1015425,3,1,1,1,2,2.0,3,1,1,0,0.166667
3,1016277,6,8,8,1,3,4.0,3,7,1,0,0.455556
4,1017023,4,1,1,3,2,1.0,3,1,1,0,0.188889
5,1017122,8,10,10,8,7,10.0,9,7,1,1,0.777778
6,1018099,1,1,1,1,2,10.0,3,1,1,0,0.233333
7,1018561,2,1,2,1,2,1.0,3,1,1,0,0.155556
8,1033078,2,1,1,1,2,1.0,1,1,5,0,0.166667
9,1033078,4,2,1,1,2,1.0,2,1,1,0,0.166667


In [202]:
print 'Patients with cancer:', np.sum(bcw[['class']].values)

Patients with cancer: 239


Split into 66% training set and 33% testing set
>```
>X = metrics_pct (predictor)
>Y = class (non-cancer:0 vs cancer:1)
>```

In [203]:
metrics_pct = np.array(bcw.metrics_pct.values)
metrics_pct = metrics_pct[:, np.newaxis]

# stratify keeps our classes balanced
X_train, X_test, Y_train, Y_test = train_test_split(metrics_pct, bcw[['class']].values, 
                                                    test_size=0.33, stratify=bcw[['class']].values,
                                                    random_state=77)  

Fit the logistic regression on the training data

In [204]:
logreg = LogisticRegression(random_state=77)
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)

  y = column_or_1d(y, warn=True)


Look at the confusion matrix

In [205]:
from sklearn.metrics import confusion_matrix

conmat = np.array(confusion_matrix(Y_test, Y_pred, labels=[1,0]))

confusion = pd.DataFrame(conmat, index=['has_cancer', 'is_healthy'],
                         columns=['predicted_cancer','predicted_healthy'])

print(confusion)

            predicted_cancer  predicted_healthy
has_cancer                69                 10
is_healthy                 4                143


Calculate true positives, false positives, true negatives, and false negatives from the confusion matrix

In [206]:
TP = confusion.ix['has_cancer', 'predicted_cancer']   # row index: has_cancer column_index: predicted_cancer
FP = confusion.ix['is_healthy', 'predicted_cancer']
TN = confusion.ix['is_healthy', 'predicted_healthy']
FN = confusion.ix['has_cancer', 'predicted_healthy']

print(zip(['True Positives','False Positives','True Negatives','False Negatives'],
          [TP, FP, TN, FN]))

[('True Positives', 69), ('False Positives', 4), ('True Negatives', 143), ('False Negatives', 10)]


## Check

- People with cancer:  ??
- People without cancer: ??

Calculate the accuracy with the accuracy_score() function from sklearn

In [207]:
from sklearn.metrics import accuracy_score

acc = accuracy_score(Y_test, Y_pred)
print(acc)

0.938053097345


Show that the accuracy is equivalent to: True Positives + True Negatives / Total

In [208]:
print((TP + TN) / float(len(Y_test)))

0.938053097345


Create the classification report with the classification_report() function

In [209]:
from sklearn.metrics import classification_report

cls_rep = classification_report(Y_test, Y_pred)
print(cls_rep)

             precision    recall  f1-score   support

          0       0.93      0.97      0.95       147
          1       0.95      0.87      0.91        79

avg / total       0.94      0.94      0.94       226



Show that the precision (for 1 vs 0) is equivalent to: True Positives / (True Positives + False Positives)

In [210]:
# 1 vs. 0
print(float(TP) / (TP + FP))

# 0 vs. 1
print(float(TN) / (TN + FN))

0.945205479452
0.934640522876


Show that the recall (for 1 vs 0) is equivalent to: True Positives / (True Positives + False Negatives)

In [211]:
## How many class predictions did we "recall" correctly?
# 1 vs. 0
print(float(TP) / (TP + FN))

# 0 vs. 1
print(float(TN) / (TN + FP))

0.873417721519
0.972789115646


Show that the F1-score is equivalent to: 2 * (Precision * Recall) / (Precision + Recall)

![](https://upload.wikimedia.org/math/9/9/1/991d55cc29b4867c88c6c22d438265f9.png)

In [212]:
# 1 vs. 0
pos_precision = float(TP) / (TP + FP)
pos_recall = float(TP) / (TP + FN)
print(2. * (pos_precision * pos_recall) / (pos_precision + pos_recall))

# 0 vs. 1
neg_precision = float(TN) / (TN + FN)
neg_recall = float(TN) / (TN + FP)
print(2. * (neg_precision * neg_recall) / (neg_precision + neg_recall))

0.907894736842
0.953333333333
