## Week 8 Assignment: Advanced Validation with Breast Cancer Dataset
<li> Author: Melanie Klein
<li> Class: CSC 570R

In [93]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve
import math

%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [94]:
data = pd.read_csv("AdvancedValidationDatasets/breast_cancer.csv")

In [95]:
data.head()

Unnamed: 0.1,Unnamed: 0,id number,clump_thickness,uniformity_of_cell_size,uniformity_of_cell_shape,marginal_adhesion,epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,malignant
0,0,1000025,5,1,1,1,2,1,3,1,1,0
1,1,1002945,5,4,4,5,7,10,3,2,1,0
2,2,1015425,3,1,1,1,2,2,3,1,1,0
3,3,1016277,6,8,8,1,3,4,3,7,1,0
4,4,1017023,4,1,1,3,2,1,3,1,1,0


In [96]:
#drop fields that aren't useful in prediction
data = data.drop(['Unnamed: 0', 'id number'], axis=1)

In [97]:
#Get an overview of the data
data.describe()

Unnamed: 0,clump_thickness,uniformity_of_cell_size,uniformity_of_cell_shape,marginal_adhesion,epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,malignant
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,4.41774,3.134478,3.207439,2.806867,3.216023,3.440629,3.437768,2.866953,1.589413,0.344778
std,2.815741,3.051459,2.971913,2.855379,2.2143,3.665507,2.438364,3.053634,1.715078,0.475636
min,1.0,1.0,1.0,1.0,1.0,-1.0,1.0,1.0,1.0,0.0
25%,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,0.0
50%,4.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,0.0
75%,6.0,5.0,5.0,4.0,4.0,5.0,5.0,4.0,1.0,1.0
max,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,1.0


In [98]:
#Get value counts for dependent variable
data.malignant.value_counts()

0    458
1    241
Name: malignant, dtype: int64

In [99]:
#Setting the feature we want to predict
y = data.pop("malignant")

In [100]:
#Randomly divide dataset into a training set and a testing set - holdout method for validation
X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=.2, random_state=42)

In [101]:
### Grid Search for hyperparameter tuning
n_estimators = [300,400,500]
max_features = ['auto', 'sqrt','log2']
min_samples_split = [3,5,7]


rfc = RandomForestClassifier(n_jobs=1)

estimator = GridSearchCV(rfc,
                         dict(n_estimators=n_estimators,
                              max_features=max_features,
                              min_samples_split=min_samples_split
                              ), cv=None, n_jobs=-1, scoring='roc_auc')                              

In [102]:
estimator.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_features': ['auto', 'sqrt', 'log2'], 'n_estimators': [300, 400, 500], 'min_samples_split': [3, 5, 7]},
       pre_dispatch='2*n_jobs', refit=True, scoring='roc_auc', verbose=0)

In [103]:
estimator.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='log2', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=7,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [104]:
best_rfc = estimator.best_estimator_

### Measure performance of model

In [105]:
#Accuracy: Ratio of correct predictions out of all predictions
accuracy = accuracy_score(y_test, best_rfc.predict(X_test))
print ("Accuracy: ", accuracy)

Accuracy:  0.964285714286


In [106]:
#Precision: The number of true positives with respect to the total number of predicted positives
#Recall: The number of true positives with respect to the total number of actual positives
print (classification_report(y_test, best_rfc.predict(X_test)))

             precision    recall  f1-score   support

          0       0.97      0.98      0.97        95
          1       0.95      0.93      0.94        45

avg / total       0.96      0.96      0.96       140



Precision:   Of all the examples the model predicted as malignant, 93% of them were actually malignant.

Recall:  The model identified 95% of the actually malignant examples as such.

In [107]:
#AUC: True positive rate (recall, or TP/(TP + FN)) with respect to the false positive rate (FP/(FP + TN))
roc = roc_auc_score(y_test, best_rfc.predict_proba(X_test)[:,1])
print ("AUC Score: ", roc)

AUC Score:  0.995789473684


In [108]:
#Using k-fold cross validation to avoid overfitting to the provided data
from sklearn import cross_validation
scores = cross_validation.cross_val_score(best_rfc, data, y, cv=10)

In [109]:
scores

array([ 0.92957746,  0.95714286,  0.97142857,  0.91428571,  0.98571429,
        0.97142857,  0.98571429,  0.98571429,  0.98550725,  1.        ])

In [110]:
#Calculating the mean AUC score from the k-fold cross validation
mean_score = scores.mean()
std_dev = scores.std()
std_error = scores.std() / math.sqrt(scores.shape[0])
ci =  2.262 * std_error
lower_bound = mean_score - ci
upper_bound = mean_score + ci

print("Score is %f +/-  %f" % (mean_score, ci))
print('95 percent probability that if this experiment were repeated over and over the average score would be between %f and %f' % (lower_bound, upper_bound))

Score is 0.968651 +/-  0.018612
95 percent probability that if this experiment were repeated over and over the average score would be between 0.950039 and 0.987264


### Model Performance
#### AUC for holdout vs. k-fold cross validation
<li>The AUC for the 20% holdout (0.995789473684) was a few percentage points higher than the average AUC for the 10-fold cross validation (0.968651).
<li>This indicates that the model that was fit using the holdout method was overfitted to the data, and the lower AUC provided by the 10-fold cross validation is likely a better indication of how the model would perform out in the wild.

#### Interpreting performance measures: Accuracy, AUC, Precision, and Recall
Based on the evaluation for the model generated by the holdout method, the accuracy of this model was 0.964285714286.  The model had a precision of 95% when it comes to predicting the presence of a malignant tumors.  In other words, 95% of the examples predicted to be malignant were actually malignant.  The model had a recall (or true positive rate) of 93% - the lowest score for all of our evaluation measures.  In other words, of all the examples in which a malignant tumor was actually present, the model detected 93% of them.  This seems problematic, as it indicates that 7% of malignant cases were missed.

When it comes to something like cancer detection, having a high true positive rate (recall) and a low false positive rate are both very important.  You definitely don't want someone with cancer to go untreated; and likewise, it is costly to perform additional tests or treatment on someone who doesn't actually need it.  Therefore, the higher the AUC, the better.  While the AUC for the holdout method was over 99%, the k-fold cross validation method indicates that there is still room to improve this model's ability to perform well in the real world.