### Advanced Validation with Breast Cancer Data
UIS CSC 570R - Data Science Essentials<br>
2017 Fall<br>
Jason Burrell<br>

Advanced validation with breast cancer data, based on https://github.com/mbernico/CS570/blob/master/module_2/Advanced%20Validation.ipynb

In [28]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve
import math
from sklearn import cross_validation

%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [7]:
data_orig = pd.read_csv('../data/breast_cancer.csv')

In [10]:
data = data_orig.drop(['Unnamed: 0', 'id number'], axis=1)
y = data.pop('malignant')

X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=.2, random_state=42)

In [24]:
### Grid Search
n_estimators = list(range(100, 121))
max_features = ['auto', 'sqrt','log2']
min_samples_split = list(range(2, 6))


rfc = RandomForestClassifier(n_jobs=-1)
#Parameters of pipelines can be set using ‘__’ separated parameter names:
estimator = GridSearchCV(rfc,
                         dict(n_estimators=n_estimators,
                              max_features=max_features,
                              min_samples_split=min_samples_split
                              ), cv=None, n_jobs=-1)


In [25]:
estimator.fit(X_train, y_train)
best_rfc = estimator.best_estimator_

In [26]:
best_rfc

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='log2', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=5, min_weight_fraction_leaf=0.0,
            n_estimators=117, n_jobs=-1, oob_score=False,
            random_state=None, verbose=0, warm_start=False)

In [27]:
def measure_model(best_rfc, X_test, y_test):
    auc = roc_auc_score(y_test, best_rfc.predict_proba(X_test)[:,1])
    accuracy = accuracy_score(y_test, best_rfc.predict(X_test))
    print('AUC = %f\nAccuracy = %f' % (auc, accuracy))
    print(classification_report(y_test, best_rfc.predict(X_test)))

measure_model(best_rfc, X_test, y_test)

AUC = 0.996023
Accuracy = 0.971429
             precision    recall  f1-score   support

          0       0.98      0.98      0.98        95
          1       0.96      0.96      0.96        45

avg / total       0.97      0.97      0.97       140



In [29]:
scores = cross_validation.cross_val_score(best_rfc, data, y, cv=10)
scores

array([ 0.92957746,  0.97142857,  0.97142857,  0.91428571,  0.97142857,
        0.98571429,  0.97142857,  0.98571429,  0.98550725,  1.        ])

In [35]:
mean_score = scores.mean()
std_dev = scores.std()
std_error = scores.std() / math.sqrt(scores.shape[0])
ci =  2.262 * std_error
lower_bound = mean_score - ci
upper_bound = mean_score + ci

print ("Score is %f +/-  %f (%f - %f)" % (mean_score, ci, lower_bound, upper_bound))

Score is 0.968651 +/-  0.018043 (0.950609 - 0.986694)


### Results

The K-fold score was worse than the single holdout AUC score, suggesting over-fitting.

### Description of the model's performance

Include AUC, Accuracy, Precision, and Recall in your discussion.

The AUC was 0.9960, very high. The model had a 97% accuracy, with a 98% precision of identifying benign growths and a 96% precision of identifying malignant growths (97% of correctly identifying which of the two states). The model's recall values show that it correctly selected 98% of all the benign growths and 96% of all malignant growths.