# Predicting Breast Cancer

### Paulo C. Rios Jr. | Oct 23, 2017

## Enhancements

1. Apply Cross-Validation with 5 folds.
2. Apply a confusion matrix. 
3. Get the scores for precision and recall in the validation and in the test sets.
4. Use zip to show which feature has wich feature importance.
5. Sort the features by importance, identifying the top 5.

## 1. Import package

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [3]:
from sklearn.ensemble import RandomForestClassifier

In [4]:
from sklearn.model_selection import cross_val_score

In [5]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score,precision_score, recall_score
from sklearn.metrics import f1_score

In [6]:
from sklearn import datasets

## 2. Reading and Browsing the data

In [7]:
# Load the diabetes dataset
wisconsin = datasets.load_breast_cancer()

In [8]:
wisconsin.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

In [9]:
type(wisconsin.DESCR)

str

In [10]:
print(wisconsin.DESCR)

Breast Cancer Wisconsin (Diagnostic) Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.

        

In [11]:
wisconsin.target_names

array(['malignant', 'benign'], 
      dtype='<U9')

In [12]:
wisconsin.feature_names

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension'], 
      dtype='<U23')

In [13]:
wisconsin.target[:20]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1])

In [14]:
wisconsin.data

array([[  1.79900000e+01,   1.03800000e+01,   1.22800000e+02, ...,
          2.65400000e-01,   4.60100000e-01,   1.18900000e-01],
       [  2.05700000e+01,   1.77700000e+01,   1.32900000e+02, ...,
          1.86000000e-01,   2.75000000e-01,   8.90200000e-02],
       [  1.96900000e+01,   2.12500000e+01,   1.30000000e+02, ...,
          2.43000000e-01,   3.61300000e-01,   8.75800000e-02],
       ..., 
       [  1.66000000e+01,   2.80800000e+01,   1.08300000e+02, ...,
          1.41800000e-01,   2.21800000e-01,   7.82000000e-02],
       [  2.06000000e+01,   2.93300000e+01,   1.40100000e+02, ...,
          2.65000000e-01,   4.08700000e-01,   1.24000000e-01],
       [  7.76000000e+00,   2.45400000e+01,   4.79200000e+01, ...,
          0.00000000e+00,   2.87100000e-01,   7.03900000e-02]])

In [15]:
wisconsin_X = wisconsin.data

In [16]:
wisconsin_X_df = pd.DataFrame(wisconsin_X)

In [17]:
wisconsin_X_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [18]:
wisconsin_X_df.columns = list(wisconsin.feature_names)

In [19]:
wisconsin_X_df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## 3. Identify X and y

In [20]:
y = wisconsin.target

In [21]:
y[:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [22]:
X = wisconsin.data

## 4. Train and Test Split

In [23]:
from sklearn.model_selection import train_test_split

In [24]:
X_train, X_test, y_train, y_test = train_test_split(
                                        X,
                                        y, 
                                        test_size=0.2, 
                                        random_state=1)

## 5. Apply Random Forest Classifier

In [25]:
rf_model = RandomForestClassifier()

In [26]:
rf_model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

## 6. Model Cross-Validation

In [27]:
cv_scores_rf = cross_val_score(rf_model, X_train, y_train, 
                               cv=5, scoring="accuracy")
cv_scores_rf

array([ 0.92307692,  0.96703297,  0.96703297,  0.94505495,  0.97802198])

In [28]:
cv_scores_rf_mean =  np.mean(cv_scores_rf)
cv_scores_rf_mean

0.95604395604395598

## 7. Model Test

In [29]:
y_test_pred = rf_model.predict(X_test)

In [30]:
y_test.shape

(114,)

In [31]:
# Results
pd.crosstab(y_test, y_test_pred,
            rownames=['Actual'], 
            colnames=['Predicted'])

Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,38,4
1,1,71


In [32]:
confusion_matrix(y_test, y_test_pred)

array([[38,  4],
       [ 1, 71]])

In [33]:
accuracy_score_test = accuracy_score(y_test, y_test_pred)
accuracy_score_test

0.95614035087719296

In [34]:
precision_score_forest = precision_score(y_test, y_test_pred)
precision_score_forest

0.94666666666666666

In [35]:
recall_score_forest = recall_score(y_test, y_test_pred)
recall_score_forest

0.98611111111111116

In [36]:
f1_score_forest = f1_score(y_test, y_test_pred)
f1_score_forest

0.96598639455782309

## 8. Comparison accuracy: Validation vs Test

In [41]:
comparison = {"Validation": [cv_scores_rf_mean],
             "Test": [accuracy_score_test]}
pd.DataFrame(comparison, index = ["Accuracy"])

Unnamed: 0,Test,Validation
Accuracy,0.95614,0.956044


## 9. Feature Importance

In [61]:
rf_model.feature_importances_

array([ 0.00427376,  0.00606282,  0.0041747 ,  0.01138347,  0.00458349,
        0.00528397,  0.02127959,  0.15720394,  0.00354267,  0.00187591,
        0.00706011,  0.00412941,  0.05172357,  0.01128251,  0.00304439,
        0.00412494,  0.00552121,  0.00786442,  0.        ,  0.00437375,
        0.22910854,  0.01513797,  0.16460837,  0.10752186,  0.00549105,
        0.01692416,  0.07547822,  0.05777345,  0.00461518,  0.00455257])

In [62]:
# View a list of the features and their importance scores
imp_list = list(zip(wisconsin.feature_names, 
                    rf_model.feature_importances_))
imp_df = pd.DataFrame(imp_list, columns = ["Features", "Importance"])
imp_df.sort_values(by = "Importance", ascending = False)

Unnamed: 0,Features,Importance
20,worst radius,0.229109
22,worst perimeter,0.164608
7,mean concave points,0.157204
23,worst area,0.107522
26,worst concavity,0.075478
27,worst concave points,0.057773
12,perimeter error,0.051724
6,mean concavity,0.02128
25,worst compactness,0.016924
21,worst texture,0.015138
