### Practicing with Classification Algorithms

Today I'm going to play with one of the datsets that comes with SciKit-Learn. Specifically, this is the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset.

From [the docs](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer), this datset has 569 samples, which includes 30 features.

This data is:
>Dictionary-like object, the interesting attributes are: ‘data’, the data to learn, ‘target’, the classification labels, ‘target_names’, the meaning of the labels, ‘feature_names’, the meaning of the features, and ‘DESCR’, the full description of the dataset, ‘filename’, the physical location of breast cancer csv dataset (added in version 0.20).

Let's take a look!

#### Load the data

In [39]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
from sklearn.datasets import load_breast_cancer

In [4]:
cancer = load_breast_cancer()

### Initial view

What is the data measuring? The target is the dependent variable - whether the tumor is cancerous or not. The independent variables are the features of these tumors.

By classifying these features, can we determine and predict which tumors will be malignant and which will be benign?

In [5]:
cancer.target_names

array(['malignant', 'benign'], dtype='<U9')

In [6]:
cancer.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

In [7]:
cancer.data

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

In [8]:
cancer.feature_names

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

### Designate X and y

As I mentioned above, the tumor features are the independent variables, while the tumor designation (target) is the dependent variable. With this in mind, let's label the features as the X variable, and the target as the y variable. 

I'm going to split the data with `train_test_split`, and I'll evaluate the final model this way. But to compare the four models against each other, I'm going to use cross validation scores.

I'm scoring the k-folds with "negative mean squared error", so I can see which model gives the fewest errors. 

In [21]:
X = cancer.data

y = cancer.target

In [116]:
from sklearn.model_selection import train_test_split

In [117]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [118]:
from sklearn.model_selection import cross_val_score

### Pick best model

We're going to run the data through a variety of classification algorithms, and compare their cross validation scores to find the best model. If necessary, we can later use Grid Search to find the best parameters.

Models to use:
- Decision Tree
- Random Forest
- SVM
- KNN

In [23]:
from sklearn.tree import DecisionTreeClassifier

In [25]:
tree_clf = DecisionTreeClassifier()

In [119]:
# cross val function

def display_cross_val(model):
    scores = cross_val_score(model, X_train, y_train, scoring = 'neg_mean_squared_error', cv=10)
    return scores.mean(), scores.std()

In [120]:
tree_rmse_scores = display_cross_val(tree_clf)

In [121]:
from sklearn.ensemble import RandomForestClassifier

In [122]:
forest_clf = RandomForestClassifier()

In [123]:
forest_rmse_scores = display_cross_val(forest_clf)

In [124]:
from sklearn.svm import SVC

In [125]:
svm_clf = SVC()

In [126]:
svm_rmse_scores = display_cross_val(svm_clf)

In [127]:
from sklearn.neighbors import KNeighborsClassifier

In [128]:
knn_clf = KNeighborsClassifier()

In [129]:
knn_rmse_scores = display_cross_val(knn_clf)

In [130]:
# View scores (mean and standard deviation)

print('Decision Tree Scores: ', tree_rmse_scores)
print('\n')
print('Random Forest Scores: ', forest_rmse_scores)
print('\n')
print('SVC Scores: ', svm_rmse_scores)
print('\n')
print('KNN Scores: ', knn_rmse_scores)

Decision Tree Scores:  (-0.06880585038479775, 0.0419493000555671)


Random Forest Scores:  (-0.05003282634861582, 0.04132597396871558)


SVC Scores:  (-0.3805011489222016, 0.004985860228816247)


KNN Scores:  (-0.09230222124958969, 0.040631734705542925)


Looks like the Random Forest model comes out on top!

### Apply model to test data

Let's use the testing data to once again evaluate the RF model we've created. To see how the model does, we'll take a look at the confusion matrix. From there, we can tweak the parameters on the model and see if the recall score - reducing the number of false negatives - changes at all.

For this particular task, we want our model to be very highly sensitive, so as to correctly identify as many malignant tumors as possible, even if it raises the risk of false positives.

In [88]:
from sklearn.metrics import confusion_matrix

In [138]:
forest_clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [139]:
forest_pred = tree_clf.predict(X_test)

In [140]:
confusion_matrix(y_test, forest_pred)

array([[ 61,   6],
       [  7, 114]])

What does the confusion matrix tell us? 61 malignant tumors were correctly identified, while 6 tumors were overlooked. In the second row, we see that 7 tumors were misidentified as malignant, while 114 tumors were correctly identified as benign.

The recall score is the number of true positives over the number of true positives plus false negatives. This dataset incorrectly has benign tumors as positives, so we have to calculate the score by hand.

In [141]:
print('Recall Score for original model: ', 61/68)

Recall Score for original model:  0.8970588235294118


That's pretty good! Would altering the parameters of the random forest model help this at all?

### Find best hyperparameters

We can do this really easy with **Grid Search**.

In [135]:
from sklearn.model_selection import GridSearchCV

In [137]:
# Grid Search #1

param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
   ]

In [148]:
grid_search = GridSearchCV(forest_clf, param_grid, cv=5, scoring='neg_mean_squared_error')

In [151]:
grid_search.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid=[{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]}, {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_squared_error', verbose=0)

In [152]:
grid_search.best_params_

{'max_features': 6, 'n_estimators': 30}

Something that I'm not showing you: When I run the grid search again with the same parameter options, it often returns a totally different combination of features. I'm pretty sure this means that there are multiple parameter combinations that improve the model by the same amount. 

Let's train the model with these new parameters!


In [153]:
forest_grid = RandomForestClassifier(max_features=6, n_estimators=30)

In [154]:
forest_grid.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=6, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=30, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [155]:
forest_grid_pred = forest_grid.predict(X_test)

In [156]:
confusion_matrix(y_test, forest_grid_pred)

array([[ 63,   4],
       [  5, 116]])

In [157]:
print('Recall Score for model #2 [max_features: 6, n_estimators: 30): ', 63/68)

Recall Score for model #2 [max_features: 6, n_estimators: 30):  0.9264705882352942


Just the small tweak did make a difference! Our model now performs with a 92.6% recall rate.

### Summary

This notebook took a look at a fairly simple dataset and tried out a number of classification algorithms. The cross validation scores showed us that Random Forest performed best, although Decision Tree and KNN weren't that far behind. SVM, not so much.

After I picked the model I wanted to use, I fit it to the training data and explored whether some different hyperparameters would improve the model. Because the Real World application of such a model would value sensitivity over accuracy, we're more concerned with model's recall rate than sheer accuracy. 