<!--NAVIGATION-->

<a href="https://colab.research.google.com/github/bpesquet/machine-learning-katas/blob/master/notebooks/training_models/breast_cancer.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open in Google Colaboratory"></a>


## Instructions

This is a self-correcting notebook generated by [nbgrader](https://github.com/jupyter/nbgrader). 

Fill in any place that says `YOUR CODE HERE` or `YOUR ANSWER HERE`. Run subsequent cells to check your code.

# Diagnose Breast Tumors With K-Nearest Neighbors

In this notebook, you'll use a K-Nearest Neighbors classifier to help diagnose breast tumors.

The [Breast Cancer][1] dataset is used for multivariate binary classification between benign and maligant tumors. There are 569 total samples with 30 features each. Features were computed from a digitized image of a fine needle aspirate of a breast mass. They describe characteristics of the cell nuclei present in the image.

![Breast cancer logo](images/breast-cancer.jpg)

[1]: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

## Package setup

In [1]:
# Import base packages
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

In [2]:
# Setup plots
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 8
%config InlineBackend.figure_format = 'retina'
sns.set()

In [3]:
# Import ML packages
import sklearn
print(f'scikit-learn version: {sklearn.__version__}')

from sklearn.datasets import load_breast_cancer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

scikit-learn version: 0.22.2.post1


## Step 1: Loading the data

In [4]:
dataset = load_breast_cancer()

# Put data in a pandas DataFrame
df_breast_cancer = pd.DataFrame(dataset.data, columns=dataset.feature_names)
# Add target and class to DataFrame
df_breast_cancer['target'] = dataset.target
df_breast_cancer['class'] = dataset.target_names[dataset.target]
# Show 10 random samples
df_breast_cancer.sample(n=10)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target,class
294,12.72,13.78,81.78,492.1,0.09667,0.08393,0.01288,0.01924,0.1638,0.061,...,88.54,553.7,0.1298,0.1472,0.05233,0.06343,0.2369,0.06922,1,benign
421,14.69,13.98,98.22,656.1,0.1031,0.1836,0.145,0.063,0.2086,0.07406,...,114.1,809.2,0.1312,0.3635,0.3219,0.1108,0.2827,0.09208,1,benign
172,15.46,11.89,102.5,736.9,0.1257,0.1555,0.2032,0.1097,0.1966,0.07069,...,125.0,1102.0,0.1531,0.3583,0.583,0.1827,0.3216,0.101,0,malignant
161,19.19,15.94,126.3,1157.0,0.08694,0.1185,0.1193,0.09667,0.1741,0.05176,...,146.6,1495.0,0.1124,0.2016,0.2264,0.1777,0.2443,0.06251,0,malignant
24,16.65,21.38,110.0,904.6,0.1121,0.1457,0.1525,0.0917,0.1995,0.0633,...,177.0,2215.0,0.1805,0.3578,0.4695,0.2095,0.3613,0.09564,0,malignant
97,9.787,19.94,62.11,294.5,0.1024,0.05301,0.006829,0.007937,0.135,0.0689,...,68.81,366.1,0.1316,0.09473,0.02049,0.02381,0.1934,0.08988,1,benign
83,19.1,26.29,129.1,1132.0,0.1215,0.1791,0.1937,0.1469,0.1634,0.07224,...,141.3,1298.0,0.1392,0.2817,0.2432,0.1841,0.2311,0.09203,0,malignant
336,12.99,14.23,84.08,514.3,0.09462,0.09965,0.03738,0.02098,0.1652,0.07238,...,87.38,576.0,0.1142,0.1975,0.145,0.0585,0.2432,0.1009,1,benign
325,12.67,17.3,81.25,489.9,0.1028,0.07664,0.03193,0.02107,0.1707,0.05984,...,88.7,574.4,0.1384,0.1212,0.102,0.05602,0.2688,0.06888,1,benign
154,13.15,15.34,85.31,538.9,0.09384,0.08498,0.09293,0.03483,0.1822,0.06207,...,97.67,677.3,0.1478,0.2256,0.3009,0.09722,0.3849,0.08633,1,benign


## Step 2: Preparing the data

### Question

Compute the number of features of the dataset into the `num_features` variable.

In [5]:
num_features = len(df_breast_cancer.iloc[:, 0:-2].columns)

In [6]:
print(f'Number of features: {num_features}')

assert num_features == 30

Number of features: 30


### Question

In order to evaluate class distribution, compute the number of benign and malignant tumors into the `num_benign` and `num_malignant` variables respectively.

In [7]:
class_count = df_breast_cancer['class'].value_counts()
class_count

benign       357
malignant    212
Name: class, dtype: int64

In [8]:
num_benign = class_count['benign']
num_malignant = class_count['malignant']

In [9]:
print(f'Benign count: {num_benign}. Malignant count: {num_malignant}')

assert num_benign == 357
assert num_malignant == 212

Benign count: 357. Malignant count: 212


In [10]:
# Store input and labels
x = dataset.data
y = dataset.target

print(f'x: {x.shape}. y: {y.shape}')

x: (569, 30). y: (569,)


### Question

Split the dataset into training and test sets with a 25% ratio. Use variables `x_train`, `y_train`, `x_test` and `y_test`.

In [11]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25)

In [12]:
print(f'x_train: {x_train.shape}. y_train: {y_train.shape}')
print(f'x_test: {x_test.shape}. y_test: {y_test.shape}')

assert x_train.shape == (426, 30)
assert y_train.shape == (426, )
assert x_test.shape == (143, 30)
assert y_test.shape == (143,)

x_train: (426, 30). y_train: (426,)
x_test: (143, 30). y_test: (143,)


### Question

Scale features by standardization while preventing information leakage from the test set.

In [13]:
x_scaler = StandardScaler()
x_train = x_scaler.fit_transform(x_train)
x_test = x_scaler.transform(x_test)

In [14]:
mean_train = x_train.mean()
std_train = x_train.std()
print(f'mean_train: {mean_train}. std_train: {std_train}')

assert np.abs(np.max(mean_train)) < 10**-6
assert np.abs(np.max(std_train - 1)) < 10**-6

mean_train: 3.135728505232367e-16. std_train: 1.0


## Step 3: Creating a classifier

### Question

Create a `KNeighborsClassifier` instance using only one nearest neighbor, store it into the `model` variable, and fit the training data.

In [15]:
print(y_train)

[1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 0 0 1 1 0 1 0 1 1
 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 1 1 1 1 1 0 1 0 0 0 1 0 1 1 1 0 1 1 1 1
 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 0 0 1 1
 1 0 0 0 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1 1 1 1 0 1 0 0 1 1 1 1 1 0 1 0 1
 0 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 1 0 0 1
 0 0 1 1 1 0 1 1 1 0 0 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 1 1 0 1 1 1 0 0 1 1 1
 1 0 0 0 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0
 1 1 1 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 1 0 1 1 1
 0 1 0 1 1 1 0 0 0 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 0 0 0 1 1
 0 1 1 0 0 1 1 0 0 1 1 0 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 1 1 1 1
 0 0 0 1 1 0 1 1 0 1 0 1 1 1 0 0 0 1 1 0 0 1 1 0 1 1 0 1 1 0 1 1 1 0 0 1 1
 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1]


In [38]:
model = KNeighborsClassifier(n_neighbors=3)
model.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

## Step 4: Evaluating the classifier

In [39]:
# Compute accuracy on training and test sets
train_acc = model.score(x_train, y_train)
test_acc = model.score(x_test, y_test)

print(f'Training accuracy: {train_acc * 100:.2f}%')
print(f'Test accuracy: {test_acc * 100:.2f}%')

Training accuracy: 97.18%
Test accuracy: 98.60%


### Question

Display precision, recall and f1-score for the classifier on test data. Interpret the results.

In [40]:
y_train_pred = model.predict(x_train)

from sklearn.metrics import precision_score, recall_score, f1_score
print('precision : ', precision_score(y_train_pred, y_train))
print('recall    : ', recall_score(y_train_pred, y_train))
print('f1-score  : ', f1_score(y_train_pred, y_train))

precision :  0.9962121212121212
recall    :  0.9598540145985401
f1-score  :  0.9776951672862454


In [41]:
y_test_pred = model.predict(x_test)

from sklearn.metrics import precision_score, recall_score, f1_score
print('precision : ', precision_score(y_test_pred, y_test))
print('recall    : ', recall_score(y_test_pred, y_test))
print('f1-score  : ', f1_score(y_test_pred, y_test))

precision :  1.0
recall    :  0.9789473684210527
f1-score  :  0.9893617021276596


In [42]:
print(pd.concat([pd.DataFrame(y_test_pred), pd.DataFrame(y_test)], axis=1))

     0  0
0    1  1
1    1  1
2    1  1
3    0  0
4    1  1
5    1  1
6    1  1
7    1  1
8    0  0
9    1  1
10   1  1
11   1  1
12   1  1
13   1  1
14   1  1
15   1  1
16   1  1
17   0  0
18   0  0
19   0  0
20   1  1
21   1  1
22   0  0
23   0  0
24   0  0
25   0  0
26   1  1
27   1  1
28   1  1
29   0  0
..  .. ..
113  1  1
114  0  0
115  1  1
116  1  1
117  1  1
118  0  0
119  1  1
120  1  1
121  0  0
122  1  1
123  1  1
124  1  1
125  0  0
126  0  0
127  1  1
128  0  0
129  1  1
130  1  1
131  1  1
132  1  1
133  1  1
134  0  0
135  0  0
136  1  1
137  1  1
138  0  0
139  1  1
140  1  1
141  0  0
142  1  1

[143 rows x 2 columns]


### Question

Go back to step 3 and try to find the best value for the `k` number of nearest neighbors.