<!--NAVIGATION-->

<a href="https://colab.research.google.com/github/bpesquet/machine-learning-katas/blob/master/notebooks/katas/algorithms/KNN_BreastCancer.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open in Google Colaboratory"></a>


## Instructions

This is a self-correcting exercise generated by [nbgrader](https://github.com/jupyter/nbgrader). 

Fill in any place that says `YOUR CODE HERE` or `YOUR ANSWER HERE`. Run subsequent cells to check your code.

---

# Kata: Diagnose Breast Tumors with K-Nearest Neighbors

In this kata, you'll use a K-Nearest Neighbors classifier to help diagnose breast tumors.

The [Breast Cancer][1] dataset is used for multivariate binary classification between benign and maligant tumors. There are 569 total samples with 30 features each. Features were computed from a digitized image of a fine needle aspirate of a breast mass. They describe characteristics of the cell nuclei present in the image.

![](images/breast-cancer-logo.jpg)

[1]: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

## Package setup

In [1]:
# Import base packages
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

In [2]:
# Setup plots
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 8
%config InlineBackend.figure_format = 'retina'
sns.set()

### Question

Import the needed packages.

In [3]:
# Import ML packages (edit this list if needed)
from sklearn.datasets import load_breast_cancer
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## Step 1: Loading the data

In [4]:
dataset = load_breast_cancer()

# Put data in a pandas DataFrame
df_breast_cancer = pd.DataFrame(dataset.data, columns=dataset.feature_names)
# Add target and class to DataFrame
df_breast_cancer['target'] = dataset.target
df_breast_cancer['class'] = dataset.target_names[dataset.target]
# Show 10 random samples
df_breast_cancer.sample(n=10)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target,class
329,16.26,21.88,107.5,826.8,0.1165,0.1283,0.1799,0.07981,0.1869,0.06532,...,113.7,975.2,0.1426,0.2116,0.3344,0.1047,0.2736,0.07953,0,malignant
408,17.99,20.66,117.8,991.7,0.1036,0.1304,0.1201,0.08824,0.1992,0.06069,...,138.1,1349.0,0.1482,0.3735,0.3301,0.1974,0.306,0.08503,0,malignant
112,14.26,19.65,97.83,629.9,0.07837,0.2233,0.3003,0.07798,0.1704,0.07769,...,107.0,709.0,0.08949,0.4193,0.6783,0.1505,0.2398,0.1082,1,benign
350,11.66,17.07,73.7,421.0,0.07561,0.0363,0.008306,0.01162,0.1671,0.05731,...,83.61,542.5,0.09958,0.06476,0.03046,0.04262,0.2731,0.06825,1,benign
51,13.64,16.34,87.21,571.8,0.07685,0.06059,0.01857,0.01723,0.1353,0.05953,...,96.08,656.7,0.1089,0.1582,0.105,0.08586,0.2346,0.08025,1,benign
357,13.87,16.21,88.52,593.7,0.08743,0.05492,0.01502,0.02088,0.1424,0.05883,...,96.74,694.4,0.1153,0.1008,0.05285,0.05556,0.2362,0.07113,1,benign
440,10.97,17.2,71.73,371.5,0.08915,0.1113,0.09457,0.03613,0.1489,0.0664,...,90.14,476.4,0.1391,0.4082,0.4779,0.1555,0.254,0.09532,1,benign
267,13.59,21.84,87.16,561.0,0.07956,0.08259,0.04072,0.02142,0.1635,0.05859,...,97.66,661.5,0.1005,0.173,0.1453,0.06189,0.2446,0.07024,1,benign
502,12.54,16.32,81.25,476.3,0.1158,0.1085,0.05928,0.03279,0.1943,0.06612,...,86.67,552.0,0.158,0.1751,0.1889,0.08411,0.3155,0.07538,1,benign
509,15.46,23.95,103.8,731.3,0.1183,0.187,0.203,0.0852,0.1807,0.07083,...,117.7,909.4,0.1732,0.4967,0.5911,0.2163,0.3013,0.1067,0,malignant


## Step 2: Preparing the data

### Question

Compute the number of features of the dataset into the `num_features` variable.

In [5]:
num_features = len(dataset.feature_names)

In [6]:
print(f'Number of features: {num_features}')

assert num_features == 30

Number of features: 30


### Question

In order to evaluate class distribution, compute the number of benign and malignant tumors into the `num_benign` and `num_malignant` variables respectively.

In [7]:
grouped = df_breast_cancer.groupby('class')
num_benign = len(grouped.get_group('benign'))
num_malignant = len(grouped.get_group('malignant'))


In [8]:
print(f'Benign count: {num_benign}. Malignant count: {num_malignant}')

assert num_benign == 357
assert num_malignant == 212

Benign count: 357. Malignant count: 212


In [9]:
# Store input and labels
x = dataset.data
y = dataset.target

print(f'x: {x.shape}. y: {y.shape}')

x: (569, 30). y: (569,)


### Question

Split the dataset into training and test sets with a 25% ratio. Use variables `x_train`, `y_train`, `x_test` and `y_test`.

In [10]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.25)

In [11]:
print(f'x_train: {x_train.shape}. y_train: {y_train.shape}')
print(f'x_test: {x_test.shape}. y_test: {y_test.shape}')

assert x_train.shape == (426, 30)
assert y_train.shape == (426, )
assert x_test.shape == (143, 30)
assert y_test.shape == (143,)

x_train: (426, 30). y_train: (426,)
x_test: (143, 30). y_test: (143,)


### Question

Scale features by standardization while preventing information leakage from the test set.

In [12]:
scaler = StandardScaler().fit(x_train)

x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

In [13]:
mean_train = x_train.mean()
std_train = x_train.std()
print(f'mean_train: {mean_train}. std_train: {std_train}')

assert np.abs(np.max(mean_train)) < 10**-6
assert np.abs(np.max(std_train - 1)) < 10**-6

mean_train: -3.057891272832982e-17. std_train: 1.0


## Step 3: Creating a classifier

### Question

Create a `KNeighborsClassifier` instance using only one nearest neighbor, store it into the `model` variable, and fit the training data.

In [14]:
model = KNeighborsClassifier(n_neighbors=1)
model.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

## Step 4: Evaluating the classifier

In [15]:
# Compute accuracy on training and test sets
train_acc = model.score(x_train, y_train)
test_acc = model.score(x_test, y_test)

print(f'Training accuracy: {train_acc * 100:.2f}%')
print(f'Test accuracy: {test_acc * 100:.2f}%')

Training accuracy: 100.00%
Test accuracy: 97.20%


### Question

Display precision, recall and f1-score for the classifier on test data. Interpret the results.

In [16]:
from sklearn import metrics
y_pred = model.predict(x_test)
precision = metrics.precision_score(y_test, y_pred, average=None)
print(f'Precision: {precision}')

# Using scikit-learn's recall_score() function
recall = metrics.recall_score(y_test, y_pred, average=None)
print(f'Recall: {recall}')
f1_score = metrics.f1_score(y_test, y_pred, average=None)
print(f'f1_score: {f1_score}')

Precision: [0.96491228 0.97674419]
Recall: [0.96491228 0.97674419]
f1_score: [0.96491228 0.97674419]


### Question

Go back to step 3 and try to find the best value for the `k` number of nearest neighbors.

In [17]:
for i in range(8):
    model = KNeighborsClassifier(n_neighbors=1)
    model.fit(x_train, y_train)
    # Compute accuracy on training and test sets
    train_acc = model.score(x_train, y_train)
    test_acc = model.score(x_test, y_test)
    print(f'-------Data for k = {i}%-------')
    print(f'Training accuracy: {train_acc * 100:.2f}%')
    print(f'Test accuracy: {test_acc * 100:.2f}%')
    y_pred = model.predict(x_test)
    precision = metrics.precision_score(y_test, y_pred, average=None)
    print(f'Precision: {precision}')
    f1_score = metrics.f1_score(y_test, y_pred, average=None)
    print(f'f1_score: {f1_score}')
    

-------Data for k = 0%-------
Training accuracy: 100.00%
Test accuracy: 97.20%
Precision: [0.96491228 0.97674419]
f1_score: [0.96491228 0.97674419]
-------Data for k = 1%-------
Training accuracy: 100.00%
Test accuracy: 97.20%
Precision: [0.96491228 0.97674419]
f1_score: [0.96491228 0.97674419]
-------Data for k = 2%-------
Training accuracy: 100.00%
Test accuracy: 97.20%
Precision: [0.96491228 0.97674419]
f1_score: [0.96491228 0.97674419]
-------Data for k = 3%-------
Training accuracy: 100.00%
Test accuracy: 97.20%
Precision: [0.96491228 0.97674419]
f1_score: [0.96491228 0.97674419]
-------Data for k = 4%-------
Training accuracy: 100.00%
Test accuracy: 97.20%
Precision: [0.96491228 0.97674419]
f1_score: [0.96491228 0.97674419]
-------Data for k = 5%-------
Training accuracy: 100.00%
Test accuracy: 97.20%
Precision: [0.96491228 0.97674419]
f1_score: [0.96491228 0.97674419]
-------Data for k = 6%-------
Training accuracy: 100.00%
Test accuracy: 97.20%
Precision: [0.96491228 0.97674419