<!--NAVIGATION-->

<a href="https://colab.research.google.com/github/bpesquet/machine-learning-katas/blob/master/notebooks/katas/algorithms/KNN_BreastCancer.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open in Google Colaboratory"></a>


## Instructions

This is a self-correcting exercise generated by [nbgrader](https://github.com/jupyter/nbgrader). 

Fill in any place that says `YOUR CODE HERE` or `YOUR ANSWER HERE`. Run subsequent cells to check your code.

---

# Kata: Diagnose Breast Tumors with K-Nearest Neighbors

In this kata, you'll use a K-Nearest Neighbors classifier to help diagnose breast tumors.

The [Breast Cancer][1] dataset is used for multivariate binary classification between benign and maligant tumors. There are 569 total samples with 30 features each. Features were computed from a digitized image of a fine needle aspirate of a breast mass. They describe characteristics of the cell nuclei present in the image.

![](images/breast-cancer-logo.jpg)

[1]: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

## Package setup

In [1]:
# Import base packages
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

In [2]:
# Setup plots
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 8
%config InlineBackend.figure_format = 'retina'
sns.set()

### Question

Import the needed packages.

In [3]:
# Import ML packages (edit this list if needed)
from sklearn.datasets import load_breast_cancer
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## Step 1: Loading the data

In [4]:
dataset = load_breast_cancer()

# Put data in a pandas DataFrame
df_breast_cancer = pd.DataFrame(dataset.data, columns=dataset.feature_names)
# Add target and class to DataFrame
df_breast_cancer['target'] = dataset.target
df_breast_cancer['class'] = dataset.target_names[dataset.target]
# Show 10 random samples
df_breast_cancer.sample(n=10)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target,class
427,10.8,21.98,68.79,359.9,0.08801,0.05743,0.03614,0.01404,0.2016,0.05977,...,83.69,489.5,0.1303,0.1696,0.1927,0.07485,0.2965,0.07662,1,benign
297,11.76,18.14,75.0,431.1,0.09968,0.05914,0.02685,0.03515,0.1619,0.06287,...,85.1,553.6,0.1137,0.07974,0.0612,0.0716,0.1978,0.06915,0,malignant
113,10.51,20.19,68.64,334.2,0.1122,0.1303,0.06476,0.03068,0.1922,0.07782,...,72.62,374.4,0.13,0.2049,0.1295,0.06136,0.2383,0.09026,1,benign
5,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,0.07613,...,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244,0,malignant
244,19.4,23.5,129.1,1155.0,0.1027,0.1558,0.2049,0.08886,0.1978,0.06,...,144.9,1417.0,0.1463,0.2968,0.3458,0.1564,0.292,0.07614,0,malignant
452,12.0,28.23,76.77,442.5,0.08437,0.0645,0.04055,0.01945,0.1615,0.06104,...,85.07,523.7,0.1208,0.1856,0.1811,0.07116,0.2447,0.08194,1,benign
283,16.24,18.77,108.8,805.1,0.1066,0.1802,0.1948,0.09052,0.1876,0.06684,...,126.9,1031.0,0.1365,0.4706,0.5026,0.1732,0.277,0.1063,0,malignant
504,9.268,12.87,61.49,248.7,0.1634,0.2239,0.0973,0.05252,0.2378,0.09502,...,69.05,300.2,0.1902,0.3441,0.2099,0.1025,0.3038,0.1252,1,benign
267,13.59,21.84,87.16,561.0,0.07956,0.08259,0.04072,0.02142,0.1635,0.05859,...,97.66,661.5,0.1005,0.173,0.1453,0.06189,0.2446,0.07024,1,benign
507,11.06,17.12,71.25,366.5,0.1194,0.1071,0.04063,0.04268,0.1954,0.07976,...,76.08,411.1,0.1662,0.2031,0.1256,0.09514,0.278,0.1168,1,benign


## Step 2: Preparing the data

### Question

Compute the number of features of the dataset into the `num_features` variable.

In [5]:
num_features = len(dataset.feature_names)

In [6]:
print(f'Number of features: {num_features}')

assert num_features == 30

Number of features: 30


### Question

In order to evaluate class distribution, compute the number of benign and malignant tumors into the `num_benign` and `num_malignant` variables respectively.

In [7]:
grouped = df_breast_cancer.groupby('class')
num_benign = len(grouped.get_group('benign'))
num_malignant = len(grouped.get_group('malignant'))


In [8]:
print(f'Benign count: {num_benign}. Malignant count: {num_malignant}')

assert num_benign == 357
assert num_malignant == 212

Benign count: 357. Malignant count: 212


In [9]:
# Store input and labels
x = dataset.data
y = dataset.target

print(f'x: {x.shape}. y: {y.shape}')

x: (569, 30). y: (569,)


### Question

Split the dataset into training and test sets with a 25% ratio. Use variables `x_train`, `y_train`, `x_test` and `y_test`.

In [10]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.25)

In [11]:
print(f'x_train: {x_train.shape}. y_train: {y_train.shape}')
print(f'x_test: {x_test.shape}. y_test: {y_test.shape}')

assert x_train.shape == (426, 30)
assert y_train.shape == (426, )
assert x_test.shape == (143, 30)
assert y_test.shape == (143,)

x_train: (426, 30). y_train: (426,)
x_test: (143, 30). y_test: (143,)


### Question

Scale features by standardization while preventing information leakage from the test set.

In [12]:
scaler = StandardScaler().fit(x_train)

x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

In [13]:
mean_train = x_train.mean()
std_train = x_train.std()
print(f'mean_train: {mean_train}. std_train: {std_train}')

assert np.abs(np.max(mean_train)) < 10**-6
assert np.abs(np.max(std_train - 1)) < 10**-6

mean_train: 2.80214036637786e-16. std_train: 1.0


## Step 3: Creating a classifier

### Question

Create a `KNeighborsClassifier` instance using only one nearest neighbor, store it into the `model` variable, and fit the training data.

In [14]:
model = KNeighborsClassifier(n_neighbors=1)
model.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

## Step 4: Evaluating the classifier

In [15]:
# Compute accuracy on training and test sets
train_acc = model.score(x_train, y_train)
test_acc = model.score(x_test, y_test)

print(f'Training accuracy: {train_acc * 100:.2f}%')
print(f'Test accuracy: {test_acc * 100:.2f}%')

Training accuracy: 100.00%
Test accuracy: 96.50%


### Question

Display precision, recall and f1-score for the classifier on test data. Interpret the results.

In [25]:
from sklearn import metrics
y_pred = model.predict(x_test)
precision = metrics.precision_score(y_test, y_pred, average=None)
print(f'Precision: {precision}')

# Using scikit-learn's recall_score() function
recall = metrics.recall_score(y_test, y_pred, average=None)
print(f'Recall: {recall}')
f1_score = metrics.f1_score(y_test, y_pred, average=None)
print(f'f1_score: {f1_score}')

Precision: [0.95555556 0.96938776]
Recall: [0.93478261 0.97938144]
f1_score: [0.94505495 0.97435897]


### Question

Go back to step 3 and try to find the best value for the `k` number of nearest neighbors.

In [26]:
for i in range(8):
    model = KNeighborsClassifier(n_neighbors=1)
    model.fit(x_train, y_train)
    # Compute accuracy on training and test sets
    train_acc = model.score(x_train, y_train)
    test_acc = model.score(x_test, y_test)
    print(f'-------Data for k = {i}%-------')
    print(f'Training accuracy: {train_acc * 100:.2f}%')
    print(f'Test accuracy: {test_acc * 100:.2f}%')
    y_pred = model.predict(x_test)
    precision = metrics.precision_score(y_test, y_pred, average=None)
    print(f'Precision: {precision}')
    f1_score = metrics.f1_score(y_test, y_pred, average=None)
    print(f'f1_score: {f1_score}')
    

-------Data for k = 0%-------
Training accuracy: 100.00%
Test accuracy: 96.50%
Precision: [0.95555556 0.96938776]
f1_score: [0.94505495 0.97435897]
-------Data for k = 1%-------
Training accuracy: 100.00%
Test accuracy: 96.50%
Precision: [0.95555556 0.96938776]
f1_score: [0.94505495 0.97435897]
-------Data for k = 2%-------
Training accuracy: 100.00%
Test accuracy: 96.50%
Precision: [0.95555556 0.96938776]
f1_score: [0.94505495 0.97435897]
-------Data for k = 3%-------
Training accuracy: 100.00%
Test accuracy: 96.50%
Precision: [0.95555556 0.96938776]
f1_score: [0.94505495 0.97435897]
-------Data for k = 4%-------
Training accuracy: 100.00%
Test accuracy: 96.50%
Precision: [0.95555556 0.96938776]
f1_score: [0.94505495 0.97435897]
-------Data for k = 5%-------
Training accuracy: 100.00%
Test accuracy: 96.50%
Precision: [0.95555556 0.96938776]
f1_score: [0.94505495 0.97435897]
-------Data for k = 6%-------
Training accuracy: 100.00%
Test accuracy: 96.50%
Precision: [0.95555556 0.96938776