<!--NAVIGATION-->

<a href="https://colab.research.google.com/github/bpesquet/machine-learning-katas/blob/master/classic-datasets/Breast_Cancer.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open in Google Colaboratory"></a>


# Kata: Breast Cancer Dataset

| Learning type | Activity type | Objective |
| - | - | - |
| Supervised | Binary classification | Predict if a tumor is benign or malignant |


## Instructions

This is a self-correcting exercise generated by [nbgrader](https://github.com/jupyter/nbgrader). 

Complete the cells beginning with `# YOUR CODE HERE` and run the subsequent cells to check your code.

## About the dataset

The [Breast Cancer][1] dataset is used for multivariate binary classification. There are 569 total samples with 30 features each. Features were computed from a digitized image of a fine needle aspirate of a breast mass. They describe characteristics of the cell nuclei present in the image.

![](images/breast-cancer-logo.jpg)

[1]: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

## Package setup

In [11]:
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
import keras.optimizers
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
#from nose.tools import assert_equal, assert_true

# Display plots inline, change default figure size and change plot resolution to retina
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# Set Seaborn aesthetic parameters to defaults
sns.set()

In [38]:
import numpy as np
from collections import Counter

class KNN:
    def __init__(self, k=1):
        self.k = k
        self.p = 2

    def _minkowski_dist(self, v1, v2):
        return sum(abs(e1 - e2) ** self.p for e1, e2 in zip(v1, v2)) ** (1 / self.p)

    def fit(self, X, labels, p=2):
        self.p = p
        self.X = X
        self.Y = labels
        return self
    
    def predict(self, query, method='mean'):
        assert method in ['mean', 'mode']
        dist_label = []
        for i in range(len(self.X)):
           dist_label.append((self._minkowski_dist(query, self.X[i]), self.Y[i]))

        neighbours = [x[1] for x in sorted(dist_label, key = lambda x: x[0])[:self.k]]
            
        if method=='mean':
            return np.average(neighbours, axis=0)
        elif method=='mode':
            count = Counter(neighbours)
            return max(count, key=count.get)


## Utility functions

In [37]:
def plot_loss_acc(history):
    """Plot training and (optionally) validation loss and accuracy"""

    loss = history.history['loss']
    epochs = range(1, len(loss) + 1)

    plt.figure(figsize=(10, 10))

    plt.subplot(2, 1, 1)
    plt.plot(epochs, loss, '.--', label='Training loss')
    final_loss = loss[-1]
    title = 'Training loss: {:.4f}'.format(final_loss)
    plt.ylabel('Loss')
    if 'val_loss' in history.history:
        val_loss = history.history['val_loss']
        plt.plot(epochs, val_loss, 'o-', label='Validation loss')
        final_val_loss = val_loss[-1]
        title += ', Validation loss: {:.4f}'.format(final_val_loss)
    plt.title(title)
    plt.legend()

    acc = history.history['acc']

    plt.subplot(2, 1, 2)
    plt.plot(epochs, acc, '.--', label='Training acc')
    final_acc = acc[-1]
    title = 'Training accuracy: {:.2f}%'.format(final_acc * 100)
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    if 'val_acc' in history.history:
        val_acc = history.history['val_acc']
        plt.plot(epochs, val_acc, 'o-', label='Validation acc')
        final_val_acc = val_acc[-1]
        title += ', Validation accuracy: {:.2f}%'.format(final_val_acc * 100)
    plt.title(title)
    plt.legend()

## Step 1: Loading the data

### Question

* Load the Breast Cancer dataset included with scikit-learn. Store it in a variable named `dataset`. 
* Display 10 random samples with feature names, target and class.

In [36]:
# YOUR CODE HERE
dataset = load_breast_cancer()

df = pd.DataFrame(dataset['data'], columns=dataset['feature_names'])
#df['feature'] = dataset['data']
#dataset['feature_names']
df['target'] = dataset['target']
#df['class'] = dataset['target_names']
df['class'] = list(map(lambda x: 'malignant' if x == 1 else 'benign', dataset['target']))
df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target,class
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0,benign
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0,benign
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0,benign
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0,benign
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0,benign
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0,benign
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0,benign
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0,benign
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0,benign


## Step 2: Using a simple model

In [39]:
# Use scikit-learn's builtin logistic regression classifier to obtain a pretty good accuracy
model = LogisticRegression()
model.fit(dataset.data, dataset.target)
accuracy = model.score(dataset.data, dataset.target)
print('Accuracy: {:.2f}%'.format(accuracy * 100))

Accuracy: 94.55%


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [43]:
def evaluate(y_hat, y_true):
    points = 0
    for pred, true in zip(y_hat, y_true):
        if pred == true:
            points += 1
    return points / len(y_hat)

In [99]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, 
                                                    stratify = dataset.target, test_size = 0.2,
                                                    random_state=4)

model = KNN(k = 3)
model.fit(X_train, y_train)
outcomes = []
for i in range(len(X_test)):
    outcomes.append(model.predict(X_test[i], method='mode'))
accuracy = evaluate(outcomes, y_test)
print('Accuracy: {:.2f}%'.format(accuracy * 100))

Accuracy: 94.74%


## Step 3: Training a neural network model

### Question

Train a model on the data to obtain a training accuracy > 85%. Store the training history in a variable named `history`.

In [None]:
# YOUR CODE HERE

In [None]:
plot_loss_acc(history)

In [None]:
# Retrieve final accuracy
final_acc = history.history['acc'][-1]
# Assert final accuracy
assert_true(final_acc > 0.85)

## Step 4: Beating the simple model

### Question

Optimize training to beat the simple model by achieving a training accuracy > 96%.

In [None]:
# YOUR CODE HERE

In [None]:
# Plot training history
plot_loss_acc(history)

In [None]:
# Retrieve final accuracy
final_acc = history.history['acc'][-1]
# Assert final accuracy
assert_true(final_acc > 0.96)