> This is a self-correcting activity generated by [nbgrader](https://nbgrader.readthedocs.io). Fill in any place that says `YOUR CODE HERE` or `YOUR ANSWER HERE`. Run subsequent cells to check your code.

---

# Classify handwritten digits

In this activity, you'll try several classifiers on the [UCI handwritten digits dataset](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits), either separately or into an ensemble.

![UCI digits](images/uci_digits.png)

## Environment setup

In [1]:
# Import base packages
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

In [2]:
# Setup plots
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 8
%config InlineBackend.figure_format = 'retina'
sns.set()

In [3]:
# Import ML packages
import sklearn
print(f'scikit-learn version: {sklearn.__version__}')

from sklearn.datasets import load_digits
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import train_test_split

scikit-learn version: 0.23.2


## Step 1: Loading and preparing the data

In [4]:
# Load the MNIST digits dataset
digits = load_digits()

# To apply a classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))

print(f'digits.images: {digits.images.shape}. targets: {digits.target.shape}')
print(f'data: {data.shape}. targets: {digits.target.shape}')

digits.images: (1797, 8, 8). targets: (1797,)
data: (1797, 64). targets: (1797,)


### Question

Split the data into training, validation and test sets.

In [8]:
# YOUR CODE HERE
train_images, test_images, train_labels, test_labels = train_test_split(data, digits.target, test_size=0.2)

x_train = train_images[:1237].astype("float32") / 16.
x_test = test_images.astype("float32") / 16.
x_val = train_images[1237:].astype("float32") / 16.

y_train = train_labels[:1237]
y_test = test_labels
y_val = train_labels[1237:]

In [9]:
print(f'x_train: {x_train.shape}. x_val: {x_val.shape}. x_test: {x_test.shape}')
print(f'y_train: {y_train.shape}. y_val: {y_val.shape}. y_test: {y_test.shape}')

assert x_train.shape == (1237, 64)
assert x_val.shape == (200, 64)
assert x_test.shape == (360, 64)
assert y_train.shape == (1237,)
assert y_val.shape == (200,)
assert y_test.shape == (360,)

x_train: (1237, 64). x_val: (200, 64). x_test: (360, 64)
y_train: (1237,). y_val: (200,). y_test: (360,)


## Step 2: Train several models

### Question

Create and train various models, such as a linear classifier, a multilayer perceptron, a random forest...

In [10]:
# YOUR CODE HERE
sgd_model = SGDClassifier()
sgd_model.fit(x_train, y_train)

mlp_model = MLPClassifier(max_iter=500)
mlp_model.fit(x_train, y_train)

rf_model = RandomForestClassifier()
rf_model.fit(x_train, y_train)

RandomForestClassifier()

### Question

Display the default score for each model.

In [12]:
# YOUR CODE HERE
print('SGD Linear Classifier scores:')
print(f'Training score: {sgd_model.score(x_train, y_train)*100:.2f}%')
print(f'Validation score: {sgd_model.score(x_val, y_val)*100:.2f}%')
print(f'Test score: {sgd_model.score(x_test, y_test)*100:.2f}%')

print('\nMultiLayer Perceptron Classifier scores:')
print(f'Training score: {mlp_model.score(x_train, y_train)*100:.2f}%')
print(f'Validation score: {mlp_model.score(x_val, y_val)*100:.2f}%')
print(f'Test score: {mlp_model.score(x_test, y_test)*100:.2f}%')

print('\nRandomForest Classifier scores:')
print(f'Training score: {rf_model.score(x_train, y_train)*100:.2f}%')
print(f'Validation score: {rf_model.score(x_val, y_val)*100:.2f}%')
print(f'Test score: {rf_model.score(x_test, y_test)*100:.2f}%')

SGD Linear Classifier scores:
Training score: 95.47%
Validation score: 91.00%
Test score: 92.22%

MultiLayer Perceptron Classifier scores:
Training score: 100.00%
Validation score: 99.00%
Test score: 96.39%

RandomForest Classifier scores:
Training score: 100.00%
Validation score: 98.00%
Test score: 96.67%


### Question

Create a `VotingClassifier` including all your models. Fit it on the training data.

In [43]:
# YOUR CODE HERE
estimators = [("SGD Linear", SGDClassifier(loss='log')), ("MLP", mlp_model), ("RandomForest", rf_model)]
vot_model = VotingClassifier(estimators)
vot_model.fit(x_train, y_train)

VotingClassifier(estimators=[('SGD Linear', SGDClassifier(loss='log')),
                             ('MLP', MLPClassifier(max_iter=500)),
                             ('RandomForest', RandomForestClassifier())])

### Question

Show the `VotingClassifier` score and compare to each model's individual score.

In [44]:
# YOUR CODE HERE
print('Voting Classifier scores:')
print(f'Training score: {vot_model.score(x_train, y_train)*100:.2f}%')
print(f'Validation score: {vot_model.score(x_val, y_val)*100:.2f}%')
print(f'Test score: {vot_model.score(x_test, y_test)*100:.2f}%')

Voting Classifier scores:
Training score: 100.00%
Validation score: 98.50%
Test score: 98.06%


### Question

Show the score for a soft voting classifier.

In [45]:
# YOUR CODE HERE
soft_vot_model = VotingClassifier(estimators, voting='soft')
soft_vot_model.fit(x_train, y_train)

VotingClassifier(estimators=[('SGD Linear', SGDClassifier(loss='log')),
                             ('MLP', MLPClassifier(max_iter=500)),
                             ('RandomForest', RandomForestClassifier())],
                 voting='soft')

### Question

Compute the `VotingClassifier` score on the test data. Compare it to each model's individual score.

In [46]:
# YOUR CODE HERE
print('Soft Voting Classifier scores:')
print(f'Training score: {soft_vot_model.score(x_train, y_train)*100:.2f}%')
print(f'Validation score: {soft_vot_model.score(x_val, y_val)*100:.2f}%')
print(f'Test score: {soft_vot_model.score(x_test, y_test)*100:.2f}%')

Soft Voting Classifier scores:
Training score: 100.00%
Validation score: 98.00%
Test score: 98.06%
