# Classify handwritten digits

In this activity, you'll try several classifiers on the [UCI handwritten digits dataset](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits), either separately or into an ensemble.

![UCI digits](images/uci_digits.png)

## Environment setup

In [1]:
# Import base packages
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

In [2]:
# Setup plots
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 8
%config InlineBackend.figure_format = 'retina'
sns.set()

In [3]:
# Import ML packages
import sklearn
print(f'scikit-learn version: {sklearn.__version__}')

from sklearn.datasets import load_digits
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import train_test_split

scikit-learn version: 0.22.2.post1


## Step 1: Loading and preparing the data

In [4]:
# Load the MNIST digits dataset
digits = load_digits()

# To apply a classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))

print(f'data: {data.shape}. targets: {digits.target.shape}')

data: (1797, 64). targets: (1797,)


### Question

Split the data into training, validation and test sets.

In [5]:
# BEGIN SOLUTION CODE
x_train, x_test, y_train, y_test = train_test_split(
    data, digits.target, test_size=0.2)

# Set apart the first 200 images as validation data
x_val, x_train = x_train[:200], x_train[200:]
y_val, y_train = y_train[:200], y_train[200:]
# END SOLUTION CODE

In [6]:
print(f'x_train: {x_train.shape}. x_val: {x_val.shape}. x_test: {x_test.shape}')
print(f'y_train: {y_train.shape}. y_val: {y_val.shape}. y_test: {y_test.shape}')

assert x_train.shape == (1237, 64)
assert x_val.shape == (200, 64)
assert x_test.shape == (360, 64)
assert y_train.shape == (1237,)
assert y_val.shape == (200,)
assert y_test.shape == (360,)

x_train: (1237, 64). x_val: (200, 64). x_test: (360, 64)
y_train: (1237,). y_val: (200,). y_test: (360,)


## Step 2: Train several models

### Question

Create and train various models, such as a linear classifier, a multilayer perceptron, a random forest...

In [7]:
# BEGIN SOLUTION CODE
sgd_clf = SGDClassifier(loss='log', random_state=42)
mlp_clf = MLPClassifier(random_state=42)
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)

models = [sgd_clf, mlp_clf, rf_clf]

for model in models:
    print("Training the", model)
    model.fit(x_train, y_train)
# END SOLUTION CODE

Training the SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=1000,
              n_iter_no_change=5, n_jobs=None, penalty='l2', power_t=0.5,
              random_state=42, shuffle=True, tol=0.001, validation_fraction=0.1,
              verbose=0, warm_start=False)
Training the MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=42, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)
Training the RandomForestClass

### Question

Display the default score for each model.

In [8]:
# BEGIN SOLUTION CODE
[model.score(x_val, y_val) for model in models]
# END SOLUTION CODE

[0.985, 0.985, 0.975]

### Question

Create a `VotingClassifier` including all your models. Fit it on the training data.

In [9]:
# BEGIN SOLUTION CODE
named_models = [
    ("sgd_clf", sgd_clf),
    ("mlp_clf", mlp_clf),
    ("rf_clf", rf_clf),
]

voting_clf = VotingClassifier(named_models)
voting_clf.fit(x_train, y_train)
# END SOLUTION CODE

VotingClassifier(estimators=[('sgd_clf',
                              SGDClassifier(alpha=0.0001, average=False,
                                            class_weight=None,
                                            early_stopping=False, epsilon=0.1,
                                            eta0=0.0, fit_intercept=True,
                                            l1_ratio=0.15,
                                            learning_rate='optimal', loss='log',
                                            max_iter=1000, n_iter_no_change=5,
                                            n_jobs=None, penalty='l2',
                                            power_t=0.5, random_state=42,
                                            shuffle=True, tol=0.001,
                                            validation_fraction=0.1, verbose=...
                                                     criterion='gini',
                                                     max_depth=None,
                

### Question

Show the `VotingClassifier` score and compare to each model's individual score.

In [10]:
# BEGIN SOLUTION CODE
voting_clf.score(x_val, y_val)
# END SOLUTION CODE

0.985

### Question

Show the score for a soft voting classifier.

In [11]:
# BEGIN SOLUTION CODE
voting_clf.voting = "soft"
voting_clf.score(x_val, y_val)
# END SOLUTION CODE

0.99

### Question

Compute the `VotingClassifier` score on the test data. Compare it to each model's individual score.

In [12]:
# BEGIN SOLUTION CODE
print(voting_clf.score(x_test, y_test))
[estimator.score(x_test, y_test) for estimator in voting_clf.estimators_]
# END SOLUTION CODE

0.9888888888888889


[0.9555555555555556, 0.9833333333333333, 0.9777777777777777]