*This notebook contains an excerpt (edited/augmented by Vishal Patel) from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas*

### Loading and visualizing the digits data

We'll use Scikit-Learn's data access interface and take a look at this data:

In [None]:
from sklearn.datasets import load_digits

# Load data
digits = load_digits()

# Dimensions (shape) of the data
digits.images.shape

The images data is a three-dimensional array: 1,797 samples each consisting of an 8 × 8 grid of pixels.
Let's visualize the first hundred of these:

In [None]:
# View the first two records

digits.images[:2]

In [None]:
# View the first image

import matplotlib.pyplot as plt
%matplotlib inline

plt.imshow(digits.images[0], cmap='binary', interpolation='nearest')

In [None]:
# Target value for the first image

digits.target[0]

In [None]:
# View the second image

plt.imshow(digits.images[1], cmap='binary', interpolation='nearest')

In [None]:
# Target value for the second image

digits.target[1]

In [None]:
# Create a 10x10 grid (subplots) to view the first 100 images/digits

fig, axes = plt.subplots(10, 10, figsize=(12, 12))

In [None]:
# Remove the axis tick marks (and consequently, labels)

fig, axes = plt.subplots(10, 10, figsize=(12, 12),
                         subplot_kw={'xticks':[], 'yticks':[]})

In [None]:
fig, axes = plt.subplots(10, 10, figsize=(12, 12),
                         subplot_kw={'xticks':[], 'yticks':[]})

# Iterate thru the subplots and use imshow() to plot images
for i, ax in enumerate(axes.flat):
    ax.imshow(digits.images[i])

In [None]:
fig, axes = plt.subplots(10, 10, figsize=(12, 12),
                         subplot_kw={'xticks':[], 'yticks':[]})

# Change the color-map to a black and white spectrum
for i, ax in enumerate(axes.flat):
    ax.imshow(digits.images[i], cmap='binary')

In [None]:
fig, axes = plt.subplots(10, 10, figsize=(12, 12),
                         subplot_kw={'xticks':[], 'yticks':[]})

# Include the target value (digit) for each image
for i, ax in enumerate(axes.flat):
    ax.imshow(digits.images[i], cmap='binary')
    ax.text(0.05, 0.05, str(digits.target[i]),color='green')

In [None]:
fig, axes = plt.subplots(10, 10, figsize=(12, 12),
                         subplot_kw={'xticks':[], 'yticks':[]})

# Move the target value (text) to the bottom by transforming the axes
for i, ax in enumerate(axes.flat):
    ax.imshow(digits.images[i], cmap='binary')
    ax.text(0.05, 0.05, str(digits.target[i]), color='green',
            transform=ax.transAxes)

In order to work with this data within Scikit-Learn, we need a two-dimensional, ``[n_samples, n_features]`` representation.
We can accomplish this by treating each pixel in the image as a feature: that is, by flattening out the pixel arrays so that we have a length-64 array of pixel values representing each digit.
Additionally, we need the target array, which gives the previously determined label for each digit.
These two quantities are built into the digits dataset under the ``data`` and ``target`` attributes, respectively:

In [None]:
# Grab all data (1797 records, and 8x8=64 columns)

X = digits.data
X.shape

In [None]:
# Grab the target (true) value for each image

y = digits.target
y.shape

We see here that there are 1,797 samples and 64 features.

In [None]:
# View the first two records

X[:2]

In [None]:
# View the first 20 target values

y[:20]

### Classification on digits

Let's apply a classification algorithm to the digits.
[W]e will split the data into a training and testing set, and fit a [Decision Tree] model:

In [None]:
from sklearn.model_selection import train_test_split

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=314)

In [None]:
# The default is 75% train and 25% test
len(Xtrain), len(Xtest)

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Define the model (object)
clf = DecisionTreeClassifier()

# Fit (train) the model
clf.fit(Xtrain, ytrain)

# Make predictions on the test data
y_model = clf.predict(Xtest)

Now that we [generated predictions using] our model, we can gauge its accuracy by comparing the true values of the test set to the predictions:

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)

With even this extremely simple model, we find about 80% accuracy for classification of the digits!
However, this single number doesn't tell us *where* we've gone wrong—one nice way to do this is to use the *confusion matrix*, which we can compute with Scikit-Learn and plot with Seaborn:

In [None]:
from sklearn.metrics import confusion_matrix

mat = confusion_matrix(ytest, y_model)
mat

In [None]:
# Plot the confusion matrix

import seaborn as sns
plt.figure(figsize=(9, 9))

sns.heatmap(mat, annot=True, cbar=False, cmap="YlGnBu")
plt.xlabel('Predicted value', fontsize=14)
plt.ylabel('True value', fontsize=14);

In [None]:
mat_pctg = mat / mat.sum(axis=1)

plt.figure(figsize=(9, 9))

sns.heatmap(mat_pctg, annot=True, cbar=False, cmap="YlGnBu", fmt='0.0%')
plt.xlabel('Predicted value', fontsize=14)
plt.ylabel('True value', fontsize=14);

This shows us where the mis-labeled points tend to be: for example, a large number of twos here are mis-classified as either ones or eights.
Another way to gain intuition into the characteristics of the model is to plot the inputs again, with their predicted labels.
We'll use green for correct labels, and red for incorrect labels:

In [None]:
fig, axes = plt.subplots(10, 10, figsize=(12, 12),
                         subplot_kw={'xticks':[], 'yticks':[]},
                         gridspec_kw=dict(hspace=0.1, wspace=0.1))

test_images = Xtest.reshape(-1, 8, 8)

for i, ax in enumerate(axes.flat):
    ax.imshow(test_images[i], cmap='binary', interpolation='nearest')
    ax.text(0.05, 0.05, str(y_model[i]),
            transform=ax.transAxes,
            color='green' if (ytest[i] == y_model[i]) else 'red')

Examining this subset of the data, we can gain insight regarding where the algorithm might be not performing optimally.
To go beyond our 80% classification rate, we might move to a more sophisticated algorithm such as support vector machines (see [In-Depth: Support Vector Machines](05.07-Support-Vector-Machines.ipynb)), random forests (see [In-Depth: Decision Trees and Random Forests](05.08-Random-Forests.ipynb)) or another classification approach.

### Exercise: Build a model using Random Forest and plot the confusion matrix

In [None]:
### INSERT CODE HERE ###

Confusion matrix:

In [None]:
### INSERT CODE HERE ###

In [None]:
fig, axes = plt.subplots(10, 10, figsize=(12, 12),
                         subplot_kw={'xticks':[], 'yticks':[]},
                         gridspec_kw=dict(hspace=0.1, wspace=0.1))

test_images = Xtest.reshape(-1, 8, 8)

for i, ax in enumerate(axes.flat):
    ax.imshow(test_images[i], cmap='binary', interpolation='nearest')
    ax.text(0.05, 0.05, str(y_model[i]),
            transform=ax.transAxes,
            color='green' if (ytest[i] == y_model[i]) else 'red')

## Manual Test

In [None]:
import imageio as img
from skimage.transform import resize
import warnings

warnings.filterwarnings("ignore")

# read the sample image
sample = plt.imread("https://www.dropbox.com/s/t4k1ncufnqtcf0c/six.png?dl=1")

# convert the image into the desired shape/format
sample = resize(sample, (8, 8))
sample = sample[:, :, 0]
sample = sample.reshape((1, sample.shape[0]*sample.shape[1]))

warnings.filterwarnings("default")

# predict the digit using the model
clf.predict(sample)

More on 'Hand-written Digits Recognition': https://www.kaggle.com/c/digit-recognizer