# MNIST

the MNIST dataset is a set of 70,000 small images of digits handwritten by high school students.

Datasets loaded by Scikit-Learn generally have a similar dictionary structure in Python including:
    
• A DESCR key describing the dataset

• A data key containing an array with one row per instance and one column per
feature

• A target key containing an array with the labels

In [None]:
# Take some time for fetching the data
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
mnist

In [None]:
X, y = mnist["data"], mnist["target"]
X.shape 
# X is a matrix, 70000 rows and 784 columns

There are 70,000 images, and each image has 784 features. This is because each image
is 28×28 pixels, and each feature simply represents one pixel’s intensity, from 0
(white) to 255 (black). 

Let’s take a peek at one digit from the dataset. All you need to
do is grab an instance’s feature vector, reshape it to a 28×28 array, and display it using
Matplotlib’s imshow() function:

In [None]:
y.shape
# y is an array

In [None]:
28*28

In [None]:
# plot one of images of digit "5'

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

some_digit = X[36005] # it is digit "5"
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap = matplotlib.cm.binary,
           interpolation="nearest")
plt.axis("off")

# save_fig("some_digit_plot")
plt.show()

The following codes is to show a few more images from the MNIST dataset to give you a feel for
the complexity of the classification task.



In [None]:
def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap = matplotlib.cm.binary,
               interpolation="nearest")
    plt.axis("off")

In [None]:
# EXTRA
def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    images = [instance.reshape(size,size) for instance in instances]
    n_rows = (len(instances) - 1) // images_per_row + 1
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row]
        row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap = matplotlib.cm.binary, **options)
    plt.axis("off")

In [None]:
import numpy as np
plt.figure(figsize=(9,9))
example_images = np.r_[X[:12000:600], X[13000:30600:600], X[30600:60000:590]]
plot_digits(example_images, images_per_row=10)
#save_fig("more_digits_plot")
plt.show()

In [None]:
y[36000]

But wait! You should always create a test set and set it aside before inspecting the data
closely. The MNIST dataset is actually already split into a training set (the first 60,000
images) and a test set (the last 10,000 images):
    

In [None]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

Let’s also shuffle the training set; this will guarantee that all cross-validation folds will
be similar (you don’t want one fold to be missing some digits). Moreover, some learn‐
ing algorithms are sensitive to the order of the training instances, and they perform
poorly if they get many similar instances in a row. Shuffling the dataset ensures that
this won’t happen:

In [None]:
import numpy as np

shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

# Binary classifier

Let’s simplify the problem for now and only try to identify one digit—for example,
the number 5. This “5-detector” will be an example of a binary classifier, capable of
distinguishing between just two classes, 5 and not-5. Let’s create the target vectors for
this classification task:

In [None]:
# prepare for training classification model, SGDClassifer
# specify "5-detector"

y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

let’s pick a classifier and train it. 

A good place to start is with a Stochastic Gradient Descent (SGD) classifier, using Scikit-Learn’s SGDClassifier class. 

I will introduce SGD in next Chapter.

This classifier has the advantage of being capable of handling very large datasets efficiently.
Let’s create an SGDClassifier and train it on the whole training set:

In [None]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)

SGDClassifier?

In [None]:
some_digit

In [None]:
# after training, see its predictition, given test is "5"

sgd_clf.predict([some_digit])

# some_digit = X[36005],  it is digit "5"

The classifier guesses that this image represents a 5 (True). 

Looks like it guessed right



In [None]:
#let's try another image

some_digit = X[16005] # it is digit "2"

In [None]:
# after training, see its predictition
sgd_clf.predict([some_digit])

So, it's not "5"

In [None]:
# Why?
# Let's see this image, X[16005]

In [None]:
# plot this image, X[16005]

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

some_digit = X[16005] 
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap = matplotlib.cm.binary,
           interpolation="nearest")
plt.axis("off")

# save_fig("some_digit_plot")
plt.show()

in this particular case! Now, let’s evaluate this model’s performance

Question: how to evaluate its performance???

Let's do next task.