# <center> Keras </center>
## <center>1.2 MNIST database</center>

# MNIST  database

MNIST, short for __M__odified __N__ational __I__nstitute of __S__tandards and __T__echnology,<br> 
is a dataset, containing 70000 images of handwritten digits with 60000 for training and 10000 for testing. <br>
The digits used in the database are size-normalized and centered in a fixed-size image of 28x28 pixels

All the images in the database are labelled. Here's an extract of the MNIST dataset: <br>
<img src="http://blog.welcomege.com/wp-content/uploads/2017/08/2017-08-08_14-39-24.png" width = "80%" /><br>

<br>The Keras library already includes MNIST

## Splitting Dataset
Evaluating a model requires splitting your available data into three sets:
training, validation, and test set. You train on the training data, and evaluate your model
on the validation data. Once your model is ready for prime time, you test it one final time
on the test data.

__Training-dataset__ is used to train the network.<br>
__Validation-dataset__ is used to evaluate the training.<br>
__Test-dataset__ is used to provide an unbiased final evaluation of the model. (To prevent information leak)<br><br>
*In the scope of this workshop we will only focus on train and test datasets*

## Best practice
There is no standard way of dividing the dataset between training, validation and testing datasets.
As a common rule of thumb, it is advised to split the dataset in training and testing set in the following ratio (Paretos principle):<br>
&emsp;Training:Testing = 80:20<br>
And use the following ratio for dividing into training, validation and testing:<br>
&emsp;Training:Validation:Testing = 60:20:20<br>

## Pitfalls
While dividing the dataset into training and testing, special care is necessary to make sure that both the train and test dataset contain enough examples for each classification. This requires knowledge about the dataset. Failure to do so can result in bad network performance. For example, if the MNIST dataset were sorted according to digits, we would not be able just take the first 60000 indices because this would lead to misrepresentation of digit 8 and 9.<br>

# Code

In [None]:
# Importing the MNIST dataset
from keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

The MNIST dataset comes pre-loaded in Keras, in the form of a set of four Numpy arrays.<br>
`train_images` and `train_labels` form the __training set__, the data that the model will learn from. 
The model will then be tested on the __test set__, `test_images` and `test_labels`.<br><br>

Let's examine the loaded data:

In [None]:
# Training data
print ("Shape of training images: \n" + str(train_images.shape))
print ("Length of training images: \n" + str(len(train_labels)))
print ("Training Labels: \n" + str(train_labels))

print ("\n----------------\n")
# Testing data
print ("Shape of test images: \n" + str(test_images.shape))
print ("Length of test images: \n" + str(len(test_labels)))
print ("Test Labels: \n" + str(test_labels))


Let's plot one MNIST image and see what it looks like:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.imshow(train_images[0], cmap='gray', interpolation='none')
print("Digit Class: " + str(train_labels[0]))

Let's plot some more digits:

In [None]:
for i in range(9):
    plt.subplot(3,3,i+1)
    plt.imshow(train_images[i], cmap='gray', interpolation='none')
    plt.title("Class {}".format(train_labels[i]))
plt.tight_layout()


# Task

Print out the distribution of digits in the test dataset. <br>
__NOTE:__ This task requires some previous knowledge of Python. 

# Feedback
<a href = "http://goto/ml101_doc/Keras02">Feedback: MNIST database</a> <br>

# Navigation

<div>
<span> <h3 style="display:inline">&lt;&lt; Prev: <a href = "Keras01.ipynb">MNIST with Keras</a></h3> </span>
<span style="float: right"><h3 style="display:inline">Next: <a href = "Keras03.ipynb">Process input data</a> &gt;&gt; </h3></span>
</div>