# the MNIST handwritten digit recognition task
Based on [https://cambridgespark.com/content/tutorials/deep-learning-for-complete-beginners-recognising-handwritten-digits/index.html].

In [1]:
#import tensorflow as tf
from keras.datasets import mnist # subroutines for fetching the MNIST dataset
from keras.models import Model # basic class for specifying and training a neural network
from keras.layers import Input, Dense # the two types of neural network layer we will be using
from keras.utils import np_utils # utilities for one-hot encoding of ground truth values

Using TensorFlow backend.


## Hyperparameters
### Define some parameters of our model. They are assumed to be fixed before training starts.

- The batch size, representing the number of training examples being used simultaneously during a single iteration of the gradient descent algorithm 
- The number of epochs, representing the number of times the training algorithm will iterate over the entire training set before terminating 
- The number of neurons in each of the two hidden layers of the MLP.

In [13]:
batch_size = 128 # in each iteration, we consider 128 training examples at once
num_epochs = 20 # we iterate twenty times over the entire training set
hidden_size = 512 # there will be 512 neurons in both hidden layers 


To preprocess the input data, we will first flatten the images into 1D (as we will consider each pixel as a separate input feature), and we will then force the pixel intensity values to be in the [0,1] range by dividing them by 255. 

## Probabilistic classification
### Outputting a value which corresponds to the probability of the input being of that particular class.

- This implies a need to transform the training output data into a "one-hot" encoding: for example, if the desired output class is 3, and there are five classes overall (labelled 0 to 4), then an appropriate one-hot encoding is: [0 0 0 1 0]. Keras, once again, provides us with an out-of-the-box functionality for doing just that.

In [3]:
num_train = 60000 # there are 60000 training examples in MNIST
num_test = 10000 # there are 10000 test examples in MNIST

height, width, depth = 28, 28, 1 # MNIST images are 28x28 and greyscale
num_classes = 10 # there are 10 classes (1 per digit)

(X_train, y_train), (X_test, y_test) = mnist.load_data() # fetch MNIST data

X_train = X_train.reshape(num_train, height * width) # Flatten data to 1D
X_test = X_test.reshape(num_test, height * width) # Flatten data to 1D
X_train = X_train.astype('float32') 
X_test = X_test.astype('float32')
X_train /= 255 # Normalise data to [0, 1] range
X_test /= 255 # Normalise data to [0, 1] range

Y_train = np_utils.to_categorical(y_train, num_classes) # One-hot encode the labels
Y_test = np_utils.to_categorical(y_test, num_classes) # One-hot encode the labels

Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz


## Define our model!
### Using a stack of three Dense layers, which correspond to a fully unrestricted MLP structure

- Use ReLU activations for the neurons in the first two layers, and a softmax activation for the neurons in the final one. This activation is designed to turn any real-valued vector into a vector of probabilities, and is defined as follows, for the j-th neuron.

An excellent feature of Keras, that sets it apart from frameworks such as TensorFlow, is automatic inference of shapes; we only need to specify the shape of the input layer, and afterwards Keras will take care of initialising the weight variables with proper shapes. Once all the layers have been defined, we simply need to identify the input(s) and the output(s) in order to define our model

In [9]:
inp = Input(shape=(height * width,)) # Our input is a 1D vector of size 784 (= 28*28)
hidden_1 = Dense(hidden_size, activation='relu')(inp) # First hidden ReLU layer
hidden_2 = Dense(hidden_size, activation='relu')(hidden_1) # Second hidden ReLU layer
out = Dense(num_classes, activation='softmax')(hidden_2) # Output softmax layer

model = Model(inputs=inp, outputs=out) # To define a model, just specify its input and output layers

To finish off specifying the model, we need to define our loss function, the optimisation algorithm to use, and which metrics to report.

## Cross-entropy loss
### This loss is better for probabilistic tasks (i.e. ones with logistic/softmax output neurons)
- its manner of derivation – it aims only to maximise the model's confidence in the correct class
- not concerned with the distribution of probabilities for other classes (while the squared error loss would dedicate equal attention to getting all of the other class probabilities as close to zero as possible). This is due to the fact that incorrect classes, i.e. classes i′ with ŷ i′=0, eliminate the respective neuron's output from the loss function.

## Optimisation algorithm
### Revolve around some form of gradient descent
- their key differences revolve around the manner in which the previously mentioned learning rate, η, is chosen or adapted during training. 
- here we will use the Adam optimiser, which typically performs well.

## Accuracy
### the proportion of the inputs classified correctly.
- as our classes are balanced (there is an equal amount of handwritten digits across all ten classes)

In [11]:
model.compile(loss='categorical_crossentropy', # using the cross-entropy loss function
              optimizer='adam', # using the Adam optimiser
              metrics=['accuracy']) # reporting the accuracy

## Training algorithm
### Finally, we call the training algorithm with the determined batch size and epoch count. 
- It is good practice to set aside a fraction of the training data to be used just for verification that our algorithm is (still) properly generalising (this is commonly referred to as the validation set); here we will hold out 10% of the data for this purpose.

- An excellent out-of-the-box feature of Keras is verbosity; it's able to provide detailed real-time pretty-printing of the training algorithm's progress.

In [12]:
model.fit(X_train, Y_train, # Train the model using the training set...
          batch_size=batch_size, epochs=num_epochs,
          verbose=1, validation_split=0.1) # ...holding out 10% of the data for validation
model.evaluate(X_test, Y_test, verbose=1) # Evaluate the trained model on the test set!

Train on 54000 samples, validate on 6000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


[0.10260707112895712, 0.9812]

As can be seen, our model achieves an accuracy of 98.55% on the test set; this is quite respectable for such a simple model, despite being outclassed by state-of-the-art approaches enumerated here.

## Attempt different hyperparameter values/optimisation algorithms/activation functions, add more hidden layers, etc. 
### To achieve accuracies above 99%.

#### NOTE
- 그냥 네우론의 숫자를 늘리는것은 도움이 되지 않는다.
예) 512개에서 800개로 늘렸더니 시간이 더 걸리고 accuracy가 98.55% 에서 98.12% 로 줄어듬.
- 네우론 수를 512개로 두고 레이어를 하나 더 늘림. 98.24% 나옴.
- 베치 사이즈를 두배로 늘리고 3개의 히든 레이어로 하니, 98.13%

In [15]:
batch_size = 128*2 # in each iteration, we consider 128 training examples at once
num_epochs = 20 # we iterate twenty times over the entire training set
hidden_size = 512 # there will be 512 neurons in both hidden layers 

inp = Input(shape=(height * width,)) # Our input is a 1D vector of size 784 (= 28*28)
hidden_1 = Dense(hidden_size, activation='relu')(inp) # First hidden ReLU layer
hidden_2 = Dense(hidden_size, activation='relu')(hidden_1) # Second hidden ReLU layer
hidden_3 = Dense(hidden_size, activation='relu')(hidden_2) # Third hidden ReLU layer
out = Dense(num_classes, activation='softmax')(hidden_2) # Output softmax layer

model = Model(inputs=inp, outputs=out) # To define a model, just specify its input and output layers
model.compile(loss='categorical_crossentropy', # using the cross-entropy loss function
              optimizer='adam', # using the Adam optimiser
              metrics=['accuracy']) # reporting the accuracy
model.fit(X_train, Y_train, # Train the model using the training set...
          batch_size=batch_size, epochs=num_epochs,
          verbose=1, validation_split=0.1) # ...holding out 10% of the data for validation
model.evaluate(X_test, Y_test, verbose=1) # Evaluate the trained model on the test set!

Train on 54000 samples, validate on 6000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


[0.08921291756161713, 0.9813]