## Dataset

In this notebook, we perform image recognition on the **MNIST dataset**, which contains a collection of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. Each image is labeled with the digit it represents. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. There are 70,000 images. Each image is 28x28 pixels, and each feature simply represents one pixel’s intensity, from 0 (white) to 255 (black). Hence each image in the set has 784 features.

## Task Statement

The goal is to identify the numbers using Pattern Recognition techniques. Image recognition is the ability AI to detect, classify and identify objects in images. Since the dataset contains hand-written digits (0-9), it is a multi-class classfication problem.

We are going to use a simple ConvNet model to classify the MNIST didits. 

## Instantiate a small ConvNet

In [1]:
from keras import layers
from keras import models

Using TensorFlow backend.


A ConvNet is basically a stack of Conv2D and MaxPooling2D layers. ConvNet takes input tensors of shape (image_height, image_width, image_channels). We configure it to (28,28,1) for MNIST images.

For an RGB image, the dimension of the depth axis (channel axis) is 3 because the image has three color channels: Red, Green and Blue. For a Grayscale image, the dimension of the depth axis is 1 because the depth is one for various levels of Gray.

In Keras Conv2D layers, the first parameters passed to the layer are:
> `Conv2D(output_depth, (window_height, window_width))`

- Size of the patches extracted from the inputs: They are

In [2]:
model = models.Sequential()
model.add(layers.Conv2D(32, (3,3), activation='relu', input_shape=(28,28,1)))
model.add(layers.MaxPooling2D((2,2)))
model.add(layers.Conv2D(64, (3,3), activation='relu'))
model.add(layers.MaxPooling2D((2,2)))
model.add(layers.Conv2D(64, (3,3), activation='relu'))

The first convolution layer takes a feature map of size (28, 28, 1) and outputs a feature map of size (26, 26, 32). 
Here, it computes 32 filters over its input. Each of these 32 output channels contains a 26 x 26 grid of values, 
which is a **Response Map** of the filter over the input, indicating the response of that filter pattern at different locations in the input. The **Response map** quantifies the presence of the filter's pattern at different locations in the original input.

Here, **Feature Map** means: Every dimension in the depth axis is a feature (or filter), and the 2D tensor output [:, : ,n] is the 2D spatial map of the response of this filter over the input.


## View ConvNet Architecture

In [3]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 3, 3, 64)          36928     
Total params: 55,744
Trainable params: 55,744
Non-trainable params: 0
_________________________________________________________________


The output of every Conv2D and MaxPooling2D layer is a 3D tensor of shape (height, width, channels). These 3D tensors are called **Feature Maps**. They have two spatial axes (height and width) as well as a depth axis (also called channels axis). As we move deeper into the network, the width and height dimensions seem to shrink.

The output feature map is also a 3D tensor, whose depth can be arbitrary. The different channels no longer stand for specific colors but rather stand for **filters**. **Filters** encode specific aspects of the input data. Eg. At a higher level, a single filter could encode "presence of a face" in the input.

The number of channels is controlled by the first argument passed to the Conv2D layers (32 or 64).

## Add a Classifier on top of the ConvNet

Next we feed the last output tensor (of shape (3, 3,64)) into a densely connected classifier network (a stack of Dense layers).
These classifiers process *vectors*, which are 1D whereas the current output is a 3D tensor.

First we flatten the 3D outputs into 1D and then add a few Dense layers on top.

In [4]:
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

Here, we do a 10-way classification. We have a final layer with 10 outputs (for each digit) and a softmax activation function.

The (3, 3, 64) outputs from the earlier layer are flattened into vectors of shape 3*3*64 ie. (576,) before going through the two Dense layers.

In [5]:
# View model summary
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 3, 3, 64)          36928     
_________________________________________________________________
flatten_1 (Flatten)          (None, 576)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)               

## Train ConvNet on MNIST Images

In [8]:
from keras.datasets import mnist
from keras.utils import to_categorical

In [10]:
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype('float32') / 255

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

In [11]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=5, batch_size=64)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.callbacks.History at 0x18f9dc43c88>

In [12]:
test_loss, test_acc = model.evaluate(test_images, test_labels)



In [19]:
print("The Accuracy on the test set is", round(test_acc * 100, 2) , '%')

The Accuracy on the test set is 99.09 %


## Summary

A similar classfication task using Multi-layer perceptron networks gave us an accuracy of 97.54 %. [[View Notebook]](https://github.com/rojinadeuja/Multi-Layer-Perceptron/blob/master/Image-Recognition-Using-MLP.ipynb)

A basic ConvNet model implemented in this notebook, gives us a much higher accuracy of 99.09%.