## MNIST dataset 
<br>
MNIST is a collection of handwritten digits saved in 28-by-28 pixels images. Datasets for training and testing can be downloaded at https://www.kaggle.com/c/digit-recognizer/data. Each row of the training data are the pixel values (0-255) for one handwritten digit image. We can use similar procedures as above XOR example to construct our neural network, and change only the number of input nodes, hidden nodes, and output nodes.

### 1. Batch vs Online Learning
If we update weights after we calculated erorr for all training data, then it is called batch learning. There are several advantages.

In [1]:
import numpy as np
import pandas as pd
import warnings 
import matplotlib.pyplot as plt 
%matplotlib inline 

# we use sigmoid activation function throughout the workbook
def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

Plot 20 example figuures

In [None]:
# load some data and plot 20 sample images for view 
with open('train.csv', 'r') as f:
    # skip header row in csv file
    data = f.readlines()[1:21]

plt.figure(figsize=(15,1)) 
for i in range(20):
    plt.subplot(1, 20, i)
    # split pixel values by comma
    values = data[i].split(',') 
    # convert string to float and reshape matrix
    pixels = np.asfarray(values[1:]).reshape((28, 28))
    # plot in grayscale
    plt.imshow(pixels, cmap='Greys')
    # no ticks
    plt.xticks(())
    plt.yticks(())
   
plt.show()

In [None]:
# load training dataset in pandas dataframe for manipulation
df = pd.read_csv('train.csv', sep=',',header=0)

# get pixel values for each image and convert to numpy array
X = df.iloc[:,1:].as_matrix()
# normalize pixel values to between 0 and 1 
X = X / 255.0  

# first element of each row is the label 
label = df.iloc[:,0]
# set up target array of 10 nodes for all 10 classes 
y = np.zeros((df.shape[0], 10))
# set node for the correct label to 1 and keep others 0
for i in range(df.shape[0]):
    y[i, label[i]] = 1.0

In [None]:
learning_rate = 0.1
# number of hidden nodes
num_nodes = 200
# number of records in training set
num_train = 40000

X_train = X[:num_train,:]
y_train = y[:num_train,:]
W1 = 0.01 * np.random.randn(X.shape[1], num_nodes) # dim (784, N)
W2 = 0.01 * np.random.randn(num_nodes, 10) # dim (N, 10)


for i in range(6):
    
    # go through all records 
    for X_online, y_online in zip(X_train, y_train):        
        
        # each record has to be a 2D array 
        X_online = np.array([X_online]) 
        y_online = np.array([y_online])
        
        # forward propagation 
        z1 = sigmoid(np.dot(X_online, W1))    
        z2 = sigmoid(np.dot(z1, W2)) 

        # backward propagation
        z2_delta = (z2 - y_online) * z2 * (1.0 - z2) 
        z2_gradient = np.dot(z1.T, z2_delta) 
        z1_delta = np.dot(z2_delta, W2.T) * z1 * (1.0 - z1) 
        z1_gradient = np.dot(X_online.T, z1_delta) 

        # update weights
        W2 -= learning_rate * z2_gradient
        W1 -= learning_rate * z1_gradient 
           
    z2 = sigmoid(np.dot(sigmoid(np.dot(X_train, W1)), W2)) 
    loss = np.around(0.5 * np.sum((z2 - y_train)**2) / num_train, decimals=5) 
    print "Average squared difference between output and target at epoch {} : {}".format(i + 1, loss)

In [None]:
X_test = X[num_train:,:]
y_test = label[num_train:]
predict = np.argmax(sigmoid(np.dot(sigmoid(np.dot(X_test, W1)), W2)), axis=1)
accuracy = 100.0 * sum(predict == y_test) / (42000 - num_train)
print 'prediction accuracy: %.2f%%' % round(accuracy, 4)