In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import sys

In [2]:
#column-1 contains the labels, so cast it to int
#other columns are pixel intensities, so cast them to floating-point numbers
df = pd.read_csv("train.csv")
train = df.as_matrix()

#intensities
train_y = train[:,0].astype('int8')
train_x = train[:,1:].astype('float64')

train = None

test = pd.read_csv("test.csv").as_matrix().astype('float64')

In [3]:
# note that we will have bad accuracy if the examples for each digit are not uniformly distributed.
#So check if they are.
no_of_train = train_y.shape[0]
counts = [0]*10
for i in range(no_of_train):
    counts[train_y[i]] += 1
for i in counts:
    print(i)

4132
4684
4177
4351
4072
3795
4137
4401
4063
4188


Almost Uniform. So apply NN happily!

In [4]:
# Normalise to keep floating-point underflow in check!
test /= 255
train_x /= 255

In [5]:
#look at this : https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html
#gives you one-hot encoding.
train_y = pd.get_dummies(train_y).as_matrix()

As one-hot encoding is used here and we have to predict from 10 labels ie: the most 'probable', makes sense to use Softmax function : 
\begin{equation*}
y=σ(x)=\frac{e^x}{\sum_{i=1}^Ne^x}
\end{equation*}

Recall that the derivative of sigmoid is :
\begin{equation*}
y=σ(x)=y*(1-y)
\end{equation*}
I intend to apply the most-widely used combination : **Cross-entropy as Loss-function + Softmax classification + weight-decay**. (Might apply RBMs in another notebook on MNIST, to learn the same!). Read this : http://www.cs.toronto.edu/~hinton/absps/fastnc.pdf

No. of hidden layers : 2 with 500 and 300 units in each respectively.

In [6]:
#randomly initialised using : random gaussian function with mean=zero and variance between 1/sqtr(num_inputs_layer)
def getWeights():
    inp = 784
    w1_dim = 500
    w2_dim = 300
    classes = 10
    
    #layer1
    w1 = np.random.normal(0,inp**-0.5,[inp,w1_dim])
    b1 = np.random.normal(0,inp**-0.5,[1,w1_dim])
    
    #layer2
    w2 = np.random.normal(0,w1_dim**-0.5,[w1_dim,w2_dim])
    b2 = np.random.normal(0,w1_dim**-0.5,[1,w2_dim])
    
    #layer3
    w3 = np.random.normal(0,w2_dim**-0.5,[w2_dim,classes])
    b3 = np.random.normal(0,w2_dim**-0.5,[1,classes])
    
    return [w1,b1,w2,b2,w3,b3]

Read this : http://jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf to learn more about 'Dropout'. It is mainly to avoid overfitting by ignoring a few input units. Here, 'p' is the probability that a unit will be ignored. So simply zero out the selected ones. And n testing, the entire network is considered and each activation is reduced by a factor p. https://medium.com/@amarbudhiraja/https-medium-com-amarbudhiraja-learning-less-to-learn-better-dropout-in-deep-machine-learning-74334da4bfc5 : see this for a condensed version of the paper.

In [7]:
def softmax(x):
    return (np.exp(x).T/np.sum(np.exp(x),axis=1)).T

def ReLu(x,der=False):
    if der==False:
        return x*(x>0)
    else:
        return 1*(x>0)

def dropout(x,p):
    m = np.random.binomial([np.ones_like(x)],(1-p))[0] / (1-p)
    #note that only the first dimension is being taken.
    return x*m
                           
def forward_prop(x,weights,p):
    w1,b1,w2,b2,w3,b3 = weights                    

    #in case you wondering : https://stackoverflow.com/questions/27385633/what-is-the-symbol-for-in-python
    z1 = ReLu(x@w1 + b1)
    z1 = dropout(z1,p)
                           
    z2 = ReLu(z1@w2 + b2)
    z2 = dropout(z2,p)
                           
    return [z1,z2,softmax(z2@w3 + b3)]

Use the cross-entropy loss function when dealing with classification:
\begin{equation*}
-\frac{1}{N}.{\sum_{i=1}^N[y*log(a)+(1-y)log(1-a)]}
\end{equation*}

Here is a refresher(I knew you'd forget!): https://github.com/Kulbear/deep-learning-nano-foundation/wiki/ReLU-and-Softmax-Activation-Functions

In [8]:
def log2(x):
    if x!=0:
        return np.log(x)
    else:
        return -np.inf #don't worry, taken care by nan_to_num : 
    #https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.nan_to_num.html
    
def log(y):
    return [[log2(nx) for nx in x] for x in y]

def cost(a,y):
    loss = -np.mean((np.nan_to_num(y*log(a)) + np.nan_to_num((1-y)*log(1-a))),keepdims = True)
    return loss

In [9]:
#add cross-validation
frac = 0.2
sz = round(train_x.shape[0]*frac)

indices = np.arange(no_of_train)
np.random.shuffle(indices)

x_train = train_x[indices[sz:]]
x_valid = train_x[indices[:sz]]

y_train = train_y[indices[sz:]]
y_valid = train_y[indices[:sz]]

train_x = None
train_y = None

x_train.shape

(33600, 784)

And here it comes : Back Prop!! Just apply chain-rule repeatedly to arrive at those complex equations that we got in Stanford ML course! 
For output layer : 
\begin{equation*}
\nabla W_j=\frac{dL(W_{ij})}{dW_{ij}} = \frac{1}{N}.(t-y)X_{ij}
\end{equation*}

For hidden layer :
\begin{equation*}
\nabla W_j=\frac{dL(W_{ij})}{dW_{ij}} =  \frac{1}{N}.(t-y).W_{ij}.\sigma(x)^{'}X_{ij}
\end{equation*}
Enter Momentum! Used to avoid getting trapped in local minima by taking leaps and decreases convergence time. Read this : https://www.quora.com/What-does-momentum-mean-in-neural-networks
\begin{equation*}
V_{i+1} = \gamma V_{i} + \eta.\nabla L(W) \\
W = W - V_{i+1}
\end{equation*}
Also, add regularisation using L2 norm.
\begin{equation*}
L = -\frac{1}{N}.{\sum_{i=1}^N[ylog(a)+(1-y)log(1-a)]} + \frac{1}{2}.\lambda.\sum_{j=1}^{nj}\sum_{i=1}^{ni}W_{ij}^{2}
\end{equation*}

In [None]:
def SGD(weights,x,a,op,eta_lr,gamma_mom,delta,cache = None):
    w1,b1,w2,b2,w3,b3 = weights
    if cache == None:
        prev_w1 = np.zeros_like(w1)
        prev_w2 = np.zeros_like(w2)
        prev_w3 = np.zeros_like(w3)
        prev_b1 = np.zeros_like(b1)
        prev_b2 = np.zeros_like(b2)
        prev_b3 = np.zeros_like(b3)
    
    else:
        prev_w1,prev_b1,prev_w2,prev_b2,prev_w3,prev_b3 = cache
    
    z1,z2,out = op
    