## Neural Network and Convolutional Neural Network Practice

Self study deep learning using book "Deep Learning" published by O'Reilly.

The basic structure of neural network is cell. The cell takes input and generate output based on the input value. Similar to $y = f(x)$. $x$ is the input value, $f(x)$ is the internal function, and $y$ is the output value.

In [2]:
# import necessary package
import numpy as np

### Helper functions

Step function is not a good function for activating cells in practical. The cell output values are usually processed by **Sigmoid** or **Relu** functions to create a smooth curve output. For classification problem, **Softmax** function is commonly used. For regression problem, usually use the cell output directly. 

In [3]:
# helper functions

# sigmoid and RelU functions are commonly used for passing value from layer to layer
def sigmoid(x):
    return 1 / ( 1 + np.exp(-x) )

# ReLU
def relu(x):
    return np.maximum(0, x)

# softmax function, usually used for clasification problem
def softmax(x):
    x = x - np.max(x)
    return np.exp(x) / np.sum( np.exp(x) )

# gradient of sigmoid function
def sigmoid_grad(x):
    return ( 1.0 - sigmoid(x) ) * sigmoid(x)

# gradient of ReLU function


# gradient of softmax function
# gradient of softmax is same as softmax, check the formula!

### Loss function
For all machine learning problems, we need a loss function to help our model learning (adjusting weights). Both mean square error function and cross entropy error function are commonly used in neural network. 

The cross entropy error function can be expressed as:

$E = - \sum t_k\log y_k$

In [4]:
# error functions

# mse
def mse(y, t):
    return 0.5 * np.sum( (y-t)**2 )

# cross-entropy error
def cross_entropy_error(y, t):
    delta = 1e-7 # prevent log function error
    return -np.sum( t * np.log(y+delta) )

# batch version
#def cross_entropy_error(y, t):
#    if y.ndim == 1:
#        t = t.reshape(1, t.size)
#        y = y.reshape(1, y.size)
#    
#    batch_size = y.shape[0]
#    return -np.sum(np.log( y[np.arange(batch_size), t])) / batch_size

### Practice 1: 1 layer network

First we build a simple 1 layer network with 2 cells can take 2 input values and can predict 3 different classes.

In [5]:
# simple net practice
class simpleNet:
    def __init__(self):
        self.W = np.random.randn(2,3)
    
    def predict(self, x):
        return np.dot(x,self.W)
    
    def loss(self, x, t):
        z = self.predict(x)
        y = softmax(z)
        loss = cross_entropy_error(y, t)
        
        return loss

In [6]:
# initiate simple 1 layer network
net = simpleNet()

# given inputs x1, x2 = 0.6, 0.9
x = np.array([0.6, 0.9])

# predict y
y_hat = net.predict(x)
print('Prediction: ' + str(y_hat))

# assume actual result is [0, 0, 1]
t =np.array([0, 0, 1])

# cross entropy error
error = net.loss(x, t)
print('Error: ' + str(error))

Prediction: [-0.76498923  0.01598985 -1.56366737]
Error: 2.088882638363264


### Practice 2: 2 layers network

Second, build a simple 2 layers network with 100 cells. The input and output size are dynamic. The initial weights are set using random function.

In [72]:
# 2 layers
class TwoLayerNet:
    def __init__(self, input_size, hidden_size, output_size, weight_init_std=0.01):
        self.params = {}
        
        # 1st layer size: from input to cell size
        self.params['W1'] = weight_init_std * np.random.randn(input_size, hidden_size)
        self.params['b1'] = np.zeros(hidden_size)
        
        # 2nd layer size: from cell size to output size
        self.params['W2'] = weight_init_std * np.random.randn(hidden_size, output_size)
        self.params['b2'] = np.zeros(output_size)

    def predict(self, x):
        # get weights
        W1, W2 = self.params['W1'], self.params['W2']
        b1, b2 = self.params['b1'], self.params['b2']
        
        # first layer output: input * W1 + b1
        # use sigmoid function to smooth values
        a1 = np.dot(x, W1) + b1
        z1 = sigmoid(a1)
        
        # second layer output: input * W2 + b2
        # use softmax to normalize the result
        a2 = np.dot(z1, W2) + b2
        y = softmax(a2)
        
        return y
    
    def loss(self, x, t):
        y = self.predict(x)
        return cross_entropy_error(y, t)
    
    def accuracy(self, x, t):
        y = self.predict(x)
        y = np.argmax(y, axis = 1)
        t = np.argmax(t, axis = 1)
        
        accuracy = np.sum(y == t) / float(x.shape[0])
        return accuracy
    
    # numerical gradient
    # the purpose of numerical gradient function is to show how slow it is.
    # graph gradient method is much faster
    def numerical_gradient(self, x, t):
        pass
        #loss_W = lambda W: self.loss(x, t)
        
        #grads = {}
        #grads['W1'] = numerical_gradient(loss_W, self.params['W1'])
        #grads['b1'] = numerical_gradient(loss_W, self.params['b1'])
        #grads['W2'] = numerical_gradient(loss_W, self.params['W2'])
        #grads['b2'] = numerical_gradient(loss_W, self.params['b2'])
        
        #return grads
    
    # graph gradient
    def gradient(self, x, t):
        W1, W2 = self.params['W1'], self.params['W2']
        b1, b2 = self.params['b1'], self.params['b2']
        
        grads = {}
        
        batch_num = x.shape[0]
        
        # forward
        a1 = np.dot(x, W1) + b1
        z1 = sigmoid(a1)
        a2 = np.dot(z1, W2) + b2
        y = softmax(a2)
        
        # backward
        dy = (y-t) / batch_num
        grads['W2'] = np.dot(z1.T, dy)
        grads['b2'] = np.sum(dy, axis = 0)
        
        dz1 = np.dot(dy, W2.T)
        da1 = sigmoid_grad(a1) * dz1
        grads['W1'] = np.dot(x.T, da1)
        grads['b1'] = np.sum(da1, axis = 0)
        
        return grads

In [71]:
# initiate two layers network with input size 784 (equals to 28*28 image size), 
# hidden layer with 100 cells, and output size 10 as 0~9 digits.
net = TwoLayerNet(input_size = 784, hidden_size = 100, output_size= 10)

# show the matrix size
print('Size of input * 1st layer cells: ' + str(net.params['W1'].shape))
print('Size of 1st layer bias vecotr: ' + str(net.params['b1'].shape))
print('Size of input * 1st layer cells: ' +str(net.params['W2'].shape))
print('Size of 2nd layer bias vecotr: ' + str(net.params['b2'].shape))

# random assign value of 100 pics (28*28 pixel)
x = np.random.rand(100,784)

# prediction
y = net.predict(x)

# random assign labels
t = np.random.rand(100, 10)

# compute gradients
# skip numerical gradient, too slow
#numerical_grads = net.numerical_gradient(x, t)

# computational graph gradients
#compute_graph_grads = net.gradient(x, t)

Size of input * 1st layer cells: (784, 100)
Size of 1st layer bias vecotr: (100,)
Size of input * 1st layer cells: (100, 10)
Size of 2nd layer bias vecotr: (10,)


An important concept is to understand how to use computational graph to estimate the gradient. The method is way faster than numerical gradient method.

### Apply 2 layers model to train and test MNIST data

In [30]:
import os
from mnist import MNIST
mnidata = MNIST(os.getcwd()+'/Data')
train_img, train_lab = mnidata.load_training()
test_img, test_lab = mnidata.load_testing()

In [66]:
train_data = np.array(train_img)
train_label = np.array(train_lab)
test_data = np.array(test_img)
test_label = np.array(test_lab)

# convert label to one hot encoding
def get_one_hot(targets, nb_classes):
    res = np.eye(nb_classes)[np.array(targets).reshape(-1)]
    return res.reshape(list(targets.shape)+[nb_classes])

train_label = get_one_hot(train_label, 10)
test_label = get_one_hot(test_label, 10)

# normalize
#train_data[train_data>0] = 1
#test_data[test_data>0] = 1

In [63]:
# apply mnist data
# use 50 cells only
network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)

In [68]:
# parameters
iteration_number = 1200
train_data_size = train_data.shape[0]
batch_data_size = 100
learning_rate = 0.1

train_lost_list = []
train_acc_list = []
test_acc_list = []

iter_per_epoch = max(train_data_size/batch_data_size, 1)

print('Training data size: ' + str(train_data.shape[0]))
print('Testing data size: ' + str(test_data.shape[0]))
print('1 epoch needs '+ str(iter_per_epoch) + 'iteration' )
print('Note: epoch equals to cover all training data')

Training data size: 60000
Testing data size: 10000
1 epoch needs 600.0iteration
Note: epoch equals to cover all training data


In [69]:
# training
for i in range(iteration_number):
    # batch
    batch_mask = np.random.choice(train_data_size, batch_data_size)
    traing_data_batch = train_data[batch_mask]
    train_label_batch = train_label[batch_mask]
    
    # compute gradient
    grad = network.gradient(traing_data_batch, train_label_batch)
    
    # update weights
    # W' = W - learning_rate * dW
    for key in ('W1', 'b1', 'W2', 'b2'):
        network.params[key] -= learning_rate * grad[key]
        
    # record loss
    loss = network.loss(traing_data_batch, train_label_batch)
    train_lost_list.append(loss)
    
    if i % iter_per_epoch == 0:
        train_acc = network.accuracy(train_data, train_label)
        test_acc = network.accuracy(test_data, test_label)
        train_acc_list.append(train_acc)
        test_acc_list.append(test_acc)
        print("train acc, test acc | " + str(train_acc) + ", " + str(test_acc))

train acc, test acc | 0.11236666666666667, 0.1135
train acc, test acc | 0.11236666666666667, 0.1135
