# M2177.003100 Deep Learning <br> Assignment #1 Part 2: Implementing Neural Networks from Scratch

Copyright (C) Data Science & AI Laboratory, Seoul National University. This material is for educational uses only. Some contents are based on the material provided by other paper/book authors and may be copyrighted by them. 

Previously in `Assignment1-1_Data_Curation.ipynb`, we created a pickle with formatted datasets for training, development and testing on the [notMNIST dataset](http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html).

The goal of this assignment is to implement a simple 3-layer neural network from scratch. We won't derive all the math that's required, but I will try to give an intuitive explanation of what we are doing and will point to resources to read up on the details.

But why implement a Neural Network from scratch at all? Even if you plan on using Neural Network libraries like [PyBrain](http://pybrain.org) in the future, implementing a network from scratch at least once is an extremely valuable exercise. It helps you gain an understanding of how neural networks work, and that is essential to designing effective models.

One thing to note is that the code examples here aren't terribly efficient. They are meant to be easy to understand. In an upcoming part of the assignment, we will explore how to write an efficient Neural Network implementation using [PyTorch](http://pytorch.org/). 

**Note**: certain details are missing or ambiguous on purpose, in order to test your knowledge on the related materials. However, if you really feel that something essential is missing and cannot proceed to the next step, then contact the teaching staff with clear description of your problem.

### Submitting your work:
<font color=red>**DO NOT clear the final outputs**</font> so that TAs can grade both your code and results.  
Once you have done **part 1 - 3**, run the *CollectSubmission.sh* script with your **Student number** as input argument. <br>
This will produce a compressed file called *[Your student number].tar.gz*. Please submit this file on ETL. &nbsp;&nbsp; (Usage: ./*CollectSubmission.sh* &nbsp; 20\*\*-\*\*\*\*\*)

## Load datasets

First reload the data we generated in `Assignment2-1_Data_Curation.ipynb`.

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import numpy as np
from six.moves import cPickle as pickle
from six.moves import range
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline


In [3]:
pickle_file = '/home/jackyoung96/2020_2/Deeplearning_assignment/HW1_data/notMNIST.pickle'

with open(pickle_file, 'rb') as f:
    save = pickle.load(f)
    train_dataset = save['train_dataset']
    train_labels = save['train_labels']
    valid_dataset = save['valid_dataset']
    valid_labels = save['valid_labels']
    test_dataset = save['test_dataset']
    test_labels = save['test_labels']
    del save  # hint to help gc free up memory
    print('Training set', train_dataset.shape, train_labels.shape)
    print('Validation set', valid_dataset.shape, valid_labels.shape)
    print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 28, 28) (200000,)
Validation set (10000, 28, 28) (10000,)
Test set (10000, 28, 28) (10000,)


Reformat into a shape that's more adapted to the models we're going to train:
- unnormalize data
- data as a flat matrix,
- labels as float 1-hot encodings.

In [4]:
image_size = 28
num_labels = 10

def reformat(dataset, labels):
    dataset = dataset * 255.0 + 255.0/2
    dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
    # one-hot encoding, Map the label 0 to [1.0, 0.0, 0.0 ...], 1 to [0.0, 1.0, 0.0 ...]
    labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
    return dataset, labels

train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 784) (200000, 10)
Validation set (10000, 784) (10000, 10)
Test set (10000, 784) (10000, 10)


In [6]:
data_size = 2000
train_dataset = train_dataset[0:data_size]
train_labels = train_labels[0:data_size]

print(train_dataset.shape)

(2000, 784)


## Training a Neural Network

![Sample network](Utils/nn-from-scratch-3-layer-network-1024x693.png)

Let's now build a neural network with one input layer, one hidden layer, and one output layer. The number of nodes in the input layer is determined by the dimensionality of our data, 784. Similarly, the number of nodes in the output layer is determined by the number of classes we have, 10. The input to the network will be the pixel values of the input image and its output will be ten probabilities, ones for each class.

### How our network makes predictions

Our network makes predictions using *forward propagation*, which is just a bunch of matrix multiplications and the application of the activation function(s) we defined above. If $x$ is the 784-dimensional input to our network then we calculate our prediction $\hat{y}$ (ten-dimensional) as follows:

$$
\begin{aligned}
z_1 & = xW_1 + b_1 \\
a_1 & = \tanh(z_1) \\
z_2 & = a_1W_2 + b_2 \\
a_2 & = \hat{y} = \mathrm{softmax}(z_2)
\end{aligned}
$$

$z_i$ is the input of layer $i$ and $a_i$ is the output of layer $i$ after applying the activation function. $W_1, b_1, W_2, b_2$ are  parameters of our network, which we need to learn from our training data. You can think of them as matrices transforming data between layers of the network. Looking at the matrix multiplications above we can figure out the dimensionality of these matrices. If we use 1024 nodes for our hidden layer then $W_1 \in \mathbb{R}^{784\times1024}$, $b_1 \in \mathbb{R}^{1024}$, $W_2 \in \mathbb{R}^{1024\times10}$, $b_2 \in \mathbb{R}^{10}$. Now you see why we have more parameters if we increase the size of the hidden layer.

### Learning the Parameters

Learning the parameters for our network means finding parameters ($W_1, b_1, W_2, b_2$) that minimize the error on our training data. But how do we define the error? We call the function that measures our error the *loss function*. A common choice with the softmax output is the [cross-entropy loss](https://en.wikipedia.org/wiki/Cross_entropy#Cross-entropy_error_function_and_logistic_regression). If we have $N$ training examples and $C$ classes then the loss for our prediction $\hat{y}$ with respect to the true labels $y$ is given by:

$$
\begin{aligned}
L(y,\hat{y}) = - \frac{1}{N} \sum_{n \in N} \sum_{i \in C} y_{n,i} \log\hat{y}_{n,i}
\end{aligned}
$$



The formula looks complicated, but all it really does is sum over our training examples and add to the loss if we predicted the incorrect class. So, the further away $y$ (the correct labels) and $\hat{y}$ (our predictions) are, the greater our loss will be. 

Remember that our goal is to find the parameters that minimize our loss function. We can use [gradient descent](http://cs231n.github.io/optimization-1/) to find its minimum. I will implement the most vanilla version of gradient descent, also called batch gradient descent with a fixed learning rate. Variations such as SGD (stochastic gradient descent) or minibatch gradient descent typically perform better in practice. So if you are serious you'll want to use one of these, and ideally you would also [decay the learning rate over time](http://cs231n.github.io/neural-networks-3/#anneal).

As an input, gradient descent needs the gradients (vector of derivatives) of the loss function with respect to our parameters: $\frac{\partial{L}}{\partial{W_1}}$, $\frac{\partial{L}}{\partial{b_1}}$, $\frac{\partial{L}}{\partial{W_2}}$, $\frac{\partial{L}}{\partial{b_2}}$. To calculate these gradients we use the famous *backpropagation algorithm*, which is a way to efficiently calculate the gradients starting from the output. I won't go into detail how backpropagation works, but there are many excellent explanations ([here](http://colah.github.io/posts/2015-08-Backprop/) or [here](http://cs231n.github.io/optimization-2/)) floating around the web.

Applying the backpropagation formula we find the following (trust me on this):

$$
\begin{aligned}
& \delta_3 = \hat{y} - y \\
& \delta_2 = (1 - \tanh^2z_1) \circ \delta_3W_2^T \\
& \frac{\partial{L}}{\partial{W_2}} = a_1^T \delta_3  \\
& \frac{\partial{L}}{\partial{b_2}} = \delta_3\\
& \frac{\partial{L}}{\partial{W_1}} = x^T \delta_2\\
& \frac{\partial{L}}{\partial{b_1}} = \delta_2 \\
\end{aligned}
$$

### Activations functions

There are various activation functions in neural networks. 
According to the characteristics of each activation function, the type of the neural network, and the data type, appropriate activation functions are used.

![Activation Functions](Utils/activation-functions.png)




## Implementation

Now we are ready for our implementation. We start by defining some useful variables and parameters for gradient descent:

In [8]:
num_examples = len(train_dataset) # training set size
print(num_examples)
nn_input_dim = 784 # input layer dimensionality
nn_output_dim = 10 # output layer dimensionality

# Gradient descent parameters (I picked these by hand)
epsilon = 0.01 # learning rate for gradient descent
reg_lambda = 0.01 # regularization strength

2000


### Loss function

First let's implement the loss function we defined above. We use this to evaluate how well our model is doing:

In [9]:
# Helper function to evaluate the total loss on the dataset
def calculate_loss(model):
    W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
    # Forward propagation to calculate our predictions
    z1 = train_dataset.dot(W1) + b1
    a1 = np.tanh(z1)
    z2 = a1.dot(W2) + b2
    exp_scores = np.exp(z2)
    probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
    # Calculating the loss
    corect_logprobs = -np.log([probs[i,np.nonzero(train_labels)[(1)][i].astype('int64')] for i in range(num_examples)])
    data_loss = np.sum(corect_logprobs)
    # Add regulatization term to loss (optional)
    data_loss += reg_lambda/2 * (np.sum(np.square(W1)) + np.sum(np.square(W2)))
    return 1./num_examples * data_loss

We also implement a helper function to calculate the output of the network. It does forward propagation as defined above and returns the class with the highest probability.

In [10]:
# Helper function to predict an output (0 or 1)
def predict(model, x):
    W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
    # Forward propagation
    z1 = x.dot(W1) + b1
    a1 = np.tanh(z1)
    z2 = a1.dot(W2) + b2
    exp_scores = np.exp(z2)
    probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
    return np.argmax(probs, axis=1)

## Build model
Finally, here comes the function to train our Neural Network. It implements batch gradient descent using the backpropagation derivates we found above

In [11]:
# This function learns parameters for the neural network and returns the model.
# - nn_hdim: Number of nodes in the hidden layer
# - num_passes: Number of passes through the training data for gradient descent
# - print_loss: If True, print the loss every 1000 iterations
def build_model(nn_hdim, num_passes=20000, print_loss=False):
    
    # Initialize the parameters to random values. We need to learn these.
    np.random.seed(0)
    W1 = np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim)
    b1 = np.zeros((1, nn_hdim))
    W2 = np.random.randn(nn_hdim, nn_output_dim) / np.sqrt(nn_hdim)
    b2 = np.zeros((1, nn_output_dim))

    # This is what we return at the end
    model = {}
    
    # Gradient descent. For each batch...
    for i in range(0, num_passes):

        # Forward propagation
        z1 = train_dataset.dot(W1) + b1
        a1 = np.tanh(z1)
        z2 = a1.dot(W2) + b2
        exp_scores = np.exp(z2)
        probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)

        # Backpropagation
        delta3 = (probs - train_labels) / data_size
        dW2 = (a1.T).dot(delta3)
        db2 = np.sum(delta3, axis=0, keepdims=True)
        delta2 = delta3.dot(W2.T) * (1 - np.power(a1, 2))
        dW1 = np.dot(train_dataset.T, delta2)
        db1 = np.sum(delta2, axis=0)

        # Add regularization terms (b1 and b2 don't have regularization terms)
        dW2 += reg_lambda * W2
        dW1 += reg_lambda * W1

        # Gradient descent parameter update
        W1 += -epsilon * dW1
        b1 += -epsilon * db1
        W2 += -epsilon * dW2
        b2 += -epsilon * db2
        
        # Assign new parameters to the model
        model = { 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}
        
        # Optionally print the loss.
        # This is expensive because it uses the whole dataset, so we don't want to do it too often.
        if print_loss and i % 1000 == 0:
            print("Loss after iteration %i: %f" %(i, calculate_loss(model)))
    
    return model

## A network with a hidden layer of size 10

Let's see what happens if we train a network with a hidden layer size of 10. 

In [12]:
# Build a model with a 10-dimensional hidden layer
model = build_model(10, print_loss=True)

Loss after iteration 0: 2.409069
Loss after iteration 1000: 0.951440
Loss after iteration 2000: 0.737469
Loss after iteration 3000: 0.639097
Loss after iteration 4000: 0.577911
Loss after iteration 5000: 0.535709
Loss after iteration 6000: 0.504747
Loss after iteration 7000: 0.480976
Loss after iteration 8000: 0.462012
Loss after iteration 9000: 0.446527
Loss after iteration 10000: 0.433805
Loss after iteration 11000: 0.423181
Loss after iteration 12000: 0.414191
Loss after iteration 13000: 0.406507
Loss after iteration 14000: 0.399886
Loss after iteration 15000: 0.394142
Loss after iteration 16000: 0.389138
Loss after iteration 17000: 0.384766
Loss after iteration 18000: 0.380936
Loss after iteration 19000: 0.377566


# Varying the hidden layer size

In the example above we picked a hidden layer size of 10. Let's now get a sense of how varying the hidden layer size affects the result.


In [13]:
hidden_layer_dimensions = [50, 100]
for i, nn_hdim in enumerate(hidden_layer_dimensions):
    model = build_model(nn_hdim, print_loss=True)    

Loss after iteration 0: 2.330436
Loss after iteration 1000: 0.666129
Loss after iteration 2000: 0.560475
Loss after iteration 3000: 0.501706
Loss after iteration 4000: 0.459902
Loss after iteration 5000: 0.427561
Loss after iteration 6000: 0.401522
Loss after iteration 7000: 0.380165
Loss after iteration 8000: 0.362433
Loss after iteration 9000: 0.347567
Loss after iteration 10000: 0.334993
Loss after iteration 11000: 0.324275
Loss after iteration 12000: 0.315080
Loss after iteration 13000: 0.307149
Loss after iteration 14000: 0.300276
Loss after iteration 15000: 0.294291
Loss after iteration 16000: 0.289057
Loss after iteration 17000: 0.284456
Loss after iteration 18000: 0.280395
Loss after iteration 19000: 0.276795
Loss after iteration 0: 2.362255
Loss after iteration 1000: 0.650155
Loss after iteration 2000: 0.551860
Loss after iteration 3000: 0.495918
Loss after iteration 4000: 0.455431
Loss after iteration 5000: 0.423593
Loss after iteration 6000: 0.397577
Loss after iteration 700

We can see that while a hidden layer of low dimensionality nicely capture the general trend of our data, but higher dimensionalities are prone to overfitting. They are "memorizing" the data as opposed to fitting the general shape. If we were to evaluate our model on a separate test set (and you should!) the model with a smaller hidden layer size would likely perform better because it generalizes better. We could counteract overfitting with stronger regularization, but picking the a correct size for hidden layer is a much more "economical" solution.

---
## Problem 2

Implement neural network with a <font color='red'>$two\ hidden\ layer$</font> to improve your model's validation / test accuracy as much as you can. You just can copy and paste the code above, but since relevant materials can appear on the exam, I strongly recommend you to implement it yourself.

Here are some things you can try:

1. Instead of batch gradient descent, use **minibatch** gradient descent ([more info](http://cs231n.github.io/optimization-1/#gd)) to train the network. Minibatch gradient descent typically performs better in practice. 
2. We used a fixed learning rate epsilon for gradient descent. Implement an **annealing** schedule for the gradient descent learning rate ([more info](http://cs231n.github.io/neural-networks-3/#anneal)). 
3. We used a tanh activation function for our hidden layer. Experiment with other activation functions such as **ReLU** function. Note that changing the activation function also means changing the backpropagation derivative.

**Evaluation**: Use print_loss option and show the model actually train. 

---

In [None]:
print(__doc__)
""" TODO """

# Dataset Loading Part

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import numpy as np
from six.moves import cPickle as pickle
from six.moves import range
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline

pickle_file = '/home/jackyoung96/2020_2/Deeplearning_assignment/HW1_data/notMNIST.pickle'

with open(pickle_file, 'rb') as f:
    save = pickle.load(f)
    train_dataset = save['train_dataset']
    train_labels = save['train_labels']
    valid_dataset = save['valid_dataset']
    valid_labels = save['valid_labels']
    test_dataset = save['test_dataset']
    test_labels = save['test_labels']
    del save  # hint to help gc free up memory
    print('Training set', train_dataset.shape, train_labels.shape)
    print('Validation set', valid_dataset.shape, valid_labels.shape)
    print('Test set', test_dataset.shape, test_labels.shape)

image_size = 28
num_labels = 10

def reformat(dataset, labels):
    dataset = dataset * 255.0 + 255.0/2
    dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
    # one-hot encoding, Map the label 0 to [1.0, 0.0, 0.0 ...], 1 to [0.0, 1.0, 0.0 ...]
    labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
    return dataset, labels

train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 28, 28) (200000,)
Validation set (10000, 28, 28) (10000,)
Test set (10000, 28, 28) (10000,)
Training set (200000, 784) (200000, 10)
Validation set (10000, 784) (10000, 10)
Test set (10000, 784) (10000, 10)


# Datasize and hyperparameter setting

In [2]:
print(train_dataset.shape)

(200000, 784)


In [3]:
data_size = 2000
valid_size = 200

train_dataset = train_dataset[0:data_size]
train_labels = train_labels[0:data_size]
valid_dataset = valid_dataset[0:valid_size]
valid_labels = valid_labels[0:valid_size]

print(train_dataset.shape)

num_examples = len(train_dataset) # training set size
print(num_examples)
nn_input_dim = 784 # input layer dimensionality
nn_hidden_dim_1 = [100,200,400]
nn_hidden_dim_2 = [20,30,50]
nn_output_dim = 10 # output layer dimensionality

# Gradient descent parameters (I picked these by hand)
epsilon = [0.1,0.01,0.001] # learning rate for gradient descent
reg_lambda = 0.01 # regularization strength

(2000, 784)
2000


In [4]:
def make_layer(dim_in, dim_h1, dim_h2, dim_out):
    W1 = np.random.randn(dim_h1,dim_in)
    b1 = np.zeros((dim_h1,1))
    W2 = np.random.randn(dim_h2,dim_h1)
    b2 = np.zeros((dim_h2,1))
    W3 = np.random.randn(dim_out,dim_h2)
    b3 = np.zeros((dim_out,1))
    
    return W1,b1,W2,b2,W3,b3
    

In [5]:
def sigmoid(z):
    return 1/(1+np.exp(-z))

In [20]:
def build_model(X,Y,dim_h1,dim_h2,learing_rate,reg_lambda,epoch=10000, print_cost = False):
    decay_rate = 0.9
    dim_in = X.shape[1]
    dim_out = Y.shape[1]
    m = X.shape[0]
    cost_prev = 1000000
    
    W1 = np.random.randn(dim_in,dim_h1)
    b1 = np.zeros((1,dim_h1))
    W2 = np.random.randn(dim_h1,dim_h2)
    b2 = np.zeros((1,dim_h2))
    W3 = np.random.randn(dim_h2,dim_out)
    b3 = np.zeros((1,dim_out))
    
    for i in range(epoch):
        #feed forward
        Z1 = np.dot(X,W1)+b1
        A1 = np.tanh(Z1)
        Z2 = np.dot(A1,W2)+b2
        A2 = np.tanh(Z2)
        Z3 = np.dot(A2,W3)+b3
        A3 = sigmoid(Z3)

        #back propagation
        dZ3 = A3-Y
        dW3 = np.dot(A2.T,dZ3)*(1/m) + 2*reg_lambda*W3
        db3 = np.sum(dZ3,axis=0)
        dZ2 = np.dot(dZ3,W3.T)*(1-np.power(A2,2))
        dW2 = np.dot(A1.T,dZ2)*(1/m) + 2*reg_lambda*W2
        db2 = np.sum(dZ2,axis=0)
        dZ1 = np.dot(dZ2,W2.T)*(1-np.power(A1,2))
        dW1 = np.dot(X.T,dZ1)*(1/m) + 2*reg_lambda*W1
        db1 = np.sum(dZ1,axis=0)

        #update parameter
        W1 = W1 - learing_rate * dW1
        W2 = W2 - learing_rate * dW2
        W3 = W3 - learing_rate * dW3
        b1 = b1 - learing_rate * db1
        b2 = b2 - learing_rate * db2
        b3 = b3 - learing_rate * db3
        
        #calculate cost
        cross_entropy = -np.sum(np.log(A3)*Y, axis=1)
        cost = np.sum(cross_entropy)*(1/m)
        
        
        if print_cost:
            if(i%1000==999):
                # learning rate decay
                learing_rate = learing_rate * decay_rate
                
                if cost > cost_prev:
                    learing_rate = learing_rate * 0.5
                cost_prev = cost
                
                print("iteration {} cost : {}".format(i+1,cost))
        
    
    return {
        'W1':W1,
        'b1':b1,
        'W2':W2,
        'b2':b2,
        'W3':W3,
        'b3':b3,
    }
    
    
    

In [7]:
def test(model,X,Y):
    m = X.shape[0]
    
    W1 = model['W1']
    b1 = model['b1']
    W2 = model['W2']
    b2 = model['b2']
    W3 = model['W3']
    b3 = model['b3']
    
    #feed forward
    Z1 = np.dot(X,W1)+b1
    A1 = np.tanh(Z1)
    Z2 = np.dot(A1,W2)+b2
    A2 = np.tanh(Z2)
    Z3 = np.dot(A2,W3)+b3
    A3 = sigmoid(Z3)
    
    pred = np.argmax(A3,axis=1)
    target = np.argmax(Y,axis=1)
    
    cross_entropy = -np.sum(np.log(A3)*Y, axis=1)
    cost = np.sum(cross_entropy)*(1/m)
    
    accuracy = sum(pred==target)/m
    
    return accuracy, cost

# Lets select hyper parameter

In [9]:
for e in epsilon:
    for hid_1 in nn_hidden_dim_1:
        for hid_2 in nn_hidden_dim_2:
            model = build_model(train_dataset, train_labels, hid_1, hid_2, e, reg_lambda,print_cost= True)
            acc, cost = test(model, valid_dataset, valid_labels)
            print("dim_h1 {} dim_h2 {} epsilon {} : cost = {}, accuracy = {}".format(hid_1, hid_2,e, cost,acc))

iteration 1000 cost : 82.8795162311825
iteration 2000 cost : 84.77009893295944
iteration 3000 cost : 81.91881530843597
iteration 4000 cost : 86.29231073966056
iteration 5000 cost : 101.9685157760117
iteration 6000 cost : 69.90171798659294
iteration 7000 cost : 79.51377600214808
iteration 8000 cost : 78.33311319299264
iteration 9000 cost : 57.70926585093017
iteration 10000 cost : 62.22223898551865
dim_h1 100 dim_h2 20 epsilon 0.1 : cost = 63.98430281805804, accuracy = 0.105
iteration 1000 cost : 98.3447158398883
iteration 2000 cost : 79.63912740690466
iteration 3000 cost : 91.47581073068604
iteration 4000 cost : 114.22744667184135
iteration 5000 cost : 50.690413021792004
iteration 6000 cost : 72.80719326041816
iteration 7000 cost : 92.84369099033243
iteration 8000 cost : 77.34644692812726
iteration 9000 cost : 83.21733421416788
iteration 10000 cost : 90.15053783932865
dim_h1 100 dim_h2 30 epsilon 0.1 : cost = 85.25227355883602, accuracy = 0.125
iteration 1000 cost : 72.70597563709866
it

iteration 1000 cost : 1.7080091840951543
iteration 2000 cost : 1.1068143578303835
iteration 3000 cost : 0.8601663068488712
iteration 4000 cost : 0.7018675113058449
iteration 5000 cost : 0.5935016171124449
iteration 6000 cost : 0.5189593748295716
iteration 7000 cost : 0.4573810144313781
iteration 8000 cost : 0.4126100349030779
iteration 9000 cost : 0.3789598374041059
iteration 10000 cost : 0.39515705450539484
dim_h1 400 dim_h2 50 epsilon 0.01 : cost = 0.8806560541180447, accuracy = 0.795
iteration 1000 cost : 2.481651024795913
iteration 2000 cost : 2.3296928754039254
iteration 3000 cost : 2.2555673308259077
iteration 4000 cost : 2.2106131996249725
iteration 5000 cost : 2.183734987862936
iteration 6000 cost : 2.1546656630921652
iteration 7000 cost : 2.134726106092771
iteration 8000 cost : 2.116967469463713
iteration 9000 cost : 2.0981469597651805
iteration 10000 cost : 2.0794257163338217
dim_h1 100 dim_h2 20 epsilon 0.001 : cost = 2.2343523417264404, accuracy = 0.185
iteration 1000 cost 

# learning rate = 0.01 is optimal, hidden layer test again!!

In [10]:
e = 0.01
nn_hidden_dim_1 = [50,100,150,200,250,300,350,400,450,500]
nn_hidden_dim_2 = [20,30,50,70,100,150,200,250,300,400]
for hid_1 in nn_hidden_dim_1:
    for hid_2 in nn_hidden_dim_2:
        model = build_model(train_dataset, train_labels, hid_1, hid_2, e, reg_lambda,print_cost= False)
        acc, cost = test(model, valid_dataset, valid_labels)
        print("dim_h1 {} dim_h2 {} epsilon {} : cost = {}, accuracy = {}".format(hid_1, hid_2,e, cost,acc))

dim_h1 50 dim_h2 20 epsilon 0.01 : cost = 1.2700413144452758, accuracy = 0.685
dim_h1 50 dim_h2 30 epsilon 0.01 : cost = 1.0020496725565586, accuracy = 0.73
dim_h1 50 dim_h2 50 epsilon 0.01 : cost = 0.9814453805316734, accuracy = 0.75
dim_h1 50 dim_h2 70 epsilon 0.01 : cost = 1.003839748035584, accuracy = 0.73
dim_h1 50 dim_h2 100 epsilon 0.01 : cost = 1.0523197469344014, accuracy = 0.745
dim_h1 50 dim_h2 150 epsilon 0.01 : cost = 0.9380602843021668, accuracy = 0.76
dim_h1 50 dim_h2 200 epsilon 0.01 : cost = 0.9183202295245277, accuracy = 0.755
dim_h1 50 dim_h2 250 epsilon 0.01 : cost = 1.0580843142112455, accuracy = 0.735
dim_h1 50 dim_h2 300 epsilon 0.01 : cost = 0.8772264317188715, accuracy = 0.79
dim_h1 50 dim_h2 400 epsilon 0.01 : cost = 0.9233949111314874, accuracy = 0.775
dim_h1 100 dim_h2 20 epsilon 0.01 : cost = 1.0451622994133636, accuracy = 0.77
dim_h1 100 dim_h2 30 epsilon 0.01 : cost = 1.0828539782026574, accuracy = 0.7
dim_h1 100 dim_h2 50 epsilon 0.01 : cost = 1.01866575

# Hidden layer 1 = 450, layer 2 = 300 is the best

And it's result 

In [18]:
e=0.01
hid_1=450
hid_2=300

model = build_model(train_dataset, train_labels, hid_1, hid_2, e, reg_lambda,epoch=20000,print_cost= True)
acc, cost = test(model, test_dataset, test_labels)
print("dim_h1 {} dim_h2 {} epsilon {} : cost = {}, accuracy = {}".format(hid_1, hid_2,e, cost,acc))

iteration 1000 cost : 0.5589362103262916
iteration 2000 cost : 0.2857177892919409
iteration 3000 cost : 0.2086103930926925
iteration 4000 cost : 0.1685329694670511
iteration 5000 cost : 0.1496999336544728
iteration 6000 cost : 0.1426500674112879
iteration 7000 cost : 0.1379729294085051
iteration 8000 cost : 0.13636778321346316
iteration 9000 cost : 0.1369857547402311
iteration 10000 cost : 0.13841317815429113
iteration 11000 cost : 0.14008645251534416
iteration 12000 cost : 0.14198971561608006
iteration 13000 cost : 0.1442127151470001
iteration 14000 cost : 0.14657355730923627
iteration 15000 cost : 0.14893428091656358
iteration 16000 cost : 0.15120806403005027
iteration 17000 cost : 0.15335015603053131
iteration 18000 cost : 0.1553471620653516
iteration 19000 cost : 0.15719996927746252
iteration 20000 cost : 0.15891239120295506
dim_h1 450 dim_h2 300 epsilon 0.01 : cost = 1.1139135370289857, accuracy = 0.73


In [22]:
acc, cost = test(model, test_dataset, test_labels)
print('test data accuracy : {}, cost : {}'.format(acc, cost))

test data accuracy : 0.8425, cost : 0.6776996527210342
