# Part 1: Basics of Neural Networks
* <b>Learning Objective:</b> In this part, you are going to implement a basic multi-layer fully connected neural network to perform the same classification task.
* <b>Provided Code:</b> We provide the skeletons of classes you need to complete. Forward checking and gradient checkings are provided for verifying your implementation as well.
* <b>TODOs:</b> You are asked to implement the forward passes and backward passes for standard layers and loss functions, various widely-used optimizers, and part of the training procedure. And finally we want you to train a network from scratch on your own.

In [1]:
from lib.fully_conn import *
from lib.layer_utils import *
from lib.grad_check import *
from lib.datasets import *
from lib.optim import *
from lib.train import *
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

## Loading the data (CIFAR-10)
Load in the properly splitted CIFAR-10 data.

## Implement Standard Layers
You will now implement all the following standard layers commonly seen in a fully connected neural network. Please refer to the file layer_utils.py under the directory lib. Take a look at each class skeleton, and we will walk you through the network layer by layer. We provide results of some examples we pre-computed for you for checking the forward pass, and also the gradient checking for the backward pass.

## FC Forward
In the class skeleton "fc", please complete the forward pass in function "forward", the input to the fc layer may not be of dimension (batch size, features size), it could be an image or any higher dimensional data. Make sure that you handle this dimensionality issue. 

In [1]:
# Test the fc forward function
# Compare your output with the above pre-computed ones. 
# The difference should not be larger than 1e-8


## FC Backward
Please complete the function "backward" as the backward pass of the fc layer. Follow the instructions in the comments to store gradients into the predefined dictionaries in the attributes of the class. Parameters of the layer are also stored in the predefined dictionary.

In [2]:
# Test the fc backward function
# The error should be around 1e-10


## ReLU Forward
In the class skeleton "relu", please complete the forward pass.

In [3]:
# Test the relu forward function
# Compare your output with the above pre-computed ones. 
# The difference should not be larger than 1e-8


## ReLU Backward
Please complete the backward pass of the class relu.

In [4]:
# Test the relu backward function
# The error should not be larger than 1e-10


# Dropout

Dropout **[1]** is a technique for regularizing neural networks by randomly setting some features top zero during the forward pass. In this part, you will implement a dropout layer and modify your fully-connected network to optionally use dropout.

**[1] Geoffrey E. Hinton et al, "Improving neural networks by preventing co-adaptation of feature detectors", arXiv 2012**

## Dropout Forward
In the class "dropout", please complete the forward pass. Remember that the dropout is only applied during training phase, you should pay attention to this while implementing the function.

## Dropout Backward
Please complete the backward pass. Again remember that the dropout is only applied during training phase, handle this in the backward pass as well.

## Testing cascaded layers: FC + ReLU
Please find the TestFCReLU function in fully_conn.py under lib directory. <br />
You only need to complete few lines of code in the TODO block. <br />
Please design an FC --> ReLU two-layer-mini-network where the parameters of them match the given x, w, and b <br />
Please insert the corresponding names you defined for each layer to param_name_w, and param_name_b respectively. <br />
Here you only modify the param_name part, the _w, and _b are automatically assigned during network setup 

In [5]:
# The errors should not be larger than 1e-7


## SoftMax Function and Loss Layer
In the layer_utils.py, please first complete the function softmax, which will be use in the function cross_entropy. Please refer to the mathematical expressions of the cross entropy loss function, and complete its forward pass and backward pass.

## Test a Small Fully Connected Network
Please find the SmallFullyConnectedNetwork function in fully_conn.py under lib directory. <br />
Again you only need to complete few lines of code in the TODO block. <br />
Please design an FC --> ReLU --> FC --> ReLU network where the shapes of parameters match the given shapes <br />
Please insert the corresponding names you defined for each layer to param_name_w, and param_name_b respectively. <br />
Here you only modify the param_name part, the _w, and _b are automatically assigned during network setup 

## Test a Fully Connected Network regularized with Dropout
Please find the DropoutNet function in fully_conn.py under lib directory. <br />
For this part you don't need to design a new network, just simply run the following test code <br />
If something goes wrong, you might want to double check your dropout implementation

## Training a Network
In this section, we defined a TinyNet class for you to fill in the TODO block in fully_conn.py.
* Here please design a two layer fully connected network for this part.
* Please read the train.py under lib directory carefully and complete the TODO blocks in the train_net function first.
* In addition, read how the SGD function is implemented in optim.py, you will be asked to complete three other optimization methods in the later sections.

### Now train the network to achieve at least 50% validation accuracy

In [6]:
# Take a look at what names of params were stored


In [7]:
# How to load the parameters to a newly defined network


In [8]:
# Plot the learning curves


## Different Optimizers
There are several more advanced optimizers than vanilla SGD, you will implement three more sophisticated and widely-used methods in this section. Please complete the TODOs in the optim.py under lib directory.

## SGD + Momentum
The update rule of SGD plus momentum is as shown below: <br\ >
\begin{equation}
v_t: velocity \\
\gamma: momentum \\
\eta: learning\ rate \\
v_t = \gamma v_{t-1} + \eta \nabla_{\theta}J(\theta) \\
\theta = \theta - v_t
\end{equation}
Complete the SGDM() function in optim.py

In [37]:
# SGD with momentum


In [38]:
# Test the implementation of SGD with Momentum


updated_w error:  8.88234703351e-09
velocity error:  4.26928774328e-09


Run the following code block to train a multi-layer fully connected network with both SGD and SGD plus Momentum. The network trained with SGDM optimizer should converge faster.

In [9]:
# Arrange a small data


## RMSProp
The update rule of RMSProp is as shown below: <br\ >
\begin{equation}
\gamma: decay\ rate \\
\epsilon: small\ number \\
g_t^2: squared\ gradients \\
\eta: learning\ rate \\
E[g^2]_t: decaying\ average\ of\ past\ squared\ gradients\ at\ update\ step\ t \\
E[g^2]_t = \gamma E[g^2]_{t-1} + (1-\gamma)g_t^2 \\
\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t+\epsilon}}
\end{equation}
Complete the RMSProp() function in optim.py

In [10]:
# Test RMSProp implementation; you should see errors less than 1e-7


## Adam
The update rule of Adam is as shown below: <br\ >
\begin{equation}
g_t: gradients\ at\ update\ step\ t \\
m_t = \beta_1m_{t-1} + (1-\beta_1)g_t \\
v_t = \beta_2v_{t-1} + (1-\beta_1)g_t^2 \\
\hat{m_t}: bias\ corrected\ m_t \\
\hat{v_t}: bias\ corrected\ v_t \\
\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v_t}}+\epsilon}
\end{equation}
Complete the Adam() function in optim.py

In [11]:
# Test Adam implementation; you should see errors around 1e-7 or less


## Comparing the optimizers
Compare the results among all the above optimizers

## Training a Network with Dropout
Design the code to compare the results with and without dropout

In [12]:
# Train two identical nets, one with dropout and one without


In [13]:
# Plot train and validation accuracies of the two models


### Inline Question: Describe what you observe from the above results and graphs

In [15]:
#Answer: fill in here.

## Plot the Activation Functions
In each of the activation function, use the given lambda function template to plot their corresponding curves.