# Homework 2

**Due: Wednesday May 04, 2022**

# Classification with one hidden layer Neural Network

In this assignment you will build your first neural network which will have one or more hidden layers. You may start with a single layer. You will see a big difference between this model and the one we implemented using logistic regression. 

**You will perform the following tasks:**
- Implement a 4-class classification neural network with a single hidden layer.
- Use units with a non-linear activation function, such as tanh in the hidden layer and softmax output layer.
- Compute the cross entropy loss 
- Implement forward and backward propagation

# Dataset and Helper Code

- You may download a subset of the data file [CIFAR10_small](https://cs.umd.edu/class/spring2022/cmsc426-0201/data/cifar10_data_small.h5) for the assignment from the web page. This subset contains images of 4 (airplane, automobile, bird, and cat) out of 10 classes. If you want to use the full dataset with all 10 categories, you may download [CIFAR10](https://cs.umd.edu/class/spring2022/cmsc426-0201/data/cifar10_data.h5)
- You may find the Logistic Regression and single layer neural network notebooks shown in class useful for this homework.
-  You may use the following [helper Jupyter notebook](https://github.com/nayeemmz/cmsc426Spring2022/blob/main/assets/hw2/loadCifarData.ipynb) to load the provided data file.

## 1 - Packages ##

Import all the packages that you will need during this assignment.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import h5py
import scipy
from PIL import Image
from scipy import ndimage
from skimage.transform import resize
import os, shutil
%matplotlib inline

## 2 - Overview of the Problem set ##

**Problem Statement**: We have a dataset containing:

    - a training set of 20000 images of four categories labeled as airplane 0, automobile 1, bird 2, cat 3. These labels are in one hot encoded format as you will see in the provided helper code.
    - a test set of 4000 images labeled as airplane 0, automobile 1, bird 2, cat 3
    - each image is of shape (32x32x3). 

We will build a simple image-recognition algorithm that can correctly classify images of these four categories.

Let's get more familiar with the dataset. Load the data.

In [None]:
# Load your data set  
# and their labels
### START CODE HERE ### 




### END CODE HERE ###


In [None]:
print(train_data.shape)
print(test_data.shape)
print(train_labels.shape)
print(test_labels.shape)

In [None]:
# label categories: airplane 0, automobile 1, bird 2, cat 3, deer 4, dog 5, frog 6, horse 7, ship 8, and truck 9

label_names=['airplane',
 'automobile',
 'bird',
 'cat',
 'deer',
 'dog',
 'frog',
 'horse',
 'ship',
 'truck']

In [None]:
# Show an Example of a picture and its categorical label



Many software bugs in deep learning come from having matrix/vector dimensions that don't fit. If you can keep your matrix/vector dimensions straight you will go a long way toward eliminating many bugs. 

**Exercise:** Find the values for:

    - m_train (number of training examples)
    - m_test (number of test examples)
    - num_px (= height = width of a training image)

Remember that `train_data` is a numpy-array of shape (m_train, num_px, num_px, 3). 

In [None]:
### START CODE HERE ### 


### END CODE HERE ###

print ("Number of training examples: m_train = " + str(m_train))
print ("Number of testing examples: m_test = " + str(m_test))
print ("Height/Width of each image: num_px = " + str(num_px))
print ("Each image is of size: (" + str(num_px) + ", " + str(num_px) + ", 3)")

For convenience, you should now reshape images of shape (num_px, num_px, 3) in a numpy-array of shape (num_px $*$ num_px $*$ 3, 1). After this, our training (and test) dataset is a numpy-array where each column represents a flattened image. There should be m_train (respectively m_test) columns.

**Exercise:** Reshape the training and test data sets so that images of size (num_px, num_px, 3) are flattened into single vectors of shape (num\_px $*$ num\_px $*$ 3, 1).

A trick when you want to flatten a matrix X of shape (a,b,c,d) to a matrix X_flatten of shape (b$*$c$*$d, a) is to use: 
```python
X_flatten = X.reshape(X.shape[0], -1).T      # X.T is the transpose of X

To represent color images, the red, green and blue channels (RGB) must be specified for each pixel, and so the pixel value is actually a vector of three numbers ranging from 0 to 255.
One common preprocessing step in machine learning is to center and standardize your dataset, meaning that you substract the mean of the whole numpy array from each example, and then divide each example by the standard deviation of the whole numpy array. But for picture datasets, it is simpler and more convenient and works almost as well to just divide every row of the dataset by 255 (the maximum value of a pixel channel).
Let's standardize our dataset.

In [None]:
# Standardize the dataset

### START CODE HERE ###


### END CODE HERE ###

<font color='blue'>
**What you need to remember:**

Common steps for pre-processing a new dataset are:
- Figure out the dimensions and shapes of the problem (m_train, m_test, num_px, ...)
- "Standardize" the data (divide by 255 and make it float)

## 3 - Neural Network model

You are going to train a Neural Network with a single hidden layer.

**Here is a representative model with a single unit in the output layer (Remember you would need to modify this to have four units in the output layer and start with 10 units in the hidden layer)**:
<img src="images/OneLayerNN.png" >

**Mathematically**:

For one example $x_{i}$:
$$z_{1 i} =  W_{1} x_{i} + b_{1 i}\tag{1}$$ 

$$a_{1 i} = \tanh(z_{1 i})\tag{2}$$

$$z_{2 i} = W_{2} a_{1 i} + b_{2 i}\tag{3}$$

$$\hat{y}_{i} = a_{2 i} = \sigma(z_{ 2 i})\tag{4}$$

$$\text{where } \sigma \text{ is softmax function}. $$

Given the predictions on all the examples, you can also compute the cost $J$ as follows: 




$$J = - \frac{1}{n} \sum\limits_{i = 0}^{n}  \small y_{i}\log\left(a_{2i}\right)  \small\tag{6}$$

**Reminder**: The general methodology to build a Neural Network is to:

    1. Define the neural network structure ( # of input units,  # of hidden units, etc). 
    2. Initialize the model's parameters
    3. Loop:
        - Implement forward propagation
        - Compute loss
        - Implement backward propagation to get the gradients
        - Update parameters (gradient descent)

You often build helper functions to compute steps 1-3 and then merge them into one function we call `nn_model()`. Once you've built `nn_model()` and learnt the right parameters, you can make predictions on new data.

# 3.1 - Defining the neural network structure ####

**Exercise**: Define three variables:
    - n_x: the size of the input layer
    - n_h: the size of the hidden layer (set this to 10) 
    - n_y: the size of the output layer

**Hint**: Use shapes of X and Y to find n_x and n_y. Also, hard code the hidden layer size to be 10. You can change it later if you want to try different values.

In [None]:
def layer_sizes(X, Y):
    """
    Arguments:
    X -- input dataset of shape (input size, number of examples)
    Y -- labels of shape (output size, number of examples)
    
    Returns:
    n_x -- the size of the input layer
    n_h -- the size of the hidden layer
    n_y -- the size of the output layer
    """
    ### START CODE HERE ### 
           # size of input layer
    
           # size of output layer
    ### END CODE HERE ###
    return (n_x, n_h, n_y)

In [None]:
(n_x, n_h, n_y) = layer_sizes(train_data, train_labels)
print("The size of the input layer is: n_x = " + str(n_x))
print("The size of the hidden layer is: n_h = " + str(n_h))
print("The size of the output layer is: n_y = " + str(n_y))

### 3.2 - Initialize the model's parameters ####

**Exercise**: Implement the function `initialize_parameters()`.

**Instructions**:
- Make sure your parameters' sizes are right. Refer to the neural network figure above if needed.
- You will initialize the weights matrices with random values. 
    - Use: `np.random.randn(a,b) * 0.01` to randomly initialize a matrix of shape (a,b).
- You will initialize the bias vectors as zeros. 
    - Use: `np.zeros((a,b))` to initialize a matrix of shape (a,b) with zeros.

In [None]:
def initialize_parameters(n_x, n_h, n_y):
    """
    Argument:
    n_x -- size of the input layer
    n_h -- size of the hidden layer
    n_y -- size of the output layer
    
    Returns:
    params -- python dictionary containing your parameters:
                    W1 -- weight matrix of shape (n_h, n_x)
                    b1 -- bias vector of shape (n_h, 1)
                    W2 -- weight matrix of shape (n_y, n_h)
                    b2 -- bias vector of shape (n_y, 1)
    """
    
    np.random.seed(2) # we set up a seed so that your output matches ours although the initialization is random.
    
    ### START CODE HERE ### (≈ 4 lines of code)
    
    ### END CODE HERE ###
    
    assert (W1.shape == (n_h, n_x))
    assert (b1.shape == (n_h, 1))
    assert (W2.shape == (n_y, n_h))
    assert (b2.shape == (n_y, 1))
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters

In [None]:
parameters = initialize_parameters(n_x, n_h, n_y)
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

### 3.3 - The Loop ####

**Question**: Implement `forward_propagation()`.

**Instructions**:
- Look above at the mathematical representation of your classifier.
- You would need to implement the softmax layer. It is not included in the notebook.
- You can use the function `np.tanh()`. It is part of the numpy library.
- The steps you have to implement are:
    1. Retrieve each parameter from the dictionary "parameters" (which is the output of `initialize_parameters()`) by using `parameters[".."]`.
    2. Implement Forward Propagation. Compute $Z_{1}, A_{1}, Z_{2}$ and $A_{2}$ (the vector of all your predictions on all the examples in the training set).
- Values needed in the backpropagation are stored in "`cache`". The `cache` will be given as an input to the backpropagation function.

In [None]:
def sigmoid(z):
    """
    Compute the sigmoid of z

    Arguments:
    z -- A scalar or numpy array of any size.

    Return:
    s -- sigmoid(z)
    """

 
    s = 1/(1+np.exp(-z))
   
    
    return s

def softmax(z):
    """
    Compute the one hot encoded vector 
    
    Arguments:
    z -- A scalar or numpy array 
    
    Return: a one-hot encoded label for the input image or the actual integer for the label
    """
    
    
    ### Start code here ###
    
    
    
    ### End code here ###
    
    
    
    

def forward_propagation(X, parameters):
    """
    Argument:
    X -- input data of size (n_x, m)
    parameters -- python dictionary containing your parameters (output of initialization function)
    
    Returns:
    A2 -- The softmax output of the second activation
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2"
    """
    # Retrieve each parameter from the dictionary "parameters"
    ### START CODE HERE ### (≈ 4 lines of code)
    
    ### END CODE HERE ###
    
    # Implement Forward Propagation to calculate A2 (probabilities)
    ### START CODE HERE ### (≈ 4 lines of code)
    
    ### END CODE HERE ###
    
    
    
    cache = {"Z1": Z1,
             "A1": A1,
             "Z2": Z2,
             "A2": A2}
    
    return A2, cache

In [None]:
A2, cache = forward_propagation(train_data, parameters)

# Note: we use the mean here just to make sure that your output matches ours. 
print(np.mean(cache['Z1']) ,np.mean(cache['A1']),np.mean(cache['Z2']),np.mean(cache['A2']))

Now that you have computed $A_{2}$ (in the Python variable "`A2`"), which contains $a_{2i}$ for every example, you can compute the cost function as follows:

$$J = - \frac{1}{n} \sum\limits_{i = 0}^{n}  \small y_{i}\log\left(a_{2i}\right)  \small\tag{7}$$

**Exercise**: Implement `compute_cost()` to compute the value of the cost $J$.

**Instructions**:
- There are many ways to implement the cross-entropy loss. To help you, we give you how we would have implemented
$- \sum\limits_{i=0}^{n}  y^{(i)}\log(a^{[2](i)})$:
```python
logprobs = np.multiply(np.log(A2),Y)
cost = - np.sum(logprobs)                # no need to use a for loop!
```

(you can use either `np.multiply()` and then `np.sum()` or directly `np.dot()`).

In [None]:
def compute_cost(A2, Y, parameters):
    """
    Computes the cross-entropy cost given in equation (13)
    
    Arguments:
    A2 -- The softmax output of the second activation, of shape (1, number of examples) (you would need to assign 
           a number corresponding the one-hot encoded output)
    Y -- "true" labels vector of shape (1, number of examples) (again assign the number corresponding to the 
           one-hot encoded labels )
    parameters -- python dictionary containing your parameters W1, b1, W2 and b2
    
    Returns:
    cost -- cross-entropy cost given equation (13)
    """
    
    m = Y.shape[1] # number of example

    # Compute the cross-entropy cost
    ### START CODE HERE ### (≈ 2 lines of code)
    
    ### END CODE HERE ###
    
    cost = np.squeeze(cost)     # makes sure cost is the dimension we expect. 
                                # E.g., turns [[17]] into 17 
    assert(isinstance(cost, float))
    
    return cost

In [None]:
print("cost = " + str(compute_cost(A2, train_labels, parameters)))

Using the cache computed during forward propagation, you can now implement backward propagation.

**Question**: Implement the function `backward_propagation()`.

**Instructions**:
Backpropagation is usually the hardest (most mathematical) part in deep learning. To help you, here again is the slide from the lecture on backpropagation. You'll want to use the six equations on the right of this slide, since you are building a vectorized implementation.  

<img src="images/gradDesc.png" >

<!--
$\frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)} } = \frac{1}{m} (a^{[2](i)} - y^{(i)})$

$\frac{\partial \mathcal{J} }{ \partial W_2 } = \frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)} } a^{[1] (i) T} $

$\frac{\partial \mathcal{J} }{ \partial b_2 } = \sum_i{\frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)}}}$

$\frac{\partial \mathcal{J} }{ \partial z_{1}^{(i)} } =  W_2^T \frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)} } * ( 1 - a^{[1] (i) 2}) $

$\frac{\partial \mathcal{J} }{ \partial W_1 } = \frac{\partial \mathcal{J} }{ \partial z_{1}^{(i)} }  X^T $

$\frac{\partial \mathcal{J} _i }{ \partial b_1 } = \sum_i{\frac{\partial \mathcal{J} }{ \partial z_{1}^{(i)}}}$

- Note that $*$ denotes elementwise multiplication.
- The notation you will use is common in deep learning coding:
    - dW1 = $\frac{\partial \mathcal{J} }{ \partial W_1 }$
    - db1 = $\frac{\partial \mathcal{J} }{ \partial b_1 }$
    - dW2 = $\frac{\partial \mathcal{J} }{ \partial W_2 }$
    - db2 = $\frac{\partial \mathcal{J} }{ \partial b_2 }$
    
!-->

- Tips:
    - To compute dZ1 you'll need to compute the gradient of tanh activation function, if $a_1 = tanh(z)$ then $\frac{\partial a_1}{\partial z} = 1-a_1^2$. So you can compute 
    using `(1 - np.power(A1, 2))`.

In [None]:
def backward_propagation(parameters, cache, X, Y):
    """
    Implement the backward propagation using the instructions above.
    
    Arguments:
    parameters -- python dictionary containing our parameters 
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2".
    X -- input data of shape (2, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    
    Returns:
    grads -- python dictionary containing your gradients with respect to different parameters
    """
    m = X.shape[1]
    
    # First, retrieve W1 and W2 from the dictionary "parameters".
    ### START CODE HERE ### (≈ 2 lines of code)
    
    ### END CODE HERE ###
        
    # Retrieve also A1 and A2 from dictionary "cache".
    ### START CODE HERE ### (≈ 2 lines of code)
    
    ### END CODE HERE ###
    
    # Backward propagation: calculate dW1, db1, dW2, db2. 
    ### START CODE HERE ### (≈ 6 lines of code, corresponding to 6 equations on slide above)
    
    ### END CODE HERE ###
    
    grads = {"dW1": dW1,
             "db1": db1,
             "dW2": dW2,
             "db2": db2}
    
    return grads

In [None]:
grads = backward_propagation(parameters, cache, train_data, train_labels)
print ("dW1 = "+ str(grads["dW1"]))
print ("db1 = "+ str(grads["db1"]))
print ("dW2 = "+ str(grads["dW2"]))
print ("db2 = "+ str(grads["db2"]))

**Question**: Implement the update rule. Use gradient descent. You have to use (dW1, db1, dW2, db2) in order to update (W1, b1, W2, b2).

**General gradient descent rule**: $ w = w - \alpha \frac{\partial J }{ \partial w }$ where $\alpha$ is the learning rate and $w$ represents a parameter.

In [None]:
def update_parameters(parameters, grads, learning_rate = 1.2):
    """
    Updates parameters using the gradient descent update rule given above
    
    Arguments:
    parameters -- python dictionary containing your parameters 
    grads -- python dictionary containing your gradients 
    
    Returns:
    parameters -- python dictionary containing your updated parameters 
    """
    # Retrieve each parameter from the dictionary "parameters"
    ### START CODE HERE ### (≈ 4 lines of code)
    
    ### END CODE HERE ###
    
    # Retrieve each gradient from the dictionary "grads"
    ### START CODE HERE ### (≈ 4 lines of code)
    
    ## END CODE HERE ###
    
    # Update rule for each parameter
    ### START CODE HERE ### (≈ 4 lines of code)
    
    ### END CODE HERE ###
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters

In [None]:
parameters = update_parameters(parameters, grads)

print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

### 3.4 - Integrate parts 3.1, 3.2 and 3.3 in nn_model() ####

**Question**: Build your neural network model in `nn_model()`.

**Instructions**: The neural network model has to use the previous functions in the right order.

In [None]:
def nn_model(X, Y, n_h, num_iterations = 100000, print_cost=False):
    """
    Arguments:
    X -- dataset 
    Y -- labels of shape (1, number of examples)
    n_h -- size of the hidden layer
    num_iterations -- Number of iterations in gradient descent loop
    print_cost -- if True, print the cost every 1000 iterations
    
    Returns:
    parameters -- parameters learnt by the model. They can then be used to predict.
    """
  
    
    
    # Initialize parameters, then retrieve W1, b1, W2, b2. Inputs: "n_x, n_h, n_y". Outputs = "W1, b1, W2, b2, parameters".
    ### START CODE HERE ### (≈ 5 lines of code)
    
    ### END CODE HERE ###
    
    # Loop (gradient descent)

    for i in range(0, num_iterations+1):
         
        ### START CODE HERE ### 
        # Forward propagation. Inputs: "X, parameters". Outputs: "A2, cache".
        
        
        # Cost function. Inputs: "A2, Y, parameters". Outputs: "cost".
        
 
        # Backpropagation. Inputs: "parameters, cache, X, Y". Outputs: "grads".
        
 
        # Gradient descent parameter update. Inputs: "parameters, grads". Outputs: "parameters".
        
        
        ### END CODE HERE ###
        
        # Print the cost every 1000 iterations
        if print_cost and i % 1000 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))

    return parameters

In [None]:
parameters = nn_model(train_data, train_labels, 4, num_iterations=10000, print_cost=True)
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

### 3.5 Predictions

**Question**: Use your model to predict by building predict().
Use forward propagation to predict results.

**Reminder**: predictions = $y_{prediction} = \text{class label}$
    


In [None]:
def predict(parameters, X):
    """
    Using the learned parameters, predicts a class for each example in X
    
    Arguments:
    parameters -- python dictionary containing your parameters 
    X -- input data of size (n_x, m)
    
    Returns
    predictions -- vector of predictions of our model (airplane:0, automobile:1, bird:2, cat:3)
    """
    
    # Computes probabilities using forward propagation, and classifies to a particular category.
    ### START CODE HERE ### 
    
    ### END CODE HERE ###
    
    return predictions

In [None]:
## find predictions
### start code here (1 line)###

### end code here ###
print("predictions mean = " + str(np.mean(predictions)))

In [None]:
# Build a model with a n_h-dimensional hidden layer
parameters = nn_model(train_data, train_labels, n_h = 10, num_iterations = 10000, print_cost=True)

In [4]:
# Print accuracy


### Start code here for train accuracy ###


### END CODE HERE ###

### Start code here for test accuracy ###


### END CODE HERE ###

### 3.6 - Tuning hidden layer size  ###

In the cell below, write code to observe different behaviors of the model for various hidden layer sizes.

In [None]:
# Write code to test different hidden layer size and its impact on accuracy
# You will require a for-loop for the various hidden layer sizes

### START CODE### 


### END CODE ###

## 4 - Report

In the cell below write markdown code to analyze the following:


- Write the observations about the ability of the larger models (with more hidden units) to fit the training data better. Eventually does it overfit the data (accuracy on the test set drops or not)? 
- Write about the best hidden layer size.
- Train for different number of iterations.
- Try few different values of the learning rate and report its effect.

## 5 - Extra Credit (10 points)

In the cell below write markdown code to analyze the following:


- Add a second hidden layer to your model and show your results