# CS549 Machine Learning Spring 2024 - Irfan Khan
# Assignment 7: Simple Neural Network

**Total points: 10**

Updated assignment designed by Ex-Professor Yang Xu Computer Science Dept, SDSU

In this assignment, you will implement a 2-layer shallow neural network model. 

We will use the model to conduct the same binary classification task , i.e., classify two categories of the sign language dataset. 

The input size is the number of pixels in a image ($64\times 64)=4096$. The size of hidden layer is determined by a hyperparameter `n_h`, and the size of output layer is 1.

In [18]:
import numpy as np
import matplotlib.pyplot as plt
from utils import *

#utils is a "utilities file" that contains utility functions or helper functions used in this assignment. 
#It is provided in the zip folder for this assignemnt. Please leave it unchanged.
%matplotlib inline
np.random.seed(1)

In [19]:
# Load data
#Since data is in n x m format, convert into m x n format, m: sample size, n: number of features
X_train_orig, Y_train_orig, X_test_orig, Y_test_orig = load_data()
X_train = X_train_orig.T
Y_train = Y_train_orig.T
X_test = X_test_orig.T
Y_test = Y_test_orig.T

print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)

(286, 4096)
(286, 1)
(125, 4096)
(125, 1)


# Expected Result

(286, 4096)<br>
(286, 1)<br>
(125, 4096)<br>
(125, 1)

### 1.1 Intialize parameters
**1 point**

The parameters associated with the hidden layer are $W^{[1]}$ and $B^{[1]}$, and the parameters associated with the output layer are $W^{[2]}$ and $B^{[2]}$.

We use **tanh** as acitivation function for hidden layer, and **sigmoid** for output layer.

**Instructions:**
- Initialize parameters randomly
- Use `np.random.randn((size_out, size_in))*0.01` to initialize $W^{[l]}$, in which `size_out` is the output size of current layer, and `size_in` is the input size from previous layer. 
- Use `np.zeros()` to initialize $B^{[l]}$

In [20]:
def init_params(n_i, n_h, n_o):
    """
    Args:
    n_i -- size of input layer
    n_h -- size of hidden layer
    n_o -- size of output layer
    
    Return:
    params -- a dict object containing all parameters:
        W1 -- weight matrix of layer 1
        B1 -- bias vector of layer 1
        W2 -- weight matrix of layer 2
        B2 -- bias vector of layer 2
    """
    np.random.seed(2) # For deterministic repeatability, DO NOT change this line! 
    
    ### START Code ###
    # 1 hidden layer 
    # 2 output layer
    W1 = np.random.randn(n_h, n_i)*0.01 
    W2 = np.random.randn(n_o, n_h)*0.01
    B1 = np.zeros((n_h, 1))
    B2 = np.zeros((n_o, 1))
    ### END Code ###
    
    params = {'W1': W1, 'B1': B1, 'W2': W2, 'B2': B2}
    
    
    return params

In [21]:
# Evaluate Task
ps = init_params(3, 4, 1)
print('W1 =', ps['W1'])
print('B1 =' ,ps['B1'])
print('W2 =', ps['W2'])
print('B2 =', ps['B2'])

W1 = [[-0.00416758 -0.00056267 -0.02136196]
 [ 0.01640271 -0.01793436 -0.00841747]
 [ 0.00502881 -0.01245288 -0.01057952]
 [-0.00909008  0.00551454  0.02292208]]
B1 = [[0.]
 [0.]
 [0.]
 [0.]]
W2 = [[ 0.00041539 -0.01117925  0.00539058 -0.0059616 ]]
B2 = [[0.]]


**Expected output**

W1 = [[-0.00416758 -0.00056267 -0.02136196]<br>
 [ 0.01640271 -0.01793436 -0.00841747]<br>
 [ 0.00502881 -0.01245288 -0.01057952]<br>
 [-0.00909008  0.00551454  0.02292208]]<br>
B1 = [[0. 0. 0. 0.]]<br>
W2 = [[ 0.00041539 -0.01117925  0.00539058 -0.0059616 ]]<br>
B2 = [[0.]]<br>

### 1.2 Forward propagation

**2 points**

Use the following fomulas to implement forward propagation:
- $Z^{[1]} = XW^{[1]T} + B^{[1]}$
- $A^{[1]} = tanh(Z^{[1]})$ --> use `np.tanh` function
- $Z^{[2]} = A^{[1]}W^{[2]T} + B^{[2]}$
- $A^{[2]} = \sigma(Z^{[2]})$ --> directly use the `sigmoid` function provided in `utils` package

In [44]:
def forward_prop(X, params):
    """
    Args:
    X -- input data of shape (m,n_in)
    params -- a python dict object containing all parameters (output of init_params)
    
    Return:
    A2 -- the activation of the output layer
    cache -- a python dict containing all intermediate values for later use in backprop
             i.e., 'Z1', 'A1', 'Z2', 'A2'
    """
    m = X.shape[0]
    
    # Retrieve parameters
    ### START Code ###
    W1 = params['W1']
    B1 = params['B1']
    W2 = params['W2']
    B2 = params['B2']  
    ### END Code ###
    
    # Implement forward propagation
     ### START Code ###
    Z1 = np.dot(X, W1.T) + B1
    A1 = np.tanh(Z1)
    Z2 = np.dot(A1, W2.T) + B2
    A2 = sigmoid(Z2)
    ### END Code ###
    
    
    assert A1.shape[0] == m
    assert A2.shape[0] == m
    
    cache = {'Z1': Z1, 'A1': A1, 'Z2': Z2, 'A2': A2}
    
    return A2, cache

In [45]:
# Evaluate Task
X_tmp, params_tmp = forwardprop_testcase()

A2, cache = forward_prop(X_tmp, params_tmp)

print('mean(Z1) =', np.mean(cache['Z1']))
print('mean(A1) =', np.mean(cache['A1']))
print('mean(Z2) =', np.mean(cache['Z2']))
print('mean(A2) =', np.mean(cache['A2']))


mean(Z1) = 0.006415781628350418
mean(A1) = 0.006410368144939439
mean(Z2) = -6.432516196270971e-05
mean(A2) = 0.49998391870952374


**Expected output**

mean(Z1) = 0.006415781628350418<br>
mean(A1) = 0.006410368144939439<br>
mean(Z2) = -6.432516196270971e-05<br>
mean(A2) = 0.49998391870952374<br>
***

### 1.3 Backward propagation
**3 points**

Use the following formulas to implement backward propagation:
- $dz^{[2]} = a^{[2]} - y$
- $dW^{[2]} = \frac{1}{m}dz^{[2]T}a^{[1]}$ --> $m$ is the number of examples
- $dB^{[2]} = \frac{1}{m}$ np.sum( $dz^{[2]}$, axis=0, keepdims=True)
- $da^{[1]} = dz^{[2]}W^{[2]}$
- $dz^{[1]} = da^{[1]}*g'(z^{[1]})$
    - Here $*$ denotes element-wise multiply
    - $g(z)$ is the tanh function, therefore its derivative $g'(z^{[1]}) = 1 - (g(z^{[1]}))^2 = 1 - (a^{[1]})^2$
- $dW^{[1]} = \frac{1}{m} dz^{[1]T}X$
- $dB^{[1]} = \frac{1}{m}$ np.sum( $dz^{[1]}$, axis=0, keepdims=True)

In [46]:
def backward_prop(X, Y, params, cache):
    """
    Args:
    X -- input data of shape (m,n_in)
    Y -- input label of shape (m,1)
    params -- a python dict containing all parameters
    cache -- a python dict containing 'Z1', 'A1', 'Z2' and 'A2' (output of forward_prop)
    
    Return:
    grads -- a python dict containing the gradients w.r.t. all parameters,
             i.e., dW1, dB1, dW2, dB2
    """
    m = X.shape[0]
    
    # Retrieve parameters
     ### START Code ###
    W1 = params['W1']
    B1 = params['B1']
    W2 = params['W2']
    B2 = params['B2']
    ### END Code ###
    
    
    # Retrive intermediate values stored in cache
     ### START Code ###
    Z1 = cache['Z1']
    A1 = cache['A1']
    Z2 = cache['Z2']
    A2 = cache['A2']
    ### END Code ###
    
    
    # Implement backprop
     ### START Code ###
    dZ2 = A2 - Y
    dW2 = (1/m) * np.dot(dZ2.T, A1)
    dB2 = (1/m) * np.sum(dZ2, axis=0, keepdims=True)
    dA1 = np.dot(dZ2, W2)
    dZ1 = dA1 * (1 - np.power(A1, 2))
    dW1 = (1/m) * np.dot(dZ1.T, X)
    dB1 = (1/m) * np.sum(dZ1, axis=0, keepdims=True)
    
    ### END Code ###
    
    
    grads = {'dW1': dW1, 'dB1': dB1, 'dW2': dW2, 'dB2': dB2}
    
    return grads

In [47]:
# Evaluate Task
X_tmp, Y_tmp, params_tmp, cache_tmp = backprop_testcase()

grads = backward_prop(X_tmp, Y_tmp, params_tmp, cache_tmp)
print('mean(dW1)', np.mean(grads['dW1']))
print('mean(dB1)', np.mean(grads['dB1']))
print('mean(dW2)', np.mean(grads['dW2']))
print('mean(dB2)', np.mean(grads['dB2']))



mean(dW1) -0.00014844465852477853
mean(dB1) -0.0002838378969105248
mean(dW2) -0.004079186018202939
mean(dB2) 0.09998392000000002


**Expected output**

mean(dW1) -0.00014844465852477853<br>
mean(dB1) -0.0002838378969105248<br>
mean(dW2) -0.004079186018202939<br>
mean(dB2) 0.09998392000000002<br>

***

### 1.4 Update parameters
**1 point**

Update $W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}$ accordingly:
- $W^{[1]} = W^{[1]} - \alpha\ dW^{[1]}$
- $B^{[1]} = B^{[1]} - \alpha\ dB^{[1]}$
- $W^{[2]} = W^{[2]} - \alpha\ dW^{[2]}$
- $B^{[2]} = B^{[2]} - \alpha\ dB^{[2]}$

In [48]:
def update_params(params, grads, alpha):
    """
    Args:
    params -- a python dict containing all parameters
    grads -- a python dict containing the gradients w.r.t. all parameters (output of backward_prop)
    alpha -- learning rate
    
    Return:
    params -- a python dict containing all updated parameters
    """
    # Retrieve parameters
    W1 = params['W1']
    B1 = params['B1']
    W2 = params['W2']
    B2 = params['B2']
    
    # Retrieve gradients
    dW1 = grads['dW1']
    dB1 = grads['dB1']
    dW2 = grads['dW2']
    dB2 = grads['dB2']
    
     ### START Code ###
    W1  = W1 - alpha * dW1
    B1 = B1 - alpha * dB1
    W2 = W2 - alpha * dW2
    B2 = B2 - alpha * dB2
    ### END Code ###
    
    
    params = {'W1': W1, 'B1': B1, 'W2': W2, 'B2': B2}
    
    return params

In [49]:
# Evaluate Task
params_tmp, grads_tmp = update_params_testcase()

params = update_params(params_tmp, grads_tmp, 0.01)
print('W1 =', params['W1'])
print('B1 =', params['B1'])
print('W2 =', params['W2'])
print('B2 =', params['B2'])

W1 = [[ 0.004169   -0.00056367 -0.02136304]
 [ 0.0163645  -0.01790747 -0.00838857]
 [ 0.00504726 -0.01246588 -0.01059348]
 [-0.00911046  0.0055289   0.0229375 ]]
B1 = [[-4.13852251e-07  1.12173654e-05 -5.39304763e-06  5.94305036e-06]]
W2 = [[ 0.00048642 -0.011058    0.00546531 -0.00606545]]
B2 = [[-0.00099984]]


**Expected output**

W1 = [[ 0.004169   -0.00056367 -0.02136304]<br>
 [ 0.0163645  -0.01790747 -0.00838857]<br>
 [ 0.00504726 -0.01246588 -0.01059348]<br>
 [-0.00911046  0.0055289   0.0229375 ]]<br>
B1 = [[-4.13852251e-07  1.12173654e-05 -5.39304763e-06  5.94305036e-06]]<br>
W2 = [[ 0.00048642 -0.011058    0.00546531 -0.00606545]]<br>
B2 = [[-0.00099984]]<br>

***

### 1.5 Integrated model
**1.5 points**

Integrate `init_params`, `forward_prop`, `backward_prop` and `update_params` into one model.

In [50]:
def nn_model(X, Y, n_h, num_iters=10000, alpha=0.01, verbose=False):
    """
    Args:
    X -- training data of shape (m,n_in)
    Y -- training label of shape (m,1)
    n_h -- size of hidden layer
    num_iters -- number of iterations for gradient descent
    verbose -- print cost every 1000 steps
    
    Return:
    params -- parameters learned by the model. Use these to make predictions on new data
    """
    np.random.seed(3)
    m = X.shape[0]
    n_in = X.shape[1]
    n_out = 1
    
    # Initialize parameters and retrieve them
    ### START Code ###
    params = init_params(n_in, n_h, n_out)
    ### END Code ###
    
    
    # Gradient descent loop
    for i in range(num_iters):
        ### START Code ###
        # Forward propagation
        
        A2, cache = forward_prop(X, params)
        # Backward propagation
        
        grads = backward_prop(X, Y, params, cache)
        # Update parameters
        params = update_params(params, grads, alpha)
        
        # Compute cost
        
        logprobs = np.multiply(np.log(A2), Y) + np.multiply((1 - Y), np.log(1 - A2))
        cost = - np.sum(logprobs) / m
    
        cost = np.squeeze(cost)
        
        ### END Code ###
        
        # Print cost
        if i % 1000 == 0 and verbose:
            print('Cost after iter {}: {}'.format(i, cost))
    
    return params

In [51]:
# Evaluate Task 1.5
X_tmp, Y_tmp = nn_model_testcase()

params_tmp = nn_model(X_tmp, Y_tmp, n_h=5, num_iters=5000, alpha=0.01)
print('W1 =', params_tmp['W1'])
print('B1 =', params_tmp['B1'])
print('W2 =', params_tmp['W2'])
print('B2 =', params_tmp['B2'])

W1 = [[ 0.30237531 -0.17417915 -0.15306611]
 [ 1.25575279 -0.42239646 -0.35147978]
 [ 1.29886467 -0.43536728 -0.36668058]
 [-1.32065465  0.43563934  0.37269501]
 [ 0.41146082 -0.22524765 -0.15315463]]
B1 = [[-0.10251157 -0.82319548 -0.85962928  0.87045666 -0.16520153]
 [-0.10251157 -0.82319548 -0.85962928  0.87045666 -0.16520153]
 [-0.10251157 -0.82319548 -0.85962928  0.87045666 -0.16520153]
 [-0.10251157 -0.82319548 -0.85962928  0.87045666 -0.16520153]
 [-0.10251157 -0.82319548 -0.85962928  0.87045666 -0.16520153]]
W2 = [[ 0.42009393  1.87265216  1.95145175 -1.98319859  0.56655482]]
B2 = [[-0.81216478]]


**Expected output**

W1 = [[ 0.30237531 -0.17417915 -0.15306611]<br>
 [ 1.25575279 -0.42239646 -0.35147978]<br>
 [ 1.29886467 -0.43536728 -0.36668058]<br>
 [-1.32065465  0.43563934  0.37269501]<br>
 [ 0.41146082 -0.22524765 -0.15315463]]<br>
B1 = [[-0.10251157 -0.82319548 -0.85962928  0.87045666 -0.16520153]]<br>
W2 = [[ 0.42009393  1.87265216  1.95145175 -1.98319859  0.56655482]]<br>
B2 = [[-0.81216478]]<br>

***

### 1.6 Predict
**1 point**

Use the learned parameters to make predictions on new data. 
- Compute $a^{[2]}$ by calling `forward_prop`. Note that the `cache` returned will not be used in making predictions.
- Convert $a^{[2]}$ into a vector of 0 and 1.

In [52]:
def predict(X, params):
    """
    Args:
    X -- input data of shape (m,n_in)
    params -- a python dict containing the learned parameters
    
    Return:
    pred -- predictions of model on X, a vector of 0s and 1s
    """
    
   
    ### START Code ###
    A2, cache = forward_prop(X, params)
    pred = np.round(A2)
    ### END Code ###
    
    
    return pred

In [53]:
# Evaluate Task 1.6
# NOTE: the X_tmp and params_tmp are the ones generated in evaluating Task 1.5 (two cells above)
pred = predict(X_tmp, params_tmp)
print('predictions = ', pred)

predictions =  [[0.]
 [1.]
 [1.]
 [0.]
 [0.]]


**Expected output**

predictions =  [[0.]<br>
 [1.]<br>
 [1.]<br>
 [0.]<br>
 [0.]]<br>

***

### 1.7 Train and evaluate

**0.5 point**

Train the neural network model on X_train and Y_train, and evaluate it on X_test and Y_test.



In [56]:
# Train the model on X_train and Y_train, and print cost
# DO NOT change the hyperparameters, so that your output matches the expected one.
params = nn_model(X_train, Y_train, n_h = 10, num_iters=10000, verbose=True)

# Make predictions on X_test
pred = predict(X_test, params)


# Compute accuracy by comparing predictions and Y_test
### START YOUR CODE ###
acc = np.mean(pred == Y_test)


acc = np.mean(pred.flatten() == Y_test.flatten())

### END YOUR CODE ###
print('Accuracy = {0:.2f}%'.format(acc * 100))

ValueError: operands could not be broadcast together with shapes (286,10) (10,1) 

**Expected output**

Cost after iter 0: 0.6931077265775999<br>
Cost after iter 1000: 0.27191665440434465<br>
Cost after iter 2000: 0.05471220502234073<br>
Cost after iter 3000: 0.024320586832899165<br>
Cost after iter 4000: 0.01459203911762461<br>
Cost after iter 5000: 0.010128918610307609<br>
Cost after iter 6000: 0.0076426515266461124<br>
Cost after iter 7000: 0.006082606294144723<br>
Cost after iter 8000: 0.005022499769792935<br>
Cost after iter 9000: 0.004259937441762031<br>
Accuracy = 95.20%<br>
***