<img align="center" src="figures/course.png" width="800">

#                                    16720 (B) Neural Networks for Recognition - Assignment 3

     Instructor: Kris Kitani                       TAs: Arka, Jinkun, Rawal, Rohan, Sheng-Yu

## Q1 Implementing a Fully Connected Network (75 points)

**Please include all the answers to the write-up questions to HW3:PDF**. Questions are indicated either the "write-up" or "auto-grader" tag.

In [45]:
# Do Not Modify
# Do Not Import ANY other packages
import numpy as np

# use for a "no activation" layer
def linear(x):
    return x

def linear_deriv(post_act):
    return np.ones_like(post_act)

def tanh(x):
    return np.tanh(x)

def tanh_deriv(post_act):
    return 1-post_act**2

def relu(x):
    return np.maximum(x, 0)

def relu_deriv(x):
    return (x > 0).astype(np.float)

### Q1.1 Network Initialization

#### Q1.1.1 (3 points, write-up)
Why is it not a good idea to initialize a network with all zeros? If you imagine that every layer has weights and biases, what can a zero-initialized network output after training?

<font color="red">**Please include your answer to HW3:PDF**</font>

#### Q1.1.2 (3 points, auto-grader)
Implement `initialize_weights` below to initialize neural network weights with [Xavier initialization](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf), where $Var[w] = \frac{2}{n_{in}+ n_{out}} $ and $n$ is the dimensionality of the vectors. Please use an **uniform distribution** to sample random numbers (see eq 16 in [Xavier initialization](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)), we recommend using np.random.uniform().

In [46]:
def initialize_weights(in_size: int, out_size: int, params: dict, name: str='' ):
    '''
    Initialize the weights W and b for a linear layer Y = XW + b
    
    [input]
    * in_size -- the feature dimension of the input
    * out_size -- the feature dimension of the output
    * params -- a dictionary containing parameters
    * name -- name of the layer
    
    HINTS:
    (1) b should be a 1D array, not a 2D array with a singleton dimension
    '''
    init_value = (6 / (in_size + out_size)) ** .5
    W, b = np.random.uniform(-init_value, init_value, size=(in_size, out_size)), np.zeros((out_size,))

    params['W' + name] = W
    params['b' + name] = b

In [47]:
params = {}
initialize_weights(2, 25, params, 'layer1')
initialize_weights(25, 4, params, 'output')
assert(params['Wlayer1'].shape == (2, 25))
assert(params['blayer1'].shape == (25,))


#### Q1.1.3 (2 points, write-up)
Why is it a good practice to initialize the parameters using random numbers? Explain the intuition behind scaling the initializations depending on layer size (see near Fig 6 in [Xavier initialization](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf))?

<font color="red">**Please include your answer to HW3:PDF**</font>

### Q1.2 Forward Propagation

Please refer to `appendix.jpynb` for the forward propagation equations. We will be implementing the forward propagation in code here.

#### Q1.2.1 (12 points, auto-grader)
Implement `sigmoid`, along with `forward` propagation for a single layer with an activation function, namely
$y = \sigma(X W + b)$, returning the output and intermediate results for an $N \times D$ dimension input $X$, with examples along the rows, data dimensions along the columns.

In [48]:
def sigmoid(x: np.ndarray):
    '''
    A sigmoid activation function
    
    [input]
    * X -- input data [N x D]
    
    [output]
    * res -- output after the sigmoid function
    '''
    return 1 / (1 + np.exp(-(x + 1e-6)))


In [49]:
def forward(X: np.ndarray, params: dict, name: str='',
            activation: callable=sigmoid):
    """
    Do a forward pass

    [input]
    * X -- input data [N x D]
    * params -- a dictionary containing parameters
    * name -- name of the layer
    * activation -- the activation function (default is sigmoid)
    
    [output]
    * post_act -- output after a linear layer and activation
    """
    pre_act, post_act = None, None
    # get the layer parameters
    W = params['W' + name]
    b = params['b' + name]
    
    ## compute pre_act using X, W and b.
    ## compute post_act using pre_act.
    pre_act  = X @ W + b
    post_act = activation(pre_act)

    # store the pre-activation and post-activation values
    # these will be important in backprop
    params['cache_' + name] = (X, pre_act, post_act)

    return post_act

In [50]:
params = {'Wlayer1': np.random.rand(10, 25), 'blayer1': np.random.rand(25,)}
X = np.random.rand(3, 10)
y = forward(X, params, 'layer1')
assert 'cache_layer1' in params


#### Q1.2.2 (5 points, auto-grader)
Implement the `softmax` function. Please implement a numerically stable computation of softmax using Theory:Q2. Hint, translate the input using the maximum element.

In [51]:
def softmax(X: np.ndarray):
    """
    A softmax function.
    
    [input]
    * X -- input data [N x D]
    
    [output]
    * res -- values after softmax
    """
    exp_res = np.exp(X  - np.max(X, axis = 1).reshape(X.shape[0], 1))
    res     = exp_res / np.sum(exp_res, axis = 1).reshape(X.shape[0], 1)

    return res


In [52]:
X = np.array([[1, 2, 3], [1, 2, 3]])
softmax(X)

array([[0.09003057, 0.24472847, 0.66524096],
       [0.09003057, 0.24472847, 0.66524096]])

#### Q1.2.3 (5 points, auto-grader)
Implement `compute_loss_and_acc` to compute the accuracy of a set of labels, along with the scalar loss across the data.  The loss function generally used for classification is the cross-entropy loss.

$$L_{\textbf{f}}(\textbf{D}) = - \sum_{(\textbf{x}, \textbf{y})\in \textbf{D}}\textbf{y}\cdot\log(\textbf{f}(\textbf{x}))$$
Here $\textbf{D}$ is the full training dataset of data samples $\textbf{x}$ ($N\times 1$ vectors, N = dimensionality of data) and labels $\textbf{y}$ ($C\times 1$ one-hot vectors, C = number of classes).

In [53]:
def compute_loss_and_acc(y: np.ndarray, probs: np.ndarray):
    """
    Compute total loss and accuracy
    
    [input]
    * y -- one hot labels [N x C]
    * probs -- class probabities [N x C]
    
    [output]
    * loss -- cross-entropy loss
    * acc -- accuracy
    """
    loss = -np.sum(y * np.log(probs + 1e-20))
    acc  = np.sum(np.argmax(probs, axis = 1) == np.argmax(y, axis = 1)) / y.shape[0]

    return loss, acc

In [54]:
probs = np.array([[.51, .49, 0], [.98, 0.02, 0.0], [.65, .33, .02]])
y     = np.array([[1, 0, 0], [0, 1, 0], [1, 0, 0]])
loss, acc = compute_loss_and_acc(y, probs)
print(f"Loss: {loss}")
print(f"Accuracy: {acc}")

Loss: 5.016150474784366
Accuracy: 0.6666666666666666


### Q1.3 Backwards Propagation

#### Q1.3.1 (10 points, auto-grader)
Compute back-propagation for a single layer, given the original weights, the appropriate intermediate results, and given gradient with respect to the loss. You should return the gradient with respect to $X$ so you can feed it into the next layer. As a sanity check, your gradients should be the same dimensions as the original objects.

In [55]:
def sigmoid_deriv(post_act: np.ndarray):
    """
    Derivative of sigmoid.
    
    we give this to you because you proved it
    it's a function of post_act
    """
    res = post_act*(1.0-post_act)
    return res


def backwards(delta: np.ndarray, params: dict, name: str='',
              activation_deriv: callable=sigmoid_deriv):
    """
    Do a backwards pass

    [input]
    * delta -- errors to backprop
    * params -- a dictionary containing parameters
    * name -- name of the layer
    * activation_deriv -- the derivative of the activation_func
    
    [output]
    * grad_X -- gradient w.r.t X
    """
    grad_X, grad_W, grad_b = None, None, None
    # everything you may need for this layer
    W = params['W' + name]
    b = params['b' + name]
    X, pre_act, post_act = params['cache_' + name]

    # DEBUG: Printing shapes
    # print(f"Pre activation shape: {pre_act.shape}")
    # print(f"Post activation shape: {post_act.shape}")
    # print(f"X shape: {X.shape}")
    # print(f"W shape: {W.shape}")
    # print(f"Delta shape: {delta.shape}")

    # Do the derivative through activation first
    # then compute the derivative W,b, and X

    act_deriv = activation_deriv(post_act)
    # Calculate the error by multiplying the activation derivative (signal) 
    # times the error between the prediction and the label
    errors    = act_deriv * delta

    
    # The gradient wrt X is the weighted errors
    grad_X = errors @ W.T
    # print(f"Gradient wrt X Shape: {grad_X.shape}")

    # The gradient wrt W is weighted by the x values
    grad_W = (X.T @ errors)

    # Just the errors, average it over the batch size
    grad_b = np.sum(errors, axis=0) / errors.shape[0]

    # store the gradients
    params['grad_W' + name] = grad_W
    params['grad_b' + name] = grad_b

    return grad_X

In [56]:

# we use random values to test your implementation 
# independent of previous questions
n, c1, c2 = 5, 40, 20 
delta = np.random.rand(n, c2)
name = 'layer1'
params = {
    'W'+name: np.random.rand(c1, c2),
    'b'+name: np.random.rand(c2),
    'cache_'+name: (np.random.rand(n, c1), 
                     np.random.rand(n, c2), 
                     np.random.rand(n, c2))
}
print()
grad = backwards(delta, params, name, tanh_deriv)

assert 'grad_W' + name in params
assert 'grad_b' + name in params

assert params['grad_W'+name].shape == params['W'+name].shape
assert params['grad_b'+name].shape == params['b'+name].shape





### Q1.4 Convolutional Layer (10 points)

For now we have worked with linear layer in fully-connected networks. In practice, convolutional layers are commonly used to extract image feature. You will implement the forward and backawad propagation for convolutional layer in this subsection. 

#### Q1.4.1 (5 points, auto-grader)
Similar to Q1.2.1, implement `conv_forward` for a single convolutional layer with zero paddings.

In [57]:
def pad_img(img, pad):
  return np.stack([np.pad(Xc, pad) for Xc in img])

def get_empty_out(X, filt, stride, pad):
    F, _, HH, WW = filt.shape
    out_rows = int(((X.shape[2] + 2 * pad - HH)/stride) + 1)
    out_cols = int(((X.shape[3] + 2 * pad - WW)/stride) + 1)
    res = np.zeros((X.shape[0], F, out_rows, out_cols))
    return res, out_rows, out_cols


def conv_forward(X: np.ndarray, params: dict, name: str='',
            stride: int=1, pad: int=0):
    """
    Do a forward pass for a convolutional layer

    [input]
    * X -- input data [N x C x H x W]
    * params -- a dictionary containing parameters
    * name -- name of the layer
    * stride, pad -- convolution parameters
    
    [output]
    * res -- output after a convolutional layer
    """
    res = None
    # get the layer parameters
    w = params['W' + name] # Conv Filter weights [F x C x HH x WW]
    b = params['b' + name] # Biases [F]

    # Debug Dimensions
    print(f"X dimensions: {X.shape}")
      
    # Filter Details
    F, _, HH, WW = w.shape

    # Get empty output array
    res, out_rows, out_cols = get_empty_out(X, w, stride, pad)

    # Iterate over the batch
    for img_num in range(X.shape[0]):
      
      # Pad the image
      X_pad = pad_img(X[img_num], pad)
      # print(X_pad.shape)

      for i in range(out_rows):
        for j in range(out_cols):
            for filt_num, filt in enumerate(w):
              # Compute the idx for the X matrix
              i_start = stride * i
              i_end   = i_start + HH
              j_start = stride * j
              j_end   = j_start + WW

              # Calculate the index in the padding
              X_patch = X_pad[:, i_start:i_end, j_start:j_end]

              # Debug the shape of the patch and the filter
              # print(f"X Patch Shape: {X_patch.shape}")
              # print(f"Filt Shape: {filt.shape}")

              # Convolve each filter and add the bias
              conv_res = np.dot(X_patch.flatten(), filt.flatten()) + b[filt_num]
              # print(f"({img_num}, {c}, {i//stride}, {j//stride})")
              res[img_num, filt_num, i, j] += conv_res
            
    # store the input and convolution parameters
    # these will be important in backprop
    params['cache_' + name] = (X, stride, pad)

    return res

In [58]:
x_shape = np.array((2, 3, 4, 4))
w_shape = np.array((3, 3, 4, 4))
x = np.linspace(-0.1, 0.5, num=np.prod(x_shape), dtype=np.float64).reshape(*x_shape)
w = np.linspace(-0.2, 0.3, num=np.prod(w_shape), dtype=np.float64).reshape(*w_shape)
b = np.linspace(-0.1, 0.2, num=3, dtype=np.float64)

params = {'WConv_layer1': w, 'bConv_layer1': b}
y = conv_forward(np.array(x), params, 'Conv_layer1', stride=2, pad=1)
assert 'cache_Conv_layer1' in params


y_ref = np.array([[[[-0.08759809, -0.10987781],
                              [-0.18387192, -0.2109216 ]],
                             [[ 0.21027089,  0.21661097],
                              [ 0.22847626,  0.23004637]],
                             [[ 0.50813986,  0.54309974],
                              [ 0.64082444,  0.67101435]]],
                            [[[-0.98053589, -1.03143541],
                              [-1.19128892, -1.24695841]],
                             [[ 0.69108355,  0.66880383],
                              [ 0.59480972,  0.56776003]],
                             [[ 2.36270298,  2.36904306],
                              [ 2.38090835,  2.38247847]]]], 
            )
# print(y_ref.shape)
assert y.shape == y_ref.shape
print(y_ref - y)

max_diff = np.max(np.abs((y_ref - y)))
base = (np.abs(y_ref) + np.abs(y)).clip(np.finfo(float).eps).max()
print(max_diff/base) # the difference should be less than 1e-8


X dimensions: (2, 3, 4, 4)
[[[[-3.87559808e-09 -3.59587783e-09]
   [-2.44387191e-09  4.71107839e-09]]

  [[ 2.99227090e-09  2.02061101e-09]
   [-5.81523746e-10 -4.67795361e-09]]

  [[-1.39860123e-10 -2.36290021e-09]
   [ 1.28082456e-09 -4.06698553e-09]]]


 [[[-4.83253570e-09 -3.30143513e-09]
   [ 1.60471125e-09  1.10418341e-11]]

  [[ 1.96908356e-09  2.24880392e-09]
   [ 3.40080963e-09  5.55759994e-10]]

  [[-1.22929755e-09 -2.20095719e-09]
   [-4.80309170e-09  1.10047838e-09]]]]
1.01418245052412e-09


#### Q1.4.2 (5 points, auto-grader)
Implement `conv_backword` for a single convolutional layer with zero paddings.
Compute back-propagation for a single convolutional layer, given the original weights, the cached input, and given gradient with respect to the loss. Similar to Q1.3.1, you should return the gradient with respect to $X$ so you can feed it into the next layer. As a sanity check, your gradients should be the same dimensions as the original objects.

In [59]:
def conv_backward(delta: np.ndarray, params: dict, name: str=''):
    """
    Do a backwards pass for a convolutional layer

    [input]
    * delta -- errors to backprop
    * params -- a dictionary containing parameters
    * name -- name of the layer
    
    [output]
    * grad_X -- gradient w.r.t X
    """
    grad_X, grad_W, grad_b = None, None, None
    # everything you may need for this layer
    W = params['W' + name]
    b = params['b' + name]
    X, stride, pad = params['cache_' + name]

    # Filter Details
    F, _, HH, WW = w.shape

    # Get empty output array
    _, out_rows, out_cols = get_empty_out(X, W, stride, pad)
    print(f"Out Rows: {out_rows}")
    print(f"Out Cols: {out_cols}")

    # Instaniate arrays for grads of X, W, b
    grad_X = np.zeros((X.shape[0], X.shape[1], X.shape[2] + 2*pad, X.shape[3] + 2*pad))
    grad_W = np.zeros_like(W)
    grad_b = np.zeros_like(b)

    # Iterate over the batch
    for img_num in range(X.shape[0]):
      
      # Pad the image
      X_pad = pad_img(X[img_num], pad)

      for i in range(out_rows):
        for j in range(out_cols):
            for filt_num, filt in enumerate(w):
              # Compute the idx for the X matrix
              i_start = stride * i
              i_end   = i_start + HH
              j_start = stride * j
              j_end   = j_start + WW

              # Find the patch of X used for the original conv.
              X_patch = X_pad[:, i_start:i_end, j_start:j_end]

              # For X, weights times delta, the weights are weighted by the error
              try:
                grad_X[img_num, :, i_start:i_end, j_start:j_end] += filt * delta[img_num, filt_num, i, j]
              except:
                print(f"{i_start} {i_end}")

              # For W, X times delta. Each slice ij times delta (the X value of that slice times delta)
              grad_W[filt_num] += X_patch * delta[img_num, filt_num, i, j]

              # For b, just sum up the delta
              grad_b[filt_num] += delta[img_num, filt_num, i, j]

    # store the gradients
    params['grad_W' + name] = grad_W
    params['grad_b' + name] = grad_b

    # Get rid of the padding
    return grad_X[:, :, pad:-pad, pad:-pad]

In [60]:
x = np.random.rand(5, 4, 16, 16)
w = np.random.rand(8, 4, 7, 7)
b = np.random.rand(8,)
dout = np.random.rand(5, 8, 8, 8)

params = {'WConv_layer1': w, 'bConv_layer1': b}
y = conv_forward(x, params, 'Conv_layer1', stride=2, pad=3)
dx = conv_backward(dout, params, 'Conv_layer1')
assert x.shape == dx.shape
assert params['grad_WConv_layer1'].shape == params['WConv_layer1'].shape
assert params['grad_bConv_layer1'].shape == params['bConv_layer1'].shape


X dimensions: (5, 4, 16, 16)
Out Rows: 8
Out Cols: 8


### Q1.5 The Training Loop
You usually see gradient descent in three forms: "normal", "stochastic" and "batch". "Normal" gradient descent aggregates the updates for the entire dataset before changing the weights. Stochastic gradient descent applies updates after every single data example. Batch gradient descent is a compromise, where random subsets of the full dataset are evaluated before applying the gradient update. 

#### Q1.5.1 (10 points, auto-grader)
Write a training loop that generates random batches, iterates over them for many iterations, does forward and backward propagation, and applies a gradient update step. Specifically, implement `get_random_batches` and `train` functions below.

In [61]:
def get_random_batches(x: np.ndarray, y: np.ndarray, batch_size: int) -> list:
    """
    Split x and y into random batches
    
    [input]
    * x -- training samples
    * y -- training lables
    * batch_size -- batch size
    
    [output]
    * batches -- a list of [(batch1_x,batch1_y)...]
    """
    batches = []
    shuffled_idxs = np.random.permutation(y.shape[0])
    shuffled_x    = x[shuffled_idxs]
    shuffled_y    = y[shuffled_idxs]
    for i in range(0, len(y), batch_size):
      # Create the batch of X
      batches.append((shuffled_x[i:i+batch_size], shuffled_y[i:i+batch_size]))
    # print(batches[0])
    return batches


In [62]:
n, c1, c2 = 20, 100, 5
batch_size = 3
x = np.random.rand(n, c1)
y = np.random.rand(n, c2)
batches = get_random_batches(x, y, batch_size)
assert type(batches) == list
assert len(batches) >= 6


In [63]:
def train(x: np.ndarray, y: np.ndarray, params: dict, batch_size: int = 5,
          max_iters: int = 500, learning_rate: float=1e-3):
    
    """
    Train the network with two sequential layers: 
    (1) one layer named "layer1" with sigmoid activation
    (2) one layer named "output" with softmax activation

    [input]
    * x -- training samples
    * y -- training lables
    * params -- a dictionary containing initial parameters
    * batch_size -- batch size
    * max_iters -- total number of iterations
    * learning_rate -- learning rate
    
    [output]
    * total_loss, avg_acc -- loss and accuracy for the last iteration
    """

    batches = get_random_batches(x, y, batch_size)

    for itr in range(max_iters):
        total_loss = 0
        avg_acc = 0
        num_batches = len(batches)
        for xb, yb in batches:

            # forward
            layer_1_out = forward(X = xb, params = params, name = "layer1", activation = sigmoid)
            probs       = forward(X = layer_1_out, params = params, name = "output", activation = softmax)
            
            # loss
            # be sure to add loss and accuracy to epoch totals
            loss_b, acc_b = compute_loss_and_acc(y = yb, probs = probs)
            total_loss += loss_b
            avg_acc  += float (1 / num_batches) * float(acc_b)
            
            # backward
            #delta_2 = backwards(delta = loss_b, params = params, name = "output", activation_deriv = linear_deriv)
            # dJ/dWout = h_L * (yhat - y)
            delta_2 = backwards(delta = probs - yb, params = params, name = "output", activation_deriv=linear_deriv)
            grad    = backwards(delta = delta_2, params = params, name = "layer1", activation_deriv = sigmoid_deriv)

            # apply gradient
            # Gradient descent
            params["Woutput"] = params["Woutput"] - learning_rate * params["grad_Woutput"]
            params["Wlayer1"] = params["Wlayer1"] - learning_rate * params["grad_Wlayer1"]
            params["boutput"] = params["boutput"] - learning_rate * params["grad_boutput"]
            params["blayer1"] = params["blayer1"] - learning_rate * params["grad_blayer1"]
        
        if itr % 100 == 0:
            print("itr: {:02d} \t loss: {:.2f} \t acc : {:.2f}".format(
                itr, total_loss, avg_acc))

    return total_loss, avg_acc


In [64]:
# Successulf implementation of dependent functions are required to get full score for the `train` function

# create inputs
g0 = np.random.multivariate_normal([3.6,40],[[0.05,0],[0,10]],10)
g1 = np.random.multivariate_normal([3.9,10],[[0.01,0],[0,5]],10)
g2 = np.random.multivariate_normal([3.4,30],[[0.25,0],[0,5]],10)
g3 = np.random.multivariate_normal([2.0,10],[[0.5,0],[0,10]],10)
x = np.vstack([g0,g1,g2,g3])

# create labels
y_idx = np.array([0 for _ in range(10)] + [1 for _ in range(10)] + [2 for _ in range(10)] + [3 for _ in range(10)])

# turn labels to one_hot
y = np.zeros((y_idx.shape[0],y_idx.max()+1))
y[np.arange(y_idx.shape[0]),y_idx] = 1

# parameters in a dictionary
params = {}
# initialize a layer
initialize_weights(2,25,params,'layer1')
initialize_weights(25,4,params,'output')

# train the two-layer neural network
total_loss, avg_acc = train(x, y, params, batch_size=5, max_iters=500, learning_rate=1e-3)
print("itr: {:02d} \t loss: {:.2f} \t acc : {:.2f}".format(500, total_loss, avg_acc))

# with default settings, you should get loss < 35 and accuracy > 70%
assert total_loss < 35 and avg_acc > 0.70


itr: 00 	 loss: 58.23 	 acc : 0.00
itr: 100 	 loss: 45.61 	 acc : 0.50
itr: 200 	 loss: 38.63 	 acc : 0.60
itr: 300 	 loss: 34.13 	 acc : 0.65
itr: 400 	 loss: 31.33 	 acc : 0.65
itr: 500 	 loss: 29.57 	 acc : 0.68


AssertionError: 

### Q1.6 Numerical Gradient Checker

#### Q1.6.1 (15 points, auto-grader)
Implement the `centeral_differences_gradient` function. Instead of using the analytical gradients computed from the chain rule, add $\epsilon$ offset to each element in the weights, and compute the numerical gradient of the loss with central differences. Central differences is just $\frac{f(x+\epsilon) - f(x-\epsilon)}{2 \epsilon}$. Remember, this needs to be done for each scalar dimension in all of your weights independently. 

In [None]:
def centeral_differences_gradient(params: dict, eps = 1e-6):
    """
    Compute the estimated gradients using central difference
    
    Hint:
    please feel free to reuse the functions above
    """
    out = {}
    for k, v in params.items():
        if '_' in k:
            continue

        if len(v.shape) == 2:
          # Holds on to the gradient values for each weight/bias term
          grad = np.zeros_like(v)

          for idx in range(np.prod(v.shape)):
            if len(grad.shape) == 2:
              idx = (int(idx / grad.shape[1]), int(idx % grad.shape[1]))

            # Left (- eps)
            v[idx] -= eps
            input_left = forward(x, params, "layer1", activation=sigmoid)
            out_left = forward(input_left, params, "output", activation=softmax)
            loss_left, _ = compute_loss_and_acc(y, out_left)

            # Right (+ eps)
            v[idx] += 2*eps
            input_right = forward(x, params, "layer1", activation=sigmoid)
            out_right = forward(input_right, params, "output", activation=softmax)
            loss_right, _ = compute_loss_and_acc(y, out_right)

            # Central Difference
            grad[idx] = (loss_right - loss_left) / (2 * eps)

            # Set it back to the old value
            v[idx] -= eps

          # Save the gradient
          params["grad_"+k] = grad

    # Save to the params
    # for key in out.keys():
    #   print(key)
    #   params[key] = out[key]


In [None]:
# Compute the analytical gradients
h1 = forward(x,params,'layer1')
probs = forward(h1,params,'output',softmax)
delta1 = probs
delta1[np.arange(probs.shape[0]),y_idx] -= 1

delta2 = backwards(delta1,params,'output',linear_deriv)
backwards(delta2,params,'layer1',sigmoid_deriv)

import copy
params_orig = copy.deepcopy(params)

# Compute the estimated gradient using central difference
centeral_differences_gradient(params)

total_error = 0
for k in params.keys():
    if 'grad_' in k:
        # relative error
        err = np.abs(params[k] - params_orig[k])/np.maximum(np.abs(params[k]),np.abs(params_orig[k]))
        err = err.sum()
        total_error += err
        print(f"{k} Error: {err}")
print(f"Total Error: {total_error}")
# should be less than 1e-4
assert 0. < total_error < 1e-4
print('Test case passed, good job!')

grad_Woutput Error: 1.9659158481792783e-06
grad_boutput Error: 0.0
grad_Wlayer1 Error: 1.2296770548214458e-06
grad_blayer1 Error: 0.0
Total Error: 3.195592903000724e-06
Test case passed, good job!
