The diagram below shows a neural network used for a classification problem. The network contains two hidden layers and one output layer. The input to the network is a column vector $x \in \R^{3}$. The first hidden layer contains 3 neurons, the second hidden layer contains 3 neurons and the output layer contains 3 neurons. Each neuron in the $l^{th}$ layer is connected to all the neurons in the $(l+1)^{th}$ layer. Each neuron has a bias connected to it (not explicitly shown in the figure).

<img src="https://backend.seek.onlinedegree.iitm.ac.in/22t3_cs3004/assets/img/W3GA1.png">

The weights and biases are initialized as given below...

$W_1=\begin{bmatrix}
    0.5488135 & 0.71518937 & 0.60276338\\
    0.54488318 & 0.4236548 & 0.64589411\\
    0.43758721 & 0.891773 & 0.96366276
    \end{bmatrix}$
​

 $W_2=\begin{bmatrix}
    0.56804456 & 0.92559664 & 0.07103606\\
    0.0871293 & 0.0202184 & 0.83261985\\
    0.77815675 & 0.87001215 & 0.97861834
    \end{bmatrix}$


$W_3=\begin{bmatrix}
    0.11827443 & 0.63992102 & 0.14335329\\
    0.94466892 & 0.52184832 & 0.41466194\\
    0.26455561 & 0.77423369 & 0.45615033
    \end{bmatrix}$


$b_1=\begin{bmatrix}
    0.38344152\\
    0.79172504\\
    0.52889492
    \end{bmatrix}$


$b_2=\begin{bmatrix}
    0.79915856\\
    0.46147936\\
    0.78052918
    \end{bmatrix}$


$b_3=\begin{bmatrix}
    0.56843395\\
    0.0187898\\
    0.6176355
    \end{bmatrix}$


The weights that connects outputs from neurons in the previous $(i−1)$ layer to a neuron in the present $i^{th}$ layer correspond to a row in the weight matrix.The input to the network

$x=\begin{bmatrix}
    1\\
    0\\
    1
    \end{bmatrix}$

and the corresponding label

$y=\begin{bmatrix}
    0\\
    0\\
    1
    \end{bmatrix}$


All the neurons in the hidden layers use Sigmoid activation function and the neurons in the output layer uses Softmax function. Assume that the network uses the cross entropy loss (use natural log).

You are advised to use the Numpy package to compute matrix vector multiplications. You can download the initial weights [HERE](https://drive.google.com/file/d/1TJax-Eq1I-TD3Fofsu4ODPLwCPsz1250/view).

In [60]:
import numpy as np

In [61]:
params = np.load('parameters.npz')
W1=params.get('W1')
W2=params.get('W2')
W3=params.get('W3')
b1=params.get('b1')
b2=params.get('b2')
b3=params.get('b3')
x=np.array([1,0,1]).reshape(-1,1)
y=np.array([0,0,1]).reshape(-1,1)

In [62]:
def pre_activation(W, x, b):
    return np.dot(W,x)+b

In [63]:
def sigmoid(z):
    return 1/(1+np.exp(-z))

In [64]:
def softmax(z):
    return np.exp(z)/np.sum(np.exp(z))

In [65]:
def cross_entropyloss(y, y_pred):
    return -np.sum(y*np.log(y_pred))

In [66]:
def forward_prop(W1, W2, W3, b1, b2, b3, x):

    a1=pre_activation(W1,x,b1)
    h1=sigmoid(a1)

    a2=pre_activation(W2,h1,b2)
    h2=sigmoid(a2)

    a3=pre_activation(W3,h2,b3)
    y_pred=softmax(a3)

    return a1, h1, a2, h2, a3, y_pred

In [67]:
def back_prop(y, y_pred, a3, h2, a2, h1, a1, x, W1, W2, W3):
    del_y_pred = -y/y_pred

    del_a3 = -(y-y_pred)
    del_h2 = W3.T@del_a3
    del_W3 = del_a3@h2.T
    del_b3 = del_a3

    del_a2 = del_h2*sigmoid(a2)*(1-sigmoid(a2))
    del_h1 = W2.T@del_a2
    del_W2 = del_a2@h1.T
    del_b2 = del_a2

    del_a1 = del_h1*sigmoid(h1)*(1-sigmoid(h1))
    del_W1 = del_a1@x.T
    del_b1 = del_a1
    
    return del_y_pred, del_a3, del_h2, del_a2, del_h1, del_a1, del_W3, del_W2, del_W1, del_b3, del_b2, del_b1

In [68]:
a1, h1, a2, h2, a3, y_pred = forward_prop(W1, W2, W3, b1, b2, b3, x)

In [69]:
del_y_pred, del_a3, del_h2, del_a2, del_h1, del_a1, del_W3, del_W2, del_W1, del_b3, del_b2, del_b1 = back_prop(y, y_pred, a3, h2, a2, h1, a1, x, W1, W2, W3)

## Questions

Q1. How many (learnable) parameters are there in the network?

In [70]:
print("\033[1;32m {}".format(W1.shape[0]*W1.shape[1]+W2.shape[0]*W2.shape[1]+W3.shape[0]*W3.shape[1]+b1.shape[0]+b2.shape[0]+b3.shape[0]))

[1;32m 36


Q2. What is the sum of the elments of output $a_1$? (Choose the nearest option to your answer)

(a) 3.7

(b) 5.44

(c) 4.16

(d) 17.59

In [71]:
print("\033[1;32m {:.2f}".format(a1.sum()))

[1;32m 5.45


Q3. What is the sum of the elments of output $h_1$? (Choose the nearest option to your answer)

(a) 2.57

(b) 3.46

(c) 0.67

(d) 0.42

In [72]:
print("\033[1;32m {:.2f}".format(h1.sum()))

[1;32m 2.57


Q4. The sum of the elements of $[a_2, h_2, a_3]$ respectively are $[6.4, 2.63, 4.87]$. What is the loss value?

In [73]:
a2.sum(), h2.sum(), a3.sum() # checking the sum

(6.460166773282593, 2.63139309587371, 4.874920988265704)

In [74]:
print("\033[1;32m {:.2f}".format(cross_entropyloss(y, y_pred)))

[1;32m 0.86


Q5. Choose the vector that corresponds to $\nabla_{a_3}L(θ)$.

In [75]:
print("\033[1;32m {}".format(del_a3))

[1;32m [[ 0.23691422]
 [ 0.33838847]
 [-0.57530268]]


Q6. We know that after computing gradients, we update the values of $b_2$ by subtracting its gradient ,i.e.,

$b_2-\eta\nabla_{b_2}L(θ)$.

Which of the following is the gradient vector of $b_2$?

In [76]:
print("\033[1;32m {}".format(del_b2))

[1;32m [[ 0.01838198]
 [-0.01997644]
 [-0.0038401 ]]


Q7. Update all the parameters with the calculated gradients. Forward propagate the input through the network. What is the new loss value ?

In [77]:
W3-=del_W3
b3-=del_b3
W2-=del_W2
b2-=del_b2
W1-=del_W1
b1-=del_b1

a1, h1, a2, h2, a3, y_pred = forward_prop(W1, W2, W3, b1, b2, b3, x)

print("\033[1;32m {:.2f}".format(cross_entropyloss(y, y_pred)))

[1;32m 0.07
