<h2>Contents of <b>caches</b>:</h2>
<br/>
caches is a vector of length L.<br/>
caches[ l ] is the cache returned by the forward_prop function on its l<sup>th</sup> call.<br/>
this cache contains a linear_cache and an activation_cache. i.e.,
<br/><b>caches[ l ] = ( linear_cache, activation_cache )</b><br/>
<b>linear_cache = ( A<sup>[ l - 1 ]</sup> , W<sup>[ l ]</sup> , b<sup>[ l ]</sup> )</b><br/>and<br/>
<b>activation_cache = ( Z<sup>[ l ]</sup> )</b><br/>

In [3]:
import numpy as np

<h2>Single Step of Back Propagation:</h2>
<br/>
Function for implementing one step of back_prop:
<ul><li> Takes as input <b>dA</b>, <b>cache</b> and <b>activation</b>.<ul><li><b>dA</b> is the post-activation gradient for the current layer l i.e., dA<sup>[ l ]</sup>.
<li><b> cache</b> is  a tuple of two tuples viz. <b>linear_cache</b> for layer l and <b>activation_cache</b> for layer l.
<li><b>activation</b> is the name of the activation function used for layer l.
</ul>
<li> The function calculates 
<ul>
    <li><b>dZ<sup>[ L ]</sup></b> as <b>dZ = AL - Y</b>. This is for softmax.
    <li><b>dW<sup>[ L ]</sup></b> as <b>dW = ( dZ . transpose( A<sup>[ L - 1 ]</sup> ) ) / m</b>. This is for softmax.
<li><b>dZ<sup>[ l ]</sup></b> as <b>dZ = dA * g<sup>[ l ] '</sup>( Z<sup>[ l ]</sup> )</b>. Note, here dA is same as dA<sup>[ l ]</sup>. shape( dZ<sup>[ l ]</sup> ) = shape( Z<sup>[ l ]</sup> ). g<sup>[ l ] '</sup>( ) is the gradient of g<sup>[ l ]</sup> ( ) w.r.t. <br/>Z<sup>[ l ]</sup>. This is for relu.
<li><b>dW<sup>[ l ]</sup></b> as <b>dW = ( dZ<sup>[ l ]</sup> . transpose( A<sup>[ l - 1 ]</sup> ) ) / m</b>. shape( dW<sup>[ l ]</sup> ) = shape( W<sup>[ l ]</sup> ). This is for relu.
<li><b>db<sup>[ l ]</sup></b> as <b>db = np.sum( dZ<sup>[ l ]</sup>, axis = 1, keepdims = True )</b>. shape( db<sup>[ l ]</sup> ) = shape( b<sup>[ l ]</sup> ). This is same for both softmax and relu.
<li><b>dA<sup>[ l - 1 ]</sup></b> as <b> dA_prev = transpose( W<sup>[ l ]</sup> ) . dZ<sup>[ l ]</sup></b>. shape( dA<sup>[ l - 1 ]</sup> ) = shape( A<sup>[ l - 1 ]</sup> ).
</ul> 
<li>The function returns <b>dA_prev</b>, <b>dW</b> and <b>db</b>.
</ul>
<br/>
For softmax, instead of doing <b>dZ = AL - Y</b> we can do the same thing we do with relu i.e.,<br/>
<b>dZ = dA * g<sup>[ l ] '</sup>( Z<sup>[ l ]</sup> )</b><br/>
Here dA = - ( Y / A<sup>[ L ]</sup> ) and g<sup>[ l ] '</sup>( Z<sup>[ l ]</sup> ) = A<sup>[ L ]</sup> ( 1 - A<sup>[ L ]</sup> ) = g<sup>[ L ]</sup>( Z<sup>[ L ]</sup> ) ( 1 - g<sup>[ L ]</sup>( Z<sup>[ L ]</sup> ) ).<br/><br/>
For relu, g<sup>[ l ] '</sup>( Z<sup>[ l ]</sup> ) = 1 if z > 0 and 0 if z < 0

In [2]:
def back_prop_single_step(dA, Y, cache, activation):
    linear_cache, activation_cache = cache
    Z = activation_cache
    A_prev, W, b = linear_cache
    m = A_prev.shape[1]
    if(activation=='softmax'):
        #calculate g(Z). (1 -g(Z))
        t = np.exp(Z) #element wise exponent of Z
        sum_t = np.sum(t, axis=0) #for each col. of t, find sum of all rows in that col
        gofZ = t / sum_t #calculate the softmax activation
        #gDashOfZ = np.multiply(gofZ, (1 - gofZ))
        dZ = gofZ - Y
    if(activation=='relu'):
#        temp = Z > 0
#        gDashOfZ = temp.astype(int)
#        dZ = np.multiply(dA, gDashOfZ)
        gDashOfZ = np.ones_like(Z)
        alpha = 0.01
        gDashOfZ[ Z<0 ] = alpha
        dZ = np.multiply(dA, gDashOfZ)
        
    dW = np.dot(dZ, A_prev.T) / m
    db = np.sum(dZ, axis=1, keepdims=True) / m
    dA_prev = np.dot(W.T, dZ)
    
    return dA_prev, dW, db

<h2>Building L Layers  of Back Propagation:</h2>
<br/>
Function for building the L-layer model(L steps of back_prop):
<ul>
<li> Takes as argument <b>AL</b>, <b>Y</b>, <b>caches</b> where
<ul>
<li><b>AL</b> is A<sup>[ L ]</sup> i.e., the activation matrix of the last layer and of shape ( n<sup>[ L ]</sup>, m ).
<li><b>Y</b> is the ground truth label vector of shape ( 1, m ).
<li><b>caches</b> is the L element vector as returned by the L layer forward-prop model function.
</ul>
<li> The function calculates
<ul>
<li><b>dA<sup>[ L ]</sup></b> as <b>dAL = - ( Y / AL )</b>. It calls the one step back_prop function with dA<sup>[ L ]</sup> and caches[ L - 1 ] and activation as arguments.
<li>For each of the L-1 hidden layers( in order L-1, L-2, ..., 1 ) it calls one step back_prop function with arguments dA<sup>[ l + 1 ]</sup>, caches[ l ] and activation name as arguments.
</ul>
<li> When one step back prop is called for the layer l of the NN, it returns dA<sup>[ l - 1 ]</sup>, dW<sup>[ l ]</sup> and db<sup>[ l ]</sup>. These are stored in a dictionary called <b>grads</b> as grads[ "dA" + str( l - 1 ) ], grads[ "dW" + str( l ) ] and grads[ "db" = str( l ) ] respectively.
<li> The function returns the dict <b>grads</b>.
</ul>

In [1]:
def L_Layer_Back_Prop(AL, Y, caches):
    L = len(caches)
    m = AL.shape[1]
    
    grads = {}
    dAL = - ( np.divide(Y, AL) )
    current_cache = caches[L-1]
    grads["dA"+str(L-1)], grads["dW"+str(L)], grads["db"+str(L)] = back_prop_single_step(dAL, Y, current_cache, activation='softmax')
    
    for l in reversed(range(L-1)):
        current_cache = caches[l]
        grads["dA"+str(l)], grads["dW"+str(l+1)], grads["db"+str(l+1)] = back_prop_single_step(grads[ "dA"+str(l+1) ], Y, current_cache, activation='relu')
        
    return grads
        

<h2>Parameter Update:</h2>
<br/>
Function for updating parameters:
<ul>
<li> Takes as arguments the dict <b>parameters</b>, the dict <b>grads</b> and the learning rate <b>alpha</b>.
<li> For l in range 0, 1, ..., L-1 it updates the values of parameters[ "dW" + str( l + 1 ) ] and parameters[ "db" + str( l + 1 ) ].
<li> Returns the updated dict <b>parameters</b>.
</ul>

In [11]:
def update_parameters(parameters, grads, alpha):
    L = len(parameters) // 2
    for l in range(L):
        parameters[ "W"+str(l+1) ] = parameters[ "W"+str(l+1) ] - ( alpha * grads[ "dW"+str(l+1) ])
        parameters[ "b"+str(l+1) ] = parameters[ "b"+str(l+1) ] - ( alpha * grads[ "db"+str(l+1) ])
    return parameters