# Deep L-Layer Neural Network
- Deep neural network notations
    - $L =$ number of layers
    - $n^{[l]} =$ number of units in layer $l$
    - $a^{[l]} = g(z^{[l]}) =$ activations in layer $l$
        - $z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]}$
    - $X = a^{[0]}$
    - $\hat y = a^{[L]}$

# Forward Propagation in a Deep Network
- For a single training example:
    - $z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]}$
    - $a^{[l]} = g^{[l]}(z^{[l]})$
- Vectorized for the entire training set:
    - $Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}$
    - $A^{[l]} = g^{[l]}(Z^{[l]})$
        - $Z^{[l]} = [z^{[l](1)}, ..., z^{[l](m)}]$
        - $A^{[l]} = [a^{[l](1)}, ..., a^{[l](m)}]$
- Note that no matter for a single training example or the entire training set, we will have to run a for-loop for $l = 1, ..., L$.

# Getting Your Matrix Dimensions Right
- Let $n^{[l]}$ be the number of units on layer $l$, then parameters $W^{[l]}$ and $b^{[l]}$:
    - $W^{[l]}: (n^{[l]}, n^{[l-1]})$
    - $b^{[l]}: (n^{[l]}, 1)$
- For a single training example:
    - $z^{[l]}: (n^{[l]}, 1)$
    - $a^{[l]}: (n^{[l]}, 1)$
- For vectorized implementation over the entire training set:
    - $Z^{[l]}: (n^{[l]}, m)$
    - $A^{[l]}: (n^{[l]}, m)$
- For backward propagation, we then have:
    - $dW^{[l]}: (n^{[l]}, n^{[l-1]})$
    - $db^{[l]}: (n^{[l]}, 1)$
    - $dZ^{[l]}: (n^{[l]}, m)$
    - $dA^{[l]}: (n^{[l]}, m)$

# Building Blocks of Deep Neural Networks

![p1.png](attachment:p1.png)

# Forward and Backward Propagation
- Forward propagation for layer $l$
    - Input $a^{[l-1]}$
    - Output $a^{[l]}$, cache $z^{[l]}, W^{[l]}, b^{[l]}$
    - Functions for a single training example:
        - $z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]}$
        - $a^{[l]} = g(z^{[l]})$
    - Vectorized implementation:
        - $Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}$
        - $A^{[l]} = g^{[l]}(Z^{[l]})$
- Backward propagation for layer $l$
    - Input $da^{[l]}$
    - Output $da^{[l-1]}, dW^{[l]}, db^{[l]}$
    - Functions for a single training example:
        - $dz^{[l]} = da^{[l]} \odot g^{[l]'}(z^{[l]})$, where $\odot$ represents element-wise product
        - $dW^{[l]} = dz^{[l]} a^{[l-1]T}$
        - $db^{[l]} = dz^{[l]}$
        - $da^{[l-1]} = W^{[l]T} dz^{[l]}$
    - Vectorized implementation:
        - $dZ^{[l]} = dA^{[l]} \odot g^{[l]'}(Z^{[l]})$, where $\odot$ represents element-wise product
        - $dW^{[l]} = \frac{1}{m} dZ^{[l]} A^{[l-1]T}$
        - $db^{[l]} = \frac{1}{m} np.sum(dZ^{[l]}, axis=1, keepdims=True)$
        - $dA^{[l-1]} = W^{[l]T} dZ^{[l]}$

# Parameters vs. Hyperparameters
- Parameters
    - $W^{[l]}, l=1, ..., L$
    - $b^{[l]}, l=1, ..., L$
- Hyperparameters
    - Learning rate $\alpha$
    - \# iterations
    - \# hidden layers $L$
    - \# hidden units $n^{[l]}, l=1, ..., L$
    - Choice of activation functions $g^{[l]}, l=1, ..., L$