# C1_Notes_W2 - Shallow Neural Networks

> Learn to build a neural network with one hidden layer, using forward propagation and backpropagation.

## Table of contents
  * [1. Neural Networks Overview](#neural-networks-overview)
  * [2. Neural Network Representation](#neural-network-representation)
  * [3. Computing a Neural Network's Output](#computing-a-neural-networks-output)
  * [4. Vectorizing across multiple examples](#vectorizing-across-multiple-examples)
  * [5. Activation functions](#activation-functions)
  * [6. Why do you need non-linear activation functions?](#why-do-you-need-non-linear-activation-functions)
  * [7. Derivatives of activation functions](#derivatives-of-activation-functions)
  * [8. Gradient descent for Neural Networks](#gradient-descent-for-neural-networks)
  * [9. Random Initialization](#random-initialization)

# 1. Neural Networks Overview 

- In logistic regression we had:

  ```
  X1  \  
  X2   ==>  z = XW + B ==> a = Sigmoid(z) ==> l(a,Y)
  X3  /
  ```

- In neural networks with one layer we will have:

  ```
  X1  \  
  X2   =>  z1 = XW1 + B1 => a1 = Sigmoid(a1) => z2 = a1W2 + B2 => a2 = Sigmoid(z2) => l(a2,Y)
  X3  /
  ```


- `X` is the input vector `(X1, X2, X3)`, and `Y` is the output variable `(1x1)`
- NN is stack of logistic regression objects.

![](images/c1w2n_basicnn.png)

# 2. Neural Network Representation

- We will define the neural networks that has one hidden layer.
- NN contains of input layers, hidden layers, output layers.
- Hidden layer means we cant see that layers in the training set.
- a<sup>[0]</sup> = x (the input layer)
- a<sup>[1]</sup> will represent the activation of the hidden neurons.
- a<sup>[2]</sup> will represent the output layer.
- We are talking about 2 layers NN. **The input layer isn't counted.**
![](images/c1w2n_basicnn2.png)


# 3. Computing a Neural Network's Output

- Equations of Hidden layers:
  ![](images/c1w2n_basicnn3.png)

  - ![](Images/05.png)
- Here is some informatios about the last image (**NOTE THESE CORRESPOND TO ONLY 1 ONSERVATION m=1)**:
  - `noOfHiddenNeurons = 4`
  - `Nx = 3`
  - Shapes of the variables:
    - `W1` is the matrix of the first hidden layer, it has a shape of `(noOfHiddenNeurons,nx)`
    - `b1` is the matrix of the first hidden layer, it has a shape of `(noOfHiddenNeurons,1)`
    - `z1` is the result of the equation `z1 = W1*X + b`, it has a shape of `(noOfHiddenNeurons,1)` 
    - `a1` is the result of the equation `a1 = sigmoid(z1)`, it has a shape of `(noOfHiddenNeurons,1)`
    - `W2` is the matrix of the second hidden layer, it has a shape of `(1,noOfHiddenLayers)`
    - `b2` is the matrix of the second hidden layer, it has a shape of `(1,1)`
    - `z2` is the result of the equation `z2 = W2*a1 + b`, it has a shape of `(1,1)`
    - `a2` is the result of the equation `a2 = sigmoid(z2)`, it has a shape of `(1,1)`
        
![](images/c1w2n_basicnn4.png)

#### So to recap, we can think of each neuon as a single logistic regression, and we are just stacking them on top of one another. Each of those produces an output (a[1]i) and then those get fed into layer 2 (a[2]), which is also pretty much just another logisitic regression! (Though that obviously changes if you use a different activation function)


# 4. Vectorizing across multiple examples

- Pseudo code for forward propagation for the 2 layers NN:

  ```
  for i = 1 to m
    # shape of z[1, i] is (noOfHiddenNeurons,1)
    z[1, i] = W1*x[i] + b1      
    
    # shape of a[1, i] is (noOfHiddenNeurons,1)
    a[1, i] = sigmoid(z[1, i])  
    
    # shape of z[2, i] is (1,1)
    z[2, i] = W2*a[1, i] + b2   
    
    # shape of a[2, i] is (1,1)
    a[2, i] = sigmoid(z[2, i])  
  ```

- So the new pseudo code:
    - `X.shape = (Nx,m)`
    - `W1.shape = (noOfHiddenNeurons, Nx)`
    - `W2.shape = (1, noOfOutputNeurons)`
    - `b.shape = (1, m)`

  ```
  Z1 = W1X + b1     # shape of Z1 (noOfHiddenNeurons,m)
  A1 = sigmoid(Z1)  # shape of A1 (noOfHiddenNeurons,m)
  Z2 = W2A1 + b2    # shape of Z2 is (1,m)
  A2 = sigmoid(Z2)  # shape of A2 is (1,m)
  ```

- If you notice always m is the number of columns.
- In the last example we can call `X` = `A0`. So the previous step can be rewritten as:

  ```
  Z1 = W1A0 + b1    # shape of Z1 (noOfHiddenNeurons,m)
  A1 = sigmoid(Z1)  # shape of A1 (noOfHiddenNeurons,m)
  Z2 = W2A1 + b2    # shape of Z2 is (1,m)
  A2 = sigmoid(Z2)  # shape of A2 is (1,m)
  ```

![](images/c1w2n_basicnn5.png)
![](images/c1w2n_basicnn6.png)


# 5. Activation functions

- So far we are using sigmoid, but in some cases other functions can be a lot better.
- Sigmoid can lead us to gradient decent problem where the updates are so low... gradients get closer and closer to either zero or 1... so updates become slower and slower and gradients get smaller and smaller. 
- Sigmoid activation function range is [0,1]
      
      `# Where z is the input matrix
      A = 1 / (1 + np.exp(-z))`
      
- Tanh activation function range is [-1,1]   (Shifted version of sigmoid function)
  - In NumPy we can implement Tanh using one of these methods:
    
    `# Where z is the input matrix
    A = (np.exp(z) - np.exp(-z)) / (np.exp(z) + np.exp(-z))`

    Or
    
    `# Where z is the input matrix
    A = np.tanh(z)   `
    
- It turns out that the tanh activation usually works better than sigmoid activation function for hidden units because the **mean of its output is closer to zero, and so it centers the data better for the next layer.**
- Sigmoid or Tanh function disadvantage is that if the input is **too small or too high,** the slope will be near zero which will cause us the gradient decent problem.
- One of the popular activation functions that solved the slow gradient decent is the RELU function.
  
  `# If z is negative the slope is 0`
  `# if z is positive the slope remains linear.
  RELU = max(0,z)`
  
- So here is some basic rule for choosing activation functions:
    - if your classification is between 0 and 1, use the output activation as sigmoid and the others as RELU.
    - Leaky RELU activation function is different from basic RELU because if the input is negative the slope will be super small. It works as well as RELU (and sometimes better) but most people uses RELU.

  `#the 0.01 can be a parameter for your algorithm.
  Leaky_RELU = max(0.01z,z)`
  
- In NN you will decide a lot of choices like:
  - No of hidden layers.
  - No of neurons in each hidden layer.
  - **Learning rate. (The most important parameter)**
  - Activation functions.
  - And others..
- It turns out there are no guide lines for that. You should try all activation functions for example.

# 6. Why do you need non-linear activation functions?

- If we removed the activation function from our algorithm, then we just have a linear activation function (identity functions).
- Linear activation function will output linear activations
  - Whatever hidden layers you add, the activation will be always linear like logistic regression (So its useless in a lot of complex problems)
  - **IT TURNS OUT THAT A NEURAL NETWORK WITH ONLY LINEAR ACTIVATION FUNCTIONS IS USUALLY NO BETTER THAN A STRAIGHT LOGISTIC REGRESSION!!!**
- **You might use linear activation function in one place - in the output layer if the output is real numbers (regression problem).** BUT even in this case if the output value is non-negative you could use RELU instead.

# 7. Derivatives of activation functions

- Derivation of Sigmoid activation function:

  ```
  g(z) = 1 / (1 + np.exp(-z))
  g'(z) = (1 / (1 + np.exp(-z))) * (1 - (1 / (1 + np.exp(-z))))
  g'(z) = g(z) * (1 - g(z))
  ```

- Derivation of Tanh activation function:

  ```
  g(z)  = (e^z - e^-z) / (e^z + e^-z)
  g'(z) = 1 - np.tanh(z)^2 = 1 - g(z)^2
  ```

- Derivation of RELU activation function:

  ```
  g(z)  = np.maximum(0,z)
  g'(z) = { 0  if z < 0
            1  if z >= 0  }
  ```

- Derivation of leaky RELU activation function:

  ```
  g(z)  = np.maximum(0.01 * z, z)
  g'(z) = { 0.01  if z < 0
            1     if z >= 0   }
  ```

# 8. Gradient descent for Neural Networks
- In this section we will have the full back propagation of the neural network (Just the equations with no explanations).
- Gradient descent algorithm:
  - NN parameters:
    - `n[0] = Nx`
    - `n[1] = NoOfHiddenNeurons`
    - `n[2] = NoOfOutputNeurons = 1`
    - `W1` shape is `(n[1],n[0])`
    - `b1` shape is `(n[1],1)`
    - `W2` shape is `(n[2],n[1])`
    - `b2` shape is `(n[2],1)`
  - Cost function `I =  I(W1, b1, W2, b2) = (1/m) * Sum(L(Y,A2))`
  - Then Gradient descent:

    ```
    Repeat:
    		Compute predictions (y'[i], i = 0,...m)
    		Get derivatives: dW1, db1, dW2, db2
    		Update: W1 = W1 - LearningRate * dW1
    				b1 = b1 - LearningRate * db1
    				W2 = W2 - LearningRate * dW2
    				b2 = b2 - LearningRate * db2
    ```

- Forward propagation:

  ```
  Z1 = W1A0 + b1    # A0 is X
  A1 = g1(Z1)
  Z2 = W2A1 + b2
  A2 = Sigmoid(Z2)      # Sigmoid because the output is between 0 and 1
  ```

- Backpropagation (derivations):   
  ```
  dZ2 = A2 - Y      # derivative of cost function we used * derivative of the sigmoid function
  dW2 = (dZ2 * A1.T) / m
  db2 = Sum(dZ2) / m
  dZ1 = (W2.T * dZ2) * g'1(Z1)  # element wise product (*)
  dW1 = (dZ1 * A0.T) / m   # A0 = X
  db1 = Sum(dZ1) / m
  # Hint there are transposes with multiplication because to keep dimensions correct
  ```
- How we derived the 6 equations of the backpropagation:   
  ![](Images/06.png)

# 9. Random Initialization

- In logistic regression it wasn't important to initialize the weights randomly, while in NN we have to initialize them randomly.

- If we initialize all the weights with zeros in NN it won't work (initializing bias with zero is OK):
  - all hidden units will be completely identical (symmetric) - compute exactly the same function... So both hidden units will be the exact same, and we don't get any additional value from having multiple neurons!!!
  - on each gradient descent iteration all the hidden units will always update the same

![](images/c1w2n_basicnn7.png)
- To solve this we initialize the W's with a small random numbers:

  ```
  W1 = np.random.randn((2,2)) * 0.01    # 0.01 to make it small enough
  b1 = np.zeros((2,1))                  # its ok to have b as zero, it won't get us to the symmetry breaking problem
  ```

- We need small values because in sigmoid (or tanh), for example, if the weight is too large you are more likely to end up even at the very start of training with very large values of Z. Which causes your tanh or your sigmoid activation function to be saturated, thus slowing down learning. If you don't have any sigmoid or tanh activation functions throughout your neural network, this is less of an issue.

- Constant 0.01 is alright for 1 hidden layer networks, but if the NN is deep this number can be changed but it will always be a small number.


# Hand Written Notes

![](HandNotes/c1w2n_notes1.jpg)
![](HandNotes/c1w2n_notes2.jpg)
![](HandNotes/c1w2n_notes3.jpg)
![](HandNotes/c1w2n_notes4.jpg)