# The need for optimization

## Loss Function
![image-3](image-3.png)

- Lower loss function value means a better model
- Goal: Find the weights that give the lowest value for the loss function
- **Gradient descent**

## Gradient descent
- Start at random point
- Until you are somewhere flat:
    - Find the slope
    - Take a step downhill

## Coding how weight changes affect accuracy
![image-4](image-4.png)


In [1]:
import numpy as np

# Input nodes
inputs = np.array([0,3])

# Initial weights
weights = {'weight_0': np.array([2,1]),
           'weight_1': np.array([1,2]),
           'output': np.array([1,1])}

# Target value
target = np.array(3)

In [2]:
# Function that calculates value at each node
def calc_nodes(inputs, weights):
    node_0 = (inputs*weights['weight_0']).sum()
    node_1 = (inputs*weights['weight_1']).sum()
    hidden = np.array([node_0,node_1])
    output = (hidden*weights['output']).sum()
    return output

In [3]:
# Calculate output at initial weight
init_weight = calc_nodes(inputs, weights)

# Error for initial weight
error_0 = init_weight - target
error_0

6

In [4]:
# Update the weights
upt_weights = {'weight_0': np.array([2,1]),
           'weight_1': np.array([1,1]),
           'output': np.array([0.5,0.5])}

# Calculate for updated weights
pred = calc_nodes(inputs, upt_weights)

# Error for updated weight
error_1 = pred - target
error_1

0.0

# Gradient Descent

![gradient_descent_LR_0.2.gif](gradient_descent_LR_0.2.gif)

- With gradient descent, you repeatedly found a slope capturing how your loss function changes as a weight changes. Then, make a small change to the weight to get to a lower point and repeat this until you cannot go downhill any more. 

![image-6](image-6.png)

- If the slope is positive, going opposite the slope means moving to lower numbers.
- Subtracting the slope from the current value achieves this. But too big a step might lead us far astray.
- So, instead of directly subtracting the slope, we multiply the slope by a small number, called the learning rate, and we change the weight by the product of that multiplication. 

## Slope calculation example
![image-8](image-8.png)

- To calculate the slope for a weight, need to multiply:
1. **Slope of the loss function w.r.t value at the node we feed into**
- eg. For mean_squared loss function : Slope of mean-squared loss function w.r.t prediction
- i.e  2 (Predicted Value - Actual Value) = 2 x Error ; here: 2 * -4
2. **The value of the node that feed into our weight** ; here: 3
3. **Slope of the activation function w.r.t value we feed into** ; here: none
Note: for the ReLU function, the slope is 0 if the input into a node is negative. If the input into the node is positive, the output is the same as the input. So the slope would be 1.

Thus, slope = 2*-4*3 = -24
- If `learning rate` is `0.01`, the new weight would be: 2 - 0.01(-24) = 2.24

In [5]:
from numpy import array

# Initializing values
input_data = array([1, 2, 3])
weights = array([0, 2, 1])
target = 0 

In [6]:
# Calculate the predictions: preds
preds = (input_data * weights).sum()

# Calculate the error: error
error = preds - target

# Calculate the slope: slope
slope = 2 * error * input_data

# Print the slope
print(slope)

[14 28 42]


In [7]:
# Set the learning rate: learning_rate
learning_rate = 0.01

# Calculate the predictions: preds
preds = (weights * input_data).sum()

# Calculate the error: error
error = preds - target

# Calculate the slope: slope
slope = 2 * error * input_data

# Update the weights: weights_updated
weights_updated = weights - (slope * learning_rate) #<-- gradient descent

# Get updated predictions: preds_updated
preds_updated = (input_data*weights_updated).sum()

# Calculate updated error: error_updated
error_updated = preds_updated-target

# Print the original error
print(error)

# Print the updated error
print(error_updated)

7
5.04


# Backpropagation
![image-7](image-7.png)

- It calculates the necessary slopes sequentially from the weights closest to the prediction, through the hidden layers, eventually back to the weights coming from the inputs
- Allows gradient descent to update all weights in neural network (by getting gradients for all weights)
- Comes from chain rule of calculus

## Example of backpropagation
1. Start at some random set of weights
2. Use forward propagation to make a prediction

![image-9](image-9.png)

3. Use backward propagation to calculate the slope of the loss function w.r.t each weight
- Calculating the slopes: 
    - For node '1': 2 x error x input x slope of activation function = 2 x 3 x 1 = 6 i.e gradient for weight 1 is 6
    - For node '3': 2 x error x input x slope of activation function = 2 x 3 x 3 = 18 i.e gradient for weight 2 is 18
4. Multiply that slope by the learning rate, and subtract from the current weights
5. Keep going with that cycle until we get to a flat part
