## Overview

In the previous module, we learned more about the main components of neural networks, including neurons, weights, activation function , and the types of layers. We also created a simple perceptron binary classifier and an actual neural network using the keras library. As we built those models, we focused more on the architecture of the models, such as the number of layers and neurons in thos layers. We also looked at parameters like the batch size and the learning rate and how they affect training time.

Now we're going to get more specific about *how* a neural network learns. First, we need to specify that we're talking about a *feed-forward* neural network, where information moves forward from the input. Even though the weights are adjusted each iteration, this process doesn't transfer information from one layer to the next inside the neural network. 

In contrast, consider a multi-layer neural network where this transfer does occur. If we added feedback from the last hidden layer to the first hidden layer it would be considered a *recurrent neural network* (RNN), something we will cover in the next sprint.

### Loss Function

So, what's going one when we train a neural network? First, remember we adjust the weights by comparing the model prediction with the expected result, and either adding or subtracting to those values to get closer to the correct answer. To do that, we minimize a loss function that compares to the output to the target (correct answer). This function is minimized using the process of gradient descent.

### Gradient Descent

Given a function, in this case the neural network loss function, we find the minimum of the function with the process of gradient descent. We start by choosing a random location on the function, find the direction of the negative gradient, and then repeat the process to find a minimum of the function. Sometimes, we find a local minimum but the goal is to find the global minumum.



## Follow Along

Let's work through an example with a real function and find the minimum using gradient descent. The function we'll use is a simple parabola which has a single minimum.

$$ y = (x-3)^2 $$

Let's graph this function to see if we can see visually where the minimum is. 

<p><math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mi></mi><annotation encoding="application/x-tex"></annotation></semantics></math></p>

In [1]:
# Plot the function

import matplotlib.pyplot as plt
import numpy as np

x = np.arange(-10, 15, 0.01)
y = (x-3)**2

plt.plot(x, y)
plt.xlabel('x'); plt.ylabel('y')

plt.clf() # comment/delete to show plot

<Figure size 432x288 with 0 Axes>

![mod2_obj1_function.png](https://raw.githubusercontent.com/LambdaSchool/data-science-canvas-images/main/unit_4/sprint_2/mod2_obj1_function.png)

We can see it reaches a minimum (y=0) when x=3. Using calculus, we can find the minimum mathematically. We do this by looking slope of the function and finding when that slope is equal to zero. The slope is found by taking the derivative of the function. We'll take the derivative of $y$ with respect to $x$.

$$ \frac{dx}{dy} = 2(x-3)$$

If we set the derivative equal to zero and solve for $x$, we'll have a solution for the minimum of the function:

$$ 0 = 2(x-3)$$ which is true when $x = -3$.

Using Python, we'll implement a gradient descent function. First let's rewrite the equation as follows:

$$ x_{current} = (x_{previous} - \text{learning rate}) * \frac{dx}{dy}$$ where $\frac{dx}{dy}$ equals $2(x_{previous}-3)$.


The basic steps that we'll follow are:

* initialize at a value for x
* calculate a "new" current *x* with the learning rate and gradient
* update the next iteration with this "new" value of x
* iterate until the maximum number of iterations is reached

<p><math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mi></mi><annotation encoding="application/x-tex"></annotation></semantics></math></p>

In [2]:
# Initialize at x=1
cur_x = 1 

# Learning rate (how much to adjust x each iteration)
rate = 0.05

# Maximum number of iterations
max_iters = 25

# Initialize interation counter
iters = 0

# Gradient of the function
grad = lambda x: 2*(x-3)

while iters < max_iters:
  # Set the previous x as the current
  prev_x = cur_x

  # Calculate the "new" current x with the gradient
  cur_x = prev_x - (rate * grad(prev_x))
  
  # Advance the iteration counter
  iters = iters+1
  print("Iteration {} - x value: {}.".format(iters, cur_x))

# Print out the final result
print("The local minimum occurs at", cur_x)

Iteration 1 - x value: 1.2.
Iteration 2 - x value: 1.38.
Iteration 3 - x value: 1.5419999999999998.
Iteration 4 - x value: 1.6877999999999997.
Iteration 5 - x value: 1.8190199999999999.
Iteration 6 - x value: 1.937118.
Iteration 7 - x value: 2.0434061999999997.
Iteration 8 - x value: 2.1390655799999996.
Iteration 9 - x value: 2.2251590219999997.
Iteration 10 - x value: 2.3026431198.
Iteration 11 - x value: 2.37237880782.
Iteration 12 - x value: 2.4351409270380002.
Iteration 13 - x value: 2.4916268343342.
Iteration 14 - x value: 2.54246415090078.
Iteration 15 - x value: 2.5882177358107024.
Iteration 16 - x value: 2.629395962229632.
Iteration 17 - x value: 2.6664563660066687.
Iteration 18 - x value: 2.6998107294060016.
Iteration 19 - x value: 2.7298296564654017.
Iteration 20 - x value: 2.7568466908188616.
Iteration 21 - x value: 2.7811620217369755.
Iteration 22 - x value: 2.8030458195632777.
Iteration 23 - x value: 2.82274123760695.
Iteration 24 - x value: 2.840467113846255.
Iteration 25

After 25 iterations, the value is starting to converge on the minimum of $x=3$. To speed up the convergence, we can adjust the learning rate or the size of the step we adjust $x$ by. We'll leave that up to you to complete as a challenge.  

## Challenge

Following the example above, try experimenting with the gradient descent algorithm. Specifically, you can change the learning rate and adjust the maximum number of iterations. How close can you get to the minimum value of the function?

## Additional Resources

* [Khan Academy: Gradient and Directional Derivatives](https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/gradient-and-directional-derivatives/v/why-the-gradient-is-the-direction-of-steepest-ascent)
* [Implement a Gradient Descent in Python](https://towardsdatascience.com/implement-gradient-descent-in-python-9b93ed7108d1)