# How the Machine Learns
## The Algorithm
1. We start of by making a **prediction** that is completely random.
2. Then **calculate the error**, this is where we measure how good our prediction was
3. Lastly, learaning step, where we adjust our parameters so we have a smaller error next turn around our initial prediction and **learn from our mistakes**
4. Repeat step 1 with our new modified parameters that we found by changing our old ones based on how far off they were
5. We keep repeating steps 1-4 until we our satified




# Cost Functions and Loss Functions
Loss and cost functions are methods of measuring the error in machine learning predictions.

![example](loss-vs-cost.png)

## Loss Functions
Loss functions measure the error per observation, whilst cost functions measure the error over all observations.

A **Lost Function** is a function that assiociates a cost with a decision. It measures the error between a single prediction and the corresponding actual value
>  Loss functions are used to determine the error (aka “the loss”) between the output of our algorithms and the given target value.
> In layman’s terms, the loss function expresses how far off the mark our computed output is. 

### Well known Loss Functions
`l1_loss = abs(actual - prediction`  
`l2_loss = (actual - prediction) ** 2`

## Cost Functions
A **Cost Function** is a function which measures the error between predictions and their actual values across the whole dataset. To do this it aggregates the loss values that are calculated per observation.

Loss functions are used in regression when finding a line of best fit by minimizing the overall loss of all the points with the prediction from the line.

### Well Known Cost Functions
Mean Square Error or mse: $MSE = \frac{1}{n} \sum(y - \hat y)^2$  
Mean Absolute Error or mae: $\frac{1}{n} \sum|yi - xi|$
In MAE xi is the observed value in the $i^{th}$ observation and yi is the predicted value for the $i^{th}$ observation.

# Gradient Descent
In Data Science and Machine Learning we want to make sure that our loss and cost function have a low error and a high accrucy. Gradient Dscent is the algorithm we use to minimize the error in our loss and cost functions. **Gradient Descent is an optimization algorithm for finding a local minimum of a differentiable function.** The gradient descent is the learning alorithm our computers use to reduce the error in their loss and cost functions.

In order to find the gradient descent 
1. We make a radnom guess for the constant in our loss or cost function
2. Then we run until a certain stopping condition, for example running for a fixed amount of iterations
3. As the alogorithm is running we the derivate of our loss or cost function. This is called finding the gradient. 
4. Then we travel in the direction of our gradient, heading towards zero. We need a learning rate to determine how big of steps we make as we head towards the gradient. If our gradient is negative we go in the postive direction, if it is postive we go in the negative dirrection.
5. Lastly we model our gradient descent alogorithm and decide if we want to modify our initial loss or cost function to change what the model cares about more.


## Gradient Descent Example
Our simple cost function:  
## $$f(x) = x^2 + x + 1$$  
We Are going to use the gradient descent to minimize the error.  

We Import our modules

In [None]:
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

Below is the function is our cost function written as a python function.

In [None]:
def f(x):
    return x**2 + x + 1

Below we use linspace to create a sequence of values in the start stop range num times that are spaced out equally

In [None]:
x_1 = np.linspace(start=-3, stop=3, num=500)

Next we plot our cost function using our generated x values

In [None]:
plt.plot(x_1, f(x_1))
plt.xlim([-3, 3])
plt.ylim(0, 8)
plt.xlabel('X', fontsize=16)
plt.ylabel('f(x)', fontsize=16)

We want the value of x that minimizes the cost. The slope tells us when we have reached our lowest point. We want were the slope is zero.

Below is python function that is the derivative of our cost function

In [None]:
def df(x):
    return 2*x + 1

Next we graph our diverative

In [None]:
plt.figure(figsize=[15, 5])

plt.subplot(1, 2, 1) # same row, two columns, index
plt.plot(x_1, f(x_1), linewidth=5)
plt.title('Cost Function')
plt.grid()
plt.xlim(-3, 3)
plt.ylim(0, 8)
plt.xlabel('X', fontsize=16)
plt.ylabel("f(x)", fontsize=16)

# Chart 2: Derivative
plt.subplot(1, 2, 2)
plt.title('Slope of Cost Function')
plt.plot(x_1, df(x_1), c='skyblue', linewidth=5)
plt.grid()
plt.xlim(-3, 3)
plt.ylim(-4, 5)
plt.xlabel('X', fontsize=16)
plt.ylabel('df(x)', fontsize=16)

plt.show()

Our Gradient Descent Algorithmn

To make our calculations more precise we need to run our algorithm longer or make it modify our precision value.

In [None]:
new_x = 3 # Our random guess
previous_x = 0
step_multiplier = 0.1 # Our learning rate
precision = 0.00001

learning_steps = []
gradients_list = []

for n in range(500): # Our secondary stopping condition
    previous_x = new_x
    gradient = df(previous_x) # Find the gradient

    learning_steps.append(new_x)
    gradients_list.append(gradient)

    new_x = previous_x - step_multiplier * gradient # Modify our parameter
    step_size = abs(new_x - previous_x) # Fine tuning our learning rate

    if (step_size < precision): # Our primary stopping condition
        break

print('Local mininum occurs at: ', new_x)
print('Slope or df(x) value at this point is:', df(new_x))
print('f(x) value or cost at this point is:', f(new_x))

In [None]:
plt.figure(figsize=[15, 5])

plt.subplot(1, 3, 1) # same row, two columns, index
plt.plot(x_1, f(x_1), linewidth=3, alpha=0.6)
plt.title('Cost Function')
plt.grid()
plt.xlim([-3, 3])
plt.ylim(0, 8)
plt.xlabel('X', fontsize=16)
plt.ylabel("f(x)", fontsize=16)

learning_steps_array = np.array(learning_steps)
plt.scatter(learning_steps, f(learning_steps_array), color='red', s=40)

# Chart 2: Derivative
plt.subplot(1, 3, 2)
plt.title('Slope of Cost Function')
plt.plot(x_1, df(x_1), c='skyblue', linewidth=3, alpha=0.6)
plt.grid()
plt.xlim(-2, 3)
plt.ylim(-3, 6)
plt.xlabel('X', fontsize=16)
plt.ylabel('df(x)', fontsize=16)

learning_steps_array = np.array(learning_steps)
plt.scatter(learning_steps, gradients_list, color='red', s=40)

plt.show()

The further we are from our minimum point our slope is ver steep. A steep slope means high error.

In [None]:
# Chart 3: Derivative Close Up
plt.figure(figsize=[20, 3])
plt.subplot(1, 3, 3)
plt.title('Derivative Close Up')
plt.plot(x_1, df(x_1), c='skyblue', linewidth=3, alpha=0.6)
plt.grid()
plt.xlim(-0.55, -0.2)
plt.ylim(-0.3, 0.8)
plt.xlabel('X', fontsize=16)
plt.ylabel('df(x)', fontsize=16)

learning_steps_array = np.array(learning_steps)
plt.scatter(learning_steps, gradients_list, color='red', s=40)

plt.show()