## Regression

**Regression** - think about fitting a line through data. If we have the following equation that tells us how the line is drawn $y = w_{1}x + w_{2}$ the way we can adjust the line is to modify the weights. That will increase or decrease the slope of the line.

In this notebook, several methods are mentioned on how to move the line to fit the data.

### Absolute trick

This is one method that we can use to adjust the line to better fit the data.

![Absolute trick](extra_images/absolute_trick.png)

Note that it's not restricted to adding, subtraction can also be used if we want to lower the line. Note that if `p` is negative then the line rotates in the other direction. `p` can also be close or far to the y-axis, so if it's close then we add a small `p` otherwise we move it by a lot more. Why is `p` so important? it tells us how far it is from y-axis but in terms of **horizontal distance**.

### Square trick

Solves a problem of before, that `p` is only good for telling us the horizontal distance but what about the vertical? Enter `q`. 

![Square trick](extra_images/square-trick.png)

What is more important, there is only one rule regardless of whether the line is above or below the point compared to the absolute trick where we either had to do addition or subtraction.

### Gradient descent

Gradient descent is a more reliable method for fitting the line based on the error especially during neural networks because we are taking into account the error and we are taking the negative gradient of the derivative (the direction which points towards the largest step towards reducing the error). This is what gradient descent does, it helps us minimize the error by descending towards the minimum value.

![Gradient descent](extra_images/gradient-descent.png)


How do we actually measure the error?

### Mean Absolute Error

It's one of the most commonly know methods $Error = \frac{1}{m}\sum_{i=1}^{m}|y - \hat{y}|$

The reason why the absolute value is taken from the difference between real value and prediction is that we can obtain the correct value regardless of the values are positive or negative. It's avoiding numbers cancelling each other out.

### Mean Squared Error

Two important aspects to note:
- if we are squaring then we avoid having non-negative numbers
- the 1/2 is for convenience as later we are taking the derivative of this error

The equation: $Error = \frac{1}{2m}\sum_{i=1}^{m}(y - \hat{y})^2$



### Minimizing Error Functions

The tricks and error functions represent the same thing when using gradient descent to minimize the error.

As an example, for the mean squared error function we can develop the derivative of the error function in the following manner.

We first define the squared error function: $Error = \frac{1}{2m}(y - \hat{y})^2$

Also, we define the prediction as $\hat{y} = w_{1}x + w_{2}$

So to calculate the derivative of the error with respect to $w_{1}$ , we simply use the chain rule:

$\frac{\partial}{\partial w_{1}} Error = \frac{\partial Error}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial w_{1}}$ 

Since we already know what the factors are, the error and the prediction our final equation will look like:

- The first factor of the right hand side is the derivative of the Error with respect to the prediction $\hat{y}$
- The second factor is the derivative of the prediction with respect to $w_{1}$, which is simply x.


$\frac{\partial}{\partial w_{1}} Error = (y - \hat{y})x$