# Linear Regression

Assuming we have a line represented by the equation 

$$y = w_1x + w_2$$

and a point with coordinates $(p,q)$

### Absolute Trick

We try to move the line closer to the point

$$$y = (w_1 + p)x + (w_2 + 1)$$

This ends up being too large of a step, and over-corrected our line.

Instead , we utilize a small number called a learning rate, referred to as alpha($\alpha$)

$$y = (w_1 + p\alpha)x + (w_2 + \alpha)$$

above works when the point(p, q) is above the line.

When the point is udnerneath the line we need to substract in order to move the line appropriately

$$y = (w_1 - p\alpha)x + (w_2 - \alpha)$$

### Square Trick
If we have a point that is close to a line, we want to move the line very little.

If we have a point is far from the line, we want to move the line a lot more.

The Absolute Trick dose not take into account how far the point is from the line.

The square trick addresses this.

The point over the line has coordinates $(p,q)$ , and the corresponding point on the line is $(p,q^′)$ . The vertical distance between the point and the line is $(q - q^′)$

$$y = (w_1 + p(q - q^′)\alpha)x + (w_2 + (q - q^′)\alpha)$$

# Gradient Descent

We move the line, and calculate the error.
- we move in this direction and see that the error increases, so that's not the way to go
- we move in the other direction and see that the error decreased, so we pick this one

repeat the steps many times over and over. every time descend the error a bit until we get the perfect line.

**The two most common error functions for linear regression**

- Mean absolute error
- Mean squared error

### Mean absolute error
We have some points with coordinates $(x_1,y_1),(x_2,y_2)...(x_m,y_m)$, and a line is called $\hat{Y}$.
The corresponding point on the line are $(x_1,\hat{y}_1),(x_2,\hat{y}_2)...(x_m,\hat{y}_m)$.

The vertical distance from the point to the line is $(y - \hat{y})$, we call it the error.

so the total error is the sum of all these distance for all the points

$$Error = \sum_{i=1}^m | y - \hat{y}|$$

the mean absolute error is 

$$Error = \frac{1}{m} \sum_{i=1}^m | y - \hat{y}|$$

### Mean squared error
Similar to the mean absolute Error,bu instead of taking the distance between the point and the prediction, we draw a square with this segment as it's side.

$$Error = \frac{1}{2m} \sum_{i=1}^m ( y - \hat{y})^2$$

the one half is going to be there for convenience becuase later we will take the derivative of this error.

## Batch Gradient Descent vs Stochastic Gradient Descent

we have two ways to do linear regression.

- by applying the squared(or absolute) trick at every point one by one, and repeating this process many times
- by applying the squared(or absolute) trick at every point all at the same time, and repeating this process many times

The former one is called ***stochastic gradient descent***. The latter is called ***batch gradient descent***.

In most cases, the best way is to split your data into many small batches. Each batch, with roughly the same number of point. Then use each batch to update your weight.
This is call ***mini-batch gradient descent***.


# The Same thing

Actually, the Absolute/Square Trick and the Gradient Descent are doing the same thing.

### The squared trick and The mean squared error

We have the points $(x, y)$ and the  equation

$$\hat{y} = w_1x + w_2$$

We want to make sure the direction we moved to reduce the mean squared error. 

What's the minimum value of mean squared error ?

$$E = \frac{1}{2}  ( y - \hat{y})^2$$

In order to minimize it, let's take the derivatives with respect to $w_1$ and $w_2$.

Let's do it to $w_1$ first.

$$
\begin{aligned}
\frac{\partial E}{\partial w_1}
&= \frac{\partial E}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial w_1} \\
&= \frac{\partial {(\frac{1}{2}  ( y - \hat{y})^2)}}{\partial \hat{y}} \frac{\partial (w_1x + w_2)}{\partial w_1} \\
&= \frac{\partial {(\frac{1}{2}  ( y^2 - 2y\hat{y} +(\hat{y})^2))}}{\partial \hat{y}} x \\
&= \frac{\partial {(\frac{y^2}{2} - \frac{2y\hat{y}}{2} +\frac{(\hat{y})^2}{2})}}{\partial \hat{y}} x \\
&= (0 - y + \hat{y}) x \\
&= -(y - \hat{y}) x
\end{aligned}
$$

If we think about  the point $(p, q)$, we got the first part of Square Trick

$$ -p(q - q^′)$$

Now, let's take the derivatives with respect to $w_2$

$$
\begin{aligned}
\frac{\partial E}{\partial w_2}
&= \frac{\partial E}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial w_2} \\
&= \frac{\partial {(\frac{1}{2}  ( y - \hat{y})^2)}}{\partial \hat{y}} \frac{\partial (w_1x + w_2)}{\partial w_2} \\
&= \frac{\partial {(\frac{1}{2}  ( y^2 - 2y\hat{y} +(\hat{y})^2))}}{\partial \hat{y}} 1 \\
&= \frac{\partial {(\frac{y^2}{2} - \frac{2y\hat{y}}{2} +\frac{(\hat{y})^2}{2})}}{\partial \hat{y}} \\
&= (0 - y + \hat{y}) \\
&= -(y - \hat{y})
\end{aligned}
$$

So, we got the second part
$$ -(q - q^′)$$

Let's look back to the Square Trick

$$y = (w_1 + p(q - q^′)\alpha)x + (w_2 + (q - q^′)\alpha)$$

See, they are the same thing. ( Almost the same thing. As you see, we need to flip the sign . Why ? it's another story.)