# Linear Regression

## Introduction
Is an algorithm for fitting linear model. Given linear model described by equation $ y = wx + b $, it finds the values for $w$ and $b$ such that the model "fits" the provided data. Meaning, given input values $x$ it predicts as acurately as possible the values of $y$.<br>
<br>
For demonstration, look at the plot below. The blue dots are data points and the red line represents a model that has been fit to that data - it's $w$ and $b$ parameters are such that given linear equation $ y = wx + b $ it correctly predicts $y$ for given input values $x$.

![Linear model plot](linear_model_1.png)

The model from above example perfectly matches the data points. In real examples, however this will happen very rarely. In fact, the data points will almost always be such, that it is impossible to find parameters that put them all on the same line. See the example below for demonstration.

![Linear model plot - random data points](linear_model_2.png)

In this case, we can clearly see there is no straight, red line that we could draw, that would contain all the blue dots. How, then can we fit linear model's parameters to such data? By doing the best we can in that situation, which means finding parameters $w$ and $b$ to minimize for each $x$, the distance between model's prediction (the red line) - $\hat{y}$ and actual value for that data point - $y$. <br>
<br>
In other words we will calculate, and try to minimize, the error, also called the **cost function**.

## Calculating the error

The function that defines the error **for single data point** is called the **loss function**.<br>
<br>
For linear regression, the error value for each **individual** data point will be described as a **squarred** distance between model's prediction and actual value for that data point, found in the dataset. Mathematically, it is represented as:

$$
\tag{1}
\begin{equation}
l(\hat{y}^{(j)}, y^{(j)}) = (\hat{y}^{(j)} - y^{(j)})^2
\end{equation}
$$

Where:<br><br>
$j$ - index of a data point in a dataset.<br>
$y^{(j)}$ - value for j-th data point in the dataset.<br>
$\hat{y}^{(j)}$ - value predicted by the model, for j-th data point in the dataset.<br>

However, to calculate the error that can be used to fit the model, we need to know error for the whole dataset. To measure that we will take a mean of loss function values over all data points.<br>
This is called **cost function** and can be represented by equation (please don't confuse cost function symbol $J$ with $J$ used to represent matrix of ones in linear model notebook):

$$
\tag{2}
\begin{equation}
J(\hat{y}, y) = {1 \over m}\displaystyle\sum_{j=0}^m{(\hat{y}^{(j)} - y^{(j)})^2}
\end{equation}
$$

Where:<br><br>
$m$ - number of data points in the dataset.<br>
<br><br>
It is worth to note that in most machine learning materials this cost function contains the factor of 2 in the denominator. The reason for that constant to be there is to simplify the equation after taking it's derivative (more on that later). The equation then is:

$$
\tag{3}
\begin{equation}
J(\hat{y}, y) = {1 \over 2m}\displaystyle\sum_{j=0}^m{(\hat{y}^{(j)} - y^{(j)})^2}
\end{equation}
$$
<br><br>
The value predicted by the model $\hat{y}$ represents the linear model: $\hat{y} = wx + b$, so equation $(3)$ can be finally presented as:

$$
\tag{4}
\begin{equation}
J(\hat{y}, y) = J(\vec{x}, b, y) = {1 \over 2m}\displaystyle\sum_{j=0}^m{(\vec{x}^{(j)}\vec{w} + b y^{(j)})^2}
\end{equation}
$$

Where:<br><br>
$w$ - input variable coefficient, weights.<br>
$x^{(j)}$ - vector of independend variables for j-th data point.<br>
$b$ - bias coefficient.<br>

## Using Gradient Descent to fit parameters

Fitting the model to the data means we will find model parameters $\vec{w}$ and $b$ for which $J(\vec{w}, b, y)$ is minimal. For convenience let's define model parameters as single variable theta: $\theta = \{\vec{w}, b\}$. Our task then is to find $\theta_{opt}$ as:
$$
\theta_{opt} = \argmin_{\theta} J(\theta, y)
$$
<br><br>
In order to find optimal parameters we will use a procedure called Gradient Descent. This algorithm requires us to iteratively update model parameters by subtracting from their values the derivatives of cost function taken with respect to model parameters. Properly calibrated Gradient Descent procedure should result in a process that can be visualised by the graph below - iteratively getting closer to minimal value of our cost function.

![Gradient Descent visualisation](gradient_descent_1.png)