# Linear Regression

Linear regression is a supervised learning technique used to describe and model the relationship between a scalar dependent variable Y and one or more independent/predictor variables X.

In linear regression, we use a weighted linear combination of the predictor variables to describe the relationship between them and Y. 

The weights w, are found by selecting a combination that minimises the total error between observed value Y and the value anticipated based on the model.


# Formulation

Given a set of input data $\ x\in\mathbb R^d $ and some output data $\ y\in\mathbb R $, we seek to learn a function that can map $ f : \mathbb R^d \rightarrow \mathbb R $ such that $ y \approx f(x;w)$ for the data pair $(x,y)$.

$f(x;w)$ is a regression function with free parameter $w$

We can model the relationship between the length $d$ vector of regressor variables $x$ and $y$ as $$ y_i = f(x;w) = w_{i0}1 + w_{i1}x_1 + w_{i2}x_2+ ... + w_{id}x_d + \epsilon_i = \mathbf x_i^Tw + \epsilon_i$$ where the error term $\epsilon_i$ is a random variabe that adds noise to the linear relationship between the dependent variable and regressors. Notice also that the first term $w_0$ is multiplied by a constant term 1, we call this term the intercept.

For a set of $n$ observations we can express the above using matrix notation as $ \mathbf y = \mathbf Xw + \varepsilon$, where: 

$$  \mathbf y = \begin{matrix} 
y_1 \\ y_2 \\ \vdots \\ y_n
   \end{matrix},
$$

$$  \mathbf x = \begin{matrix} 
\mathbf X_1^T \\ \mathbf X_2^T \\ \vdots \\ \mathbf X_n^T
   \end{matrix} = \begin{matrix} 1 & x_{11}^T & x_{12}^T & \cdots & x_{1d}^T \\
   1 & x_{21}^T & x_{22}^T & \cdots & x_{2d}^T \\
   \vdots & \vdots & \vdots & \cdots & \vdots \\
   1 & x_{n1}^T & x_{n2}^T & \cdots & x_{nd}^T \end{matrix},
$$

and

$$  \mathbf w = \begin{matrix} 
w_0 \\ w_1 \\ \vdots \\ w_d
   \end{matrix}, \varepsilon = \begin{matrix} 
\varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n
   \end{matrix}
$$

Clearly, a good model is one that makes predictions as close to the actual observed data as possible. More formally  we should aim to find the parameter vector $w_{ls}$ that minimises the difference between our estimates $\hat y = \mathbf Xw  $ and actual observed values $y$, rather than minimising the absolute differences we seek to find the $w_{ls} = $ that minimises the squared difference between  $\hat y$ the observed responses $y$.

Our 'Least Squares' solution is thus:

$$ L = \sum_i^{n}(y_i - x_i^Tw) = \lvert\lvert \mathbf y - \mathbf X^Tw \rvert\rvert ^2 =(\mathbf y - \mathbf X^Tw)^T(\mathbf y - \mathbf X^Tw) $$