# Linear Regression

### What is regression?

Regression is the prediction of quantitative values given a set of input values.

### What is Linear Regression?

Linear Regression is a linear approach to modelling the relationship between a dependent variable and one or more explanatory or independent variables. This is done by fitting a line or a hyperplane in the input space.<br><br>
Linear Regression focusses on Conditional probability distribution of the response or dependent variable given the explanatory variables.<br><br>
A simple example to understand the concept is predicting the price of a house given the number of rooms in it.
- The <b>linear term in linear regression</b> does not imply that we can only model linear relationships between the independent variable X and dependent variable Y but that the coefficients measuring the intensity of the relationship are linearly related with the dependent variable i.e. Y is a linear function of the adjustable parameters or coefficients.

### Simplest form of Linear Regression

In this form the model is a linear function of the input variable X as well along with the parameters $\beta$

$Y = \overbrace{\beta_0 + \sum \limits_{j=1}^p X_j \beta_j}^{f(X)} +\epsilon$

In the above equation, 
1. X represents a input row vector for a single example containing p features i.e. $X^T = (X_1,X_2,...X_p)$
2. $X_j$ represents a feature of input vector X
3. $\beta_0$ represents the intercept term also know as the bias in Machine Learning (not to be confused with bias in the statistical sense). The intercept gives us the value the output variable is biased to take in absence of any input. For example - While predicting the price of a house given the number of bedrooms, if no input is given then we do not say that the house is free instead the house is assigned a base price irrespective of the number of rooms. The bias is the base price in our example.
4. $\beta_j$ represents the parameter or coefficient assigned to each feature which measures the amount by which $Y$ varies if there is a unit increase or decrease in $X_J$.
5. $\epsilon$ represents the noise or error in the estimation. This variable captures all other factors that influence the dependent variable Y and are not included in the regressors X.<br>
<b>Note: </b> $f(X)$ is the expected value of $Y$ plus the error term for each input vector $X$.<br> 
In real world it is hardly likely to observe situations where Y is a linear function of X due to various complexities. This will lead to inaccurate predictions using the Simple Linear Regression model. Therefore we prefer a more generalized form of Linear Regression which will help us learn the complex relationships between different features of X and the dependent variable Y.

### General Form of the Linear Regression model

Using the general form of Linear Regression we can model complex relations such as polynomial, etc.

$Y = \overbrace{\beta_0 + \sum \limits_{j=1}^p \phi_j(X)\beta_j}^{f(X)} + \epsilon$

In the above equation, 
1. $\phi_j$ represents a feature of X which is a function of one or more observed features that can be linear or non linear such as $X_1.X_2$ or $X_1^2$
2. $\beta_j$ represents the parameter or coefficient assigned to each feature which measures the amount by which $Y$ varies if there is a unit increase or decrease in $\phi_j(X)$.

### Objective

Our objective is to estimate the values of the $\beta$ coefficients of the true model $f(X)$ using the training data in order to approximate the response variable values Y.

$\hat{Y} = \hat{\beta_0} + \sum \limits_{j=1}^p X_j \hat{\beta_j}$

A simple example of a approximating the model $f(X)$. Let us take a small example with only one explanatory variable to understand the concept. Given the TV advertising budget we want to predict the product sales of a company.
<div>
<img src = "attachment:linear%20reg.JPG" width = "500px"></div>

In the above plot the blue line is the approximation of the true model. More the number of training examples (given they are randomly selected) better the approximation.

### How to fit the model to training data?

In order to fit a model we use the <b>least squares method</b>. In this approach we find the coefficients $\beta$ so as to minimize the residual sum of squared errors.<br> The RSS is nothing but the square of the error term $\epsilon$ which we can easily obtain by subtracting the estimated values from the observed values of the dependent variable. <br>
$RSS(\beta) = \sum \limits_{i=1}^N(y_i-f(x_i))^2$<br>
$\hspace{1.6cm}= \sum \limits_{i=1}^N(y_i- \beta_0 - \sum \limits_{j=1}^px_{ij}\beta_j)^2$<br>
    where $N$ is the number of training examples and $p$ is the number of explanatory variables.

### How do we minimize the RSS?

we can combine the intercept or the bias term in the $\beta$ matrix and add 1 as $X_0$ to $X$ and reduce the model $f(X)$ to,
$f(X) = X\beta$<br>
The total number of parameters are $p+1$ now instead of $p$.<br>
Now RSS can be written as,<br>
$RSS(\beta) = (y-X\beta)^T(y-X\beta)$<br>

Differentiating RSS with respect to $\beta$ we obtain,<br>

$\frac{\partial RSS}{\partial \beta} = -2X^T(y-X\beta)$<br>

$\frac{\partial_2RSS}{\partial\beta\partial\beta^T} = 2X^TX$<br>

Here we are assuming that X is a full rnk matrix, and hence $X^TX$ is positive definite i.e. the eigen values of the matrix are positive, we set the first derivative to zero.<br>

$X^T(y-X\beta) = 0$<br>

to obtain the unique solution<br>

<b>$\hat\beta = (X^TX)^{-1}X^Ty$</b><br>

Therefore,<br>

$\hat y = X\hat\beta = X(X^TX)^{-1}X^Ty$<br>
where $\hat y$ which is equal to $\hat f(X)$ is the approximation of $f(X)$

### Geometric Interpretation of the Model
<div>
    <img src = "attachment:projec_lr.JPG" width = "400px"> </div>
    
<center>The above image shows that the outcome vector y is orthogonally projected onto the hyperplane spanned by the input feature vectors $x_1 and x_2$ in the N dimensional space. The projection $\hat y$ represents the vecotr of the least sqaures prediction.</center>

### Assumptions of linear regression regarding residuals

Whenever a model is created there are always certain assumptions on the basis of which we are to build a model otherwise it would be insignificant. Below are given the assumptions made for Linear Regression regarding residuals:
 <div>
    <img src = "attachment:Q-Q.png" width = "300"></div>

1.	Conditional Expected value of the residuals is 0 – This assumption simply states that the error terms are distributed equally across the mean 0 or the sum of all the error terms is 0.
2.	Residuals follow a normal distribution – This assumption states that for each value of independent variable x the error terms between the observed values of y for different observations and the population mean of y are normally distributed.
 
3.	Constant Variance or Homoscedasticity – This assumption states that the residuals are independent of predictor variable x and for all values of x the variance of the residuals remains constant. This concept is homoscedasticity. If the variance of residuals changes with increase in the value of x then it is known as heteroscedasticity.

<div>
    <img src = "attachment:homoscedasticity.png" width = "600"></div>

 
4.	In case of time series data, the error terms are uncorrelated – This means that the current value of residuals is not dependent on the previous values.
