# Linear Regression

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory (independent) variable, and the other is considered to be a dependent variable. For example, a modeler might want to relate the weights of individuals to their heights using a linear regression model. 

A linear regression line has an equation of the form **y = β0 + β1 x**, where **X is the explanatory (independent) variable and y is the dependent variable**. 

The slope of the line is **β1**, and **β0** is the intercept (the value of y when x = 0). 

Linear regression explains two important aspects of the variables, which are as follows:

* Does the set of independent variables explain the dependent variable significantly?
* Which variables are the most significant in explaining the dependent available? In which way do they impact the dependent variable? The impact is usually determined by the magnitude and the sign of the beta coefficients in the equation.



## Assumptions of Linear Regression

1. **Linear relationship**

One of the most important assumptions is that a linear relationship is said to exist between the dependent and the independent variables. If you try to fit a linear relationship in a non-linear data set, the proposed algorithm won’t capture the trend as a linear graph, resulting in an inefficient model. Thus, it would result in inaccurate predictions.



**How can you determine if the assumption is met?**

The simple way to determine if this assumption is met or not is by creating a scatter plot x vs y. If the data points fall on a straight line in the graph, there is a linear relationship between the dependent and the independent variables, and the assumption holds.

![](https://i.imgur.com/XmytW3M.jpg)

**What should you do if this assumption is violated?**

If a linear relationship doesn’t exist between the dependent and the independent variables, then apply a non-linear transformation such as logarithmic, exponential, square root, or reciprocal either to the dependent variable, independent variable, or both. 

2. **Little or No autocorrelation in the residuals**

Autocorrelation occurs when the residual errors are dependent on each other.The presence of correlation in error terms drastically reduces model’s accuracy.This usually occurs in time series models where the next instant is dependent on previous instant.

**How to determine if the assumption is met?**
Autocorrelation can be tested with the help of **Durbin-Watson test**.The null hypothesis of the test is that there is no serial correlation. The Durbin-Watson test statistics is defined as:
![](https://i.imgur.com/eJwDh2F.jpg)

The test statistic is approximately equal to 2*(1-r) where r is the sample autocorrelation of the residuals. Thus, for r == 0, indicating no serial correlation, the test statistic equals 2. This statistic will always be between 0 and 4. The closer to 0 the statistic, the more evidence for positive serial correlation. The closer to 4, the more evidence for negative serial correlation.

**What should you do if this assumption is violated?**

If the assumption is violated, consider the following options:

* For positive correlation, consider adding lags to the dependent or the independent or both variables.
* For negative correlation, check to see if none of the variables is over-differenced.
* For seasonal correlation, consider adding a few seasonal variables to the model.


3. **Little or no Multicollinearity between the features**

Multicollinearity is a state of very high inter-correlations or inter-associations among the independent variables.It is therefore a type of disturbance in the data if present weakens the statistical power of the regression model.

**How to determine if the assumption is met?**

Use a scatter plot to visualise the correlation between the variables.
Pair plots and heatmaps(correlation matrix) can be used for identifying highly correlated features.
Another way is to determine the VIF (Variance Inflation Factor). VIF<=4 implies no multicollinearity, whereas VIF>=10 implies serious multicollinearity.
![](https://i.imgur.com/lGzQglL.jpg)


**What should you do if this assumption is violated?**

Reduce the correlation between variables by either transforming or combining the correlated variables.

4. **Homoscedasticity**

Homoscedasticity means the residuals have constant variance at every level of x. The absence of this phenomenon is known as heteroscedasticity. Heteroscedasticity generally arises in the presence of outliers and extreme values.

**How to determine if the assumption is met?**

Create a scatter plot that shows residual vs fitted value. If the data points are spread across equally without a prominent pattern, it means the residuals have constant variance (homoscedasticity). Otherwise, if a funnel-shaped pattern is seen, it means the residuals are not distributed equally and depicts a non-constant variance (heteroscedasticity).
![](https://i.imgur.com/ganj1Sh.jpg)

**What should you do if this assumption is violated?**

* Transform the dependent variable
* Redefine the dependent variable
* Use weighted regression

5. **Normal distribution of error terms**

The last assumption that needs to be checked for linear regression is the error terms’ normal distribution. If the error terms don’t follow a normal distribution, confidence intervals may become too wide or narrow.

How to determine if the assumption is met?

Check the assumption using a Q-Q (Quantile-Quantile) plot. If the data points on the graph form a straight diagonal line, the assumption is met.

![](https://i.imgur.com/agAQnnm.jpg)

You can also check for the error terms’ normality using statistical tests like the Kolmogorov-Smironov or Shapiro-Wilk test.


![](https://i.imgur.com/DeXfeIg.jpg)

The q-q plot of the advertising data set shows that the errors(residuals) are fairly normally distributed.The histogram plot in the “Error(residuals) vs Predicted values” in assumption no.3 also shows that the errors are normally distributed with mean close to 0.



## Simple Linear Regression
Simple linear regression models try to model relationships on data with one feature or explanatory variable x and a single response variable y where the objective is to predict y. Methods like ordinary least squares (OLS) are typically used to get the best linear fit during model training.

In particular, we have measurements $(x_1, y_1), . . . ,(x_n, y_n)$ that
lie approximately on a straight line.
A simple model for these data is that the {$x_i$} are fixed and variables {$Y_i$} are random such that

<center>$Y_i = β_0 + β_1 x_i + ε_i$</center>


i = 1, . . . , n

<center><img src="https://i.imgur.com/pcxq3VO.jpg"/></center>

for certain unknown parameters $β_0$ and $β_1$. The {$ε_i$} are assumed to be independent with expectation 0 ($Eε_i=0$) and unknown variance $σ^2$
The unknown line

<center>$y = β_0 + β_1 x$</center>

is called the regression line. Thus, we view the responses as random variables that would regression line lie exactly on the regression line, were it not for some “disturbance” or “error” term represented by the {$ε_i$}. The extent of the disturbance is modeled by the parameter $σ^2$. The model is called simple linear regression simple linear regression model . 



## Multiple Linear Regression

In a multiple linear regression model multiple linear regression model
the response Y depends on a d-dimensional explanatory vector 

x = [$x_1$, . . . , $x_d$], via the linear relationship
<center> $Y = β_0 + β_1 x_1 + · · · + β_d x_d + ε$ </center>
where $E ε = 0$ and Var $ε = σ^2$

<center><img src='https://i.imgur.com/3muWVKz.png'></center>


## Residual analysis and Prediction
When implementing linear regression of some dependent variable 𝑦 on the set of independent variables **𝐱 = (𝑥₁, …, 𝑥ᵣ)**, where **𝑟** is the number of predictors, you assume a linear relationship between 𝑦 and **𝐱: 𝑦 = 𝛽₀ + 𝛽₁𝑥₁ + ⋯ + 𝛽ᵣ𝑥ᵣ + 𝜀**. This equation is the regression equation. 𝛽₀, 𝛽₁, …, 𝛽ᵣ are the regression coefficients, and 𝜀 is the random error.

Linear regression calculates the estimators of the regression coefficients or simply the predicted weights, denoted with **𝑏₀, 𝑏₁, …, 𝑏ᵣ**. These estimators define the estimated regression function **𝑓(𝐱) = 𝑏₀ + 𝑏₁𝑥₁ + ⋯ + 𝑏ᵣ𝑥ᵣ**. This function should capture the dependencies between the inputs and output sufficiently well.

**The estimated or predicted response**, **𝑓(𝐱ᵢ)**, for each observation **𝑖 = 1, …, 𝑛**, should be as close as possible to the corresponding actual response 𝑦ᵢ. 

<center><img src='https://i.imgur.com/647iRkD.jpg'></center>


The differences **𝑦ᵢ - 𝑓(𝐱ᵢ)** for all observations 𝑖 = 1, …, 𝑛, are called the **residuals**. **Regression is about determining the best predicted weights—that is, the weights corresponding to the smallest residuals.**

To get the best weights, you usually **minimize the sum of squared residuals (SSR)** for all observations 𝑖 = 1, …, 𝑛: **SSR = Σᵢ(𝑦ᵢ - 𝑓(𝐱ᵢ))²**. This approach is called the method of **ordinary least squares.**

## Model Assessment

When you have done a great job of getting a working model, it's time to evaluate it on unseen or test data, not on training data.

And when the model works well on the test set, here comes the last step. Evaluation metrics are used to measure the performance of the machine learning models. Some of the evaluation metrics are as follows-

In regression tasks, the goal is to predict the continuous value. The difference between the actual value and the predicted value is called the *error*.

***Error = Actual value - Predicted value***

The square of the error over all samples is called Mean Squarred Error(MSE).

*MSE = SQUARE(Actual value - Predicted value)/Number of Samples*

***1. MSE***: 

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

where **n is the number of examples in data set.**


Taking the square root of the mean squared error will give the Root Mean Squared Error(RMSE). RMSE is the most used regression metric. 

***2. RMSE***: 

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

There are times that you will work with the datasets containing outliers. A suitable metric for those kinds of datasets is Mean Absolute Error (MAE). As simple as calculating MSE, MAE is also the absolute of the error.

*MAE = ABSOLUTE (Actual value - Predicted Value)*

***3. MAE:***

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

Like said, MAE is very sensitive to outliers. It is a suitable metric for all kinds of problems that are likely to have abnormal scenarios such as time series.

<center><img src='https://i.imgur.com/Bs022gj.png'></center>


***4. $R^2$:***

The **Coefficient of Determination**, denoted as **𝑅²**, tells you which amount of variation in 𝑦 can be explained by the dependence on 𝐱, using the particular regression model. 

A larger 𝑅² indicates a better fit and means that the model can better explain the variation of the output with different inputs.

The 𝑅² is calculated by dividing the sum of squares of residuals from the regression model (given by SSRES) by the total sum of squares of errors from the average model (given by SSTOT) and then subtracting it from 1.

<center><img src='https://i.imgur.com/9NpOZ7Q.png'></center>

The value **𝑅² = 1** corresponds to **$SS_{RES} = 0$**. That’s the perfect fit, since the values of predicted and actual responses fit completely to each other.


**Drawbacks of using R Squared :**

👉 Every time if we add Xi (independent/predictor/explanatory) to a regression model, R2 increases even if the independent variable is insignificant for our regression model.

👉 R2 assumes that every independent variable in the model helps to explain variations in the dependent variable. In fact, some independent variables don’t help to explain the dependent variable. In simple words, some variables don’t contribute to predicting the dependent variable.

👉 So, if we add new features to the data (which may or may not be useful), the R2 value for the model would either increase or remain the same but it would never decrease.

So, to overcome all these problems, we have adjusted-R2 which is a slightly modified version of R2.
Let’s understand what is Adjusted R2?

👉 Similar to R2,  Adjusted-R2 measures the proportion of variations explained by only those independent variables that really help in explaining the dependent variable.

👉 Unlike R2, the Adjusted-R2 punishes for adding such independent variables that don’t help in predicting the dependent variable (target).

Let us mathematically understand how this feature is accommodated in Adjusted-R2. Here is the formula for adjusted R2

<center><img src='https://i.imgur.com/vttpfEy.png'></center>


**R2 vs Adjusted-R2**

👉 Adjusted-R2 is an improved version of R2.

👉 Adjusted-R2 includes the independent variable in the model on merit.

👉 Adjusted-R2 < R2

👉 R2 includes extraneous variations whereas adjusted-R2 includes pure variations.

👉 The difference between R2 and adjusted-R2 is only the degrees of freedom.


### Loss Functions and Cost Functions

#### Loss Functions
Loss functions play an important role in any statistical model - they define an objective which the performance of the model is evaluated against and the parameters learned by the model are determined by minimizing a chosen loss function.

**Loss function is usually a function defined on a data point, prediction and label, and measures the penalty**. For example:

* **square loss $l(f(x_i|θ),y_i)=(f(x_i|θ)−y_i)^2$** , used in linear regression

* **hinge loss $l(f(x_i|θ),y_i)=max(0,1−f(x_i|θ)y_i)$** , used in SVM

* **0/1 loss $l(f(x_i|θ),y_i)=1⟺f(x_i|θ)≠y_i$** , used in theoretical analysis and definition of accuracy

#### Cost Functions

The cost function measures the difference, or error, between **actual y** and **predicted y** at its current position. This improves the machine learning model's efficacy by providing feedback to the model so that it can adjust the parameters to minimize the error and find the local or global minimum. 

It continuously iterates, moving along the direction of steepest descent (or the negative gradient) until the cost function is close to or at zero. At this point, the model will stop learning. 

Additionally, while the terms, **cost function and loss function, are considered synonymous**, there is a slight difference between them. It’s worth noting that **a loss function refers to the error of one training example**, while **a cost function calculates the average error across an entire training set.**

Cost function is usually more general. It might be a sum of loss functions over your training set plus some model complexity penalty (regularization). For example: 

* **MSE** also known as **L2 loss**, 

* **MAE**, also known as **L1 loss**

## Gradient Descent

**Gradient descent is an optimization algorithm which is commonly-used to train machine learning models**.  

Training data helps these models learn over time, and the cost function within gradient descent specifically acts as a barometer, gauging its accuracy with each iteration of parameter updates. Until the function is close to or equal to zero, the model will continue to adjust its parameters to yield the smallest possible error. Once machine learning models are optimized for accuracy, they can be powerful tools for artificial intelligence (AI) and computer science applications. 

#### How does gradient descent work?

Before we dive into gradient descent, it may help to review some concepts from linear regression. You may recall the following formula for the slope of a line, which is $y = mx + c$, where m represents the slope and c is the intercept on the y-axis.

You may also recall plotting a scatterplot in statistics and finding the line of best fit, which required calculating the error between the actual output and the predicted output $(\hat{y})$ using the mean squared error formula. The gradient descent algorithm behaves similarly, but it is based on a convex function, such as the one below:

<center><img src='https://imgur.com/R2hNk7x.jpg)'></center>

The starting point is just an arbitrary point for us to evaluate the performance. From that starting point, we will find the derivative (or slope), and from there, we can use a tangent line to observe the steepness of the slope. The slope will inform the updates to the parameters—i.e. the weights and bias. The slope at the starting point will be steeper, but as new parameters are generated, the steepness should gradually reduce until it reaches the lowest point on the curve, known as the point of convergence.   

Similar to finding the line of best fit in linear regression, the goal of gradient descent is to minimize the cost function, or the error between predicted and actual y. In order to do this, it requires two data points—a direction and a learning rate. These factors determine the partial derivative calculations of future iterations, allowing it to gradually arrive at the local or global minimum (i.e. point of convergence).

Gradient descent algorithm does not work for all functions. There are two specific requirements. A function has to be:

* differentiable
* convex

First, what does it mean it has to be **differentiable**? If a function is differentiable it has a derivative for each point in its domain — not all functions meet these criteria.

Next requirement — function has to be **convex**. For a univariate function, this means that the line segment connecting two function’s points lays on or above its curve (it does not cross it). If it does it means that it has a local minimum which is not a global one.

A way to check mathematically if a univariate function is convex is to calculate the second derivative and check if its value is always bigger than 0.

  $d(fx)^2/dx^2 = 0$

![](https://i.imgur.com/hw2LD44.jpg)


#### Learning rate
**Learning rate** (also referred to as step size or the alpha) is the size of the steps that are taken to reach the minimum. This is typically a small value, and it is evaluated and updated based on the behavior of the cost function. High learning rates result in larger steps but risks overshooting the minimum. Conversely, a low learning rate has small step sizes. While it has the advantage of more precision, the number of iterations compromises overall efficiency as this takes more time and computations to reach the minimum.

<center> <img src='https://i.imgur.com/YceEMHR.jpg'> </center>


The formula for gradient descent used is as follows:


Linear Eqauation : $y_i = mx_i + c$

where m= slope and c = intercept

$y_{i pred}$ = predicted value of y 

<center><img src='https://i.imgur.com/CrcTFzZ.png'></center>









**Step by Step Algorithm:**

1. Let **m = 0** and **c = 0**. Let **L** be our learning rate. It could be a small value like **0.01** for good accuracy.

2. Calculate the **partial derivative of the Cost function with respect to m**. Let partial derivative of the Cost function with respect to m be $D_m$ (With little change in m how much Cost function changes).

<center><img src='https://i.imgur.com/CFcfmNw.png'></center>

Similarly, let’s find the partial derivative with respect to c. Let partial derivative of the Cost function with respect to c be Dc (With little change in c how much Cost function changes).

<center><img src='https://i.imgur.com/CN2xmCo.png'></center>

 

3. Now update the current values of m and c using the following equation:

 <center><img src='https://i.imgur.com/wfIO9F6.png'></center>

4. **We will repeat this process until our Cost function is very small (ideally 0).**

Gradient Descent Algorithm gives optimum values of m and c of the linear regression equation. With these values of m and c, we will get the equation of the best-fit line and ready to make predictions.



### Types of gradient Descent:

1. Batch Gradient Descent: This is a type of gradient descent which processes all the training examples for each iteration of gradient descent. But if the number of training examples is large, then batch gradient descent is computationally very expensive. Hence if the number of training examples is large, then batch gradient descent is not preferred. Instead, we prefer to use stochastic gradient descent or mini-batch gradient descent.

2. Stochastic Gradient Descent: This is a type of gradient descent which processes 1 training example per iteration. Hence, the parameters are being updated even after one iteration in which only a single example has been processed. Hence this is quite faster than batch gradient descent. But again, when the number of training examples is large, even then it processes only one example which can be additional overhead for the system as the number of iterations will be quite large.

3. Mini Batch gradient descent: This is a type of gradient descent which works faster than both batch gradient descent and stochastic gradient descent. Here b examples where `b<m` are processed per iteration. So even if the number of training examples is large, it is processed in batches of b training examples in one go. Thus, it works for larger training examples and that too with lesser number of iterations.

### Gradient Descent Play Ground

Use following link to know more about the Gradient Descent Visually

https://developers.google.com/machine-learning/crash-course/fitter/graph