# Linear Regression
<li>Linear regression is a statistical practice of calculating a straight line that specifies a mathematical relationship between two variables.</li>
<li>Linear regression analysis is used to predict the value of a variable based on the value of another variable.</li>
<li>The variable you want to predict is called the dependent variable.</li>
<li>The variable you are using to predict the other variable's value is called the independent variable.</li>

<ol>
    <li>Simple Linear Regression</li>
    <li>Multiple Linear Regression</li>
</ol>

## 1. Simple Linear Regression
<li>Simple linear regression is a regression model that estimates the relationship between one independent and one dependent variable using a straight line.</li>
<li>Both variables should be quantitative.</li>

<li>The following equation is the general form of the simple linear regression model.</li>
<code>
    ^
    y =B0 + B1x1 
</code>
Where    
^
y represents the predicted value, 
x1 represents the feature column we choose to use in our model.
<li>These values are independent of the dataset.</li>
<li>On the other hand, B0 and B1 represent the parameter values that are specific to the dataset.</li>
<li>The goal of simple linear regression is to find the optimal B0 and B1 values that best describe the relationship between the feature and the target column.</li>


<li>The following diagram shows different simple linear regression models depending on the data:</li>

![](images/regression_figure.png)

<li>The first step is to select the feature x1, we want to use in our model.</li>
<li>Once we select this feature, we can use scikit-learn to determine the optimal parameter values B0 and B1 based on the training data.</li>


## Multiple Linear Regression

<li>A multiple linear regression model allows us to capture the relationship between multiple feature columns and the target column.</li>
<li>Here's what the formula looks like:
<code>
^
y = B0 + B1x1 + B2x2 + ... + Bnxn
</code>
      ^
<li>Where y represents the predicted value</li>
<li>B0, B1, B2,..., Bn represents n parameter values that are specific to the dataset.</li>
<li>The goal here is to find out the optimal values of B0, B1, B2 such that these features best represents the relationship between the data.</li>

![](images/multiple_linear_regression.png)

<li>The parameters values can be estimated using the following eqns:</li>

![](images/mle_eqn.png)

## Cost/Loss Function For Linear Regression

<li>Cost function measures the performance of a machine learning model for a data set.</li>
<li>Cost function quantifies the error between predicted and expected values and presents that error in the form of a single real number.</li>
<li>Depending on the problem, cost function can be formed in many different ways.</li>
<li>The purpose of cost function is to be either minimized or maximized.</li>
<li>For algorithms relying on gradient descent to optimize model parameters, every function has to be differentiable.</li>

![](images/cost_function.png)

### Optimization (Using Gradient Descent)
<li>Gradient descent is an iterative optimization algorithm to find the minimum of a function. Here that function is our Loss Function.</li>

![](images/gradient_descent.jpg)


#### Steps For Finding Gradient Descent

![](images/gradient_descent_steps.png)

## Assumptions For Linear Regression
<ol>
<li><b>Linearity:</b> The relationship between the dependent variable and the independent variable(s) is linear.</li>
<li><b>Independence:</b> The observations are independent of each other.</li>
<li><b>Homoscedasticity:</b> The variance of the errors is constant across all levels of the independent variable(s).</li>
<li><b>Normality:</b> The errors follow a normal distribution.</li>
<li><b>No multicollinearity:</b> The independent variables are not highly correlated with each other.</li>
</ol>

### 1. Linearity
<li>Linearity means that there should be a linear relationship between the independent variable(s) and the dependent variable.</li>
<li>In other words, the change in the dependent variable should be proportional to the change in the independent variable(s), with a constant slope and intercept.</li>

For example, let's say you want to use linear regression to model the relationship between a person's height and their weight. If the relationship between height and weight is not linear, this would violate the assumption of linearity. In this case, the model may not be able to accurately capture the complex and nonlinear effects of height on weight, and may yield biased and inefficient estimates of the regression coefficients.

Another example would be if you were studying the relationship between a company's sales revenue and its advertising budget. If the relationship between sales revenue and advertising budget is not linear, this would violate the assumption of linearity. In this case, the model may not be able to capture the diminishing or increasing returns to scale of the advertising budget on sales revenue, and may yield unreliable and inaccurate predictions of the sales revenue for different levels of advertising budget.

<li>Violation of the linearity assumption can lead to biased and inefficient estimates of the regression coefficients, and can affect the validity of the model.</li> 
<li>Therefore, it is important to check for linearity when using linear regression.</li>
<li>For example, by plotting the dependent variable against each independent variable and examining the scatter plot or trend line.</li>


### 2.Independence: 
<li>Independence is one of the assumptions of linear regression, which means that the observations should be independent of each other. </li>
<li>This means that the value of the dependent variable for one observation should not be related to the value of the dependent variable for any other observation.</li>

For example, let's say you want to use linear regression to model the relationship between a person's weight and their height. If you collect data from a group of identical twins, the weight and height of one twin would be highly correlated with the weight and height of their sibling, violating the assumption of independence. In this case, you would need to collect data from non-related individuals to ensure independence.

Another example would be if you were studying the impact of a new medication on blood pressure. If you measured the blood pressure of the same person before and after taking the medication, the observations would not be independent because the values of blood pressure before and after the medication are related to each other for that person. In this case, you would need to collect data from different people who have a similar health condition and administer the medication to some of them, while the rest receive a placebo.

<li>If the Durbin-Watson test statistic is close to 2 (e.g., between 1.5 and 2.5), this suggests that the residuals are independent.</li>
<li>If the test statistic is significantly less than 2 or significantly greater than 2, this suggests the presence of positive or negative autocorrelation, respectively.</li> 


### 3. Homoscedasticity:
<li>It is one of the assumptions of linear regression, which means that the variance of the errors is constant across all levels of the independent variable(s).</li> 
<li>In other words, the spread of the residuals should be similar across the range of the independent variable(s).</li>

For example, let's say you want to use linear regression to model the relationship between a student's study time and their exam scores. If the variance of the errors increases or decreases as the study time increases, this would violate the assumption of homoscedasticity. In this case, the model may overemphasize the effect of the study time on the exam scores for some values of study time, while underemphasizing it for others.

Another example would be if you were studying the relationship between a car's speed and its fuel efficiency. If the variance of the errors increases or decreases as the speed increases, this would violate the assumption of homoscedasticity. In this case, the model may overemphasize the effect of the speed on fuel efficiency for some speeds, while underemphasizing it for others.

Violation of the homoscedasticity assumption can lead to biased and inefficient estimates of the regression coefficients, and can affect the validity of the statistical inferences and predictions based on the model. Therefore, it is important to check for homoscedasticity when using linear regression.

### 4. Normality:
<li>Normality is one of the assumptions of linear regression, which means that the errors (residuals) should follow a normal distribution.</li>
<li>In other words, the distribution of the residuals should be symmetrical and bell-shaped around zero.</li>

For example, let's say you want to use linear regression to model the relationship between a person's age and their cholesterol level. If the distribution of the residuals is skewed or has outliers, this would violate the assumption of normality. In this case, the model may be overestimating or underestimating the effect of age on cholesterol level, depending on the direction and magnitude of the skewness or outliers.

Another example would be if you were studying the relationship between a company's advertising budget and its sales revenue. If the distribution of the residuals is not normal, this would violate the assumption of normality. In this case, the model may not accurately capture the nonlinearities and interactions between the variables, and the estimated confidence intervals and p-values may be inaccurate.

<li>Violation of the normality assumption can lead to biased and inefficient estimates of the regression coefficients.</li> <li>It can also affect the validity of the statistical inferences and predictions based on the model.</li>
<li>Therefore, it is important to check for normality when using linear regression.</li>
<li>For example, by examining the histogram, Q-Q plot, or normal probability plot of the residuals, we can check normality.</li>

### 5. No multicollinearity

<li>No multicollinearity is one of the assumptions of linear regression, which means that the independent variables should not be highly correlated with each other.</li>
<li>In other words, there should be no perfect or near-perfect linear relationship between any two or more independent variables.</li>

For example, let's say you want to use linear regression to model the relationship between a student's exam scores and their study time, their attendance rate, and their participation in a review session. If study time and attendance rate are highly correlated with each other, this would violate the assumption of no multicollinearity. In this case, the model may not be able to distinguish between the effects of study time and attendance rate on the exam scores, and the estimated regression coefficients and their standard errors may be unstable or even impossible to calculate.

Another example would be if you were studying the relationship between a car's fuel efficiency and its engine size, weight, and horsepower. If engine size and horsepower are highly correlated with each other, this would violate the assumption of no multicollinearity. In this case, the model may not be able to separate the effects of engine size and horsepower on fuel efficiency, and the estimated regression coefficients and their standard errors may be unreliable or even misleading.

Violation of the no multicollinearity assumption can lead to unstable and inaccurate estimates of the regression coefficients, and can affect the interpretation and prediction of the model. Therefore, it is important to check for multicollinearity when using linear regression.

<li>If the VIF values are greater than 5 or 10, this indicates problematic levels of multicollinearity, and you may need to consider removing one of the correlated features, transforming the data, or using a different model that is robust to multicollinearity.</li>


### Performance Metrics In Linear Regression

<li>Mean Absolute Error</li>
<li>Mean Squared Error</li>
<li>Root Mean Squared Error</li>
<li>R2</li>
<li>Adjuster R2</li>

In [None]:
# import numpy as np
# import pandas as pd
# import seaborn as sns
# import scipy.stats as stats
# import matplotlib.pyplot as plt
# from sklearn.model_selection import train_test_split
# from statsmodels.stats.stattools import durbin_watson
# from statsmodels.stats.outliers_influence import variance_inflation_factor
# from sklearn.linear_model import LinearRegression