# <p style="padding:10px;background-color:#85BB65;margin:0;color:white;font-family:newtimeroman;font-size:150%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Assumptions Of Linear Regression</p>

While Building our Linear Regression Model we have some assumptions which we need to keep in our mind to better regression line fit for our Model.

Linear Regression is supervised machine Learning Algorithm in which one or more independent variable explain the dependent(Predictor) variable. There linear regression have five assumptions.


#### **<mark style="background-color:#85BB65;color:white;border-radius:5px;opacity:1.0">assumptions of linear regression</mark>**


* 1- Linearity
* 2- Multicollinearity
* 3- mean of residuals
* 4- normality of residuals
* 5- Error Term should be independent to each other
* 6- hemoscedasticity / heteroscedasticity


In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns 

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error ,r2_score ,mean_squared_error

import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv(r"/kaggle/input/advertising-dataset/Advertising.csv").set_index("Unnamed: 0")
df.head()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
def two_plots_num_column(feature):
    
    print(f"the skewness value of {feature} column = {df[feature].skew():.2f}")
    plt.figure(figsize=(10,4))
    
    plt.subplot(1,2,1)
    plt.title('histgram')
    sns.histplot(data=df, x=feature, kde=True)
    plt.axvline(x = df[feature].mean(), c = 'red')
    plt.axvline(x = df[feature].median(), c = 'green')

    plt.subplot(1,2,2)
    plt.title('Boxplot')
    sns.boxplot(y=df[feature])
    plt.show()


In [None]:
two_plots_num_column("Newspaper")

* `Newspaper`column is Right-skewed.

In [None]:
q1, q3 = df['Newspaper'].quantile([0.25, 0.75])
iqr = q3 - q1
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)

df.loc[(df["Newspaper"] < lower_bound) | (df["Newspaper"] > upper_bound), "Newspaper"] = np.nan
df["Newspaper"].fillna(df["Newspaper"].mean(), inplace=True)
            

In [None]:
two_plots_num_column('Sales')

# <b>I <span style='color:#85BB65'>|</span> Linearity</b> 

<br> 

#### **<mark style="background-color:#85BB65;color:white;border-radius:5px;opacity:1.0">Note that</mark>**

* The relationship between X and the mean of Y is linear. If not linear, we may use polynomial regression or machine-learning techniques.

* Linear regression needs the relationship between the independent and dependent variables to be linear. 

<br>

<div style="border-radius:10px;border:#85BB65 solid;padding: 15px;background-color:#ffffff00;font-size:100%;text-align:left">
If the independent and dependent variable are not linearly dependent on each other and we still try to fit the straight line. it will not give the better accuracy in the model.
</div>

Let's use a pair plot to check the relation of independent variables with the Sales variable

In [None]:
sns.pairplot(df, x_vars=['TV','Radio','Newspaper'], y_vars='Sales', size=5, aspect=0.7);

### <b><span style='color:#85BB65'>|</span> Observations </b> 


* `TV` feature sean to have linear relationship with sales.
* `Radio` feature doesn't form an accurately linear shape with the `Sales` variable but Radio do still better than `Newspaper` which seems to hardly have any specific shape.

So it shows that a linear regression fitting might not be the best model for it.

# <b>II <span style='color:#85BB65'>|</span> Multicollinearity</b> 

It means that the independant variables should not have any correlation between each other. 

To check this issue we can plot the pairwise correlation plot and avoid using high correlated variables

In [None]:
plt.figure(figsize=(10,10))
sns.pairplot(df)
plt.show();

* Sometimes two or more variables are correlated to a independant variable which is hard to identify from correlation plot. in that case you can check the VIF (Variance Inflation Factor).

### **<span style='color:#85BB65'>What's the Variance Inflation Factor (VIF)? </span>** 

VIF value ranges between 1 to infinity . value 1 indecation no multicollinearity and the higher value of VIF , the higher value of multicollinearity.

* VIF between 1:5 indecating moderate multicollinearity.
* VIF between 5:10 indecating higher level of multicollinearity.
* VIF between 10:.. indecating very high level multicollinearity.

In [None]:
# check about multicollenarity

from statsmodels.stats.outliers_influence import variance_inflation_factor

columns= df.drop(columns='Sales').columns
# VIF dataframe
vif_data = pd.DataFrame()
vif_data["feature"] = columns
  
# calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(df.drop(columns='Sales').values, i)
                          for i in range(len(columns))]
  
vif_data

### <b><span style='color:#85BB65'>|</span> Observations </b> 

* all feature have value of VIF Less than 5. that is very suitable

now let's build the model and check about other assumptions.

In [None]:
X = df.drop(["Sales"],axis=1)
y = df.Sales

In [None]:
# split data

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 42 ,test_size=0.25)

In [None]:
# build an fit the model 

model = LinearRegression()
model.fit(X_train,y_train)

y_pred= model.predict(X_train)

In [None]:
print("R squared: {}".format(r2_score(y_true=y_train ,y_pred=y_pred)))
print(f"mae : {mean_absolute_error(y_train,y_pred)}")

# <b>III <span style='color:#85BB65'>|</span> Mean of residuals</b> 

mean of residuals should be equal zero.

In [None]:
# create a list of residuals 
residuals = y_train.values - y_pred

mean_residuals = np.mean(residuals)
print("Mean of Residuals {}".format(mean_residuals))

### <b><span style='color:#85BB65'>|</span> Observations </b> 

* The mean of residuals is almost equal to zero. That's very good

# <b>IV <span style='color:#85BB65'>|</span> Normality of residuals</b> 

it is assume that the error term is normally distributed

In [None]:
# Plot the histogram of the error terms

fig = plt.figure()
sns.distplot(residuals , bins=20)
fig.suptitle('Error Terms', fontsize = 20)    
plt.xlabel('Errors', fontsize = 18)
plt.show()

### <b><span style='color:#85BB65'>|</span> Observations </b> 

* Error terms is approximately lift-skew distributed. it means that a linear regression fitting didn't be the best.

# <b>V <span style='color:#85BB65'>|</span> Error Term should be independent to each other</b> 

it means that the error term should not dependent in any other error terms. 
The below diagram shows that the error term are randomly distributed and not following any pattern.

In [None]:
plt.scatter(y_pred , residuals)
plt.axhline(y=0,color="red" ,linestyle="--")
plt.show()

# <b>VI <span style='color:#85BB65'>|</span> Hemoscedasticity And Heteroscedasticity</b>

* hemoscedasticity means that variance should not be increasing or decreasing (constant) if error term changes (increase or decrease). Also it should not be following some pattern as error term changes (increase or decrease). 

* Heteroskedasticity refers to the situation where the variance of the residuals in a regression model is not constant across different levels of the predictor variables.

![](https://media.geeksforgeeks.org/wp-content/uploads/20190425172205/hetero.jpg)

<br>

<div style="border-radius:10px;border:#85BB65 solid;padding: 15px;background-color:#ffffff00;font-size:100%;text-align:left">
The null hypothesis of the test is that the variance of the residuals is constant (homoscedastic), while the alternative hypothesis is that the variance of the residuals is not constant (heteroskedastic).
</div>

<br>

to dentify heterscedasticity , we will use statistical test called `breusch-pagan`.
this test check whether heterscedasticity exists or not.


In [None]:
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ['f_statistic' , 'p_value' , 'lagrange multipler stat']
test = sms.het_breuschpagan(residuals , X_train)
lzip(name , test)

Hope you liked the notebook, any suggestions would be highly appreciated.

***

<br>

<div style="text-align: center;">
   <span style="font-size: 4.5em; font-weight: bold; font-family: Arial;">THANK YOU!</span>
</div>/

<br>
<br>

<div style="text-align: center;">
    <span style="font-size: 5em;">✔️</span>
</div>

<br>

<div style="text-align: center;">
   <span style="font-size: 1.4em; font-weight: bold; font-family: Arial; max-width:1200px; display: inline-block;">
       If you find this notebook useful, I'd greatly appreciate your upvote!
   </span>
</div>
