# OLS Summary Table

This notebook has been made for gaining a better understanding of the `statsmodels.api`'s **OLS** Summary table.

In [51]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import statsmodels.formula.api as smf

from sklearn.model_selection import train_test_split as tts
from sklearn.linear_model import LinearRegression as LR
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import r2_score

### Importing the dataset

In [25]:
df = pd.read_csv(r"https://raw.githubusercontent.com/bhattbhavesh91/linear-regression-assumptions/master/data.csv")
df.rename(columns={
        "Feature 1" : "ft1",
        "Feature 2" : "ft2",
        "Feature 3" : "ft3"}, inplace=True)

In [26]:
df.head()

Unnamed: 0,ft1,ft2,ft3,Target
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


### Using the `statsmodels.api` library

#### Significance
The `ols.summary()` function offered by this library is a **better** choice for performing Linear Regression tasks as it gives a far more **comprehensive** summary/report regarding how good of a fit our model is on the given data.

In [39]:
ml1 = smf.ols("Target ~ ft1 + ft2 + ft3", data=df).fit()
ml1.summary()

0,1,2,3
Dep. Variable:,Target,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,570.3
Date:,"Tue, 03 Sep 2024",Prob (F-statistic):,1.58e-96
Time:,16:13:28,Log-Likelihood:,-386.18
No. Observations:,200,AIC:,780.4
Df Residuals:,196,BIC:,793.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.9389,0.312,9.422,0.000,2.324,3.554
ft1,0.0458,0.001,32.809,0.000,0.043,0.049
ft2,0.1885,0.009,21.893,0.000,0.172,0.206
ft3,-0.0010,0.006,-0.177,0.860,-0.013,0.011

0,1,2,3
Omnibus:,60.414,Durbin-Watson:,2.084
Prob(Omnibus):,0.0,Jarque-Bera (JB):,151.241
Skew:,-1.327,Prob(JB):,1.44e-33
Kurtosis:,6.332,Cond. No.,454.0


#### Adjusting the features
Dropping the `ft3` column, as it is clear from the summary of `model1` that it has a **p-value > `0.05`** and **negative t-score**.

In [40]:
ml2 = smf.ols("Target ~ ft1 + ft2", data=df.drop("ft3", axis=1)).fit()
ml2.summary()

0,1,2,3
Dep. Variable:,Target,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,859.6
Date:,"Tue, 03 Sep 2024",Prob (F-statistic):,4.83e-98
Time:,16:13:39,Log-Likelihood:,-386.2
No. Observations:,200,AIC:,778.4
Df Residuals:,197,BIC:,788.3
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.9211,0.294,9.919,0.000,2.340,3.502
ft1,0.0458,0.001,32.909,0.000,0.043,0.048
ft2,0.1880,0.008,23.382,0.000,0.172,0.204

0,1,2,3
Omnibus:,60.022,Durbin-Watson:,2.081
Prob(Omnibus):,0.0,Jarque-Bera (JB):,148.679
Skew:,-1.323,Prob(JB):,5.19e-33
Kurtosis:,6.292,Cond. No.,425.0


### Modelling
#### **Before** dropping the `ft3` column

In [65]:
X_train, X_test, y_train, y_test = tts(df.drop("Target", axis=1).values, df["Target"].values, shuffle=True, test_size=0.2, random_state=33)

X_train.shape

(160, 3)

In [66]:
lr1 = LR()
lr1.fit(X_train, y_train)

In [68]:
y_pred1 = lr1.predict(X_test)
mae1, r2_1 = mae(y_test, y_pred1), r2_score(y_test, y_pred1)

#### **After** dropping the `ft3` column

In [73]:
X_train2, X_test2, y_train, y_test = tts(df.drop(["ft3", "Target"], axis=1).values, df["Target"].values, shuffle=True, test_size=0.2, random_state=33)

X_train2.shape

(160, 2)

In [74]:
lr2 = LR()
lr2.fit(X_train2, y_train)

In [75]:
y_pred2 = lr2.predict(X_test)
mae2, r2_2 = mae(y_test, y_pred2), r2_score(y_test, y_pred2)

### Results

In [79]:
print(f"{round((mae1 - mae2) / mae1 * 100, 2)}% decrease in MAE\n{round((r2_2 - r2_1) / r2_1 * 100, 2)}% increase in R2-score")

0.11% decrease in MAE
0.1% increase in R2-score
