# Multiple Linear Regression using statsmodels & sklearn

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
sns.set()
from sklearn.linear_model import LinearRegression

In [214]:
bit = pd.read_csv('Fit_Data.csv')

In [215]:
bit.head(5)

Unnamed: 0,Calories Burned,Steps,Distance,Floors,Minutes Sedentary,Minutes Lightly Active(Fat Burn),Minutes Fairly Active(Cardio),Minutes Very Active(Peak),Activity Calories,Minutes Asleep,Minutes Awake,Number of Awakenings,Time in Bed,Minutes REM Sleep,Minutes Light Sleep,Minutes Deep Sleep
0,2736,10201,4.46,6,633,341,10,24,1482,334,98,20,432,47.0,242.0,45.0
1,2637,9539,4.25,4,608,292,31,2,1302,414,93,33,507,50.0,346.0,18.0
2,2656,11394,4.75,5,750,242,32,27,1328,331,58,27,389,31.0,278.0,22.0
3,2934,17150,7.2,6,541,294,16,36,1657,464,89,36,553,84.0,341.0,39.0
4,2961,18607,7.82,11,452,270,18,48,1651,526,126,46,652,79.0,401.0,46.0


In [440]:
x = bit[['Time in Bed', 'Number of Awakenings']]
y = bit['Distance']

### Create multiple regression using statsmodels

In [441]:
x1 = sm.add_constant(x)
results = sm.OLS(y,x1).fit()

In [442]:
results.summary()

0,1,2,3
Dep. Variable:,Distance,R-squared:,0.029
Model:,OLS,Adj. R-squared:,0.02
Method:,Least Squares,F-statistic:,3.166
Date:,"Tue, 04 Aug 2020",Prob (F-statistic):,0.0442
Time:,18:59:50,Log-Likelihood:,-485.1
No. Observations:,213,AIC:,976.2
Df Residuals:,210,BIC:,986.3
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.4224,0.572,9.472,0.000,4.294,6.551
Time in Bed,0.0026,0.003,0.955,0.341,-0.003,0.008
Number of Awakenings,-0.0734,0.037,-1.978,0.049,-0.147,-0.000

0,1,2,3
Omnibus:,13.0,Durbin-Watson:,1.711
Prob(Omnibus):,0.002,Jarque-Bera (JB):,13.524
Skew:,0.583,Prob(JB):,0.00116
Kurtosis:,2.596,Cond. No.,1480.0


#### If a variable has a p-value > 0.05, we can discard it. 

#### **Number of Awakenings** appear to have a p-value < 0.05, while **Time in Bed** has a p-value > 0.05

### Multiple regression using sklearn

In [445]:
reg =  LinearRegression()
reg.fit(x,y)

LinearRegression()

In [446]:
reg.coef_

array([ 0.00261202, -0.07343498])

In [447]:
reg.intercept_

5.42239731408418

### Adjusted R^2

In [448]:
reg.score(x,y)

0.029270258697293272

## Verify Adjusted R^2 with below forumla

### Forumla for Adujusted R^2

$R^2_{adj.} = 1 - (1-R^2)*\frac{n-1}{n-1}{n-p-1}$

In [449]:
x.shape

(213, 2)

In [450]:
r2 = reg.score(x,y)
n = x.shape[0]
p = x.shape[1]

adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)

In [451]:
adjusted_r2

0.020025213542029463

#### The adjusted R^2 is lower than R^2. This indicates one value has less explanatory value than another,  since **Number of Awakenings** has a lower p-value, it is a more useful variable.

## Feature Selection

In [452]:
from sklearn.feature_selection import f_regression

In [453]:
f_regression(x,y)

(array([2.3850073 , 5.42235232]), array([0.12400278, 0.02082706]))

In [454]:
p_values = f_regression(x,y)[1]
p_values.round(3)

array([0.124, 0.021])

## Create Summary Table

In [455]:
reg_summary = pd.DataFrame(data= x.columns.values, columns=['Features'])
reg_summary

Unnamed: 0,Features
0,Time in Bed
1,Number of Awakenings


In [456]:
reg_summary ['Coefficients'] = reg.coef_
reg_summary ['p-values'] = p_values.round(3)

In [457]:
reg_summary

Unnamed: 0,Features,Coefficients,p-values
0,Time in Bed,0.002612,0.124
1,Number of Awakenings,-0.073435,0.021
