# Linear Regression

References:
    
* [Dimensionality Reduction in Python](https://campus.datacamp.com/courses/dimensionality-reduction-in-python/feature-selection-ii-selecting-for-model-accuracy)
* [Lasso Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)

In [288]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Lasso, LassoCV
import statsmodels.api as sm


## Sample data

In [282]:
np.random.seed(1)

df = pd.DataFrame(np.random.normal(size=(1000, 4), loc=5, scale=5), columns=["x1", "x2", "x3", "x4"])
df["error"] = np.random.normal(size=len(df), loc=0, scale=10)

# Actual intercept is 20
# Actual coefficients are [5, 2, 0, 0]
df["y"] = 20 + 5 * df["x1"] + 2 * df["x2"] + 0 * df["x3"] + 0 * df["x4"] + df["error"]

X = df[["x1", "x2", "x3", "x4"]]
y = df["y"]

df.head()

Unnamed: 0,x1,x2,x3,x4,error,y
0,13.121727,1.941218,2.359141,-0.364843,-1.40371,88.08736
1,9.327038,-6.507693,13.724059,1.193965,1.416417,55.03622
2,6.595195,3.753148,12.31054,-5.300704,3.119686,63.60196
3,3.387914,3.079728,10.668847,-0.499456,7.690852,50.789878
4,4.137859,0.610708,5.211069,7.914076,5.842858,47.753568


## Using the LinearRegression classifier

In [283]:
lr = LinearRegression()
lr.fit(X, y)

print("Intercept: {:.3f}".format(lr.intercept_))
print("Coefficients:", lr.coef_)
print("R2 score: {:.3f}".format(lr.score(X, y)))

Intercept: 21.348
Coefficients: [ 4.99107031  2.00208645 -0.07989671 -0.031451  ]
R2 score: 0.873


## Using Lasso to minimize coefficients

* Lasso stands for Least Absolute Shrinkage and Selection
* It penalizes large coefficients

In [295]:
# When alpha is too low, model might overfit
# When alpha is too high, model might become too simple and inaccurate
la = Lasso(alpha=2)
la.fit(X, y)

print("Intercept: {:.3f}".format(la.intercept_))
print("Coefficients:", la.coef_)
print("R2 score: {:.3f}".format(la.score(X, y)))

zero_coef = la.coef_ == 0
print("The model has ignored {} out of {} features.".format(sum(zero_coef), len(la.coef_)))

mask = zero_coef = la.coef_ != 0
reduced_X = X.loc[:, mask]
print("Reduced features:", reduced_X.columns.values)

Intercept: 21.619
Coefficients: [ 4.90549375  1.91595741 -0.         -0.        ]
R2 score: 0.872
The model has ignored 2 out of 4 features.
Reduced features: ['x1' 'x2']


## Finding best alpha value with LassoCV

In [292]:
lcv = LassoCV()
lcv.fit(X, y)

print("Alpha: {:.3f}".format(lcv.alpha_))
print("Intercept: {:.3f}".format(lcv.intercept_))
print("Coefficients:", lcv.coef_)
print("R2 score: {:.3f}".format(lcv.score(X, y)))


Alpha: 0.883
Intercept: 21.368
Coefficients: [ 4.95261684  1.96433853 -0.04311655 -0.        ]
R2 score: 0.873


## Using the Stats Model package

In [285]:
result = sm.OLS(y, X).fit()

print("Summary:\n{}".format(result.summary()))
print("\nCoefficients:\n{}".format(result.params))

Summary:
                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.951
Model:                            OLS   Adj. R-squared (uncentered):              0.951
Method:                 Least Squares   F-statistic:                              4812.
Date:                Wed, 02 Sep 2020   Prob (F-statistic):                        0.00
Time:                        10:34:47   Log-Likelihood:                         -4047.8
No. Observations:                1000   AIC:                                      8104.
Df Residuals:                     996   BIC:                                      8123.
Df Model:                           4                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------