# Linear Regression

Linear regression is one of the most simplistic models in machine learning and is in general a very intuitive model to build and explain. The latter point is especially important in a business setting because it is very essential to be able to talk about how you are able to predict things and why.  

You should check out the slides for an explanation of what I am doing here. Everything below will be based on them.

In [90]:
%matplotlib inline

import numpy as np

## Finding betas (single coefficient)

In [91]:
# Initializing the data
X = np.array([[1800], [1900], [2000]])
y = np.array([3.5, 3.4, 3.7])

In [92]:
betas = np.linalg.inv(X.T @ X) @ X.T @ y

In [93]:
betas

array([ 0.00185806])

## Finding betas (including intercept)

In [94]:
# Initializing the data
X = np.array([[1, 1800], [1, 1900], [1, 2000]])
y = np.array([3.5, 3.4, 3.7])

In [95]:
betas = np.linalg.inv(X.T @ X) @ X.T @ y

In [96]:
betas

array([  1.63333333e+00,   1.00000000e-03])

## Computing the covariance matrix

In [97]:
# Finding SSE
y_hat = X @ betas
sse = sum((y - y_hat)**2)

In [98]:
sse

0.026666666666666772

In [99]:
# Finding number of rows
n = X.shape[0]

# Finding rank of X
r = np.linalg.matrix_rank(X)

# Finding residual standard error
rse = np.sqrt(sse/(n-r))

In [100]:
rse

0.16329931618554552

In [101]:
# The covariance of beta hat
cov_betas = rse**2 * np.linalg.inv(X.T @ X)

In [102]:
cov_betas

array([[  4.82222222e+00,  -2.53333333e-03],
       [ -2.53333333e-03,   1.33333333e-06]])

# Example Wine Dataset

The following is data about wine price based on certain attributes like age and winter rain levels.

In [103]:
%matplotlib inline

from pandas import DataFrame, Series
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import statsmodels.api as sm
from sklearn.cross_validation import train_test_split

In [104]:
#Read in Wine Data
wine = pd.read_csv("~/src/kaggledecal/datasets/wine/wine.csv")


In [105]:
wine.shape #not visible b/c not the last line
wine.head()


Unnamed: 0,Year,Price,WinterRain,AGST,HarvestRain,Age,FrancePop
0,1952,7.495,600,17.1167,160,31,43183.569
1,1953,8.0393,690,16.7333,80,30,43495.03
2,1955,7.6858,502,17.15,130,28,44217.857
3,1957,6.9845,420,16.1333,110,26,45152.252
4,1958,6.7772,582,16.4167,187,25,45653.805


In [106]:
#Training our linear classifier
wine_y = wine["Price"]
wine_x = wine.drop(["Price"],axis=1)

clf = sm.OLS(wine_y,wine_x)
result = clf.fit()

In [107]:
result.summary()

0,1,2,3
Dep. Variable:,Price,R-squared:,0.829
Model:,OLS,Adj. R-squared:,0.784
Method:,Least Squares,F-statistic:,18.47
Date:,"Sun, 25 Sep 2016",Prob (F-statistic):,1.04e-06
Time:,23:12:49,Log-Likelihood:,-2.1043
No. Observations:,25,AIC:,16.21
Df Residuals:,19,BIC:,23.52
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Year,-0.0002,0.005,-0.044,0.965,-0.011 0.011
WinterRain,0.0010,0.001,1.963,0.064,-6.89e-05 0.002
AGST,0.6012,0.103,5.836,0.000,0.386 0.817
HarvestRain,-0.0040,0.001,-4.523,0.000,-0.006 -0.002
Age,0.0004,0.074,0.005,0.996,-0.155 0.155
FrancePop,-4.953e-05,0.000,-0.297,0.770,-0.000 0.000

0,1,2,3
Omnibus:,1.769,Durbin-Watson:,2.792
Prob(Omnibus):,0.413,Jarque-Bera (JB):,1.026
Skew:,-0.005,Prob(JB):,0.599
Kurtosis:,2.008,Cond. No.,86100.0


## Wine Validation Method

In [122]:
#Read in Wine Data
wine = pd.read_csv("../../datasets/wine/wine.csv")

#Train Test Random Split
local_train, local_test = train_test_split(wine,test_size=0.2,random_state=123)

In [123]:
local_train.shape

(20, 7)

In [124]:
local_test.shape

(5, 7)

In [125]:
local_train_y = local_train["Price"]
local_train_x = local_train.drop(["Price"],axis=1)
local_test_y = local_test["Price"]
local_test_x = local_test.drop("Price",axis=1)

In [126]:
#The Model
clf = sm.OLS(local_train_y,local_train_x)
result = clf.fit()
preds = result.predict(local_test_x)
preds

array([ 7.26553482,  6.72032722,  6.40673963,  5.92787539,  6.42593892])

In [127]:
#SSE of Linear Model
np.sum((local_test_y.values - preds)**2)

1.576636974693699

## Multicollinearity

If we look at the correlation matrix of each variable pair, there are definitely VERY strong correlations. This can be problematic when interpreting our model.

In [128]:
local_train.corr()

Unnamed: 0,Year,Price,WinterRain,AGST,HarvestRain,Age,FrancePop
Year,1.0,-0.445622,0.142337,-0.384322,-0.110929,-1.0,0.993772
Price,-0.445622,1.0,0.200214,0.656723,-0.676018,0.445622,-0.452493
WinterRain,0.142337,0.200214,1.0,-0.288179,-0.162791,-0.142337,0.108707
AGST,-0.384322,0.656723,-0.288179,1.0,-0.295792,0.384322,-0.371744
HarvestRain,-0.110929,-0.676018,-0.162791,-0.295792,1.0,0.110929,-0.08804
Age,-1.0,0.445622,-0.142337,0.384322,0.110929,1.0,-0.993772
FrancePop,0.993772,-0.452493,0.108707,-0.371744,-0.08804,-0.993772,1.0


Let's remove Year and FrancePop and leave Age in instead.

In [131]:
wine2 = wine.drop(["Year","FrancePop"],axis=1)

local_train, local_test = train_test_split(wine2,test_size=0.2,random_state=123)

local_train_y = local_train["Price"]
local_train_x = local_train.drop(["Price"],axis=1)
local_test_y = local_test["Price"]
local_test_x = local_test.drop("Price",axis=1)

clf = sm.OLS(local_train_y,local_train_x)
result = clf.fit()
preds = result.predict(local_test_x)

#MSE of Linear Model
np.sum((local_test_y.values - preds)**2)

1.6647133176412703

In [132]:
result.summary()

0,1,2,3
Dep. Variable:,Price,R-squared:,0.999
Model:,OLS,Adj. R-squared:,0.999
Method:,Least Squares,F-statistic:,4103.0
Date:,"Mon, 26 Sep 2016",Prob (F-statistic):,7.28e-24
Time:,00:08:10,Log-Likelihood:,1.7346
No. Observations:,20,AIC:,4.531
Df Residuals:,16,BIC:,8.514
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
WinterRain,0.0012,0.000,2.972,0.009,0.000 0.002
AGST,0.3875,0.020,19.427,0.000,0.345 0.430
HarvestRain,-0.0049,0.001,-6.631,0.000,-0.007 -0.003
Age,0.0343,0.008,4.400,0.000,0.018 0.051

0,1,2,3
Omnibus:,0.602,Durbin-Watson:,1.959
Prob(Omnibus):,0.74,Jarque-Bera (JB):,0.497
Skew:,-0.344,Prob(JB):,0.78
Kurtosis:,2.649,Cond. No.,241.0


Huh, that makes a bit more sense now! Older wine generally cost more and the p-value now reflects this fact after the troublesome columns have been removed.