# Linear Regression

*In my opinion, Linear Regression is the most important Machine Learning model that exists. It is worth to study well.*

* predicting continous variables is commonplace in data science
* the results are interpretable
* can solve complex problems with proper engineering of features
* Linear Regression has been backed up by deep statistical research
* Linear Regression is the base to understand other linear models (such as Logistic Regression or Poisson Regression)
* Prerequisite for understanding neural networks
* very often does the job sufficiently well

In [1]:
import pickle

import numpy as np
import pandas as pd
import seaborn as sns

from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error

### 1. Define Business Goal

**Predict the amount of rental bicycles for a given day.**

### 2. Get Data

In [2]:
df = pd.read_csv('../data/bicycles/train.csv', index_col=0)
df.head(3)

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32


### 3. Split Data into Training, Validation and Test sets
The test set has been already sliced off by Kaggle. They only give us the input data but not the correct results.

In [None]:
# what do we need to fill into the gap?
train, val = train_test_split(df, ...)

In [None]:
# how can we check the number of rows and columns?
...

### 4. Explore Data

In [None]:
sns.pairplot(train)

In [None]:
# what else might we want to look at?

### 5. Define X and y

In [None]:
# X is a matrix of input features
Xtrain = ...
Xval = ...

# y is a vector of scalar values --> Regression
ytrain = ...
yval = ...

In [None]:
Xtrain.shape, ytrain.shape

In [None]:
Xval.shape, yval.shape

### 6. Train a Linear Regression Model

The model is:

$\hat y = w_1x_1 + w_2x_2 + .. + w_nx_n + w_0$

There are two ways to solve a Linear Regression Model

#### a) Normal Equation

Uses an analytical approach to calculate coefficients directly.
This is a closed-form solution called the **Normal Equation**

The Normal Equation has two big disadvantages:

* quadratic time complexity $O(N^2)$
* it can gets stuck if your features are redundant

Usually, b) is the better choice.

#### b) Gradient Descent

Iteratively optimizes the coefficients to find the lowest possible MSE.

* always finds the minimum (MSE is a convex function)
* partial derivative (linear time complexity to data points and features)

This is the implementation used in practically all common libraries (scikit, statsmodels, R, Spark, TensorFlow).

In [None]:
# train the model
m = LinearRegression(fit_intercept=True)
...

### 7. Evaluate the Model

#### R squared

* 0 = no explainability from the model's correlation
* 1 = the model completely explains the variance in the data

#### MSE

* Mean Squared Error
* tt is very sensitve to outliers - each residual is squared, so
* residuals greater than one have a disproportionate big effect on outliers
* residuals less than one have a disproportionate small effect on outliers

#### MAE

* Mean Absolute Error
* average of the absolute residuals
* less sensitive to outliers than the MSE
* same unit as the target variable

#### RMSL

* Root Mean Squared Log Error
* doesn't penalise over-estimates as much as underestimates
* good for count data that stretches over several orders of magnitude


In [None]:
ypred = m.predict(Xtrain)
mse_train = mean_squared_error(ytrain, ypred)
mae_train = mean_absolute_error(ytrain, ypred)

print(f"training MSE {mse_train:4.2f}")
print(f"training MAE {mae_train:4.2f}")

In [None]:
ypred_val = m.predict(Xval)
mse_val = mean_squared_error(yval, ypred_val)
mae_val = mean_absolute_error(yval, ypred_val)
print(f"validation MSE {mse_val:4.2f}")
print(f"validation MAE {mae_val:4.2f}")

In [None]:
# inspect the coefficients

# LotFrontage, OverallQual, YearBuilt
m.coef_.round(1), m.intercept_.round(1)

### Statsmodels

In [None]:
from statsmodels.regression.linear_model import OLS

In [None]:
Xtrain['intercept'] = 1  # <-- OLS does not do this on its own

In [None]:
sm = OLS(ytrain, Xtrain)  # opposite order
result = sm.fit()
result.summary()