## Linear Regression
In this notebook we create a linear prediction model. Linear regression is the
simplest form of a predictive model.  
We will use it as a baseline to compare the performance of other models. If other
models do not perform significantly better, the linear regression is favorable,
as it can train way faster than other models and is relatively easy to understand
and interpret.

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.gaussian_process.kernels import RBF
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
import sys, os
sys.path.append(os.path.abspath(os.path.join("..")))
from utils.evaluation import mean_average_percentage_error, root_mean_squared_error

First we load our prepared data.

In [2]:
trips_df = pd.read_pickle('../00_data/trips_hourly_selected.pkl')

We split the dataframe in X and y and standardize the parameters in X. We also standardize the data thus ensuring they have the same mean (0) and standard deviation (1).

In [3]:
trips_df

Unnamed: 0_level_0,starting_trips,ongoing_trips_prev,available_bikes,min_temp,hour,month,is_weekday,is_holiday
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2019-01-01 01:00:00,8.0,2.0,870.0,15.6,1,1,True,True
2019-01-01 02:00:00,11.0,4.0,868.0,15.0,2,1,True,True
2019-01-01 03:00:00,2.0,3.0,869.0,15.0,3,1,True,True
2019-01-01 06:00:00,2.0,1.0,871.0,12.2,6,1,True,True
2019-01-01 07:00:00,3.0,1.0,871.0,12.8,7,1,True,True
...,...,...,...,...,...,...,...,...
2019-12-31 19:00:00,28.0,12.0,850.0,20.0,19,12,True,True
2019-12-31 20:00:00,40.0,8.0,853.0,20.6,20,12,True,True
2019-12-31 21:00:00,23.0,15.0,846.0,21.1,21,12,True,True
2019-12-31 22:00:00,16.0,6.0,855.0,21.7,22,12,True,True


In [4]:
X = trips_df[
    [
        "ongoing_trips_prev",
        "available_bikes",
        "min_temp",
        "hour",
        "month",
        "is_weekday",
        "is_holiday",
    ]
]
X_std = StandardScaler().fit_transform(X)
y = trips_df["starting_trips"]


We make a train test split with a test size of 30%, so that we can evaluate our
model on unseen data later.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    X_std, y, test_size=0.3, random_state=0
)


### Training the model

In [6]:
lin_mod = LinearRegression()
lin_mod.fit(X_train, y_train)
y_pred = lin_mod.predict(X_test)
y_true = y_test


Here we print the coefficients for the model. These can be easily interpreted. 
A unit increase in a feature leads to a increase of the size of the corresponding
coefficient in the outcome (demand).  
E.g. if the temperature (`min_temp`) increases by 1 degree, the demand will (on average)
increase by ~10.48.

In [7]:
print(
    "The Coefficients for our multiple linear regression model are:",
    "\n" "\n" "min_temp        =   ",
    lin_mod.coef_[0],
    "\n" "available_bikes =  ",
    lin_mod.coef_[1],
    "\n" "hour            =   ",
    lin_mod.coef_[2],
    "\n" "month           =   ",
    lin_mod.coef_[3],
    "\n" "is_holiday      =   ",
    lin_mod.coef_[4],
    "\n" "The Intercept is:",
    lin_mod.intercept_,
)


The Coefficients for our multiple linear regression model are: 

min_temp        =    10.476929709657899 
available_bikes =   2.9790711497688642 
hour            =    -3.7372022716485818 
month           =    5.99382804742423 
is_holiday      =    3.2786355231110527 
The Intercept is: 26.464125085090274


### Evaluating the testmetrics

In [8]:
print(f"MAE: {mean_absolute_error(y_true, y_pred):.2f}")
print(f"MSE: {mean_squared_error(y_true, y_pred):.2f}")
print(f"MAPE: {mean_average_percentage_error(y_true, y_pred) * 100:.2f}%")
print(f"RMSE: {root_mean_squared_error(y_true, y_pred):.2f}")
print(f"R^2: { r2_score(y_true, y_pred):.2f}")

MAE: 12.06
MSE: 231.06
MAPE: 45.40%
RMSE: 15.20
R^2: 0.46


As we can see the model does not perform very well. This shows us that a simple
linear regression is not enough for this problem.