# **INF161 - Bike Traffic Prediction Modelling**
### *Ole Kristian Westby | owe009@uib.no | H23*

In this notebook, I will go through modelling and prediction for this project. My starting plan here is grabbing the final dataframe which was made for the previous deadline. 


In [571]:
import pandas as pd

dir = 'ready_data/ready_data.csv'

# Load data into a dataframe
df = pd.read_csv(dir)

# Look at first rows
df.head()

Unnamed: 0,Datotid,Trafikkmengde,Globalstraling,Solskinstid,Lufttemperatur,Vindstyrke,Vindkast
0,2015-07-16 15:00:00,50,504.4,7.233333,13.9,4.083333,6.7
1,2015-07-16 16:00:00,101,432.833333,8.116667,13.733333,4.333333,7.2
2,2015-07-16 17:00:00,79,378.4,10.0,13.866667,3.933333,6.55
3,2015-07-16 18:00:00,56,212.583333,10.0,13.216667,4.233333,7.15
4,2015-07-16 19:00:00,45,79.75,10.0,12.683333,2.95,5.45


Let's check for missing data.

In [572]:
# Missing data
missing_data = df.isnull().sum()
missing_data

Datotid             0
Trafikkmengde       0
Globalstraling    403
Solskinstid       403
Lufttemperatur    403
Vindstyrke        403
Vindkast          403
dtype: int64

Next, we'll convert the "Datotid" column to datetime format, and extract month, day and hour from the dataframe.

In [573]:
# Datetime format
df['Datotid'] = pd.to_datetime(df['Datotid'])

# Extracting time-related stuff
df['Month'] = df['Datotid'].dt.month
df['DayOfWeek'] = df['Datotid'].dt.dayofweek # Monday: 0, Tuesday: 1, ... , Sunday: 6.
df['Hour'] = df['Datotid'].dt.hour

df.head()

Unnamed: 0,Datotid,Trafikkmengde,Globalstraling,Solskinstid,Lufttemperatur,Vindstyrke,Vindkast,Month,DayOfWeek,Hour
0,2015-07-16 15:00:00,50,504.4,7.233333,13.9,4.083333,6.7,7,3,15
1,2015-07-16 16:00:00,101,432.833333,8.116667,13.733333,4.333333,7.2,7,3,16
2,2015-07-16 17:00:00,79,378.4,10.0,13.866667,3.933333,6.55,7,3,17
3,2015-07-16 18:00:00,56,212.583333,10.0,13.216667,4.233333,7.15,7,3,18
4,2015-07-16 19:00:00,45,79.75,10.0,12.683333,2.95,5.45,7,3,19


Now, regarding the missing values in the dataframe, we're gonna fix this by imputing them with the median because it's less sensitive to outliers.

In [574]:
# Impute with median
for col in missing_data.index[missing_data > 0]:
    df[col].fillna(df[col].median(), inplace=True)

# Now we can check again
missing_data = (df.isnull().sum())
missing_data

Datotid           0
Trafikkmengde     0
Globalstraling    0
Solskinstid       0
Lufttemperatur    0
Vindstyrke        0
Vindkast          0
Month             0
DayOfWeek         0
Hour              0
dtype: int64

Looks like it has now been fixed. Now that our data is clean, we can start with data splitting. We will first need to import an additional library for this.

In [575]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=['Datotid', 'Trafikkmengde'])
y = df['Trafikkmengde']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.5, random_state=1)

X_train.shape, X_val.shape

((32680, 8), (32681, 8))

The data has now been sucessfully split into a training set, and a validation set. The training set contains 32,680 samples and the validation set contains 32,681 samples.

Next, we need to select a model, and we'll need to import numpy, as well as the models and MSE. Let's try Linear Regression as our first model.

In [576]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Initialize the model and train it
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predict on the validation set
lr_predict = lr.predict(X_val)

lr_rmse = np.sqrt(mean_squared_error(y_val, lr_predict))

lr_rmse

61.68697933176709

The RMSE for this model is 61.6793 approx. Before we continue checking different models, I wish to stop here and think a bit. Is ~61.6793 a good RMSE here? I think in order to evaluate this, we need to check how many cycle on average each day. Let's check this.

In [577]:
cycle_stats = df['Trafikkmengde'].describe()
cycle_stats

count    65361.000000
mean        50.379905
std         69.782243
min          0.000000
25%          5.000000
50%         25.000000
75%         64.000000
max        608.000000
Name: Trafikkmengde, dtype: float64

Given this, it is clear that an RMSE of 61,67 is pretty high, especially since the 75th percentile is 64 cycles. On average we're gonna be really off with this. This model won't do, we need to explore some more.. We can try the Lasso Regression model from the lab works.

In [578]:
from sklearn.linear_model import Lasso

# Initialize the model and train it
lasso = Lasso(alpha=0.1, random_state=1)
lasso.fit(X_train, y_train)

# Predict on val set
lasso_predict = lasso.predict(X_val)

# RMSE
lasso_rmse = np.sqrt(mean_squared_error(y_val, lasso_predict))
lasso_rmse


61.693154529589826

Hmm.. that's even worse (not that much more worse!). So far the Linear Regression model is winning, but there has to be a better one.. How about Random Forest?

In [579]:
from sklearn.ensemble import RandomForestRegressor

# Initialize the model and train it
rf = RandomForestRegressor(n_estimators=10, random_state=1) # With 100 estimators, the RMSE is 26.1455 instead, but it took a minute to run. # 1000: 25,89

rf.fit(X_train, y_train)

# Predict on val set
rf_predict = rf.predict(X_val)

# RMSE
rf_rmse = np.sqrt(mean_squared_error(y_val, rf_predict))
rf_rmse

27.336702706600718

Woah! That's a big improvement. Definitely the best model so far with RMSE ~27.01 (26.1455 with 100 estimators, 25.89 with 1000 estimators (took 11 minutes))

In [580]:
from sklearn.ensemble import GradientBoostingRegressor

# Initialize the model and train it
gb = GradientBoostingRegressor(n_estimators=10, random_state=1)
gb.fit(X_train, y_train)

# Predict on val set
gb_predict = gb.predict(X_val)

# RMSE
gb_rmse = np.sqrt(mean_squared_error(y_val, gb_predict))
gb_rmse

53.0983290926551

Hmm, not better. Okay, out of the three models we tested so far, RandomForest is winning. I'll try some more.. ps: in the final deadline I will probably make adjustments so it will check all models at once for the best one instead of individually like this, this is just for visualization and thought-descriptions.

In [581]:
from sklearn.linear_model import ElasticNet

# Initialize the model and train it
elastic_net = ElasticNet(alpha=1, random_state=1)
elastic_net.fit(X_train, y_train)

# Predict on val set
y_predict = elastic_net.predict(X_val)

# RMSE
en_rmse = np.sqrt(mean_squared_error(y_val, y_predict))
en_rmse

61.754404052183446

Let's try a different type of model, KNeighborsRegressor.

In [582]:
from sklearn.neighbors import KNeighborsRegressor

# Initialize the model and train it
kn = KNeighborsRegressor(n_neighbors=10)
kn.fit(X_train, y_train)

# Predict on val set
y_predict = kn.predict(X_val)

# RMSE
kn_rmse = np.sqrt(mean_squared_error(y_val, y_predict))
kn_rmse

49.66251270465994

After trying all of these models, I think I will stick with RandomForestRegressor. Let's now use GridSearchCV to find the optimal hyperparameters. This will take about 25-30 minutes on my computer to run. I will probably comment this out, alternatively leave it out when running the notebook.

In [583]:
"""

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(estimator=rf, 
                           param_grid=param_grid, 
                           cv=3, 
                           n_jobs=-1, 
                           verbose=2, 
                           scoring='neg_mean_squared_error')

grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_

best_rf = grid_search.best_estimator_

rf_predict = best_rf.predict(X_val)

rf_rmse = np.sqrt(mean_squared_error(y_val, rf_predict))

rf_rmse
best_rf

"""

"\n\nfrom sklearn.model_selection import GridSearchCV\n\n# Define parameter grid\nparam_grid = {\n    'n_estimators': [10, 50, 100, 200],\n    'max_depth': [None, 10, 20, 30],\n    'min_samples_split': [2, 5, 10],\n    'min_samples_leaf': [1, 2, 4]\n}\n\ngrid_search = GridSearchCV(estimator=rf, \n                           param_grid=param_grid, \n                           cv=3, \n                           n_jobs=-1, \n                           verbose=2, \n                           scoring='neg_mean_squared_error')\n\ngrid_search.fit(X_train, y_train)\n\nbest_params = grid_search.best_params_\n\nbest_rf = grid_search.best_estimator_\n\nrf_predict = best_rf.predict(X_val)\n\nrf_rmse = np.sqrt(mean_squared_error(y_val, rf_predict))\n\nrf_rmse\nbest_rf\n\n"

GridSearchCV gives the best hyperparameters as: (min_samples_leaf=4, n_estimators=200, random_state=1). This took 35 minutes for me to run, so I am embedding a picture proof so nobody has to run this again. RMSE is around 25.8~

![alt text](gridsearchd.png "GridSearch")