## Modelling

In this file, we will model, fit, and select the ideal machine learning model for the data we prepared during preparation in the file preparation.ipynb.

We will obviously need numpy and pandas to manage the data as usual, but we will also make use of sklearn's various modules for constructing models.

In [1]:
import numpy as np
import pandas as pd
import pickle
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, Lasso, ElasticNet
from sklearn.preprocessing import PolynomialFeatures, MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.impute import KNNImputer
from sklearn.ensemble import RandomForestRegressor

We had saved the training, validation, and test data in csv files during preparation, so we can use them here.

In [2]:
# Reading in our saved CSV files
data_train = pd.read_csv('data_train.csv')
data_val = pd.read_csv('data_val.csv')
data_test = pd.read_csv('data_test.csv')
target_2022 = pd.read_csv('target_2022.csv')

data_train.dtypes

Dato/Tid           object
Volum               int64
Solskinstid       float64
Lufttemperatur    float64
Vindstyrke        float64
Måned               int64
Dag                 int64
Ukedag              int64
Time                int64
Måned_1             int64
Måned_10            int64
Måned_11            int64
Måned_12            int64
Måned_2             int64
Måned_3             int64
Måned_4             int64
Måned_5             int64
Måned_6             int64
Måned_7             int64
Måned_8             int64
Måned_9             int64
Ukedag_0            int64
Ukedag_1            int64
Ukedag_2            int64
Ukedag_3            int64
Ukedag_4            int64
Ukedag_5            int64
Ukedag_6            int64
Time_0              int64
Time_1              int64
Time_10             int64
Time_11             int64
Time_12             int64
Time_13             int64
Time_14             int64
Time_15             int64
Time_16             int64
Time_17             int64
Time_18     

As the datetime object was converted from "datetime" to "object" during conversion to CSV, we need to convert that column back to datetime.

In [3]:
data_train['Dato/Tid'] = pd.to_datetime(data_train['Dato/Tid'])
data_val['Dato/Tid'] = pd.to_datetime(data_val['Dato/Tid'])
data_test['Dato/Tid'] = pd.to_datetime(data_test['Dato/Tid'])

target_2022['Dato/Tid'] = pd.to_datetime(target_2022['Dato/Tid'])

Checking for missing data:

In [4]:
# Check for lost data using boolean values
data_train.isnull().any()

Dato/Tid          False
Volum             False
Solskinstid       False
Lufttemperatur     True
Vindstyrke         True
Måned             False
Dag               False
Ukedag            False
Time              False
Måned_1           False
Måned_10          False
Måned_11          False
Måned_12          False
Måned_2           False
Måned_3           False
Måned_4           False
Måned_5           False
Måned_6           False
Måned_7           False
Måned_8           False
Måned_9           False
Ukedag_0          False
Ukedag_1          False
Ukedag_2          False
Ukedag_3          False
Ukedag_4          False
Ukedag_5          False
Ukedag_6          False
Time_0            False
Time_1            False
Time_10           False
Time_11           False
Time_12           False
Time_13           False
Time_14           False
Time_15           False
Time_16           False
Time_17           False
Time_18           False
Time_19           False
Time_2            False
Time_20         

We have missing data, which is to be expected. We will impute this during modelling shortly.

In [5]:
# Overview of training data
data_train

Unnamed: 0,Dato/Tid,Volum,Solskinstid,Lufttemperatur,Vindstyrke,Måned,Dag,Ukedag,Time,Måned_1,...,Time_21,Time_22,Time_23,Time_3,Time_4,Time_5,Time_6,Time_7,Time_8,Time_9
0,2015-07-16 17:00:00,84,60.0,13.866667,3.933333,7,16,3,17,0,...,0,0,0,0,0,0,0,0,0,0
1,2015-07-16 18:00:00,57,60.0,13.216667,4.233333,7,16,3,18,0,...,0,0,0,0,0,0,0,0,0,0
2,2015-07-16 19:00:00,49,60.0,12.683333,2.950000,7,16,3,19,0,...,0,0,0,0,0,0,0,0,0,0
3,2015-07-16 20:00:00,45,36.0,12.066667,2.483333,7,16,3,20,0,...,0,0,0,0,0,0,0,0,0,0
4,2015-07-16 21:00:00,25,0.0,11.350000,1.083333,7,16,3,21,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45104,2021-12-31 16:00:00,7,0.0,6.600000,1.033333,12,31,4,16,0,...,0,0,0,0,0,0,0,0,0,0
45105,2021-12-31 18:00:00,5,0.0,6.483333,0.466667,12,31,4,18,0,...,0,0,0,0,0,0,0,0,0,0
45106,2021-12-31 19:00:00,4,0.0,5.616667,2.650000,12,31,4,19,0,...,0,0,0,0,0,0,0,0,0,0
45107,2021-12-31 20:00:00,2,0.0,4.700000,1.916667,12,31,4,20,0,...,0,0,0,0,0,0,0,0,0,0


We now need to find the best model for our data in order to create a prediction. We can start by splitting our data into x and y variables. The x variable will be the weather conditions (subject to change), and the dummy variables for the time. The y variable will be the quantity (volum) of the bikes.

In [6]:
# Years are not periodical, so we won't predict on them
# We convert to numpy too, as the models may not work on dataframe objects
X_train = data_train.drop(['Dato/Tid', 'Volum'], axis=1).to_numpy()
y_train = data_train['Volum'].to_numpy()

X_val = data_val.drop(['Dato/Tid', 'Volum'], axis=1).to_numpy()
y_val = data_val['Volum'].to_numpy()

X_test = data_test.drop(['Dato/Tid', 'Volum'], axis=1).to_numpy()
y_test = data_test['Volum'].to_numpy()

There are different ways to model data. We have classification, regression, cluster analysis, and hypothesis testing.
We don't need to classify data into categories, cluster data into certain characteristics, or test a hypothesis for the data. We are attempting to create a prediction for future data, so regression seems to be the optimal choice. Now, we need to pick the best possible regression model for our data.

We fit our model on training data, and create a prediction on validation data. We will pick out the best model to run on test data.

In [7]:
# Mean is a decent strategy, as data can vary. It also gave a better RMSE score than median
baseline = DummyRegressor(strategy='mean')
baseline.fit(X_train, y_train)
baseline_prediction = baseline.predict(X_val)

print('RMSE_baseline: ', np.round(np.sqrt(mean_squared_error(y_val, baseline_prediction)), decimals=2))

RMSE_baseline:  71.57


In the above block of code, we started off with creating a baseline model to compare with. We want to get an RMSE (root mean-squared error) value less than the baseline and as close as possible to 0. This helps us to compare the accuracy of our actual models. Otherwise, the RMSE value won't tell us much about how good the model is. The RMSE value tells us the difference in values between predicted and actual values, which differs depending on the characteristics of the data (a dataset with higher values and higher variance is more likely to have a higher RMSE value). The ideal thing to do now is experiment with various regression models and pick out the best one.

It's hard to say what results I expect, as I'm quite inexperienced with many of these models and machine learning in general. I reckon that models such as k-neighbours regression may end up being good, as it predicts values based on other values in a given neighbourhood. Polynomial model also has potential, as you can apply various degrees to create a good fit. We will have to find the sweet spot for that. As I have many independent variables, a more complex model could also be good, but I will have to experiment with a few.

We can now impute our missing data. Our data seems to be missing at random (MAR), as sometimes, our data usually goes missing in chunks (a given timeframe). For instance, if you look at the weather data for 20 May 2016, there is a chunk of missing data. I assume the reason is that equipment for recording data may have been faulty or under maintenance during that time. If you have MAR, then it is good to use k-neighbours imputation. A value 9 is ideal as values beyond that could vary by much more, also, it seems to work best for all models when I graphed it, so this is what I will pick. I could use IterativeImputer, but the share of missing data is quite small, and it is unlikely to significantly change the prediction. With SimpleImputer, I can impute with mean or median, which isn't ideal due to the variance of the data and the periodical nature of it (Eg: not ideal to impute missing data for sunshine time with its median, especially if it is during nighttime). K-neighbours imputation is also good if multiple variables have missing data (true in our case).

As for scaling, I can pick between StandardScaler, and MinMaxScaler. StandardScaler scales values based on the standard normal distribution, while MinMaxScaler scales values between 0 and 1. I will choose to scale values via MinMaxScaler(). This will help optimise our model, especially when you consider that the dummy variables go from 0 to 1 as well. Also, I did graph some of these models, but I won't include these graphs as they take a long time to run.

Finally, for most of these models, I did graph them with degrees 1 to 300, picking out the best one, but these graphs take extremely long to run, so I won't include them.

## Regression

Linear regression is very simple. We predict the value of a variable based on another. We assume that this relationship between the two sets of variables is linear, making use of linear equations.

In [8]:
# We can include the imputer and the scaler in the make_pipeline() method along with the regression model in order to fit and transform automatically
lr = make_pipeline(KNNImputer(n_neighbors=9), MinMaxScaler(), LinearRegression())
lr.fit(X_train, y_train)
lr_prediction = lr.predict(X_val)

print('RMSE_linear_regression: ', np.round(np.sqrt(mean_squared_error(y_val, lr_prediction)), decimals=2))

RMSE_linear_regression:  43.12


Polynomial regression uses the relationship between the x and y variables to create a model of the nth degree. So, it is like linear regression, but we add polynomial terms (higher-degree terms such as x^2, with x being a variable), in order to predict in a different way.

In [9]:
# I choose a degree of 2 as choosing a larger value causes this to crash
polynomial_model = make_pipeline(KNNImputer(n_neighbors=9), MinMaxScaler(), PolynomialFeatures(degree=2), LinearRegression())
polynomial_model.fit(X_train, y_train)
polynomial_prediction = polynomial_model.predict(X_val)

print('RMSE_polynomial: ', np.round(np.sqrt(mean_squared_error(y_val, polynomial_prediction)), decimals=2))

RMSE_polynomial:  27.86


Lasso regression is a form of linear regression that makes use of a method called shrinkage. This is when values are "shrunk" towards a central point, such as the mean or the median. This is mostly ideal for models with fewer variables, so it may not work so well with our data.

In [10]:
lasso_model = make_pipeline(KNNImputer(n_neighbors=9), MinMaxScaler(), Lasso(1))
lasso_model.fit(X_train, y_train)
lasso_prediction = lasso_model.predict(X_val)
print('RMSE_lasso: ', np.round(np.sqrt(mean_squared_error(y_val, lasso_prediction)), decimals=2))

RMSE_lasso:  48.25


Elastic Net regression combines Lasso regression and Ridge regression by using both penalties from those models. It is an improved version of both models, but I get a better RMSE for Lasso regression above, interestingly.

In [11]:
elasticnet_model = make_pipeline(KNNImputer(n_neighbors=9), MinMaxScaler(), ElasticNet(1))
elasticnet_model.fit(X_train, y_train)
elastic_net_prediction = elasticnet_model.predict(X_val)

print('RMSE_elastic_net: ', np.round(np.sqrt(mean_squared_error(y_val, elastic_net_prediction)), decimals=2))

RMSE_elastic_net:  66.55


K-neighbours regression is pretty self-explanatory. It approximates the relationship between the variables and the outcomes by averaging values in the given neighbourhood of size n. This is also less ideal for multiple variables, but we do get a decent RMSE score nonetheless.

In [12]:
kneighbors_model = make_pipeline(KNNImputer(n_neighbors=9), MinMaxScaler(), KNeighborsRegressor(7))
kneighbors_model.fit(X_train, y_train)
kneighbors_prediction = kneighbors_model.predict(X_val)

print('RMSE_k_neighbors: ', np.round(np.sqrt(mean_squared_error(y_val, kneighbors_prediction)), decimals=2))

RMSE_k_neighbors:  28.78


Random forest regression makes use of ensemble learning and bootstrapping to create multiple decision trees. Ensemble learning involves multiple models, training them over our data, and taking the average to find an improved result. Bootstrapping involves creating random samples of the data, and averaging them (I assume this is why the RMSE varies everytime I run it). Combining this with decision trees, we get this model.

In [13]:
# I tried to graph parameters for this (1 to 300), but it was running for over 1 hour, so I picked values via trial and error
# Having the default parameter of 100 works well enough, especially when it varies every time I run it
forest_model = make_pipeline(KNNImputer(n_neighbors=9), MinMaxScaler(), RandomForestRegressor())
forest_model.fit(X_train, y_train)
forest_prediction = forest_model.predict(X_val)
print('RMSE_forest: ', np.round(np.sqrt(mean_squared_error(y_val, forest_prediction)), decimals=2))

RMSE_forest:  24.29


Decision tree involves breaking down data into smaller subsets, turning them into trees (like following a path). It does not take in numerical parameters.

In [14]:
dt_model = make_pipeline(KNNImputer(n_neighbors=9), MinMaxScaler(), DecisionTreeRegressor())
dt_model.fit(X_train, y_train)
dt_prediction = dt_model.predict(X_val)

print('RMSE_decision_tree: ', np.round(np.sqrt(mean_squared_error(y_val, dt_prediction)), decimals=2))

RMSE_decision_tree:  32.48


We have run through 6 models, and the one with the lowest RMSE score is the random forest model with a degree of 100 (default). Now, we can check how well it fits with the test data.

In [15]:
prediction_test = forest_model.predict(X_test)

print('RMSE_test: ', np.round(np.sqrt(mean_squared_error(y_test, prediction_test)), decimals=2))

RMSE_test:  23.54


This value seems to be a decent fit, especially when we compare with the baseline. We now have to apply this onto 2022 data, given in the target_2022 CSV file.

## Predicting 2022 data

In [16]:
# As the 2022 data does not have all months, I need to add the remaining dummy variables so it matches with the training data. I can pad these with 0s
extra_months = ['Måned_10', 'Måned_11', 'Måned_12', 'Måned_6', 'Måned_7', 'Måned_8', 'Måned_9']

for col in extra_months:
    target_2022[col] = 0

In [17]:
# The columns need to be in the correct order, otherwise the model won't work well
# We also add a Volum column and pad them with 0s
target_2022['Volum'] = 0
target_2022 = target_2022[data_train.columns]

In [18]:
# Drop the corresponding columns
target_x = target_2022.drop(['Dato/Tid', 'Volum'], axis=1)

In [19]:
# Array of predicted values
volum_2022 = forest_model.predict(target_x.to_numpy())
volum_2022

array([3.85, 1.67, 1.38, ..., 6.51, 4.56, 1.62])

In [20]:
# Input the values and round them
target_2022['Volum'] = volum_2022
target_2022 = target_2022.round({'Volum': 0})
target_2022['Volum'] = target_2022['Volum'].astype(int)

# Replace potential negative values with 0
target_2022.Volum = np.where(target_2022.Volum < 0, 0, target_2022.Volum)
target_2022

Unnamed: 0,Dato/Tid,Volum,Solskinstid,Lufttemperatur,Vindstyrke,Måned,Dag,Ukedag,Time,Måned_1,...,Time_21,Time_22,Time_23,Time_3,Time_4,Time_5,Time_6,Time_7,Time_8,Time_9
0,2022-01-01 00:00:00,4,0.0,3.016667,0.683333,1,1,5,0,1,...,0,0,0,0,0,0,0,0,0,0
1,2022-01-01 01:00:00,2,0.0,2.666667,1.233333,1,1,5,1,1,...,0,0,0,0,0,0,0,0,0,0
2,2022-01-01 02:00:00,1,0.0,2.316667,1.233333,1,1,5,2,1,...,0,0,0,0,0,0,0,0,0,0
3,2022-01-01 03:00:00,1,0.0,1.733333,0.900000,1,1,5,3,1,...,0,0,0,1,0,0,0,0,0,0
4,2022-01-01 04:00:00,1,0.0,1.100000,0.950000,1,1,5,4,1,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3550,2022-05-28 22:00:00,26,0.0,7.450000,4.983333,5,28,5,22,0,...,0,1,0,0,0,0,0,0,0,0
3551,2022-05-28 23:00:00,20,0.0,7.133333,5.050000,5,28,5,23,0,...,0,0,1,0,0,0,0,0,0,0
3552,2022-05-29 00:00:00,7,0.0,7.016667,4.566667,5,29,6,0,0,...,0,0,0,0,0,0,0,0,0,0
3553,2022-05-29 01:00:00,5,0.0,6.816667,3.533333,5,29,6,1,0,...,0,0,0,0,0,0,0,0,0,0


We can save this to a CSV file.

In [21]:
target_2022.to_csv('predictions.csv', index=False)

## Saving the model for the creation of a website

We want to display our model in a website, and we can do that in app.py. We first need to store this model using the pickle library.

In [22]:
#pickle.dump(forest_model, open('model.pkl', 'wb'))

## Summary

In the end, we managed to get a good RMSE score, easily beating the baseline, and we managed to display results for the 2022 data. This concludes the data science process. The trickiest parts were preparing the data, namely organising the rows into hourly data, setting up date and time, and merging the data. Picking the right model, along with the correct data points was a challenge as well, especially since I would get varying RMSE values for the columns that I selected. For instance, including the individual date and time columns, even though it makes little sense to include this, as these values are scaled. This was a surprise, but I may have been wrong about the scaled date and time columns.

It was also frustrating that I couldn't do higher degree polynomial regression, as the number of columns was limited by computational resources. It was also annoying when I had to make graphs that ran for 5+ minutes in order to select the best RMSE values for various regression models (60+ minutes for random forest regression). As for variable extraction, I did also attempt to extract whether it was a weekend or not (1 if yes, 0 if not), but that didn't improve my RMSE, so I discarded it. I also had dummy variables for each day of the month, but removing these improved my RMSE, interestingly. So I logically discarded it.

It would be nice to have extra data, so I could find out why there is a dip in the volum during the middle of summer. If I had to guess, I suspect it has to do with people going on vacation, and many cyclists bike to work. I can't know for sure though. It would also be nice to have precipitation data, which I reckon would be the best predictor of bike traffic.
If I had unlimited time, I would try to get data on precipitation or other factors, and make use of that. I would also attempt to play around with more regression methods to get an improved RMSE score.

The figures gave me a good understanding of how volum correlated with the weather conditions. I also learned about the characteristics of the data, especially when taking a closer look. I could have included a more diverse range of figures, but I felt that I got a good insight from the ones I made. I wish I could've plotted multiple lines on the same graph, so that it is easier to compare. Unfortunately, due to the different scales, this was not possible, and plotly did not allow for multiple axes.

I wasn't surprised by how well random forest regression worked, considering the scale of ensemble models. It was a complex model in that it takes decision trees and uses more algorithms to create another regression model, so it met expectations in that sense.

Ultimately, this was a valuable introduction to large-scale data science and a good learning experience to take forward in future projects.