# INF161: Data Science: Course Project
### *Predicting how many people cycle over Nygårdsbroen at a given time*

> #### by [Mathias Svendsen]
>-------------------------------------------
> *Autumn 2022* 
----------------------------------------------------------------------------

<a id="top"></a> 

<h2> Part 2: Modelling and Predictions</h2>
    
#### *Notebook Index:*
1. [**Exploratory Data Analysis & Feature Engineering**](#analysis) <br>
    1.1 [*Additional features*](#new_features) <br>
    1.2 [*Splitting the dataset*](#splitting) <br>
    1.3 [*Exploring the concatenated traffic and weather data*](#explore_concat) <br>
    1.4 [*Further analysis with regards to the scaling of given data*](#scaling) <br>
    1.5 [*Correlation analysis*](#Correlation-analysis) <br>
    </div>
2. [**Modelling: First Iteration**](#feature-selection) <br>
    2.1 [*Scaling the data*](#scaling) <br>
    2.2 [*Linear Regression Models*](#linear_regression) <br>
    2.3 [*The Polynomial Model*](#polynomial_model) <br>
    2.4 [*K Nearest Neighbour*](#k_nearest_neighbour) <br>
    2.5 [*The Decision Tree Model*](#decision_tree) <br>
    2.6 [*The Multi Layer Perceptron*](#mlp) <br>
3. [**Final test and evaluation**](#saving) <br>
4. [**Actual Prediction**](#prediction_model) <br>

In [1]:
# Importing notebook dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.dummy import DummyRegressor
from sklearn import utils
from sklearn.pipeline import make_pipeline
from sklearn.gaussian_process import GaussianProcessRegressor

from sklearn.model_selection import train_test_split, ParameterGrid
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.cross_decomposition import PLSRegression
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet, LogisticRegression
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.neural_network import MLPRegressor
from math import sqrt, ceil
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from collections import defaultdict
import warnings
import pickle
import copy

# Setting options
plt.style.use('ggplot')
warnings.simplefilter(action='ignore', category=UserWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

# Setting random state for all estimators and functions that use randomization
random_state = 35

## <a id="analysis"></a>1) Exploratory Data Analysis

This section will consist of both modelling and predictions to further estimate how many people cycle over the bridge at a given time.

### <a id="new_features"></a>1.1) Additional features

In [2]:
# Importing prepared data in .csv format

train_data_df = pd.read_csv('train_data.csv')
val_data_df   = pd.read_csv('val_data.csv')
test_data_df  = pd.read_csv('test_data.csv')

In [3]:
train_data_df.shape

(44825, 5)

In [4]:
train_data_df

Unnamed: 0,DT,Volum,Lufttemperatur,Vindstyrke,Solskinstid
0,2015-07-16 17:00:00,84,13.866667,3.933333,60.0
1,2015-07-16 18:00:00,57,13.216667,4.233333,60.0
2,2015-07-16 19:00:00,49,12.683333,2.950000,60.0
3,2015-07-16 20:00:00,45,12.066667,2.483333,36.0
4,2015-07-16 22:00:00,26,10.616667,1.050000,0.0
...,...,...,...,...,...
44820,2021-12-31 18:00:00,5,6.483333,0.466667,0.0
44821,2021-12-31 19:00:00,4,5.616667,2.650000,0.0
44822,2021-12-31 20:00:00,2,4.700000,1.916667,0.0
44823,2021-12-31 21:00:00,5,4.233333,1.733333,0.0


In [5]:
val_data_df.shape

(5603, 5)

In [6]:
test_data_df.shape

(5604, 5)

By running our imported dataframe we can see that some metadata has changed from Preparation.ipynb

In [7]:
train_data_df.shape

(44825, 5)

In [8]:
train_data_df.head()

Unnamed: 0,DT,Volum,Lufttemperatur,Vindstyrke,Solskinstid
0,2015-07-16 17:00:00,84,13.866667,3.933333,60.0
1,2015-07-16 18:00:00,57,13.216667,4.233333,60.0
2,2015-07-16 19:00:00,49,12.683333,2.95,60.0
3,2015-07-16 20:00:00,45,12.066667,2.483333,36.0
4,2015-07-16 22:00:00,26,10.616667,1.05,0.0


In [9]:
train_data_df.dtypes

DT                 object
Volum               int64
Lufttemperatur    float64
Vindstyrke        float64
Solskinstid       float64
dtype: object

Now, our datetime index has been reset. It is also of type Object and not datetime, which is not useful for our model. Hence, we update it so that it can be utilized accordingly:

In [10]:
train_data_df['DT'] = pd.to_datetime(train_data_df['DT'])
val_data_df['DT'] = pd.to_datetime(val_data_df['DT'])
test_data_df['DT'] = pd.to_datetime(test_data_df['DT'])

In [11]:
# Checking for any NaN-values before setting the index
train_data_df.isnull().any()

DT                False
Volum             False
Lufttemperatur    False
Vindstyrke        False
Solskinstid       False
dtype: bool

In [12]:
val_data_df.isnull().any()

DT                False
Volum             False
Lufttemperatur    False
Vindstyrke        False
Solskinstid       False
dtype: bool

In [13]:
test_data_df.isnull().any()

DT                False
Volum             False
Lufttemperatur    False
Vindstyrke        False
Solskinstid       False
dtype: bool

The data seems to be in order, which is necessary before proceeding to use it as input in our model.


In [14]:
train_data_df

Unnamed: 0,DT,Volum,Lufttemperatur,Vindstyrke,Solskinstid
0,2015-07-16 17:00:00,84,13.866667,3.933333,60.0
1,2015-07-16 18:00:00,57,13.216667,4.233333,60.0
2,2015-07-16 19:00:00,49,12.683333,2.950000,60.0
3,2015-07-16 20:00:00,45,12.066667,2.483333,36.0
4,2015-07-16 22:00:00,26,10.616667,1.050000,0.0
...,...,...,...,...,...
44820,2021-12-31 18:00:00,5,6.483333,0.466667,0.0
44821,2021-12-31 19:00:00,4,5.616667,2.650000,0.0
44822,2021-12-31 20:00:00,2,4.700000,1.916667,0.0
44823,2021-12-31 21:00:00,5,4.233333,1.733333,0.0


Now it's time to split our dataset into separate variables for the purpose of getting an efficient model.

By splitting the data into X and y variables, we can use that data to make our predictions more accurate. In this case, it makes sense to let X represent the weather conditions and y the amount of bikes (volume). 


We also introduce another column weekday, that is a numerical value that can help us compare the volume based on what day it is.

In [15]:
# New column representing each day of the week
train_data_df['weekday'] = train_data_df['DT'].dt.dayofweek
val_data_df['weekday'] = val_data_df['DT'].dt.dayofweek
test_data_df['weekday'] = test_data_df['DT'].dt.dayofweek


train_data_df

Unnamed: 0,DT,Volum,Lufttemperatur,Vindstyrke,Solskinstid,weekday
0,2015-07-16 17:00:00,84,13.866667,3.933333,60.0,3
1,2015-07-16 18:00:00,57,13.216667,4.233333,60.0,3
2,2015-07-16 19:00:00,49,12.683333,2.950000,60.0,3
3,2015-07-16 20:00:00,45,12.066667,2.483333,36.0,3
4,2015-07-16 22:00:00,26,10.616667,1.050000,0.0,3
...,...,...,...,...,...,...
44820,2021-12-31 18:00:00,5,6.483333,0.466667,0.0,4
44821,2021-12-31 19:00:00,4,5.616667,2.650000,0.0,4
44822,2021-12-31 20:00:00,2,4.700000,1.916667,0.0,4
44823,2021-12-31 21:00:00,5,4.233333,1.733333,0.0,4


In [16]:
# new columns representing month
train_data_df['month'] = train_data_df['DT'].dt.month
val_data_df['month'] = val_data_df['DT'].dt.month
test_data_df['month'] = test_data_df['DT'].dt.month

train_data_df

Unnamed: 0,DT,Volum,Lufttemperatur,Vindstyrke,Solskinstid,weekday,month
0,2015-07-16 17:00:00,84,13.866667,3.933333,60.0,3,7
1,2015-07-16 18:00:00,57,13.216667,4.233333,60.0,3,7
2,2015-07-16 19:00:00,49,12.683333,2.950000,60.0,3,7
3,2015-07-16 20:00:00,45,12.066667,2.483333,36.0,3,7
4,2015-07-16 22:00:00,26,10.616667,1.050000,0.0,3,7
...,...,...,...,...,...,...,...
44820,2021-12-31 18:00:00,5,6.483333,0.466667,0.0,4,12
44821,2021-12-31 19:00:00,4,5.616667,2.650000,0.0,4,12
44822,2021-12-31 20:00:00,2,4.700000,1.916667,0.0,4,12
44823,2021-12-31 21:00:00,5,4.233333,1.733333,0.0,4,12


In [17]:
# New column representening the season of the year
# I.e, 1: winter; 2: spring; 3: summer; 4: fall
train_data_df['season'] = train_data_df['DT'].dt.month % 12 // 3 + 1
val_data_df['season'] = val_data_df['DT'].dt.month % 12 // 3 + 1
test_data_df['season'] = test_data_df['DT'].dt.month % 12 // 3 + 1

train_data_df

Unnamed: 0,DT,Volum,Lufttemperatur,Vindstyrke,Solskinstid,weekday,month,season
0,2015-07-16 17:00:00,84,13.866667,3.933333,60.0,3,7,3
1,2015-07-16 18:00:00,57,13.216667,4.233333,60.0,3,7,3
2,2015-07-16 19:00:00,49,12.683333,2.950000,60.0,3,7,3
3,2015-07-16 20:00:00,45,12.066667,2.483333,36.0,3,7,3
4,2015-07-16 22:00:00,26,10.616667,1.050000,0.0,3,7,3
...,...,...,...,...,...,...,...,...
44820,2021-12-31 18:00:00,5,6.483333,0.466667,0.0,4,12,1
44821,2021-12-31 19:00:00,4,5.616667,2.650000,0.0,4,12,1
44822,2021-12-31 20:00:00,2,4.700000,1.916667,0.0,4,12,1
44823,2021-12-31 21:00:00,5,4.233333,1.733333,0.0,4,12,1


### <a id="splitting"></a>1.2) Splitting the dataset

In [18]:
X_train = train_data_df.drop(['DT', 'Volum'], axis=1)
y_train = train_data_df.drop(['DT', 'Lufttemperatur', 'Vindstyrke', 'Solskinstid', 'weekday', 'month', 'season'], axis=1)

X_val = val_data_df.drop(['DT', 'Volum'], axis=1) 
y_val = val_data_df.drop(['DT', 'Lufttemperatur', 'Vindstyrke', 'Solskinstid','weekday', 'month', 'season'], axis=1)

X_test = test_data_df.drop(['DT', 'Volum'], axis=1) 
y_test = test_data_df.drop(['DT', 'Lufttemperatur', 'Vindstyrke', 'Solskinstid','weekday', 'month', 'season'], axis=1)

In [19]:
# Combine training and validation data. It will be used for the re-training before predicting on test data. 
X_trainval = np.concatenate([X_train, X_val])
y_trainval = np.concatenate([y_train, y_val])

### <a id="explore_concat"></a>1.3) Exploring the concatenated traffic and weather data

In [20]:
display(X_train.describe())
display(y_train.describe())

Unnamed: 0,Lufttemperatur,Vindstyrke,Solskinstid,weekday,month,season
count,44825.0,44825.0,44825.0,44825.0,44825.0,44825.0
mean,8.705159,3.034921,7.519206,3.008009,6.760245,2.550563
std,5.811264,2.059423,17.872903,2.001796,3.454041,1.130157
min,-10.85,0.0,0.0,0.0,1.0,1.0
25%,4.4,1.416667,0.0,1.0,4.0,2.0
50%,8.4,2.616667,0.0,3.0,7.0,3.0
75%,12.9,4.166667,0.0,5.0,10.0,4.0
max,31.833333,14.733333,60.0,6.0,12.0,4.0


Unnamed: 0,Volum
count,44825.0
mean,52.710363
std,71.802716
min,0.0
25%,6.0
50%,27.0
75%,68.0
max,625.0


<br>

[*back to top*](#top) 
## <a id="1st-iteration"></a>2) Modelling: Iteration

This section will consist of testing, experimenting and implementing different models and further evalute the results. This includes but is not limited to different regression models for accurate predictions of the RMSE for our data. 

I will also implement and evaluate three different models part the baseline, which hopefully will generate preferred results.

### <a id="scaling"></a>2.1) Scaling the data

We want to scale our data in a manner such that it is easier to compare them. I've decided to scale the data together as a unit as it is easier to work with the data afterwards.

In [21]:
# Scaling with StandardScaler
standardscaler = StandardScaler()

# Fit on X_train, transform X_val and X_test
X_train_scaled = standardscaler.fit_transform(X_train)
X_val_scaled = standardscaler.transform(X_val)
X_test_scaled = standardscaler.transform(X_test)
X_trainval_scaled = standardscaler.transform(X_trainval)

# Fit on y_train, transform y_val and y_test
y_train_scaled = standardscaler.fit_transform(y_train)
y_val_scaled = standardscaler.transform(y_val)
y_test_scaled = standardscaler.transform(y_test)
y_trainval_scaled = standardscaler.fit_transform(y_trainval)

In [22]:
X_train_scaled.shape

(44825, 6)

In [23]:
# Calculate the 3 principal components for independent variables (X)
pca = PCA(n_components=3)
X_train_scaled = pca.fit_transform(X_train_scaled)
X_val_scaled = pca.transform(X_val_scaled)
X_test_scaled = pca.transform(X_test_scaled)
X_trainval_scaled = pca.fit_transform(X_trainval_scaled)

The baseline model is the initial value that is using a simple regression model to give an estimate.

In [24]:
# Creating the baseline model to estimate number of cyclists per hour
# The scaled data will not be used in this particular prediction
baseline = DummyRegressor(strategy="mean")
baseline.fit(X_train, y_train)
baseline_prediction = baseline.predict(X_val_scaled)

print('RMSE_baseline: ', np.round(np.sqrt(mean_squared_error(y_val, baseline_prediction)), decimals=2))

RMSE_baseline:  69.89


As we can deduce from the results above, the RMSE baseline is 71.63 . Now it's time to compare this to other results with different models:

### <a id="linear_regression"></a>2.2) Linear Regression models

Now it's time to induce some regression models. The picks for this project consists of four different models, which will be compared to one another for the lowest RMSE. The goal for this part is to successfully compare the results and conclude which is to be used for further enhancements.

In [25]:
linreg_model = LinearRegression()
ridge_model = Ridge()
lasso_model = Lasso()
elasticnet_model = ElasticNet(max_iter=4000)

models = [linreg_model, lasso_model, ridge_model, elasticnet_model]


# Following code-snippet is from stackOverflow with smaller changes to fit models used
for model, modelname in zip(models, ["Linear Regression", "Lasso", "Ridge", "ElasticNet"]):
    # Training the model 
    model.fit(X_train_scaled, y_train_scaled)
    
    print(f"""--------------------------------------------------------------------------
    Predicting with {modelname}:""")
    # Prediction on training data
    train_predict = model.predict(X_train_scaled)
    r2_trainpred = r2_score(y_train_scaled, train_predict)
    print(f'R2 for predictions made on training data:', r2_trainpred)

    # Prediction on validation data
    val_predict = model.predict(X_val_scaled)
    
    r2_valpred = r2_score(y_val_scaled, val_predict)
    print(f'R2 for predictions made on validation data:', r2_valpred)

    # Calculating the RMSE for predictions made on validation data
    val_predict_unscaled = standardscaler.inverse_transform(val_predict.reshape(-1,1))
    
    print(f'RMSE for {modelname}: ', sqrt(mean_squared_error(val_predict_unscaled, y_val)))

--------------------------------------------------------------------------
    Predicting with Linear Regression:
R2 for predictions made on training data: 0.1669008956318624
R2 for predictions made on validation data: 0.14446236233261633
RMSE for Linear Regression:  64.63283216098851
--------------------------------------------------------------------------
    Predicting with Lasso:
R2 for predictions made on training data: 0.0
R2 for predictions made on validation data: -0.00014575777797687373
RMSE for Lasso:  69.88579784243885
--------------------------------------------------------------------------
    Predicting with Ridge:
R2 for predictions made on training data: 0.16690089558766952
R2 for predictions made on validation data: 0.14446288155217424
RMSE for Ridge:  64.63281321906207
--------------------------------------------------------------------------
    Predicting with ElasticNet:
R2 for predictions made on training data: 0.0
R2 for predictions made on validation data: -0.

Currently it seems like the standard linear regression yields the best result among those chosen above, but it is almost equal to the Ridge model. The good news is that both these models perform considerably better than the baseline model, which is what we hoped for before doing our calculations.

### <a id="polynomial_model"></a>2.3) Polynomial model

The parameter tuning in this model will be done manully and not with sklearn's built in functions. I've set the parameter tuning in the range of 1-9 as a polynomial model of a higher degree could result in data overfitting.

In [26]:
# Some manual parameter tuning for degree k
k = [1,2,3,4,5,6,7,8,9]

scores = []

for num in k:
    model = make_pipeline(PolynomialFeatures(degree=num), LinearRegression())

    model.fit(X_train_scaled, y_train_scaled)

    val_predict = model.predict(X_val_scaled)

    # unscaling
    val_predict_not_scaled = standardscaler.inverse_transform(val_predict.reshape(-1,1))
    
    scores.append(np.round(np.sqrt(mean_squared_error(y_val, val_predict_not_scaled)),
                        decimals=2))
    
    print('RMSE Polynomial Model: ', np.round(np.sqrt(mean_squared_error(y_val, val_predict_not_scaled)),
                        decimals=2))
    
index_min = min(range(len(scores)), key=scores.__getitem__)

print(f"\nPolynomial Model of degree {index_min + 1} returns RMSE of: {min(scores)}")
      


RMSE Polynomial Model:  64.63
RMSE Polynomial Model:  64.35
RMSE Polynomial Model:  64.03
RMSE Polynomial Model:  63.64
RMSE Polynomial Model:  63.3
RMSE Polynomial Model:  63.04
RMSE Polynomial Model:  62.91
RMSE Polynomial Model:  63.27
RMSE Polynomial Model:  63.49

Polynomial Model of degree 7 returns RMSE of: 62.91


The RMSE seems to be stabilizing around the 63-67 area for the models that we have utilized currently. Polynomial regression seems to be the best performer with a slight better performance than the linear regression models. But the results are still remarkably similar. The polynomial model tends to perform the best when there is no linear correlation between the variables.

### <a id="k_nearest_neighbour"></a>2.4) K Nearest Neighbours

For the K Nearest Neighbours model I will parameter tune my model with an iterative approach:

Note: Another solution would be using the sklearn.GridSearchCV() function, but it is not used here.

In [27]:
knn_rmse = 100000 # setting the base RMSE to a high number
best_k = None

for k in range(5, 51):
    
    knn = KNeighborsRegressor(n_neighbors=k)

    knn.fit(X_train_scaled, y_train_scaled)

    knn_predict = knn.predict(X_val_scaled)

    # unscaling
    knn_predict_not_scaled = standardscaler.inverse_transform(knn_predict.reshape(-1,1))
    
    
    new_rmse = np.round(np.sqrt(mean_squared_error(y_val, knn_predict_not_scaled)),
                        decimals=2)
    # print(f"for {k} neighbours the rmse is: {new_rmse}")
    
    if new_rmse < knn_rmse:
        knn_rmse = new_rmse
        best_k = k

print(f"\nRMSE K nearest neighbour: {knn_rmse} for {best_k} neighbours yields the best result")


RMSE K nearest neighbour: 62.79 for 32 neighbours yields the best result


From this, we can further deduce that by manually tuning our parameters our most efficient value for k is the highest provided. But similarly having to few data points could provide false readings. Hence, we set it between 5 and 50 for this model.

### <a id="decision_tree"></a>2.5) The Decision Tree Model

In [28]:
dtm = DecisionTreeRegressor(random_state=random_state)

dtm.fit(X_train_scaled, y_train_scaled)

dtm_predict = dtm.predict(X_val_scaled)

# unscaling
dtm_predict_not_scaled = standardscaler.inverse_transform(dtm_predict.reshape(-1,1))

print('RMSE Decision Tree model: ', np.round(np.sqrt(mean_squared_error(y_val, dtm_predict_not_scaled)),
                        decimals=2))

RMSE Decision Tree model:  89.07


It is pretty apparent that the decison tree model is not very efficient for this dataset, as it performs worse than all the other models in this notebook.


### <a id="mlp"></a>2.6) The Multi Layer Perceptron Model

In [29]:
mlp = MLPRegressor(shuffle=False, random_state=random_state)

mlp.fit(X_train_scaled, y_train_scaled)

mlp_predict = mlp.predict(X_val_scaled)

# unscaling
mlp_predict_not_scaled = standardscaler.inverse_transform(mlp_predict.reshape(-1,1))

print('RMSE Multi Layer Perceptron model: ', np.round(np.sqrt(mean_squared_error(y_val, mlp_predict_not_scaled)),
                        decimals=2))

RMSE Multi Layer Perceptron model:  63.26


The Multi Layer Perceptron model performs similar to k Nearest Neighbors, albeit a little bit worse. I will therefore procede with the latter.

In [30]:
# Also implemented this model. But it was incredibly slow.

#gp = GaussianProcessRegressor(random_state = random_state)

#gp.fit(X_train_scaled, y_train_scaled)

#gp_predict = gp.predict(X_val_scaled)

# unscaling 
#gp_predict_not_scaled = standardscaler.inverse_transform(gp_predict.reshape(-1,1))


#print('RMSE Multi Layer Perceptron model: ', np.round(np.sqrt(mean_squared_error(y_val, gp_predict_not_scaled)),
                        #decimals=2))

## <a id="saving"></a>3) Final test and evaluation

Since our K nearest Neighbors model yields the lowest RMSE, I will now use it in combination with the test data to accurately predict the number of cyclists.

In [31]:
# Instanciating a new K nearest Neighbors model with optimized parameters

best_knn = KNeighborsRegressor(n_neighbors=28)

# fitting the model
best_knn.fit(X_trainval_scaled, y_trainval_scaled)

# Predictions 
knn_predict_training_data = best_knn.predict(X_trainval_scaled)


# Now it's time to measure our RMSE for the training data. Hence, the data needs to be unscaled
train_predict_unscaled = standardscaler.inverse_transform(knn_predict_training_data)

train_data_rmse = np.round(np.sqrt(mean_squared_error(train_predict_unscaled, y_trainval)),
                        decimals=2)

print(f"RMSE for predictions on training data: {train_data_rmse}")

# Initializing the test data for the first time
test_predict = best_knn.predict(X_test_scaled)

# Measuring test data RMSE
test_predict_unscaled = standardscaler.inverse_transform(test_predict)

test_data_rmse = np.round(np.sqrt(mean_squared_error(test_predict_unscaled, y_test)),
                        decimals=2)

print(f"\nFinal RMSE for predictions on test data with best model: \n{test_data_rmse}")

RMSE for predictions on training data: 61.85

Final RMSE for predictions on test data with best model: 
66.79


There seems to be a small anomaly between the test data and training data.

## <a id="prediction_model"></a>4) Actual Prediction

As the K Nearest Neighbor model performed the best during the initial tests, it will be utilized to predict the volume of cyclists that is passing the bridge with the 2022 data. 

In [50]:
# Collecting data prior to 2022
combined_data = pd.concat([train_data_df, val_data_df, test_data_df])

# Collecting 2022 data that is used to complete our model
combined_data_2022 = pd.read_csv('combined_data_2022.csv', index_col=0)

# 2022 dates
dates_2022 = combined_data_2022.DT.reset_index(drop=True)

In [51]:
combined_data

Unnamed: 0,DT,Volum,Lufttemperatur,Vindstyrke,Solskinstid,weekday,month,season
0,2015-07-16 17:00:00,84,13.866667,3.933333,60.0,3,7,3
1,2015-07-16 18:00:00,57,13.216667,4.233333,60.0,3,7,3
2,2015-07-16 19:00:00,49,12.683333,2.950000,60.0,3,7,3
3,2015-07-16 20:00:00,45,12.066667,2.483333,36.0,3,7,3
4,2015-07-16 22:00:00,26,10.616667,1.050000,0.0,3,7,3
...,...,...,...,...,...,...,...,...
5599,2016-06-28 14:00:00,70,14.550000,3.900000,14.2,1,6,3
5600,2020-08-17 00:00:00,3,15.866667,2.050000,0.0,0,8,3
5601,2020-05-06 12:00:00,56,9.933333,3.750000,12.3,2,5,2
5602,2017-08-29 08:00:00,324,12.783333,0.716667,0.0,1,8,3


In [52]:
combined_data_2022

Unnamed: 0,DT,Lufttemperatur,Vindstyrke,Solskinstid
24,2022-01-01 01:00:00,2.666667,1.233333,0.0
25,2022-01-01 02:00:00,2.316667,1.233333,0.0
26,2022-01-01 03:00:00,1.733333,0.900000,0.0
27,2022-01-01 04:00:00,1.100000,0.950000,0.0
28,2022-01-01 05:00:00,0.350000,0.916667,0.0
...,...,...,...,...
3572,2022-05-28 22:00:00,7.450000,4.983333,0.0
3573,2022-05-28 23:00:00,7.120000,5.000000,0.0
3574,2022-05-29 00:00:00,7.016667,4.566667,0.0
3575,2022-05-29 01:00:00,6.816667,3.533333,0.0


In [53]:
# Resetting DT column to datetime object
combined_data['DT'] = pd.to_datetime(combined_data['DT'])
combined_data_2022['DT'] = pd.to_datetime(combined_data_2022['DT'])

In [54]:
# Creating new training set
display(combined_data_2022.head())

# New training set is the old training, validation and test set
X_train = combined_data.drop(["DT", "Volum"], axis=1)
y_train = combined_data.drop(["DT", "Lufttemperatur", "Vindstyrke", "Solskinstid", "weekday", "month", "season"], axis=1)

# New test set is the 2022 data (without the DT column)
X_test = combined_data_2022.copy()

# Adding additional features for 2022 data
X_test['weekday'] = X_test['DT'].dt.dayofweek
X_test['month'] = X_test['DT'].dt.month
X_test['season'] = X_test['DT'].dt.month % 12 // 3 + 1

X_test.drop("DT", axis=1, inplace=True)

# Converting to numeric value
#cols = combined_data_2022.columns.drop('Volum')
#combined_data_2022['Volum'] = pd.to_numeric(combined_data_2022['Volum'])


print(f"2022 data has {len(X_test.columns)} features.")
print(f"X_train has {len(X_train.columns)} features.")

display(X_train.head())
display(y_train.head())

Unnamed: 0,DT,Lufttemperatur,Vindstyrke,Solskinstid
24,2022-01-01 01:00:00,2.666667,1.233333,0.0
25,2022-01-01 02:00:00,2.316667,1.233333,0.0
26,2022-01-01 03:00:00,1.733333,0.9,0.0
27,2022-01-01 04:00:00,1.1,0.95,0.0
28,2022-01-01 05:00:00,0.35,0.916667,0.0


2022 data has 6 features.
X_train has 6 features.


Unnamed: 0,Lufttemperatur,Vindstyrke,Solskinstid,weekday,month,season
0,13.866667,3.933333,60.0,3,7,3
1,13.216667,4.233333,60.0,3,7,3
2,12.683333,2.95,60.0,3,7,3
3,12.066667,2.483333,36.0,3,7,3
4,10.616667,1.05,0.0,3,7,3


Unnamed: 0,Volum
0,84
1,57
2,49
3,45
4,26


In [59]:
# Note: some of this code is collected from an earlier lab:

model = KNeighborsRegressor(n_neighbors=28)

# normalising training and test data with StandardScaler
X_train_scaled = standardscaler.fit_transform(X_train)
X_test_scaled = standardscaler.transform(X_test)
y_train_scaled = standardscaler.fit_transform(y_train)
y_test_scaled = standardscaler.transform(y_test)

# Fitting on both training and validation data
model.fit(X_train_scaled, y_train_scaled)

# Prediction on training data
train_predict = model.predict(X_train_scaled)

# Prediction on 2022 (test) data
test_predict = model.predict(X_test_scaled)

# Unscale the predictions
test_predict_unscaled = standardscaler.inverse_transform(test_predict)

# Round predictions to nearest integer
test_predict = np.rint(test_predict_unscaled)

# Store 2022 predictions in a .csv file
predictions_df = pd.concat([dates_2022,pd.DataFrame(test_predict, columns=["predicted_volume"])], axis=1)
display(predictions_df)

predictions_df.to_csv('Predictions.csv', index=False)

Unnamed: 0,DT,predicted_volume
0,2022-01-01 01:00:00,13.0
1,2022-01-01 02:00:00,12.0
2,2022-01-01 03:00:00,9.0
3,2022-01-01 04:00:00,9.0
4,2022-01-01 05:00:00,8.0
...,...,...
3512,2022-05-28 22:00:00,23.0
3513,2022-05-28 23:00:00,24.0
3514,2022-05-29 00:00:00,27.0
3515,2022-05-29 01:00:00,20.0


At last, let's save the model for the flask application.

In [None]:
# Instanciating K Nearest Neighbor
model = KNeighborsRegressor(n_neighbors=28)

# Fitting model
model.fit(X_train, y_train)

# saving model for flask application
pickle.dump(model, open('model.pkl', 'wb'))