# Bike highways - revisit in PyCaret

We already analyzed the bike-highway data a bit. It was a time series, so let's focus on the counting-point nearest to our school and see if we can predict it. And by "we" I mean PyCaret.

Let's look at the [official PyCaret-documentation](https://pycaret.gitbook.io/docs/get-started/quickstart#time-series). We won't bother with the stock example this time but start right away with our own data.

In [1]:
import pandas as pd

df = pd.read_csv('files/bike_counters_data/Measured data-nl-Geel_FMN GV 21 Geel.csv')
df.head()

Unnamed: 0,Meetpunt surrogate key,Meetpunt locatie,Datum,Tijd,Aantal fietsers,Aantal fietsers van,Aantal fietsers naar,Meetpunt code
0,89,Rauwelkoven 54,2020-02-14,00:00:00,0,0,0,FMN GV 21 Geel
1,89,Rauwelkoven 54,2020-02-14,01:00:00,0,0,0,FMN GV 21 Geel
2,89,Rauwelkoven 54,2020-02-14,02:00:00,0,0,0,FMN GV 21 Geel
3,89,Rauwelkoven 54,2020-02-14,03:00:00,2,2,0,FMN GV 21 Geel
4,89,Rauwelkoven 54,2020-02-14,04:00:00,0,0,0,FMN GV 21 Geel


Once again we need to do some cleaning. As we can read [here](https://pycaret.gitbook.io/docs/learn-pycaret/official-blog/time-series-forecasting-with-pycaret-regression) PyCaret can't deal with dates so we'll have to store all parts of the date and time separately. We're also only interested in "Aantal fietsers", not the "van" and "naar" columns. The "Meetpunt surrogate key", "Meetpunt locatie" and "Meetpunt code" is always the same, so you can drop these as well.

And rename "Aantal fietsers" to "nr_cyclists". It'll be easier to work with.

In [2]:
# DELETE

df['year'] = pd.DatetimeIndex(df['Datum']).year
df['month'] = pd.DatetimeIndex(df['Datum']).month
df['day'] = pd.DatetimeIndex(df['Datum']).day
df['hour'] = pd.DatetimeIndex(df['Datum']).hour
df = df.rename(columns={"Aantal fietsers": "nr_cyclists"})

df = df.drop(columns=['Datum', 'Tijd', 'Meetpunt surrogate key', 'Meetpunt locatie','Aantal fietsers van','Aantal fietsers naar','Meetpunt code'])
df.head()

Unnamed: 0,nr_cyclists,year,month,day,hour
0,0,2020,2,14,0
1,0,2020,2,14,0
2,0,2020,2,14,0
3,2,2020,2,14,0
4,0,2020,2,14,0


Finally, group this data so you're working with the daily totals, not the hourly data. Otherwise you'll be predicting way to many zeros.

In [6]:
df_monthly = df.drop(columns=["hour","day"]).groupby(["year","month"], as_index=False).sum(["nr_cyclists"])
df_monthly.head()

Unnamed: 0,year,month,nr_cyclists
0,2020,2,2067
1,2020,4,26496
2,2020,5,22445
3,2020,6,23922
4,2020,7,20133


Next up is PyCaret! Some of these steps will take a while. If you [have better things to do](https://www.youtube.com/watch?v=nLJ8ILIE780), save the last variable you made (the setup or the best model) in a [pickle](https://www.geeksforgeeks.org/how-to-use-pickle-to-save-and-load-variables-in-python/) file.

First, setup using the setup-function.

In [7]:
# DELETE
from pycaret.time_series import *

s = setup(df_monthly, fh = 3, fold = 5, session_id = 123, target="nr_cyclists")

Unnamed: 0,Description,Value
0,session_id,123
1,Target,nr_cyclists
2,Approach,Univariate
3,Exogenous Variables,Present
4,Original data shape,"(43, 3)"
5,Transformed data shape,"(43, 3)"
6,Transformed train set shape,"(40, 3)"
7,Transformed test set shape,"(3, 3)"
8,Rows with missing values,0.0%
9,Fold Generator,ExpandingWindowSplitter


Next up compare the different models. We're predicting based on the monthly data, giving us 40 datapoints to predict and test on. This is not nearly enough, but as a POC it'll do.

Also, use the option "n_select=5" as parameter to compare_models.

In [8]:
best = compare_models(sort = 'MAE', n_select=5)

Unnamed: 0,Model,MASE,RMSSE,MAE,RMSE,MAPE,SMAPE,R2,TT (Sec)
gbr_cds_dt,Gradient Boosting w/ Cond. Deseasonalize & Detrending,0.7712,0.6906,5956.5454,6610.1847,0.1938,0.2138,-12.3424,0.048
ada_cds_dt,AdaBoost w/ Cond. Deseasonalize & Detrending,0.8298,0.7262,6389.4844,6932.9224,0.1953,0.2189,-16.6429,0.052
rf_cds_dt,Random Forest w/ Cond. Deseasonalize & Detrending,0.8865,0.7771,6835.1022,7420.5291,0.2113,0.2477,-15.7894,0.08
dt_cds_dt,Decision Tree w/ Cond. Deseasonalize & Detrending,0.9156,0.8674,7161.9308,8364.3143,0.2264,0.2729,-17.9133,0.034
omp_cds_dt,Orthogonal Matching Pursuit w/ Cond. Deseasonalize & Detrending,0.9336,0.816,7285.1628,7867.3416,0.2283,0.2469,-28.5719,0.032
br_cds_dt,Bayesian Ridge w/ Cond. Deseasonalize & Detrending,0.9464,0.8241,7383.7618,7945.8919,0.2301,0.2507,-29.998,0.038
huber_cds_dt,Huber w/ Cond. Deseasonalize & Detrending,0.997,0.8601,7747.689,8268.242,0.2414,0.2562,-30.5327,0.032
en_cds_dt,Elastic Net w/ Cond. Deseasonalize & Detrending,1.0109,0.8706,7902.4904,8408.7235,0.2464,0.2753,-33.2562,0.118
knn_cds_dt,K Neighbors w/ Cond. Deseasonalize & Detrending,1.0219,0.8937,7942.2503,8599.3869,0.2503,0.2741,-24.7515,0.052
et_cds_dt,Extra Trees w/ Cond. Deseasonalize & Detrending,1.0556,0.9237,8242.7958,8901.2829,0.2729,0.3409,-20.0125,0.068


In [9]:
import pickle 

with open('best_model.pkl', 'wb') as file: 
    pickle.dump(best, file) 

In [10]:
import pickle

with open('best_model.pkl', 'rb') as file: 
      
    # Call load method to deserialze 
    best_2 = pickle.load(file) 

print(best_2)

[BaseCdsDtForecaster(fe_target_rr=[WindowSummarizer(lag_feature={'lag': [1]},
                                                   n_jobs=1)],
                    regressor=GradientBoostingRegressor(random_state=123),
                    window_length=1), BaseCdsDtForecaster(fe_target_rr=[WindowSummarizer(lag_feature={'lag': [1]},
                                                   n_jobs=1)],
                    regressor=AdaBoostRegressor(random_state=123),
                    window_length=1), BaseCdsDtForecaster(fe_target_rr=[WindowSummarizer(lag_feature={'lag': [1]},
                                                   n_jobs=1)],
                    regressor=RandomForestRegressor(n_jobs=-1, random_state=123),
                    window_length=1), BaseCdsDtForecaster(fe_target_rr=[WindowSummarizer(lag_feature={'lag': [1]},
                                                   n_jobs=1)],
                    regressor=DecisionTreeRegressor(random_state=123),
                    window_len

Predict 6 months into the future!

In [11]:
# DELETE
plot_model(best[0], plot = 'forecast', data_kwargs = {'fh' : 6})

Now compare this model to the other four you stored. Some provide a pretty prediction, others are plain bad. Still it's a good start.

In [15]:
# DELETE
plot_model(best[1], plot = 'forecast', data_kwargs = {'fh' : 6})

There are a lot of other plots to be made. Experiment a bit!

In [None]:
#DELETE

