# Bike highways - revisit manually

After AWS and PyCaret it's still nice to do a bit of predicting ourselves. Let's load the same dataset (again) and see of any of the models we saw will be able to provide a good prediction.

In [None]:
import pandas as pd

df = pd.read_csv('files/bike_counters_data/Measured data-nl-Geel_FMN GV 21 Geel.csv')

df["date_time"] = df["Datum"] + " " + df["Tijd"]
df["date_time"] = pd.to_datetime(df["date_time"])
df = df.set_index("date_time")
df = df[["Aantal fietsers"]]
df.head()

First, group by month and plot the data.

In [None]:
df_time_month = df.loc[ df["Aantal fietsers"] >= 30 ].resample('ME').sum().reset_index().set_index(['date_time'])
df_time_month.plot()

Next, calculate the autocorrection on this dataset. This should show us any seasonality that is in there.

In [None]:
# DELETE

pd.plotting.autocorrelation_plot(df_time_month['Aantal fietsers'])

It's not as obvious as the example dataset, but there is a definite spike at 12 and 24 (months). And this is significant because we only have three years worth of data.

What if we group the data by day?

In [None]:
df_time_day = df.loc[ df["Aantal fietsers"] >= 30 ].resample('D').sum().reset_index().set_index(['date_time'])
# df_time_day.plot()
pd.plotting.autocorrelation_plot(df_time_day['Aantal fietsers'])

The same spike at 365 and 730! But not much higher, so we could simply keep on working with the monthly data...

## Arima

* [Try this one](https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/)
* [Or this one](https://www.geeksforgeeks.org/python-arima-model-for-time-series-forecasting/)

Both require the statsmodel package to check for stationarity. You can install it using pip. If you use the bottom one you'll also need pmdarima. The solution uses the bottom one.

In [None]:
%pip install statsmodels
%pip install pmdarima

Now run the model. Easiest way to go about it is to reload the excel file, set the date as index, drop all columns besides "Aantal fietsers" and resample as months.

Then do a seasonal decompose.

In [None]:
# DELETE

from statsmodels.tsa.seasonal import seasonal_decompose

df = pd.read_csv('files/bike_counters_data/Measured data-nl-Geel_FMN GV 21 Geel.csv')[ ["Datum", "Aantal fietsers"] ]
df = df.loc[ df['Aantal fietsers'] > 1 ]
df["Datum"] = pd.to_datetime(df["Datum"])
df1 = df.resample('M', on='Datum').sum()

result = seasonal_decompose(df1['Aantal fietsers'],  model ='additive')
result.plot()

Next, try to fit an Arima model on your data.

In [None]:
# DELETE

from pmdarima import auto_arima

import warnings 
warnings.filterwarnings("ignore") 
  
# Fit auto_arima function to AirPassengers dataset 
stepwise_fit = auto_arima(df1['Aantal fietsers'], start_p = 1, start_q = 1, 
                          max_p = 3, max_q = 3, m = 12, 
                          start_P = 0, seasonal = True, 
                          d = None, D = 1, trace = True, 
                          error_action ='ignore',   # we don't want to know if an order does not work 
                          suppress_warnings = True,  # we don't want convergence warnings 
                          stepwise = True)           # set to stepwise 
  
# To print the summary 
stepwise_fit.summary() 

Train a model using the parameters you found.

In [None]:
# DELETE

# Split data into train / test sets 
train = df1.iloc[:len(df1)-12] 
test = df1.iloc[len(df1)-12:] # set one year(12 months) for testing 
  
# Fit a SARIMAX(1, 0, 0)x(0, 1, [1], 12) on the training set 
from statsmodels.tsa.statespace.sarimax import SARIMAX 
  
model = SARIMAX(train['Aantal fietsers'],  
                order = (1, 0, 0),  
                seasonal_order =(0, 1, 1, 12)) 
  
result = model.fit() 
result.summary() 


Finally, test the trained model.

In [None]:
# DELETE

start = len(train) 
end = len(train) + len(test) - 1
  
# Predictions for one-year against the test set 
predictions = result.predict(start, end, 
                             typ = 'levels').rename("Predictions") 
  
# plot predictions and actual values 
predictions.plot(legend = True) 
test['Aantal fietsers'].plot(legend = True) 


And what are the MSE and RMSE of our model? We know by the graph it won't be great, but still doable, no?

In [None]:


# Load specific evaluation tools 
from sklearn.metrics import mean_squared_error 
from statsmodels.tools.eval_measures import rmse 
  
# Calculate root mean squared error 
print(rmse(test["Aantal fietsers"], predictions))
  
# Calculate mean squared error 
print(mean_squared_error(test["Aantal fietsers"], predictions) )
