# Bike highways - revisit manually

After AWS and PyCaret it's still nice to do a bit of predicting ourselves. Let's load the same dataset (again) and see of any of the models we saw will be able to provide a good prediction.

In [None]:
import pandas as pd

df = pd.read_csv('../files/bike_counters_data/Measured data-nl-Geel_FMN GV 21 Geel.csv')

df["date_time"] = df["Datum"] + " " + df["Tijd"]
df["date_time"] = pd.to_datetime(df["date_time"])
df = df.set_index("date_time")
df = df[["Aantal fietsers"]]
df.head()

First, group by month and plot the data.

In [None]:
#DELETE
df_time_month = df.loc[ df["Aantal fietsers"] >= 30 ].resample('ME').sum().reset_index().set_index(['date_time'])
df_time_month.plot()

Next, calculate the autocorrection on this dataset. This should show us any seasonality that is in there.

In [None]:
# DELETE

pd.plotting.autocorrelation_plot(df_time_month['Aantal fietsers'])

It's not as obvious as the example dataset, but there is a definite spike at 12 and 24 (months). And this is significant because we only have three years worth of data.

What if we group the data by day?

In [None]:
#DELETE
df_time_day = df.loc[ df["Aantal fietsers"] >= 30 ].resample('D').sum().reset_index().set_index(['date_time'])
# df_time_day.plot()
pd.plotting.autocorrelation_plot(df_time_day['Aantal fietsers'])

The same spike at 365 and 730! But not much higher, so we could simply keep on working with the monthly data...

## Arima

We'll apply an Arima-model to predict the number of cyclists. We'll try to predict the number per month.

Follow the following steps:

* Reload the excel file
* Set the date as index
* Drop all columns besides "Aantal fietsers"
* Resample as months
* Show the top 5 rows

In [None]:
# DELETE

import pandas as pd

df = pd.read_csv('../files/bike_counters_data/Measured data-nl-Geel_FMN GV 21 Geel.csv')[ ["Datum", "Aantal fietsers"] ]
df = df.loc[ df['Aantal fietsers'] > 1 ]
df["Datum"] = pd.to_datetime(df["Datum"])
df_months = df.resample('ME', on='Datum').sum()

df_months.head()

Good start. For auto-arima, our dataset should have:
* A date column named ds (currently the index)
* A value column named y (currently "Aantal fietsers")
* A column called "unique_id" that contains a value (the same for all rows)

The month-column should be stored as a date-time. Drop all other columns.

In [None]:
#DELETE
df_months["ds"]= df_months.index
df_months["y"]= df_months["Aantal fietsers"]
df_months['ds'] = pd.to_datetime(df_months['ds'])
df_months['unique_id'] = 'Cyclists'

df_months = df_months[["ds", "unique_id", "y"]]
# Sort by date just to be sure
df_months = df_months.sort_values('ds')
df_months.head()

We should tell the model how often our data appears. Here it is monthly, we resampled it to be that way. But let's determine it automatically anyway.

In [None]:
freq = pd.infer_freq(df_months['ds'])
freq

Now chop of the last 12 months and put the remainder in a df_train.

In [None]:
#DELETE
df_train = df_months[:-12]
df_test = df_months[-12:]

And now we're ready to create and fit the model!

In [None]:
#DELETE
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA

# Initialize forecast engine
sf = StatsForecast(
    models=[AutoARIMA(season_length=12)],  # adjust season_length if needed
    freq=freq,
    n_jobs=-1  # use all CPUs
)

# Fit model on training data
sf_fitted = sf.fit(df_train)


And now predict the last 12 months of our data.

In [None]:
#DELETE
# Forecast next 12 periods
df_forecast = sf_fitted.predict(h=12)
print(df_forecast)


Calculate the RMSE.

In [None]:
#DELETE
from sklearn.metrics import mean_squared_error
import numpy as np

# Forecast the next 12 months
df_forecast = sf.predict(h=12)
df_forecast["y"] = df_forecast["AutoARIMA"]

rmse = np.sqrt(mean_squared_error(df_test["y"], df_forecast["y"]))
print(f"RMSE on test set: {rmse:.4f} passengers")

Kind of a big number. Maybe the graph will enlighten us?

In [None]:
#DELETE
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.plot(df_months['ds'], df_months['y'], label='Original')
plt.plot(df_forecast['ds'], df_forecast['AutoARIMA'], label='Forecast')
plt.legend()
plt.title("AutoARIMA Forecast")
plt.show()


The forecast isn't great, but roughly follows the data. Let's not forget we are working with about 40 months worth of data, which isn't anywhere near enough data for a reliable model. Main thing was getting a succesful pipeline (data -> prediction).