**Student:** Michele Cristina Otta

# Times Series Forecasting
Data Science Track -> Air Quality Index Forecasting

In [1]:
!pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [12]:
from ucimlrepo import fetch_ucirepo
import pandas as pd
import numpy as np

# load air quality index dataset
air_quality = fetch_ucirepo(id=360)
air_quality = pd.DataFrame(air_quality.data.features)

air_quality['Date'] = pd.to_datetime(air_quality['Date'], format='mixed').dt.strftime('%Y/%m/%d')
air_quality['Timestamp'] = pd.to_datetime(air_quality['Date'].astype(str) + ' ' + air_quality['Time'])
air_quality.set_index('Timestamp', inplace=True)
air_quality.drop(columns=['Date', 'Time'], inplace=True)
air_quality.sort_index(inplace = True)

# prepare the data as a time series forecasting task
def prepare_data(dataset, series='NO2(GT)', lag=2, horizon=1):

    u = dataset[series].to_numpy()

    n = len(u)
    X, y = [], []

    for i in range(n - lag - horizon + 1):
        X.append(u[i : i + lag])
        y.append(u[i + lag + horizon - 1])

    return np.array(X), np.array(y)

x, y = prepare_data(air_quality)

print(x)
print(y)

[[113  92]
 [ 92 114]
 [114 122]
 ...
 [190 179]
 [179 175]
 [175 156]]
[114 122 116 ... 175 156 168]


In [7]:
from sklearn.model_selection import TimeSeriesSplit

# use times series split (5 splits)
tscv = TimeSeriesSplit(n_splits=5)
print(tscv)
for i, (train_index, test_index) in enumerate(tscv.split(x)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")

TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None)
Fold 0:
  Train: index=[   0    1    2 ... 1557 1558 1559]
  Test:  index=[1560 1561 1562 ... 3116 3117 3118]
Fold 1:
  Train: index=[   0    1    2 ... 3116 3117 3118]
  Test:  index=[3119 3120 3121 ... 4675 4676 4677]
Fold 2:
  Train: index=[   0    1    2 ... 4675 4676 4677]
  Test:  index=[4678 4679 4680 ... 6234 6235 6236]
Fold 3:
  Train: index=[   0    1    2 ... 6234 6235 6236]
  Test:  index=[6237 6238 6239 ... 7793 7794 7795]
Fold 4:
  Train: index=[   0    1    2 ... 7793 7794 7795]
  Test:  index=[7796 7797 7798 ... 9352 9353 9354]


In [11]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

# compare 3 regression models (Linear Regression, Random Forest, naive (Y t+1 = Y t))
linear_regression = LinearRegression()
tree_regression = RandomForestRegressor()

models = {
    "Linear Regression": linear_regression,
    "Random Forest": tree_regression,
    "Naive (Y t+1 = Y t)": None
    }

for name, model in models.items():
  mse = []
  mae = []

  for i, (train_index, test_index) in enumerate(tscv.split(x)):
    x_train, x_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]

    if model == None:
      predict_scores = y_train[-1]  # valor anterior
      predict_scores = np.full_like(y_test, predict_scores, dtype=np.float64)
    else:
      model.fit(x_train, y_train)
      predict_scores = model.predict(x_test)

    # check MSE and MAE
    mse.append(mean_squared_error(y_test, predict_scores))
    mae.append(mean_absolute_error(y_test, predict_scores))

  print(f"\n======== {name} ========")
  print(f"Mean Squared Error (MSE): {np.mean(mse):.2f} -> {mse}")
  print(f"Mean Absolute Error (MAE): {np.mean(mae):.2f} -> {mae}")


Mean Squared Error (MSE): 4895.31 -> [4785.487713997605, 4698.43348723038, 4555.01407133462, 5521.495642314247, 4916.12894762335]
Mean Absolute Error (MAE): 39.10 -> [38.39083468551891, 40.31008262307764, 38.20649265872244, 39.62457577921467, 38.95032095905486]

Mean Squared Error (MSE): 4342.27 -> [4053.9185210448345, 4073.2697838708937, 4091.193373103411, 4958.356749481134, 4534.610093445831]
Mean Absolute Error (MAE): 32.95 -> [32.4791400213121, 31.044554101724138, 30.308353234742086, 34.472933137087935, 36.44305862111812]

Mean Squared Error (MSE): 24117.79 -> [19778.20205259782, 23036.26491340603, 54245.47466324567, 15170.942912123155, 8358.048107761386]
Mean Absolute Error (MAE): 107.77 -> [101.0166773572803, 110.40089801154586, 184.0891597177678, 73.68954457985889, 69.65426555484285]


### Analysis
* Random Forest had the best performance:
  * Mean Squared Error (MSE)  = 4342.27 -> best = 4053.9185210448345
  * Mean Absolute Error (MAE) = 32.95  -> best = 30.308353234742086
  * captures complex patterns better than linear models.
* Meanwhile, the Naive model had the worst results:
  * Mean Squared Error (MSE) = 24117.79
  * Mean Absolute Error (MAE) = 107.77
* Linear Regression performed better than the Naive Model but slightly worse than Random Forest:
  * Mean Squared Error (MSE) = 4895.31
  * Mean Absolute Error (MAE) = 39.10

# References
* [Air Quality Index Dataset](https://archive.ics.uci.edu/dataset/360/air+quality)
* [A Practical Guide on Scikit-learn for Time Series Forecasting](https://medium.com/@mouse3mic3/a-practical-guide-on-scikit-learn-for-time-series-forecasting-bbd15b611a5d)
* [TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html)
* [Random Forest for Time Series Forecasting](https://www.analyticsvidhya.com/blog/2021/06/random-forest-for-time-series-forecasting/)