## Baseline Model

Description

In [1]:
%reload_ext autoreload
%autoreload 2

In [2]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_error

import sys
import os
from dotenv import load_dotenv

import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", message=".*Maximum Likelihood optimization failed to converge.*")

In [3]:
# Add the src directory to the PYTHONPATH
os.environ['PYTHONPATH'] = os.path.abspath(os.path.join('..', 'src'))
sys.path.append(os.path.abspath(os.path.join('..', 'src')))

# Load environment variables from .env file
load_dotenv()

# Add the src directory to the Python path
sys.path.append(os.getenv('PYTHONPATH'))

In [4]:
from plot import plot_one_sample

In [5]:
df = pd.read_parquet('../data/transformed/tabular_data.parquet')

---

### Data Split for Tabular Data

Converting tabular data into a time series format (numpy arrays) is particularly beneficial for advanced machine learning models like XGBoost, LightGBM, and CatBoost. These models excel at handling structured data and can effectively capture complex patterns when provided with well-prepared time series features. 

By transforming the data into a structured format, we enable these models to leverage temporal information and interactions between features, leading to improved predictive performance. This preprocessing step ensures that the models can utilize the full range of temporal dynamics present in the data.

In contrast, models specifically designed for time series forecasting, such as ARIMA, typically do not require this type of conversion. These models can directly handle sequential data using simpler split methods that maintain the temporal order. For ARIMA, the emphasis is on maintaining the continuity of the time series rather than restructuring the data into a tabular format.

In [6]:
from data_split import get_ts
ts, feat_names = get_ts(df)

In [7]:
feat_names[:4] # label_idx 1 to 4

['no2', 'o3', 'pm10', 'pm2.5']

In [8]:
from data_split import get_feats_and_label
X_train, y_train = get_feats_and_label(timeseries=df, offset=30, label_period=30, label_idx=1)

In [9]:
X_train.shape, y_train.shape

((1191, 31), (30, 1))

---

### 1. Baseline Model: Naive Forecast

In [10]:
class BaselineModelPreviousMonth:
    def __init__(self):
        self.monthly_average = None

    def fit(self, y_train: pd.Series):
        """
        Fit the model by calculating the average of the last month's data from the training series.

        Args:
            y_train (pd.Series): Series containing the target values for training.
        """
        # Calculate the average of the last month's data
        self.monthly_average = y_train[-30:].mean()

    def predict(self, forecast_period: int) -> np.ndarray:
        """
        Predict the next values by using the monthly average from the training data.

        Args:
            forecast_period (int): The number of periods to forecast.

        Returns:
            np.ndarray: An array containing the forecasted values.
        """
        if self.monthly_average is None:
            raise ValueError("The model has not been fitted yet. Please call the fit method before predict.")
        
        return np.full(forecast_period, self.monthly_average)

In [11]:
for pollutant in ['no2', 'o3', 'pm10', 'pm2.5']:
    X_train, y_train = get_feats_and_label(timeseries=df, offset=30, label_period=30, label_idx=df.columns.get_loc(pollutant)-1)
    X_test, y_test = get_feats_and_label(timeseries=df, offset=0, label_period=30, label_idx=df.columns.get_loc(pollutant)-1)

    model = BaselineModelPreviousMonth()
    model.fit(y_train)
    y_pred = model.predict(30)
    mae = mean_absolute_error(y_test, y_pred)
    print(f"mae for {pollutant}={mae:.3f}")
    plot_one_sample(features=y_train.reset_index(), target=y_test.reset_index(), example_pollutant=pollutant, predictions=y_pred)

mae for no2=4.162


mae for o3=11.739


mae for pm10=4.519


mae for pm2.5=3.300


**Baseline Results**

| pollutant | mae |
| :--------- | --- | 
| no2 | 4.162 |
| o3 | 11.739 |
| pm10 | 4.519 |
| pm2.5 | 3.300 |

The baseline results indicate a reasonable starting point for our forecasting models, with MAE values reflecting the initial performance. However, it is clear that there is significant room for improvement, particularly for O3 with a notably high MAE of 11.739. This suggests that our current models can be optimized further. In the next notebook, I will focus on refining the models and incorporating additional features to enhance accuracy and reduce the prediction errors across all pollutants.