# Machine Learning Model

## Context

In the previous notebook, ts_model.ipynb, we managed to remove all time dependency, seasonality and autocorrelation, from our target feature, allowing us to transition to a classic machine learning approach, specifically, a tree-based model.

In this notebook, we performed feature engineering, model selection, feature selection, and finally model tuning. The candidate models for this problem are LightGBM and XGBoost, due to their typically high accuracy.

**Data Source**
The data used in this notebook was extracted from the notebook *model/ts_model.ipynb*

- **Data:** 06/12/2025
- **Localização:** ../data/wrangle

## Set up

### Libraries

In [78]:
## Base
import os
import pickle
import numpy as np
import pandas as pd
from joblib import Parallel, delayed

## Visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# Model
import optuna
from sklearn.model_selection import KFold, ShuffleSplit, cross_val_score, GridSearchCV
from sklearn.metrics import mean_absolute_error, make_scorer

from lightgbm import LGBMRegressor, early_stopping, log_evaluation
from xgboost import XGBRegressor

# Ignore all warnings
import warnings
warnings.filterwarnings("ignore")

import logging
logging.getLogger("lightgbm").setLevel(logging.CRITICAL)
logging.getLogger("lightgbm.engine").setLevel(logging.CRITICAL)
logging.getLogger("lightgbm.basic").setLevel(logging.CRITICAL)

In [2]:
# Funções criadas
import sys
from pathlib import Path
sys.path.insert(1, Path.cwd().parents[1].as_posix())

from src.ts_utils import *

from config import *

In [3]:
plt.rcParams['axes.prop_cycle'] = plt.cycler(color=['#003366'])

# Data

Our target variable contains 6 years of data, totaling 2,190 rows, due to the seasonal division applied in the previous notebook. This amount of information is typically small for training tree-based models.

In [4]:
df = pd.read_parquet(os.path.join(DATA_PATH_WRANGLE, 'weather_linear_resid.parquet'))
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2555 entries, 0 to 2554
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   time      2555 non-null   datetime64[ns]
 1   tavg      2555 non-null   float64       
 2   prcp      2555 non-null   float64       
 3   snow      2555 non-null   float64       
 4   wspd      2555 non-null   float64       
 5   pres      2555 non-null   float64       
 6   tamp      2555 non-null   float64       
 7   sin_1     2555 non-null   float64       
 8   cos_1     2555 non-null   float64       
 9   ml_resid  2555 non-null   float64       
dtypes: datetime64[ns](1), float64(9)
memory usage: 199.7 KB


In [5]:
df['ml_resid'].describe()

count    2555.000000
mean        0.002710
std         3.179228
min       -15.814890
25%        -1.684089
50%         0.196558
75%         1.828278
max        11.023475
Name: ml_resid, dtype: float64

# Feature Engineering
In this section, we create all the features our models might use. The plan is to generate as many features as possible to expose all the information contained in the data and then later decide which ones should remain in the final model. In other words, we are not concerned at this stage with whether a feature is useful, this will be determined later in the notebook.

The only rule is that the feature must be available at the time of forecasting.

## Lagging

Given this rule, we begin by lagging all relevant features to ensure they are available at forecast time. We applied lags from 365 to 730 days.

In [6]:
list_columns = ['tavg', 'prcp', 'snow', 'wspd', 'pres', 'tamp']
for c in list_columns:
    for lag in range(365, 2*365):
        df[f'{c}_{lag}'] = df[c].shift(lag)

In [7]:
df.dropna().shape

(1826, 2200)

So far, our table has 2.202 columns and 1.826 rows, about 5 years of data. After applying the lags, we will also include the differences, which capture the day-to-day changes.

## Differences

In [8]:
not_include = {'time', 'tavg', 'prcp', 'snow', 'wspd', 'pres', 'tamp', 'sin_1', 'cos_1', 'seasonal_resid', 'ml_resid', 'seasonal_pattern'}
set_all_columns = set(df.columns)
set_columns = set_all_columns - not_include

len(set_columns)

2190

In [9]:
for c in set_columns:
    df[f'{c}_diff'] = df[c].diff()

In [10]:
df.dropna().shape

(1825, 4390)

## Time
Finally, we can also include time-based features such as month, day of the month, day of the year, and year. Since we have removed time dependence, it is likely that none of these features will be considered relevant. However, because tree-based models are non-linear, it's possible that this information, combined with other features, could reveal patterns not already captured.

In [11]:
df['day_month'] = df['time'].dt.day
df['day_year'] = df['time'].dt.day_of_year
df['month'] = df['time'].dt.month.astype('category')
df['year'] =  df['time'].dt.year

In [12]:
df.dropna().shape

(1825, 4394)

# Model Selection

We have 4,396 features however, within this pool, we have 12 that cannot be used. To select the best model for this task, we will perform cross-validation using all features with the simplest version of each model and measure their Mean Absolute Error (MAE). The model that performs best will be selected.

In [13]:
set_all_columns = set(df.columns)
set_columns = set_all_columns - not_include
len(set_columns)

4384

Above we have the number of features used for the model selection.

In [14]:
df.dropna(inplace=True)
X = df[list(set_columns)]
y = df['ml_resid']

In [15]:
model_xgb  = XGBRegressor(random_state=25, enable_categorical=True)
model_lgbm = LGBMRegressor(random_state=25)

cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=42)

In [16]:
mae_xgb  = -cross_val_score(model_xgb,  X, y, cv=cv, n_jobs=-2, scoring="neg_mean_absolute_error")
mae_lgbm = -cross_val_score(model_lgbm, X, y, cv=cv, n_jobs=-2, scoring="neg_mean_absolute_error")

print(f"XGBoost Mean MAE:  {mae_xgb.mean():.4f}")
print(f"LightGBM Mean MAE: {mae_lgbm.mean():.4f}")

XGBoost Mean MAE:  2.5637
LightGBM Mean MAE: 2.4818


# Feature Selection

Based on the MAE results, we selected LightGBM due to its superior performance. Next, we proceed to feature selection. The chosen method is straightforward: we first train our model on all available data and remove features that show no importance.

We then apply a Greedy Forward Selection method, a stepwise feature selection approach that iteratively builds a model by adding features one at a time based on their contribution to predictive performance. Starting with an empty set of features, the algorithm evaluates each candidate feature by temporarily including it in the model and measuring the improvement in a performance metric, such as mean absolute error (MAE). The feature that yields the largest improvement is permanently added to the model.

This process repeats until adding new features no longer significantly improves performance or a predefined stopping criterion is met.

## Gain Selection

In [17]:
model = LGBMRegressor(random_state=25).fit(X, y)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.455222 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 717576
[LightGBM] [Info] Number of data points in the train set: 1825, number of used features: 4384
[LightGBM] [Info] Start training from score -0.049022


In [18]:
importance_gain = model.booster_.feature_importance(importance_type='gain')
feature_names = model.booster_.feature_name()

feat_imp = pd.DataFrame({
    'feature': feature_names,
    'gain': importance_gain
}).sort_values(by='gain', ascending=False)

feat_imp

Unnamed: 0,feature,gain
1261,tavg_430_diff,835.089394
1701,wspd_655_diff,778.790444
2809,wspd_560_diff,603.810895
625,snow_402,596.010201
487,pres_390,572.651686
...,...,...
2355,tavg_538_diff,0.000000
2354,snow_640_diff,0.000000
797,tavg_620,0.000000
2383,prcp_624_diff,0.000000


In [19]:
feat_imp.loc[feat_imp['gain'] > 0, 'gain'].describe()

count    1806.000000
mean       52.722365
std        74.959522
min         0.731386
25%         5.416520
50%        23.570001
75%        75.134867
max       835.089394
Name: gain, dtype: float64

Even though we removed around 2,500 features that showed no gain, we still have too many features to perform greedy feature selection. Therefore, we will retain only the features that contribute more than 0.05% of the total gain.

In [20]:
feature_pool = feat_imp.loc[feat_imp['gain'] > feat_imp['gain'].sum()* 0.0005, 'feature'].tolist()
len(feature_pool)

627

## Greedy Selection
This method leverages the lightGBM ability to handle complex, nonlinear relationships and interactions among features, making it an effective way to identify a predictive subset of features while reducing dimensionality and potential overfitting.

In [21]:
def evaluate_feature(feature, current_features, X, y):
    features_to_use = current_features + [feature]
    cv = ShuffleSplit(n_splits=3, test_size=0.25, random_state=42)
    maes = []
    for train_idx, val_idx in cv.split(X):
        X_train, X_val = X.iloc[train_idx][features_to_use], X.iloc[val_idx][features_to_use]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
        model = LGBMRegressor(num_boost_round = 75, max_depth = 4, subsample = 0.6, verbose=-1, random_state=25)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_val)
        maes.append(mean_absolute_error(y_val, y_pred))
        
    return np.mean(maes), feature

selected_features = []
remaining_features = feature_pool
performance_history = []

In [22]:
for step in range(1, 50):
    description = f'step_{step} (MAE: {performance_history[-1]:.4f})' if step > 1 else f'step_{step}'

    # Parallel evaluation using joblib
    results = Parallel(n_jobs=-2)(
        delayed(evaluate_feature)(f, selected_features, X, y) for f in tqdm(remaining_features, desc=description, leave=True, position=0)
    )
    best_mae, best_feature = min(results, key=lambda x: x[0])

    # Check improvement
    if step > 10 and best_mae - performance_history[-1] > 0.01:
        print("No improvement, stopping selection.")
        break

    # Save results
    selected_features.append(best_feature)
    remaining_features.remove(best_feature)
    performance_history.append(best_mae)

step_1: 100%|███████████████████████████████████████████████████████████████████████████████████| 627/627 [06:36<00:00,  1.58it/s]
step_2 (MAE: 2.4025): 100%|████████████████████████████████████████████████████████████████████████████| 626/626 [06:16<00:00,  1.66it/s]
step_3 (MAE: 2.3824): 100%|████████████████████████████████████████████████████████████████████████████| 625/625 [06:20<00:00,  1.64it/s]
step_4 (MAE: 2.3719): 100%|████████████████████████████████████████████████████████████████████████████| 624/624 [06:26<00:00,  1.61it/s]
step_5 (MAE: 2.3590): 100%|████████████████████████████████████████████████████████████████████████████| 623/623 [06:30<00:00,  1.60it/s]
step_6 (MAE: 2.3392): 100%|█████████████████████████████████████████████████████████████████████| 622/622 [06:30<00:00,  1.59it/s]
step_7 (MAE: 2.3220): 100%|████████████████████████████████████████████████████████████████████████████| 621/621 [06:45<00:00,  1.53it/s]
step_8 (MAE: 2.3149): 100%|█████████████████████

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.547382 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 668884
[LightGBM] [Info] Number of data points in the train set: 1460, number of used features: 4384
[LightGBM] [Info] Start training from score -0.065494
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 1.419361 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 669114
[LightGBM] [Info] Number of data points in the train set: 1460, number of used features: 4384
[LightGBM] [Info] Start training from score -0.095867
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 1.424431 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 669030
[LightGBM] [Info] Number of data points in the train set: 1460, number of used features: 4384
[LightGBM] [Info

KeyboardInterrupt: 

The code was manually interrupted because the metric was no longer improving as more features were added. The change was too small to trigger the early-stopping condition, but the metrics shown in the progress bar indicated that the model had already reached its best performance at the 39th step.

In [23]:
selected_features, performance_history

(['pres_703',
  'tamp_404',
  'prcp_702',
  'tamp_709_diff',
  'wspd_718',
  'prcp_411',
  'wspd_646',
  'tamp_582',
  'tamp_523_diff',
  'tamp_372',
  'prcp_382',
  'tamp_469',
  'prcp_703',
  'tavg_378_diff',
  'wspd_657',
  'tavg_412',
  'wspd_416',
  'tavg_486_diff',
  'wspd_524',
  'tamp_397_diff',
  'snow_370_diff',
  'pres_389_diff',
  'wspd_597',
  'pres_440_diff',
  'tavg_682_diff',
  'snow_694',
  'tavg_695_diff',
  'tamp_532_diff',
  'tamp_382',
  'wspd_401_diff',
  'tavg_728',
  'tamp_691_diff',
  'prcp_376',
  'tamp_403',
  'tamp_490',
  'tamp_375',
  'wspd_464_diff',
  'tamp_720_diff',
  'wspd_600_diff',
  'pres_717_diff',
  'tavg_427_diff',
  'wspd_613',
  'tavg_497_diff',
  'tamp_721',
  'prcp_460_diff',
  'prcp_614_diff',
  'prcp_386',
  'prcp_443_diff',
  'tamp_517',
  'tamp_456_diff'],
 [np.float64(2.4025164527247753),
  np.float64(2.382404352509843),
  np.float64(2.371872236610991),
  np.float64(2.3590130649809837),
  np.float64(2.339218500035797),
  np.float64(2.32

Using the method above, we managed to select 39 predictive features, significantly reducing the dimensionality. For comparison, these 39 features, combined with simpler hyperparameters, achieved an MAE that was 0,25 points lower than the model trained with all 4,000 features during "Model Selection".

In [108]:
X = X[selected_features[:performance_history.index(min(performance_history))]]
X.shape

(1825, 38)

# Model Tuning

After selecting the predictive features, we move on to tuning the model. The strategy is simple: we use Bayesian Optimization, which learns from the results of past trials to make informed decisions about which hyperparameters to evaluate next, to identify promising regions of the search space. Within those regions, we then apply a GridSearch to determine the best hyperparameters for the model.

## Optuna

Optuna is a hyperparameter optimization framework that incorporates Bayesian optimization techniques.

In [87]:
X_values = X.values
y_values = y.values

kf = KFold(n_splits=5, shuffle=True, random_state=25)

def objective(trial):
    params = {
        "objective": "regression",
        "metric": "mae",
        "boosting_type": "gbdt",
        "verbosity": -1,

        "num_leaves": trial.suggest_int("num_leaves", 15, 255),
        "max_depth": trial.suggest_int("max_depth", 3, 12),

        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.2),
        "n_estimators": trial.suggest_int("n_estimators", 200, 1500),

        "min_child_samples": trial.suggest_int("min_child_samples", 5, 50),
        "min_child_weight": trial.suggest_float("min_child_weight", 1e-5, 1.0),

        "min_gain_to_split": 0.0,

        "lambda_l1": trial.suggest_float("lambda_l1", 0.0, 2.0),
        "lambda_l2": trial.suggest_float("lambda_l2", 0.0, 2.0),

        "feature_fraction": trial.suggest_float("feature_fraction", 0.6, 1.0),
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.6, 1.0),
        "bagging_freq": 1,
    }

    maes = []
    best_iters = []
    
    for train_idx, valid_idx in kf.split(X_values):
        X_train, X_valid = X_values[train_idx], X_values[valid_idx]
        y_train, y_valid = y_values[train_idx], y_values[valid_idx]

        model = LGBMRegressor(**params)

        model.fit(
            X_train, y_train,
            eval_set=[(X_valid, y_valid)],
            callbacks=[
                early_stopping(200, verbose=False),  
                log_evaluation(period=0)            
            ]
        )
        best_iters.append(model.best_iteration_)
        pred = model.predict(X_valid)
        maes.append(mean_absolute_error(y_valid, pred))
        
    trial.set_user_attr("mean_best_iteration", np.mean(best_iters))
    
    return np.mean(maes)

In [88]:
n_trials = 100
optuna.logging.set_verbosity(optuna.logging.ERROR)
study = optuna.create_study(direction="minimize")

with tqdm(total=n_trials, desc="Hyperparameter Search", position=0) as pbar:
    for _ in range(n_trials):
        study.optimize(objective, n_trials=1, catch=(Exception,))
        pbar.update(1)

print("Best Optuna params:")
print(study.best_params)
print("Average n_estimator:", study.best_trial.user_attrs["mean_best_iteration"])

best = study.best_params

Hyperparameter Search: 100%|████████████████████████████████████████████████████████████████████| 100/100 [08:27<00:00,  5.08s/it]

Best Optuna params:
{'num_leaves': 188, 'max_depth': 3, 'learning_rate': 0.0476510398940014, 'n_estimators': 1378, 'min_child_samples': 17, 'min_child_weight': 0.8738048497952383, 'lambda_l1': 0.17405239654238058, 'lambda_l2': 0.48525397028025363, 'feature_fraction': 0.8046112243391896, 'bagging_fraction': 0.790649561847512}
133.2





## Grid Search
With the promising search areas defined, we proceed to a GridSearch to identify the best hyperparameters.

In [89]:
mae_scorer = make_scorer(mean_absolute_error, greater_is_better=False)

In [92]:
grid = {
    "learning_rate": [max(1e-5, best["learning_rate"] * f) for f in [0.7, 1.0, 1.3]],
    "num_leaves": sorted(list({max(16, best["num_leaves"] + d) for d in [-16, 0, 16]})),
    "min_child_samples": sorted(list({max(2, best["min_child_samples"] + d) for d in [-5, 0, 5]})),
    "feature_fraction": sorted(list({min(1.0, best["feature_fraction"] * f) for f in [0.9, 1.0, 1.1]})),
    "bagging_fraction": sorted(list({min(1.0, best["bagging_fraction"] * f) for f in [0.9, 1.0, 1.1]})),
    "lambda_l1": [max(0, best["lambda_l1"] * f) for f in [0.7, 1.0, 1.3]],
    "lambda_l2": [max(0, best["lambda_l2"] * f) for f in [0.7, 1.0, 1.3]],
    "n_estimators": [int(study.best_trial.user_attrs["mean_best_iteration"] * n) for n in [0.8, 1.0, 1.2]]
}

grid

{'learning_rate': [0.03335572792580098,
  0.0476510398940014,
  0.06194635186220182],
 'num_leaves': [172, 188, 204],
 'min_child_samples': [12, 17, 22],
 'feature_fraction': [0.7241501019052707,
  0.8046112243391896,
  0.8850723467731086],
 'bagging_fraction': [0.7115846056627608,
  0.790649561847512,
  0.8697145180322633],
 'lambda_l1': [0.1218366775796664, 0.17405239654238058, 0.22626811550509476],
 'lambda_l2': [0.33967777919617753, 0.48525397028025363, 0.6308301613643298],
 'n_estimators': [106, 133, 159]}

In [94]:
base_model = LGBMRegressor(
    objective="regression",
    random_state=25,
    verbosity=-1,
    n_jobs=1
)

grid_search = GridSearchCV(
    estimator=base_model,
    param_grid=grid,
    scoring=mae_scorer,
    cv=kf,
    n_jobs=-2,
    verbose=1,
    refit=True
)

grid_search.fit(X, y)

print("Grid best params:", grid_search.best_params_)
print("Grid best MAE:", -grid_search.best_score_)  

Fitting 5 folds for each of 6561 candidates, totalling 32805 fits
Grid best params: {'bagging_fraction': 0.7115846056627608, 'feature_fraction': 0.8046112243391896, 'lambda_l1': 0.22626811550509476, 'lambda_l2': 0.6308301613643298, 'learning_rate': 0.03335572792580098, 'min_child_samples': 22, 'n_estimators': 106, 'num_leaves': 172}
Grid best MAE: 2.347952880634722


## Final Model

Now that we have identified the best feature set and hyperparameters for our model, we can finalize it.

In [110]:
model = LGBMRegressor(
    **grid_search.best_params_,
    objective="regression",
    random_state=25,
    n_jobs=-2,
    verbosity=-1,
)

In [104]:
kf = KFold(n_splits=10, shuffle=True, random_state=25)
final_lgbm = -cross_val_score(model, X, y, cv=kf, n_jobs=-2, scoring="neg_mean_absolute_error")
print(f"LightGBM All MAEs: {final_lgbm}")
print(f"LightGBM Mean MAE: {final_lgbm.mean():.4f}")

LightGBM All MAEs: [2.2375594  2.24416481 2.56208358 2.19960629 2.30541475 2.62504182
 2.35684063 2.28164927 2.31679155 2.4487121 ]
LightGBM Mean MAE: 2.3578


In [111]:
light = model.fit(X, y)
pickle.dump(light, open('lightgbm_model.pkl', 'wb'))