# Preliminary Modelling and Model Selection

In [1]:
import pandas as pd
from src.features.helpers.load_data import load_data
from src.models.model_2.model.pipelines_2h import pipeline

train_data, augmented_data, test_data = load_data('2_00h')

all_train_data_transformed = pipeline.fit_transform(pd.concat([train_data, augmented_data]))

X_train, y_train = all_train_data_transformed.iloc[len(train_data):].drop(columns=['bg+1:00']), all_train_data_transformed.iloc[len(train_data):]['bg+1:00']
X_augmented, y_augmented = all_train_data_transformed.iloc[:len(train_data)].drop(columns=['bg+1:00']), all_train_data_transformed.iloc[:len(train_data)]['bg+1:00']

## Lazy Predict

To gain quick insights into the performance of different regression models, we can use the `LazyPredict` library. This library automates the training and evaluation of a variety of regression models on the data, providing a fast and comprehensive overview of model performance.
For our use case, we developed a custom wrapper around ``LazyPredict`` to have more control over the models applied.

In [None]:
from src.features.helpers.LazyPredict import get_lazy_regressor
from sklearn.model_selection import train_test_split

X_augmented_train, X_augmented_test, y_augmented_train, y_augmented_test = train_test_split(X_augmented, y_augmented, test_size=0.2, random_state=42)

X_train_all = pd.concat([X_train, X_augmented_train], axis=0)
y_train_all = pd.concat([y_train, y_augmented_train], axis=0)

lazy_predict_regressor = get_lazy_regressor(exclude=['SVN'])
models, predictions = lazy_predict_regressor.fit(X_train=X_train_all, y_train=y_train_all, X_test=X_augmented_test, y_test=y_augmented_test)
models

  0%|          | 0/39 [00:00<?, ?it/s]

## Model Selection

Based on the results from the `LazyPredict` library, we can identify and select the best performing models from different categories, fine-tune and combine them into a `StackingRegressor` for the final prediction.

In this case we will use:

* `HistGradientBoostingRegressor`
* `LassoLarsICRegressor`
* `KNNRegressor`
* `XGBRegressor`

### Why HistGradientBoostingRegressor

Category: Gradient Boosting Model

Strengths:
* Efficient implementation of gradient boosting, optimized for large datasets with categorical features.
* Handles missing data well and works effectively on high-dimensional datasets.
* Generally robust to overfitting due to regularization.

Unique Contribution to Stacking:
* Captures complex, non-linear relationships.
* Performs well in terms of accuracy on structured data and integrates nicely with other weaker models.



### Why LassoLarsICRegressor

Category: Linear Model

Strengths:
* A regression model that combines Lasso (L1 regularization) with a model selection method based on the Akaike Information Criterion (AIC) or Bayes Information Criterion (BIC).
* Particularly effective for datasets with a large number of features but where most coefficients are zero (sparse datasets).

Unique Contribution to Stacking:
* Provides a strong linear baseline that helps the ensemble learn from both linear trends and more complex patterns captured by non-linear models.
* Avoids overfitting by feature selection.



### Why KNNRegressor

Category: Instance-Based Learning (Non-parametric)

Strengths:
* Simple yet effective for small-to-medium datasets, particularly when the relationship between features and the target variable is highly localized.
* Handles non-linear relationships without assuming any prior distribution.

Unique Contribution to Stacking:
* Introduces local prediction capability that complements global models like gradient boosting.
* Offers diversity to the ensemble, as its predictions are based purely on similarity rather than a parametric model.

### Why XGBRegressor

Category: Gradient Boosting Model

Strengths:
* High performance and efficiency due to optimized implementations.
* Supports a variety of tuning options and regularization techniques (L1 and L2).
* Often achieves state-of-the-art results in many regression tasks.

Unique Contribution to Stacking:
* Brings additional predictive power, especially when the dataset has complex feature interactions.
* Complements HistGradientBoostingRegressor by leveraging different gradient boosting frameworks.


### Why Stacking these models

1. Diversity: Each model belongs to a different category (linear, boosting, instance-based), ensuring diverse perspectives.
2. Complementary Strengths: Combining models that excel in different aspects (e.g., linear vs. non-linear, global vs. local) leads to a well-rounded regressor.
3. Error Reduction: Aggregating predictions mitigates individual model weaknesses, reducing bias and variance.
4. Improved Generalization: Stacking effectively captures patterns that individual models may miss, enhancing overall performance.


By carefully tuning these models and stacking their predictions, the ensemble benefits from their combined strengths, yielding a robust and accurate final regressor.

