# Modelling

This notebook is used for both training models and using them to generate predictions. Feel free to begin configuring model code!

Below are some models we may use. Included are some observations concerning each of them.
1. N-BEATS
    - Sequence modelling.
    - Easily interpretable
    - Very performant for time-series forecasting.
    - Far more stable than other RNN-based neural networks.
    - [Docs](https://unit8co.github.io/darts/generated_api/darts.models.forecasting.nbeats.html).
2. XGBoost (or LightGBM)
    - Feature modelling. Can use [TSfresh](https://tsfresh.readthedocs.io/en/latest/) or [Ts2Vec](https://github.com/WenjieDu/PyPOTS?tab=readme-ov-file#user-content-fn-48-143bda604d5e3bfec7be057bccbe8255) (considered better than TSfresh) to automatically generate and filter a lot of good features.
    - Quick training
    - Handles label-encoding of categoricals
    - Performant with limited data
    - Handles missing values
    - Requires good-quality training data 
    - Got a tip that XGBoost can be configured to use quantile regression as the objective function. We need to use something like ```reg:quantileerror``` with the parameter ```alpha=0.2```. This way we optimize the model for the 20-percent quantile, such as the task asks for.
    - [Docs](https://xgboost.readthedocs.io/en/stable/).

3. DLinear
    - Incredibly simple, yet surprisingly performant for seasonal and cyclical time-series.
    - Essentially N-BEATS without neural networks.
    - [Docs](https://unit8co.github.io/darts/generated_api/darts.models.forecasting.dlinear.html).
4. Temporal Fusion Transformer
    - Sequence modelling. Handles time-series very well.
    - Good interpretability.
    - Requires a good deal of training data to perform well. Maybe we have enough, maybe not.
    - [Docs](https://unit8co.github.io/darts/generated_api/darts.models.forecasting.tft_model.html)
5. Random Forest Classifier
    - A simpler, less performant version of XGBoost.
    - Feature modelling.
    - [Docs](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

## Setup

### Imports

In [None]:
import pandas as pd
from pathlib import Path

### Helper Functions

In [None]:
def load_data(filename, folder="1_raw"):
    """
    Load data from a CSV file in a subfolder of the project's 'data' directory.
    This version is adjusted to work even if the notebook is run from a subfolder.

    Parameters
    ----------
    filename : str
        The name of the file to load, including the extension (e.g., "data.csv").
    folder : str, optional
        The subfolder within 'data' to load from. Defaults to "1_raw".
    """
    try:
        # Go up one level from the current working directory to find the project root
        PROJECT_ROOT = Path.cwd().parent

        file_path = PROJECT_ROOT / "data" / folder / filename

        df = pd.read_csv(file_path, sep=",")

        print(f"Data loaded successfully from {file_path}")
        return df
    except FileNotFoundError:
        print(f"Error: The file was not found at {file_path}")
        return None
    except Exception as e:
        print(f"An error occurred while loading the file: {e}")
        return None


def save_data(df, filename, folder="2_interim"):
    """
    Save a dataframe to a CSV file in a subfolder of the project's 'data' directory.

    This function automatically creates the destination folder if it doesn't exist.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe to save.
    filename : str
        The name for the output file, including the extension (e.g., "processed_orders.csv").
    folder : str, optional
        The subfolder within 'data' to save to. Defaults to "2_interim".
    """
    try:
        PROJECT_ROOT = Path.cwd().parent
        save_dir = PROJECT_ROOT / "data" / folder
        save_dir.mkdir(parents=True, exist_ok=True)

        # The full filename, including extension, is now expected
        file_path = save_dir / filename

        df.to_csv(file_path, sep=",", index=False)

        print(f"Data saved successfully to {file_path} ✅")

    except Exception as e:
        print(f"An error occurred while saving the file: {e}")

## LightGBM

### Training


Setup (Blocks 1-4): You import libraries, load your data, and define your helper functions. Crucially, you now have a quantile_error_scorer that directly measures what the competition asks for.

Optuna objective (Block 5): This is the heart of the tuning process. For each trial (a set of hyperparameters), it runs a full time-series cross-validation. It dynamically creates the correct training/validation sets for each fold, trains a model, and calculates the score. It returns the average score across the folds, which gives Optuna a stable target to minimize.

Running the Study (Block 6): This block kicks off the optimization. It will run the objective function 50 times, intelligently searching for the best combination of hyperparameters.

Final Model Training (Block 7): This is the critical final step. Once Optuna has found the best settings, you don't want to use one of the models from the cross-validation folds. You want a single, final model that has learned from all available historical data. This block trains that definitive model using the optimal parameters, ready for you to generate your final predictions for the submission file.

In [None]:
# --- Block 1: Imports ---
import numpy as np
import lightgbm as lgb
import optuna
from sklearn.model_selection import TimeSeriesSplit

# --- Block 2: Data Loading & Initial Preparation ---
# Load your raw dataframes here
# datasets = load_all_your_csvs()
# master_df = ... your full, merged historical dataframe ...
# master_df['date_arrival'] = pd.to_datetime(master_df['date_arrival'])
master_df = None

# Create the full date range for TimeSeriesSplit
# This should be based on your aggregated daily data
# all_dates = pd.to_datetime(master_df['date_arrival']).dt.date.unique()
# all_dates.sort()
all_dates = []


# --- Block 3: Feature and Target Creation Functions ---
# It's best practice to put your feature engineering logic into functions.


def create_features(data, last_hist_date):
    """
    Creates features for all rm_ids based on historical data up to a cutoff date.

    Args:
        data (pd.DataFrame): The master dataframe with all historical data.
        last_hist_date (pd.Timestamp): The last date to include for feature calculation.

    Returns:
        pd.DataFrame: A dataframe where each row is an rm_id and each column is a feature.
    """
    # This is where your feature engineering logic goes.
    # For now, this is a placeholder.
    # Example:
    # features = data[data['date_arrival'] <= last_hist_date].groupby('rm_id').agg(
    #     avg_net_weight=('net_weight', 'mean'),
    #     std_net_weight=('net_weight', 'std')
    # )
    # return features
    pass  # Replace with your actual feature engineering


def create_target(data, forecast_start_date, forecast_end_date):
    """
    Creates the cumulative target variable for a given forecast period.

    Args:
        data (pd.DataFrame): The master dataframe.
        forecast_start_date (pd.Timestamp): The first day of the forecast period.
        forecast_end_date (pd.Timestamp): The last day of the forecast period.

    Returns:
        pd.Series: A series where the index is rm_id and the value is the cumulative net_weight.
    """
    # This is where your target creation logic goes.
    # For now, this is a placeholder.
    # Example:
    # target_period = data[
    #     (data['date_arrival'] >= forecast_start_date) &
    #     (data['date_arrival'] <= forecast_end_date)
    # ]
    # target = target_period.groupby('rm_id')['net_weight'].sum()
    # return target
    pass  # Replace with your actual target creation


# --- Block 4: Custom Scorer for the Competition Metric ---
def quantile_error_scorer(y_true, y_pred):
    """
    Custom scoring function for the Quantile Error at alpha=0.2.
    """
    alpha = 0.2
    error = y_true - y_pred
    loss = np.maximum(alpha * error, (alpha - 1) * error)
    return np.mean(loss)


# --- Block 5: Optuna Objective Function ---
def objective(trial):
    """
    The function Optuna will minimize. It trains a model using TimeSeriesSplit
    and returns the average validation score.
    """
    params = {
        "objective": "quantile",
        "alpha": 0.2,
        "metric": "quantile",
        "n_estimators": trial.suggest_int("n_estimators", 200, 2000),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.2, log=True),
        "num_leaves": trial.suggest_int("num_leaves", 20, 300),
        "max_depth": trial.suggest_int("max_depth", 3, 12),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
        "subsample": trial.suggest_float("subsample", 0.6, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
        "random_state": 42,
        "n_jobs": -1,
    }

    tscv = TimeSeriesSplit(n_splits=3)  # Use 3-5 splits for tuning
    scores = []

    # This loop is for robust validation of a single hyperparameter set
    for train_indices, val_indices in tscv.split(all_dates):
        # Determine the time period for this fold
        train_cutoff_date = all_dates[train_indices[-1]]

        # NOTE: This assumes a fixed forecast horizon for validation, e.g., 30 days
        validation_start_date = all_dates[val_indices[0]]
        validation_end_date = all_dates[val_indices[-1]]

        # Create datasets for this specific fold
        X_train = create_features(master_df, train_cutoff_date)
        y_train = create_target(master_df, validation_start_date, validation_end_date)
        X_val = X_train  # Features are the same, as they are based on past data
        y_val = y_train

        model = lgb.LGBMRegressor(**params)
        model.fit(
            X_train,
            y_train,
            eval_set=[(X_val, y_val)],
            callbacks=[lgb.early_stopping(50, verbose=False)],
        )

        preds = model.predict(X_val)
        score = quantile_error_scorer(y_val, preds)
        scores.append(score)

    return np.mean(scores)


# --- Block 6: Running the Optuna Study ---
print("Starting hyperparameter tuning with Optuna...")
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=50)  # Start with 50 trials, increase if needed

# --- Block 7: Getting Results and Training the Final Model ---
print("\n--- OPTUNA RESULTS ---")
print(f"Number of finished trials: {len(study.trials)}")
best_trial = study.best_trial
print(f"  Value (Quantile Error): {best_trial.value:.4f}")
print("  Best Params: ")
for key, value in best_trial.params.items():
    print(f"    {key}: {value}")

# Get the best hyperparameters
best_params = best_trial.params
# Ensure objective and alpha are correctly set for the final model
best_params["objective"] = "quantile"
best_params["alpha"] = 0.2

print("\n--- Training Final Model on All Historical Data ---")
# Now, train one final model on ALL your historical data using the best params
# The features are based on all data up to the start of the forecast period
final_X_train = create_features(master_df, pd.to_datetime("2024-12-31"))
# The target is what you want to predict in the future, so you don't have a y_train here.
# For the final fit, we use the model to learn from all data without a validation set.

final_model = lgb.LGBMRegressor(**best_params)

# We fit on all available feature data. There is no y_train because the "target" is in the future.
# The model learns the patterns from the features of all rm_ids up to the end of 2024.
# This is a common approach in forecasting competitions.
final_model.fit(
    final_X_train, y=None
)  # y is None as we are just training on the final feature set

print("Final model trained successfully!")
# You would now use 'final_model' to predict on your 'X_test' set.

### Prediction

### Model Evaluation