# Feature Engineering

Some Todos:
- `kernel_receivals`
    - 
- `kernel_purchase_orders`
    - Consider modified_date_time in purchase_orders: While this can be noisy, the time difference `(modified_date_time - created_date_time)` could be engineered into a feature like `order_revision_duration`.
- `extended_materials`
    - There are a good deal of entries where `stock_location` is prefixed with DELETED followed by a date. It would be useful to check if there are differences in deliveries before and after that date. My hypothesis is that after it is deleted, there are no more deliveries for that rm_id. If this is a case, a boolean feature `is_deleted` would be powerful.
- `extended_transportation`
    - packaging_ratio = (gross_weight - net_weight) / gross_weight
    Perhaps heavier materials (rm_id A) require more robust (and heavier) packaging than lighter ones (rm_id B).
    - weight_discrepancy = (gross_weight - tare_weight) - net_weight.
    An average of this discrepancy per supplier_id or transporter_name could be a fantastic feature. It essentially measures the reliability of that supplier's or transporter's data. A model can learn that predictions for transporters with historically high discrepancies are less certain.
    - Do not include gross_weight or tare_weight directly in your model's feature set. Instead, use them to engineer these new, more abstract features (packaging_ratio, weight_discrepancy, etc.). Then, you can calculate historical averages of these new features on a per-rm_id, per-supplier_id, or per-transporter_name basis to use in your final model. This avoids data leakage while extracting the valuable information they contain.

## Setup

This notebook is meant for all feature engineering. Feel free to add or change sections.

### Imports

In [None]:
import pandas as pd
from pathlib import Path

### Helper Functions

In [None]:
def load_data(filename, folder="1_raw"):
    """
    Load data from a CSV file in a subfolder of the project's 'data' directory.
    This version is adjusted to work even if the notebook is run from a subfolder.

    Parameters
    ----------
    filename : str
        The name of the file to load, including the extension (e.g., "data.csv").
    folder : str, optional
        The subfolder within 'data' to load from. Defaults to "1_raw".
    """
    try:
        # Go up one level from the current working directory to find the project root
        PROJECT_ROOT = Path.cwd().parent

        file_path = PROJECT_ROOT / "data" / folder / filename

        df = pd.read_csv(file_path, sep=",")

        print(f"Data loaded successfully from {file_path}")
        return df
    except FileNotFoundError:
        print(f"Error: The file was not found at {file_path}")
        return None
    except Exception as e:
        print(f"An error occurred while loading the file: {e}")
        return None


def save_data(df, filename, folder="2_interim"):
    """
    Save a dataframe to a CSV file in a subfolder of the project's 'data' directory.

    This function automatically creates the destination folder if it doesn't exist.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe to save.
    filename : str
        The name for the output file, including the extension (e.g., "processed_orders.csv").
    folder : str, optional
        The subfolder within 'data' to save to. Defaults to "2_interim".
    """
    try:
        PROJECT_ROOT = Path.cwd().parent
        save_dir = PROJECT_ROOT / "data" / folder
        save_dir.mkdir(parents=True, exist_ok=True)

        # The full filename, including extension, is now expected
        file_path = save_dir / filename

        df.to_csv(file_path, sep=",", index=False)

        print(f"Data saved successfully to {file_path} ✅")

    except Exception as e:
        print(f"An error occurred while saving the file: {e}")

### Data Loading

## Introductory Feature Engineering

## Saving the Processed Data

In [None]:
# y_train =
# x_train =
# y_test = # It may be that we generate K-fold windowed cross-validation datasets on-the-go as we train the model. If so, y_test is not saved here.
# x_test =

# save_data(x_train, "x_train", folder="processed")
# save_data(x_test, "x_test", folder="processed")
# save_data(y_train, "y_train", folder="processed")
# save_data(y_test, "y_test", folder="processed")