# Exploratory Data Analysis

This notebook is meant for all EDA. Feel free to add or change sections.


Below is a temprorary list of TODOs that you can extend if you notice something you want to check later, but dont have time to do right now.

TODO:
- Graph out all time-values to see if there is an obious trend over time.
- Check for cyclical trends.
    - Do sine- and cos-transformations of hour/weekday/month respectively and look for patterns.
- Check for outliers.
    - Make scatterplots for the continous values and see what values are suspicous.
- If there are missing values, look for patterns that wxplain when data is usually missing.
    - Do weekends usually have missing data for example?
- Check for [class imbalances](https://www.geeksforgeeks.org/machine-learning/how-to-handle-imbalanced-classes-in-machine-learning/).
    - If some features have a much lower amount of data points than others, that leads to class imbalance.

## Setup

### Imports

In [None]:
import pandas as pd
from pathlib import Path
from tabulate import tabulate

### Helper Functions

In [None]:
def load_data(filename, folder="1_raw"):
    """
    Load data from a CSV file in a subfolder of the project's 'data' directory.
    This version is adjusted to work even if the notebook is run from a subfolder.

    Parameters
    ----------
    filename : str
        The name of the file to load, including the extension (e.g., "data.csv").
    folder : str, optional
        The subfolder within 'data' to load from. Defaults to "1_raw".
    """
    try:
        # Go up one level from the current working directory to find the project root
        PROJECT_ROOT = Path.cwd().parent

        file_path = PROJECT_ROOT / "data" / folder / filename

        df = pd.read_csv(file_path, sep=",")

        print(f"Data loaded successfully from {file_path}")
        return df
    except FileNotFoundError:
        print(f"Error: The file was not found at {file_path}")
        return None
    except Exception as e:
        print(f"An error occurred while loading the file: {e}")
        return None


def save_data(df, filename, folder="2_interim"):
    """
    Save a dataframe to a CSV file in a subfolder of the project's 'data' directory.

    This function automatically creates the destination folder if it doesn't exist.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe to save.
    filename : str
        The name for the output file, including the extension (e.g., "processed_orders.csv").
    folder : str, optional
        The subfolder within 'data' to save to. Defaults to "2_interim".
    """
    try:
        PROJECT_ROOT = Path.cwd().parent
        save_dir = PROJECT_ROOT / "data" / folder
        save_dir.mkdir(parents=True, exist_ok=True)

        # The full filename, including extension, is now expected
        file_path = save_dir / filename

        df.to_csv(file_path, sep=",", index=False)

        print(f"Data saved successfully to {file_path} ✅")

    except Exception as e:
        print(f"An error occurred while saving the file: {e}")

### Data Loading

In [None]:
# Extended files
df_em = load_data("extended_materials.csv")
df_et = load_data("extended_transportation.csv")

# Kernel files
df_kpo = load_data("kernel_purchase_orders.csv")
df_kr = load_data("kernel_receivals.csv")

# Other files
df_pm = load_data("prediction_mapping.csv")
df_ss = load_data("sample_submission.csv")

datasets = {
    "extended_materials": df_em,
    "extended_transportation": df_et,
    "kernel_purchase_orders": df_kpo,
    "kernel_receivals": df_kr,
    "prediction_mapping": df_pm,
    "sample_submission": df_ss,
}

## Introductory EDA

### Checking .head(n=20) for all dataframes

In [None]:
df_em.head(n=20)

In [None]:
df_em.describe(include="all")

In [None]:
df_et.head(n=20)

In [None]:
df_kpo.head(n=20)

In [None]:
df_kr.head(n=20)

In [None]:
df_pm.head(n=20)

In [None]:
df_ss.head(n=20)

In [None]:
# --- Full Table ---
all_features = sorted(
    list(set(feature for df in datasets.values() for feature in df.columns))
)
presence_data = []

for name, df in datasets.items():
    row = {}
    for feature in all_features:
        if feature in df.columns:
            series = df[feature]
            total_count = len(series)

            if total_count > 0:
                non_nan_count = series.count()
                fill_grade = (non_nan_count / total_count) * 100
                row[feature] = f"✅ {fill_grade:.1f}%"
            else:
                row[feature] = "✅ 100.0%"
        else:
            row[feature] = ""
    presence_data.append(row)

presence_df = pd.DataFrame(presence_data, index=datasets.keys())
presence_df_transposed = presence_df.T
presence_df_transposed.index.name = "Feature Name"
presence_df_transposed.columns.name = "DataFrame Name"

print("--- Feature Presence & Fill Grade Across All DataFrames ---")
print(
    tabulate(presence_df_transposed, headers="keys", tablefmt="grid", stralign="center")
)
print("\n" * 2)  # Add some space between the tables


# --- Kernel Table ---

# 1. Filter the datasets dictionary
kernel_datasets = {
    "kernel_purchase_orders": datasets["kernel_purchase_orders"],
    "kernel_receivals": datasets["kernel_receivals"],
}

# 2. Re-run the table generation logic on the filtered dictionary
kernel_features = sorted(
    list(set(feature for df in kernel_datasets.values() for feature in df.columns))
)
kernel_presence_data = []

for name, df in kernel_datasets.items():
    row = {}
    for feature in kernel_features:
        if feature in df.columns:
            series = df[feature]
            total_count = len(series)

            if total_count > 0:
                non_nan_count = series.count()
                fill_grade = (non_nan_count / total_count) * 100
                row[feature] = f"✅ {fill_grade:.1f}%"
            else:
                row[feature] = "✅ 100.0%"
        else:
            row[feature] = ""
    kernel_presence_data.append(row)

kernel_presence_df = pd.DataFrame(kernel_presence_data, index=kernel_datasets.keys())
kernel_presence_df_transposed = kernel_presence_df.T
kernel_presence_df_transposed.index.name = "Feature Name"
kernel_presence_df_transposed.columns.name = "DataFrame Name"


print("--- Feature Presence & Fill Grade for Kernel DataFrames ---")
print(
    tabulate(
        kernel_presence_df_transposed,
        headers="keys",
        tablefmt="grid",
        stralign="center",
    )
)

## 'Advice' regarding what to do next

Based on that table and the project requirements, here is a concrete, step-by-step plan for your EDA and data processing. Think of this as a roadmap for your `1_exploratory_data_analysis.ipynb` and `2_data_processing.ipynb` notebooks.

### 🗺️ Your Action Plan: From Raw Data to a Clean Foundation

Your goal is to transform the raw CSV files into a single, clean "master dataframe" that can be fed into your feature engineering functions.

---
### **Part 1: Targeted Exploratory Data Analysis (EDA)**
*(`1_exploratory_data_analysis.ipynb`)*

Your EDA shouldn't be a random search; it should be a targeted investigation to answer specific questions that will help you build features. Here’s what you're looking for:

#### **1. Understand the Target Variable (`net_weight`)**
First, get a feel for the delivery patterns.
* **Action:**
    1.  Load `kernel_receivals.csv`.
    2.  Convert `date_arrival` to a datetime object.
    3.  Aggregate the data to get the total daily `net_weight` for each `rm_id`.
* **What to look for (and visualize):**
    * **Sparsity:** Plot the daily deliveries for a few high-volume `rm_id`s. Do deliveries happen every day, or are they large but infrequent? This is critical. A model needs to handle many days with zero deliveries.
    * **Seasonality:** Are there weekly patterns (e.g., fewer deliveries on weekends)? Or monthly/quarterly patterns? A bar chart of total deliveries by month or day of the week can reveal this.
    * **Outliers:** Are there any negative `net_weight` values? Or values that are orders of magnitude larger than the rest? These could be data entry errors that need cleaning.


#### **2. Analyze the Link Between Orders and Receivals (Most Important)**
This is where your most powerful predictive signals are.
* **Action:**
    1.  Merge `kernel_receivals.csv` and `kernel_purchase_orders.csv`.
    2.  Calculate the **Delivery Delay**: `delay = actual_date_arrival - expected_delivery_date`. This is your single most important exploratory variable.
* **What to look for (and visualize):**
    * **Delay Distribution:** Plot a histogram of the `delay` in days. Is it normally distributed around zero? Is there a long tail of very late deliveries? The shape of this distribution tells you how reliable the `delivery_date` is.
    * **Segmented Delays:** Does the average delay change based on other categories? Create boxplots of the delay grouped by:
        * `supplier_id` (from `receivals`)
        * `transporter_name` (from `extended_transportation`)
        * `raw_material_format_type` (from `extended_materials`)
        If you find that "Supplier X" is consistently 5 days late, you've just discovered a hugely valuable feature.

---
### **Part 2: Data Cleaning and Processing**
*(`2_data_processing.ipynb`)*

Based on your EDA findings, you can now build a clean, unified dataset.

#### **1. Create the `master_df`**
* **Action:**
    1.  Start with `kernel_receivals`.
    2.  **Left join** `kernel_purchase_orders` using `purchase_order_id` and `purchase_order_item_no` as the composite key.
    3.  Optionally, left join `extended_materials` (on `rm_id`) and `extended_transportation` (on your composite key) to bring in useful features like `raw_material_format_type` and `transporter_name`.
    4.  **Ignore the sparse columns** from `transportation` (e.g., `wood`, `plastic`, `ice`). They won't be useful.

#### **2. Clean the Data**
* **Action:**
    * **Data Types:** Convert all date columns (`date_arrival`, `delivery_date`, `created_date_time`) to the `pd.to_datetime` format.
    * **Missing Values:** Your `net_weight` fill rate is 99.9%. The easiest and safest approach is to simply **drop the rows** where `net_weight` is missing. For `batch_id` (52.8% filled), treat the missing values as a special category (e.g., fill with "Unknown" or -1).
    * **Outliers:** Based on your EDA, remove any obvious errors (e.g., negative `net_weight`).
    * **Categorical Features:** For columns like `supplier_id`, `transporter_name`, etc., convert them to pandas' `category` dtype. This is more memory-efficient and LightGBM can handle it directly.

#### **3. Save the Processed Data**
* **Action:** Save your clean `master_df` to the `data/3_processed` directory as a Parquet or Feather file. This is much faster to load than a CSV and preserves your data types, so you don't have to repeat the cleaning steps every time you work on your model.

Once you have this clean `master_df`, you are perfectly set up to start implementing the `create_features` and `create_target` functions in your `lgbm_pipeline.py` script. Your data work will then directly feed your modeling pipeline.