# Exploratory Data Analysis

This notebook is meant for all EDA. Feel free to add or change sections.


Below is a temprorary list of TODOs that you can extend if you notice something you want to check later, but dont have time to do right now.

TODO:
- Graph out all time-values to see if there is an obious trend over time.
- Check for cyclical trends.
    - Do sine- and cos-transformations of hour/weekday/month respectively and look for patterns.
- Check for outliers.
    - Make scatterplots for the continous values and see what values are suspicous.
- If there are missing values, look for patterns that wxplain when data is usually missing.
    - Do weekends usually have missing data for example?
- Check for [class imbalances](https://www.geeksforgeeks.org/machine-learning/how-to-handle-imbalanced-classes-in-machine-learning/).
    - If some features have a much lower amount of data points than others, that leads to class imbalance.

## Setup

### Imports

In [37]:
import pandas as pd
from pathlib import Path
from tabulate import tabulate

### Helper Functions

In [26]:
def load_data(filename, folder="1_raw"):
    """
    Load data from a CSV file in a subfolder of the project's 'data' directory.
    This version is adjusted to work even if the notebook is run from a subfolder.

    Parameters
    ----------
    filename : str
        The name of the file to load, including the extension (e.g., "data.csv").
    folder : str, optional
        The subfolder within 'data' to load from. Defaults to "1_raw".
    """
    try:
        # Go up one level from the current working directory to find the project root
        PROJECT_ROOT = Path.cwd().parent

        file_path = PROJECT_ROOT / "data" / folder / filename

        df = pd.read_csv(file_path, sep=",")

        print(f"Data loaded successfully from {file_path}")
        return df
    except FileNotFoundError:
        print(f"Error: The file was not found at {file_path}")
        return None
    except Exception as e:
        print(f"An error occurred while loading the file: {e}")
        return None


def save_data(df, filename, folder="2_interim"):
    """
    Save a dataframe to a CSV file in a subfolder of the project's 'data' directory.

    This function automatically creates the destination folder if it doesn't exist.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe to save.
    filename : str
        The name for the output file, including the extension (e.g., "processed_orders.csv").
    folder : str, optional
        The subfolder within 'data' to save to. Defaults to "2_interim".
    """
    try:
        PROJECT_ROOT = Path.cwd().parent
        save_dir = PROJECT_ROOT / "data" / folder
        save_dir.mkdir(parents=True, exist_ok=True)

        # The full filename, including extension, is now expected
        file_path = save_dir / filename

        df.to_csv(file_path, sep=",", index=False)

        print(f"Data saved successfully to {file_path} ✅")

    except Exception as e:
        print(f"An error occurred while saving the file: {e}")

### Data Loading

In [32]:
# Extended files
df_em = load_data("extended_materials.csv")
df_et = load_data("extended_transportation.csv")

# Kernel files
df_kpo = load_data("kernel_purchase_orders.csv")
df_kr = load_data("kernel_receivals.csv")

# Other files
df_pm = load_data("prediction_mapping.csv")
df_ss = load_data("sample_submission.csv")

datasets = {
    "extended_materials": df_em,
    "extended_transportation": df_et,
    "kernel_purchase_orders": df_kpo,
    "kernel_receivals": df_kr,
    "prediction_mapping": df_pm,
    "sample_submission": df_ss,
}

Data loaded successfully from /home/kfkh/Code/tdt4173-course-project/data/1_raw/extended_materials.csv
Data loaded successfully from /home/kfkh/Code/tdt4173-course-project/data/1_raw/extended_transportation.csv
Data loaded successfully from /home/kfkh/Code/tdt4173-course-project/data/1_raw/kernel_purchase_orders.csv
Data loaded successfully from /home/kfkh/Code/tdt4173-course-project/data/1_raw/kernel_receivals.csv
Data loaded successfully from /home/kfkh/Code/tdt4173-course-project/data/1_raw/prediction_mapping.csv
Data loaded successfully from /home/kfkh/Code/tdt4173-course-project/data/1_raw/sample_submission.csv


## Introductory EDA

### Checking .head(n=20) for all dataframes

In [28]:
df_em.head(n=20)

Unnamed: 0,rm_id,product_id,product_version,raw_material_alloy,raw_material_format_type,stock_location
0,,,,,,
1,342.0,91900170.0,1.0,SB06 Traders,24.0,DELETED_28.02:2011_SB06 anodiz
2,343.0,91900143.0,2.0,SB02 606035,24.0,SB 02
3,345.0,91900143.0,2.0,SA10 606035,3.0,DELETED_28.02:2011_ST01
4,346.0,91900146.0,2.0,SA15 600540,3.0,DELETED_28.02:2011_SA 300370
5,347.0,91900143.0,2.0,SA13 606020,3.0,DELETED_28.02:2011_SA13 6035
6,348.0,91900143.0,2.0,SA11 606035,3.0,DELETED_28.02:2011_ST01
7,353.0,91900143.0,1.0,TYB 6060,23.0,DELETED_28.02:2011_TYB
8,354.0,91900182.0,1.0,SA99.5,21.0,DELETED_10.09:2015_SA 99.5
9,355.0,91900152.0,14.0,"PM99,7 Coils",7.0,DELETED_21.06:2019_Bobinas Primario


In [31]:
df_em.describe(include="all")

Unnamed: 0,rm_id,product_id,product_version,raw_material_alloy,raw_material_format_type,stock_location
count,1217.0,1217.0,1217.0,1217,1217.0,1217
unique,,,,180,,153
top,,,,CPS Prof 6060,,SB 16
freq,,,,55,,43
mean,2546.416598,83518760.0,16.420707,,32.516845,
std,783.283365,26469140.0,13.187869,,13.172436,
min,342.0,1002.0,1.0,,1.0,
25%,2133.0,91900150.0,5.0,,24.0,
50%,2160.0,91900300.0,13.0,,36.0,
75%,3125.0,91901440.0,25.0,,47.0,


In [None]:
df_et.head(n=20)

Unnamed: 0,rm_id,product_id,purchase_order_id,purchase_order_item_no,receival_item_no,batch_id,transporter_name,vehicle_no,unit_status,vehicle_start_weight,...,net_weight,wood,ironbands,plastic,water,ice,other,chips,packaging,cardboard
0,365.0,91900143.0,208545.0,10.0,1,,Transporter0,Vehicle0,Transferred,40040.0,...,11420.0,,,,,,,,,
1,365.0,91900143.0,208545.0,10.0,2,,Transporter0,Vehicle0,Transferred,40040.0,...,13760.0,,,,,,,,,
2,365.0,91900143.0,208490.0,10.0,1,,Transporter1,Vehicle1,Transferred,39940.0,...,11281.0,,,,,,,,,
3,365.0,91900143.0,208490.0,10.0,2,,Transporter1,Vehicle1,Transferred,39940.0,...,13083.0,,,,,,,,,
4,379.0,91900296.0,210435.0,20.0,1,,Transporter2,Vehicle2,Transferred,39360.0,...,23910.0,,,,,,,,,
5,389.0,91900330.0,208535.0,30.0,1,,Transporter3,Vehicle3,Transferred,22600.0,...,8680.0,,,,,,,,,
6,365.0,91900143.0,208532.0,10.0,1,,Transporter1,Vehicle4,Transferred,39080.0,...,14840.0,,,,,,,,,
7,369.0,91900146.0,208532.0,30.0,2,,Transporter1,Vehicle4,Transferred,39080.0,...,6745.0,,,,,,,,,
8,366.0,91900160.0,208532.0,20.0,3,,Transporter1,Vehicle4,Transferred,39080.0,...,3015.0,,,,,,,,,
9,365.0,91900143.0,208537.0,10.0,1,,Transporter4,Vehicle5,Transferred,40500.0,...,25060.0,,,,,,,,,


In [None]:
df_kpo.head(n=20)

Unnamed: 0,purchase_order_id,purchase_order_item_no,quantity,delivery_date,product_id,product_version,created_date_time,modified_date_time,unit_id,unit,status_id,status
0,1,1,-14.0,2003-05-12 00:00:00.0000000 +02:00,91900143,1,2003-05-12 10:00:48.0000000 +00:00,2004-06-15 06:16:18.0000000 +00:00,,,2,Closed
1,22,1,23880.0,2003-05-27 00:00:00.0000000 +02:00,91900160,1,2003-05-27 12:42:07.0000000 +00:00,2012-06-29 09:41:13.0000000 +00:00,,,2,Closed
2,41,1,0.0,2004-03-08 00:00:00.0000000 +01:00,91900143,1,2004-03-08 13:44:31.0000000 +00:00,2012-07-04 13:51:02.0000000 +00:00,,,2,Closed
3,61,1,0.0,2004-03-10 00:00:00.0000000 +01:00,91900143,1,2004-03-10 11:39:06.0000000 +00:00,2012-07-04 13:50:59.0000000 +00:00,,,2,Closed
4,141,10,25000.0,2004-10-28 00:00:00.0000000 +02:00,91900143,1,2004-10-22 12:21:54.0000000 +00:00,2012-07-04 13:50:55.0000000 +00:00,,,2,Closed
5,161,10,6000.0,2005-03-11 00:00:00.0000000 +01:00,91900143,1,2005-03-11 13:53:25.0000000 +00:00,2012-07-04 13:50:49.0000000 +00:00,,,2,Closed
6,161,20,15000.0,2006-03-27 00:00:00.0000000 +02:00,91900143,1,2006-03-27 11:04:44.0000000 +00:00,2012-07-04 13:50:52.0000000 +00:00,,,2,Closed
7,361,10,150000.0,2012-07-31 00:00:00.0000000 +02:00,91900296,1,2012-07-04 13:53:29.0000000 +00:00,2014-07-29 10:58:02.0000000 +00:00,,,2,Closed
8,361,20,150000.0,2012-07-31 00:00:00.0000000 +02:00,91900170,1,2012-07-04 13:55:14.0000000 +00:00,2014-07-29 10:58:03.0000000 +00:00,,,2,Closed
9,361,30,150000.0,2012-07-31 00:00:00.0000000 +02:00,91901050,1,2012-07-09 07:52:43.0000000 +00:00,2014-07-29 10:58:05.0000000 +00:00,,,2,Closed


In [None]:
df_kr.head(n=20)

Unnamed: 0,rm_id,product_id,purchase_order_id,purchase_order_item_no,receival_item_no,batch_id,date_arrival,receival_status,net_weight,supplier_id
0,365.0,91900143.0,208545.0,10.0,1,,2004-06-15 13:34:00 +02:00,Completed,11420.0,52062
1,365.0,91900143.0,208545.0,10.0,2,,2004-06-15 13:34:00 +02:00,Completed,13760.0,52062
2,365.0,91900143.0,208490.0,10.0,1,,2004-06-15 13:38:00 +02:00,Completed,11281.0,50468
3,365.0,91900143.0,208490.0,10.0,2,,2004-06-15 13:38:00 +02:00,Completed,13083.0,50468
4,379.0,91900296.0,210435.0,20.0,1,,2004-06-15 13:40:00 +02:00,Completed,23910.0,52577
5,389.0,91900330.0,208535.0,30.0,1,,2004-06-15 13:43:00 +02:00,Completed,8680.0,55251
6,365.0,91900143.0,208532.0,10.0,1,,2004-06-15 13:46:00 +02:00,Completed,14840.0,20023
7,369.0,91900146.0,208532.0,30.0,2,,2004-06-15 13:46:00 +02:00,Completed,6745.0,20023
8,366.0,91900160.0,208532.0,20.0,3,,2004-06-15 13:46:00 +02:00,Completed,3015.0,20023
9,365.0,91900143.0,208537.0,10.0,1,,2004-06-16 08:26:00 +02:00,Completed,25060.0,50387


In [None]:
df_pm.head(n=20)

Unnamed: 0,ID,rm_id,forecast_start_date,forecast_end_date
0,1,365,2025-01-01,2025-01-02
1,2,365,2025-01-01,2025-01-03
2,3,365,2025-01-01,2025-01-04
3,4,365,2025-01-01,2025-01-05
4,5,365,2025-01-01,2025-01-06
5,6,365,2025-01-01,2025-01-07
6,7,365,2025-01-01,2025-01-08
7,8,365,2025-01-01,2025-01-09
8,9,365,2025-01-01,2025-01-10
9,10,365,2025-01-01,2025-01-11


In [None]:
df_ss.head(n=20)

Unnamed: 0,ID,predicted_weight
0,1,0
1,2,0
2,3,0
3,4,0
4,5,0
5,6,0
6,7,0
7,8,0
8,9,0
9,10,0


In [41]:
# --- Full Table ---
all_features = sorted(
    list(set(feature for df in datasets.values() for feature in df.columns))
)
presence_data = []

for name, df in datasets.items():
    row = {}
    for feature in all_features:
        if feature in df.columns:
            series = df[feature]
            total_count = len(series)

            if total_count > 0:
                non_nan_count = series.count()
                fill_grade = (non_nan_count / total_count) * 100
                row[feature] = f"✅ {fill_grade:.1f}%"
            else:
                row[feature] = "✅ 100.0%"
        else:
            row[feature] = ""
    presence_data.append(row)

presence_df = pd.DataFrame(presence_data, index=datasets.keys())
presence_df_transposed = presence_df.T
presence_df_transposed.index.name = "Feature Name"
presence_df_transposed.columns.name = "DataFrame Name"

print("--- Feature Presence & Fill Grade Across All DataFrames ---")
print(
    tabulate(presence_df_transposed, headers="keys", tablefmt="grid", stralign="center")
)
print("\n" * 2)  # Add some space between the tables


# --- Kernel Table ---

# 1. Filter the datasets dictionary
kernel_datasets = {
    "kernel_purchase_orders": datasets["kernel_purchase_orders"],
    "kernel_receivals": datasets["kernel_receivals"],
}

# 2. Re-run the table generation logic on the filtered dictionary
kernel_features = sorted(
    list(set(feature for df in kernel_datasets.values() for feature in df.columns))
)
kernel_presence_data = []

for name, df in kernel_datasets.items():
    row = {}
    for feature in kernel_features:
        if feature in df.columns:
            series = df[feature]
            total_count = len(series)

            if total_count > 0:
                non_nan_count = series.count()
                fill_grade = (non_nan_count / total_count) * 100
                row[feature] = f"✅ {fill_grade:.1f}%"
            else:
                row[feature] = "✅ 100.0%"
        else:
            row[feature] = ""
    kernel_presence_data.append(row)

kernel_presence_df = pd.DataFrame(kernel_presence_data, index=kernel_datasets.keys())
kernel_presence_df_transposed = kernel_presence_df.T
kernel_presence_df_transposed.index.name = "Feature Name"
kernel_presence_df_transposed.columns.name = "DataFrame Name"


print("--- Feature Presence & Fill Grade for Kernel DataFrames ---")
print(
    tabulate(
        kernel_presence_df_transposed,
        headers="keys",
        tablefmt="grid",
        stralign="center",
    )
)

--- Feature Presence & Fill Grade Across All DataFrames ---
+--------------------------+----------------------+---------------------------+--------------------------+--------------------+----------------------+---------------------+
|       Feature Name       |  extended_materials  |  extended_transportation  |  kernel_purchase_orders  |  kernel_receivals  |  prediction_mapping  |  sample_submission  |
|            ID            |                      |                           |                          |                    |      ✅ 100.0%       |      ✅ 100.0%      |
+--------------------------+----------------------+---------------------------+--------------------------+--------------------+----------------------+---------------------+
|         batch_id         |                      |         ✅ 52.8%          |                          |      ✅ 52.8%      |                      |                     |
+--------------------------+----------------------+---------------------------+

## 'Advice' regarding what to do next

Based on that table and the project requirements, here is a concrete, step-by-step plan for your EDA and data processing. Think of this as a roadmap for your `1_exploratory_data_analysis.ipynb` and `2_data_processing.ipynb` notebooks.

### 🗺️ Your Action Plan: From Raw Data to a Clean Foundation

Your goal is to transform the raw CSV files into a single, clean "master dataframe" that can be fed into your feature engineering functions.

---
### **Part 1: Targeted Exploratory Data Analysis (EDA)**
*(`1_exploratory_data_analysis.ipynb`)*

Your EDA shouldn't be a random search; it should be a targeted investigation to answer specific questions that will help you build features. Here’s what you're looking for:

#### **1. Understand the Target Variable (`net_weight`)**
First, get a feel for the delivery patterns.
* **Action:**
    1.  Load `kernel_receivals.csv`.
    2.  Convert `date_arrival` to a datetime object.
    3.  Aggregate the data to get the total daily `net_weight` for each `rm_id`.
* **What to look for (and visualize):**
    * **Sparsity:** Plot the daily deliveries for a few high-volume `rm_id`s. Do deliveries happen every day, or are they large but infrequent? This is critical. A model needs to handle many days with zero deliveries.
    * **Seasonality:** Are there weekly patterns (e.g., fewer deliveries on weekends)? Or monthly/quarterly patterns? A bar chart of total deliveries by month or day of the week can reveal this.
    * **Outliers:** Are there any negative `net_weight` values? Or values that are orders of magnitude larger than the rest? These could be data entry errors that need cleaning.


#### **2. Analyze the Link Between Orders and Receivals (Most Important)**
This is where your most powerful predictive signals are.
* **Action:**
    1.  Merge `kernel_receivals.csv` and `kernel_purchase_orders.csv`.
    2.  Calculate the **Delivery Delay**: `delay = actual_date_arrival - expected_delivery_date`. This is your single most important exploratory variable.
* **What to look for (and visualize):**
    * **Delay Distribution:** Plot a histogram of the `delay` in days. Is it normally distributed around zero? Is there a long tail of very late deliveries? The shape of this distribution tells you how reliable the `delivery_date` is.
    * **Segmented Delays:** Does the average delay change based on other categories? Create boxplots of the delay grouped by:
        * `supplier_id` (from `receivals`)
        * `transporter_name` (from `extended_transportation`)
        * `raw_material_format_type` (from `extended_materials`)
        If you find that "Supplier X" is consistently 5 days late, you've just discovered a hugely valuable feature.

---
### **Part 2: Data Cleaning and Processing**
*(`2_data_processing.ipynb`)*

Based on your EDA findings, you can now build a clean, unified dataset.

#### **1. Create the `master_df`**
* **Action:**
    1.  Start with `kernel_receivals`.
    2.  **Left join** `kernel_purchase_orders` using `purchase_order_id` and `purchase_order_item_no` as the composite key.
    3.  Optionally, left join `extended_materials` (on `rm_id`) and `extended_transportation` (on your composite key) to bring in useful features like `raw_material_format_type` and `transporter_name`.
    4.  **Ignore the sparse columns** from `transportation` (e.g., `wood`, `plastic`, `ice`). They won't be useful.

#### **2. Clean the Data**
* **Action:**
    * **Data Types:** Convert all date columns (`date_arrival`, `delivery_date`, `created_date_time`) to the `pd.to_datetime` format.
    * **Missing Values:** Your `net_weight` fill rate is 99.9%. The easiest and safest approach is to simply **drop the rows** where `net_weight` is missing. For `batch_id` (52.8% filled), treat the missing values as a special category (e.g., fill with "Unknown" or -1).
    * **Outliers:** Based on your EDA, remove any obvious errors (e.g., negative `net_weight`).
    * **Categorical Features:** For columns like `supplier_id`, `transporter_name`, etc., convert them to pandas' `category` dtype. This is more memory-efficient and LightGBM can handle it directly.

#### **3. Save the Processed Data**
* **Action:** Save your clean `master_df` to the `data/3_processed` directory as a Parquet or Feather file. This is much faster to load than a CSV and preserves your data types, so you don't have to repeat the cleaning steps every time you work on your model.

Once you have this clean `master_df`, you are perfectly set up to start implementing the `create_features` and `create_target` functions in your `lgbm_pipeline.py` script. Your data work will then directly feed your modeling pipeline.