# Project Report TDT4173

The purpose of this report is to summarize all steps taken in our group in order to find fitting models/algorithms to the problem at hand. This will include exploratory data analysis, feature engineering, variuos predictors including boosting/bagging, as well as feature and model interpretations.



## Table of contents
1. [Planned actions](#planned-actions)
1. [Exploratory data analysis](#Exploratory-data-analysis)
    1. [Important observations](#Important-observations)
    2. [Steps taken](#Exploratory-steps)
2. [Feature engineering](#Feature-engineering)
    1. [Important observations](#Feature-engineering-Important-observations)
    2. [Steps taken](#Feature-engineering-Steps-taken)
3. [Model training](#Model-training)
    1. [Important observations](#Model-training-Important-observations)
    2. [Steps taken](#Model-training-Steps-taken)
4. [Model evaluation and interpretation](#Model-evaluation-and-interpretation)


## Planned actions

1. Perform exploratory data analysis in order to get an understanding of the data and notice patters/dependencies.
2. Using the results in data analysis, perform feature engineering on a simple model (xgboost and random forest).
3. Use the engineered features on better models (boosting/bagging). 
4. When model performs to satisfaction, perform model interpretation. 


## Exploratory data analysis: 

The purpose of performing an exploratory data analysis is to get an understanding of the different types of data included in the problem and their relations. This will be useful when creating models in order to understand why different models perform a certain way as how feature engineering can help improve performance. 

### Important observations

This section will summarize the steps taken in next subsection, and will include the most important observations taken during testing.

- There are 33171 purchase orders and 122590 receivals. It is thus evident that there is a tendency that purchase orders are split into several receivals, either due to stock unavailability, large orders or other reasons. We may try to merge the two dataframes together using purchase order id´s, but this is dependent on them existing in both dataframes. If they do not, we may consider deleting the id´s. 

- In both receivals and purchase orders, there are a few NaN values in the different features. Proposed solution is to drop these rows. The column `batch_id` in the receivals dataframe has about half of its values as NaN. Proposed solution is to either drop this column, or combine the non-NaN rows (aka the batches) and then drop the column.



### Steps taken

This section describes all the steps taken in the exploratory data analysis. All steps taken are to be included here and may include old steps not discussed further. In further sections we may observe new data patterns, which will be noted here. 

#### 07.10.:
- Converted date columns to datetime format, and visualized the head of the dataframes to get an understanding of the data.

- Checked the amount of NaN values in the receivals dataframe. About half of the values in the `batch_id` column in the dataframe are NaN values. We could drop the entire column. A handful of NaN values in all other columns. There could be overlap between NaN values across features, but I propose to drop all rows with NaN values in the receivals dataframe. 

- Checked the amount of NaN values in the purchase orders dataframe. A few NaN values in the `unit` and `unit_id` columns. Could remove these rows as we cannot be certain of the unit of the purchase order. Most of the units are in 'kg' and a handful in 'pund'. Could either remove the rows with 'pund' or convert them to 'kg'.

- In the `receival_status` column in the receivals dataframe, there are 142 orders that are not 'Completed'.

- There are about 4400 purchase orders which are not 'Closed' in the `status` column in the purchase orders dataframe. This could be an important observation as these orders are not completed, and could be a reason for delay.



In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import xgboost as xgb

plt.style.use('seaborn-v0_8')

In [3]:
data_receivals = pd.read_csv('./Project_materials/data/kernel/receivals.csv')
data_purchase_orders = pd.read_csv('./Project_materials/data/kernel/purchase_orders.csv')
# data_materials = pd.read_csv('./Project_materials/data/extended/materials.csv')
# data_transportation = pd.read_csv('./Project_materials/data/extended/transportation.csv')


#### Printing dataframe heads:

In [4]:
data_receivals['date_arrival'] = pd.to_datetime(data_receivals['date_arrival'], utc=True).dt.tz_localize(None)
print("Amount of receivals: ", len(data_receivals['rm_id']))
data_receivals.head()

Amount of receivals:  122590


Unnamed: 0,rm_id,product_id,purchase_order_id,purchase_order_item_no,receival_item_no,batch_id,date_arrival,receival_status,net_weight,supplier_id
0,365.0,91900143.0,208545.0,10.0,1,,2004-06-15 11:34:00,Completed,11420.0,52062
1,365.0,91900143.0,208545.0,10.0,2,,2004-06-15 11:34:00,Completed,13760.0,52062
2,365.0,91900143.0,208490.0,10.0,1,,2004-06-15 11:38:00,Completed,11281.0,50468
3,365.0,91900143.0,208490.0,10.0,2,,2004-06-15 11:38:00,Completed,13083.0,50468
4,379.0,91900296.0,210435.0,20.0,1,,2004-06-15 11:40:00,Completed,23910.0,52577


In [5]:
data_purchase_orders['delivery_date'] = pd.to_datetime(data_purchase_orders['delivery_date'], utc=True).dt.tz_localize(None)
data_purchase_orders['created_date_time'] = pd.to_datetime(data_purchase_orders['created_date_time'], utc=True).dt.tz_localize(None)
data_purchase_orders['modified_date_time'] = pd.to_datetime(data_purchase_orders['modified_date_time'], utc=True).dt.tz_localize(None)
print("Amount of purchase orders: ", len(data_purchase_orders['purchase_order_id']))
data_purchase_orders.head()

Amount of purchase orders:  33171


Unnamed: 0,purchase_order_id,purchase_order_item_no,quantity,delivery_date,product_id,product_version,created_date_time,modified_date_time,unit_id,unit,status_id,status
0,1,1,-14.0,2003-05-11 22:00:00,91900143,1,2003-05-12 10:00:48,2004-06-15 06:16:18,,,2,Closed
1,22,1,23880.0,2003-05-26 22:00:00,91900160,1,2003-05-27 12:42:07,2012-06-29 09:41:13,,,2,Closed
2,41,1,0.0,2004-03-07 23:00:00,91900143,1,2004-03-08 13:44:31,2012-07-04 13:51:02,,,2,Closed
3,61,1,0.0,2004-03-09 23:00:00,91900143,1,2004-03-10 11:39:06,2012-07-04 13:50:59,,,2,Closed
4,141,10,25000.0,2004-10-27 22:00:00,91900143,1,2004-10-22 12:21:54,2012-07-04 13:50:55,,,2,Closed


#### Study of column values, especially NaN values:

In [6]:

num_nan = 0
for elmt in data_receivals['net_weight']:
    if pd.isna(elmt):
        num_nan += 1

print(num_nan)



68


#### 16.10.: Initial Modeling & Lag Analysis

**Baseline Models**:
- Started with simple rule-based baseline (0 for inactive, 75% of PO qty for active) → Score: 134,136 (beat 0 VTs) ❌
- Moved to XGBoost with basic features (365d, 90d aggregates, PO data) + quantile regression (α=0.2) → Score: 10,135 (beat 2 VTs) ✅

**Lag Pattern Discovery**:
- Analyzed delivery lag (actual arrival - expected delivery date)
- Key finding: Median lag = -15 days (deliveries arrive ~2 weeks EARLY!)
- Supplier-specific variation: std dev of 47.3 days across suppliers → significant
- Product-specific variation: std dev of 11.4 days
- Temporal stability: 2020+ data is stable (2004-2006 had weird patterns, excluded from lag calc)

**Lag Adjustment Implementation**:
- Computed supplier-specific median lags from 2020+ data
- Adjusted PO expected_arrival = delivery_date + supplier_lag
- Only count POs with expected_arrival in forecast window (critical!)
- XGBoost with lag adjustment → Score: 10,135 ✅
- LightGBM with lag adjustment → Score: 9,600 ✅ (best so far!)

**Failed Experiments**:
- Random Forest with mean predictions → Score: 16,763 ❌ (RF needs 20th percentile extraction, too slow for iteration)
- Ensemble XGBoost + LightGBM → No improvement (models too correlated, -0.2% on validation)

**Feature Engineering Attempts**:
1. **2023-2024 training data + trend features + supplier categorical** → Score: 11,800 ❌
   - Added: 30d, 180d aggregates, trend_ratio, acceleration, supplier_id as categorical
   - Problem: 2023 data created distribution shift (2023 patterns ≠ 2025 patterns)
   - Validation improved (24,248) but Kaggle worse → classic overfitting
   
2. **2024 data + supplier as categorical** → Score: 15,000 ❌❌
   - Problem: LightGBM's categorical feature handling overfits with 87 suppliers on small dataset
   - Learned: categorical_feature parameter is dangerous with limited data

**Key Observations**:
- LightGBM (9,600) slightly beats XGBoost (10,135) with same features
- Lag adjustment is critical (improves ~16% from baseline)
- More data ≠ better (2023 data hurts due to distribution shift)
- Categorical features in LightGBM overfit easily
- Validation loss can be misleading (need time-based validation for forecasting)
- Random 80/20 split includes old patterns, but Kaggle tests on 2025 (unseen conditions)

**Current Best**: LightGBM + lag adjustment (2024 data, 13 features) → 9,600 (beats 2-3 VTs)

**Next Steps**: Test supplier_id as numeric feature (not categorical), add trend features carefully with 2024 data only

### 17.10.: Validation Strategy, Feature Analysis & Zero-Inflation

**Starting Point**: LightGBM with lag adjustment → Score: 9,600 (2024 data, time-based features)

**Phase 1: Diagnostic Deep Dive**
- Problem: Score stuck at 9,600 despite alpha adjustments (0.15, 0.20, 0.25, 0.30)
- Root cause: Random 80/20 split mixes all months → model learns "average 2024" instead of declining trend
- Key finding: **-35.6% decline** from training (Jan-Aug: 342k kg mean) to validation (Sep-Nov: 220k kg mean)

**Validation Strategy Experiments**:
1. **Random split + alpha=0.15 + 0.88 adjustment** → Score: 9,500 ✅
   - Predictions: 37k kg × 0.88 = 32.7k kg mean
   - Issue: Still predicting "average" rather than learning trend
   
2. **Time-based split + alpha=0.15** → Score: 9,500
   - Training: Jan-Aug 2024, Validation: Sep-Nov 2024
   - Model sees -35.6% decline but predictions still ~37k kg
   
3. **Time-based split + alpha=0.25 (removed 0.88)** → Score: 9,960 ❌
   - Predictions increased to 45k kg
   - Proved over-predicting makes score worse → test set has LOW activity
   
4. **Time-based split + alpha=0.15 + 0.85 adjustment** → Score: ~9,500
   - No real improvement, wasted submission

**Key Insight**: Hyperparameter tuning won't get from 9,600 → 5,000. Need fundamental changes.

**Phase 2: Deep EDA - The Breakthrough**

**Critical Discoveries**:
1. **Zero-Inflation Problem** ⚠️
   - Test set: 203 unique rm_ids
   - **Only 60 (29.6%) have ANY 2024 data**
   - **143 rm_ids (70.4%) are DEAD** (last delivery 2004-2005)
   - **72.9% of test predictions (22,200/30,450) are for DEAD rm_ids**
   - Problem: Quantile regression predicts positive values for everything → massive over-prediction

2. **Feature Correlation Analysis** (Sep-Nov 2024):
```
   Strong features (correlation > 0.65):
   - total_90, rate_90: +0.9526 ***
   - recency_weighted: +0.9327 ***
   - total_180: +0.9131 ***
   - rate_30, total_30: +0.8872 ***
   - active_ratio_90: +0.6716 ***
   
   Weak features (remove):
   - momentum: +0.0713
   - slope_90: +0.0084
   - cv_90: -0.0207
   - PO quantity: +0.0767
```

3. **Seasonality**: January vs other months 1.03x ratio → month/quarter features likely noise

**Phase 3: The Working Solution** ✅

**Two-Stage Model + Guardrails**:
1. **Two-Stage Architecture**:
   - Stage 1: LGBMClassifier (n=400) predicts P(delivery > 0)
   - Stage 2: LGBMRegressor (quantile α=0.15) predicts amount IF delivered
   - Combined: prediction = regressor × classifier_probability
   - **Why it works**: Dead rm_ids get low probability (~0.1) → effectively zero prediction

2. **Per-Horizon Calibration**:
   - Learned on Sep-Nov validation for horizons {7, 30, 60, 90, 150}
   - factor_h = sum(actual) / sum(predicted), clipped to [0.70, 1.10]
   - Fixes systematic over/under-prediction at different horizons

3. **Activity-Based Guardrails**:
   - days_since_last > 365 → force 0
   - 180 < days_since_last ≤ 365 → soft cap at 8% of last year volume
   - Upper cap: min(pred, (total_365/365) × horizon × 1.5)

**Result**: 9,600 → **8,200** ✅✅ (14% improvement!)

**Phase 4: Feature Engineering - Step 1** (In Progress)

**Removed (9 weak features)**:
- ❌ momentum, slope_90, cv_90 (correlation < 0.1)
- ❌ future_po_quantity, future_po_count (correlation 0.08)
- ❌ month, quarter (weak seasonality)
- ❌ avg_weight_365d, daily_rate_365d (redundant)

**Added (5 strong features)**:
- ✅ recency_weighted (correlation 0.93)
- ✅ total_30, rate_30, count_30 (correlation 0.89)
- ✅ active_ratio_90 (correlation 0.67)

**Net**: 16 features → 14 features (more focused, less noise)
**Expected**: 8,200 → 7,200-7,500

**Key Learnings**:
- Zero-inflation critical: 73% of test should be ~0, quantile regression can't learn this
- Domain knowledge > ML: Hard-coded rules (dead = 0) beat learned patterns
- Feature correlation matters: momentum (0.07 correlation) = noise
- Validation strategy crucial: Random split hides declining trend
- EDA before iteration: Deep analysis revealed 73% dead rm_ids problem

**Failed Experiments**:
- Random split + alpha tuning → 9,500-9,960 ❌ (doesn't learn trend)
- Remove 0.88 adjustment → 9,960 ❌ (predictions too high)
- Time-based + various alphas → 9,500-9,960 ❌ (hyperparameter band-aid)

**Current Best**: Two-stage + guardrails → 8,200 (beats ~3-4 VTs)

**Next Steps** (if Step 1 works):
1. Interpolate calibration for horizons 2-151
2. Activity-aware guardrails (smooth decay)
3. Ensemble with median predictor

### 18.10.: Feature Pruning Based on Correlation Analysis

**Removed (9 weak features)**:
- ❌ momentum, slope_90, cv_90 (correlation < 0.1)
- ❌ future_po_quantity, future_po_count (correlation 0.08)
- ❌ month, quarter (weak seasonality)
- ❌ avg_weight_365d, daily_rate_365d (redundant)

**Added (5 strong features)**:
- ✅ recency_weighted (correlation 0.93) - weights recent deliveries higher
- ✅ total_30, rate_30, count_30 (correlation 0.89) - captures very recent activity
- ✅ active_ratio_90 (correlation 0.67) - ratio of active days in last 90d

**Net**: 16 features → 14 features (more focused, less noise)

**Result**: 8,200 → **6,200** ✅✅✅ (24% improvement!)

**Key Insight**: Weak features weren't neutral - they were actively adding noise and confusing the model. Removing them + adding highly correlated features = massive gain.

**Total Progress**: 9,600 (baseline) → 6,200 (35% improvement!)

---

#### Fine-Tuning Attempts (Hit Plateau)

**Step 2: Interpolated Calibration** → 6,700 ❌ (worse by 500)
- Tried to fill calibration for ALL horizons 2-151 (not just {7,30,60,90,150})
- Used scipy.interpolate.interp1d to smooth factors across horizons
- **Failed:** Overfitted to Sep-Nov 2024 patterns, didn't generalize to 2025 test set
- Learning: Validation-learned adjustments don't always transfer to test

**Step 3: Aggressive Dead Filtering** → 6,200 (no change)
- Tightened guardrails: cold rm_ids (180-365d) from 8% cap → 5% cap
- Added new warm rm_ids (90-180d) cap at 15%
- **No effect:** Only affected 19/203 rm_ids, and their predictions were already low from two-stage model
- Learning: Guardrails already optimal from ChatGPT's original implementation

**Step 4: Lower Alpha (0.15 → 0.10)** → 6,140 ✅ (tiny improvement, -60 points)
- Changed quantile target from 15th percentile to 10th percentile
- More conservative predictions
- **Minimal improvement:** Suggests we're close to optimal conservatism level

**Step 5: Remove Calibration Entirely** → 6,138 ✅ (basically same, -2 points)
- Removed all per-horizon calibration adjustments
- Let model's forecast_horizon feature handle effects naturally
- **No impact:** Calibration was irrelevant (factors too close to 1.0)
- Learning: Sep-Nov calibration neither helped nor hurt

**Key Observations**:
- Hit a plateau around 6,100-6,200 - small tweaks (±60 points) don't move the needle
- Feature engineering (Step 1) gave 24% improvement, everything else combined gave <1%
- Calibration/guardrails/alpha tuning are all second-order effects
- Need <5,000 for grade A (currently 6,138) - requires 18.5% more improvement

**Failed Experiments**:
- Interpolated calibration → overfitting to validation
- Tighter guardrails → no effect (already optimal)
- Lower alpha → diminishing returns

**Current Best**: Step 5 (6,138 points) with alpha=0.10, no calibration, Step 1 features

**Possible next steps** (untested):
- Even lower alpha (0.05 or 0.08) for maximum conservatism
- Nuclear option: predict 0 for >90 days inactive (high risk/reward)
- Activity-aware upper caps (make hot rm_ids able to spike more, cold less)

prob need somthing fundamentally different to actually improve our score further

## Feature engineering:

The purpose of performing feature engineering on the datasets is to increase the performance of a predicting model. This can for example be done by removing features, merge features or giving features extra "weight".

### Important observations

This section will summarize the steps taken in next subsection, and will include the most important observations taken during feature engineering.



### Steps taken

This section describes all the steps taken during feature engineering. All steps taken are to be included here and may include old steps not discussed further. In further sections we may observe new feature behaviour, which will also be noted here. 

## Model training:

This is the section where we will train different models on the data, and try to find the best model for the problem. At the end of this section we should have a model that performs well on the data, and is able to make good predictions on unseen data.

### Important observations
This section will summarize the steps taken in next subsection, and will include the most important observations taken during model training.

### Steps taken
This section describes all the steps taken during model training. All steps taken are to be included here. 

## Model evaluation and interpretation:
The purpose of this section is to evaluate the model performance and interpret the model. This will include feature importance, SHAP values and partial dependence plots.