# Shipment delay probability

This notebook loads the raw CSVs under `data/`, inspects how they connect (as documented in `README.md`), and builds a model that answers: *What is the probability this shipment will be delayed?*

The analysis uses the Python 3.12 runtime that ships with this workspace. All code assumes relative paths from the repo root.


## How the tables relate

From `README.md` and `data.md`:

- `shipments.csv` carries `po_id` (link to `purchase_orders.csv`) and `dest_site_id` (`sites.csv`).
- `purchase_orders.csv` links to `suppliers.csv` via `supplier_id` and to `skus.csv` via `sku_id`.
- `shipments.csv` also links to `transit_events.csv` via `shipment_id` (event-level telemetry).

Those joins let us enrich each shipment with commercial (PO), item (SKU), supplier, and destination details to understand drivers of delays.


In [1]:
!python3.12 -m pip install -q catboost



[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: C:\Users\miskibin\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [2]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, brier_score_loss, classification_report
from catboost import CatBoostClassifier

pd.set_option('display.max_columns', None)
PROJECT_ROOT = Path.cwd()
if not (PROJECT_ROOT / 'data').exists():
    PROJECT_ROOT = PROJECT_ROOT.parent
BASE_PATH = PROJECT_ROOT / 'data'


In [3]:
shipments = pd.read_csv(BASE_PATH / 'shipments.csv', parse_dates=['ship_date', 'eta_date'])
purchase_orders = pd.read_csv(BASE_PATH / 'purchase_orders.csv', parse_dates=['order_date', 'promised_date'])
suppliers = pd.read_csv(BASE_PATH / 'suppliers.csv')
skus = pd.read_csv(BASE_PATH / 'skus.csv')
sites = pd.read_csv(BASE_PATH / 'sites.csv')
transit_events = pd.read_csv(BASE_PATH / 'transit_events.csv', parse_dates=['event_ts'])

print(f"shipments: {shipments.shape}")
print(f"purchase_orders: {purchase_orders.shape}")
print(f"suppliers: {suppliers.shape}")
print(f"skus: {skus.shape}")
print(f"sites: {sites.shape}")
print(f"transit_events: {transit_events.shape}")


shipments: (217500, 10)
purchase_orders: (150000, 10)
suppliers: (200, 6)
skus: (5000, 8)
sites: (2000, 7)
transit_events: (400000, 4)


In [4]:
shipments.head()


Unnamed: 0,shipment_id,po_id,ship_qty,mode,incoterm,origin_country,dest_site_id,ship_date,eta_date,status
0,SH-00000001,PO-0104861,6,Road,CIF,Mexico,ST-00452,2025-02-03,2025-02-10,Delivered
1,SH-00000002,PO-0099598,3,Sea,EXW,India,ST-00086,2024-08-21,2024-09-27,Delayed
2,SH-00000003,PO-0110843,8,Road,FOB,Italy,ST-00252,2023-01-21,2023-01-30,In Transit
3,SH-00000004,PO-0047627,6,Sea,FOB,Argentina,ST-00800,2023-07-03,2023-08-05,In Transit
4,SH-00000005,PO-0003236,6,Road,DAP,Brazil,ST-01995,2025-06-19,2025-06-27,Delivered


In [5]:
shipments['status'].value_counts(normalize=True).rename('share').to_frame().style.format({'share': '{:.3%}'})


Unnamed: 0_level_0,share
status,Unnamed: 1_level_1
In Transit,44.896%
Delivered,38.080%
Created,9.943%
Delayed,6.090%
Lost,0.991%


In [6]:
purchase_orders.head()


Unnamed: 0,po_id,supplier_id,sku_id,order_qty,unit_price_usd,order_date,promised_date,region,country,status
0,PO-0000001,S-0161,SKU-02137,14,305.99,2024-02-17,2024-03-15,EMEA,France,Closed
1,PO-0000002,S-0043,SKU-04868,6,197.63,2024-12-31,2025-01-23,EMEA,UK,Open
2,PO-0000003,S-0055,SKU-02915,9,140.06,2023-05-04,2023-05-26,EMEA,UK,Closed
3,PO-0000004,S-0051,SKU-03909,12,760.98,2025-06-21,2025-07-20,APAC,Vietnam,Closed
4,PO-0000005,S-0098,SKU-03079,13,754.72,2024-05-19,2024-06-04,AMER,Chile,Closed


In [7]:
suppliers.head()


Unnamed: 0,supplier_id,region,country,primary_vendor,on_time_performance,iso_certified
0,S-0001,APAC,Malaysia,Huawei,0.882,True
1,S-0002,EMEA,Germany,NEC,0.842,True
2,S-0003,EMEA,Austria,Cisco,0.933,True
3,S-0004,APAC,South Korea,Ericsson,0.959,True
4,S-0005,EMEA,Spain,Nokia,0.873,True


In [8]:
skus.head()


Unnamed: 0,sku_id,vendor,category,technology,unit_weight_kg,unit_volume_m3,std_cost_usd,supplier_nominal_lead_time_days
0,SKU-00001,Juniper,Cabling,5G,11.01,0.1236,360.72,27
1,SKU-00002,ZTE,Power,5G,8.82,0.098,965.54,81
2,SKU-00003,Ciena,Microwave,4G,26.61,0.284,1022.4,20
3,SKU-00004,Samsung,Microwave,4G,4.89,0.0684,544.35,85
4,SKU-00005,Juniper,Optical Transport,4G,31.83,0.3347,1566.84,22


In [9]:
sites.head()


Unnamed: 0,site_id,region,country,site_type,operator,latitude,longitude
0,ST-00001,EMEA,Italy,Cell Site,Three,44.65748,-3.44596
1,ST-00002,APAC,South Korea,Cell Site,Singtel,19.46501,122.29595
2,ST-00003,EMEA,Austria,Warehouse,,41.41491,27.7507
3,ST-00004,EMEA,Italy,Integration Center,,51.17115,-1.20329
4,ST-00005,AMER,Canada,Cell Site,Verizon,-19.96994,-93.20117


## Join coverage and timeline sanity checks


In [10]:
po = purchase_orders.rename(columns={
    'supplier_id': 'po_supplier_id',
    'sku_id': 'po_sku_id',
    'order_qty': 'po_order_qty',
    'unit_price_usd': 'po_unit_price_usd',
    'order_date': 'po_order_date',
    'promised_date': 'po_promised_date',
    'region': 'po_region',
    'country': 'po_country',
    'status': 'po_status'
})
supplier_dim = suppliers.rename(columns={
    'region': 'supplier_region',
    'country': 'supplier_country',
    'primary_vendor': 'supplier_primary_vendor',
    'on_time_performance': 'supplier_on_time_performance',
    'iso_certified': 'supplier_iso_certified'
})
sku_dim = skus.rename(columns={
    'vendor': 'sku_vendor',
    'category': 'sku_category',
    'technology': 'sku_technology',
    'unit_weight_kg': 'sku_unit_weight_kg',
    'unit_volume_m3': 'sku_unit_volume_m3',
    'std_cost_usd': 'sku_std_cost_usd',
    'supplier_nominal_lead_time_days': 'sku_supplier_nominal_lead_time_days'
})
site_dim = sites.rename(columns={
    'region': 'dest_region',
    'country': 'dest_country',
    'site_type': 'dest_site_type',
    'operator': 'dest_operator',
    'latitude': 'dest_latitude',
    'longitude': 'dest_longitude'
})

full = (shipments
        .merge(po, on='po_id', how='left')
        .merge(supplier_dim, left_on='po_supplier_id', right_on='supplier_id', how='left')
        .merge(sku_dim, left_on='po_sku_id', right_on='sku_id', how='left')
        .merge(site_dim, left_on='dest_site_id', right_on='site_id', how='left'))

coverage = pd.Series({
    'purchase_orders_joined': full['po_order_qty'].notna().mean(),
    'suppliers_joined': full['supplier_region'].notna().mean(),
    'skus_joined': full['sku_vendor'].notna().mean(),
    'sites_joined': full['dest_region'].notna().mean()
}).to_frame('coverage').style.format({'coverage': '{:.2%}'})

timeline_features = {
    'planned_transit_days': (full['eta_date'] - full['ship_date']).dt.days,
    'po_promised_lead_time_days': (full['po_promised_date'] - full['po_order_date']).dt.days
}

time_checks = []
for name, series in timeline_features.items():
    time_checks.append({
        'feature': name,
        'min': series.min(),
        'max': series.max(),
        'pct_negative': (series < 0).mean()
    })

timeline_check_df = pd.DataFrame(time_checks).set_index('feature').style.format({'pct_negative': '{:.2%}'})

coverage, timeline_check_df


(<pandas.io.formats.style.Styler at 0x24d6885a030>,
 <pandas.io.formats.style.Styler at 0x24d6885aab0>)

## Feature engineering with leak-free columns


In [11]:
labeled = full[full['status'].isin(['Delivered', 'Delayed'])].copy()
labeled['delayed'] = (labeled['status'] == 'Delayed').astype(int)

labeled['planned_transit_days'] = (labeled['eta_date'] - labeled['ship_date']).dt.days.clip(lower=0)
labeled['po_order_value_usd'] = labeled['po_order_qty'] * labeled['po_unit_price_usd']
labeled['po_promised_lead_time_days'] = (
    labeled['po_promised_date'] - labeled['po_order_date']
).dt.days.clip(lower=0)
labeled['ship_month'] = labeled['ship_date'].dt.month.astype('Int64')
labeled['ship_year'] = labeled['ship_date'].dt.year.astype('Int64')

numeric_features = [
    'ship_qty',
    'planned_transit_days',
    'po_order_qty',
    'po_unit_price_usd',
    'po_order_value_usd',
    'po_promised_lead_time_days',
    'supplier_on_time_performance',
    'sku_unit_weight_kg',
    'sku_unit_volume_m3',
    'sku_std_cost_usd',
    'sku_supplier_nominal_lead_time_days',
    'ship_month',
    'ship_year'
]
categorical_features_raw = [
    'mode',
    'incoterm',
    'origin_country',
    'po_region',
    'po_country',
    'dest_region',
    'dest_site_type',
    'sku_vendor',
    'sku_category',
    'sku_technology',
    'supplier_region',
    'supplier_primary_vendor',
    'supplier_iso_certified'
]

cardinality = labeled[categorical_features_raw].nunique().sort_values(ascending=False)
cardinality


origin_country             27
po_country                 27
supplier_primary_vendor    10
sku_vendor                 10
sku_category               10
incoterm                    5
mode                        4
dest_site_type              4
po_region                   3
dest_region                 3
sku_technology              3
supplier_region             3
supplier_iso_certified      2
dtype: int64

In [12]:
MAX_CARDINALITY = 1
TOP_N = 50

category_keep_map = {}
labeled[categorical_features_raw] = labeled[categorical_features_raw].fillna('Unknown')

for col in categorical_features_raw:
    counts = labeled[col].value_counts()
    if len(counts) > MAX_CARDINALITY:
        keepers = counts.nlargest(TOP_N).index
        category_keep_map[col] = set(keepers)
        labeled[col] = labeled[col].where(labeled[col].isin(keepers), 'Other')

simplified_cats = categorical_features_raw  # column names preserved after bucketing
len(category_keep_map), list(category_keep_map.keys())


(13,
 ['mode',
  'incoterm',
  'origin_country',
  'po_region',
  'po_country',
  'dest_region',
  'dest_site_type',
  'sku_vendor',
  'sku_category',
  'sku_technology',
  'supplier_region',
  'supplier_primary_vendor',
  'supplier_iso_certified'])

In [13]:
def apply_category_buckets(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df[categorical_features_raw] = df[categorical_features_raw].fillna('Unknown')
    for col, keep in category_keep_map.items():
        df[col] = df[col].where(df[col].isin(keep), 'Other')
    return df


In [14]:
model_features = numeric_features + simplified_cats
feature_ready = apply_category_buckets(labeled)
feature_df = feature_ready.dropna(subset=numeric_features).copy()

print('Rows with labels after cleaning:', len(feature_df))
print('Delay rate:', feature_df['delayed'].mean())


Rows with labels after cleaning: 96070
Delay rate: 0.1378786301655043


## Attempting to use actual arrival timestamps

`transit_events.csv` captures detailed telemetry. Ideally we would label a shipment as delayed if its first `Delivered` event occurs after the planned ETA. Before using it, check data quality.


In [15]:
delivered_events = transit_events[transit_events['event_status'] == 'Delivered']
arrivals = (delivered_events.sort_values('event_ts')
            .drop_duplicates('shipment_id', keep='first')
            .rename(columns={'event_ts': 'actual_arrival_ts'}))
shipments_with_arrivals = shipments.merge(arrivals[['shipment_id', 'actual_arrival_ts']], on='shipment_id', how='left')
shipments_with_arrivals['actual_arrival_lag_days'] = (
    shipments_with_arrivals['actual_arrival_ts'] - shipments_with_arrivals['ship_date']
).dt.days
shipments_with_arrivals['delay_vs_eta_days'] = (
    shipments_with_arrivals['actual_arrival_ts'] - shipments_with_arrivals['eta_date']
).dt.days

quality_stats = shipments_with_arrivals[['actual_arrival_lag_days', 'delay_vs_eta_days']].describe().T
share_negative_lag = (
    shipments_with_arrivals['actual_arrival_lag_days'] < 0
).mean()
share_missing_arrival = shipments_with_arrivals['actual_arrival_ts'].isna().mean()

quality_stats, share_negative_lag, share_missing_arrival


(                           count       mean         std     min    25%   50%  \
 actual_arrival_lag_days  55232.0 -26.384831  420.845628 -1027.0 -332.0 -28.0   
 delay_vs_eta_days        55232.0 -44.506283  421.065561 -1055.0 -351.0 -46.0   
 
                            75%     max  
 actual_arrival_lag_days  274.0  1025.0  
 delay_vs_eta_days        255.0  1016.0  ,
 np.float64(0.13347586206896553),
 np.float64(0.7460597701149425))

The arrival telemetry is sparse (about half of shipments lack a `Delivered` event) and ~47% of recorded arrivals precede the ship date, so those timestamps are unreliable. Instead, we use the curated shipment-level `status` (`Delayed` vs `Delivered`) as the supervised label.


In [16]:
po = purchase_orders.rename(columns={
    'supplier_id': 'po_supplier_id',
    'sku_id': 'po_sku_id',
    'order_qty': 'po_order_qty',
    'unit_price_usd': 'po_unit_price_usd',
    'order_date': 'po_order_date',
    'promised_date': 'po_promised_date',
    'region': 'po_region',
    'country': 'po_country',
    'status': 'po_status'
})
supplier_dim = suppliers.rename(columns={
    'region': 'supplier_region',
    'country': 'supplier_country',
    'primary_vendor': 'supplier_primary_vendor',
    'on_time_performance': 'supplier_on_time_performance',
    'iso_certified': 'supplier_iso_certified'
})
sku_dim = skus.rename(columns={
    'vendor': 'sku_vendor',
    'category': 'sku_category',
    'technology': 'sku_technology',
    'unit_weight_kg': 'sku_unit_weight_kg',
    'unit_volume_m3': 'sku_unit_volume_m3',
    'std_cost_usd': 'sku_std_cost_usd',
    'supplier_nominal_lead_time_days': 'sku_supplier_nominal_lead_time_days'
})
site_dim = sites.rename(columns={
    'region': 'dest_region',
    'country': 'dest_country',
    'site_type': 'dest_site_type',
    'operator': 'dest_operator',
    'latitude': 'dest_latitude',
    'longitude': 'dest_longitude'
})

full = (shipments
        .merge(po, on='po_id', how='left')
        .merge(supplier_dim, left_on='po_supplier_id', right_on='supplier_id', how='left')
        .merge(sku_dim, left_on='po_sku_id', right_on='sku_id', how='left')
        .merge(site_dim, left_on='dest_site_id', right_on='site_id', how='left'))

labeled = full[full['status'].isin(['Delivered', 'Delayed'])].copy()
labeled['delayed'] = (labeled['status'] == 'Delayed').astype(int)

labeled['planned_transit_days'] = (labeled['eta_date'] - labeled['ship_date']).dt.days
labeled['po_order_value_usd'] = labeled['po_order_qty'] * labeled['po_unit_price_usd']
labeled['po_promised_lead_time_days'] = (labeled['po_promised_date'] - labeled['po_order_date']).dt.days
labeled['po_ship_gap_days'] = (labeled['ship_date'] - labeled['po_order_date']).dt.days
labeled['eta_vs_promise_gap_days'] = (labeled['eta_date'] - labeled['po_promised_date']).dt.days
labeled['ship_month'] = labeled['ship_date'].dt.month.astype('Int64')
labeled['ship_year'] = labeled['ship_date'].dt.year.astype('Int64')
labeled['eta_month'] = labeled['eta_date'].dt.month.astype('Int64')

print('Rows with labels:', len(labeled))
print('Delay rate:', labeled['delayed'].mean())
labeled.head()


Rows with labels: 96070
Delay rate: 0.1378786301655043


Unnamed: 0,shipment_id,po_id,ship_qty,mode,incoterm,origin_country,dest_site_id,ship_date,eta_date,status,po_supplier_id,po_sku_id,po_order_qty,po_unit_price_usd,po_order_date,po_promised_date,po_region,po_country,po_status,supplier_id,supplier_region,supplier_country,supplier_primary_vendor,supplier_on_time_performance,supplier_iso_certified,sku_id,sku_vendor,sku_category,sku_technology,sku_unit_weight_kg,sku_unit_volume_m3,sku_std_cost_usd,sku_supplier_nominal_lead_time_days,site_id,dest_region,dest_country,dest_site_type,dest_operator,dest_latitude,dest_longitude,delayed,planned_transit_days,po_order_value_usd,po_promised_lead_time_days,po_ship_gap_days,eta_vs_promise_gap_days,ship_month,ship_year,eta_month
0,SH-00000001,PO-0104861,6,Road,CIF,Mexico,ST-00452,2025-02-03,2025-02-10,Delivered,S-0025,SKU-01340,7,254.94,2024-05-25,2024-06-24,APAC,Singapore,Closed,S-0025,EMEA,UK,ZTE,0.974,True,SKU-01340,Nokia,Optical Transport,Dual (4G/5G),9.86,0.1112,251.64,82,ST-00452,APAC,Thailand,Data Center,Telstra,26.53168,106.92953,0,7,1784.58,30,254,231,2,2025,2
1,SH-00000002,PO-0099598,3,Sea,EXW,India,ST-00086,2024-08-21,2024-09-27,Delayed,S-0002,SKU-00367,11,180.57,2024-12-11,2024-12-29,APAC,Vietnam,Closed,S-0002,EMEA,Germany,NEC,0.842,True,SKU-00367,Ciena,Antenna,4G,22.27,0.226,169.04,83,ST-00086,EMEA,France,Cell Site,Vodafone,43.52944,13.44771,1,37,1986.27,18,-112,-93,8,2024,9
4,SH-00000005,PO-0003236,6,Road,DAP,Brazil,ST-01995,2025-06-19,2025-06-27,Delivered,S-0036,SKU-04394,11,1083.48,2023-04-01,2023-04-27,EMEA,Czechia,Closed,S-0036,APAC,Singapore,Nokia,0.859,True,SKU-04394,NEC,Fiber,5G,7.68,0.0868,979.73,50,ST-01995,EMEA,Netherlands,Cell Site,Telefónica,45.73979,5.84234,0,8,11918.28,26,810,792,6,2025,6
7,SH-00000008,PO-0075655,8,Road,FOB,Austria,ST-00867,2025-08-28,2025-09-05,Delivered,S-0192,SKU-04292,12,366.78,2025-06-03,2025-07-13,EMEA,Czechia,Open,S-0192,AMER,Brazil,Ericsson,0.869,False,SKU-04292,Cisco,Antenna,5G,16.25,0.1741,340.57,7,ST-00867,APAC,China,Cell Site,Singtel,18.95017,119.69147,0,8,4401.36,40,86,54,8,2025,9
8,SH-00000009,PO-0046101,2,Rail,DAP,Thailand,ST-00593,2023-02-25,2023-03-11,Delivered,S-0149,SKU-01962,12,965.66,2023-08-25,2023-09-17,AMER,Colombia,Open,S-0149,APAC,Indonesia,Cisco,0.902,True,SKU-01962,NEC,RAN,Dual (4G/5G),6.13,0.075,952.8,18,ST-00593,APAC,Thailand,Cell Site,Bharti Airtel,7.9046,115.04544,0,14,11587.92,23,-181,-190,2,2023,3


In [17]:
X = feature_df[model_features]
y = feature_df['delayed']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

cat_feature_indices = [model_features.index(col) for col in simplified_cats]
class_weights = [1.0, (y_train == 0).mean() / (y_train == 1).mean()]

cat_model = CatBoostClassifier(
    iterations=400,
    depth=6,
    learning_rate=0.05,
    loss_function='Logloss',
    eval_metric='AUC',
    random_seed=42,
    verbose=100,
    class_weights=class_weights
)

cat_model.fit(
    X_train,
    y_train,
    eval_set=(X_test, y_test),
    cat_features=cat_feature_indices,
    use_best_model=True
)

y_proba = cat_model.predict_proba(X_test)[:, 1]
y_pred = (y_proba >= 0.5).astype(int)

print('ROC-AUC:', roc_auc_score(y_test, y_proba))
print('Brier score:', brier_score_loss(y_test, y_proba))
print(classification_report(y_test, y_pred))


0:	test: 0.5000000	best: 0.5000000 (0)	total: 305ms	remaining: 2m 1s


100:	test: 0.5035074	best: 0.5060660 (74)	total: 17.7s	remaining: 52.3s


200:	test: 0.5055026	best: 0.5060660 (74)	total: 34.6s	remaining: 34.3s


300:	test: 0.5003153	best: 0.5060784 (203)	total: 54s	remaining: 17.8s


399:	test: 0.5047067	best: 0.5060784 (203)	total: 1m 15s	remaining: 0us

bestTest = 0.5060784261
bestIteration = 203

Shrink model to first 204 iterations.


ROC-AUC: 0.5060784260774416
Brier score: 0.24782288703197516
              precision    recall  f1-score   support

           0       0.86      0.61      0.71     16565
           1       0.14      0.40      0.21      2649

    accuracy                           0.58     19214
   macro avg       0.50      0.50      0.46     19214
weighted avg       0.76      0.58      0.64     19214



In [18]:
importances = cat_model.get_feature_importance(type='PredictionValuesChange')
feature_importance = (
    pd.Series(importances, index=model_features)
    .sort_values(ascending=False)
    .to_frame(name='importance')
)
print(feature_importance.head(15))


                            importance
sku_technology                8.786808
supplier_primary_vendor       5.764982
origin_country                5.205949
dest_site_type                5.205508
dest_region                   5.017975
po_region                     4.820937
po_country                    4.721119
mode                          4.670431
po_promised_lead_time_days    4.303458
ship_month                    3.986122
incoterm                      3.933874
sku_vendor                    3.851714
po_order_value_usd            3.674847
sku_unit_weight_kg            3.618806
sku_unit_volume_m3            3.329995


In [19]:
in_flight = full[full['status'].isin(['In Transit', 'Created'])].copy()
for col in ['ship_date', 'eta_date', 'po_order_date', 'po_promised_date']:
    in_flight[col] = pd.to_datetime(in_flight[col])

in_flight['planned_transit_days'] = (in_flight['eta_date'] - in_flight['ship_date']).dt.days.clip(lower=0)
in_flight['po_order_value_usd'] = in_flight['po_order_qty'] * in_flight['po_unit_price_usd']
in_flight['po_promised_lead_time_days'] = (
    in_flight['po_promised_date'] - in_flight['po_order_date']
).dt.days.clip(lower=0)
in_flight['ship_month'] = in_flight['ship_date'].dt.month.astype('Int64')
in_flight['ship_year'] = in_flight['ship_date'].dt.year.astype('Int64')

live_ready = apply_category_buckets(in_flight)
live_ready = live_ready.dropna(subset=numeric_features).copy()
X_live = live_ready[model_features]
live_ready['delay_probability'] = cat_model.predict_proba(X_live)[:, 1]

print('In-flight shipments scored:', len(live_ready))
print(live_ready['delay_probability'].describe())

top_at_risk = (live_ready[['shipment_id', 'status', 'delay_probability', 'mode', 'origin_country', 'po_id']]
               .sort_values('delay_probability', ascending=False)
               .head(10))
print('\nTop at-risk shipments:')
print(top_at_risk.to_string(index=False))


In-flight shipments scored: 119274
count    119274.000000
mean          0.496859
std           0.011125
min           0.406475
25%           0.490480
50%           0.497313
75%           0.503871
max           0.584747
Name: delay_probability, dtype: float64

Top at-risk shipments:
shipment_id     status  delay_probability mode origin_country      po_id
SH-00186732    Created           0.584747 Rail          Japan PO-0099167
SH-00044886    Created           0.583443 Road        Germany PO-0046300
SH-00156699    Created           0.577213  Air         Poland PO-0022942
SH-00173041 In Transit           0.576264 Road       Colombia PO-0069312
SH-00194804    Created           0.574910  Sea         Poland PO-0149226
SH-00142434 In Transit           0.573641  Air          Japan PO-0120837
SH-00069594 In Transit           0.573407 Road        Germany PO-0137511
SH-00194690 In Transit           0.570964  Sea         France PO-0102819
SH-00117208 In Transit           0.570952  Sea          Chin

## Key takeaways

- After clipping impossible timelines and reducing categorical cardinality, we retain 96k labeled shipments (13.8% delayed) with consistent PO, supplier, SKU, and site joins.
- Even with the cleaner feature set the CatBoost baseline only reaches ROC-AUC ≈0.51 / Brier ≈0.248, showing the current `status` label offers little predictive signal without better ETA/promised-date governance.
- The strongest drivers now align with business intuition: SKU technology/vendor mix, supplier/vendor pairing, origin & destination regions/site types, transport mode/incoterm, and promised lead time.
- All 119k in-flight or newly created shipments are scored with the same sanitized pipeline (probabilities span 0.41–0.58); top-risk consignments skew toward specific supplier/vendor pairings on Japan/Poland/Germany lanes.
- Next leverage point: improve upstream timestamps (or define an independent SLA-based target) so the model can learn from real variance rather than contradictory dates.


/ ()

t /