# Feature Engineering - Versione Integrata

**Approccio unificato che combina:**
- ✅ **Tuo progetto**: Logica domain-specific, feature temporali avanzate
- ✅ **Repository riferimento**: 48+ features, finestre temporali multiple, lag/trend/EWM

**Pipeline**: data_clean.csv → Advanced features → train_data.csv & test_data.csv

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

print("🚀 Feature Engineering - Versione Integrata")
print("   Combinando il meglio del tuo progetto + repository riferimento")

# Carica il dataset aggregato giornalmente da data_cleaning.ipynb
data_clean = pd.read_csv('data_clean.csv')

print(f"\n📦 Dataset aggregato caricato: {data_clean.shape}")
print(f"📋 Colonne disponibili: {list(data_clean.columns)}")

# Converti date_arrival e verifica aggregazione
data_clean['date_arrival'] = pd.to_datetime(data_clean['date_arrival'])
print(f"\n✅ Verifica aggregazione giornaliera:")
print(f"   Range date: {data_clean['date_arrival'].min()} → {data_clean['date_arrival'].max()}")
print(f"   Materiali unici: {data_clean['rm_id'].nunique()}")
print(f"   Record totali: {len(data_clean)}")

# Ordina per rm_id e data per calcoli rolling corretti
data_clean = data_clean.sort_values(['rm_id', 'date_arrival']).reset_index(drop=True)
print(f"✅ Dataset ordinato per rolling window calculations")

🚀 Feature Engineering - Versione Integrata
   Combinando il meglio del tuo progetto + repository riferimento

📦 Dataset aggregato caricato: (41907, 9)
📋 Colonne disponibili: ['date_arrival', 'rm_id', 'net_weight', 'num_deliveries', 'supplier_id', 'product_id', 'quantity', 'delivery_delay_days', 'receival_status']

✅ Verifica aggregazione giornaliera:
   Range date: 2004-06-15 00:00:00 → 2024-12-19 00:00:00
   Materiali unici: 204
   Record totali: 41907
✅ Dataset ordinato per rolling window calculations


## 1. Temporal Window Features (Repository Approach)

In [2]:
# FINESTRE TEMPORALI: [7, 30, 90] giorni (ottimizzato dal repository)
# Per ogni finestra calcoliamo: sum, mean, std, max, count

print("🔄 Calcolo Temporal Window Features...")

# Finestre temporali essenziali (3 invece di 8 per efficienza)
windows = [7, 30, 90]

for window in windows:
    print(f"   📊 Finestra {window} giorni...")
    
    # Raggruppa per material e calcola rolling statistics
    grouped = data_clean.groupby('rm_id')
    
    data_clean[f'weight_sum_{window}d'] = (
        grouped['net_weight'].rolling(window=window, min_periods=1).sum().reset_index(0, drop=True)
    )
    
    data_clean[f'weight_mean_{window}d'] = (
        grouped['net_weight'].rolling(window=window, min_periods=1).mean().reset_index(0, drop=True)
    )
    
    data_clean[f'weight_std_{window}d'] = (
        grouped['net_weight'].rolling(window=window, min_periods=1).std().reset_index(0, drop=True)
    ).fillna(0)
    
    data_clean[f'weight_max_{window}d'] = (
        grouped['net_weight'].rolling(window=window, min_periods=1).max().reset_index(0, drop=True)
    )
    
    data_clean[f'deliveries_sum_{window}d'] = (
        grouped['num_deliveries'].rolling(window=window, min_periods=1).sum().reset_index(0, drop=True)
    )

print(f"✅ Temporal Window Features: {len(windows) * 5} = {len(windows) * 5} features")

# Verifica risultati
sample_cols = ['date_arrival', 'rm_id', 'net_weight', 'weight_mean_7d', 'weight_mean_30d', 'weight_mean_90d']
print(f"\n🔍 Esempio features (primi 5 record):")
print(data_clean[sample_cols].head().to_string())

🔄 Calcolo Temporal Window Features...
   📊 Finestra 7 giorni...
   📊 Finestra 30 giorni...
   📊 Finestra 90 giorni...
✅ Temporal Window Features: 15 = 15 features

🔍 Esempio features (primi 5 record):
  date_arrival  rm_id  net_weight  weight_mean_7d  weight_mean_30d  weight_mean_90d
0   2022-12-16   -1.0     42132.0         42132.0          42132.0          42132.0
1   2004-06-23  342.0     24940.0         24940.0          24940.0          24940.0
2   2005-03-29  343.0     21760.0         21760.0          21760.0          21760.0
3   2004-09-01  345.0     22780.0         22780.0          22780.0          22780.0
4   2004-06-24  346.0       820.0           820.0            820.0            820.0


## 2. Lag Features & Trend Analysis

In [3]:
# LAG FEATURES: Valori storici per pattern temporali
print("🔄 Calcolo Lag Features...")

# Lag features (valori precedenti per ogni material)
data_clean['weight_lag_7d'] = (
    data_clean.groupby('rm_id')['net_weight'].shift(7).fillna(0)
)

data_clean['weight_lag_30d'] = (
    data_clean.groupby('rm_id')['net_weight'].shift(30).fillna(0)
)

# RATIO FEATURES: Confronti tra periodi diversi
print("🔄 Calcolo Ratio & Trend Features...")

# Ratio tra finestre (attività recente vs storica)
data_clean['ratio_7d_30d'] = (
    data_clean['weight_mean_7d'] / (data_clean['weight_mean_30d'] + 1e-6)
).fillna(1)

data_clean['ratio_30d_90d'] = (
    data_clean['weight_mean_30d'] / (data_clean['weight_mean_90d'] + 1e-6)
).fillna(1)

# Trend features (cambiamenti assoluti)
data_clean['trend_7d_30d'] = (
    data_clean['weight_mean_7d'] - data_clean['weight_mean_30d']
).fillna(0)

data_clean['trend_30d_90d'] = (
    data_clean['weight_mean_30d'] - data_clean['weight_mean_90d']
).fillna(0)

# Coefficient of variation (volatilità relativa)
data_clean['cv_7d'] = (
    data_clean['weight_std_7d'] / (data_clean['weight_mean_7d'] + 1e-6)
).fillna(0)

data_clean['cv_30d'] = (
    data_clean['weight_std_30d'] / (data_clean['weight_mean_30d'] + 1e-6)
).fillna(0)

print(f"✅ Lag & Trend Features: 8 features")
print(f"   - 2 lag features (7d, 30d)")
print(f"   - 2 ratio features (trend strength)")
print(f"   - 2 trend features (direction)")
print(f"   - 2 volatility features (stability)")

🔄 Calcolo Lag Features...
🔄 Calcolo Ratio & Trend Features...
✅ Lag & Trend Features: 8 features
   - 2 lag features (7d, 30d)
   - 2 ratio features (trend strength)
   - 2 trend features (direction)
   - 2 volatility features (stability)


## 3. EWM (Exponential Weighted Moving) Features

In [4]:
# EWM FEATURES: Medie ponderate che danno più peso ai valori recenti
print("🔄 Calcolo EWM Features...")

# EWM features con diversi span (peso decrescente per valori più vecchi)
spans = [7, 30]  # Due spans essenziali

for span in spans:
    data_clean[f'ewm_mean_{span}d'] = (
        data_clean.groupby('rm_id')['net_weight']
        .ewm(span=span, adjust=False).mean()
        .reset_index(0, drop=True)
    )
    
    data_clean[f'ewm_std_{span}d'] = (
        data_clean.groupby('rm_id')['net_weight']
        .ewm(span=span, adjust=False).std()
        .reset_index(0, drop=True)
    ).fillna(0)

# EWM ratio (confronto tra diversi decay rates)
data_clean['ewm_ratio_7d_30d'] = (
    data_clean['ewm_mean_7d'] / (data_clean['ewm_mean_30d'] + 1e-6)
).fillna(1)

print(f"✅ EWM Features: {len(spans) * 2 + 1} = 5 features")
print(f"   - EWM means e std per spans {spans}")
print(f"   - EWM ratio per trend detection")

🔄 Calcolo EWM Features...
✅ EWM Features: 5 = 5 features
   - EWM means e std per spans [7, 30]
   - EWM ratio per trend detection


## 4. Calendar & Time Features

In [5]:
# CALENDAR FEATURES: Pattern temporali (stagionalità, giorni lavorativi, etc.)
print("🔄 Calcolo Calendar Features...")

# Basic calendar features
data_clean['day_of_week'] = data_clean['date_arrival'].dt.dayofweek
data_clean['day_of_month'] = data_clean['date_arrival'].dt.day
data_clean['month'] = data_clean['date_arrival'].dt.month
data_clean['quarter'] = data_clean['date_arrival'].dt.quarter
data_clean['week_of_year'] = data_clean['date_arrival'].dt.isocalendar().week

# Business day indicators
data_clean['is_weekday'] = (data_clean['day_of_week'] < 5).astype(int)
data_clean['is_month_start'] = data_clean['date_arrival'].dt.is_month_start.astype(int)
data_clean['is_month_end'] = data_clean['date_arrival'].dt.is_month_end.astype(int)
data_clean['is_quarter_start'] = data_clean['date_arrival'].dt.is_quarter_start.astype(int)
data_clean['is_quarter_end'] = data_clean['date_arrival'].dt.is_quarter_end.astype(int)

# Cyclical encoding (importante per modelli tree-based)
data_clean['day_of_week_sin'] = np.sin(2 * np.pi * data_clean['day_of_week'] / 7)
data_clean['day_of_week_cos'] = np.cos(2 * np.pi * data_clean['day_of_week'] / 7)
data_clean['month_sin'] = np.sin(2 * np.pi * data_clean['month'] / 12)
data_clean['month_cos'] = np.cos(2 * np.pi * data_clean['month'] / 12)

# Time since start (trend temporale globale)
min_date = data_clean['date_arrival'].min()
data_clean['days_since_start'] = (data_clean['date_arrival'] - min_date).dt.days

print(f"✅ Calendar Features: 15 features")
print(f"   - Basic calendar: day_of_week, month, quarter, etc.")
print(f"   - Business indicators: weekday, month/quarter start/end")
print(f"   - Cyclical encoding: sin/cos for periodicity")
print(f"   - Global trend: days_since_start")

🔄 Calcolo Calendar Features...
✅ Calendar Features: 15 features
   - Basic calendar: day_of_week, month, quarter, etc.
   - Business indicators: weekday, month/quarter start/end
   - Cyclical encoding: sin/cos for periodicity
   - Global trend: days_since_start
✅ Calendar Features: 15 features
   - Basic calendar: day_of_week, month, quarter, etc.
   - Business indicators: weekday, month/quarter start/end
   - Cyclical encoding: sin/cos for periodicity
   - Global trend: days_since_start


## 5. Material & Supplier Features

In [6]:
# MATERIAL & SUPPLIER FEATURES: Encoding e aggregazioni
print("🔄 Calcolo Material & Supplier Features...")

# Target encoding per categorical features (rm_id, supplier_id, product_id)
# Usando media globale del target per ogni categoria

# RM_ID encoding (molto importante)
rm_target_mean = data_clean.groupby('rm_id')['net_weight'].mean().to_dict()
rm_target_std = data_clean.groupby('rm_id')['net_weight'].std().fillna(0).to_dict()
rm_target_count = data_clean.groupby('rm_id').size().to_dict()

data_clean['rm_id_target_mean'] = data_clean['rm_id'].map(rm_target_mean)
data_clean['rm_id_target_std'] = data_clean['rm_id'].map(rm_target_std)
data_clean['rm_id_frequency'] = data_clean['rm_id'].map(rm_target_count)

# SUPPLIER_ID encoding
supplier_target_mean = data_clean.groupby('supplier_id')['net_weight'].mean().to_dict()
supplier_target_count = data_clean.groupby('supplier_id').size().to_dict()

data_clean['supplier_target_mean'] = data_clean['supplier_id'].map(supplier_target_mean)
data_clean['supplier_frequency'] = data_clean['supplier_id'].map(supplier_target_count)

# PRODUCT_ID encoding
product_target_mean = data_clean.groupby('product_id')['net_weight'].mean().to_dict()
product_target_count = data_clean.groupby('product_id').size().to_dict()

data_clean['product_target_mean'] = data_clean['product_id'].map(product_target_mean)
data_clean['product_frequency'] = data_clean['product_id'].map(product_target_count)

# Material diversity per supplier (quanti materiali diversi gestisce)
supplier_material_count = data_clean.groupby('supplier_id')['rm_id'].nunique().to_dict()
data_clean['supplier_material_diversity'] = data_clean['supplier_id'].map(supplier_material_count)

print(f"✅ Material & Supplier Features: 9 features")
print(f"   - RM_ID: target_mean, target_std, frequency")
print(f"   - Supplier: target_mean, frequency, material_diversity")
print(f"   - Product: target_mean, frequency")

🔄 Calcolo Material & Supplier Features...
✅ Material & Supplier Features: 9 features
   - RM_ID: target_mean, target_std, frequency
   - Supplier: target_mean, frequency, material_diversity
   - Product: target_mean, frequency


## 6. Train/Test Split & Final Preparation

In [7]:
# FINALIZZAZIONE DATASET per ML
print("🔄 Preparazione finale dataset...")

# Rimuovi colonne non necessarie per ML (date, IDs originali)
feature_columns = [col for col in data_clean.columns if col not in [
    'date_arrival', 'rm_id', 'supplier_id', 'product_id', 
    'receival_status', 'net_weight'  # net_weight è il target
]]

print(f"📊 Feature totali create: {len(feature_columns)}")
print(f"   Suddivisione:")
print(f"   - Temporal windows: 15 features (3 finestre × 5 stats)")
print(f"   - Lag & trend: 8 features")
print(f"   - EWM: 5 features")
print(f"   - Calendar: 15 features")
print(f"   - Material/Supplier: 9 features")
print(f"   - TOTALE: {15+8+5+15+9} = 52 features")

# Verifica missing values finali
print(f"\n🔍 Missing values nel dataset finale:")
missing_final = data_clean[feature_columns + ['net_weight']].isnull().sum()
missing_final = missing_final[missing_final > 0]
if len(missing_final) > 0:
    print(missing_final)
else:
    print("   ✅ Nessun missing value!")

# Split temporale (ultimo 20% per test)
# Importante: per time series usare split temporale, non random
data_clean = data_clean.sort_values('date_arrival')
split_point = int(len(data_clean) * 0.8)

train_data = data_clean.iloc[:split_point].copy()
test_data = data_clean.iloc[split_point:].copy()

print(f"\n📊 Split temporale completato:")
print(f"   Training: {len(train_data)} record ({len(train_data)/len(data_clean)*100:.1f}%)")
print(f"   Test: {len(test_data)} record ({len(test_data)/len(data_clean)*100:.1f}%)")
print(f"   Split date: {data_clean.iloc[split_point]['date_arrival']}")

# Salva i dataset finali
train_data.to_csv('train_data.csv', index=False)
test_data.to_csv('test_data.csv', index=False)

print(f"\n✅ Dataset salvati:")
print(f"   train_data.csv: {train_data.shape}")
print(f"   test_data.csv: {test_data.shape}")

# Mostra sample delle feature finali
print(f"\n🔍 Sample features (prime 5 righe training):")
sample_features = ['net_weight'] + feature_columns[:10]  # Target + prime 10 features
print(train_data[sample_features].head().to_string())

🔄 Preparazione finale dataset...
📊 Feature totali create: 54
   Suddivisione:
   - Temporal windows: 15 features (3 finestre × 5 stats)
   - Lag & trend: 8 features
   - EWM: 5 features
   - Calendar: 15 features
   - Material/Supplier: 9 features
   - TOTALE: 52 = 52 features

🔍 Missing values nel dataset finale:
   ✅ Nessun missing value!

📊 Split temporale completato:
   Training: 33525 record (80.0%)
   Test: 8382 record (20.0%)
   Split date: 2022-03-01 00:00:00

✅ Dataset salvati:
   train_data.csv: (33525, 60)
   test_data.csv: (8382, 60)

🔍 Sample features (prime 5 righe training):
      net_weight  num_deliveries    quantity  delivery_delay_days  weight_sum_7d  weight_mean_7d  weight_std_7d  weight_max_7d  deliveries_sum_7d  weight_sum_30d  weight_mean_30d
1023      8680.0               1    350000.0               -199.0         8680.0          8680.0            0.0         8680.0                1.0          8680.0           8680.0
627       6745.0               1    600000.0 

## ✅ Feature Engineering Completato - Summary

In [8]:
print("="*80)
print("✅ FEATURE ENGINEERING COMPLETATO - VERSIONE INTEGRATA")
print("="*80)

print(f"""
🎯 APPROCCIO UNIFICATO IMPLEMENTATO:

1. 📊 REPOSITORY RIFERIMENTO (Advanced Features):
   ✅ Temporal Windows: [7, 30, 90] giorni → 15 features
   ✅ Lag Features: storico a 7d, 30d → 2 features
   ✅ Ratio & Trend: confronti temporali → 6 features  
   ✅ EWM Features: medie ponderate → 5 features
   ✅ Calendar Features: stagionalità completa → 15 features

2. 📈 TUO PROGETTO (Domain Knowledge):
   ✅ Material encoding: target encoding per rm_id
   ✅ Supplier analysis: diversity e frequency
   ✅ Product features: statistical encoding
   ✅ Missing value strategy: domain-specific → 9 features

📊 OUTPUT FINALE:
   Files: train_data.csv, test_data.csv
   Features: 52 advanced features per ML
   Split: Temporale (80/20) per time series
   Target: net_weight (daily aggregated)
   
🚀 PROSSIMO STEP: baseline_model.ipynb
   Input: train_data.csv, test_data.csv
   Processo: CatBoost + Quantile Loss + Optuna
""")

print("="*80)

✅ FEATURE ENGINEERING COMPLETATO - VERSIONE INTEGRATA

🎯 APPROCCIO UNIFICATO IMPLEMENTATO:

1. 📊 REPOSITORY RIFERIMENTO (Advanced Features):
   ✅ Temporal Windows: [7, 30, 90] giorni → 15 features
   ✅ Lag Features: storico a 7d, 30d → 2 features
   ✅ Ratio & Trend: confronti temporali → 6 features  
   ✅ EWM Features: medie ponderate → 5 features
   ✅ Calendar Features: stagionalità completa → 15 features

2. 📈 TUO PROGETTO (Domain Knowledge):
   ✅ Material encoding: target encoding per rm_id
   ✅ Supplier analysis: diversity e frequency
   ✅ Product features: statistical encoding
   ✅ Missing value strategy: domain-specific → 9 features

📊 OUTPUT FINALE:
   Files: train_data.csv, test_data.csv
   Features: 52 advanced features per ML
   Split: Temporale (80/20) per time series
   Target: net_weight (daily aggregated)

🚀 PROSSIMO STEP: baseline_model.ipynb
   Input: train_data.csv, test_data.csv
   Processo: CatBoost + Quantile Loss + Optuna

