# 04 - Supervised Learning

**Day 3: Part 1 - Classification (Customer Satisfaction Prediction)**

## Objectives
1. Load featured data from Day 2
2. Define feature sets (avoiding data leakage)
3. Build preprocessing pipeline
4. Establish baseline (DummyClassifier)
5. Train and compare classification models:
   - Logistic Regression
   - Decision Tree
   - Random Forest
   - LightGBM (with hyperparameter tuning)
6. SHAP interpretability analysis
7. Save best model and artifacts

## Key Principles
- **Baselines FIRST** - Always establish dummy baselines before training real models
- **Cross-validation** - Use StratifiedKFold for robust performance estimates
- **No data leakage** - Exclude review_score and derived features
- **Class imbalance** - Handle with class_weight='balanced'
- **Primary metric** - ROC-AUC (threshold-independent, handles imbalance)

## 1. Setup & Imports

In [None]:
import sys
sys.path.insert(0, '..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
import json
from datetime import datetime

# sklearn
from sklearn.model_selection import StratifiedKFold, cross_val_score, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, classification_report, confusion_matrix,
    roc_curve
)

# LightGBM
from lightgbm import LGBMClassifier

# SHAP
import shap

# Joblib for saving
import joblib

print("Imports complete!")

In [None]:
# Settings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 50)
plt.style.use('seaborn-v0_8-whitegrid')

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Create plots directory
Path('../models/plots').mkdir(parents=True, exist_ok=True)

print(f"Random state: {RANDOM_STATE}")
print("Settings configured!")

## 2. Load Data

In [None]:
# Load featured data from Day 2
train = pd.read_parquet('../data/processed/train_featured.parquet')
val = pd.read_parquet('../data/processed/val_featured.parquet')
test = pd.read_parquet('../data/processed/test_featured.parquet')

print("Data loaded:")
print(f"  Train: {train.shape[0]:,} rows x {train.shape[1]} columns")
print(f"  Val:   {val.shape[0]:,} rows x {val.shape[1]} columns")
print(f"  Test:  {test.shape[0]:,} rows x {test.shape[1]} columns")

In [None]:
# Load and merge cluster labels (optional enhancement)
try:
    clustered = pd.read_parquet('../models/customer_segments_clustered.parquet')
    cluster_cols = ['customer_unique_id', 'cluster_id', 'customer_segment']
    
    # Merge if not already present
    if 'cluster_id' not in train.columns:
        train = train.merge(clustered[cluster_cols], on='customer_unique_id', how='left')
        train['cluster_id'] = train['cluster_id'].fillna(-1).astype(int)
        train['customer_segment'] = train['customer_segment'].fillna('unknown')
        
        val = val.merge(clustered[cluster_cols], on='customer_unique_id', how='left')
        val['cluster_id'] = val['cluster_id'].fillna(-1).astype(int)
        val['customer_segment'] = val['customer_segment'].fillna('unknown')
        
        test = test.merge(clustered[cluster_cols], on='customer_unique_id', how='left')
        test['cluster_id'] = test['cluster_id'].fillna(-1).astype(int)
        test['customer_segment'] = test['customer_segment'].fillna('unknown')
        
        print("Merged cluster labels from customer_segments_clustered.parquet")
    else:
        print("Cluster labels already present")
        
    print(f"\nCluster distribution (train):")
    print(train['customer_segment'].value_counts())
except Exception as e:
    print(f"Warning: Could not load cluster labels: {e}")
    print("Proceeding without cluster features.")

In [None]:
# Verify target distribution
print("=" * 50)
print("TARGET DISTRIBUTION: is_satisfied")
print("=" * 50)

for name, df in [('Train', train), ('Val', val), ('Test', test)]:
    satisfied = df['is_satisfied'].mean()
    unsatisfied = 1 - satisfied
    print(f"{name}: {satisfied:.1%} satisfied, {unsatisfied:.1%} unsatisfied (n={len(df):,})")

# Imbalance ratio
train_satisfied = train['is_satisfied'].sum()
train_unsatisfied = len(train) - train_satisfied
imbalance_ratio = train_satisfied / train_unsatisfied
print(f"\nImbalance ratio (train): {imbalance_ratio:.2f}:1 (satisfied:unsatisfied)")
print("-> Moderate imbalance, will use class_weight='balanced'")

## 3. Define Feature Sets

**Critical: Avoid Data Leakage**

Must EXCLUDE:
- `review_score` - Target is derived from this
- `is_satisfied` - Target variable
- ID columns - Not predictive
- Timestamp columns - Use derived features instead

Safe to USE (delivery happens before review):
- Delivery features: `delivery_days`, `is_late_delivery`, `delivery_delay_days`
- Review text features: `review_sentiment_polarity`, `review_word_count`

In [None]:
# Define feature columns
NUMERICAL_FEATURES = [
    # Delivery (key predictors!)
    'delivery_days', 'delivery_delay_days',
    
    # Price/Value
    'price', 'freight_value', 'payment_value', 'payment_installments',
    'freight_ratio', 'price_vs_category_zscore', 'payment_per_installment',
    
    # Product
    'product_weight_g', 'product_volume_cm3', 'product_photos_qty',
    
    # Geographic
    'seller_customer_distance_km',
    
    # NLP (review text features - written after delivery)
    'review_text_length', 'review_word_count', 'review_caps_ratio',
    'review_sentiment_polarity', 'review_sentiment_subjectivity',
    
    # Temporal (cyclical encoding)
    'order_hour_sin', 'order_hour_cos',
    'order_dayofweek_sin', 'order_dayofweek_cos',
    
    # RFM (customer-level)
    'recency', 'frequency', 'monetary', 'avg_order_value',
]

CATEGORICAL_FEATURES = [
    'customer_state', 'seller_state',
    'customer_region', 'seller_region',
    'product_category_name_english', 'payment_type',
]

BINARY_FEATURES = [
    'is_weekend', 'is_month_start', 'is_month_end',
    'is_same_state', 'is_full_payment', 'is_high_installment',
    'has_review_comment', 'is_late_delivery',
]

# Add cluster features if available
if 'customer_segment' in train.columns:
    CATEGORICAL_FEATURES.append('customer_segment')
    print("Added customer_segment to categorical features")

ALL_FEATURES = NUMERICAL_FEATURES + CATEGORICAL_FEATURES + BINARY_FEATURES

print(f"\nFeature counts:")
print(f"  Numerical: {len(NUMERICAL_FEATURES)}")
print(f"  Categorical: {len(CATEGORICAL_FEATURES)}")
print(f"  Binary: {len(BINARY_FEATURES)}")
print(f"  Total: {len(ALL_FEATURES)}")

In [None]:
# Create X, y
X_train = train[ALL_FEATURES].copy()
y_train = train['is_satisfied'].copy()

X_val = val[ALL_FEATURES].copy()
y_val = val['is_satisfied'].copy()

X_test = test[ALL_FEATURES].copy()
y_test = test['is_satisfied'].copy()

print("Feature matrices created:")
print(f"  X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"  X_val: {X_val.shape}, y_val: {y_val.shape}")
print(f"  X_test: {X_test.shape}, y_test: {y_test.shape}")

In [None]:
# Check missing values
print("Missing values in X_train:")
missing = X_train.isnull().sum()
missing_cols = missing[missing > 0].sort_values(ascending=False)

if len(missing_cols) > 0:
    print(missing_cols)
    print(f"\n-> Will be handled by SimpleImputer in preprocessing pipeline")
else:
    print("  No missing values!")

## 4. Build Preprocessing Pipeline

In [None]:
# Create preprocessing pipeline
numerical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])

preprocessor = ColumnTransformer([
    ('num', numerical_transformer, NUMERICAL_FEATURES),
    ('cat', categorical_transformer, CATEGORICAL_FEATURES),
    ('bin', 'passthrough', BINARY_FEATURES),
], remainder='drop')

print("Preprocessing pipeline created:")
print(f"  Numerical: {len(NUMERICAL_FEATURES)} features -> impute(median) + scale")
print(f"  Categorical: {len(CATEGORICAL_FEATURES)} features -> impute + one-hot encode")
print(f"  Binary: {len(BINARY_FEATURES)} features -> passthrough")

In [None]:
# Test preprocessing pipeline
print("Testing preprocessing pipeline...")
X_train_processed = preprocessor.fit_transform(X_train)
X_val_processed = preprocessor.transform(X_val)

print(f"\nProcessed shapes:")
print(f"  X_train: {X_train.shape} -> {X_train_processed.shape}")
print(f"  X_val: {X_val.shape} -> {X_val_processed.shape}")
print(f"\nOne-hot encoding expanded {len(CATEGORICAL_FEATURES)} categorical features to {X_train_processed.shape[1] - len(NUMERICAL_FEATURES) - len(BINARY_FEATURES)} columns")

## 5. Baseline Model (DummyClassifier)

In [None]:
# DummyClassifier baseline
print("=" * 60)
print("BASELINE: DummyClassifier (most_frequent)")
print("=" * 60)

dummy = DummyClassifier(strategy='most_frequent', random_state=RANDOM_STATE)
dummy.fit(X_train_processed, y_train)

y_pred_dummy = dummy.predict(X_val_processed)

baseline_accuracy = accuracy_score(y_val, y_pred_dummy)
baseline_roc_auc = 0.5  # Random by definition for most_frequent strategy

print(f"\nValidation Metrics:")
print(f"  Accuracy: {baseline_accuracy:.4f}")
print(f"  ROC-AUC:  {baseline_roc_auc:.4f} (random baseline)")
print(f"\n-> Any model must beat ROC-AUC > 0.50 to add value")

In [None]:
# Store baseline metrics
baseline_metrics = {
    'name': 'Dummy (Baseline)',
    'accuracy': baseline_accuracy,
    'precision': 0.0,
    'recall': 0.0,
    'f1': 0.0,
    'roc_auc': baseline_roc_auc,
}

# Initialize results list
all_metrics = [baseline_metrics]
trained_models = {}

In [None]:
# Define evaluation function
def evaluate_model(model, X, y, name):
    """Evaluate classifier and return metrics dict."""
    y_pred = model.predict(X)
    
    # Get probabilities for ROC-AUC
    if hasattr(model, 'predict_proba'):
        y_proba = model.predict_proba(X)[:, 1]
    else:
        y_proba = y_pred
    
    return {
        'name': name,
        'accuracy': accuracy_score(y, y_pred),
        'precision': precision_score(y, y_pred, zero_division=0),
        'recall': recall_score(y, y_pred, zero_division=0),
        'f1': f1_score(y, y_pred, zero_division=0),
        'roc_auc': roc_auc_score(y, y_proba),
    }

print("Evaluation function defined.")

## 6. Model 1: Logistic Regression

In [None]:
print("=" * 60)
print("MODEL 1: Logistic Regression")
print("=" * 60)

log_reg_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(
        class_weight='balanced',
        max_iter=1000,
        random_state=RANDOM_STATE,
        n_jobs=-1,
    ))
])

print("Training Logistic Regression...")
log_reg_pipeline.fit(X_train, y_train)

logreg_metrics = evaluate_model(log_reg_pipeline, X_val, y_val, 'Logistic Regression')
print(f"\nValidation Metrics:")
print(f"  Accuracy: {logreg_metrics['accuracy']:.4f}")
print(f"  ROC-AUC:  {logreg_metrics['roc_auc']:.4f}")
print(f"  F1:       {logreg_metrics['f1']:.4f}")

In [None]:
# Cross-validation
print("Running 5-fold cross-validation...")
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
cv_scores = cross_val_score(log_reg_pipeline, X_train, y_train, cv=cv, scoring='roc_auc', n_jobs=-1)

print(f"CV ROC-AUC: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
logreg_metrics['cv_roc_auc_mean'] = cv_scores.mean()
logreg_metrics['cv_roc_auc_std'] = cv_scores.std()

all_metrics.append(logreg_metrics)
trained_models['Logistic Regression'] = log_reg_pipeline

## 7. Model 2: Decision Tree

In [None]:
print("=" * 60)
print("MODEL 2: Decision Tree")
print("=" * 60)

dt_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier(
        max_depth=10,
        min_samples_split=20,
        min_samples_leaf=10,
        class_weight='balanced',
        random_state=RANDOM_STATE,
    ))
])

print("Training Decision Tree...")
dt_pipeline.fit(X_train, y_train)

dt_metrics = evaluate_model(dt_pipeline, X_val, y_val, 'Decision Tree')
print(f"\nValidation Metrics:")
print(f"  Accuracy: {dt_metrics['accuracy']:.4f}")
print(f"  ROC-AUC:  {dt_metrics['roc_auc']:.4f}")
print(f"  F1:       {dt_metrics['f1']:.4f}")

all_metrics.append(dt_metrics)
trained_models['Decision Tree'] = dt_pipeline

## 8. Model 3: Random Forest

In [None]:
print("=" * 60)
print("MODEL 3: Random Forest")
print("=" * 60)

rf_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(
        n_estimators=100,
        max_depth=15,
        min_samples_split=10,
        min_samples_leaf=5,
        class_weight='balanced',
        n_jobs=-1,
        random_state=RANDOM_STATE,
    ))
])

print("Training Random Forest (100 trees)...")
rf_pipeline.fit(X_train, y_train)

rf_metrics = evaluate_model(rf_pipeline, X_val, y_val, 'Random Forest')
print(f"\nValidation Metrics:")
print(f"  Accuracy: {rf_metrics['accuracy']:.4f}")
print(f"  ROC-AUC:  {rf_metrics['roc_auc']:.4f}")
print(f"  F1:       {rf_metrics['f1']:.4f}")

all_metrics.append(rf_metrics)
trained_models['Random Forest'] = rf_pipeline

## 9. Model 4: LightGBM (Base)

In [None]:
print("=" * 60)
print("MODEL 4: LightGBM (Base)")
print("=" * 60)

lgbm_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LGBMClassifier(
        n_estimators=200,
        max_depth=10,
        learning_rate=0.1,
        class_weight='balanced',
        n_jobs=-1,
        random_state=RANDOM_STATE,
        verbose=-1,
    ))
])

print("Training LightGBM (base model)...")
lgbm_pipeline.fit(X_train, y_train)

lgbm_metrics = evaluate_model(lgbm_pipeline, X_val, y_val, 'LightGBM (base)')
print(f"\nValidation Metrics:")
print(f"  Accuracy: {lgbm_metrics['accuracy']:.4f}")
print(f"  ROC-AUC:  {lgbm_metrics['roc_auc']:.4f}")
print(f"  F1:       {lgbm_metrics['f1']:.4f}")

all_metrics.append(lgbm_metrics)
trained_models['LightGBM (base)'] = lgbm_pipeline

## 10. LightGBM Hyperparameter Tuning

In [None]:
print("=" * 60)
print("HYPERPARAMETER TUNING: LightGBM")
print("=" * 60)

# Define parameter grid
param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [5, 10, 15],
    'classifier__learning_rate': [0.01, 0.05, 0.1],
    'classifier__num_leaves': [31, 50, 100],
    'classifier__min_child_samples': [20, 50, 100],
    'classifier__subsample': [0.8, 1.0],
    'classifier__colsample_bytree': [0.8, 1.0],
}

# Calculate total combinations
total_combinations = 1
for v in param_grid.values():
    total_combinations *= len(v)
print(f"Total parameter combinations: {total_combinations}")
print(f"Will sample 30 combinations with RandomizedSearchCV")

In [None]:
# Create base pipeline for tuning
lgbm_tune_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LGBMClassifier(
        class_weight='balanced',
        n_jobs=-1,
        random_state=RANDOM_STATE,
        verbose=-1,
    ))
])

# RandomizedSearchCV
print("\nRunning RandomizedSearchCV (this may take a few minutes)...")
search = RandomizedSearchCV(
    lgbm_tune_pipeline,
    param_grid,
    n_iter=30,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE),
    scoring='roc_auc',
    n_jobs=-1,
    random_state=RANDOM_STATE,
    verbose=1,
)

search.fit(X_train, y_train)
print("\nTuning complete!")

In [None]:
# Best model results
print("=" * 60)
print("BEST MODEL: LightGBM (Tuned)")
print("=" * 60)

print(f"\nBest CV ROC-AUC: {search.best_score_:.4f}")
print(f"\nBest Parameters:")
for param, value in search.best_params_.items():
    print(f"  {param}: {value}")

# Evaluate on validation set
best_lgbm = search.best_estimator_
lgbm_tuned_metrics = evaluate_model(best_lgbm, X_val, y_val, 'LightGBM (tuned)')

print(f"\nValidation Metrics:")
print(f"  Accuracy: {lgbm_tuned_metrics['accuracy']:.4f}")
print(f"  ROC-AUC:  {lgbm_tuned_metrics['roc_auc']:.4f}")
print(f"  F1:       {lgbm_tuned_metrics['f1']:.4f}")

lgbm_tuned_metrics['cv_roc_auc'] = search.best_score_
lgbm_tuned_metrics['best_params'] = search.best_params_

all_metrics.append(lgbm_tuned_metrics)
trained_models['LightGBM (tuned)'] = best_lgbm

## 11. Model Comparison

In [None]:
# Create comparison DataFrame
comparison_df = pd.DataFrame(all_metrics)
comparison_df = comparison_df[['name', 'accuracy', 'precision', 'recall', 'f1', 'roc_auc']]
comparison_df = comparison_df.sort_values('roc_auc', ascending=False)

print("=" * 70)
print("MODEL COMPARISON (Validation Set)")
print("=" * 70)
print(comparison_df.to_string(index=False))

# Save comparison
comparison_df.to_csv('../models/classification_comparison.csv', index=False)
print("\n✓ Saved: models/classification_comparison.csv")

In [None]:
# ROC Curves
fig, ax = plt.subplots(figsize=(10, 8))

# Plot ROC curves for each model
models_to_plot = [
    ('Logistic Regression', log_reg_pipeline),
    ('Decision Tree', dt_pipeline),
    ('Random Forest', rf_pipeline),
    ('LightGBM (tuned)', best_lgbm),
]

colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']

for (name, model), color in zip(models_to_plot, colors):
    y_proba = model.predict_proba(X_val)[:, 1]
    fpr, tpr, _ = roc_curve(y_val, y_proba)
    auc = roc_auc_score(y_val, y_proba)
    ax.plot(fpr, tpr, label=f'{name} (AUC={auc:.3f})', color=color, linewidth=2)

# Random baseline
ax.plot([0, 1], [0, 1], 'k--', label='Random (AUC=0.500)', linewidth=1)

ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.set_title('ROC Curves - Classification Models', fontsize=14, fontweight='bold')
ax.legend(loc='lower right', fontsize=10)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../models/plots/classification_roc_curves.png', dpi=150, bbox_inches='tight')
plt.show()

print("✓ Saved: models/plots/classification_roc_curves.png")

In [None]:
# Confusion Matrix for best model
y_pred_best = best_lgbm.predict(X_val)
cm = confusion_matrix(y_val, y_pred_best)

fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax,
            xticklabels=['Unsatisfied', 'Satisfied'],
            yticklabels=['Unsatisfied', 'Satisfied'],
            annot_kws={'size': 14})
ax.set_xlabel('Predicted', fontsize=12)
ax.set_ylabel('Actual', fontsize=12)
ax.set_title('Confusion Matrix - LightGBM (tuned)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('../models/plots/classification_confusion_matrix.png', dpi=150, bbox_inches='tight')
plt.show()

print("✓ Saved: models/plots/classification_confusion_matrix.png")

In [None]:
# Classification Report
print("=" * 60)
print("CLASSIFICATION REPORT: LightGBM (tuned)")
print("=" * 60)
print(classification_report(y_val, y_pred_best, target_names=['Unsatisfied', 'Satisfied']))

## 12. SHAP Analysis

In [None]:
print("=" * 60)
print("SHAP FEATURE IMPORTANCE ANALYSIS")
print("=" * 60)

# Get the classifier from the pipeline
classifier = best_lgbm.named_steps['classifier']

# Transform validation data
X_val_transformed = best_lgbm.named_steps['preprocessor'].transform(X_val)

# Get feature names after preprocessing
def get_feature_names(preprocessor, num_features, cat_features, bin_features):
    """Extract feature names after pipeline transformation."""
    names = []
    
    # Numerical features (same names after scaling)
    names.extend(num_features)
    
    # Categorical features (one-hot encoded names)
    try:
        cat_encoder = preprocessor.named_transformers_['cat'].named_steps['encoder']
        cat_feature_names = cat_encoder.get_feature_names_out(cat_features)
        names.extend(cat_feature_names.tolist())
    except Exception:
        names.extend(cat_features)
    
    # Binary features (passthrough)
    names.extend(bin_features)
    
    return names

feature_names = get_feature_names(
    best_lgbm.named_steps['preprocessor'],
    NUMERICAL_FEATURES,
    CATEGORICAL_FEATURES,
    BINARY_FEATURES
)

print(f"Total features after preprocessing: {len(feature_names)}")

In [None]:
# Create SHAP explainer
print("\nComputing SHAP values (this may take a moment)...")

# Use a sample for faster computation if dataset is large
sample_size = min(5000, len(X_val_transformed))
sample_idx = np.random.choice(len(X_val_transformed), sample_size, replace=False)
X_sample = X_val_transformed[sample_idx]

explainer = shap.TreeExplainer(classifier)
shap_values = explainer.shap_values(X_sample)

print(f"SHAP values computed for {sample_size} samples.")

In [None]:
# SHAP Summary Plot (beeswarm)
plt.figure(figsize=(12, 10))

# For binary classification, shap_values is a list [class_0, class_1]
if isinstance(shap_values, list):
    shap_vals = shap_values[1]  # Use class 1 (satisfied)
else:
    shap_vals = shap_values

shap.summary_plot(shap_vals, X_sample, feature_names=feature_names, show=False, max_display=20)
plt.title('SHAP Feature Importance - Customer Satisfaction\n(Top 20 Features)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('../models/plots/shap_classification_summary.png', dpi=150, bbox_inches='tight')
plt.show()

print("✓ Saved: models/plots/shap_classification_summary.png")

In [None]:
# SHAP Bar Plot (mean absolute values)
plt.figure(figsize=(10, 8))
shap.summary_plot(shap_vals, X_sample, feature_names=feature_names, 
                  plot_type='bar', show=False, max_display=20)
plt.title('Mean |SHAP| - Feature Importance\n(Top 20 Features)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('../models/plots/shap_classification_bar.png', dpi=150, bbox_inches='tight')
plt.show()

print("✓ Saved: models/plots/shap_classification_bar.png")

In [None]:
# Top features by SHAP importance
shap_importance = pd.DataFrame({
    'feature': feature_names,
    'importance': np.abs(shap_vals).mean(axis=0)
}).sort_values('importance', ascending=False)

print("\n" + "=" * 60)
print("TOP 20 FEATURES BY SHAP IMPORTANCE")
print("=" * 60)
print(shap_importance.head(20).to_string(index=False))

# Save SHAP importance
shap_importance.to_csv('../models/shap_classification_importance.csv', index=False)
print("\n✓ Saved: models/shap_classification_importance.csv")

## 13. Save Best Model & Artifacts

In [None]:
# Select best model
best_classifier = best_lgbm
best_clf_name = 'LightGBM (tuned)'

print("=" * 60)
print("SAVING BEST CLASSIFICATION MODEL")
print("=" * 60)
print(f"\nBest Model: {best_clf_name}")
print(f"Validation ROC-AUC: {lgbm_tuned_metrics['roc_auc']:.4f}")
print(f"Validation Accuracy: {lgbm_tuned_metrics['accuracy']:.4f}")
print(f"Validation F1: {lgbm_tuned_metrics['f1']:.4f}")

In [None]:
# Save model
joblib.dump(best_classifier, '../models/satisfaction_classifier.joblib')
print("\n✓ Saved: models/satisfaction_classifier.joblib")

In [None]:
# Save classification metadata
clf_metadata = {
    'model_name': best_clf_name,
    'algorithm': 'LightGBM',
    'best_params': {k.replace('classifier__', ''): v for k, v in search.best_params_.items()},
    'features': {
        'numerical': NUMERICAL_FEATURES,
        'categorical': CATEGORICAL_FEATURES,
        'binary': BINARY_FEATURES,
        'total_after_encoding': len(feature_names),
    },
    'metrics': {
        'validation': {
            'accuracy': lgbm_tuned_metrics['accuracy'],
            'precision': lgbm_tuned_metrics['precision'],
            'recall': lgbm_tuned_metrics['recall'],
            'f1': lgbm_tuned_metrics['f1'],
            'roc_auc': lgbm_tuned_metrics['roc_auc'],
        },
        'cv_roc_auc': float(search.best_score_),
    },
    'class_distribution': {
        'train': {
            'satisfied': float(y_train.mean()),
            'unsatisfied': float(1 - y_train.mean()),
        },
        'val': {
            'satisfied': float(y_val.mean()),
            'unsatisfied': float(1 - y_val.mean()),
        },
    },
    'top_features': shap_importance.head(10).to_dict('records'),
    'training_date': datetime.now().isoformat(),
}

with open('../models/classification_metadata.json', 'w') as f:
    json.dump(clf_metadata, f, indent=2, default=str)

print("✓ Saved: models/classification_metadata.json")

## 14. Part 1 Summary

In [None]:
print("\n" + "=" * 70)
print("PART 1: CLASSIFICATION COMPLETE")
print("=" * 70)

print(f"""
TASK: Customer Satisfaction Prediction
TARGET: is_satisfied (binary: review_score >= 4)

BEST MODEL: {best_clf_name}
─────────────────────────────────────────
  Validation ROC-AUC:  {lgbm_tuned_metrics['roc_auc']:.4f}
  Validation Accuracy: {lgbm_tuned_metrics['accuracy']:.4f}
  Validation F1:       {lgbm_tuned_metrics['f1']:.4f}
  Validation Precision:{lgbm_tuned_metrics['precision']:.4f}
  Validation Recall:   {lgbm_tuned_metrics['recall']:.4f}
  CV ROC-AUC:          {search.best_score_:.4f}

IMPROVEMENT OVER BASELINE:
─────────────────────────────────────────
  ROC-AUC: {baseline_roc_auc:.4f} → {lgbm_tuned_metrics['roc_auc']:.4f} (+{lgbm_tuned_metrics['roc_auc'] - baseline_roc_auc:.4f})

TOP 5 PREDICTORS (by SHAP):
─────────────────────────────────────────""")

for i, row in shap_importance.head(5).iterrows():
    print(f"  {shap_importance.head(5).index.get_loc(i)+1}. {row['feature']}: {row['importance']:.4f}")

print(f"""
ARTIFACTS SAVED:
─────────────────────────────────────────
  ✓ models/satisfaction_classifier.joblib
  ✓ models/classification_metadata.json
  ✓ models/classification_comparison.csv
  ✓ models/shap_classification_importance.csv
  ✓ models/plots/classification_roc_curves.png
  ✓ models/plots/classification_confusion_matrix.png
  ✓ models/plots/shap_classification_summary.png
  ✓ models/plots/shap_classification_bar.png

{'='*70}
Ready for Part 2: Regression (Delivery Time Prediction)!
{'='*70}
""")

---

# Part 2: Regression (Delivery Time Prediction)

*To be implemented in next session*