# III. Advanced Modeling: Product Recommendation System

## Expert-Level Modeling Strategy

### Problem Reframing
Instead of simple binary classification, we'll build a **product recommendation system** that:
- Predicts adoption probability for each customer-product pair
- Ranks products by adoption likelihood for each customer
- Outputs top-3 product recommendations per customer
- Achieves business-relevant performance metrics

### Advanced Techniques
1. **Multi-Target Learning**: Predict multiple products simultaneously
2. **Ensemble Methods**: Combine multiple algorithms for robustness
3. **Deep Feature Engineering**: Create business-meaningful features
4. **Cross-Product Features**: Interaction terms between customer and product attributes
5. **Time-Series Features**: Temporal patterns and seasonality
6. **Collaborative Filtering**: User-item interaction patterns

### Success Criteria
- **Precision@3 > 60%**: At least 60% of top-3 recommendations should be adopted
- **Recall@10 > 80%**: Capture 80% of actual adoptions in top-10 recommendations
- **NDCG@5 > 0.7**: Strong ranking quality for business decisions
- **Business Impact**: Measurable lift in conversion rates

### Model Architecture
We'll implement a **hybrid recommendation system** combining:
- Content-based filtering (customer/product features)
- Collaborative filtering (interaction patterns)
- Deep neural networks for complex patterns
- Gradient boosting for structured data

In [1]:
# Advanced Modeling - Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Advanced ML Libraries
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.metrics import (
    classification_report, confusion_matrix, roc_auc_score, 
    precision_recall_curve, roc_curve, average_precision_score,
    precision_score, recall_score, f1_score, ndcg_score
)
from sklearn.feature_selection import SelectFromModel, RFE
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# Advanced Ensemble Methods
try:
    import xgboost as xgb
    XGBOOST_AVAILABLE = True
except ImportError:
    XGBOOST_AVAILABLE = False
    print("XGBoost not available - will use LightGBM instead")

try:
    import lightgbm as lgb
    LIGHTGBM_AVAILABLE = True
except ImportError:
    LIGHTGBM_AVAILABLE = False
    print("LightGBM not available - will use alternative")

# Neural Networks
try:
    import torch
    import torch.nn as nn
    import torch.optim as optim
    from torch.utils.data import DataLoader, TensorDataset
    PYTORCH_AVAILABLE = True
    print("PyTorch available for deep learning")
except ImportError:
    PYTORCH_AVAILABLE = False
    print("PyTorch not available - will use sklearn MLPClassifier")
    print("To install PyTorch: pip install torch")

# Model Interpretation
try:
    import shap
    SHAP_AVAILABLE = True
except ImportError:
    SHAP_AVAILABLE = False
    print("SHAP not available - will use feature importance instead")

# Utility Libraries
import joblib
import json
from datetime import datetime, timedelta
from itertools import combinations
from scipy import stats
from collections import defaultdict

print("✓ Advanced modeling libraries imported successfully")
print(f"Available advanced libraries:")
print(f"  - XGBoost: {XGBOOST_AVAILABLE}")
print(f"  - LightGBM: {LIGHTGBM_AVAILABLE}")
print(f"  - PyTorch: {PYTORCH_AVAILABLE}")
print(f"  - SHAP: {SHAP_AVAILABLE}")

PyTorch available for deep learning
✓ Advanced modeling libraries imported successfully
Available advanced libraries:
  - XGBoost: True
  - LightGBM: True
  - PyTorch: True
  - SHAP: True
✓ Advanced modeling libraries imported successfully
Available advanced libraries:
  - XGBoost: True
  - LightGBM: True
  - PyTorch: True
  - SHAP: True


In [2]:
# Optional: Install missing libraries
print("📦 LIBRARY INSTALLATION CHECK")
print("=" * 40)

missing_libs = []

if not PYTORCH_AVAILABLE:
    missing_libs.append("torch")
    print("⚠️  PyTorch not found")

if not LIGHTGBM_AVAILABLE:
    missing_libs.append("lightgbm")
    print("⚠️  LightGBM not found")

if not SHAP_AVAILABLE:
    missing_libs.append("shap")
    print("⚠️  SHAP not found")

if missing_libs:
    print(f"\n💡 To install missing libraries, run:")
    print(f"   pip install {' '.join(missing_libs)}")
    print(f"\n🔄 You can continue without these libraries - alternative models will be used")
else:
    print("✅ All advanced libraries are available!")

print("\n🚀 Proceeding with available libraries...")

📦 LIBRARY INSTALLATION CHECK
✅ All advanced libraries are available!

🚀 Proceeding with available libraries...


In [3]:
# Configuration and Constants
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Set random seeds for reproducibility
if PYTORCH_AVAILABLE:
    torch.manual_seed(RANDOM_STATE)

# Business Configuration
TARGET_PRECISION_AT_3 = 0.60  # 60% precision for top-3 recommendations
TARGET_RECALL_AT_10 = 0.80    # 80% recall for top-10 recommendations
TARGET_NDCG_AT_5 = 0.70       # 70% NDCG for ranking quality

# Model Configuration
N_FOLDS = 5
TEST_SIZE = 0.2
VAL_SIZE = 0.2
MAX_FEATURES_TO_SELECT = 50

print("🎯 ADVANCED MODELING CONFIGURATION")
print("=" * 50)
print(f"Target Precision@3: {TARGET_PRECISION_AT_3:.1%}")
print(f"Target Recall@10: {TARGET_RECALL_AT_10:.1%}")
print(f"Target NDCG@5: {TARGET_NDCG_AT_5:.1%}")
print(f"Cross-validation folds: {N_FOLDS}")
print(f"Random state: {RANDOM_STATE}")

🎯 ADVANCED MODELING CONFIGURATION
Target Precision@3: 60.0%
Target Recall@10: 80.0%
Target NDCG@5: 70.0%
Cross-validation folds: 5
Random state: 42


In [4]:
# Load and Prepare Data
print("📊 LOADING DATA FOR ADVANCED MODELING")
print("=" * 50)

# Load all datasets
customers = pd.read_csv('data/data_customers.csv')
products = pd.read_csv('data/data_products.csv')
adoption_logs = pd.read_csv('data/data_adoption_logs.csv')

print(f"Loaded datasets:")
print(f"  - Customers: {customers.shape}")
print(f"  - Products: {products.shape}")
print(f"  - Adoption logs: {adoption_logs.shape}")

# Check for processed data
try:
    processed_adoption = pd.read_csv('data/processed_adoption_logs.csv')
    processed_products = pd.read_csv('data/processed_products.csv')
    PROCESSED_AVAILABLE = True
    print(f"  - Processed adoption: {processed_adoption.shape}")
    print(f"  - Processed products: {processed_products.shape}")
except FileNotFoundError:
    PROCESSED_AVAILABLE = False
    print("  - No processed data found, will create from raw data")

# Basic data exploration
print(f"\nData Overview:")
print(f"  - Unique customers: {customers['user_id'].nunique():,}")
print(f"  - Unique products: {products['product_id'].nunique():,}")
print(f"  - Total adoption events: {adoption_logs.shape[0]:,}")
print(f"  - Adoption rate: {adoption_logs['adopted'].mean():.1%}")

📊 LOADING DATA FOR ADVANCED MODELING
Loaded datasets:
  - Customers: (100000, 37)
  - Products: (1000, 26)
  - Adoption logs: (949650, 10)
Loaded datasets:
  - Customers: (100000, 37)
  - Products: (1000, 26)
  - Adoption logs: (949650, 10)
  - Processed adoption: (949650, 17)
  - Processed products: (1000, 81)

Data Overview:
  - Unique customers: 100,000
  - Unique products: 1,000
  - Total adoption events: 949,650
  - Adoption rate: 25.1%
  - Processed adoption: (949650, 17)
  - Processed products: (1000, 81)

Data Overview:
  - Unique customers: 100,000
  - Unique products: 1,000
  - Total adoption events: 949,650
  - Adoption rate: 25.1%


In [5]:
# Debug: Check column names
print("\n🔍 DEBUG: Checking column names")
print(f"Raw products columns: {list(products.columns)}")
print(f"Processed products columns: {list(processed_products.columns)}")
print(f"Adoption logs columns: {list(adoption_logs.columns)}")
print(f"Customers columns: {list(customers.columns)}")


🔍 DEBUG: Checking column names
Raw products columns: ['product_id', 'category', 'tier', 'apr', 'reward_type', 'reward_value', 'eligibility', 'tenor_months', 'risk_adj_margin', 'hist_conv_rate', 'hist_profit', 'budget_remaining', 'max_redemptions', 'offer_dates', 'launch_recency_days', 'compliance_tag', 'channels', 'target_segments', 'geo_applic', 'merchant_industry', 'cost_to_bank', 'expected_utility', 'cross_sell_score', 'bundle_depth', 'valid_window', 'popularity_trend']
Processed products columns: ['apr', 'reward_value', 'tenor_months', 'risk_adj_margin', 'hist_conv_rate', 'hist_profit', 'budget_remaining', 'max_redemptions', 'offer_dates', 'launch_recency_days', 'cost_to_bank', 'expected_utility', 'cross_sell_score', 'bundle_depth', 'profitability_score', 'profit_segment', 'roi', 'is_competitive_apr', 'budget_utilization', 'offer_dates_year', 'offer_dates_month', 'offer_dates_quarter', 'offer_dates_dayofweek', 'offer_dates_is_weekend', 'product_id_frequency', 'category_DebitCard',

In [6]:
# Advanced Feature Engineering
print("🔧 ADVANCED FEATURE ENGINEERING")
print("=" * 50)

def create_customer_product_matrix():
    """Create optimized customer-product interaction matrix"""
    
    # Instead of Cartesian product, use existing adoption logs as base
    print("Using existing adoption logs as base matrix (memory efficient)...")
    
    # Start with existing interactions
    base_matrix = adoption_logs[['user_id', 'product_id', 'adopted']].copy()
    
    print(f"Base interaction matrix: {base_matrix.shape[0]:,} customer-product pairs")
    print(f"Adoption rate: {base_matrix['adopted'].mean():.1%}")
    
    # Optional: Add some negative samples strategically (not all combinations)
    # This prevents memory explosion while maintaining model effectiveness
    
    # Get customers and products with interactions
    active_customers = base_matrix['user_id'].unique()
    active_products = base_matrix['product_id'].unique()
    
    print(f"Active customers: {len(active_customers):,}")
    print(f"Active products: {len(active_products):,}")
    
    # Create a small sample of non-existing pairs for better model training
    print("Adding strategic negative samples...")
    
    # Sample some customers and products for negative examples
    sample_size = min(50000, len(active_customers) * 10)  # Limit to prevent memory issues
    
    # Create negative samples
    np.random.seed(RANDOM_STATE)
    negative_samples = []
    
    existing_pairs = set(zip(base_matrix['user_id'], base_matrix['product_id']))
    
    attempts = 0
    while len(negative_samples) < sample_size and attempts < sample_size * 3:
        customer = np.random.choice(active_customers)
        product = np.random.choice(active_products)
        
        if (customer, product) not in existing_pairs:
            negative_samples.append({
                'user_id': customer,
                'product_id': product,
                'adopted': 0
            })
        
        attempts += 1
    
    if negative_samples:
        negative_df = pd.DataFrame(negative_samples)
        base_matrix = pd.concat([base_matrix, negative_df], ignore_index=True)
        print(f"Added {len(negative_samples):,} negative samples")
    
    print(f"Final matrix: {base_matrix.shape[0]:,} customer-product pairs")
    print(f"Final adoption rate: {base_matrix['adopted'].mean():.1%}")
    
    return base_matrix

def engineer_customer_features(df):
    """Create advanced customer features"""
    print("\n🧑‍💼 Engineering customer features...")
    
    customer_features = customers.copy()
    
    # Behavioral features from adoption history
    adoption_stats = adoption_logs.groupby('user_id').agg({
        'adopted': ['count', 'sum', 'mean'],
        'product_id': 'nunique'
    }).round(4)
    
    adoption_stats.columns = [
        'total_interactions', 'total_adoptions', 'adoption_rate', 'unique_products_tried'
    ]
    
    # Customer lifecycle features
    if 'signup_date' in customers.columns:
        customer_features['signup_date'] = pd.to_datetime(customer_features['signup_date'])
        customer_features['days_since_signup'] = (
            pd.Timestamp.now() - customer_features['signup_date']
        ).dt.days
        
        # Customer maturity segments
        customer_features['customer_maturity'] = pd.cut(
            customer_features['days_since_signup'],
            bins=[0, 30, 90, 365, float('inf')],
            labels=['New', 'Recent', 'Established', 'Veteran']
        )
    
    # Financial profile features
    if 'annual_income' in customer_features.columns and 'account_balance' in customer_features.columns:
        customer_features['balance_to_income_ratio'] = (
            customer_features['account_balance'] / customer_features['annual_income']
        ).fillna(0)
        
        customer_features['financial_capacity'] = pd.cut(
            customer_features['balance_to_income_ratio'],
            bins=[0, 0.1, 0.5, 1.0, float('inf')],
            labels=['Low', 'Medium', 'High', 'Very High']
        )
    
    # Merge adoption statistics
    customer_features = customer_features.merge(adoption_stats, on='user_id', how='left')
    customer_features = customer_features.fillna(0)
    
    print(f"  ✓ Customer features: {customer_features.shape[1]} columns")
    return customer_features

def engineer_product_features(df):
    """Create advanced product features"""
    print("\n📦 Engineering product features...")
    
    if PROCESSED_AVAILABLE:
        product_features = processed_products.copy()
        # Add product_id back if it's missing
        if 'product_id' not in product_features.columns:
            product_features['product_id'] = products['product_id'].values
    else:
        product_features = products.copy()
    
    # Product popularity features
    product_stats = adoption_logs.groupby('product_id').agg({
        'adopted': ['count', 'sum', 'mean'],
        'user_id': 'nunique'
    }).round(4)
    
    product_stats.columns = [
        'total_exposures', 'total_adoptions', 'adoption_rate', 'unique_customers'
    ]
    
    # Reset index to make product_id a column
    product_stats = product_stats.reset_index()
    
    # Product performance segments
    product_stats['performance_tier'] = pd.cut(
        product_stats['adoption_rate'],
        bins=[0, 0.1, 0.3, 0.6, 1.0],
        labels=['Poor', 'Average', 'Good', 'Excellent']
    )
    
    # Market penetration
    total_customers = customers['user_id'].nunique()
    product_stats['market_penetration'] = product_stats['unique_customers'] / total_customers
    
    # Merge with product features
    product_features = product_features.merge(product_stats, on='product_id', how='left')
    
    # Handle missing values properly for different column types
    for col in product_features.columns:
        if product_features[col].dtype.name == 'category':
            # For categorical columns, fill with the mode or first category
            if product_features[col].isnull().any():
                mode_val = product_features[col].mode()
                if len(mode_val) > 0:
                    product_features[col] = product_features[col].fillna(mode_val[0])
                else:
                    # If no mode, use first category
                    product_features[col] = product_features[col].cat.add_categories(['Unknown']).fillna('Unknown')
        elif pd.api.types.is_numeric_dtype(product_features[col]):
            # For numeric columns, fill with 0
            product_features[col] = product_features[col].fillna(0)
        else:
            # For other types, fill with 'Unknown' or appropriate default
            product_features[col] = product_features[col].fillna('Unknown')
    
    print(f"  ✓ Product features: {product_features.shape[1]} columns")
    return product_features

def create_interaction_features(df, customer_features, product_features):
    """Create customer-product interaction features"""
    print("\n🔗 Creating interaction features...")
    
    # Merge customer and product features
    enriched_df = df.merge(customer_features, on='user_id', how='left')
    enriched_df = enriched_df.merge(product_features, on='product_id', how='left')
    
    # Create interaction features
    numeric_customer_cols = customer_features.select_dtypes(include=[np.number]).columns
    numeric_product_cols = product_features.select_dtypes(include=[np.number]).columns
    
    # Remove ID columns
    numeric_customer_cols = [col for col in numeric_customer_cols if 'id' not in col.lower()]
    numeric_product_cols = [col for col in numeric_product_cols if 'id' not in col.lower()]
    
    # Create key interaction features (limit to avoid explosion)
    key_interactions = [
        ('annual_income', 'price'),
        ('account_balance', 'price'), 
        ('adoption_rate_x', 'adoption_rate_y'),  # customer vs product adoption rates
        ('unique_products_tried', 'market_penetration')
    ]
    
    for customer_col, product_col in key_interactions:
        if customer_col in enriched_df.columns and product_col in enriched_df.columns:
            # Ratio features
            interaction_name = f"{customer_col}_to_{product_col}_ratio"
            enriched_df[interaction_name] = (
                enriched_df[customer_col] / (enriched_df[product_col] + 1e-8)
            )
            
            # Product features
            product_name = f"{customer_col}_times_{product_col}"
            enriched_df[product_name] = enriched_df[customer_col] * enriched_df[product_col]
    
    print(f"  ✓ Interaction features created: {enriched_df.shape[1]} total columns")
    return enriched_df

# Execute feature engineering
print("Starting comprehensive feature engineering...")

# Create customer-product matrix
cp_matrix = create_customer_product_matrix()

# Engineer features
customer_features = engineer_customer_features(cp_matrix)
product_features = engineer_product_features(cp_matrix)

# Create final feature matrix
feature_matrix = create_interaction_features(cp_matrix, customer_features, product_features)

print(f"\n✅ FEATURE ENGINEERING COMPLETED")
print(f"Final feature matrix: {feature_matrix.shape}")
print(f"Adoption rate: {feature_matrix['adopted'].mean():.1%}")

🔧 ADVANCED FEATURE ENGINEERING
Starting comprehensive feature engineering...
Using existing adoption logs as base matrix (memory efficient)...
Base interaction matrix: 949,650 customer-product pairs
Adoption rate: 25.1%
Active customers: 100,000
Active products: 1,000
Adding strategic negative samples...
Active customers: 100,000
Active products: 1,000
Adding strategic negative samples...
Added 50,000 negative samples
Final matrix: 999,650 customer-product pairs
Final adoption rate: 23.8%

🧑‍💼 Engineering customer features...
Added 50,000 negative samples
Final matrix: 999,650 customer-product pairs
Final adoption rate: 23.8%

🧑‍💼 Engineering customer features...
  ✓ Customer features: 41 columns

📦 Engineering product features...
  ✓ Customer features: 41 columns

📦 Engineering product features...
  ✓ Product features: 88 columns

🔗 Creating interaction features...
  ✓ Product features: 88 columns

🔗 Creating interaction features...
  ✓ Interaction features created: 134 total columns


In [7]:
# Advanced Data Preprocessing (Ultra Memory Efficient)
print("🔄 ADVANCED DATA PREPROCESSING (ULTRA MEMORY EFFICIENT)")
print("=" * 50)

# Check available memory and reduce dataset size if needed
print(f"Original dataset: {feature_matrix.shape[0]:,} rows, {feature_matrix.shape[1]} columns")
print(f"Memory usage: {feature_matrix.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

# Sample the dataset to manageable size for memory constraints
max_rows = 200000  # Reduce to 200k rows to fit in memory
if feature_matrix.shape[0] > max_rows:
    print(f"\n⚠️  Dataset too large for memory. Sampling {max_rows:,} rows...")
    
    # Stratified sampling to maintain class balance
    sampled_positive = feature_matrix[feature_matrix['adopted'] == 1].sample(
        n=min(int(max_rows * 0.24), feature_matrix[feature_matrix['adopted'] == 1].shape[0]), 
        random_state=42
    )
    sampled_negative = feature_matrix[feature_matrix['adopted'] == 0].sample(
        n=max_rows - len(sampled_positive), 
        random_state=42
    )
    
    sampled_data = pd.concat([sampled_positive, sampled_negative], ignore_index=True)
    sampled_data = sampled_data.sample(frac=1, random_state=42).reset_index(drop=True)  # Shuffle
    
    print(f"Sampled dataset: {sampled_data.shape[0]:,} rows")
    print(f"Adoption rate: {sampled_data['adopted'].mean():.1%}")
    print(f"Memory usage: {sampled_data.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
    
    feature_matrix_working = sampled_data.copy()
else:
    feature_matrix_working = feature_matrix.copy()

def preprocess_for_modeling_minimal(df):
    """Minimal preprocessing to avoid memory issues"""
    
    processed_df = df.copy()
    
    # Handle only essential categorical encoding
    categorical_cols = processed_df.select_dtypes(include=['object', 'category']).columns.tolist()
    categorical_cols = [col for col in categorical_cols if col not in ['user_id', 'product_id']]
    
    print(f"Processing {len(categorical_cols)} categorical columns...")
    
    label_encoders = {}
    for col in categorical_cols[:5]:  # Limit to first 5 to save memory
        if col in processed_df.columns:
            try:
                le = LabelEncoder()
                processed_df[col] = le.fit_transform(processed_df[col].astype(str))
                label_encoders[col] = le
                print(f"  ✓ Label encoded {col}")
            except Exception as e:
                print(f"  ⚠️  Skipped {col}: {e}")
                processed_df = processed_df.drop(columns=[col])
    
    # Drop remaining categorical columns to save memory
    remaining_cats = [col for col in categorical_cols[5:] if col in processed_df.columns]
    if remaining_cats:
        processed_df = processed_df.drop(columns=remaining_cats)
        print(f"  ✓ Dropped {len(remaining_cats)} categorical columns to save memory")
    
    # Handle missing values minimally
    numeric_cols = processed_df.select_dtypes(include=[np.number]).columns.tolist()
    for col in numeric_cols:
        if processed_df[col].isnull().sum() > 0:
            if col.endswith('_rate') or col.endswith('_ratio'):
                processed_df[col] = processed_df[col].fillna(0)
            else:
                processed_df[col] = processed_df[col].fillna(processed_df[col].median())
    
    # Skip correlation removal to save memory
    print("  ✓ Skipping correlation analysis to save memory")
    
    # Convert to smaller data types
    for col in processed_df.select_dtypes(include=['float64']).columns:
        if col not in ['user_id', 'product_id']:
            processed_df[col] = pd.to_numeric(processed_df[col], downcast='float')
    
    for col in processed_df.select_dtypes(include=['int64']).columns:
        if col not in ['user_id', 'product_id']:
            processed_df[col] = pd.to_numeric(processed_df[col], downcast='integer')
    
    return processed_df, label_encoders

# Apply minimal preprocessing
modeling_data, encoders = preprocess_for_modeling_minimal(feature_matrix_working)

# Prepare features and target
feature_cols = [col for col in modeling_data.columns if col not in ['user_id', 'product_id', 'adopted']]
X = modeling_data[feature_cols]
y = modeling_data['adopted'].astype('uint8')

print(f"\n📊 FINAL MODELING DATASET")
print(f"Features: {X.shape[1]}")
print(f"Samples: {X.shape[0]:,}")
print(f"Positive samples: {y.sum():,} ({y.mean():.1%})")
print(f"Memory usage: {X.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

# Check data quality
print(f"\nData quality checks:")
print(f"  - Missing values: {X.isnull().sum().sum()}")
print(f"  - Infinite values: {np.isinf(X.select_dtypes(include=[np.number])).sum().sum()}")
print(f"  - Feature variance > 0: {(X.var() > 0).sum()}/{X.shape[1]}")

# Clear intermediate variables to free memory
del feature_matrix_working
if 'sampled_data' in locals():
    del sampled_data
    del sampled_positive, sampled_negative

print("\n✓ Memory optimization completed")

🔄 ADVANCED DATA PREPROCESSING (ULTRA MEMORY EFFICIENT)
Original dataset: 999,650 rows, 134 columns


Memory usage: 2041.0 MB

⚠️  Dataset too large for memory. Sampling 200,000 rows...
Sampled dataset: 200,000 rows
Adoption rate: 24.0%
Sampled dataset: 200,000 rows
Adoption rate: 24.0%
Memory usage: 408.3 MB
Processing 18 categorical columns...
  ✓ Label encoded occupation
  ✓ Label encoded income_tier
  ✓ Label encoded marital_status
Memory usage: 408.3 MB
Processing 18 categorical columns...
  ✓ Label encoded occupation
  ✓ Label encoded income_tier
  ✓ Label encoded marital_status
  ✓ Label encoded preferred_language
  ✓ Label encoded products
  ✓ Dropped 13 categorical columns to save memory
  ✓ Skipping correlation analysis to save memory
  ✓ Label encoded preferred_language
  ✓ Label encoded products
  ✓ Dropped 13 categorical columns to save memory
  ✓ Skipping correlation analysis to save memory

📊 FINAL MODELING DATASET
Features: 118
Samples: 200,000
Positive samples: 48,000 (24.0%)
Memory usage: 53.6 MB

Data quality checks:
  - Missing values: 0
  - Infinite values: 0

📊 FI

In [8]:
# Smart Train-Validation-Test Split
print("🎯 INTELLIGENT DATA SPLITTING")
print("=" * 50)

def create_smart_splits(X, y, modeling_data):
    """Create stratified splits with business logic"""
    
    # For recommendation systems, we need to ensure both customers and products
    # are represented across splits
    
    users = modeling_data['user_id'].unique()
    products = modeling_data['product_id'].unique()
    
    # Split users (not individual interactions)
    train_users, temp_users = train_test_split(
        users, test_size=0.4, random_state=RANDOM_STATE
    )
    val_users, test_users = train_test_split(
        temp_users, test_size=0.5, random_state=RANDOM_STATE
    )
    
    # Create splits based on users
    train_mask = modeling_data['user_id'].isin(train_users)
    val_mask = modeling_data['user_id'].isin(val_users)
    test_mask = modeling_data['user_id'].isin(test_users)
    
    X_train = X[train_mask]
    y_train = y[train_mask]
    X_val = X[val_mask]
    y_val = y[val_mask]
    X_test = X[test_mask]
    y_test = y[test_mask]
    
    print(f"User-based splits:")
    print(f"  Train users: {len(train_users):,} ({len(train_users)/len(users):.1%})")
    print(f"  Val users: {len(val_users):,} ({len(val_users)/len(users):.1%})")
    print(f"  Test users: {len(test_users):,} ({len(test_users)/len(users):.1%})")
    
    print(f"\nInteraction splits:")
    print(f"  Train: {X_train.shape[0]:,} interactions ({y_train.mean():.1%} positive)")
    print(f"  Val: {X_val.shape[0]:,} interactions ({y_val.mean():.1%} positive)")
    print(f"  Test: {X_test.shape[0]:,} interactions ({y_test.mean():.1%} positive)")
    
    return X_train, X_val, X_test, y_train, y_val, y_test, train_mask, val_mask, test_mask

# Create splits
X_train, X_val, X_test, y_train, y_val, y_test, train_mask, val_mask, test_mask = create_smart_splits(
    X, y, modeling_data
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

print("✅ Data splitting and scaling completed successfully")

🎯 INTELLIGENT DATA SPLITTING
User-based splits:
  Train users: 53,382 (60.0%)
  Val users: 17,794 (20.0%)
  Test users: 17,795 (20.0%)

Interaction splits:
  Train: 119,975 interactions (23.8% positive)
  Val: 39,926 interactions (24.4% positive)
  Test: 40,099 interactions (24.1% positive)
User-based splits:
  Train users: 53,382 (60.0%)
  Val users: 17,794 (20.0%)
  Test users: 17,795 (20.0%)

Interaction splits:
  Train: 119,975 interactions (23.8% positive)
  Val: 39,926 interactions (24.4% positive)
  Test: 40,099 interactions (24.1% positive)
✅ Data splitting and scaling completed successfully
✅ Data splitting and scaling completed successfully


In [9]:
# Advanced Model Development
print("🤖 ADVANCED MODEL DEVELOPMENT")
print("=" * 50)

# PyTorch Neural Network Class
if PYTORCH_AVAILABLE:
    class PyTorchBinaryClassifier(nn.Module):
        def __init__(self, input_size, hidden_sizes=[128, 64, 32], dropout_rate=0.3):
            super(PyTorchBinaryClassifier, self).__init__()
            
            layers = []
            prev_size = input_size
            
            for hidden_size in hidden_sizes:
                layers.append(nn.Linear(prev_size, hidden_size))
                layers.append(nn.ReLU())
                layers.append(nn.BatchNorm1d(hidden_size))
                layers.append(nn.Dropout(dropout_rate))
                prev_size = hidden_size
            
            layers.append(nn.Linear(prev_size, 1))
            layers.append(nn.Sigmoid())
            
            self.network = nn.Sequential(*layers)
        
        def forward(self, x):
            return self.network(x)
    
    class PyTorchWrapper:
        """Wrapper to make PyTorch model compatible with sklearn interface"""
        
        def __init__(self, input_size, hidden_sizes=[128, 64, 32], learning_rate=0.001, epochs=50, batch_size=512):
            self.input_size = input_size
            self.hidden_sizes = hidden_sizes
            self.learning_rate = learning_rate
            self.epochs = epochs
            self.batch_size = batch_size
            self.model = None
            self.scaler = StandardScaler()
            
        def fit(self, X, y):
            # Scale features
            X_scaled = self.scaler.fit_transform(X)
            
            # Convert to tensors
            X_tensor = torch.FloatTensor(X_scaled)
            y_tensor = torch.FloatTensor(y.values if hasattr(y, 'values') else y).reshape(-1, 1)
            
            # Create dataset
            dataset = TensorDataset(X_tensor, y_tensor)
            dataloader = DataLoader(dataset, batch_size=self.batch_size, shuffle=True)
            
            # Initialize model
            self.model = PyTorchBinaryClassifier(X_scaled.shape[1], self.hidden_sizes)
            criterion = nn.BCELoss()
            optimizer = optim.Adam(self.model.parameters(), lr=self.learning_rate)
            
            # Training loop
            self.model.train()
            for epoch in range(self.epochs):
                total_loss = 0
                for batch_X, batch_y in dataloader:
                    optimizer.zero_grad()
                    outputs = self.model(batch_X)
                    loss = criterion(outputs, batch_y)
                    loss.backward()
                    optimizer.step()
                    total_loss += loss.item()
                
                if epoch % 10 == 0:
                    avg_loss = total_loss / len(dataloader)
                    print(f"    Epoch {epoch}/{self.epochs}, Loss: {avg_loss:.4f}")
            
            return self
        
        def predict(self, X):
            X_scaled = self.scaler.transform(X)
            X_tensor = torch.FloatTensor(X_scaled)
            
            self.model.eval()
            with torch.no_grad():
                outputs = self.model(X_tensor)
                predictions = (outputs.numpy() > 0.5).astype(int).flatten()
            
            return predictions
        
        def predict_proba(self, X):
            X_scaled = self.scaler.transform(X)
            X_tensor = torch.FloatTensor(X_scaled)
            
            self.model.eval()
            with torch.no_grad():
                outputs = self.model(X_tensor)
                proba_positive = outputs.numpy().flatten()
                proba_negative = 1 - proba_positive
                
            return np.column_stack([proba_negative, proba_positive])

class AdvancedRecommenderEnsemble:
    """Advanced ensemble recommender system"""
    
    def __init__(self, random_state=42):
        self.random_state = random_state
        self.models = {}
        self.weights = {}
        self.scaler = StandardScaler()
        self.feature_selector = None
        
    def create_base_models(self):
        """Create diverse base models"""
        models = {}
        
        # 1. Logistic Regression with regularization
        models['logistic'] = LogisticRegression(
            random_state=self.random_state,
            class_weight='balanced',
            C=0.1,
            max_iter=1000
        )
        
        # 2. Random Forest with tuned parameters
        models['random_forest'] = RandomForestClassifier(
            n_estimators=200,
            max_depth=15,
            min_samples_split=10,
            min_samples_leaf=5,
            class_weight='balanced_subsample',
            random_state=self.random_state,
            n_jobs=-1
        )
        
        # 3. Gradient Boosting
        models['gradient_boosting'] = GradientBoostingClassifier(
            n_estimators=100,
            learning_rate=0.1,
            max_depth=8,
            min_samples_split=10,
            min_samples_leaf=5,
            random_state=self.random_state
        )
        
        # 4. LightGBM if available (instead of XGBoost)
        if LIGHTGBM_AVAILABLE:
            models['lightgbm'] = lgb.LGBMClassifier(
                n_estimators=100,
                learning_rate=0.1,
                max_depth=8,
                min_child_samples=10,
                subsample=0.8,
                colsample_bytree=0.8,
                random_state=self.random_state,
                objective='binary',
                metric='binary_logloss',
                verbose=-1  # Suppress output
            )
        
        # 5. PyTorch Neural Network if available
        if PYTORCH_AVAILABLE:
            models['pytorch_nn'] = PyTorchWrapper(
                input_size=None,  # Will be set dynamically
                hidden_sizes=[128, 64, 32],
                learning_rate=0.001,
                epochs=30,
                batch_size=512
            )
        
        return models
    
    def fit(self, X_train, y_train, X_val, y_val, feature_names):
        """Fit ensemble with feature selection and validation"""
        
        print("🔧 Training advanced ensemble...")
        
        # Feature selection
        self.feature_selector = SelectFromModel(
            RandomForestClassifier(n_estimators=50, random_state=self.random_state),
            max_features=min(MAX_FEATURES_TO_SELECT, X_train.shape[1])
        )
        
        X_train_selected = self.feature_selector.fit_transform(X_train, y_train)
        X_val_selected = self.feature_selector.transform(X_val)
        
        selected_features = feature_names[self.feature_selector.get_support()]
        print(f"  ✓ Selected {len(selected_features)} most important features")
        
        # Train base models
        self.models = self.create_base_models()
        validation_scores = {}
        
        for name, model in self.models.items():
            print(f"  🏋️ Training {name}...")
            
            # Special handling for PyTorch model
            if name == 'pytorch_nn' and PYTORCH_AVAILABLE:
                model.input_size = X_train_selected.shape[1]
                model.fit(X_train_selected, y_train)
            else:
                model.fit(X_train_selected, y_train)
            
            # Validate
            val_pred = model.predict_proba(X_val_selected)[:, 1]
            val_auc = roc_auc_score(y_val, val_pred)
            validation_scores[name] = val_auc
            
            print(f"    Validation AUC: {val_auc:.4f}")
        
        # Calculate ensemble weights based on validation performance
        total_score = sum(validation_scores.values())
        self.weights = {name: score/total_score for name, score in validation_scores.items()}
        
        print(f"\n  📊 Ensemble weights:")
        for name, weight in self.weights.items():
            print(f"    {name}: {weight:.3f}")
        
        return self
    
    def predict_proba(self, X):
        """Ensemble prediction with weighted voting"""
        X_selected = self.feature_selector.transform(X)
        
        predictions = np.zeros((X.shape[0], 2))
        
        for name, model in self.models.items():
            model_pred = model.predict_proba(X_selected)
            predictions += self.weights[name] * model_pred
        
        return predictions
    
    def predict(self, X):
        """Binary predictions"""
        proba = self.predict_proba(X)
        return (proba[:, 1] > 0.5).astype(int)

# Train the advanced ensemble
ensemble_model = AdvancedRecommenderEnsemble(random_state=RANDOM_STATE)
ensemble_model.fit(X_train_scaled, y_train, X_val_scaled, y_val, X.columns)

print("✅ Advanced ensemble model training completed")

🤖 ADVANCED MODEL DEVELOPMENT
🔧 Training advanced ensemble...
  ✓ Selected 28 most important features
  🏋️ Training logistic...
  ✓ Selected 28 most important features
  🏋️ Training logistic...
    Validation AUC: 0.7146
  🏋️ Training random_forest...
    Validation AUC: 0.7146
  🏋️ Training random_forest...
    Validation AUC: 0.7186
  🏋️ Training gradient_boosting...
    Validation AUC: 0.7186
  🏋️ Training gradient_boosting...
    Validation AUC: 0.7161
  🏋️ Training lightgbm...
    Validation AUC: 0.7161
  🏋️ Training lightgbm...
    Validation AUC: 0.7198
  🏋️ Training pytorch_nn...
    Validation AUC: 0.7198
  🏋️ Training pytorch_nn...
    Epoch 0/30, Loss: 0.6051
    Epoch 0/30, Loss: 0.6051
    Epoch 10/30, Loss: 0.4915
    Epoch 10/30, Loss: 0.4915
    Epoch 20/30, Loss: 0.4874
    Epoch 20/30, Loss: 0.4874
    Validation AUC: 0.7205

  📊 Ensemble weights:
    logistic: 0.199
    random_forest: 0.200
    gradient_boosting: 0.200
    lightgbm: 0.201
    pytorch_nn: 0.201
✅ Advan

In [10]:
# Advanced Evaluation Framework
print("📈 ADVANCED EVALUATION FRAMEWORK")
print("=" * 50)

def evaluate_recommendation_system(model, X_test, y_test, test_data, top_k=[3, 5, 10]):
    """Comprehensive evaluation for recommendation system"""
    
    results = {}
    
    # Get prediction probabilities
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    y_pred = model.predict(X_test)
    
    # Basic classification metrics
    results['classification'] = {
        'auc_roc': roc_auc_score(y_test, y_pred_proba),
        'auc_pr': average_precision_score(y_test, y_pred_proba),
        'precision': precision_score(y_test, y_pred),
        'recall': recall_score(y_test, y_pred),
        'f1': f1_score(y_test, y_pred)
    }
    
    # Recommendation-specific metrics
    results['recommendation'] = {}
    
    # Create user-item matrix for ranking evaluation
    test_df = test_data.copy()
    test_df['prediction_score'] = y_pred_proba
    
    users = test_df['user_id'].unique()
    
    precision_at_k = {}
    recall_at_k = {}
    ndcg_at_k = {}
    
    for k in top_k:
        precisions = []
        recalls = []
        ndcgs = []
        
        for user in users:
            user_data = test_df[test_df['user_id'] == user].copy()
            
            if len(user_data) < k:
                continue
                
            # Sort by prediction score (descending)
            user_data = user_data.sort_values('prediction_score', ascending=False)
            
            # Top-k recommendations
            top_k_items = user_data.head(k)
            
            # Calculate metrics
            relevant_items = user_data[user_data['adopted'] == 1]
            
            if len(relevant_items) > 0:
                # Precision@k
                precision_k = len(top_k_items[top_k_items['adopted'] == 1]) / k
                precisions.append(precision_k)
                
                # Recall@k
                recall_k = len(top_k_items[top_k_items['adopted'] == 1]) / len(relevant_items)
                recalls.append(recall_k)
                
                # NDCG@k
                y_true = top_k_items['adopted'].values.reshape(1, -1)
                y_score = top_k_items['prediction_score'].values.reshape(1, -1)
                
                if len(np.unique(y_true)) > 1:  # Need both 0s and 1s for NDCG
                    ndcg_k = ndcg_score(y_true, y_score, k=k)
                    ndcgs.append(ndcg_k)
        
        precision_at_k[k] = np.mean(precisions) if precisions else 0
        recall_at_k[k] = np.mean(recalls) if recalls else 0
        ndcg_at_k[k] = np.mean(ndcgs) if ndcgs else 0
    
    results['recommendation'] = {
        'precision_at_k': precision_at_k,
        'recall_at_k': recall_at_k,
        'ndcg_at_k': ndcg_at_k
    }
    
    return results

# Prepare test data for evaluation
test_data = modeling_data[test_mask].copy()

# Evaluate the model
evaluation_results = evaluate_recommendation_system(
    ensemble_model, X_test_scaled, y_test, test_data
)

# Display results
print("🎯 EVALUATION RESULTS")
print("=" * 30)

print("\n📊 Classification Metrics:")
for metric, value in evaluation_results['classification'].items():
    print(f"  {metric.upper()}: {value:.4f}")

print("\n🎯 Recommendation Metrics:")
for k in [3, 5, 10]:
    print(f"\n  📋 Top-{k} Recommendations:")
    print(f"    Precision@{k}: {evaluation_results['recommendation']['precision_at_k'][k]:.4f}")
    print(f"    Recall@{k}: {evaluation_results['recommendation']['recall_at_k'][k]:.4f}")
    print(f"    NDCG@{k}: {evaluation_results['recommendation']['ndcg_at_k'][k]:.4f}")

# Check if we meet business targets
print("\n🎯 BUSINESS TARGET ASSESSMENT:")
precision_3 = evaluation_results['recommendation']['precision_at_k'][3]
recall_10 = evaluation_results['recommendation']['recall_at_k'][10]
ndcg_5 = evaluation_results['recommendation']['ndcg_at_k'][5]

print(f"  Precision@3: {precision_3:.1%} (Target: {TARGET_PRECISION_AT_3:.1%}) {'✅' if precision_3 >= TARGET_PRECISION_AT_3 else '❌'}")
print(f"  Recall@10: {recall_10:.1%} (Target: {TARGET_RECALL_AT_10:.1%}) {'✅' if recall_10 >= TARGET_RECALL_AT_10 else '❌'}")
print(f"  NDCG@5: {ndcg_5:.1%} (Target: {TARGET_NDCG_AT_5:.1%}) {'✅' if ndcg_5 >= TARGET_NDCG_AT_5 else '❌'}")

targets_met = sum([
    precision_3 >= TARGET_PRECISION_AT_3,
    recall_10 >= TARGET_RECALL_AT_10,
    ndcg_5 >= TARGET_NDCG_AT_5
])

print(f"\n📈 Overall Performance: {targets_met}/3 targets met")

if targets_met >= 2:
    print("🎉 Model performance is ACCEPTABLE for production deployment!")
    SHAP_READY = True
else:
    print("⚠️  Model needs further improvement before deployment")
    SHAP_READY = False

📈 ADVANCED EVALUATION FRAMEWORK
🎯 EVALUATION RESULTS

📊 Classification Metrics:
  AUC_ROC: 0.7220
  AUC_PR: 0.4328
  PRECISION: 0.4757
  RECALL: 0.3171
  F1: 0.3805

🎯 Recommendation Metrics:

  📋 Top-3 Recommendations:
    Precision@3: 0.3961
    Recall@3: 0.8832
    NDCG@3: 0.7616

  📋 Top-5 Recommendations:
    Precision@5: 0.3074
    Recall@5: 0.9730
    NDCG@5: 0.6756

  📋 Top-10 Recommendations:
    Precision@10: 0.0000
    Recall@10: 0.0000
    NDCG@10: 0.0000

🎯 BUSINESS TARGET ASSESSMENT:
  Precision@3: 39.6% (Target: 60.0%) ❌
  Recall@10: 0.0% (Target: 80.0%) ❌
  NDCG@5: 67.6% (Target: 70.0%) ❌

📈 Overall Performance: 0/3 targets met
⚠️  Model needs further improvement before deployment
🎯 EVALUATION RESULTS

📊 Classification Metrics:
  AUC_ROC: 0.7220
  AUC_PR: 0.4328
  PRECISION: 0.4757
  RECALL: 0.3171
  F1: 0.3805

🎯 Recommendation Metrics:

  📋 Top-3 Recommendations:
    Precision@3: 0.3961
    Recall@3: 0.8832
    NDCG@3: 0.7616

  📋 Top-5 Recommendations:
    Precision@

In [11]:
# Generate Top-3 Product Recommendations
print("🎯 GENERATING TOP-3 PRODUCT RECOMMENDATIONS")
print("=" * 50)

def generate_user_recommendations(model, user_data, customers_df, products_df, top_k=3):
    """Generate top-k product recommendations for each user"""
    
    recommendations = {}
    users = user_data['user_id'].unique()
    
    print(f"Generating recommendations for {len(users):,} users...")
    
    for user in users[:100]:  # Limit to first 100 users for demo
        user_interactions = user_data[user_data['user_id'] == user].copy()
        
        if len(user_interactions) == 0:
            continue
            
        # Get prediction scores
        user_features = user_interactions[feature_cols]
        user_scaled = scaler.transform(user_features)
        scores = model.predict_proba(user_scaled)[:, 1]
        
        # Add scores to dataframe
        user_interactions['recommendation_score'] = scores
        
        # Sort by score and get top-k
        top_recommendations = user_interactions.sort_values(
            'recommendation_score', ascending=False
        ).head(top_k)
        
        # Format recommendations
        user_recs = []
        for _, row in top_recommendations.iterrows():
            product_info = products_df[products_df['product_id'] == row['product_id']].iloc[0]
            
            rec = {
                'product_id': row['product_id'],
                'product_name': product_info.get('product_name', f"Product_{row['product_id']}"),
                'product_category': product_info.get('category', 'Unknown'),
                'recommendation_score': row['recommendation_score'],
                'predicted_adoption_probability': row['recommendation_score'],
                'actual_adopted': row['adopted']
            }
            user_recs.append(rec)
        
        recommendations[user] = user_recs
    
    return recommendations

# Generate recommendations
user_recommendations = generate_user_recommendations(
    ensemble_model, test_data, customers, products, top_k=3
)

# Display sample recommendations
print("\n📋 SAMPLE RECOMMENDATIONS")
print("=" * 30)

sample_users = list(user_recommendations.keys())[:5]
for user in sample_users:
    print(f"\n👤 User {user}:")
    for i, rec in enumerate(user_recommendations[user], 1):
        status = "✅ ADOPTED" if rec['actual_adopted'] else "❌ Not adopted"
        print(f"  {i}. {rec['product_name']} ({rec['product_category']})")
        print(f"     Score: {rec['recommendation_score']:.3f} | {status}")

# Calculate recommendation accuracy
print(f"\n📊 RECOMMENDATION ACCURACY")
print("=" * 30)

total_recs = 0
correct_recs = 0

for user, recs in user_recommendations.items():
    for rec in recs:
        total_recs += 1
        if rec['actual_adopted']:
            correct_recs += 1

recommendation_accuracy = correct_recs / total_recs if total_recs > 0 else 0
print(f"Overall recommendation accuracy: {recommendation_accuracy:.1%}")
print(f"Total recommendations generated: {total_recs:,}")
print(f"Successful recommendations: {correct_recs:,}")

🎯 GENERATING TOP-3 PRODUCT RECOMMENDATIONS
Generating recommendations for 17,795 users...

📋 SAMPLE RECOMMENDATIONS

👤 User 55cb2fca-2a06-4c20-b996-9d9dac80f871:
  1. Product_3f19d078-4239-49a9-a77b-c389f421670e (Mortgage)
     Score: 0.438 | ❌ Not adopted
  2. Product_69702bb0-34ca-4850-8d38-3d3c261d063e (SavingsAccount)
     Score: 0.437 | ✅ ADOPTED
  3. Product_ce07d04c-b6e7-449d-a1ab-f13267ac7cf1 (FXTransfer)
     Score: 0.437 | ❌ Not adopted

👤 User aec0d92b-484d-4f8e-8fbb-1c856995452d:
  1. Product_5c28ab1e-bb10-4348-b3ab-c8a4dfa4f0e2 (FXTransfer)
     Score: 0.445 | ❌ Not adopted
  2. Product_199c7c90-4c12-4f8e-9a1e-86b14c055774 (Insurance)
     Score: 0.152 | ❌ Not adopted

👤 User 053eb314-4483-45e3-bdc2-4982cac49fda:
  1. Product_5c28ab1e-bb10-4348-b3ab-c8a4dfa4f0e2 (FXTransfer)
     Score: 0.195 | ❌ Not adopted
  2. Product_69702bb0-34ca-4850-8d38-3d3c261d063e (SavingsAccount)
     Score: 0.186 | ❌ Not adopted

👤 User fcc0bc5e-7ebe-4246-bed8-281dee7ae828:
  1. Product_b8fb9bd

In [12]:
# Save Advanced Model and Results (Simplified)
print("💾 SAVING ADVANCED MODEL AND RESULTS")
print("=" * 50)

# Save the ensemble model
model_save_path = 'model/advanced_ensemble_recommender.joblib'
joblib.dump(ensemble_model, model_save_path)
print(f"✅ Ensemble model saved to: {model_save_path}")

# Save the scaler
scaler_save_path = 'model/advanced_feature_scaler.joblib'
joblib.dump(scaler, scaler_save_path)
print(f"✅ Feature scaler saved to: {scaler_save_path}")

# Save evaluation results (simplified)
results_save_path = 'model/advanced_evaluation_results.json'
simple_results = {
    'auc_roc': float(evaluation_results['classification']['auc_roc']),
    'precision_at_3': float(evaluation_results['recommendation']['precision_at_k'][3]),
    'recall_at_10': float(evaluation_results['recommendation']['recall_at_k'][10]),
    'ndcg_at_5': float(evaluation_results['recommendation']['ndcg_at_k'][5]),
    'targets_met': int(targets_met)
}
with open(results_save_path, 'w') as f:
    json.dump(simple_results, f, indent=2)
print(f"✅ Evaluation results saved to: {results_save_path}")

# Save simple metadata
metadata_save_path = 'model/advanced_model_metadata.json'
advanced_metadata = {
    'model_type': 'Advanced Ensemble Recommender',
    'creation_date': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    'models_used': list(ensemble_model.models.keys()),
    'features_selected': int(ensemble_model.feature_selector.get_support().sum()),
    'total_features': int(len(feature_cols)),
    'auc_roc': float(evaluation_results['classification']['auc_roc']),
    'precision_at_3': float(evaluation_results['recommendation']['precision_at_k'][3]),
    'targets_met': int(targets_met),
    'dataset_size': int(modeling_data.shape[0]),
    'adoption_rate': float(modeling_data['adopted'].mean())
}

with open(metadata_save_path, 'w') as f:
    json.dump(advanced_metadata, f, indent=2)
print(f"✅ Advanced model metadata saved to: {metadata_save_path}")

print(f"\n🎉 ADVANCED MODELING COMPLETED!")
print(f"📊 Final Performance Summary:")
print(f"  - Classification AUC: {evaluation_results['classification']['auc_roc']:.3f}")
print(f"  - Precision@3: {evaluation_results['recommendation']['precision_at_k'][3]:.1%}")
print(f"  - Recall@10: {evaluation_results['recommendation']['recall_at_k'][10]:.1%}")
print(f"  - NDCG@5: {evaluation_results['recommendation']['ndcg_at_k'][5]:.1%}")
print(f"  - Business targets met: {targets_met}/3")
print(f"  - Dataset size: {modeling_data.shape[0]:,} samples")
print(f"  - Features used: {len(feature_cols)} total, {ensemble_model.feature_selector.get_support().sum()} selected")

print(f"\n💡 RECOMMENDATIONS FOR IMPROVEMENT:")
print(f"  1. Increase dataset size (currently using 200k sample)")
print(f"  2. Add more feature engineering (interaction terms)")
print(f"  3. Hyperparameter tuning for ensemble models")
print(f"  4. Try different sampling strategies for imbalanced data")
print(f"  5. Consider matrix factorization techniques")
print(f"  6. Use the full dataset instead of sample")

💾 SAVING ADVANCED MODEL AND RESULTS
✅ Ensemble model saved to: model/advanced_ensemble_recommender.joblib
✅ Feature scaler saved to: model/advanced_feature_scaler.joblib
✅ Evaluation results saved to: model/advanced_evaluation_results.json
✅ Advanced model metadata saved to: model/advanced_model_metadata.json

🎉 ADVANCED MODELING COMPLETED!
📊 Final Performance Summary:
  - Classification AUC: 0.722
  - Precision@3: 39.6%
  - Recall@10: 0.0%
  - NDCG@5: 67.6%
  - Business targets met: 0/3
  - Dataset size: 200,000 samples
  - Features used: 118 total, 28 selected

💡 RECOMMENDATIONS FOR IMPROVEMENT:
  1. Increase dataset size (currently using 200k sample)
  2. Add more feature engineering (interaction terms)
  3. Hyperparameter tuning for ensemble models
  4. Try different sampling strategies for imbalanced data
  5. Consider matrix factorization techniques
  6. Use the full dataset instead of sample
✅ Ensemble model saved to: model/advanced_ensemble_recommender.joblib
✅ Feature scaler s