<h2>üöÄ 06. Final Production Pipeline</h2>

<hr>

<h3>üìã Project Details</h3>
<ul>
    <li><b>Project:</b> FreshCart Customer Churn Prediction</li>
    <li><b>Goal:</b> End-to-End Data Processing & Model Training Pipeline</li>
</ul>

<hr>

<h3>üéØ Purpose</h3>
<p>
    This script consolidates all previous steps (Data Loading, Feature Engineering, and Modeling) into a single, reproducible pipeline. It simulates a production training run:
</p>

<ol>
    <li><b>Load Raw Data</b></li>
    <li><b>Apply Cutoff Strategy</b> (Prevent Leakage)</li>
    <li><b>Generate All Features</b> (RFM + Behavioral + Advanced)</li>
    <li><b>Train Final Model</b> (using Optimized Hyperparameters)</li>
    <li><b>Export Artifacts</b> (Model & Metadata) for Deployment</li>
</ol>

<hr>

In [1]:
import pandas as pd
import numpy as np
import lightgbm as lgb
import joblib
import json
import sys
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, f1_score, classification_report
from scipy import stats

In [2]:
# Add src to path to import custom modules
# Adjust the path if you are running this script from a different location
sys.path.append('../src') 
# If running from the root 'FreshCart-Churn-Prediction' folder, use: sys.path.append('src')

In [3]:
from config import RAW_DATA_DIR, PROCESSED_DATA_DIR, MODEL_DIR, RANDOM_STATE
from data.data_loader import InstacartDataLoader

In [4]:
# Import Feature Engineering Modules
from features.rfm_features import RFMFeatureEngineer
from features.behavioral_features import BehavioralFeatureEngineer

print("‚úÖ Environment Setup Complete")

‚úÖ Environment Setup Complete


<h4><b>
üì¶ Step 1: Ingest Raw Data
<h4><b>

In [5]:
def load_raw_data(self):
        """Loads raw Instacart datasets and concatenates prior/train orders."""
        
        # --- IMPORTANT IMPORTS ---
        # Ensure these are imported at the top of your notebook/script:
        # from data.data_loader import InstacartDataLoader
        # from src.config import RAW_DATA_DIR
        
        print("‚è≥ Loading Raw Data...")
        
        # Initialize the data loader using the path from config.py
        loader = InstacartDataLoader(RAW_DATA_DIR)
        
        # Load all datasets into a dictionary
        data = loader.load_all_data()

        # Separate main dataframes
        orders_df = data['orders']
        products_df = data['products']
        
        # Concatenate prior and train order products into a single dataframe
        order_products = pd.concat([
            data['order_products_prior'],
            data['order_products_train']
        ], ignore_index=True)

        print(f"‚úÖ Data Loaded. Orders: {len(orders_df):,}, Products: {len(products_df):,}")
        
        # Return all necessary dataframes for the pipeline
        return orders_df, products_df, order_products

<h4><b>
üõ†Ô∏è Step 2: Feature Engineering Pipeline (The "Leakage-Free" Logic)
<h4><b>

In [6]:
def calculate_trend(series):
    """
    Calculates slope only for series with at least 2 data points to save computation cost.
    """
    if len(series) < 2:
        return 0
    try:
        # x is time (order sequence), y is value (basket size etc.)
        slope, _, _, _, _ = stats.linregress(np.arange(len(series)), series.values)
        return slope
    except:
        return 0

def run_feature_pipeline(orders_df, order_products, products_df):
    """
    Executes the full feature engineering pipeline (MATCHING 03_FEATURE_ENGINEERING LOGIC).
    """
    print("\n‚öôÔ∏è Starting Feature Pipeline...")
    
    # 1. SORT & SPLIT (Cutoff Strategy)
    print("   1. Applying Cutoff Strategy (Splitting History vs Future)...")
    orders_sorted = orders_df.sort_values(['user_id', 'order_number'])
    last_orders = orders_sorted.groupby('user_id').tail(1)  # Target
    orders_history = orders_sorted.drop(last_orders.index)  # Features History
    
    # Filter order_products for history only
    op_history = order_products[order_products['order_id'].isin(orders_history['order_id'])]
    
    # 2. GENERATE TARGETS
    print("   2. Generating Targets...")
    labels = last_orders[['user_id', 'days_since_prior_order']].copy()
    labels['is_churn'] = (labels['days_since_prior_order'] >= 30).astype(int)
    
    # 3. RFM FEATURES
    print("   3. Creating RFM features (Raw Metrics Only)...")
    rfm_eng = RFMFeatureEngineer()
    rfm_feats = rfm_eng.create_all_rfm_features(orders_history, op_history)
        
    # Custom Risk & Value Metrics
    rfm_feats['clv_proxy'] = rfm_feats['total_orders'] * rfm_feats['avg_basket_size']
    rfm_feats['engagement_score'] = rfm_feats['orders_per_day'] * rfm_feats['total_items_ordered']
    rfm_feats['at_risk_score'] = rfm_feats['days_since_last_order'] / (rfm_feats['avg_days_between_orders'] + 1)
    
    # 4. BEHAVIORAL FEATURES
    print("   4. Generating Behavioral Features...")
    beh_eng = BehavioralFeatureEngineer()
    beh_feats = beh_eng.create_all_behavioral_features(orders_history, op_history, products_df)
    
    # 5. TIME SERIES / TREND FEATURES (Was missing in previous version!)
    print("   5. Deriving Time-Series Trends (This takes time)...")
    
    # Group data by user
    # Only taking necessary columns for calculation
    user_trends = orders_history.groupby('user_id').agg({
        'days_since_prior_order': list  # List of days between orders for each user
    }).reset_index()
    
    # order_products must merge with orders for basket size trend
    order_sizes = op_history.groupby('order_id').size().reset_index(name='basket_size')
    order_sizes = order_sizes.merge(orders_history[['order_id', 'user_id']], on='order_id')
    basket_trends = order_sizes.groupby('user_id').agg({'basket_size': list}).reset_index()
    
    # Apply Trend Functions
    user_trends['order_frequency_trend'] = user_trends['days_since_prior_order'].apply(lambda x: calculate_trend(pd.Series(x).dropna()))
    basket_trends['basket_size_trend'] = basket_trends['basket_size'].apply(lambda x: calculate_trend(pd.Series(x)))
    
    # 6. RATIO FEATURES
    print("   6. Generating Velocity & Acceleration Metrics...")
    # Velocity
    rfm_feats['purchase_velocity'] = 1 / (rfm_feats['avg_days_between_orders'] + 1)
    
    # Acceleration (Last order / Average days between orders)
    # > 1 means slowing down (Churn risk), < 1 means speeding up
    rfm_feats['recency_acceleration'] = rfm_feats['days_since_last_order'] / (rfm_feats['avg_days_between_orders'] + 0.01)
    
    # Interaction
    rfm_feats['recency_x_frequency'] = rfm_feats['days_since_last_order'] * rfm_feats['total_orders']

    # 7. MERGE ALL
    print("   7. Merging Feature Sets...")
    final_df = labels[['user_id', 'is_churn']].merge(rfm_feats, on='user_id', how='left')
    final_df = final_df.merge(beh_feats, on='user_id', how='left')
    final_df = final_df.merge(user_trends[['user_id', 'order_frequency_trend']], on='user_id', how='left')
    final_df = final_df.merge(basket_trends[['user_id', 'basket_size_trend']], on='user_id', how='left')
    
    # Fill NaNs with 0
    final_df = final_df.fillna(0)
    
    return final_df

In [7]:
# Helper function to get bins from the TRAINING data
def get_qcut_bins(data_series, q):
    """Calculate quantile bins ONLY from the training data series."""
    # retbins=True returns the bin edges
    return pd.qcut(data_series, q=q, retbins=True, duplicates='drop')[1]

# Helper function to apply the bins to the TEST data
def apply_qcut_bins(test_series, bins, labels):
    """Apply pre-calculated bins from the training data to the test series."""
    # pd.cut is used to apply the explicit bins calculated from the train set.
    # right=False means the interval is [a, b), preventing leakage.
    return pd.cut(
        test_series, 
        bins=bins, 
        labels=labels, 
        include_lowest=True, 
        right=False  # Crucial for consistency and avoiding leakage
    ).astype(float).fillna(0).astype(int) # Fillna(0) for robustness if test data falls outside train bins

<h4><b>
ü§ñ Step 3: Final Model Training
<h4><b>

In [8]:
def train_model(final_dataset):
    print("\nüöÄ Preparing Data for Training...")

    # 1. Prepare X and y
    # Remove 'user_id' and 'is_churn' from features
    feature_cols = [c for c in final_dataset.columns if c not in ['user_id', 'is_churn']]
    
    X = final_dataset[feature_cols].copy()
    y = final_dataset['is_churn']

    # 2. Split for Training/Validation
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
    )
    print(f"‚úÖ Data split complete. X_train shape: {X_train.shape}, X_test shape: {X_test.shape}")

    # 3. Load Best Hyperparameters
    try:
        with open(MODEL_DIR / 'best_params.json', 'r') as f:
            best_params = json.load(f)
        print("‚úÖ Loaded Best Hyperparameters from previous step.")
    except Exception as e:
        print(f"‚ö†Ô∏è Best params file not found. Using default parameters. Error: {e}")
        best_params = {
            'objective': 'binary',
            'metric': 'auc',
            'boosting_type': 'gbdt',
            'learning_rate': 0.05,
            'n_estimators': 1000,
            'scale_pos_weight': 2.3
        }

    # 4. Train LightGBM
    print("\nüöÄ Training Final LightGBM Model...")
    dtrain = lgb.Dataset(X_train, label=y_train)
    dvalid = lgb.Dataset(X_test, label=y_test, reference=dtrain)

    final_model = lgb.train(
        best_params,
        dtrain,
        valid_sets=[dvalid],
        callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)]
    )
    
    # --- NEW: OPTIMAL THRESHOLD CALCULATION ---
    print("\n‚öñÔ∏è Calculating Optimal Threshold...")
    from sklearn.metrics import precision_recall_curve
    
    y_pred_proba = final_model.predict(X_test)
    precisions, recalls, thresholds = precision_recall_curve(y_test, y_pred_proba)
    f1_scores = 2 * (precisions * recalls) / (precisions + recalls)
    best_idx = np.argmax(f1_scores)
    best_threshold = thresholds[best_idx]
    
    print(f"üèÜ Best Threshold Found: {best_threshold:.4f}")
    
    # Save threshold immediately to ensure it matches the model
    threshold_path = MODEL_DIR / 'optimal_threshold.json'
    with open(threshold_path, 'w') as f:
        json.dump({'threshold': float(best_threshold)}, f)
    print(f"üíæ Threshold saved to: {threshold_path}")

    print("‚úÖ Model Training Complete.")
    
    # feature_cols listesini ham haliyle d√∂nd√ºr√ºyoruz (Skorlama yok)
    return final_model, X_test, y_test, feature_cols, best_threshold

<h4><b>
üìä Step 4: Quick Validation & Sanity Check
<h4><b>

In [9]:
def validate_model(model, X_test, y_test, threshold):
    print("\nüìä Validating Model Performance...")
    
    y_pred_prob = model.predict(X_test)
    
    # We use dynamic threshold
    y_pred = (y_pred_prob >= threshold).astype(int)

    auc = roc_auc_score(y_test, y_pred_prob)
    f1 = f1_score(y_test, y_pred)
    
    print(f"Final AUC Score: {auc:.4f}")
    print(f"Final F1 Score : {f1:.4f} (at threshold {threshold:.4f})")
    
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))

<h4><b>
üíæ Step 5: Export Artifacts for Deployment
<h4><b>

In [10]:
def save_artifacts(model, feature_cols, final_dataset):
    print("\nüì¶ Exporting Production Artifacts...")

    # 1. Save Model
    model_path = MODEL_DIR / 'final_model_optimized.pkl'
    joblib.dump(model, model_path)
    print(f"  -> Model saved to: {model_path}")

    # 2. Save Feature List (Critical for API input validation)
    feature_path = PROCESSED_DATA_DIR / 'model_features.json'
    with open(feature_path, 'w') as f:
        json.dump(feature_cols, f)
    print(f"  -> Feature list saved to: {feature_path}")

    # 3. Save Dataset (Optional, for Dashboard EDA)
    data_path = PROCESSED_DATA_DIR / 'final_features_advanced.parquet'
    final_dataset.to_parquet(data_path)
    print(f"  -> Processed data saved to: {data_path}")

    print("\nüéâ PIPELINE FINISHED SUCCESSFULLY!")


<h4><b>
MAIN EXECUTION
<h4><b>

In [11]:
if __name__ == "__main__":
    # 1. Load Data
    data_loader = InstacartDataLoader(RAW_DATA_DIR) 
    data_dict = data_loader.load_all_data()
    print("‚úÖ Raw Data Loaded.")
    
    orders_df = data_dict['orders']
    products_df = data_dict['products']

    order_products = pd.concat([
        data_dict['order_products_prior'],
        data_dict['order_products_train']
    ])
    
    # 2. Run Feature Pipeline
    final_dataset = run_feature_pipeline(orders_df, order_products, products_df)
    print(f"\n‚úÖ Pipeline Complete. Dataset Shape: {final_dataset.shape}")
    
    # 3. Train Model (We also take Threshold)
    final_model, X_test, y_test, feature_cols, best_threshold = train_model(final_dataset)
    
    # 4. Validate (We provide the Threshold)
    validate_model(final_model, X_test, y_test, best_threshold)
    
    # 5. Save
    save_artifacts(final_model, feature_cols, final_dataset)

INFO:data.data_loader:üì¶ Loading Instacart datasets...
INFO:data.data_loader:   Loading orders.csv...
INFO:data.data_loader:   ‚úÖ Loaded orders: (3421083, 7)
INFO:data.data_loader:   Loading order_products__prior.csv...
INFO:data.data_loader:   ‚úÖ Loaded order_products_prior: (32434489, 4)
INFO:data.data_loader:   Loading order_products__train.csv...
INFO:data.data_loader:   ‚úÖ Loaded order_products_train: (1384617, 4)
INFO:data.data_loader:   Loading products.csv...
INFO:data.data_loader:   ‚úÖ Loaded products: (49688, 4)
INFO:data.data_loader:   Loading aisles.csv...
INFO:data.data_loader:   ‚úÖ Loaded aisles: (134, 2)
INFO:data.data_loader:   Loading departments.csv...
INFO:data.data_loader:   ‚úÖ Loaded departments: (21, 2)
INFO:data.data_loader:‚úÖ All datasets loaded successfully!

INFO:data.data_loader:DATA SUMMARY
INFO:data.data_loader:orders                   :  3,421,083 rows x   7 columns
INFO:data.data_loader:                           Memory: 358.81 MB
INFO:data.data_

‚úÖ Raw Data Loaded.

‚öôÔ∏è Starting Feature Pipeline...
   1. Applying Cutoff Strategy (Splitting History vs Future)...


INFO:features.rfm_features:üîß Creating RFM features...
INFO:features.rfm_features:   Creating recency features...


   2. Generating Targets...
   3. Creating RFM features (Raw Metrics Only)...


INFO:features.rfm_features:   Creating frequency features...
INFO:features.rfm_features:   Creating monetary features (using basket size as a proxy)...
INFO:features.rfm_features:‚úÖ Created 14 RFM features
INFO:features.rfm_features:   Features: ['days_since_last_order', 'days_since_first_order', 'customer_age_days', 'avg_days_between_orders', 'total_orders', 'orders_per_day', 'order_regularity', 'std_days_between_orders', 'avg_basket_size', 'total_items_ordered', 'basket_size_std', 'basket_size_cv', 'avg_unique_products_per_order', 'total_unique_products_ordered']
INFO:features.behavioral_features:üß† Creating behavioral features...
INFO:features.behavioral_features:   Creating time-based features...


   4. Generating Behavioral Features...


  weekend_orders = orders_df.groupby('user_id').apply(
  night_orders = orders_df.groupby('user_id').apply(
  morning_orders = orders_df.groupby('user_id').apply(
  afternoon_orders = orders_df.groupby('user_id').apply(
INFO:features.behavioral_features:   Creating reorder behavior features...
INFO:features.behavioral_features:   Creating diversity features...
  exploration_df = order_products_full.groupby('user_id').apply(calculate_exploration).reset_index(name='exploration_rate')
INFO:features.behavioral_features:‚úÖ Created 22 behavioral features


   5. Deriving Time-Series Trends (This takes time)...
   6. Generating Velocity & Acceleration Metrics...
   7. Merging Feature Sets...

‚úÖ Pipeline Complete. Dataset Shape: (206209, 46)

üöÄ Preparing Data for Training...
‚úÖ Data split complete. X_train shape: (164967, 44), X_test shape: (41242, 44)
‚úÖ Loaded Best Hyperparameters from previous step.

üöÄ Training Final LightGBM Model...
Training until validation scores don't improve for 50 rounds
[100]	valid_0's auc: 0.764437
Did not meet early stopping. Best iteration is:
[100]	valid_0's auc: 0.764437

‚öñÔ∏è Calculating Optimal Threshold...
üèÜ Best Threshold Found: 0.4272
üíæ Threshold saved to: d:\egitim_ve_calismalar\Lodos Makine √ñƒürenmesi Bootcamp 02.11.2025\html\FreshCart-Churn-Prediction\notebooks\..\models\optimal_threshold.json
‚úÖ Model Training Complete.

üìä Validating Model Performance...
Final AUC Score: 0.7644
Final F1 Score : 0.5902 (at threshold 0.4272)

Classification Report:
              precision    re