<h2>üöÄ 06. Final Production Pipeline</h2>

<hr>

<h3>üìã Project Details</h3>
<ul>
    <li><b>Project:</b> FreshCart Customer Churn Prediction</li>
    <li><b>Goal:</b> End-to-End Data Processing & Model Training Pipeline</li>
</ul>

<hr>

<h3>üéØ Purpose</h3>
<p>
    This script consolidates all previous steps (Data Loading, Feature Engineering, and Modeling) into a single, reproducible pipeline. It simulates a production training run:
</p>

<ol>
    <li><b>Load Raw Data</b></li>
    <li><b>Apply Cutoff Strategy</b> (Prevent Leakage)</li>
    <li><b>Generate All Features</b> (RFM + Behavioral + Advanced)</li>
    <li><b>Train Final Model</b> (using Optimized Hyperparameters)</li>
    <li><b>Export Artifacts</b> (Model & Metadata) for Deployment</li>
</ol>

<hr>

In [1]:
import pandas as pd
import numpy as np
import lightgbm as lgb
import joblib
import json
import sys
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, f1_score, classification_report

In [2]:
# Add src to path to import custom modules
# Adjust the path if you are running this script from a different location
sys.path.append('../src') 
# If running from the root 'FreshCart-Churn-Prediction' folder, use: sys.path.append('src')

In [3]:
from config import RAW_DATA_DIR, PROCESSED_DATA_DIR, MODEL_DIR, RANDOM_STATE
from data.data_loader import InstacartDataLoader

In [4]:
# Import Feature Engineering Modules
from features.rfm_features import RFMFeatureEngineer
from features.behavioral_features import BehavioralFeatureEngineer

print("‚úÖ Environment Setup Complete")

‚úÖ Environment Setup Complete


<h4><b>
üì¶ Step 1: Ingest Raw Data
<h4><b>

In [5]:
def load_data():
    print("‚è≥ Loading Raw Data...")
    loader = InstacartDataLoader(RAW_DATA_DIR)
    data = loader.load_all_data()

    orders_df = data['orders']
    products_df = data['products']
    order_products = pd.concat([
        data['order_products_prior'],
        data['order_products_train']
    ], ignore_index=True)

    print(f"‚úÖ Data Loaded. Orders: {len(orders_df):,}, Products: {len(products_df):,}")
    return orders_df, products_df, order_products

<h4><b>
üõ†Ô∏è Step 2: Feature Engineering Pipeline (The "Leakage-Free" Logic)
<h4><b>

In [6]:
def run_feature_pipeline(orders, order_products, products):
    """
    Executes the full feature engineering pipeline with strict time-based splitting.
    """
    print("\n‚öôÔ∏è Starting Feature Pipeline...")
    
    # 1. SORT & SPLIT (Cutoff Strategy)
    print("   1. Applying Cutoff Strategy (Splitting History vs Future)...")
    orders_sorted = orders.sort_values(['user_id', 'order_number'])
    last_orders = orders_sorted.groupby('user_id').tail(1)  # Target
    orders_history = orders_sorted.drop(last_orders.index)  # Features History
    
    # Filter order_products for history only
    op_history = order_products[order_products['order_id'].isin(orders_history['order_id'])]
    
    # 2. GENERATE TARGETS
    print("   2. Generating Targets...")
    labels = last_orders[['user_id', 'days_since_prior_order']].copy()
    labels['is_churn'] = (labels['days_since_prior_order'] >= 30).astype(int)
    
    # 3. RFM FEATURES
    print("   3. Generating RFM Features...")
    rfm_eng = RFMFeatureEngineer()
    rfm_feats = rfm_eng.create_all_rfm_features(orders_history, op_history)
    
    # 4. BEHAVIORAL FEATURES
    print("   4. Generating Behavioral Features...")
    beh_eng = BehavioralFeatureEngineer()
    beh_feats = beh_eng.create_all_behavioral_features(orders_history, op_history, products)
    
    # 5. ADVANCED FEATURES (Derived)
    print("   5. Deriving Advanced Metrics (Velocity, Acceleration)...")
    # Purchase Velocity: 1 / (Average Days Between Orders + 1)
    rfm_feats['purchase_velocity'] = 1 / (rfm_feats['avg_days_between_orders'] + 1)
    
    # 6. MERGE ALL
    print("   6. Merging Feature Sets...")
    final_df = labels[['user_id', 'is_churn']].merge(rfm_feats, on='user_id', how='left')
    final_df = final_df.merge(beh_feats, on='user_id', how='left')
    
    # Fill NaNs with 0
    final_df = final_df.fillna(0)
    
    return final_df


<h4><b>
ü§ñ Step 3: Final Model Training
<h4><b>

In [7]:
def train_model(final_dataset):
    # 1. Prepare X and y
    feature_cols = [c for c in final_dataset.columns if c not in ['user_id', 'is_churn']]
    
    X = final_dataset[feature_cols]
    y = final_dataset['is_churn']

    # 2. Split for Final Validation
    # Ideally, retrain on full data for production, but keeping split for validation metrics
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
    )

    # 3. Load Best Hyperparameters
    try:
        with open(MODEL_DIR / 'best_params.json', 'r') as f:
            best_params = json.load(f)
        print("‚úÖ Loaded Best Hyperparameters from previous step.")
    except:
        print("‚ö†Ô∏è Best params file not found. Using default parameters.")
        best_params = {
            'objective': 'binary',
            'metric': 'auc',
            'boosting_type': 'gbdt',
            'learning_rate': 0.05,
            'n_estimators': 1000,
            'scale_pos_weight': 2.3 # Imbalance handling
        }

    # 4. Train LightGBM
    print("\nüöÄ Training Final LightGBM Model...")
    dtrain = lgb.Dataset(X_train, label=y_train)
    dvalid = lgb.Dataset(X_test, label=y_test, reference=dtrain)

    final_model = lgb.train(
        best_params,
        dtrain,
        valid_sets=[dvalid],
        callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)]
    )

    print("‚úÖ Model Training Complete.")
    return final_model, X_test, y_test, feature_cols


<h4><b>
üìä Step 4: Quick Validation & Sanity Check
<h4><b>

In [8]:
def validate_model(model, X_test, y_test):
    print("\nüìä Validating Model Performance...")
    y_pred_prob = model.predict(X_test)
    y_pred = (y_pred_prob >= 0.38).astype(int) # Using our optimized threshold

    auc = roc_auc_score(y_test, y_pred_prob)
    print(f"Final AUC Score: {auc:.4f}")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))


<h4><b>
üíæ Step 5: Export Artifacts for Deployment
<h4><b>

In [9]:
def save_artifacts(model, feature_cols, final_dataset):
    print("\nüì¶ Exporting Production Artifacts...")

    # 1. Save Model
    model_path = MODEL_DIR / 'final_model_optimized.pkl'
    joblib.dump(model, model_path)
    print(f"  -> Model saved to: {model_path}")

    # 2. Save Feature List (Critical for API input validation)
    feature_path = PROCESSED_DATA_DIR / 'model_features.json'
    with open(feature_path, 'w') as f:
        json.dump(feature_cols, f)
    print(f"  -> Feature list saved to: {feature_path}")

    # 3. Save Dataset (Optional, for Dashboard EDA)
    data_path = PROCESSED_DATA_DIR / 'final_features_advanced.parquet'
    final_dataset.to_parquet(data_path)
    print(f"  -> Processed data saved to: {data_path}")

    print("\nüéâ PIPELINE FINISHED SUCCESSFULLY!")


<h4><b>
MAIN EXECUTION
<h4><b>

In [10]:
if __name__ == "__main__":
    # 1. Load Data
    orders_df, products_df, order_products = load_data()
    
    # 2. Run Feature Pipeline
    final_dataset = run_feature_pipeline(orders_df, order_products, products_df)
    print(f"\n‚úÖ Pipeline Complete. Dataset Shape: {final_dataset.shape}")
    
    # 3. Train Model
    final_model, X_test, y_test, feature_cols = train_model(final_dataset)
    
    # 4. Validate
    validate_model(final_model, X_test, y_test)
    
    # 5. Save
    save_artifacts(final_model, feature_cols, final_dataset)


INFO:data.data_loader:üì¶ Loading Instacart datasets...
INFO:data.data_loader:   Loading orders.csv...


‚è≥ Loading Raw Data...


INFO:data.data_loader:   ‚úÖ Loaded orders: (3421083, 7)
INFO:data.data_loader:   Loading order_products__prior.csv...
INFO:data.data_loader:   ‚úÖ Loaded order_products_prior: (32434489, 4)
INFO:data.data_loader:   Loading order_products__train.csv...
INFO:data.data_loader:   ‚úÖ Loaded order_products_train: (1384617, 4)
INFO:data.data_loader:   Loading products.csv...
INFO:data.data_loader:   ‚úÖ Loaded products: (49688, 4)
INFO:data.data_loader:   Loading aisles.csv...
INFO:data.data_loader:   ‚úÖ Loaded aisles: (134, 2)
INFO:data.data_loader:   Loading departments.csv...
INFO:data.data_loader:   ‚úÖ Loaded departments: (21, 2)
INFO:data.data_loader:‚úÖ All datasets loaded successfully!

INFO:data.data_loader:DATA SUMMARY
INFO:data.data_loader:orders                   :  3,421,083 rows x   7 columns
INFO:data.data_loader:                           Memory: 358.81 MB
INFO:data.data_loader:order_products_prior     : 32,434,489 rows x   4 columns
INFO:data.data_loader:                  

‚úÖ Data Loaded. Orders: 3,421,083, Products: 49,688

‚öôÔ∏è Starting Feature Pipeline...
   1. Applying Cutoff Strategy (Splitting History vs Future)...


INFO:features.rfm_features:üîß Creating RFM features...
INFO:features.rfm_features:   Creating recency features...


   2. Generating Targets...
   3. Generating RFM Features...


INFO:features.rfm_features:   Creating frequency features...
INFO:features.rfm_features:   Creating monetary features (using basket size as a proxy)...
INFO:features.rfm_features:‚úÖ Created 14 RFM features
INFO:features.rfm_features:   Features: ['days_since_last_order', 'days_since_first_order', 'customer_age_days', 'avg_days_between_orders', 'total_orders', 'orders_per_day', 'order_regularity', 'std_days_between_orders', 'avg_basket_size', 'total_items_ordered', 'basket_size_std', 'basket_size_cv', 'avg_unique_products_per_order', 'total_unique_products_ordered']
INFO:features.behavioral_features:üß† Creating behavioral features...
INFO:features.behavioral_features:   Creating time-based features...


   4. Generating Behavioral Features...


INFO:features.behavioral_features:   Creating reorder behavior features...
INFO:features.behavioral_features:   Creating diversity features...
INFO:features.behavioral_features:‚úÖ Created 22 behavioral features


   5. Deriving Advanced Metrics (Velocity, Acceleration)...
   6. Merging Feature Sets...

‚úÖ Pipeline Complete. Dataset Shape: (206209, 39)
‚úÖ Loaded Best Hyperparameters from previous step.

üöÄ Training Final LightGBM Model...
Training until validation scores don't improve for 50 rounds
[100]	valid_0's auc: 0.756228
Did not meet early stopping. Best iteration is:
[100]	valid_0's auc: 0.756228
‚úÖ Model Training Complete.

üìä Validating Model Performance...
Final AUC Score: 0.7562

Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.51      0.65     28605
           1       0.43      0.85      0.58     12637

    accuracy                           0.61     41242
   macro avg       0.66      0.68      0.61     41242
weighted avg       0.75      0.61      0.63     41242


üì¶ Exporting Production Artifacts...
  -> Model saved to: d:\egitim_ve_calismalar\Lodos Makine √ñƒürenmesi Bootcamp 02.11.2025\html\FreshCart_E-Ticaret_P