<h2>üéØ FreshCart Churn Prediction - Baseline Model</h2>
 
<h4>
    <b>
        Zero2End Machine Learning Bootcamp - Final Project
    </b>
</h4>

<h4>
    üìã Notebook Contents
</h4>

<h5>
    <ol>
        <li>Data Preparation and Feature Creation</li>
        <li>Train-Test Split</li>
        <li>Baseline Model: Logistic Regression</li>
        <li>Baseline Model: Random Forest</li>
        <li>Model Evaluation</li>
        <li>Baseline Results and Next Steps</li>
    </ol>
</h5>

In [1]:
# Library imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from pathlib import Path
import sys
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report,
    roc_curve, precision_recall_curve
)
import joblib

In [2]:
# Settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("Set2")
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

In [3]:
# Add src to path
sys.path.append('../src')

In [4]:
from config import RAW_DATA_DIR, PROCESSED_DATA_DIR, MODEL_DIR, RANDOM_STATE
from data.data_loader import InstacartDataLoader
from data.churn_labels import ChurnLabelCreator
from features.rfm_features import create_rfm_features_pipeline
from features.behavioral_features import create_behavioral_features_pipeline

print("‚úÖ Libraries imported successfully!")

‚úÖ Libraries imported successfully!


<h4>
    1Ô∏è‚É£ Data Preparation and Feature Creation
</h4>

In [5]:
# Load data
print("üì¶ Loading Instacart data...")
loader = InstacartDataLoader(RAW_DATA_DIR)
data = loader.load_all_data()

orders_df = data['orders']
order_products = pd.concat([
    data['order_products_prior'],
    data['order_products_train']
], ignore_index=True)
products_df = data['products']

print(f"‚úÖ Data loaded:")
print(f"   Orders: {len(orders_df):,}")
print(f"   Order-Products: {len(order_products):,}")
print(f"   Products: {len(products_df):,}")

INFO:data.data_loader:üì¶ Loading Instacart datasets...
INFO:data.data_loader:   Loading orders.csv...


üì¶ Loading Instacart data...


INFO:data.data_loader:   ‚úÖ Loaded orders: (3421083, 7)
INFO:data.data_loader:   Loading order_products__prior.csv...
INFO:data.data_loader:   ‚úÖ Loaded order_products_prior: (32434489, 4)
INFO:data.data_loader:   Loading order_products__train.csv...
INFO:data.data_loader:   ‚úÖ Loaded order_products_train: (1384617, 4)
INFO:data.data_loader:   Loading products.csv...
INFO:data.data_loader:   ‚úÖ Loaded products: (49688, 4)
INFO:data.data_loader:   Loading aisles.csv...
INFO:data.data_loader:   ‚úÖ Loaded aisles: (134, 2)
INFO:data.data_loader:   Loading departments.csv...
INFO:data.data_loader:   ‚úÖ Loaded departments: (21, 2)
INFO:data.data_loader:‚úÖ All datasets loaded successfully!

INFO:data.data_loader:DATA SUMMARY
INFO:data.data_loader:orders                   :  3,421,083 rows x   7 columns
INFO:data.data_loader:                           Memory: 358.81 MB
INFO:data.data_loader:order_products_prior     : 32,434,489 rows x   4 columns
INFO:data.data_loader:                  

‚úÖ Data loaded:
   Orders: 3,421,083
   Order-Products: 33,819,106
   Products: 49,688


In [6]:
# Create churn labels
print("\nüè∑Ô∏è  Creating churn labels...")
churn_creator = ChurnLabelCreator(
    churn_threshold_days=30,
    min_orders=3,
    observation_window_days=90
)

labels_df = churn_creator.create_churn_labels(orders_df)

print(f"\n‚úÖ Churn labels created:")
print(f"   Total users: {len(labels_df):,}")
print(f"   Churned: {labels_df['is_churn'].sum():,} ({labels_df['is_churn'].mean():.2%})")
print(f"   Active: {(labels_df['is_churn']==0).sum():,} ({(labels_df['is_churn']==0).mean():.2%})")


INFO:data.churn_labels:üéØ Churn Definition:
INFO:data.churn_labels:   Threshold: 30 days
INFO:data.churn_labels:   Min Orders: 3
INFO:data.churn_labels:   Observation Window: 90 days
INFO:data.churn_labels:üè∑Ô∏è  Creating churn labels...
INFO:data.churn_labels:üìä Creating user order summary...



üè∑Ô∏è  Creating churn labels...


INFO:data.churn_labels:‚úÖ User summary created: (206209, 9)
INFO:data.churn_labels:üìÖ Calculating recency for each user...
INFO:data.churn_labels:‚úÖ Recency calculated for 206209 users
INFO:data.churn_labels:
INFO:data.churn_labels:CHURN LABEL STATISTICS
INFO:data.churn_labels:Total Users:              206,209
INFO:data.churn_labels:Eligible Users:           206,209 (min 3 orders)
INFO:data.churn_labels:Churned Users:            204,617
INFO:data.churn_labels:Active Users:               1,592
INFO:data.churn_labels:Churn Rate:                 99.23%




‚úÖ Churn labels created:
   Total users: 206,209
   Churned: 204,617 (99.23%)
   Active: 1,592 (0.77%)


In [7]:
# Create RFM features
print("\nüîß Creating RFM features...")
rfm_features = create_rfm_features_pipeline(orders_df, order_products)

print(f"‚úÖ RFM features created: {rfm_features.shape}")
print(f"   Features: {[col for col in rfm_features.columns if col != 'user_id']}")

INFO:features.rfm_features:üîß Creating RFM features...
INFO:features.rfm_features:   Creating recency features...



üîß Creating RFM features...


INFO:features.rfm_features:   Creating frequency features...
INFO:features.rfm_features:   Creating monetary features (using basket size as a proxy)...
INFO:features.rfm_features:‚úÖ Created 14 RFM features
INFO:features.rfm_features:   Features: ['days_since_last_order', 'days_since_first_order', 'customer_age_days', 'avg_days_between_orders', 'total_orders', 'orders_per_day', 'order_regularity', 'std_days_between_orders', 'avg_basket_size', 'total_items_ordered', 'basket_size_std', 'basket_size_cv', 'avg_unique_products_per_order', 'total_unique_products_ordered']
INFO:features.rfm_features:üìä Calculating RFM scores...
INFO:features.rfm_features:‚úÖ RFM scores calculated
INFO:features.rfm_features:
RFM Segment Distribution:


rfm_segment
At Risk      46469
Promising    66084
Loyal        59074
Champions    34582
Name: count, dtype: int64
‚úÖ RFM features created: (206209, 20)
   Features: ['days_since_last_order', 'days_since_first_order', 'customer_age_days', 'avg_days_between_orders', 'total_orders', 'orders_per_day', 'order_regularity', 'std_days_between_orders', 'avg_basket_size', 'total_items_ordered', 'basket_size_std', 'basket_size_cv', 'avg_unique_products_per_order', 'total_unique_products_ordered', 'recency_score', 'frequency_score', 'monetary_score', 'rfm_score', 'rfm_segment']
