Predictive Models:

1 - Predict which current customers will become Power Buyers?

2 - Which customers are likely to churn in the next period?

In [2]:
pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.7.0-cp311-cp311-win_amd64.whl.metadata (14 kB)
Collecting scipy>=1.8.0 (from scikit-learn)
  Downloading scipy-1.15.3-cp311-cp311-win_amd64.whl.metadata (60 kB)
     ---------------------------------------- 0.0/60.8 kB ? eta -:--:--
     ------ --------------------------------- 10.2/60.8 kB ? eta -:--:--
     ------------------- ------------------ 30.7/60.8 kB 435.7 kB/s eta 0:00:01
     -------------------------------- ----- 51.2/60.8 kB 525.1 kB/s eta 0:00:01
     -------------------------------------- 60.8/60.8 kB 404.2 kB/s eta 0:00:00
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.5.1-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.7.0-cp311-cp311-win_amd64.whl (10.7 MB)
   ---------------------------------------- 0.0/10.7 MB ? eta -:--:--
   ----------------------------------


[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: C:\Users\luket\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [13]:
# ── Setup ───────────────────────────────────────
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, precision_recall_curve
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

core = pd.read_csv(r'C:\Users\luket\projects\traditonal\ecom_core_labeled.csv')

In [8]:
print(core.columns)

Index(['user_id', 'target_event', 'target_customer_value', 'target_revenue',
       'target_actual_profit', 'view_count_mean', 'cart_count_mean',
       'purchase_count_mean', 'time_to_view_mean', 'time_to_cart_mean',
       'time_to_purchase_mean', 'view_revenue_mean', 'cart_revenue_mean',
       'purchase_revenue_mean', 'session_number_mean',
       'inter_session_time_mean', 'session_recency_mean',
       'purchase_number_mean', 'inter_purchase_time_mean',
       'purchase_recency_mean', 'session_count_ratio', 'click_count_ratio',
       'transaction_count_ratio', 'purchase_count_month_lag0',
       'purchase_count_month_ma3', 'is_direct_buyer', 'is_power_buyer',
       'value_segment', 'is_loss_making', 'is_low_engage', 'is_churn'],
      dtype='object')


In [None]:
# Power Buyer Predicted Model

#  LABEL
y = core['is_power_buyer'].astype(int)

# FEATURE SELECTION
def leak(col):
    leaks = (
        'purchase_', 'revenue', 'transaction_',           
        'cart_count_mean', 'purchase_count_mean',         
        'target_',                                        
        'is_', '_cohort', 'value_segment'                 
    )
    return any(tok in col for tok in leaks)

X = core.drop(columns=['user_id'] + [c for c in core.columns if leak(c)])

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.25,
                                          random_state=42, stratify=y)

num_cols = X_tr.select_dtypes(include=['number', 'bool']).columns.tolist()
cat_cols = X_tr.select_dtypes(include=['object', 'category']).columns.tolist()

pre = ColumnTransformer([
        ('num', StandardScaler(), num_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
      ])

pipe = Pipeline([
        ('prep', pre),
        ('gb', GradientBoostingClassifier(
                learning_rate=0.05,
                n_estimators=300,
                max_depth=3,
                random_state=42))
      ])

# TRAIN & EVAL 
pipe.fit(X_tr, y_tr)
proba_te = pipe.predict_proba(X_te)[:,1]

roc = roc_auc_score(y_te, proba_te)
print("ROC-AUC:", round(roc, 3))

top_k = int(0.05 * len(proba_te))
idx   = np.argsort(proba_te)[::-1][:top_k]
precision_top5 = y_te.iloc[idx].mean()
print(f"Precision @ top-5 %: {precision_top5:.2%}")



ROC-AUC: 0.665
Precision @ top-5 %: 83.87%


###  Power-Buyer Predicted Model

| Metric | Result | What It Means |
|--------|--------|---------------|
| **ROC-AUC** | **0.665** | The model has moderate ability to rank users by their likelihood of becoming Power Buyers (random = 0.50, perfect = 1.00). |
| **Precision @ Top 5 %** | **83.9 %** | If we target only the highest-scored 5 % of customers, **~4 out of 5** will indeed join the Power-Buyer cohort — a ~16× lift over random selection. |


In [None]:
# CHURN PROPENSITY MODEL

y = core['is_churn'].astype(int)

def leak(col):
    leaks = (
        'purchase_', 'revenue', 'transaction_',  
        'cart_count_mean', 'purchase_count_mean',
        'target_', 'is_', '_cohort', 'value_segment'
    )
    return any(tok in col for tok in leaks)

X = core.drop(columns=['user_id'] + [c for c in core.columns if leak(c)])


X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y)


num_cols = X_tr.select_dtypes(include=['number', 'bool']).columns.tolist()
cat_cols = X_tr.select_dtypes(include=['object', 'category']).columns.tolist()

pre = ColumnTransformer([
        ('num', StandardScaler(), num_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
      ])

pipe_churn = Pipeline([
        ('prep', pre),
        ('gb', GradientBoostingClassifier(
                learning_rate=0.05,
                n_estimators=300,
                max_depth=3,
                random_state=42))
      ])

# 5️⃣  TRAIN & EVALUATE
pipe_churn.fit(X_tr, y_tr)
proba_te = pipe_churn.predict_proba(X_te)[:,1]

roc = roc_auc_score(y_te, proba_te)
print("Churn Model  ►  ROC-AUC:", round(roc, 3))

# precision at top-10 % (retention shortlist)
top_k = int(0.10 * len(proba_te))
idx   = np.argsort(proba_te)[::-1][:top_k]
precision_top10 = y_te.iloc[idx].mean()
print(f"Precision @ top-10 %: {precision_top10:.2%}")



Churn Model  ►  ROC-AUC: 0.865
Precision @ top-10 %: 97.83%


### Churn Predictive Model 

| Metric | Value | What It Means |
|--------|-------|---------------|
| **ROC-AUC** | **0.865** | The model has strong discriminative power: it can clearly separate future churners from retainers (random = 0.50, perfect = 1.00). |
| **Precision @ Top 10 %** | **97.8 %** | If we target only the 10 % highest-scored customers, **~98 %** of them really do churn. This is an enormous lift over the baseline churn rate. |

