# 03 - Model Experiments: Churn Prediction

## Goals:
- Try different classifiers (Random Forest, XGBoost, etc.)
- Add/remove features and observe impact
- Track everything with MLflow

---



## 1. Load the Processed Data

We load the train and test datasets that were previously cleaned and saved to the `data/processed/` folder.

These files will be used as the base input for our feature engineering and model training experiments.

In [8]:
# 1. Load train and test data

import pandas as pd

# Load processed train and test sets
train_df = pd.read_csv("../data/processed/train.csv")
test_df = pd.read_csv("../data/processed/test.csv")

print(f"✅ Train shape: {train_df.shape}")
print(f"✅ Test shape: {test_df.shape}")

# Optional: show a few rows
train_df.head()

✅ Train shape: (7088, 19)
✅ Test shape: (3039, 19)


Unnamed: 0,Customer_Age,Dependent_count,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Total_Trans_Amt,Total_Trans_Ct,Total_Amt_Chng_Q4_Q1,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Gender,Education_Level,Marital_Status,Income_Category,Card_Category,churn
0,44,3,36,2,3,3,6680.0,1839,7632,95,0.617,0.532,0.275,F,Uneducated,Married,Less than $40K,Blue,0
1,39,1,34,3,1,1,2884.0,2517,4809,87,0.693,0.74,0.873,F,Graduate,Single,Unknown,Blue,0
2,52,1,36,4,2,2,14858.0,1594,4286,72,0.51,0.636,0.107,M,Unknown,Married,$80K - $120K,Blue,0
3,34,0,17,4,1,4,2638.0,2092,1868,43,0.591,0.344,0.793,M,Graduate,Married,$40K - $60K,Blue,0
4,47,5,36,3,1,2,8896.0,1338,4252,70,0.741,0.591,0.15,M,Doctorate,Single,Less than $40K,Blue,0


## 2. Select Features and Target

We define the list of features to use for model training and isolate the target variable (`churn`). 

These features include both numerical and categorical columns, so we'll handle preprocessing later using a pipeline.

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer
import numpy as np

# === 1. Define your custom feature engineering logic ===
def add_interaction_features(df):
    df = df.copy()
    df["Avg_Transaction_Amt"] = df["Total_Trans_Amt"] / (df["Total_Trans_Ct"] + 1e-3)
    df["Revolve_to_Limit"] = df["Total_Revolving_Bal"] / (df["Credit_Limit"] + 1e-3)
    df["AmtCt_Chg_Ratio"] = df["Total_Amt_Chng_Q4_Q1"] / (df["Total_Ct_Chng_Q4_Q1"] + 1e-3)
    return df

# Wrap it as a FunctionTransformer
interaction_transformer = FunctionTransformer(add_interaction_features)

# === 2. Separate features ===
categorical = [
    "Gender", "Education_Level", "Marital_Status", 
    "Income_Category", "Card_Category"
]

numerical_to_scale = [
    "Credit_Limit", "Total_Revolving_Bal", 
    "Total_Trans_Amt", "Total_Trans_Ct", 
    "Avg_Utilization_Ratio", "Avg_Transaction_Amt", 
    "Revolve_to_Limit", "AmtCt_Chg_Ratio"
]

numerical_no_scale = [
    "Customer_Age", "Dependent_count", "Months_on_book",
    "Total_Relationship_Count", "Months_Inactive_12_mon",
    "Contacts_Count_12_mon", "Total_Amt_Chng_Q4_Q1", 
    "Total_Ct_Chng_Q4_Q1"
]

# === 3. Define transformers ===

cat_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

num_scale_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

num_noscale_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median"))
])

# === 4. Compose all into one ColumnTransformer ===
preprocessor = ColumnTransformer([
    ("num_scaled", num_scale_transformer, numerical_to_scale),
    ("num_noscale", num_noscale_transformer, numerical_no_scale),
    ("cat", cat_transformer, categorical)
])

# === 5. Final pipeline with feature engineering + preprocessor ===
full_pipeline = Pipeline([
    ("feature_engineering", interaction_transformer),
    ("preprocessor", preprocessor)
])

## 4. Train + evaluate baseline model

In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.pipeline import Pipeline

# Combine preprocessing pipeline with the classifier
rf_pipeline = Pipeline([
    ("feature_pipeline", full_pipeline),
    ("classifier", RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42))
])

# Separate features and target
X_train = train_df.drop(columns=["churn"])  # Drop unused target + unused col
y_train = train_df["churn"]

X_test = test_df.drop(columns=["churn"])
y_test = test_df["churn"]

# Fit the model
rf_pipeline.fit(X_train, y_train)

# Predict
y_pred = rf_pipeline.predict(X_test)
y_proba = rf_pipeline.predict_proba(X_test)[:, 1]

# Evaluate
print("✅ Accuracy:", accuracy_score(y_test, y_pred))
print("✅ ROC AUC:", roc_auc_score(y_test, y_proba))

✅ Accuracy: 0.9519578808818691
✅ ROC AUC: 0.9843969899300179


## 5. MLflow tracking block

## 6. (Optional) Register best model