# Customer churn modeling
Quick comparison of a few simple models to see what predicts churn best.

## Workflow overview
1. Import libraries and set paths.
2. Inspect dataset columns to confirm the churn label name.
3. Clean/encode the data and create train/test splits.
4. Train three lightweight models and compare metrics.
5. Report the top performer and jot next steps.

In [None]:
# Core libraries for path handling, data prep, and modeling
from pathlib import Path
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Configure the dataset location and churn label name
data_path = Path("..") / "data" / "train.xlsx"
target_col = "Churn_retained"

# Show the resolved path for quick sanity checking
data_path

WindowsPath('../data/train.xlsx')

## Confirm available columns
This quick look prevents typos in `target_col` and helps document the dataset schema.

In [None]:
# Display header names without loading entire file
pd.read_excel(data_path, nrows=0).columns.tolist()

['Customer.ID',
 'Purchase.Date',
 'Product.Price',
 'Quantity',
 'Total.Purchase.Amount',
 'Age',
 'Product.Category_Clothing',
 'Product.Category_Electronics',
 'Product.Category_Home',
 'Payment.Method_Credit Card',
 'Payment.Method_PayPal',
 'Gender_Male',
 'Returns_Return',
 'Churn_retained']

## Clean and encode data
Convert date columns to numeric values, drop records missing the target, encode categoricals, and split into train/test sets.

In [None]:
# Load full dataset (single Excel sheet)
df = pd.read_excel(data_path)
print(f"Loaded {df.shape[0]} rows and {df.shape[1]} columns")
if target_col not in df.columns:
    raise KeyError(f"'{target_col}' column not found. Please update target_col to match your file")

# Convert any date-like columns to ordinal ints so scikit-learn can consume them
date_cols = [col for col in df.columns if "date" in col.lower()]
for col in date_cols:
    dt_series = pd.to_datetime(df[col], errors="coerce")
    df[col] = dt_series.map(lambda x: x.toordinal() if pd.notnull(x) else pd.NA)

# Basic cleanup + dummy encoding
df = df.dropna(subset=[target_col])
y = df[target_col].astype(int)
X = pd.get_dummies(df.drop(columns=[target_col]), drop_first=True)
X = X.fillna(X.median())

# Stratified split keeps churn ratio stable between train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
 )
X_train.shape

Loaded 202611 rows and 14 columns


(162088, 13)

## Train and evaluate models
Fit three simple classifiers, capture accuracy and F1 on the holdout set, and compare results in a small table.

In [None]:
# Define a lightweight mix of linear and tree-based models
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest": RandomForestClassifier(n_estimators=300, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42)
}

results = []
for name, model in models.items():
    # Standard fit/predict loop
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    results.append({
        "model": name,
        "accuracy": accuracy_score(y_test, preds),
        "f1": f1_score(y_test, preds)
    })

# Build a tidy comparison frame sorted by accuracy
results_df = (
    pd.DataFrame(results)
    .sort_values(by="accuracy", ascending=False)
    .reset_index(drop=True)
)
results_df

Unnamed: 0,model,accuracy,f1
0,Gradient Boosting,0.798978,0.888249
1,Logistic Regression,0.798904,0.888212
2,Random Forest,0.798904,0.8882


## Highlight the top performer
Call out the table leader so it is obvious which model to promote or tune further.

In [None]:
# Grab top row from the comparison table and format the metrics
best_scores = results_df.iloc[0]
print(
    f"Best model: {best_scores['model']} (accuracy={best_scores['accuracy']:.3f}, "
    f"f1={best_scores['f1']:.3f})"
 )

Best model: Gradient Boosting (accuracy=0.799, f1=0.888)


## Next steps
- Update `target_col` if your label has a different name.
- Try more feature engineering or class balancing to push scores higher.
- Swap in other estimators (XGBoost, LightGBM, etc.) and compare in `results_df`.