# 🧹 Cleaned Customer Churn Prediction Notebook

**What I did:**
- Removed fragile / outdated imports and non-functional cells.
- Consolidated the pipeline: load data → preprocess → train/test split → model → eval → save model.

**Notes for you:**
- Put your dataset as `churn.csv` in the same folder, or modify the path in the data-loading cell.
- If you want specific parts restored from your original notebook, tell me which cell numbers or paste the code.


## 1) Environment / Requirements

Run this cell to ensure required packages are installed in the environment (you may need to run it once).

In [None]:
# ! Uncomment and run if packages are missing in your environment
# !pip install pandas scikit-learn matplotlib seaborn joblib openpyxl
print('Assuming standard data-science packages are installed: pandas, scikit-learn, matplotlib, seaborn, joblib')


## 2) Imports

In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, RocCurveDisplay
import matplotlib.pyplot as plt
import joblib

print('Imports successful')


## 3) Load dataset (robust)

Looks for `churn.csv`. If not found, creates a small synthetic dataset so the notebook runs end-to-end for testing.

In [None]:
DATA_PATH = 'churn.csv'

if os.path.exists(DATA_PATH):
    df = pd.read_csv(DATA_PATH)
    print(f'Loaded dataset from {DATA_PATH} — rows: {len(df):,}, columns: {len(df.columns):,}')
else:
    print(f'File {DATA_PATH} not found. Creating a small synthetic dataset for demo purposes.')
    rng = np.random.RandomState(0)
    n = 500
    df = pd.DataFrame({
        'age': rng.randint(18,80,size=n),
        'tenure_months': rng.randint(0,72,size=n),
        'monthly_charges': np.round(rng.uniform(20,120,size=n),2),
        'has_partner': rng.choice(['Yes','No'], size=n),
        'contract_type': rng.choice(['Month-to-month','One year','Two year'], size=n),
        'churn': rng.choice([0,1], size=n, p=[0.75,0.25])
    })

# Quick peek
print(df.head())
print('\nData shape:', df.shape)


## 4) Basic EDA

In [None]:
# Data info
print(df.info())

# Numeric summary
print('\nNumeric summary:')
print(df.select_dtypes(include=[np.number]).describe().T)

# Class balance
print('\nChurn distribution:')
print(df['churn'].value_counts(normalize=True))


## 5) Preprocessing and feature setup

In [None]:
# Identify features
TARGET = 'churn'
features = [c for c in df.columns if c != TARGET]

# Simple heuristic: categorical = object or category dtype
categorical_features = [c for c in features if df[c].dtype == 'object' or df[c].dtype.name == 'category']
numeric_features = [c for c in features if c not in categorical_features]

print('Numeric features:', numeric_features)
print('Categorical features:', categorical_features)

# Preprocessing transformers
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
cat_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', cat_transformer, categorical_features)
])


## 6) Train / Test split

In [None]:
X = df[features]
y = df[TARGET]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print('Train shape:', X_train.shape, 'Test shape:', X_test.shape)


## 7) Model pipeline and training

Using RandomForest for a robust baseline.

In [None]:
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1))
])

model.fit(X_train, y_train)
print('Model trained')


## 8) Evaluation

In [None]:
y_pred = model.predict(X_test)
y_proba = None
if hasattr(model, 'predict_proba'):
    try:
        y_proba = model.predict_proba(X_test)[:,1]
    except Exception:
        y_proba = None

print('\nClassification report:')
print(classification_report(y_test, y_pred))

print('\nConfusion matrix:')
print(confusion_matrix(y_test, y_pred))

if y_proba is not None:
    roc = roc_auc_score(y_test, y_proba)
    print(f'ROC AUC: {roc:.4f}')
    RocCurveDisplay.from_predictions(y_test, y_proba)
    plt.show()
else:
    print('Probability estimates not available — skipping ROC')


## 9) Feature importance (approximate)

This shows feature importance if the classifier exposes feature_importances_. For pipeline with OneHotEncoder, we reconstruct feature names.

In [None]:
try:
    clf = model.named_steps['classifier']
    pre = model.named_steps['preprocessor']
    # Get transformed feature names for numeric + onehot
    num_feats = numeric_features
    cat_feats = []
    if 'cat' in pre.named_transformers_:
        ohe = pre.named_transformers_['cat'].named_steps['onehot']
        cat_names = ohe.get_feature_names_out(categorical_features)
        cat_feats = list(cat_names)
    feature_names = list(num_feats) + cat_feats
    importances = clf.feature_importances_
    fi = pd.DataFrame({'feature': feature_names, 'importance': importances}).sort_values('importance', ascending=False)
    print(fi.head(20))
except Exception as e:
    print('Could not compute feature importances:', e)


## 10) Save trained model

In [None]:
OUT_MODEL = 'churn_model.joblib'
try:
    joblib.dump(model, OUT_MODEL)
    print(f'Model saved to {OUT_MODEL}')
except Exception as e:
    print('Error saving model:', e)


## 11) Example: load saved model and predict on a sample

In [None]:
# Load model and run a single prediction
m = joblib.load(OUT_MODEL)
print('Loaded model from', OUT_MODEL)

sample = X_test.iloc[:3]
print('Sample input:\n', sample)
print('Predictions:', m.predict(sample))
if hasattr(m, 'predict_proba'):
    print('Probabilities:', m.predict_proba(sample)[:,1])


## 12) Next steps / suggestions

- Replace the synthetic demo data with your real `churn.csv` file.
- Add cross-validation, hyperparameter tuning (GridSearchCV / RandomizedSearchCV).
- Add better EDA plots and missing-value handling depending on your data.
- If you want me to restore or port specific cells from your original notebook, tell me the cell numbers or paste code snippets.