# Customer Churn Prediction

## Introduction
Customer churn is when a customer stops using a company's service. For businesses, churn leads to revenue loss, so predicting churn is very important. By identifying customers who are likely to leave, companies can take actions (discounts, offers, better service) to retain them.

In this project, we aim to **predict customer churn** using machine learning techniques. We are using the **Kaggle Telco Customer Churn dataset**. This dataset contains customer details such as demographics, account information, and service usage. The target variable is `Churn` (Yes/No).

### Objectives:
1. Perform Exploratory Data Analysis (EDA) to understand patterns in churn.  
2. Preprocess the data (handle missing values, encode categorical variables, scale numerical features).  
3. Build machine learning models to predict churn.  
4. Evaluate the models using metrics such as Accuracy, Precision, Recall, F1-score, and ROC-AUC.  
5. Provide insights and possible business recommendations.


In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/customer-churn-ng-intern/sample_submission.csv
/kaggle/input/customer-churn-ng-intern/train.csv
/kaggle/input/customer-churn-ng-intern/test.csv


In [None]:
# üì¶ Imports
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from category_encoders import TargetEncoder
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.base import clone

# Make plots look nice
plt.style.use("seaborn-v0_8")
sns.set_palette("Set2")

In [None]:
# üìÇ Load the dataset
train = pd.read_csv("/kaggle/input/playground-series-s4e1/train.csv")
test = pd.read_csv("/kaggle/input/playground-series-s4e1/test.csv")

print("Train shape:", train.shape)
print("Test shape:", test.shape)

# Quick look
train.head()

In [None]:
# üîç Basic EDA

# Check for missing values
print("Missing values per column:\n", train.isnull().sum().sort_values(ascending=False).head(10))

# Target distribution
print("\nTarget distribution:")
print(train["Exited"].value_counts(normalize=True))

# Quick statistical summary
train.describe().T.head(10)

In [None]:
# üìä Target distribution plot
sns.countplot(data=train, x="Exited")
plt.title("Target Distribution (Exited)")
plt.show()

# üìä Correlation heatmap (numeric features only)
numeric_features = train.select_dtypes(include=[np.number]).drop(columns=["Exited", "id"], errors="ignore")
plt.figure(figsize=(10,6))
sns.heatmap(numeric_features.corr(), annot=False, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()

In [None]:
# ‚öôÔ∏è Feature Engineering & Preprocessing

from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from category_encoders import TargetEncoder

# Separate features & target
X = train.drop(columns=["Exited", "id"])   # remove target + id
y = train["Exited"]

# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=["object"]).columns.tolist()
numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()

print("Categorical columns:", categorical_cols)
print("Numeric columns:", numeric_cols)

# Initialize encoders & scaler
target_enc = TargetEncoder(cols=categorical_cols, smoothing=0.2)
scaler = StandardScaler()

# Fit-transform categorical features with TargetEncoder
X_cat_encoded = target_enc.fit_transform(X[categorical_cols], y)

# Scale numeric features
X_num_scaled = scaler.fit_transform(X[numeric_cols])

# Combine processed features into one DataFrame
import numpy as np
import pandas as pd

X_processed = np.hstack([X_num_scaled, X_cat_encoded])
X_processed = pd.DataFrame(X_processed, columns=numeric_cols + categorical_cols)

print("Processed dataset shape:", X_processed.shape)
X_processed.head()


In [None]:
# ‚ö° XGBoost Model Setup

from xgboost import XGBClassifier

# Define model with tuned hyperparameters
xgb_model = XGBClassifier(
    n_estimators=10000,       # large, will stop early
    learning_rate=0.01,       # small LR for better convergence
    max_depth=6,              # depth of trees
    subsample=0.8,            # row sampling
    colsample_bytree=0.8,     # feature sampling
    objective="binary:logistic",
    eval_metric="auc",
    random_state=42,
    n_jobs=-1,
    tree_method="hist",       # fast & memory efficient
    scale_pos_weight=(y.value_counts()[0] / y.value_counts()[1]) # handle imbalance
)

print("‚úÖ XGBoost model initialized")


In [None]:
# ‚ö° Cross-validation with StratifiedKFold

from sklearn.model_selection import StratifiedKFold

# Setup Stratified K-Fold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

cv_scores = []

for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    print(f"üîπ Fold {fold+1}")
    
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
    
    # Preprocessing + model pipeline
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', clone(xgb_model))
    ])
    
    # Fit with early stopping
    pipeline.named_steps['model'].fit(
        pipeline.named_steps['preprocessor'].fit_transform(X_train, y_train),
        y_train,
        eval_set=[(
            pipeline.named_steps['preprocessor'].transform(X_val), y_val
        )],
        early_stopping_rounds=200,
        verbose=False
    )
    
    # Predictions
    y_val_pred = pipeline.predict_proba(X_val)[:, 1]
    
    # ROC AUC
    score = roc_auc_score(y_val, y_val_pred)
    cv_scores.append(score)
    
    print(f"‚úÖ Fold {fold+1} AUC: {score:.5f}")

print("\nüìä Mean CV AUC:", np.mean(cv_scores))


In [None]:
# ‚ö° Train final model on full training data and generate submission

# Refit preprocessing on full train data
X_full = preprocessor.fit_transform(X, y)

# Refit XGBoost on full train data
final_model = XGBClassifier(
    **params,
    use_label_encoder=False,
    eval_metric="auc"
)

final_model.fit(X_full, y)

# Transform test set
X_test_transformed = preprocessor.transform(test)

# Predictions (probabilities for class=1)
test_pred = final_model.predict_proba(X_test_transformed)[:, 1]

# Build submission DataFrame
submission = pd.DataFrame({
    "id": test["id"],      # Kaggle requires "id" column from test set
    "Exited": test_pred    # Target column name in dataset
})

# Save CSV
submission.to_csv("submission.csv", index=False)

print("‚úÖ Submission file saved as submission.csv")
submission.head()


In [None]:
# üìä Feature Importance from XGBoost

importances = final_model.feature_importances_

# Get feature names after preprocessing
# Numeric + Encoded categorical
num_features = numeric_features
cat_features = categorical_features

all_features = num_features + cat_features

# Map importance scores
feat_importances = pd.DataFrame({
    "Feature": all_features,
    "Importance": importances
}).sort_values(by="Importance", ascending=False)

# Plot Top 15
plt.figure(figsize=(10, 6))
sns.barplot(data=feat_importances.head(15), x="Importance", y="Feature", palette="viridis")
plt.title("Top 15 Important Features (XGBoost)")
plt.show()

feat_importances.head(10)

In [None]:
# üìù Final Report

print(" Model Report :")
print(f"Cross-Validation ROC AUC (mean ¬± std): {cv_scores.mean():.5f} ¬± {cv_scores.std():.5f}")
print(f"Final Validation ROC AUC: {final_val_auc:.5f}")
print("----------------------------------")
print("Public LB score will be visible after submission on Kaggle.")
print("Best CV vs Public LB comparison will guide further tuning.")
print("==================================")

# üèÜ Model Summary Notes
report_notes = {
    "Model": "XGBoost with Target Encoding",
    "Feature Engineering": "Categorical TargetEncoder + Normalized numeric features",
    "Regularization": "Early stopping + tuned learning rate, max_depth, subsample, colsample_bytree",
    "Evaluation": "Stratified 5-fold CV (ROC AUC)",
    "Explainability": "Feature importance (XGBoost) + SHAP values",
    "Next Steps": [
        "Try LightGBM / CatBoost for comparison",
        "Hyperparameter tuning with Optuna",
        "Stacking ensemble (XGBoost + LightGBM + Logistic Regression)"
    ]
}

import pprint
pprint.pprint(report_notes)

# ‚úÖ Reminder for Kaggle submission
print("\n‚û°Ô∏è Now upload 'submission.csv' to Kaggle and track Public LB score!")


In [None]:
# ‚ö° Hyperparameter Tuning with Optuna

import optuna
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline

# Objective function for Optuna
def objective(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 300, 2000),
        "max_depth": trial.suggest_int("max_depth", 3, 12),
        "learning_rate": trial.suggest_float("learning_rate", 0.001, 0.2, log=True),
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
        "gamma": trial.suggest_float("gamma", 0, 5),
        "reg_alpha": trial.suggest_float("reg_alpha", 0, 10),
        "reg_lambda": trial.suggest_float("reg_lambda", 1, 20),
        "min_child_weight": trial.suggest_int("min_child_weight", 1, 10),
        "random_state": 42,
        "n_jobs": -1,
        "eval_metric": "auc",
        "tree_method": "hist"
    }

    model = XGBClassifier(**params)

    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(model, X_train, y_train, cv=cv, scoring="roc_auc", n_jobs=-1)

    return scores.mean()

# Run Optuna study
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)  # üî• you can increase to 200+ for stronger tuning

print("Best parameters:", study.best_params)
print("Best CV ROC AUC:", study.best_value)


In [None]:
# Train final model with tuned params
best_params = study.best_params
best_params.update({
    "random_state": 42,
    "n_jobs": -1,
    "eval_metric": "auc",
    "tree_method": "hist"
})

final_model = XGBClassifier(**best_params)
final_model.fit(X_train, y_train,
                eval_set=[(X_val, y_val)],
                early_stopping_rounds=50,
                verbose=False)

# Predict on test
y_test_pred = final_model.predict_proba(X_test)[:, 1]

# Save submission
submission = pd.DataFrame({"id": test_df["id"], "Exited": y_test_pred})
submission.to_csv("submission_optuna.csv", index=False)

print("‚úÖ submission_optuna.csv is ready ‚Äì upload to Kaggle!")


In [None]:
# ‚öñÔ∏è Train/Validation Split with Stratified K-Fold
from sklearn.model_selection import StratifiedKFold

# Define number of folds
n_splits = 5  # Common choice (5-fold CV)

# Create Stratified K-Fold object
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

print(f"‚úÖ Stratified {n_splits}-Fold cross-validation is ready!")

In [None]:
# üì¶ Imports
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from category_encoders import TargetEncoder
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.base import clone

# üì• Load the dataset
train_data = pd.read_csv('/kaggle/input/customer-churn-ng-intern/train.csv')
test_data = pd.read_csv('/kaggle/input/customer-churn-ng-intern/test.csv')
sample_sub = pd.read_csv('/kaggle/input/customer-churn-ng-intern/sample_submission.csv')

# üßΩ Feature setup
drop_features = ['id', 'CustomerId', 'Surname']
X_train = train_data.drop(columns=drop_features + ['Exited'])
y_train = train_data['Exited']
X_test = test_data.drop(columns=drop_features)

# üîç Column types
cat_cols = ['Geography', 'Gender']
num_cols = [col for col in X_train.columns if col not in cat_cols]

# üîÑ Preprocessing using TargetEncoder + StandardScaler
preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), num_cols),
    ('cat', TargetEncoder(), cat_cols)
])

# üîß XGBoost Classifier
xgb = XGBClassifier(
    use_label_encoder=False,
    eval_metric='auc',
    random_state=42
)

# ‚öôÔ∏è Pipeline (used for hyperparameter tuning only)
clf_pipeline = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('model', xgb)
])

# üîç Hyperparameter search space
param_distributions = {
    'model__n_estimators': [100, 200, 300],
    'model__learning_rate': [0.01, 0.05, 0.1],
    'model__max_depth': [3, 5, 7],
    'model__subsample': [0.8, 1.0],
    'model__colsample_bytree': [0.8, 1.0]
}

# üîé RandomizedSearchCV
random_search = RandomizedSearchCV(
    clf_pipeline,
    param_distributions=param_distributions,
    n_iter=10,
    cv=3,
    scoring='roc_auc',
    random_state=42,
    verbose=1,
    n_jobs=-1
)

# üöÇ Fit for best params
random_search.fit(X_train, y_train)
print("‚úÖ Best parameters:", random_search.best_params_)

# üí° Final model and preprocessing
best_model = random_search.best_estimator_.named_steps['model']

# Fit preprocessor
X_train_transformed = preprocessor.fit_transform(X_train, y_train)
X_test_transformed = preprocessor.transform(X_test)

# üîÅ Stratified K-Fold CV with early stopping
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
oof_preds = np.zeros(len(X_train))
test_preds = np.zeros(len(X_test))

for fold, (train_idx, val_idx) in enumerate(skf.split(X_train, y_train)):
    X_tr, X_val = X_train_transformed[train_idx], X_train_transformed[val_idx]
    y_tr, y_val = y_train.iloc[train_idx], y_train.iloc[val_idx]

    model = clone(best_model)
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_val, y_val)],
        early_stopping_rounds=30,
        verbose=False
    )

    oof_preds[val_idx] = model.predict_proba(X_val)[:, 1]
    
    test_preds += model.predict_proba(X_test_transformed)[:, 1] / skf.n_splits

    fold_score = roc_auc_score(y_val, oof_preds[val_idx])
    print(f"üìà Fold {fold+1} AUC: {fold_score:.4f}")

# üéØ Overall AUC
print(f"\nüéØ Overall ROC AUC: {roc_auc_score(y_train, oof_preds):.4f}")

# üì§ Submission
sample_sub['Exited'] = test_preds
sample_sub.to_csv('submission_xgb_targetencoder.csv', index=False)
print("üìÅ submission_xgb_targetencoder.csv created.")


Fitting 3 folds for each of 10 candidates, totalling 30 fits
‚úÖ Best parameters: {'model__subsample': 1.0, 'model__n_estimators': 300, 'model__max_depth': 3, 'model__learning_rate': 0.05, 'model__colsample_bytree': 1.0}




üìà Fold 1 AUC: 0.9365




üìà Fold 2 AUC: 0.9316




üìà Fold 3 AUC: 0.9314




üìà Fold 4 AUC: 0.9413




üìà Fold 5 AUC: 0.9298

üéØ Overall ROC AUC: 0.9339
üìÅ submission_xgb_targetencoder.csv created.
