# Bank Customer Churn — End-to-End ML Pipeline

This notebook implements a **fully reproducible, end-to-end machine learning pipeline**
for customer churn prediction.

**This is the main entry point of the project.**

Running this notebook from top to bottom will:
- load the dataset
- perform preprocessing
- train a baseline model
- evaluate its performance

No other notebooks are required to execute this pipeline.


In [2]:
from __future__ import annotations

# Core
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Modeling
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    roc_auc_score,
    classification_report,
    confusion_matrix,
)

# System
from pathlib import Path
import warnings


In [3]:
# Display settings
pd.set_option("display.max_columns", None)
pd.set_option("display.float_format", "{:.3f}".format)

# Visualization style
plt.style.use("seaborn-v0_8")
sns.set_context("notebook")

# Warnings
warnings.filterwarnings("ignore")

# Reproducibility
RANDOM_STATE = 42

# Dataset configuration
TARGET_COL = "churn"
ID_COL = "customer_id"


## Data Source & Acquisition

This project uses the **Bank Customer Churn Dataset** from Kaggle.

- Source: https://www.kaggle.com/datasets/gauravtopre/bank-customer-churn-dataset
- Description: Customer-level data from a European retail bank,
  commonly used for churn prediction tasks.

The dataset is downloaded programmatically to ensure full reproducibility
in Google Colab.


In [4]:
# KaggleHub (Colab-friendly dataset download)
try:
    import kagglehub  # type: ignore
except Exception:
    %pip install -q kagglehub
    import kagglehub  # type: ignore


# Download dataset
dataset_dir = Path(
    kagglehub.dataset_download("gauravtopre/bank-customer-churn-dataset")
)

csv_files = sorted(dataset_dir.glob("*.csv"))
assert csv_files, f"No CSV files found in: {dataset_dir}"

DATA_PATH = csv_files[0]

print("Using dataset file:", DATA_PATH.resolve())


Downloading from https://www.kaggle.com/api/v1/datasets/download/gauravtopre/bank-customer-churn-dataset?dataset_version_number=1...


100%|██████████| 187k/187k [00:00<00:00, 48.7MB/s]

Extracting files...
Using dataset file: /root/.cache/kagglehub/datasets/gauravtopre/bank-customer-churn-dataset/versions/1/Bank Customer Churn Prediction.csv





In [5]:
# Load dataset
df = pd.read_csv(DATA_PATH)

print("Dataset shape:", df.shape)
display(df.head())


Dataset shape: (10000, 12)


Unnamed: 0,customer_id,credit_score,country,gender,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,churn
0,15634602,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,15647311,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,15619304,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,15701354,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,15737888,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


## Dataset Structure & Sanity Check

Before building the pipeline, we inspect the dataset structure:
- column names and data types
- presence of missing values
- basic assumptions about the target variable

This step ensures that the pipeline operates on clean and expected inputs.


In [6]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customer_id       10000 non-null  int64  
 1   credit_score      10000 non-null  int64  
 2   country           10000 non-null  object 
 3   gender            10000 non-null  object 
 4   age               10000 non-null  int64  
 5   tenure            10000 non-null  int64  
 6   balance           10000 non-null  float64
 7   products_number   10000 non-null  int64  
 8   credit_card       10000 non-null  int64  
 9   active_member     10000 non-null  int64  
 10  estimated_salary  10000 non-null  float64
 11  churn             10000 non-null  int64  
dtypes: float64(2), int64(8), object(2)
memory usage: 937.6+ KB


## Removing Identifier Columns

Identifier columns uniquely identify customers but do not carry predictive
information. Keeping them may introduce noise or unintended data leakage.

These columns are removed before feature selection and model training.


In [7]:
id_columns = [
    ID_COL,
    "CustomerId",
    "RowNumber",
    "Surname",
]

existing_id_columns = [c for c in id_columns if c in df.columns]

print("Dropping identifier columns:", existing_id_columns)

df = df.drop(columns=existing_id_columns)

df.head()


Dropping identifier columns: ['customer_id']


Unnamed: 0,credit_score,country,gender,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,churn
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


## Target Variable Definition

We explicitly separate the target variable from input features
to prevent data leakage and simplify downstream processing.


In [8]:
X = df.drop(columns=[TARGET_COL])
y = df[TARGET_COL]

print("Features shape:", X.shape)
print("Target shape:", y.shape)


Features shape: (10000, 10)
Target shape: (10000,)


## Feature Type Identification

Different feature types require different preprocessing strategies.
We explicitly define numerical and categorical feature groups
to build a clean and interpretable preprocessing pipeline.


In [9]:
numerical_features = (
    X.select_dtypes(include=["int64", "float64"])
    .columns
    .tolist()
)

categorical_features = (
    X.select_dtypes(include=["object"])
    .columns
    .tolist()
)

print("Numerical features:", numerical_features)
print("Categorical features:", categorical_features)


Numerical features: ['credit_score', 'age', 'tenure', 'balance', 'products_number', 'credit_card', 'active_member', 'estimated_salary']
Categorical features: ['country', 'gender']


## Train / Test Split

The dataset is split into training and test sets **before** any preprocessing
to avoid data leakage.

Stratified sampling is used to preserve the churn ratio in both sets.


In [10]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=RANDOM_STATE,
    stratify=y,
)

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)


Train shape: (8000, 10)
Test shape: (2000, 10)


## Preprocessing Pipeline

We construct a preprocessing pipeline that applies:
- scaling to numerical features
- one-hot encoding to categorical features

This ensures consistent transformations for both training and test data.


In [11]:
# Numerical preprocessing
numerical_transformer = Pipeline(
    steps=[
        ("scaler", StandardScaler())
    ]
)

# Categorical preprocessing
categorical_transformer = Pipeline(
    steps=[
        ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
    ]
)

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

preprocessor
