# Bank Customer Churn — Data Preprocessing

## Purpose of this notebook
This notebook prepares the dataset for machine learning models.

The main goals are:
- Load the dataset in a reproducible way
- Define the final feature set
- Split data into train and test sets
- Apply appropriate preprocessing to numerical and categorical features
- Build a clean and reusable preprocessing pipeline

All steps are designed to be reproducible and independent of previous notebooks.


In [19]:
from __future__ import annotations

# Core libraries
import numpy as np
import pandas as pd


# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# System utilities
from pathlib import Path
import warnings

# Display settings
pd.set_option("display.max_columns", None)
pd.set_option("display.float_format", "{:.3f}".format)

# Visualization style
plt.style.use("seaborn-v0_8")
sns.set_context("notebook")

# Warnings / reproducibility
warnings.filterwarnings("ignore")
RANDOM_STATE = 42

# Dataset config (portable)
TARGET_COL = "churn"
LOCAL_DATA_PATH: Path | None = None  # e.g., Path("../data/raw/bank_churn.csv")


## Data Source & Acquisition

This project uses the **Bank Customer Churn Dataset** from Kaggle:
- Source: [gauravtopre/bank-customer-churn-dataset](https://www.kaggle.com/datasets/gauravtopre/bank-customer-churn-dataset)

The dataset describes customers of a European retail bank and is commonly used for churn prediction tasks.

### Reproducibility notes
- In **Google Colab**, we can download the dataset programmatically.
- In a **local environment**, you can set `LOCAL_DATA_PATH` to a CSV on disk (recommended for stable runs without Kaggle auth).


In [20]:
# KaggleHub (Colab-friendly). For local runs, prefer LOCAL_DATA_PATH.
try:
    import kagglehub  # type: ignore
except Exception:
    %pip install -q kagglehub
    import kagglehub  # type: ignore


In [21]:
# Resolve data path (LOCAL first, then KaggleHub)
if LOCAL_DATA_PATH is not None:
    data_path = LOCAL_DATA_PATH
else:
    dataset_dir = Path(
        kagglehub.dataset_download("gauravtopre/bank-customer-churn-dataset")
    )
    csv_files = sorted(dataset_dir.glob("*.csv"))
    assert csv_files, f"No CSV files found in: {dataset_dir}"
    data_path = csv_files[0]

print("Using dataset file:", data_path)

Using Colab cache for faster access to the 'bank-customer-churn-dataset' dataset.
Using dataset file: /kaggle/input/bank-customer-churn-dataset/Bank Customer Churn Prediction.csv


In [22]:
# Load dataset
assert data_path.exists(), f"File not found: {data_path}"

df = pd.read_csv(data_path)

print("Dataset shape:", df.shape)
display(df.head())

Dataset shape: (10000, 12)


Unnamed: 0,customer_id,credit_score,country,gender,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,churn
0,15634602,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,15647311,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,15619304,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,15701354,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,15737888,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


## Dataset Structure Check

Before modifying the dataset, we inspect its structure and data types.
This serves as a reference point before preprocessing steps are applied.


In [23]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customer_id       10000 non-null  int64  
 1   credit_score      10000 non-null  int64  
 2   country           10000 non-null  object 
 3   gender            10000 non-null  object 
 4   age               10000 non-null  int64  
 5   tenure            10000 non-null  int64  
 6   balance           10000 non-null  float64
 7   products_number   10000 non-null  int64  
 8   credit_card       10000 non-null  int64  
 9   active_member     10000 non-null  int64  
 10  estimated_salary  10000 non-null  float64
 11  churn             10000 non-null  int64  
dtypes: float64(2), int64(8), object(2)
memory usage: 937.6+ KB


## Removing Identifier Columns

Identifier columns uniquely identify customers but do not carry
predictive information about churn.

Keeping such columns may introduce noise or unintended data leakage,
so they are removed before model training.


In [24]:
id_columns = [
    "customer_id",
    "CustomerId",
    "RowNumber",
    "Surname"
]

existing_id_columns = [col for col in id_columns if col in df.columns]
print("Dropping identifier columns:", existing_id_columns)

df = df.drop(columns=existing_id_columns)

df.head()


Dropping identifier columns: ['customer_id']


Unnamed: 0,credit_score,country,gender,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,churn
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


## Target Variable Definition

We explicitly define the target variable to avoid accidental leakage
and to simplify downstream processing.


In [25]:
TARGET_COL = "churn"

X = df.drop(columns=[TARGET_COL])
y = df[TARGET_COL]

X.shape, y.shape


((10000, 10), (10000,))

## Feature Type Definition

For model preprocessing, we explicitly define numerical and categorical
feature groups. These definitions will be used to build preprocessing pipelines
and must be independent of the EDA notebook.


In [26]:
ID_COL = "customer_id"
TARGET_COL = "churn"

numerical_features = [
    col
    for col in X.select_dtypes(include=["int64", "float64"]).columns
    if col != TARGET_COL
]

categorical_features = X.select_dtypes(include=["object"]).columns.tolist()

numerical_features, categorical_features


(['credit_score',
  'age',
  'tenure',
  'balance',
  'products_number',
  'credit_card',
  'active_member',
  'estimated_salary'],
 ['country', 'gender'])

## Train / Test Split

We split the dataset into training and test sets before applying any preprocessing.
This prevents data leakage and ensures that preprocessing steps are learned
only from the training data.


In [27]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=RANDOM_STATE,
    stratify=y
)

X_train.shape, X_test.shape


((8000, 10), (2000, 10))

## Preprocessing Pipeline

We build a preprocessing pipeline that applies different transformations
to numerical and categorical features.

This approach ensures that:
- preprocessing is learned only from the training data
- the same transformations are applied consistently to train and test sets
- the pipeline can be reused directly in model training


In [28]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline


In [29]:
# Numerical features: scaling
numerical_transformer = Pipeline(
    steps=[
        ("scaler", StandardScaler())
    ]
)

# Categorical features: one-hot encoding
categorical_transformer = Pipeline(
    steps=[
        ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
    ]
)


In [30]:
from sklearn.compose import ColumnTransformer


In [31]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

preprocessor


In [32]:
# Fit preprocessing only on training data
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

X_train_processed.shape, X_test_processed.shape


((8000, 13), (2000, 13))

## Feature Names After Preprocessing

After applying scaling and one-hot encoding, the original feature space
is transformed. We extract the final feature names to improve transparency
and interpretability of downstream models.


In [33]:
# Get numerical feature names (unchanged)
num_feature_names = numerical_features

# Get categorical feature names after one-hot encoding
cat_feature_names = (
    preprocessor
    .named_transformers_["cat"]
    .named_steps["encoder"]
    .get_feature_names_out(categorical_features)
    .tolist()
)

# Final feature list
feature_names = num_feature_names + cat_feature_names

len(feature_names), feature_names[:10]


(13,
 ['credit_score',
  'age',
  'tenure',
  'balance',
  'products_number',
  'credit_card',
  'active_member',
  'estimated_salary',
  'country_France',
  'country_Germany'])

In [34]:
X_train_df = pd.DataFrame(
    X_train_processed,
    columns=feature_names,
    index=X_train.index
)

X_test_df = pd.DataFrame(
    X_test_processed,
    columns=feature_names,
    index=X_test.index
)

X_train_df.head()


Unnamed: 0,credit_score,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,country_France,country_Germany,country_Spain,gender_Female,gender_Male
2151,1.059,1.715,0.685,-1.226,-0.91,0.641,-1.03,1.042,1.0,0.0,0.0,0.0,1.0
8392,0.914,-0.66,-0.696,0.413,-0.91,0.641,-1.03,-0.624,0.0,1.0,0.0,0.0,1.0
5006,1.079,-0.185,-1.732,0.602,0.809,0.641,0.971,0.308,0.0,1.0,0.0,1.0,0.0
4117,-0.929,-0.185,-0.006,-1.226,0.809,0.641,-1.03,-0.29,1.0,0.0,0.0,0.0,1.0
7182,0.427,0.955,0.339,0.548,0.809,-1.56,0.971,0.135,0.0,1.0,0.0,0.0,1.0


## Save Preprocessing Artifacts

At this stage we persist all preprocessing outputs required
for downstream modeling:

- processed train / test datasets
- fitted preprocessing pipeline

Artifacts are stored locally in the Colab runtime and are not
committed to version control.



In [38]:
from pathlib import Path
import joblib

# Use a stable, runtime-level directory in Colab
ARTIFACTS_DIR = Path("/content/artifacts")
ARTIFACTS_DIR.mkdir(parents=True, exist_ok=True)

print("Saving artifacts to:", ARTIFACTS_DIR.resolve())

# Save preprocessing pipeline
joblib.dump(preprocessor, ARTIFACTS_DIR / "preprocessor.joblib")

# Save processed datasets
X_train_df.to_parquet(ARTIFACTS_DIR / "X_train.parquet")
X_test_df.to_parquet(ARTIFACTS_DIR / "X_test.parquet")

y_train.to_frame("churn").to_parquet(ARTIFACTS_DIR / "y_train.parquet")
y_test.to_frame("churn").to_parquet(ARTIFACTS_DIR / "y_test.parquet")

print("\nSaved files:")
for p in sorted(ARTIFACTS_DIR.iterdir()):
    print(" -", p.name)




Saving artifacts to: /content/artifacts

Saved files:
 - X_test.parquet
 - X_train.parquet
 - preprocessor.joblib
 - y_test.parquet
 - y_train.parquet
