# Bank Customer Churn â€” Exploratory Data Analysis (EDA)

## Project Context
Customer churn is a critical challenge for the banking industry, as retaining existing customers is significantly more cost-effective than acquiring new ones.

This notebook represents the **first stage** of an end-to-end machine learning project.
Its goal is to explore the data, understand customer characteristics, and identify potential drivers of churn before building predictive models.


## Objectives of this notebook
- Understand the structure and origin of the dataset
- Load the data in a fully reproducible way
- Perform an initial inspection of features and target variable
- Identify potential data quality and preprocessing issues


In [8]:
# Core libraries
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# System utilities
import os

# Display settings
pd.set_option("display.max_columns", None)
pd.set_option("display.float_format", "{:.3f}".format)

# Visualization style
plt.style.use("seaborn-v0_8")
sns.set_context("notebook")

# Reproducibility
RANDOM_STATE = 42



## Dataset Origin

The dataset used in this project is the **Bank Customer Churn Dataset**, originally published on Kaggle:

https://www.kaggle.com/datasets/gauravtopre/bank-customer-churn-dataset

The data describes customers of a European retail bank and is commonly used for
customer churn prediction and customer analytics tasks.

To ensure full reproducibility, the dataset is downloaded programmatically
using Kaggle's official tools.


In [9]:
# Install KaggleHub for programmatic dataset access
!pip install -q kagglehub


In [10]:
import kagglehub

# Download dataset from Kaggle
dataset_path = kagglehub.dataset_download(
    "gauravtopre/bank-customer-churn-dataset"
)

print("Dataset downloaded to:", dataset_path)
os.listdir(dataset_path)


Using Colab cache for faster access to the 'bank-customer-churn-dataset' dataset.
Dataset downloaded to: /kaggle/input/bank-customer-churn-dataset


['Bank Customer Churn Prediction.csv']

In [12]:
# Find CSV file in the dataset directory
csv_files = [f for f in os.listdir(dataset_path) if f.endswith(".csv")]

assert len(csv_files) == 1, f"Expected exactly one CSV file, found: {csv_files}"

DATA_PATH = os.path.join(dataset_path, csv_files[0])
print("Using dataset file:", DATA_PATH)

# Load dataset
df = pd.read_csv(DATA_PATH)

print("Dataset shape:", df.shape)
df.head()


Using dataset file: /kaggle/input/bank-customer-churn-dataset/Bank Customer Churn Prediction.csv
Dataset shape: (10000, 12)


Unnamed: 0,customer_id,credit_score,country,gender,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,churn
0,15634602,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,15647311,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,15619304,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,15701354,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,15737888,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


## Dataset Structure and Data Types

Before any transformations, we inspect the structure of the dataset:
- number of rows and columns
- data types
- presence of missing values

This step helps identify potential preprocessing requirements.


In [13]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customer_id       10000 non-null  int64  
 1   credit_score      10000 non-null  int64  
 2   country           10000 non-null  object 
 3   gender            10000 non-null  object 
 4   age               10000 non-null  int64  
 5   tenure            10000 non-null  int64  
 6   balance           10000 non-null  float64
 7   products_number   10000 non-null  int64  
 8   credit_card       10000 non-null  int64  
 9   active_member     10000 non-null  int64  
 10  estimated_salary  10000 non-null  float64
 11  churn             10000 non-null  int64  
dtypes: float64(2), int64(8), object(2)
memory usage: 937.6+ KB


## Dataset Overview

Based on the dataset structure, we observe:
- 10,000 customer records
- 12 columns in total
- No missing values
- A mix of numerical and categorical features

Before proceeding further, we explicitly identify feature types
to structure the exploratory analysis.



In [14]:
TARGET_COL = "churn"

# Numerical features (excluding target)
numerical_features = [
    col for col in df.select_dtypes(include=["int64", "float64"]).columns
    if col != TARGET_COL and col != "customer_id"
]

# Categorical features
categorical_features = df.select_dtypes(include=["object"]).columns.tolist()

numerical_features, categorical_features


(['credit_score',
  'age',
  'tenure',
  'balance',
  'products_number',
  'credit_card',
  'active_member',
  'estimated_salary'],
 ['country', 'gender'])

## Target Variable Distribution

We now analyze the distribution of the target variable.
This step is important to assess potential class imbalance
and to guide metric selection in later stages.


In [15]:
churn_counts = df[TARGET_COL].value_counts()
churn_ratio = df[TARGET_COL].value_counts(normalize=True)

churn_counts, churn_ratio


(churn
 0    7963
 1    2037
 Name: count, dtype: int64,
 churn
 0   0.796
 1   0.204
 Name: proportion, dtype: float64)

### Initial Observation

The dataset shows a moderate class imbalance, which is typical for churn problems.
This will be taken into account during model evaluation and threshold selection.
