# 01 – Exploratory Data Analysis
## Bank Customer Churn

**Objective:** Understand distributions, missing values, and initial churn patterns.

**Data source:** [Kaggle – Churn for Bank Customers](https://www.kaggle.com/datasets/mathchi/churn-for-bank-customers)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Path from notebooks/ folder
df = pd.read_csv('../data/raw/Churn_Modelling.csv')
df.head()

In [None]:
# Shape, dtypes, missing values
print(df.shape)
print(df.dtypes)
df.isnull().sum()

In [None]:
# Target distribution (Exited = churn)
df['Exited'].value_counts(normalize=True)
sns.countplot(data=df, x='Exited')
plt.title('Churn (Exited) distribution')
plt.show()

In [None]:
# Churn rate by Geography - bar chart
df.groupby('Geography')['Exited'].mean().mul(100).plot(kind='bar')
plt.title('Churn rate by Geography (%)')
plt.ylabel('Churn %')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap (numeric columns)
num_cols = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary', 'Exited']
sns.heatmap(df[num_cols].corr(), annot=True, fmt='.2f', cmap='coolwarm', center=0)
plt.title('Correlation with Exited (churn)')
plt.tight_layout()
plt.show()

In [None]:
# Churn rate by NumOfProducts
df.groupby('NumOfProducts')['Exited'].agg(['sum', 'count']).assign(
    churn_rate_pct=lambda x: (x['sum'] / x['count'] * 100).round(2)
)

**Next:** `02_data_cleaning.ipynb` – cleaning, feature engineering, train/test split.