# Customer Churn Analysis (Banking Dataset)

## Executive Summary
In this analysis I explore customer churn (`Exited`) to identify the main patterns and high-risk segments.  
Goal: produce actionable insights and practical recommendations.

PS: After the MVP I'm gonna add the Machine Learning Part where i predict the Exited() value of a certain customer.

**Key outputs**
- Overall churn rate
- Top churn drivers (by segment comparison)
- 2–3 high-risk segments
- 3 actionable recommendations + how to measure impact


### Importing the libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Reading the Dataset

In [3]:
df = pd.read_csv('../data/churn.csv')

## 1) Defining context

### Business question
Which customer segments are more likely to churn, and what patterns can explain churn behavior?

### Target definition
`Exited`  
- 1 = customer churned  
- 0 = customer retained

### Unit of analysis
1 row = 1 customer

### Scope
This project focuses on:
- data checks & data quality
- exploratory analysis (univariate / bivariate / multivariate)
- segmentation and business recommendations  
- prediction of the future customers churn creating a machine learning model.


## 2) Dataset sanity check

### What I check
- Dataset size (rows/columns)
- Column types and basic structure
- Duplicates (row-level and customer ID if available)
- Target distribution (churn rate baseline)

### Notes
- Identifier-like columns (e.g., `RowNumber`, `CustomerId`) are used only for checks and will not be treated as analytical features.


In [None]:
#df.head()
#df.shape
#df.info()
df.describe()
#df["Exited"].value_counts(normalize=True)

## 3) Data quality

### Missing values
- Check missing values per column (%)
- Decide whether missing values require imputation or exclusion

### Valid ranges & constraints (quick checks)
- Age: reasonable range (e.g., 18+)
- Tenure: expected bounds (e.g., 0–10)
- CreditScore: typical bounds (e.g., 300–850)
- NumOfProducts: expected bounds (e.g., 1–4)
- Balance / EstimatedSalary: non-negative values

### Outliers & skew
Identify variables with heavy tails or extreme values (often Balance/Salary).


## 4) Target analysis (baseline)

### Overall churn rate
Compute the baseline churn rate and comment on class balance:
- Is churn rare (<10%), moderate (10–30%), or high (>30%)?
- Why class balance matters (interpretation of segments and later modeling).


## 5) Univariate analysis

### Numerical variables
For each key numerical variable (e.g., Age, Tenure, CreditScore, Balance, EstimatedSalary):
- distribution (histogram)
- spread/outliers (boxplot)
- short note: “what stands out?”

### Categorical variables
For each categorical/binary variable (e.g., Geography, Gender, HasCrCard, IsActiveMember, NumOfProducts):
- frequency table (%)
- short note: “dominant groups / imbalance?”


## 6) Bivariate analysis with target (core section)

### Goal
Quantify churn differences across groups:
- churn rate by category
- churn rate across binned numerical variables (e.g., Age groups)

### What I produce
For each key feature:
- churn rate per group
- group size (important: avoid misleading small groups)
- quick takeaway: “higher/lower churn than baseline?”

### Examples of comparisons
- Churn by Geography
- Churn by IsActiveMember
- Churn by NumOfProducts
- Churn by AgeGroup (binned)
- Churn by Balance presence (Balance = 0 vs > 0)


## 7) Multivariate analysis (segment discovery)

### Goal
Find practical high-risk segments using combinations of variables.

### Approach
Create 2–3 simple cross-segment views (pivot tables), such as:
- Geography × IsActiveMember → churn rate
- AgeGroup × IsActiveMember → churn rate
- NumOfProducts × HasBalance → churn rate

### Output
A short list of high-risk segments with:
- churn rate
- segment size
- why this segment might churn (hypothesis)


## 8) Key findings (final)

Write 5–7 findings max. Each finding must include:
- the segment/variable
- the churn rate (or difference vs baseline)
- a one-line interpretation

**Template**
1) **Finding:** …  
   **Evidence:** churn = __% vs baseline __% (n=__)  
   **Interpretation:** …

2) **Finding:** …  
   **Evidence:** …  
   **Interpretation:** …


## 9) Recommendations (business actions)

Provide 3 practical actions tied to findings.

For each recommendation include:
- **Who** (which segment)
- **What** (action)
- **Why** (based on findings)
- **How to measure** (metrics before/after)

**Template**
1) **Segment:** …  
   **Action:** …  
   **Rationale:** …  
   **Measure:** churn rate in this segment, retention at 30/60/90 days, etc.


## 10) Limitations & next steps

### Limitations
- This is an observational dataset: correlation ≠ causation
- Some variables may be proxies (e.g., geography may correlate with pricing/product differences)
- Snapshot data: churn behavior may change over time

### Next steps
- Optional: add a simple baseline model for validation (logistic regression) with interpretability
- Build a small dashboard (Power BI) to monitor churn by segment over time (if data available)


### Splitting the dataset into indipendet variables(X) and dipendent variable (y)

In [17]:
X = df.drop("Exited", axis=1)
y = df["Exited"]

### Encoding Categorical Data: Geography and Gender


In [18]:
categorical_cols = ["Geography", "Gender"]
numerical_cols = X.drop(columns=categorical_cols).columns

In [26]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(drop="first", sparse_output=False), categorical_cols),
        ("num", "passthrough", numerical_cols)
    ]
)
X_clean = preprocessor.fit_transform(X)

In [30]:
# nomi colonne one-hot
ohe_feature_names = preprocessor.named_transformers_["cat"].get_feature_names_out(categorical_cols)

# colonne numeriche
all_feature_names = list(ohe_feature_names) + list(numerical_cols)

In [32]:
X_df = pd.DataFrame(
    X_clean,columns=all_feature_names
)

In [33]:
X_df.head()
X_df.info()

<class 'pandas.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Geography_Germany  10000 non-null  object
 1   Geography_Spain    10000 non-null  object
 2   Gender_Male        10000 non-null  object
 3   RowNumber          10000 non-null  object
 4   CustomerId         10000 non-null  object
 5   Surname            10000 non-null  str   
 6   CreditScore        10000 non-null  object
 7   Age                10000 non-null  object
 8   Tenure             10000 non-null  object
 9   Balance            10000 non-null  object
 10  NumOfProducts      10000 non-null  object
 11  HasCrCard          10000 non-null  object
 12  IsActiveMember     10000 non-null  object
 13  EstimatedSalary    10000 non-null  object
dtypes: object(13), str(1)
memory usage: 1.1+ MB
