<h2>Project: Beta Bank Customer Retention Strategy</h2>

<h3>Introduction</h3>
<p>
Beta Bank is experiencing gradual customer churn month after month. Since retaining existing customers is cheaper than acquiring new ones, the goal of this project is to predict whether a customer will leave the bank soon using historical client behavior and contract termination data.
</p>

<h3>Project Objective</h3>
<p>
Build a classification model that achieves the <b>maximum possible F1 score</b>. To pass the project, the model must reach an <b>F1 score of at least 0.59</b> on the test set. In addition, we will compute <b>AUC-ROC</b> and compare it with F1 to evaluate overall classification quality.
</p>

<h3>Project Workflow</h3>
<ol>
  <li>
    <b>Data Preparation:</b> Load <code>Churn_Modelling.csv</code>, review dataset structure using <code>info()</code> and <code>describe()</code>, and verify overall data quality.
  </li>
  <li>
    <b>Preprocessing:</b> Standardize column names (lowercase / snake_case), check and handle missing values and duplicates, encode categorical variables (e.g., <code>Geography</code>, <code>Gender</code>) using One-Hot Encoding or Label Encoding, and scale numerical features (e.g., <code>Balance</code>, <code>EstimatedSalary</code>) while avoiding data leakage.
  </li>
  <li>
    <b>Exploratory Data Analysis (EDA):</b> Compare feature distributions for churned customers (<code>Exited=1</code>) vs. retained customers (<code>Exited=0</code>), examine churn rates across groups (e.g., by country and gender), and check correlations to identify potential churn drivers.
  </li>
  <li>
    <b>Class Imbalance Check:</b> Evaluate the balance of the target class (<code>Exited</code>). Train a baseline model without addressing imbalance and document the impact on performance (especially F1).
  </li>
  <li>
    <b>Model Improvement:</b> Apply at least two imbalance-handling methods (e.g., class weights, upsampling, downsampling). Train and tune at least two algorithms (e.g., Logistic Regression, Decision Tree, Random Forest), and select the best model based on F1 score.
  </li>
  <li>
    <b>Final Testing & Conclusions:</b> Evaluate the final model on the test set, calculate <b>F1</b> and <b>AUC-ROC</b>, and summarize the most important factors that predict customer churn with clear, business-relevant insights.
  </li>
</ol>

<h3>Data Description</h3>
<ul>
  <li><b>RowNumber</b> — data string index</li>
  <li><b>CustomerId</b> — unique customer identifier</li>
  <li><b>Surname</b> — surname</li>
  <li><b>CreditScore</b> — credit score</li>
  <li><b>Geography</b> — country of residence</li>
  <li><b>Gender</b> — gender</li>
  <li><b>Age</b> — age</li>
  <li><b>Tenure</b> — period of maturation for a customer’s fixed deposit (years)</li>
  <li><b>Balance</b> — account balance</li>
  <li><b>NumOfProducts</b> — number of banking products used by the customer</li>
  <li><b>HasCrCard</b> — customer has a credit card</li>
  <li><b>IsActiveMember</b> — customer’s activeness</li>
  <li><b>EstimatedSalary</b> — estimated salary</li>
  <li><b>Exited</b> — <b>target</b> (1 = customer left, 0 = customer stayed)</li>
</ul>


### STEP 1: PREPARING THE DATA

In [1]:
# IMPORTS
import pandas as pd

In [2]:
# IMPORTING DATA
df = pd.read_csv('data/Churn.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [4]:
# FIRST FIVE ROWS
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [7]:
df.describe(include='all')

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000,10000.0,10000,10000,10000.0,9091.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
unique,,,2932,,3,2,,,,,,,,
top,,,Smith,,France,Male,,,,,,,,
freq,,,32,,5014,5457,,,,,,,,
mean,5000.5,15690940.0,,650.5288,,,38.9218,4.99769,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,2886.89568,71936.19,,96.653299,,,10.487806,2.894723,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,1.0,15565700.0,,350.0,,,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,,584.0,,,32.0,2.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690740.0,,652.0,,,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,7500.25,15753230.0,,718.0,,,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0


In [8]:
df.describe()

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,9091.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,15690940.0,650.5288,38.9218,4.99769,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,2886.89568,71936.19,96.653299,10.487806,2.894723,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,1.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,584.0,32.0,2.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,7500.25,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


In [10]:
#CHECKING MISSING VALUES
df.isna().sum()

RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64

In [14]:
df['Tenure'].value_counts(dropna=True)

Tenure
1.0     952
2.0     950
8.0     933
3.0     928
5.0     927
7.0     925
4.0     885
9.0     882
6.0     881
10.0    446
0.0     382
Name: count, dtype: int64

In [15]:
df['Tenure'].value_counts(dropna=False)

Tenure
1.0     952
2.0     950
8.0     933
3.0     928
5.0     927
7.0     925
NaN     909
4.0     885
9.0     882
6.0     881
10.0    446
0.0     382
Name: count, dtype: int64