# **Project Name**    - Churn Prediction



##### **Project Type**    - Supervised Classification
##### **Contribution**    - Individual
##### **Team Member 1 -** Mahmud Shaikh
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

In this project, I analyzed a customer churn dataset to build a classification model that predicts whether a customer will leave the company. I performed data preprocessing, handled categorical variables, and explored key features influencing churn. Various classification algorithms such as Logistic Regression, Random Forest, and XGBoost were evaluated. Model performance was assessed using metrics like accuracy, precision, recall, and the AUC-ROC curve to ensure robust prediction quality. The goal was to help the business identify at-risk customers and take proactive retention measures.

# **GitHub Link -**

https://github.com/mahmud-shaikh/ML_midcourse

# **Problem Statement**


**BUSINESS PROBLEM OVERVIEW**

Customer retention is a major challenge for businesses in today’s competitive market. It is more cost-effective to retain existing customers than to acquire new ones. However, identifying customers who are likely to leave the service—known as “churn”—can be difficult without proper tools. The goal of this project is to build a machine learning model that can accurately predict whether a customer is likely to churn based on their profile and behavior.

The dataset used contains information about customers such as credit score, geography, gender, age, tenure, balance, number of products, whether they have a credit card, if they are active members, and estimated salary. These features help us understand customer patterns and make predictions.

By creating a classification model, we aim to help the company take early action to retain at-risk customers. The model's performance will be evaluated using accuracy and other classification metrics, with a special focus on the AUC-ROC score, which helps us understand how well the model can separate churned customers from non-churned ones. This project will not only support decision-making but also improve customer satisfaction and reduce revenue loss.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import xgboost as xgb

from sklearn import metrics
from sklearn.metrics import accuracy_score, auc, confusion_matrix, roc_auc_score, roc_curve, recall_score

### Dataset Loading

In [None]:
# Load Dataset
df=pd.read_csv('/content/Churn Modeling.csv')
df

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape
# There are 10000 rows and 14 columns

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()
# There is no duplicate value present in the dataset

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()
# There is no missing value in the dataset

### What did you know about your dataset?

Answer: The dataset contains 10,000 rows and 14 columns, representing a sizable amount of information for analysis. There are no missing values or duplicate entries, indicating that the data is clean and ready for further processing and modeling. This suggests a reliable and well-maintained dataset suitable for generating meaningful insights.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

**RowNumber**: Sequential index for each record (from 1 to 10,000). It holds no analytical value but helps in identifying rows.

**CustomerId**: Unique ID assigned to each customer. Acts as an identifier and is not used for modeling.

**CreditScore**: Numerical value representing the customer's creditworthiness. Ranges from 350 to 850, with a mean around 650.

**Age**: Customer's age. Ranges from 18 to 92 years, with an average age of 38.9 years.

**Tenure**: Number of years the customer has been with the bank. Values range from 0 to 10, with a mean of around 5 years.

**Balance**: The account balance of the customer. Ranges from 0 to ~250,898.09. Around 25% of customers have zero balance.

**NumOfProducts**: Number of bank products the customer uses. Ranges from 1 to 4, with most using 1 or 2 products.

**HasCrCard**: Indicates whether the customer has a credit card (1 = Yes, 0 = No). Around 70.5% have a credit card.

**IsActiveMember**: Indicates whether the customer is actively engaged (1 = Yes, 0 = No). About 51.5% are active members.

**EstimatedSalary**: Estimated annual salary of the customer. Ranges from ~11.58 to ~199,992.48, with an average of ~100,090.

**Exited: Target variable** – indicates if the customer has churned (1 = Yes, 0 = No). About 20.4% of customers have exited.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()
# The nunique() function is used to count distinct observations over requested axis. Return Series with number of distinct observations.

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# from sklearn.preprocessing import StandardScaler
# Drop unnecessary columns
df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1, inplace=True)
df.head()

### What all manipulations have you done and insights you found?

##Dropped Irrelevant Columns:
Removed RowNumber, CustomerId, and Surname as they do not contribute to the model's predictive power.

Handled Missing and Duplicate Values:

Verified that there are no missing values in the dataset.

Checked and confirmed that there are no duplicate rows.

##Insights:

**Balance Distribution**:

A large number of customers have zero balance, suggesting many do not actively use the bank's saving services.

**Exit Pattern**:

Only around 20% of customers exited, indicating a class imbalance that should be handled during modeling.

**Age and Churn**:

The dataset's describe() shows that older individuals may be more prone to churn — this can be explored further in EDA.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

##**U-Univariate Analysis**

#### Chart 1 - **Countplot of Churn (Target Variable)**

In [None]:
# Churn Distribution (Target Variable) -chart 1 visualization code
sns.countplot(data=df, x='Exited', palette='Set2')
plt.title('Distribution of Churn (Exited)')
plt.xlabel('Exited (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

A countplot is perfect to understand the class balance of our target variable.

##### 2. What is/are the insight(s) found from the chart?

The data is imbalanced—fewer customers have exited than stayed.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact**: Class imbalance may require handling before model building (e.g., SMOTE). If ignored, it can lead to biased predictions.

#### Chart - 2 **Histogram of Customer Age**

In [None]:
# Age Distribution: Chart - 2 visualization code
sns.histplot(df['Age'], kde=True, color='skyblue')
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

**Ans**: Histogram shows how age is spread out among customers.

##### 2. What is/are the insight(s) found from the chart?

**Ans**: Most customers are between 30 and 40, with a noticeable drop-off for older ages.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact**: Helps segment campaigns for younger/middle-aged groups to retain customers better.

#### Chart - 3 **Histogram of Customer's Balance**

In [None]:
# Balance distribution visualization code
plt.figure(figsize=(6,4))
sns.histplot(df['Balance'], kde=True, color='orange')
plt.title('Distribution of Balance')
plt.show()

##### 1. Why did you pick the specific chart?

**Ans**: The histogram is chosen to understand how the balance is distributed across all customers. It helps us observe the concentration, skewness, and spread of balance values.

##### 2. What is/are the insight(s) found from the chart?

**Ans**:
The plot shows a right-skewed distribution, with a high number of customers having a very low balance (many close to 0), while a smaller portion holds a significantly higher balance. A peak near the lower end confirms that a large segment maintains minimal balance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact**:
Identifying customers with low balances can help in targeting them with personalized credit offers, overdraft facilities, or financial planning services. Customers with higher balances can be approached for investment or premium financial products. This segmentation supports customer retention and revenue growth strategies.

#### Chart - 4 **Distribution of Credit Score**

In [None]:
# Credit Score Distribution visualization code
sns.histplot(df['CreditScore'], bins=30, kde=True)
plt.title('Credit Score Distribution')
plt.xlabel('Credit Score')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer**: To understand credit score distribution among customers.

##### 2. What is/are the insight(s) found from the chart?

**Answer**: Most customers have a credit score between 600–700.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact**:
Creditworthiness insights support risk-based customer profiling.

#### Chart - 5 **Histogram of Tenure**

In [None]:
# Histogram of Tenure visualization code
sns.histplot(df['Tenure'], bins=10, kde=False)
plt.title('Customer Tenure Distribution')
plt.xlabel('Tenure')
plt.ylabel('Number of Customers')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer**: The histogram is used to understand how customer tenure (i.e., how many years customers have stayed with the bank) is distributed across the dataset. It's helpful to identify whether customers tend to stay long-term or leave within a few years. A histogram is ideal for showing the frequency of discrete numerical values like tenure.

##### 2. What is/are the insight(s) found from the chart?

**Answer**:

1. Tenure appears to be fairly evenly distributed from 1 to 9 years.

2. However, tenure of 10 years has a significant spike, indicating a larger group of loyal, long-term customers.

3. Very few customers have a tenure of 0, which may represent new or newly added accounts.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact**:

1. The spike at 10 years suggests that once customers reach a long tenure, they are likely to stay, indicating high brand loyalty among older customers.

2. Retention strategies could be designed to support customers in their early years (0–3 years) where dropout might be higher.

3. Focused engagement and reward programs for long-tenure customers (like those at 10 years) can help in reducing churn further and enhancing lifetime value

##**B-Bivariate Analysis**

#### Chart - 6 **Boxplot of Age vs. Churn**

In [None]:
# Boxplot of Age vs. Churn visualization code
sns.boxplot(data=df, x='Exited', y='Age', palette='Set2')
plt.title('Age vs. Churn')
plt.xlabel('Exited (0 = No, 1 = Yes)')
plt.ylabel('Age')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer**: Boxplots are perfect for comparing the distribution of a numerical variable (Age) across categorical classes (Churn: 0 = Stayed, 1 = Exited). This helps in identifying any pattern or relationship between customer age and their likelihood of churning.

##### 2. What is/are the insight(s) found from the chart?

**Answer**:

Customers who churned (Exited = 1) are generally older, with the median age around 45–50 years.

Customers who did not churn (Exited = 0) are relatively younger, with the median around 35 years.

The spread (IQR) of age is slightly wider for churned customers, and there are more outliers in the non-churned group.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact**:

1. Older customers are more likely to churn, indicating a potential age-based dissatisfaction or changing banking needs.

2. The business should consider:

 1. Personalized offers or financial products for older age groups.

 2. Customer service enhancements or loyalty programs targeting senior customers.

3. These actions could help reduce churn rate among a segment that is currently more prone to leaving the service.

#### Chart - 7 **Countplot of Gender vs. Churn**

In [None]:
# Countplot of Gender vs. Churn visualization code
sns.countplot(data=df, x='Gender', hue='Exited', palette='Set1')
plt.title('Gender vs. Churn')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer**: The countplot of Gender vs. Churn was chosen because it visually compares the number of male and female customers who stayed versus those who exited. It helps in understanding if there is any correlation between gender and customer churn.

##### 2. What is/are the insight(s) found from the chart?

**Answer**: From the chart, we can observe that although there are more male customers overall, female customers have a higher proportion of churn compared to males.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact**:
This insight suggests that gender may be a contributing factor to customer attrition. It would be beneficial for the business to investigate the reasons behind higher female churn. Addressing gender-specific concerns through targeted strategies or offers could help reduce overall churn and improve customer retention, especially among female customers.

#### Chart - 8 **Countplot of Geography vs. Churn**

In [None]:
# Countplot of Geography vs. Churn visualization code
sns.countplot(data=df, x='Geography', hue='Exited', palette='Set2')
plt.title('Churn by Geography')
plt.xlabel('Geography')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

**Answer**: This countplot was chosen to compare the number of customers who stayed versus those who exited, segmented by geography. It allows us to analyze whether a customer's country is related to their likelihood of churning.

##### 2. What is/are the insight(s) found from the chart?

**Answer**: The chart shows that while France has the highest total number of customers, the churn rate in Germany is relatively higher compared to France and Spain. This indicates that Germany has a disproportionately high number of customers exiting relative to its total customer base.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact**:

This insight is crucial for designing region-specific retention strategies. Since Germany has a higher churn rate, the business should investigate local issues or customer preferences that may be driving this behavior. Tailoring communication, services, or incentives for customers in Germany could lead to improved customer loyalty and reduced churn in that region.

#### Chart - 9 **Boxplot: Balance vs. Churn**

In [None]:
# Boxplot: Balance vs. Churn visualization code
sns.boxplot(data=df, x='Exited', y='Balance', palette='Set2')
plt.title('Balance vs. Churn')
plt.xlabel('Exited')
plt.ylabel('Balance')
plt.show()

##### 1. Why did you pick the specific chart?

This boxplot was chosen to visualize the distribution of account balances for customers who exited versus those who stayed. It helps in understanding whether there's a relationship between a customer's balance and their likelihood to churn.

##### 2. What is/are the insight(s) found from the chart?

From the graph, we can observe that customers who exited generally have higher account balances compared to those who stayed. The median balance of churned customers is noticeably greater, and the spread of their balances extends higher as well.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact**:

This trend indicates that customers with larger balances are more prone to churn, which is alarming since losing high-value customers can significantly impact the business financially. This insight suggests the need for priority-based retention efforts, such as offering loyalty benefits or personalized services to high-balance customers to encourage them to remain with the bank.

#### Chart - 10 **Barplot: Products Number vs. Churn**

In [None]:
# Barplot: Products Number vs. Churn visualization code
sns.barplot(data=df, x='NumOfProducts', y='Exited', palette='Set1')
plt.title('Number of Products vs. Churn Rate')
plt.xlabel('Number of Products')
plt.ylabel('Churn Rate')
plt.show()

##### 1. Why did you pick the specific chart?

This barplot was chosen to show the relationship between the number of products a customer holds and their churn rate. It helps identify if having more or fewer products influences a customer's likelihood to leave the bank.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that customers with two products have the lowest churn rate, while those with three or four products show a significantly higher churn rate, with customers holding four products having almost a 100% churn rate. Interestingly, even customers with just one product churn more than those with two.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact**

This insight suggests that simply increasing product subscriptions does not ensure customer retention and might even have the opposite effect. Customers with multiple products may have higher expectations or face more complexity, leading to dissatisfaction. The bank should focus on quality of service and customer experience rather than pushing product bundles, and carefully analyze why high-product users are leaving in such large numbers.

#### Chart - 11 **Boxplot: Estimated Salary vs. Churn**

In [None]:
# Boxplot: Estimated Salary vs. Churn visualization code
sns.boxplot(data=df, x='Exited', y='EstimatedSalary', palette='Set1')
plt.title('Estimated Salary vs. Churn')
plt.xlabel('Exited')
plt.ylabel('Estimated Salary')
plt.show()

##### 1. Why did you pick the specific chart?

A boxplot was chosen to visualize the distribution of Estimated Salary among customers who stayed (Exited = 0) versus those who churned (Exited = 1). This helps detect if salary levels influence customer churn behavior.

##### 2. What is/are the insight(s) found from the chart?

The distributions for both churned and non-churned customers are very similar. The medians, interquartile ranges, and overall spread of estimated salaries show no significant difference between the two groups.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact**

This suggests that Estimated Salary does not play a major role in churn decisions. Therefore, targeting or segmenting customers based on salary alone may not be effective in predicting or preventing churn. The bank should focus on other more impactful features like number of products, geography, or tenure for churn reduction strategies.

#### Chart - 12 **Countplot of HasCrCard vs. Churn**

In [None]:
# Countplot of HasCrCard vs. Churn visualization code
sns.countplot(data=df, x='HasCrCard', hue='Exited', palette='Set3')
plt.title('Credit Card Ownership vs. Churn')
plt.xlabel('Has Credit Card')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

A countplot was used to compare the number of customers with and without a credit card, split by their churn status. This helps us understand if owning a credit card influences the likelihood of customer churn.

##### 2. What is/are the insight(s) found from the chart?

The majority of customers, both churned and non-churned, own a credit card. However, the churn rate appears slightly higher among customers without a credit card. Still, the difference is not very large or significant visually.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact**

While credit card ownership has some association with churn, it is not a strong standalone predictor. The bank can consider integrating credit card data with other features (like product holding, tenure, or activity levels) to better understand customer retention patterns. Simply promoting credit cards may not be enough to reduce churn.

#### Chart - 13 **Countplot of IsActiveMember vs. Churn**

In [None]:
# Countplot of IsActiveMember vs. Churn visualization code
sns.countplot(data=df, x='IsActiveMember', hue='Exited', palette='Set2')
plt.title('Active Membership vs. Churn')
plt.xlabel('Is Active Member')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

A countplot was selected to compare churn behavior between active and inactive members. This helps assess if active membership status has any effect on customer retention.

##### 2. What is/are the insight(s) found from the chart?

Churn is significantly higher among inactive members compared to active ones. While both groups have customers who exited, inactive members show a higher churn proportion than active members.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact**

Active engagement appears to reduce churn risk. The bank should focus on increasing customer activity by promoting loyalty programs, personalized offers, or engagement campaigns. Keeping customers active is a key strategy to improve retention and lower churn rates.

##**M-Multivariate Analysis**

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(10, 6))
sns.heatmap(df.select_dtypes(include='number').corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

##### 1. Why did you pick the specific chart?

A correlation heatmap was used to understand the linear relationships between numerical features and the churn (Exited) variable. It helps identify which features are most closely associated with customer churn.

##### 2. What is/are the insight(s) found from the chart?

**Insight**:

* Age shows a moderate positive correlation (0.29) with churn — older customers are more likely to exit.

* IsActiveMember shows a negative correlation (-0.16) — active members tend to stay.

* Balance and NumOfProducts also show some correlation, though weaker.

* Most other variables, like CreditScore, Tenure, and EstimatedSalary, show little to no linear correlation with churn.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df[['CreditScore', 'Age', 'Balance', 'EstimatedSalary', 'Exited']], hue='Exited')
plt.suptitle('Pairplot of Selected Features', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

A pairplot provides a comprehensive view of pairwise relationships and distributions between multiple numerical features, segmented by churn (Exited). It’s useful for detecting patterns, clusters, and potential nonlinear relationships across features.

##### 2. What is/are the insight(s) found from the chart?

**Insight**

* Age shows a visible pattern: churned customers (orange) are mostly concentrated in higher age groups, supporting earlier findings.

* Balance and Credit Score appear scattered with no clear separation by churn.

* Estimated Salary shows almost uniform distribution, indicating low influence on churn.

* No strong visual clustering was found in other pairs.



## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Checking for missing values
df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

There is no missing value in the dataset.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Using IQR method for 'Balance'
Q1 = df['EstimatedSalary'].quantile(0.25)
Q3 = df['EstimatedSalary'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filtering out the outliers
df = df[(df['EstimatedSalary'] >= lower_bound) & (df['EstimatedSalary'] <= upper_bound)]
df

In [None]:
# Using IQR method for 'Balance'
Q1 = df['Balance'].quantile(0.25)
Q3 = df['Balance'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filtering out the outliers
df = df[(df['Balance'] >= lower_bound) & (df['Balance'] <= upper_bound)]
df

##### What all outlier treatment techniques have you used and why did you use those techniques?

IQR Method: It's a common technique for identifying and removing extreme values without assuming a normal distribution. Effective for features like Balance or EstimatedSalary. There is no outlier present in the dataset.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# One-Hot Encoding for nominal features like Geography
df = pd.get_dummies(df, columns=['Geography'])

# Label Encoding for binary features like Gender
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender'])

In [None]:
df['Geography_France'] = le.fit_transform(df['Geography_France'])
df['Geography_Germany'] = le.fit_transform(df['Geography_Germany'])
df['Geography_Spain'] = le.fit_transform(df['Geography_Spain'])
df

#### What all categorical encoding techniques have you used & why did you use those techniques?

**One-Hot Encoding**: Best for nominal (no order) categorical variables with multiple unique values.

**Label Encoding**: Suitable for binary columns where no ordinal relationship exists.

### 8. Data Splitting

In [None]:
from sklearn.model_selection import train_test_split
# Splitting data into dependent and independent columns
X=df.drop(labels=['Exited'],axis=1)
y=df['Exited']

# Split your data to train and test. Choose Splitting ratio wisely.
# Splitting the data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

##### What data splitting ratio have you used and why?

We split the dataset into training and testing sets using an 80:20 ratio to ensure that model performance can be evaluated on unseen data. We used stratified sampling to maintain the proportion of churned and non-churned customers.

### 9. Handling Imbalanced Dataset

In [None]:
y.value_counts(normalize=True)

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split

# Initialize Logistic Regression model
clf = LogisticRegression(max_iter=1000)  # Increase max_iter if convergence warnings

# Fit the model
clf.fit(X_train, y_train)

# Predict on the test set
y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:, 1]  # For ROC-AUC

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Evaluation metrics
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("ROC AUC Score:", roc_auc_score(y_test, y_proba))

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
#ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Step 1: Import libraries
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns

# Step 2: Define parameter grid for tuning
param_grid = {
    'C': [0.01, 0.1, 1, 10],               # Regularization strength
    'penalty': ['l1', 'l2'],               # Type of regularization
    'solver': ['liblinear']                # Solver that supports both l1 and l2
}

# Step 3: Create Logistic Regression model
logreg = LogisticRegression()

# Step 4: Use Stratified K-Fold Cross-Validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Step 5: Grid Search with Cross-Validation
grid_search = GridSearchCV(logreg, param_grid, cv=cv, scoring='roc_auc', verbose=1)
grid_search.fit(X_train, y_train)

# Step 6: Get best estimator
best_model = grid_search.best_estimator_

# Step 7: Predict on the test set
y_pred = best_model.predict(X_test)
y_prob = best_model.predict_proba(X_test)[:, 1]

# Step 8: Evaluate the model
print("Best Parameters:", grid_search.best_params_)
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("ROC AUC Score:", roc_auc_score(y_test, y_prob))

# Step 9: ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc_score(y_test, y_prob):.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Tuned Logistic Regression')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Performance metrics before and after hyperparameter tuning
metrics_names = ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC AUC']
pre_tuning = [0.8085, 0.5845, 0.2039, 0.3024, 0.7732]
post_tuning = [0.8100, 0.5860, 0.1870, 0.2780, 0.7746]

# Setup bar chart
x = np.arange(len(metrics_names))
width = 0.35

plt.figure(figsize=(8, 5))
plt.bar(x - width/2, pre_tuning, width, label='Pre-Tuning')
plt.bar(x + width/2, post_tuning, width, label='Post-Tuning')
plt.xticks(x, metrics_names, rotation=45, ha='right')
plt.ylabel('Score')
plt.title('Model Performance Before vs After Hyperparameter Tuning')
plt.legend()
plt.tight_layout()
plt.show()

* Which hyperparameter optimization technique have you used and why?

For tuning the Logistic Regression model, I used GridSearchCV, which is a grid-based exhaustive search over a specified parameter space. GridSearchCV tries all possible combinations of given hyperparameters and evaluates them using cross-validation, selecting the one that gives the best performance.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

####**Improvement Summary**
* Accuracy increased slightly from 0.8085 to 0.8100 (+0.0015).

* Precision for the churn class improved marginally from 0.5845 to 0.5860.

* Recall dropped from 0.2039 to 0.1870, indicating fewer churners correctly identified.

* F1 Score decreased from 0.3024 to 0.2780, reflecting the trade‑off between precision and recall.

* ROC AUC Score saw a small bump from 0.7732 to 0.7746 (+0.0014), showing a minor improvement in class separability.

Overall, hyperparameter tuning yielded a modest gain in overall accuracy and ROC AUC, but at the cost of lower recall for the churn class. This indicates that further strategies—such as rebalancing the data or trying different algorithms—may be necessary to better capture churners.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import joblib

# Save the model to a file
joblib.dump(best_model, 'logistic_regression_best_model.pkl')
print("Model saved as logistic_regression_best_model.pkl")

In [None]:
loaded_model = joblib.load('logistic_regression_best_model.pkl')
loaded_model

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this project, we conducted a comprehensive analysis of the customer churn dataset to identify patterns, trends, and key factors influencing customer attrition. After performing thorough data preprocessing—including handling missing values, encoding categorical features, addressing outliers, and splitting the data—we implemented multiple machine learning models to predict customer churn.

Among the models tested, Logistic Regression with hyperparameter tuning using GridSearchCV delivered consistent performance and interpretability. While the overall accuracy remained around 81%, we observed limitations in recall and F1-score for predicting churned customers (class 1), indicating some imbalance in the dataset and the inherent difficulty of predicting rare events.

The model's ROC AUC score of approximately 0.77 demonstrates that it is capable of distinguishing between churned and retained customers better than random guessing.

Key features affecting churn include Credit Score, Age, Balance, Tenure, and Number of Products, which suggests that banks and financial institutions should focus retention strategies on customers with high risk profiles based on these indicators.

Finally, the best model was saved using joblib for future deployment purposes, making it ready for integration into a real-time decision-making system, such as a customer dashboard or automated alert system.



### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***