<a href="https://colab.research.google.com/github/naman39910/Exploratory-Data-Analysis/blob/main/ML__Telco_Customer_Churn_Prediction_(Classification).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Telco Customer Churn Prediction



##### **Project Type**    - EDA/Classification
##### **Contribution**    - Individual


# **Project Summary -**

This project focuses on analyzing the Telco Customer Churn dataset to build a machine learning model for predicting customer churn. The dataset contains comprehensive information about customers, including demographics (gender, SeniorCitizen, Partner, Dependents), account details (tenure, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges), and the services they subscribe to (PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies). The primary goal is to identify factors influencing churn and develop a predictive model to help the telecommunications company retain at-risk customers.

The analysis began with data loading and initial exploration, revealing 7043 rows and 21 columns. The target variable, 'Churn', is binary, indicating whether a customer has left the company. Initial checks showed no duplicate entries or explicit missing values, although the 'TotalCharges' column, initially an object type, required conversion to numeric, which introduced 11 missing values. These were handled during the data wrangling phase.

Data visualization played a crucial role in understanding the relationships between variables and their impact on churn. Bar charts and pie charts illustrated the overall churn rate and the distribution of categorical features. Histograms and KDE plots provided insights into the distribution of numerical features like 'tenure', 'MonthlyCharges', and 'TotalCharges' in relation to churn. Correlation heatmaps and pair plots helped in identifying relationships between numerical variables. Key insights from the visualizations included that customers with lower tenure, higher monthly charges, and certain service combinations (like Fiber Optic internet without online security or tech support) were more likely to churn.

Hypothesis testing using t-tests, Chi-Square tests, and ANOVA confirmed some of these observations, showing statistically significant differences in 'MonthlyCharges' and 'Contract' type between churned and non-churned customers, while 'gender' did not show a significant association with churn.

Data preprocessing involved handling the few missing values in 'TotalCharges' (though the provided code did not explicitly show imputation, the info() output indicates they are handled before modeling), and importantly, categorical encoding using One-Hot Encoding to convert the numerous object type features into a format suitable for machine learning models. The 'Churn' column was also encoded. Feature scaling using StandardScaler was applied to the numerical features to ensure they have a similar range.

Several machine learning models were implemented and evaluated, including Random Forest, XGBoost, and K-Nearest Neighbors. Evaluation metrics such as accuracy, precision, recall, F1-score, and AUC-ROC were used, with a particular focus on metrics for the minority class (churned customers) to assess the models' ability to correctly identify customers likely to leave. Hyperparameter tuning using GridSearchCV was performed to optimize model performance and address potential overfitting observed in initial models like Random Forest. The XGBoost model, after tuning, demonstrated a good balance of performance across evaluation metrics, particularly in identifying churned customers, making it a strong candidate for the final prediction model. Feature importance analysis from models like XGBoost highlighted the most influential factors driving churn, such as contract type and internet service.

In conclusion, this project successfully analyzed the Telco Customer Churn dataset, identified key drivers of churn through exploratory data analysis and statistical testing, preprocessed the data for machine learning, and developed predictive models. The XGBoost model, with its robust performance and ability to provide feature importance insights, offers a valuable tool for the telecommunications company to proactively identify and target at-risk customers with tailored retention strategies, ultimately contributing to reduced churn and improved business outcomes.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The problem statement for this project is to analyze the Telco Customer Churn dataset and build a machine learning model to predict which customers are likely to churn (leave the company).

Telecommunications companies face the challenge of customer churn, which can significantly impact revenue and growth. Identifying customers at risk of churning is crucial for implementing targeted retention strategies.

This project aims to:

Understand the factors influencing customer churn: Analyze the various features in the dataset, including demographics, services subscribed to, and account information, to identify patterns and correlations with churn.
Develop a predictive model: Build and evaluate machine learning models that can accurately predict the likelihood of a customer churning.
Provide actionable insights: Based on the model's predictions and feature importance, provide insights to the telecommunications company on which customers are at risk and what factors are driving churn, enabling them to take proactive measures to retain those customers.
Ultimately, the goal is to reduce customer churn and improve customer retention for the telecommunications company.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# 1. to handle the data
import pandas as pd
import numpy as np
from scipy import stats

# to visualize the data
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# To preprocess the data
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder,OneHotEncoder
from sklearn.impute import SimpleImputer, KNNImputer
# import iterative imputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# machine learning
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
#for classification tasks
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, RandomForestRegressor
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
# pipeline
from sklearn.pipeline import Pipeline
# metrics
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, mean_absolute_error,mean_squared_error,r2_score, roc_auc_score

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
dataset = pd.read_csv('/content/WA_Fn-UseC_-Telco-Customer-Churn.csv')

### Dataset First View

In [None]:
# Dataset First Look
dataset.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
dataset.shape

### Dataset Information

In [None]:
# Dataset Info
dataset.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
dataset.duplicated().sum().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
dataset.isnull().sum().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(dataset.isnull(), cbar=False)

### What did you know about your dataset?

In This Dataset we find out 7043 Rows And 21 Columns .

In This Dataset 2 Columns Are of Int datatype 1 Column Float Datatype And 18 Columns Are Object Datatype .

By Using Info Function We Can See Count Of Columns And Rows DataType Of Data And Missing Values Also.In The Data 2 Columns Are of Int datatype And 1 Float Datatype And 18 Are Object Datatype .

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
dataset.columns

In [None]:
# Dataset Describe
dataset.describe()

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
dataset.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Converted the 0 and 1 value of SinorCitize into yes and no to amke it easier to read
def conv(value):
    if value == 0:
        return 'No'
    else:
        return 'Yes'

dataset['SeniorCitizen'] = dataset['SeniorCitizen'].apply(conv)

In [None]:
# Drop the customerid column
dataset.drop('customerID', axis=1, inplace=True)

In [None]:
 # The objective datatype conver into float datatype in TotalCharges columns
 dataset['TotalCharges'] = pd.to_numeric(dataset['TotalCharges'], errors='coerce')

In [None]:
dataset.info()

### What all manipulations have you done and insights you found?

we are manipulate some columns

1.   A custom function conv is defined to convert the values in the SeniorCitizen column: it replaces 1 with 'Yes' and 0 with 'No' to make the data easier to interpret.
2.   The customerID column, which is not useful for analysis, is dropped from the dataset using drop().
3. The TotalCharges column, originally stored as an object (string) type, is converted to numeric (float) type using pd.to_numeric() .



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 : Bar Chart

In [None]:
# Chart - 1 visualization code
sns.countplot(x='Churn', data=dataset)
plt.xlabel("Churn")
plt.ylabel("Count")
plt.title("Churn Count")
plt.show()

##### 1. Why did you pick the specific chart?

we use bar chart because A bar chart makes it easy to compare how many customers have churned (Yes) versus how many have not (No).

##### 2. What is/are the insight(s) found from the chart?

At a glance, we can see that more customers stayed (No) than left (Yes). This is useful for understanding the imbalance in churn, which could be important for building predictive models.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Even non-technical audiences can quickly understand the customer retention rate from this visual.

#### Chart - 2 : Pie Chart

In [None]:
# Chart - 2 visualization code
churn_counts = dataset['Churn'].value_counts()
plt.pie(churn_counts, labels=churn_counts.index, autopct='%1.1f%%', startangle=90, colors=['green', 'red'])
plt.title('Distribution of Churn')
plt.show()

##### 1. Why did you pick the specific chart?



*  The pie chart directly shows what percentage of customers churned (Yes) versus stayed (No).
*   It gives a clear visual impression of how much bigger one group is compared to the other.


##### 2. What is/are the insight(s) found from the chart?



*   Majority of Customers Did Not Churn: 73.5% of the customers did not leave the company. This indicates a high customer retention rate, which is a positive sign for the business.
*   About 1 in 4 Customers Churned:



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Losing more than 1 in every 4 customers is a potential revenue and growth concern. It suggests the company should investigate why customers are leaving and improve retention strategies.

#### Chart - 3 : Histogram Chart

In [None]:
# Chart - 8 visualization code
churned = dataset[dataset['Churn'] == 'Yes']
not_churned = dataset[dataset['Churn'] == 'No']

plt.hist([churned['tenure'], not_churned['tenure']], bins= 10,
         color =['red', 'blue'], label = ['yes', 'no'])
plt.title('Distribution of Tenure by Churn Status')
plt.xlabel('Tenure')
plt.ylabel('Frequency')
plt.legend()
plt.show()



##### 1. Why did you pick the specific chart?

Because It clearly compares the number of customers who churned (yes) vs. those who stayed (no) at different tenure levels.

##### 2. What is/are the insight(s) found from the chart?

some insides found in this chart


1.   Most churn happens early in the customer lifecycle
2.   Customers with longer tenure tend to stay

1.   Sharp contrast at the extremes
2.   Churn decreases consistently over time





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Introduce loyalty rewards or long-term contracts to encourage customers to stay past the early-risk period.

#### Chart - 4 : Histogram Plot

In [None]:
# Chart - 9 visualization code
# make plot for MonthlyCharges
churned = dataset[dataset['Churn'] == 'Yes']
not_churned = dataset[dataset['Churn'] == 'No']

plt.figure(figsize=(10, 5))
plt.hist([churned['MonthlyCharges'], not_churned['MonthlyCharges']], bins= 10,
         color =['red', 'blue'], label = ['yes', 'no'])
plt.title('Distribution of Monthly Charges by Churn Status')
plt.xlabel('Monthly Charges')
plt.ylabel('Frequency')
plt.legend()
plt.show()


##### 1. Why did you pick the specific chart?

Because  It directly relates churn to pricing (a key business metric)
we do Clear comparison between churned (red) and retained (blue) customers .

##### 2. What is/are the insight(s) found from the chart?



*   Customers with low monthly charges (~$20–30) are mostly retained.
*   Churn increases sharply in the $70–100 range — possibly due to dissatisfaction with perceived value at higher prices.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   Target high-charging customers (e.g., $70+) with retention offers, loyalty rewards, or better service.
*   Reduce churn = more revenue continuity.



#### Chart - 5 : Histogram Plot

In [None]:
# Chart - 10 visualization code
churned = dataset[dataset['Churn'] == 'Yes']
not_churned = dataset[dataset['Churn'] == 'No']

plt.figure(figsize=(10, 5))
plt.hist([churned['TotalCharges'], not_churned['TotalCharges']], bins= 10,
         color =['red', 'blue'], label = ['yes', 'no'])
plt.title('Distribution of Total Charges by Churn Status')
plt.xlabel('Total Charges')
plt.ylabel('Count')
plt.legend()
plt.show()


##### 1. Why did you pick the specific chart?

Easy to read

Ideal for comparing distributions

Helpful for churn pattern discovery based on financial contribution (Total Charges)

##### 2. What is/are the insight(s) found from the chart?

We founds some inside in this chart


1.   Customers with lower Total Charges are more likely to churn

1.   Long-term customers (higher Total Charges) have lower churn
2.   Churn rate declines as Total Charges increase


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Customers who churn typically have low Total Charges, indicating they leave early. This highlights the need to improve early customer experience (onboarding, initial billing, value perception) to reduce churn.

#### Chart - 6 : kde plot

In [None]:
# Chart - 11 visualization code
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Plot for Monthly Charges
sns.kdeplot(data=dataset, x="MonthlyCharges", hue="Churn", fill=True, alpha=0.5, ax=axes[0])
axes[0].set_title('Density Plot of Monthly Charges by Churn Status')
axes[0].set_xlabel('Monthly Charges')
axes[0].set_ylabel('Density')

# Plot for Total Charges
sns.kdeplot(data=dataset, x="TotalCharges", hue="Churn", fill=True, alpha=0.5, ax=axes[1])
axes[1].set_title('Density Plot of Total Charges by Churn Status')
axes[1].set_xlabel('Total Charges')
axes[1].set_ylabel('Density')

plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart?

KDE (Kernel Density Estimation) plots smooth the distribution and make it easy to compare the shape and spread of values for churned (Yes) vs. non-churned (No) customers.

##### 2. What is/are the insight(s) found from the chart?



*   Customers with higher monthly costs are more likely to churn, possibly due to perceived lack of value or affordability issues.
*   Churned customers don’t stay long enough to generate high total revenue, which hurts long-term profitability.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7 : Line Chart

In [None]:
# Chart - 12 visualization code
plt.figure(figsize=(10, 5))
sns.lineplot(x='tenure', y='MonthlyCharges', data=dataset)
plt.title('Monthly Charges Over Time')
plt.xlabel('Tenure')
plt.ylabel('Monthly Charges')
plt.show()

##### 1. Why did you pick the specific chart?


Line plots are ideal for showing how a numerical variable (MonthlyCharges) changes across a continuous variable (tenure).




##### 2. What is/are the insight(s) found from the chart?

we found some insides in this chart


1.   Monthly Charges Increase with Tenure (Slightly) .
2.   High Variation in Monthly Charges at All Tenure Levels .
3. Stabilization with Spikes After ~10 Months .



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Long-term customers tend to spend more monthly, so focusing on retention and upselling can increase revenue.

#### Chart - 8 : Bar Chart

In [None]:
# Chart - 13 visualization code
# Define the colors for yes or no
colors = ['blue','red']

# add missing key for 0 and 1 in the palette
palette = {0 :'blue', 1: 'red'}

for i, predictor in enumerate(dataset.drop(columns=['Churn', 'TotalCharges', 'MonthlyCharges', 'tenure'])):
    plt.figure(figsize=(5, 3))
    sns.countplot(data=dataset, x=predictor, hue='Churn', palette=colors)
    plt.title(predictor)
    plt.show()


##### 1. Why did you pick the specific chart?

Variables like gender, SeniorCitizen, Partner, Dependents, PhoneService, MultipleLines, InternetService, OnlineSecurity, etc., are categorical. so use countplot because it perfect for visualizing how many observations fall into each category, and also works well with hue='Churn' to split by churn status.

##### 2. What is/are the insight(s) found from the chart?

We found that some insides in these charts


*   Senior Citizens are more likely to churn

*   Customers without partners or dependents churn more
*   No tech-related services → higher churn


*   Month-to-month contracts have the highest churn


*   Paperless billing slightly linked to higher churn
*   Fiber optic internet users churn more



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

we change some stratigy to create a posive impact in business
1.   Targeted Retention Campaigns
Who to target: Senior citizens, month-to-month contract users, and those without partners or dependents.

Impact: Tailored offers, better customer support, or loyalty benefits can reduce churn in these high-risk segments.
2.    Contract Strategy
Insight: Customers with long-term contracts churn less.

Action: Promote yearly contracts with discounts or perks to lock in loyalty.

3. Product & Service Improvements
Customers without tech support or online security churn more.

Action: Improve or bundle these services to increase perceived value and reduce churn.


#### Chart - 9 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
numeric_dataset = dataset.select_dtypes(include=np.number)
sns.heatmap(numeric_dataset.corr(), annot=True)
plt.show()

##### 1. Why did you pick the specific chart?



*   Quickly See Relationships Between Variables: A heatmap visually shows how strongly related different numeric variables are, using both color intensity and numeric correlation values (Pearson correlation coefficients).
*   Simplifies Complexity: When dealing with many variables, heatmaps help summarize relationships in one compact visual instead of checking each scatter plot manually.



##### 2. What is/are the insight(s) found from the chart?



1.   Strong Positive Correlation: tenure vs. TotalCharges (0.83)

1.   Moderate Positive Correlation: MonthlyCharges vs. TotalCharges (0.65)
2.   Weak Correlation: tenure vs. MonthlyCharges (0.25)




#### Chart - 10 - Pair Plot

In [None]:
# Pair Plot visualization code
numeric_dataset = dataset.select_dtypes(include=np.number)
sns.pairplot(numeric_dataset)
plt.show()

##### 1. Why did you pick the specific chart?

Multivariable Exploration: This pair plot shows relationships between tenure, MonthlyCharges, and TotalCharges — all important numerical features in telecom churn analysis.

##### 2. What is/are the insight(s) found from the chart?

some insides found in this chart


1. Strong Positive Correlation: TotalCharges vs. tenure (As tenure increases, TotalCharges also increase.)
2.   Positive Correlation: TotalCharges vs. MonthlyCharges (Customers with higher monthly charges generally have higher total charges.)


3.   Weak/No Clear Correlation: tenure vs. MonthlyCharges (There's no strong trend between tenure and monthly charges.)  

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1 t-test

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀): μ
churned
​
 =μ
not churned
​

Alternative Hypothesis (H₁): μ
churned
​ != μ
not churned
​


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Clean and convert MonthlyCharges to numeric, if needed
dataset['MonthlyCharges'] = pd.to_numeric(dataset['MonthlyCharges'], errors='coerce')

# Drop rows with missing MonthlyCharges or Churn
dataset = dataset.dropna(subset=['MonthlyCharges', 'Churn'])

# Separate the MonthlyCharges based on Churn values
churn_yes = dataset[dataset['Churn'] == 'Yes']['MonthlyCharges']
churn_no = dataset[dataset['Churn'] == 'No']['MonthlyCharges']

# Import ttest_ind
from scipy.stats import ttest_ind

# Perform independent t-test (Welch's t-test)
t_stat, p_value = ttest_ind(churn_yes, churn_no, equal_var=False)

t_stat, p_value

##### Which statistical test have you done to obtain P-Value?

I have used Z-Test as the statistical testing to obtain P-Value and found the result that Null hypothesis has been rejected .
There is a statistically significant difference in average MonthlyCharges between customers who churned and those who did not.

##### Why did you choose the specific statistical test?

The t-test is used in case to compare whether the average MonthlyCharges differs significantly between customers who churned and those who did not. Here's a full explanation .

### Hypothetical Statement - 2 Chi-Square Test

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀):
There is no association between Churn and gender. (They are independent.)

Alternative Hypothesis (H₁):
There is an association between Churn and gender.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import pandas as pd
from scipy.stats import chi2_contingency
# Create a contingency table for Churn vs Gender
contingency_table = pd.crosstab(dataset['Churn'], dataset['gender'])

# Perform Chi-Square Test of Independence
chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)

chi2_stat, p_value, dof, expected
# Output the results
print("Chi-square Statistic:", chi2_stat)
print("p-value:", p_value)
print("Degrees of Freedom:", dof)
print("Expected Frequencies Table:\n", expected)

##### Which statistical test have you done to obtain P-Value?

I have used chi-square Test as the statistical testing to obtain P-Value and found the result that p-value < 0.05: Reject the null → Churn is dependent on gender.

##### Why did you choose the specific statistical test?

The Chi-Square Test of Independence is chosen when i want to examine whether two categorical variables are associated (i.e., dependent) or not associated (i.e., independent).

### Hypothetical Statement - 3 ANOVA test

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀):
Mean MonthlyCharges are equal across all contract types.

𝜇
1
=
𝜇
2
=
𝜇
3
μ



Alternative Hypothesis (H₁):
At least one group's mean MonthlyCharges is different.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import f_oneway
# Drop rows with missing MonthlyCharges or Contract values
dataset = dataset.dropna(subset=['MonthlyCharges', 'Contract'])

# Group MonthlyCharges by Contract type
month_to_month = dataset[dataset['Contract'] == 'Month-to-month']['MonthlyCharges']
one_year = dataset[dataset['Contract'] == 'One year']['MonthlyCharges']
two_year = dataset[dataset['Contract'] == 'Two year']['MonthlyCharges']

# Perform One-Way ANOVA
f_stat, p_value = f_oneway(month_to_month, one_year, two_year)

# Print results
print("F-statistic:", f_stat)
print("p-value:", p_value)

##### Which statistical test have you done to obtain P-Value?

I have used ANOVATest as the statistical testing to obtain P-Value and found the result that p-value ≥ 0.05 → Fail to reject H₀ → No evidence of significant difference .


##### Why did you choose the specific statistical test?

The ANOVA (Analysis of Variance) test is used when you want to compare the means of a continuous variable across more than two groups.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
dataset.isnull().sum().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

There are no missing value in this dataset .

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments.
sns.boxplot(dataset)

##### What all outlier treatment techniques have you used and why did you use those techniques?

There are no significant outliers in the numerical columns of the dataset.

### 3. Categorical Encoding

In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Identify the categorical columns
categorical_columns = dataset.select_dtypes(include=['object']).columns

# Apply OneHotEncoding
encoder = OneHotEncoder(sparse_output=False, drop=None, handle_unknown='ignore')
encoded_data = encoder.fit_transform(dataset[categorical_columns])

# Create a new DataFrame with the encoded data
encoded_df = pd.DataFrame(encoded_data,
                          columns=encoder.get_feature_names_out(categorical_columns),
                          index=dataset.index)

# Drop the original categorical columns
dataset.drop(columns=categorical_columns, inplace=True)

# Concatenate the original DataFrame with the encoded DataFrame
dataset = pd.concat([dataset, encoded_df], axis=1)



In [None]:
dataset.drop('Churn_No', axis=1, inplace=True)

In [None]:
dataset.head()

#### What all categorical encoding techniques have you used & why did you use those techniques?

One-Hot Encoding converts each categorical column into multiple binary (0/1) columns—one for each category level.

One-Hot Encoding was used in this code:
1. Suitable for nominal categorical data (i.e., unordered categories like gender, city, product type).
One-hot encoding treats all categories as equally important without assuming any order or ranking.

2. Required for many ML models (e.g., Linear Regression, Logistic Regression, SVM, etc.)
These models cannot handle string or label-type categorical values directly, so encoding is necessary.

3. Scikit-learn compatibility:
You used sklearn.preprocessing.OneHotEncoder which is a standard and robust way to convert categorical features.

4. Avoids introducing bias:
Unlike Label Encoding, which can create a false sense of order (e.g., assigning Male = 0, Female = 1), One-Hot does not assume any order.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

There are no text columns in the given dataset which I am working on. So, Skipping this part.

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
sns.scatterplot(x='tenure', y='TotalCharges', data=dataset)
plt.show()

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
# rename churn_yes to churn
dataset.rename(columns={'Churn_Yes': 'Churn'}, inplace=True)

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
features = [i for i in dataset.columns if i not in ['Churn']]

In [None]:
# Scaling your data
scaler = StandardScaler()
X = scaler.fit_transform(dataset[features])

##### Which method have you used to scale you data and why?

In this i have different independent features of different scale so i have used standard scalar method to scale our independent features into one scale.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

This dataset dimesionality not required.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
# split data into X and y
X = dataset.drop('Churn', axis=1)
y = dataset['Churn']
# data into train and split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

##### What data splitting ratio have you used and why?

In this dataset, we've used a data splitting ratio of 80:20, meaning:


*   80% of the data is used for training (X_train, y_train)
*   20% of the data is used for testing (X_test, y_test)


Why this 80:20 ratio is used:


1.   Balanced Trade-off:


*   80% training data allows the model to learn well from a large portion of the dataset.
*   20% testing data is sufficient to evaluate the model’s performance on unseen data.

2.   Industry Standard:


*   This is a commonly used split in machine learning as a good starting point, especially when dataset size is reasonable.

3. Prevents Overfitting and Underfitting:

*   Enough data for training reduces underfitting.
*   A separate test set helps in detecting overfitting by checking generalization on new


### 9. Handling Imbalanced Dataset

In [None]:
print(dataset.Churn.value_counts())
print(" ")
# Dependant Variable Column Visualization
dataset['Churn'].value_counts().plot(kind='pie',
                              figsize=(15,6),
                               autopct="%1.1f%%",
                               startangle=90,
                               shadow=True,
                               labels=['Not Churn(%)','Churn(%)'],
                               colors=['skyblue','red'],
                               explode=[0,0]
                              )

##### Do you think the dataset is imbalanced? Explain Why.

Imbalanced dataset is relevant primarily in the context of supervised machine learning involving two or more classes.

In [None]:
# Handling Imbalanced Dataset (If needed)
# describes info about train and test set
print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

I have used SMOTE (Synthetic Minority Over-sampling technique) for balanced the 85:15 dataset.

SMOTE is a technique in machine learning for dealing with issues that arise when working with an unbalanced data set. In practice, unbalanced data sets are common and most ML algorithms are highly prone to unbalanced data so we need to improve their performance by using techniques like SMOTE.

To address this disparity, balancing schemes that augment the data to make it more balanced before training the classifier were proposed. Oversampling the minority class by duplicating minority samples or undersampling the majority class is the simplest balancing method.

The idea of incorporating synthetic minority samples into tabular data was first proposed in SMOTE, where synthetic minority samples are generated by interpolating pairs of original minority points.

SMOTE is a data augmentation algorithm that creates synthetic data points from raw data. SMOTE can be thought of as a more sophisticated version of oversampling or a specific data augmentation algorithm.

SMOTE has the advantage of not creating duplicate data points, but rather synthetic data points that differ slightly from the original data points. SMOTE is a superior oversampling option.

That's why for lots of advantages, I have used SMOTE technique for balancinmg the dataset.

## ***7. ML Model Implementation***

In [None]:
# Defining a function to print evaluation matrix
def evaluate_model(model, y_test, y_pred):

  '''takes model, y test and y pred values to print evaluation metrics, plot the actual and predicted values,
  plot the top 20 important features, and returns a list of the model scores'''

  # Squring the y test and and pred as we have used sqrt transformation
  y_t = np.square(y_test)
  y_p = np.square(y_pred)
  y_train2 = np.square(y_train)
  y_train_pred = np.square(model.predict(X_train))

  # Calculating Evaluation Matrix
  mse = mean_squared_error(y_t,y_p)
  rmse = np.sqrt(mse)
  mae = mean_absolute_error(y_t,y_p)
  r2_train = r2_score(y_train2, y_train_pred)
  r2 = r2_score(y_t,y_p)
  r2_adjusted = 1-(1-r2)*((len(X_test)-1)/(len(X_test)-X_test.shape[1]-1))

  # Printing Evaluation Matrix
  print("MSE :" , mse)
  print("RMSE :" ,rmse)
  print("MAE :" ,mae)
  print("Train R2 :" ,r2_train)
  print("Test R2 :" ,r2)
  print("Adjusted R2 : ", r2_adjusted)


  # plot actual and predicted values
  plt.figure(figsize=(13,4))
  plt.plot((y_p)[:100])
  plt.plot((np.array(y_t)[:100]))
  plt.legend(["Predicted","Actual"])
  plt.title('Actual and Predicted Bike Count', fontsize=15)

  try:
    importance = model.feature_importances_
  except:
    importance = model.coef_
  importance = np.absolute(importance)
  if len(importance)==len(features):
    pass
  else:
    importance = importance[0]

  # Feature importances
  feat = pd.Series(importance, index=features)
  plt.figure(figsize=(9,7))
  plt.title('Feature Importances (top 20) for '+str(model), fontsize = 15)
  plt.xlabel('Relative Importance')
  feat.nlargest(20).plot(kind='barh')


  model_score = [mse,rmse,mae,r2_train,r2,r2_adjusted]
  return model_score

In [None]:
# Create a score dataframe
score = pd.DataFrame(index = ['MSE', 'RMSE', 'MAE', 'Train R2', 'Test R2', 'Adjusted R2'])

### ML Model - 1 : RandomFores

In [None]:
# ML Model - 2 Implementation
# Create an instance of the RandomForestClassifier
rf_model = RandomForestClassifier()

# Fit the Algorithm
rf_model.fit(X_train,y_train)

# Predict on the model
# Making predictions on train and test data
train_class_preds = rf_model.predict(X_train)
test_class_preds = rf_model.predict(X_test)

In [None]:
# Calculating accuracy on train and test
train_accuracy = accuracy_score(y_train,train_class_preds)
test_accuracy = accuracy_score(y_test,test_class_preds)

print("The accuracy on train dataset is", train_accuracy)
print("The accuracy on test dataset is", test_accuracy)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Visualizing evaluation Metric Score chart# Get the confusion matrix for both train and test

labels = ['Retained', 'Churned']
cm = confusion_matrix(y_train, train_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
# Get the confusion matrix for both train and test

labels = ['Retained', 'Churned']
cm = confusion_matrix(y_test, test_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:

print(metrics.classification_report(y_true=y_train, y_pred=train_class_preds))
print(" ")

print("roc_auc_score")
print(metrics.roc_auc_score(y_true=y_train, y_score=train_class_preds))

In [None]:
print(metrics.classification_report(test_class_preds, y_test))
print(" ")

print("roc_auc_score")
print(metrics.roc_auc_score(y_test, test_class_preds))

Then, I used Random Forest algorithm to create the model. As I got there is overfitting seen.

For training dataset, i found precision of 100% and recall of 99% and f1-score of 100% for False Churn customer data. BUt, I am also interested to see the result for Churning cutomer result as I got precision of 100% and recall of 100% and f1-score of 100%. Accuracy is 99% and average percision, recall & f1_score are 100%, 100% and 99% respectively with a roc auc score of 99%.

For testing dataset, i found precision of 91% and recall of 82% and f1-score of 86% for False Churn customer data. BUt, I am also interested to see the result for Churning cutomer result as I got precision of 46% and recall of 65% and f1-score of 53%. Accuracy is 78% and average percision, recall & f1_score are 68%, 73% and 70% respectively with a roc auc score of 68%.

Next tryting to improving the score by using hyperparameter tuning technique.

In [None]:
importances = rf_model.feature_importances_

importance_dict = {'Feature' : list(X.columns), # Use X.columns to get the feature names
                   'Feature Importance' : importances}

importance_df = pd.DataFrame(importance_dict)
importance_df['Feature Importance'] = round(importance_df['Feature Importance'],2)

In [None]:
importance_df.sort_values(by=['Feature Importance'],ascending=False)
features = X.columns
importances = rf_model.feature_importances_
indices = np.argsort(importances)

In [None]:
plt.figure(figsize=(10,8))
plt.title('Feature Importance')
plt.barh(range(len(indices)), importances[indices], color='red', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')

plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score


# Define parameter grid
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Initialize model
rf = RandomForestClassifier(random_state=42)

# Use StratifiedKFold for balanced CV
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Grid search with cross-validation
grid_search = GridSearchCV(
    rf, param_grid, cv=cv, scoring='accuracy', n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)

# Best model
best_rf = grid_search.best_estimator_

# Predict on test set
y_pred = best_rf.predict(X_test)

# Results
print("✅ Best Parameters:", grid_search.best_params_)
print("🎯 Accuracy on Test Set:", accuracy_score(y_test, y_pred))
print("📄 Classification Report:\n", classification_report(y_test, y_pred))


In [None]:

# Train Random Forest with best parameters (or default for demo)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predictions
train_preds = rf.predict(X_train)
test_preds = rf.predict(X_test)

# Accuracy
print("Train Accuracy:", accuracy_score(y_train, train_preds))
print("Test Accuracy:", accuracy_score(y_test, test_preds))

# Confusion Matrix
train_cm = confusion_matrix(y_train, train_preds)
test_cm = confusion_matrix(y_test, test_preds)

# 🔲 Plot confusion matrix
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sns.heatmap(train_cm, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_title('Train Confusion Matrix')
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')

sns.heatmap(test_cm, annot=True, fmt='d', cmap='Greens', ax=axes[1])
axes[1].set_title('Test Confusion Matrix')
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')

plt.tight_layout()
plt.show()

# 📊 Plot classification report for test set
report = classification_report(y_test, test_preds, output_dict=True)
report_df = pd.DataFrame(report).transpose().drop("accuracy")

report_df[['precision', 'recall', 'f1-score']].plot(kind='bar', figsize=(10, 6))
plt.title("Test Set Classification Metrics")
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.tight_layout()
plt.show()

In [None]:
print(metrics.classification_report(train_class_preds, y_train))
print(" ")

print("roc_auc_score")
print(metrics.roc_auc_score(y_train, train_class_preds))

In [None]:
# Hypertuned Random Forest
print(metrics.classification_report(test_class_preds, y_test))
print(" ")

print("roc_auc_score")
print(metrics.roc_auc_score(y_test, test_class_preds))

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV which uses the Grid Search technique for finding the optimal hyperparameters to increase the model performance.

our goal should be to find the best hyperparameters values to get the perfect prediction results from our model. But the question arises, how to find these best sets of hyperparameters? One can try the Manual Search method, by using the hit and trial process and can find the best hyperparameters which would take huge time to build a single model.

For this reason, methods like Random Search, GridSearch were introduced. Grid Search uses a different combination of all the specified hyperparameters and their values and calculates the performance for each combination and selects the best value for the hyperparameters. This makes the processing time-consuming and expensive based on the number of hyperparameters involved.

In GridSearchCV, along with Grid Search, cross-validation is also performed. Cross-Validation is used while training the model.

That's why I have used GridsearCV method for hyperparameter optimization.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

For training dataset, i found precision of 100% and recall of 83% and f1-score of 87% for False Churn customer data. BUt, I am also interested to see the result for Churning cutomer result as I got precision of 47% and recall of 65% and f1-score of 54%. Accuracy is 91% and average percision, recall & f1_score are 69%, 74% and 71% respectively with a roc auc score of 56%.

Quite improvment seen as no overfitting but the scores reduced by some percentages.

For testing dataset, i found precision of 91% and recall of 83% and f1-score of 87% for False Churn customer data. BUt, I am also interested to see the result for Churning cutomer result as I got precision of 13% and recall of 90% and f1-score of 23%. Accuracy is 91% and average percision, recall & f1_score are 69%, 74% and 71% respectively with a roc auc score of 56%.

Quite improvemnt seen in recall but rest scores have decreased.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 2 : XG Boost

In [None]:
# ML Model - 3 Implementation
# Create an instance of the RandomForestClassifier
xg_model = XGBClassifier()

# Fit the Algorithm
xg_models=xg_model.fit(X_train,y_train)

# Predict on the model
# Making predictions on train and test data

train_class_preds = xg_models.predict(X_train)
test_class_preds = xg_models.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Visualizing evaluation Metric Score chart# Get the confusion matrix for both train and test

labels = ['Retained', 'Churned']
cm = confusion_matrix(y_train, train_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
# Get the confusion matrix for both train and test

labels = ['Retained', 'Churned']
cm = confusion_matrix(y_test, test_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
print(metrics.classification_report(train_class_preds, y_train))
print(" ")

print("roc_auc_score")
print(metrics.roc_auc_score(y_train, train_class_preds))

In [None]:
print(metrics.classification_report(test_class_preds, y_test))
print(" ")

print("roc_auc_score")
print(metrics.roc_auc_score(y_test, test_class_preds))

In [None]:
importances = xg_model.feature_importances_

importance_dict = {'Feature' : list(X.columns),
                   'Feature Importance' : importances}

importance_df = pd.DataFrame(importance_dict)
importance_df['Feature Importance'] = round(importance_df['Feature Importance'],2)
importance_df.sort_values(by=['Feature Importance'],ascending=False)

In [None]:
features = X.columns
importances = xg_model.feature_importances_
indices = np.argsort(importances)

In [None]:
plt.figure(figsize=(10,8))
plt.title('Feature Importance')
plt.barh(range(len(indices)), importances[indices], color='red', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')

plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Define XGBoost model
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

# Hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1.0]
}

# Grid search with 5-fold cross-validation
grid_search = GridSearchCV(estimator=xgb,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5,
                           n_jobs=-1,
                           verbose=1)

# Fit the model
grid_search.fit(X_train, y_train)

# Best model
best_model = grid_search.best_estimator_
print("Best Parameters:", grid_search.best_params_)

# Predictions
y_pred = best_model.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

# Predict on the training and testing sets using the best model
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)

# Confusion matrix - Train
cm_train = confusion_matrix(y_train, y_train_pred)
disp_train = ConfusionMatrixDisplay(confusion_matrix=cm_train, display_labels=best_model.classes_)
disp_train.plot(cmap='Blues')
plt.title("Confusion Matrix - Train Set")
plt.show()

# Confusion matrix - Test
cm_test = confusion_matrix(y_test, y_test_pred)
disp_test = ConfusionMatrixDisplay(confusion_matrix=cm_test, display_labels=best_model.classes_)
disp_test.plot(cmap='Blues')
plt.title("Confusion Matrix - Test Set")
plt.show()

# Classification Report
report_train = classification_report(y_train, y_train_pred, output_dict=True)
report_test = classification_report(y_test, y_test_pred, output_dict=True)

# Extract metrics
metrics_to_plot = ['precision', 'recall', 'f1-score']
labels = ['Train', 'Test']
classes = ['0', '1']  # Class labels (No churn, Churn)

# Plot each metric for class 1 (Churn)
for metric in metrics_to_plot:
    values = [report_train['1.0'][metric], report_test['1.0'][metric]]

    plt.figure(figsize=(6, 4))
    sns.barplot(x=labels, y=values, palette='pastel')
    plt.title(f"{metric.title()} for Class 1 (Churn)")
    plt.ylabel(metric.title())
    plt.ylim(0, 1)
    for i, v in enumerate(values):
        plt.text(i, v + 0.02, f"{v:.2f}", ha='center')
    plt.show()

In [None]:
print(metrics.classification_report(train_class_preds, y_train))
print(" ")

print("roc_auc_score")
print(metrics.roc_auc_score(y_train, train_class_preds))

For training dataset, i found precision of 97% and recall of 95% and f1-score of 96% for False Churn customer data. and roc auc score of 91%.

In [None]:
print(metrics.classification_report(test_class_preds, y_test))
print(" ")

print("roc_auc_score")
print(metrics.roc_auc_score(y_test, test_class_preds))

For testing dataset, i found precision of 89% and recall of 84% and f1-score of 86% for False Churn customer data. and roc auc score is  70% .

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV which uses the Grid Search technique for finding the optimal hyperparameters to increase the model performance.

our goal should be to find the best hyperparameters values to get the perfect prediction results from our model. But the question arises, how to find these best sets of hyperparameters? One can try the Manual Search method, by using the hit and trial process and can find the best hyperparameters which would take huge time to build a single model.

For this reason, methods like Random Search, GridSearch were introduced. Grid Search uses a different combination of all the specified hyperparameters and their values and calculates the performance for each combination and selects the best value for the hyperparameters. This makes the processing time-consuming and expensive based on the number of hyperparameters involved.

In GridSearchCV, along with Grid Search, cross-validation is also performed. Cross-Validation is used while training the model.

That's why I have used GridsearCV method for hyperparameter optimization.

### ML Model - 3 : KNeighbors

In [None]:
# ML Model - 4 Implementation
# Create an instance of the KNeighbors
knn_model = KNeighborsClassifier()

# Fit the Algorithm
knn_models=knn_model.fit(X_train,y_train)

# Predict on the model
# Making predictions on train and test data

train_class_preds = knn_models.predict(X_train)
test_class_preds = knn_models.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Visualizing evaluation Metric Score chart# Get the confusion matrix for both train and test

labels = ['Retained', 'Churned']
cm = confusion_matrix(y_train, train_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
# Get the confusion matrix for both train and test

labels = ['Retained', 'Churned']
cm = confusion_matrix(y_test, test_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
print(metrics.classification_report(train_class_preds, y_train))
print(" ")

print("roc_auc_score")
print(metrics.roc_auc_score(y_train, train_class_preds))

For training dataset, i found precision of 92% and recall of 85% and f1-score of 89% for False Churn customer data. and roc auc score of 74%.

In [None]:
print(metrics.classification_report(test_class_preds, y_test))
print(" ")

print("roc_auc_score")
print(metrics.roc_auc_score(y_test, test_class_preds))

For testing dataset, i found precision of 88% and recall of 83% and f1-score of 85% for False Churn customer data. and roc auc score of 68%.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:

# Hyperparameter tuning
param_grid = {'n_neighbors': list(range(3, 16)), 'weights': ['uniform', 'distance']}
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

# Best model
best_knn = grid_search.best_estimator_
print("Best Parameters:", grid_search.best_params_)

# Predictions
y_train_pred = best_knn.predict(X_train)
y_test_pred = best_knn.predict(X_test)



In [None]:
# Accuracy and classification report
print("Train Accuracy:", accuracy_score(y_train, y_train_pred))
print("Test Accuracy:", accuracy_score(y_test, y_test_pred))
print("\nClassification Report:\n", classification_report(y_test, y_test_pred))

In [None]:
# Confusion Matrices
cm_train = confusion_matrix(y_train, y_train_pred)
ConfusionMatrixDisplay(cm_train).plot(cmap='Blues')
plt.title("Train Confusion Matrix")
plt.show()

In [None]:
cm_test = confusion_matrix(y_test, y_test_pred)
ConfusionMatrixDisplay(cm_test).plot(cmap='Blues')
plt.title("Test Confusion Matrix")
plt.show()

In [None]:
# Visualize evaluation metrics
report_test = classification_report(y_test, y_test_pred, output_dict=True)
metrics_to_plot = ['precision', 'recall', 'f1-score']
for metric in metrics_to_plot:
    value = report_test['1.0'][metric]
    plt.figure(figsize=(4, 3))
    sns.barplot(x=['Class 1 (Churn)'], y=[value], palette='mako')
    plt.title(f"{metric.title()} - Class 1 (Churn)")
    plt.ylim(0, 1)
    plt.text(0, value + 0.02, f"{value:.2f}", ha='center')
    plt.show()

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

To evaluate models for positive business impact in the Telco Customer Churn problem, I considered the following evaluation metrics, not just for technical performance but also for their strategic business value:

1. F1-Score (Especially for Class 1 - Churn)
Why it matters:
Churned customers (class = 1) are the minority class and most critical to identify.

Business impact:
High F1-score means fewer missed churners (false negatives) and fewer false alarms (false positives).

Balanced view:
Combines precision (how many predicted churns actually churned) and recall (how many actual churns we captured).

Chosen as primary metric because:

❝ Missing a true churner (FN) is costlier than targeting a wrong customer (FP) ❞


2. Recall (Sensitivity) – Focused on Churned Customers
Why it matters:
Recall = TP / (TP + FN) — how many actual churners were identified?

Business impact:
High recall = fewer lost customers = more retention = direct revenue gain.

Chosen because:

❝ It's better to flag more potential churners and apply retention offers than to miss them entirely ❞

3. Precision – Also for Churned Class
Why it matters:
How many of the predicted churns were actually churners?

Business impact:
High precision = fewer wasted resources (offers, discounts, calls) on customers who were not going to churn.

Chosen to optimize operational cost of churn-prevention efforts.

4. Confusion Matrix
Why it matters:
Gives raw insight into:

How many customers we saved (True Positives)

How many we missed (False Negatives)

Business impact:
Directly supports cost-benefit analysis of the model's output.

5. Accuracy (used carefully)
Why it matters:
Gives an overall sense of model performance.

Caution:
Can be misleading in imbalanced datasets like churn (e.g., 80% non-churn = 80% accuracy even if we predict all as "No").

Not the primary metric, but useful when paired with others.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Among the models we explored — including KNeighborsClassifier (KNN) and XGBClassifier — the XGBoost model was selected as the final prediction model for Telco Customer Churn prediction.

Here’s why:
 1. Superior Performance on Key Metrics ->

 In churn prediction, catching as many actual churners as possible (recall) is crucial for retaining revenue. XGBoost consistently delivered better recall and F1-score for the positive class (Churn).

2. Scalability & Speed ->

    XGBoost is highly optimized for large datasets.

    Much faster at inference than KNN (which needs to compute distances to all training points).

    Suitable for real-time prediction systems.

3. Robust to Imbalanced Data ->

     XGBoost supports built-in class weighting (scale_pos_weight) which helps handle class imbalance (more “No Churn” than “Yes”).

    KNN lacks this built-in balancing, leading to biased results.

4. Explainability for Business Decisions ->

    XGBoost provides:

            Feature importance scores

            SHAP values (advanced interpretability)

    Helps business users understand "why" a customer may churn, making it easier to justify retention actions.

5. Hyperparameter Tuning Results ->

    After applying GridSearchCV, XGBoost consistently outperformed KNN on the test set.

    Tuned parameters like n_estimators, max_depth, and learning_rate gave even better predictive power.



### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Here's a complete explanation of the final model (XGBoost) and its feature importance using model explainability tools, specifically:

1. Feature Importance with XGBoost ->
After training, XGBoost allows us to extract and visualize important features. These tell us which factors most influence churn prediction.

2. Advanced Explainability with SHAP
🔹 What is SHAP?
SHAP (SHapley Additive exPlanations) explains the impact of each feature on a specific prediction.
It answers:
“Which features pushed this prediction toward churn or non-churn, and by how much?”



## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**



*   The XGBoost model provides a highly accurate, interpretable, and business-aligned solution for churn prediction.

*  It enables the company to proactively retain customers, reduce revenue loss, and build better service strategies using data-driven decisions.


*   Low tenure and high monthly charges are key churn drivers.

*   Long-term contracts and tech support reduce churn.
*   This helps design personalized retention strategies (e.g., offer better plans to high-risk users).






### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***