<a href="https://colab.research.google.com/github/ibudeX/Customer_Churn_Prediction/blob/main/Bank_Customer_Churn_ML_CLassification_Notebook2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Machine Learning with Bank Churn Prediction

### What is Machine Learning?

Machine Learning is a branch of artificial intelligence that enables computers to learn from data and make predictions or decisions without being explicitly programmed for every scenario.

Imagine teaching a child to recognize different types of fruits. Instead of giving them a list of rules, you show them many pictures of apples, oranges, and bananas. Over time, the child learns to identify these fruits on their own. Machine learning works in a similar way.

### Real World Applications of Machine Learning

Machine learning is used everywhere in our daily lives:

- Email Spam Detection: Your email provider uses machine learning to identify spam emails and move them to your spam folder.
- Recommendation Systems: Netflix suggests movies you might like, and Amazon recommends products based on your browsing history.
- Voice Assistants: Siri, Alexa, and Google Assistant use machine learning to understand your voice commands.
- Medical Diagnosis: Doctors use machine learning to detect diseases from medical images and predict patient outcomes.
- Financial Fraud Detection: Banks use machine learning to identify fraudulent transactions and protect your money.
- Human Resources: Companies use machine learning to predict which customers might leave and take action to retain valuable talent.

### Our Project: Predicting Bank Customer Churn

In this notebook, we will build machine learning models to predict whether an customer will leave the bank or stay. This problem is called credit card default prediction or customer default rate prediction.

Customer default is when an customer voluntarily leaves a bank. For organizations, understanding which customers are likely to leave helps them:
- Take proactive steps to retain valuable customers
- Improve credit card conditions and customer satisfaction
- Plan for recruitment and training needs
- Reduce the costs associated with customer default rate

Customer default rate is expensive. Studies show that replacing an customer can cost 50-200% of their annual salary when you factor in recruitment, training, and lost productivity.



## Step 1: Import Required Libraries

Libraries are collections of pre-written code that help us perform specific tasks. Instead of writing everything from scratch, we use libraries to make our work easier and faster.

Here are the libraries we will use:

- pandas: For loading and manipulating data in tables
- numpy: For numerical calculations and working with arrays
- matplotlib and seaborn: For creating visualizations and charts
- sklearn: The main machine learning library that contains all the algorithms we need
- xgboost and lightgbm: Advanced machine learning libraries for gradient boosting algorithms

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

np.random.seed(42)

## Step 2: Load the Dataset

Now we will load our customer data from a CSV file. CSV stands for Comma Separated Values, which is a simple file format for storing tabular data.

We use pandas to read the CSV file and store it in a DataFrame. A DataFrame is like a spreadsheet or table where data is organized in rows and columns.

In [None]:
# Load the dataset
df = pd.read_csv("Bank_Default.csv")

In [None]:
# View the first five rows of the data
df.head()

Unnamed: 0,CustomerID,CreditLimit,Gender,Education,MaritalStatus,Age,PaymentStatus_Sept,PaymentStatus_Aug,PaymentStatus_July,PaymentStatus_June,...,BillAmount_June,BillAmount_May,BillAmount_April,PaymentAmount_Sept,PaymentAmount_Aug,PaymentAmount_July,PaymentAmount_June,PaymentAmount_May,PaymentAmount_April,DefaultNextMonth
0,CC000001,151182,Female,2,1,59,2,2,2,-1,...,30870,32881,28115,7265,4273,11946,30484,8466,3364,Yes
1,CC000002,64123,Male,2,1,72,2,2,1,-1,...,25165,17935,17182,10172,8794,13200,21939,16950,16127,Yes
2,CC000003,124986,Male,2,2,49,-1,-1,-1,-1,...,59136,49956,45403,73613,56130,46022,48283,42336,39152,No
3,CC000004,121746,Female,3,2,35,2,2,2,1,...,21247,19753,20451,7875,9906,11609,359,18814,17314,Yes
4,CC000005,45307,Female,3,3,63,2,1,-1,1,...,24680,29644,27716,4952,319,16730,10741,2578,204,No


## Step 3: Understanding the Dataset

Before we build any models, we need to understand what data we have. Let's look at each column and what it represents.

### Column Descriptions

Our dataset contains information about bank customers. Here is what each column means:

1. CustomerID: A unique identifier for each customer. This is like an customer number that helps the bank keep track of individual customers.

2. Age: The age of the customer in years. Research shows that age can be related to default patterns.

3. Gender: Whether the customer is Male or Female.

4. MaritalStatus: The marital status of the customer. This can be Single, Married, or Divorced.

5. Education: The education level of the customer on a scale from 1 to 5:
   - 1 = Below College
   - 2 = College
   - 3 = Bachelor degree
   - 4 = Master degree
   - 5 = Doctor degree

6. Department: The department where the customer works. Options are Sales, Research & Development, or Human Resources.

7. JobRole: The specific job role of the customer within their department. Examples include Sales Executive, Research Scientist, or bank Representative.

8. MonthlyIncome: The monthly salary of the customer in dollars. Higher income might correlate with lower default.

9. YearsAtCompany: How many years the customer has been with the bank. Longer tenure often means stronger loyalty.

10. YearsInCurrentRole: How many years the customer has been in their current position. Being in the same role too long might lead to frustration.

11. YearsSinceLastPromotion: How many years since the customer was last promoted. Long periods without promotion can lead to dissatisfaction.

12. NumCompaniesWorked: The number of companies the customer worked for before joining the current bank. Frequent job changes might indicate a pattern.

13. DistanceFromHome: The distance from the customer's home to the credit card in kilometers. Long commutes can affect job satisfaction.

14. JobSatisfaction: Customer satisfaction with their job on a scale from 1 to 4:
    - 1 = Low satisfaction
    - 2 = Medium satisfaction
    - 3 = High satisfaction
    - 4 = Very high satisfaction

15. EnvironmentSatisfaction: Customer satisfaction with their work environment on a scale from 1 to 4 (same scale as job satisfaction).

16. WorkLifeBalance: How the customer rates their work-life balance on a scale from 1 to 4:
    - 1 = Bad
    - 2 = Good
    - 3 = Better
    - 4 = Best

17. PerformanceRating: The customer's performance rating. Values are either 3 (Excellent) or 4 (Outstanding).

18. TrainingTimesLastYear: The number of training sessions the customer attended in the last year. Training opportunities can improve risk management.

19. OverTime: Whether the customer works overtime. Yes means they work overtime, No means they don't. Excessive overtime can lead to burnout.

20. StockOptionLevel: The level of stock options the customer has:
    - 0 = None
    - 1 = Low
    - 2 = Medium
    - 3 = High

21. BusinessTravel: How frequently the customer travels for business:
    - Non-Travel: No travel required
    - Travel_Rarely: Occasional travel
    - Travel_Frequently: Frequent travel required

22. Default: This is our target variable. It tells us whether the customer left the bank or not. A value of Yes means the customer left (default occurred), and No means the customer is still with the bank.

### What Are We Trying to Predict?

Our goal is to predict the Default column. We want to build a model that can look at the other columns and predict whether an customer will leave the bank or not. This is called a classification problem because we are classifying customers into two categories: those who will leave and those who will stay.

In [None]:
df.describe()

Unnamed: 0,CreditLimit,Education,MaritalStatus,Age,PaymentStatus_Sept,PaymentStatus_Aug,PaymentStatus_July,PaymentStatus_June,PaymentStatus_May,PaymentStatus_April,...,BillAmount_July,BillAmount_June,BillAmount_May,BillAmount_April,PaymentAmount_Sept,PaymentAmount_Aug,PaymentAmount_July,PaymentAmount_June,PaymentAmount_May,PaymentAmount_April
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,...,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,123841.9072,2.350567,1.5982,47.492667,0.3284,-0.100967,-0.3408,-0.3004,-0.317767,-0.2823,...,49453.899333,49113.525367,48833.8978,48467.731267,30989.378267,35347.663367,36757.7153,35833.370133,35255.590067,34449.493367
std,53772.124989,0.84508,0.734331,15.591495,1.721991,1.548091,1.252507,1.263308,1.151825,1.155502,...,40127.263843,40730.229374,41315.848312,41748.633808,31747.206676,34223.657061,35169.144281,35412.148075,35637.041483,35767.913755
min,10000.0,1.0,1.0,21.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,86813.25,2.0,1.0,34.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,17898.75,17257.75,16861.75,16366.75,6391.5,8362.25,9026.75,8488.0,8104.75,7445.75
50%,123179.5,2.0,1.0,47.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,40186.0,39220.0,38197.5,37205.5,19961.5,24573.0,26260.5,24722.5,23658.5,22603.5
75%,160620.0,3.0,2.0,61.0,2.0,1.0,-1.0,1.0,1.0,1.0,...,71697.75,70699.75,70163.75,69453.25,46336.0,53084.75,54162.25,52620.25,51550.5,50188.25
max,380723.0,4.0,3.0,74.0,6.0,5.0,4.0,4.0,3.0,3.0,...,286804.0,286804.0,286804.0,286804.0,207345.0,236043.0,246704.0,239947.0,256423.0,258540.0


## Step 4: Checking for Missing Values

Missing values are empty cells in our dataset where data is absent. For example, if an customer's age is not recorded, that cell would be empty or contain a special value like NaN (Not a Number).

Missing values can cause problems when training machine learning models, so we need to check if our dataset has any missing values and handle them appropriately.

In [None]:
# Check for missing values
df.isnull().sum()

CustomerID             0
CreditLimit            0
Gender                 0
Education              0
MaritalStatus          0
Age                    0
PaymentStatus_Sept     0
PaymentStatus_Aug      0
PaymentStatus_July     0
PaymentStatus_June     0
PaymentStatus_May      0
PaymentStatus_April    0
BillAmount_Sept        0
BillAmount_Aug         0
BillAmount_July        0
BillAmount_June        0
BillAmount_May         0
BillAmount_April       0
PaymentAmount_Sept     0
PaymentAmount_Aug      0
PaymentAmount_July     0
PaymentAmount_June     0
PaymentAmount_May      0
PaymentAmount_April    0
DefaultNextMonth       0
dtype: int64

## Step 5: Data Preprocessing

Data preprocessing is the process of preparing our data for machine learning. Raw data often needs to be cleaned and transformed before we can use it to train models.

### Why Do We Need Preprocessing?

Machine learning algorithms work with numbers. However, our dataset contains some columns with text values like Gender (Male, Female), Department (Sales, R&D, bank), and Default (Yes, No). We need to convert these text values into numbers.

### Identifying Columns to Remove

Not all columns are useful for prediction. Let's identify which columns we should remove:

- CustomerID: This is just a unique identifier and has no predictive value. Each customer has a different ID, but the ID itself doesn't tell us anything about whether they will leave.

### Encoding Categorical Variables

We need to convert text columns (Gender, MaritalStatus, Department, JobRole, OverTime, BusinessTravel, and Default) into numbers. This process is called encoding. We will use Label Encoding, which assigns a unique number to each category.

In [None]:
# check the info on the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   CustomerID           30000 non-null  object
 1   CreditLimit          30000 non-null  int64 
 2   Gender               30000 non-null  object
 3   Education            30000 non-null  int64 
 4   MaritalStatus        30000 non-null  int64 
 5   Age                  30000 non-null  int64 
 6   PaymentStatus_Sept   30000 non-null  int64 
 7   PaymentStatus_Aug    30000 non-null  int64 
 8   PaymentStatus_July   30000 non-null  int64 
 9   PaymentStatus_June   30000 non-null  int64 
 10  PaymentStatus_May    30000 non-null  int64 
 11  PaymentStatus_April  30000 non-null  int64 
 12  BillAmount_Sept      30000 non-null  int64 
 13  BillAmount_Aug       30000 non-null  int64 
 14  BillAmount_July      30000 non-null  int64 
 15  BillAmount_June      30000 non-null  int64 
 16  Bill

In [None]:
# select the categorical columns
categorical_column = df.select_dtypes(include='object').columns.tolist()
print(categorical_column)

['CustomerID', 'Gender', 'DefaultNextMonth']


In [None]:
# a copy of the data
df_processed = df.copy()

In [None]:
# drop the irrelevant column
df_processed = df_processed.drop(columns='CustomerID')

In [None]:
# select the categorical columns
categorical_column = df_processed.select_dtypes(include='object').columns.tolist()
print(categorical_column)

['Gender', 'DefaultNextMonth']


In [None]:
# encode the categorical columns
for col in categorical_column:
    le = LabelEncoder()
    df_processed[col] = le.fit_transform(df_processed[col])



In [None]:
df_processed.head()

Unnamed: 0,CreditLimit,Gender,Education,MaritalStatus,Age,PaymentStatus_Sept,PaymentStatus_Aug,PaymentStatus_July,PaymentStatus_June,PaymentStatus_May,...,BillAmount_June,BillAmount_May,BillAmount_April,PaymentAmount_Sept,PaymentAmount_Aug,PaymentAmount_July,PaymentAmount_June,PaymentAmount_May,PaymentAmount_April,DefaultNextMonth
0,151182,0,2,1,59,2,2,2,-1,1,...,30870,32881,28115,7265,4273,11946,30484,8466,3364,1
1,64123,1,2,1,72,2,2,1,-1,-1,...,25165,17935,17182,10172,8794,13200,21939,16950,16127,1
2,124986,1,2,2,49,-1,-1,-1,-1,-1,...,59136,49956,45403,73613,56130,46022,48283,42336,39152,0
3,121746,0,3,2,35,2,2,2,1,-1,...,21247,19753,20451,7875,9906,11609,359,18814,17314,1
4,45307,0,3,3,63,2,1,-1,1,1,...,24680,29644,27716,4952,319,16730,10741,2578,204,0


## Step 6: Splitting the Data into Features and Target

In machine learning, we separate our data into two parts:

1. Features (X): These are the input columns that we use to make predictions. Features are the information we know about each customer, such as age, job satisfaction, and monthly income.

2. Target (y): This is the output column that we want to predict. In our case, it's the Default column that tells us whether an customer left or stayed.

Think of it like this: Features are the clues, and the target is the answer we're trying to guess.

In [None]:
# split into into features and target
X = df_processed.drop('DefaultNextMonth', axis=1) # input
y = df_processed['DefaultNextMonth'] # target

In [None]:
X #input

Unnamed: 0,CreditLimit,Gender,Education,MaritalStatus,Age,PaymentStatus_Sept,PaymentStatus_Aug,PaymentStatus_July,PaymentStatus_June,PaymentStatus_May,...,BillAmount_July,BillAmount_June,BillAmount_May,BillAmount_April,PaymentAmount_Sept,PaymentAmount_Aug,PaymentAmount_July,PaymentAmount_June,PaymentAmount_May,PaymentAmount_April
0,151182,0,2,1,59,2,2,2,-1,1,...,30313,30870,32881,28115,7265,4273,11946,30484,8466,3364
1,64123,1,2,1,72,2,2,1,-1,-1,...,35790,25165,17935,17182,10172,8794,13200,21939,16950,16127
2,124986,1,2,2,49,-1,-1,-1,-1,-1,...,46488,59136,49956,45403,73613,56130,46022,48283,42336,39152
3,121746,0,3,2,35,2,2,2,1,-1,...,25320,21247,19753,20451,7875,9906,11609,359,18814,17314
4,45307,0,3,3,63,2,1,-1,1,1,...,19742,24680,29644,27716,4952,319,16730,10741,2578,204
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,180219,0,3,1,66,-1,-1,-1,-1,-1,...,47300,57387,47470,54000,69113,48130,41332,48003,46660,50039
29996,218144,1,2,1,47,-1,-1,-1,-1,-1,...,81635,97589,83871,85993,93088,81768,79203,85210,69553,12691
29997,164285,0,2,1,65,-1,-1,-1,-1,-1,...,43459,47768,44728,54296,37892,36100,41297,39484,43131,49775
29998,108388,0,1,1,41,1,1,2,1,1,...,33294,26300,22478,25412,5216,19215,356,2472,6486,3741


In [None]:
y # target

0        1
1        1
2        0
3        1
4        0
        ..
29995    0
29996    0
29997    0
29998    1
29999    1
Name: DefaultNextMonth, Length: 30000, dtype: int32

## Step 7: Train-Test Split

### What is Train-Test Split?

Before we train our machine learning models, we need to split our data into two parts:

1. Training Set: This is the data we use to teach the model. The model learns patterns from this data.

2. Testing Set: This is the data we use to evaluate how well the model performs. The model has never seen this data during training.

### Why Do We Need This Split?

Imagine you are studying for an exam. You practice with sample questions (training data), and then you take the actual exam with different questions (testing data). If the exam only had the exact same questions you practiced, you might do well but it wouldn't truly test your understanding. Similarly, we test our model on new data it hasn't seen to check if it really learned the patterns or just memorized the training data.

This concept is called generalization. We want our model to generalize well, meaning it should perform well on new, unseen data, not just the data it was trained on.

### The 80-20 Split

We typically use 80% of our data for training and 20% for testing. This is a common practice in machine learning that gives the model enough data to learn from while keeping sufficient data to evaluate its performance.

In [None]:
df_processed['DefaultNextMonth'].value_counts()

DefaultNextMonth
0    20615
1     9385
Name: count, dtype: int64

In [None]:
# split into training and testing data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42, stratify = y)

In [None]:
y_train.value_counts()

DefaultNextMonth
0    16492
1     7508
Name: count, dtype: int64

In [None]:
y_test.value_counts()

DefaultNextMonth
0    4123
1    1877
Name: count, dtype: int64

## Step 8: Understanding Evaluation Metrics

Before we start building models, we need to understand how to measure their performance. Just like students get grades on exams, machine learning models get evaluated using specific metrics.

### The Four Key Metrics

We will use four main metrics to evaluate our models:

#### 1. Accuracy
Accuracy tells us what percentage of predictions were correct overall. It's calculated as:

Accuracy = (Correct Predictions) / (Total Predictions)

For example, if our model made 100 predictions and 85 were correct, the accuracy is 85%.

#### 2. Precision
Precision tells us: Of all the customers we predicted would leave, how many actually left?

Precision = (True Positives) / (True Positives + False Positives)

High precision means when the model predicts default, it's usually right. This is important if risk management efforts are expensive, as we don't want to waste resources on false alarms.

#### 3. Recall
Recall tells us: Of all the customers who actually left, how many did we correctly identify?

Recall = (True Positives) / (True Positives + False Negatives)

High recall means we catch most of the customers who will leave. This is important if missing a departing customer is very costly.

#### 4. F1 Score
F1 Score is the harmonic mean of precision and recall. It provides a single number that balances both metrics.

F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

F1 Score is useful when you want a balance between precision and recall.

### Confusion Matrix

A confusion matrix is a table that shows four types of predictions:

- True Positives (TP): We predicted default, and the customer actually left. Correct prediction.
- True Negatives (TN): We predicted risk management, and the customer actually stayed. Correct prediction.
- False Positives (FP): We predicted default, but the customer actually stayed. Wrong prediction (False alarm).
- False Negatives (FN): We predicted risk management, but the customer actually left. Wrong prediction (Missed detection).

Let's create a function to display all these metrics in an organized way.

# Machine Learning Algorithms

Now we will learn and implement several machine learning algorithms. For each algorithm, we will:

1. Explain the intuition behind how it works
2. Discuss when to use it
3. Explain its strengths and weaknesses
4. Train the model
5. Make predictions
6. Evaluate performance

Let's begin!

## Algorithm 1: Logistic Regression

### What is Logistic Regression?

Despite having "regression" in its name, Logistic Regression is actually used for classification problems. It's one of the simplest and most widely used machine learning algorithms.

### The Intuition

Imagine you want to predict whether a student will pass or fail an exam based on the number of hours they studied. Logistic Regression draws an S-shaped curve (called a sigmoid curve) that transforms the hours studied into a probability between 0 and 1.

If the probability is above 0.5, we predict pass. If it's below 0.5, we predict fail.

In our case, Logistic Regression looks at features like job satisfaction, overtime, and years at bank, and calculates the probability that an customer will leave.

### How It Works Step by Step

1. The algorithm assigns weights to each feature. Features that are more important get higher weights.
2. It multiplies each feature by its weight and adds them up.
3. This sum is transformed using a mathematical function to get a probability between 0 and 1.
4. If the probability is greater than 0.5, the model predicts the customer will leave. Otherwise, it predicts they will stay.
5. The algorithm learns the best weights by adjusting them to minimize prediction errors on the training data.

### When to Use Logistic Regression

- When you need a simple, interpretable model
- When you want to understand which features are most important
- When you have a binary classification problem (two categories)
- When you need fast training and prediction
- When the relationship between features and target is relatively linear

### Strengths

- Easy to understand and interpret
- Fast to train and make predictions
- Works well when the decision boundary is linear
- Provides probability estimates, not just class predictions
- Less prone to overfitting with small datasets

### Weaknesses

- Assumes a linear relationship between features and the target
- May not perform well with complex patterns in the data
- Cannot automatically learn feature interactions
- Performance can be limited on datasets with complex relationships

Let's now implement Logistic Regression on our credit card default dataset.

In [None]:
# implement a logistic regression model
logistic_model = LogisticRegression(
    max_iter=10, # maximum iteration,
    random_state=42 # for reproducibility
)




In [None]:
# ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# train  alogistic regression model
logistic_model.fit(X_train,y_train) # model learns from the data / training

In [None]:
predictions = logistic_model.predict(X_test) # model makes predictions


In [None]:
accuracy = accuracy_score(y_test, predictions) # check accuracy

In [None]:
print(accuracy)

0.8603333333333333


## HyperParameters Tuning

In [None]:
logistic_model = LogisticRegression(
    max_iter=1000,
    random_state=42
)




In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
logistic_model.fit(X_train,y_train) # model learns from the data / training

In [None]:
predictions = logistic_model.predict(X_test) # model makes predictions


In [None]:
accuracy = accuracy_score(y_test, predictions) # check accuracy

In [None]:
print(accuracy)

0.901


## HYPERPARAMETERS TUNING 2

### Explanation

X_train, X_test,y_train, y_test

X_train - input for training data

X_test - input for the testing data

y_train - output for the training data

y_test - output for the test data

model.predict(X_test) - predictions

(y_test, predictions)

X_train, X_test - input data

y_train, y_test - target data

y_test - answers

## HYPERPARAMETERS TUNING 2 contd.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
sc = StandardScaler()

In [None]:
X_train_scaled = sc.fit_transform(X_train)

In [None]:
X_test_scaled = sc.fit_transform(X_test)

In [None]:
logistic_model = LogisticRegression(
    solver='liblinear',
    max_iter=1000,
    random_state=42
)


In [None]:
logistic_model.fit(X_train_scaled,y_train) # model learns from the data / training

In [None]:
predictions = logistic_model.predict(X_test_scaled) # model makes predictions


In [None]:
lr_accuracy = accuracy_score(y_test, predictions) # check accuracy

In [None]:
print(lr_accuracy)

0.9026666666666666



Positive - 1 - churn
Negative - 0 - not churn

TP - True Positive - customers who left and the model predicted them to leave

FP - False Positive - customers who did not leave but the model predicted them to leave

TN - True Negative - customers who stayed and the model predicted them to stay

FN - False Negative - customers who did not stay but the model predicted them to stay

### Precision
of all the customers we predicted would churn, how many actually churn
precsion = TP/(TP + FP)

100 to churn

70 - left

30 stayed

70%

In [None]:
precision = precision_score(y_test, predictions) # check precision
print(precision)

0.834108527131783


### Recall
of all the customers who actually churn, how many did we catch?

Recall = TP/(TP + FN)

200 persons

150 persons would leave

50 persons will stay

In [None]:
recall = recall_score(y_test, predictions) # check recall
print(recall)

0.8598827916888652


## SCENARIO A

0 -  100% - a - 90%
1 - 100% - b - 5%
accuracy - 96%

47.5%

f1_score = a + b / 2

## SCENARIO B

0 -  100% - a - 90%
1 - 100% - b - 70%

accuracy ; 89%

80%


f1_score = a + b / 2

### F1_Score

In [None]:
lr_f1_score = f1_score(y_test, predictions) # check f1_score
print(lr_f1_score)

0.8467995802728226


## Algorithm 2: Decision Tree

### What is a Decision Tree?

A Decision Tree is a machine learning algorithm that makes decisions by asking a series of yes/no questions. It's like playing a game of 20 questions to identify something.

### The Intuition

Imagine you are an bank manager trying to identify which customers might leave. You might ask:
- Is their job satisfaction low? If yes, they might leave. If no, continue.
- Do they work overtime? If yes, they might leave. If no, continue.
- Have they been here less than 2 years? If yes, they might leave. If no, they'll likely stay.

A Decision Tree works the same way. It asks questions about the data and makes decisions based on the answers.

### How It Works Step by Step

1. The algorithm starts with all the data at the top of the tree (called the root).
2. It finds the best feature and value to split the data. For example, "Is JobSatisfaction less than 2?"
3. It divides the data into two groups based on this question.
4. It repeats this process for each group, creating more branches.
5. It stops splitting when it reaches a stopping condition, such as:
   - All customers in a group have the same outcome (all leave or all stay)
   - The tree reaches a maximum depth
   - There are too few customers left to split further
6. The final groups at the bottom are called leaves, and they contain the predictions.

### When to Use Decision Trees

- When you need an interpretable model that's easy to explain
- When you want to visualize how decisions are made
- When your data has both numerical and categorical features
- When feature relationships are non-linear
- When you don't want to spend time on feature scaling or normalization

### Strengths

- Very easy to understand and interpret
- Can be visualized as a flowchart
- Handles both numerical and categorical data
- Doesn't require feature scaling
- Automatically learns feature interactions
- Can capture non-linear patterns

### Weaknesses

- Prone to overfitting (memorizing training data instead of learning patterns)
- Can create overly complex trees that don't generalize well
- Small changes in data can lead to very different trees
- May not perform as well as ensemble methods
- Can be biased toward features with more categories

Let's implement a Decision Tree on our dataset.

In [None]:
# Implement a DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth = 5,random_state=42)

In [None]:
# Train a DecisionTreeClassifier Model
dt.fit(X_train,y_train) #model learns from the data

In [None]:
predictions = dt.predict(X_test) # model makes predictions

In [None]:
dt_accuracy = accuracy_score(y_test, predictions) # check the accuracy

In [None]:
dt_f1_score = f1_score(y_test, predictions)  # check the f1_score
print(dt_f1_score)

0.8380697050938338


## Algorithm 3: Random Forest

### What is a Random Forest?

A Random Forest is an ensemble learning method that combines multiple Decision Trees to make better predictions. The name "forest" comes from the fact that we're creating many trees.

### The Intuition

Imagine you want to make an important bank decision, like whether to invest in risk management efforts for an customer. Instead of asking just one manager, you ask ten managers for their opinions and then go with the majority vote. This usually gives you a better decision than relying on just one person.

Random Forest works the same way. It creates many Decision Trees, each trained on a slightly different subset of the data. When making a prediction, it asks all the trees for their opinion and uses majority voting to make the final decision.

### How It Works Step by Step

1. Create multiple Decision Trees (typically 100 or more).
2. For each tree:
   - Randomly select a subset of the training data (with replacement). This is called bootstrap sampling.
   - Randomly select a subset of features to consider at each split.
   - Train a Decision Tree on this subset.
3. To make a prediction for a new customer:
   - Pass the customer's data through all the trees.
   - Each tree gives its prediction (leave or stay).
   - Take a majority vote. If more trees predict default, the final prediction is default.

### Why Does This Work Better?

Each individual tree might make mistakes, but they make different mistakes because they're trained on different subsets of data and features. When we combine them, the errors tend to cancel out, leading to better overall predictions.

### When to Use Random Forest

- When you want high accuracy
- When you need a robust model that's less prone to overfitting
- When you have enough computational resources (Random Forest is slower than single trees)
- When interpretability is less important than performance
- When working with tabular data

### Strengths

- Generally very accurate
- Reduces overfitting compared to single Decision Trees
- Works well with both numerical and categorical features
- Handles missing values well
- Provides feature importance measures
- Requires less parameter tuning than other algorithms

### Weaknesses

- Less interpretable than single Decision Trees
- Slower to train and predict than single trees
- Requires more memory to store multiple trees
- Can be slower for real-time predictions
- May not perform well on very noisy data

Let's implement Random Forest on our dataset.

In [None]:
rf_model = RandomForestClassifier(n_estimators=100,
                                  max_depth = 10,
                                  min_samples_split = 10,
                                  random_state=42)

In [None]:
rf_model.fit(X_train,y_train) # model learns from data

In [None]:
rf_predictions = rf_model.predict(X_test)

In [None]:
rf_f1_score = f1_score(y_test,rf_predictions)
print(rf_f1_score)

0.8569230769230769


In [None]:
rf_accuracy = accuracy_score(y_test,rf_predictions)
print(rf_accuracy)

0.907


In [None]:
clear - 1
not clear - 0

## Algorithm 4: Gradient Boosting

### What is Gradient Boosting?

Gradient Boosting is another ensemble method, but it works differently from Random Forest. Instead of building trees independently, Gradient Boosting builds trees sequentially, where each new tree tries to correct the mistakes made by previous trees.

### The Intuition

Imagine you are learning to predict credit card default:
1. You make your first attempt and identify some customers who will leave, but miss others.
2. For your second attempt, you focus specifically on the customers you got wrong before.
3. You continue this process, each time focusing on your remaining mistakes.
4. By the end, the combination of all your attempts gives you a very accurate prediction.

Gradient Boosting works similarly. Each tree focuses on the mistakes of previous trees, gradually improving the overall predictions.

### How It Works Step by Step

1. Start with a simple initial prediction (usually the average).
2. Build a small Decision Tree that predicts the errors from step 1.
3. Add this tree's predictions to the initial predictions to get improved predictions.
4. Build another tree that predicts the remaining errors.
5. Add this new tree's predictions to improve predictions further.
6. Repeat steps 4-5 for a specified number of trees (typically 100-1000).
7. The final prediction is the sum of all tree predictions.

Each tree is relatively small and weak (they're called "weak learners"), but when combined, they create a powerful "strong learner."

### When to Use Gradient Boosting

- When you need very high accuracy
- When you have structured/tabular data
- When you can afford longer training times
- When you're participating in machine learning competitions
- When you have sufficient data and computational resources

### Strengths

- Often achieves the highest accuracy among traditional ML algorithms
- Handles different types of data well
- Can capture complex patterns
- Provides feature importance
- Less prone to overfitting than individual trees (with proper tuning)

### Weaknesses

- Slower to train than Random Forest
- Requires careful tuning of parameters
- Can overfit if not properly configured
- Sensitive to outliers
- Less interpretable than simpler models
- Trains sequentially, so cannot be parallelized as easily as Random Forest

Let's implement Gradient Boosting on our dataset.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
gb_model = GradientBoostingClassifier(n_estimators=100,
                                      learning_rate = 0.1,
                                      max_depth = 4,
                                      random_state=42)

In [None]:
gb_model.fit(X_train,y_train) #model learns from the data

In [None]:
gb_predictions = gb_model.predict(X_test)

In [None]:
gb_f1_score = f1_score(y_test,gb_predictions)
print(gb_f1_score)

0.8586900464156781


In [None]:
gb_accuracy = accuracy_score(y_test,gb_predictions)
print(gb_accuracy)

0.9086666666666666


## Algorithm 5: XGBoost (Extreme Gradient Boosting)

### What is XGBoost?

XGBoost stands for Extreme Gradient Boosting. It's an optimized and highly efficient implementation of gradient boosting. XGBoost has become one of the most popular machine learning algorithms because it consistently wins machine learning competitions.

### The Intuition

XGBoost works on the same principle as Gradient Boosting (correcting previous mistakes), but it includes several improvements:

Think of Gradient Boosting as a good bank analyst who learns from mistakes. XGBoost is like a brilliant analyst who not only learns from mistakes but also:
- Has better analytical techniques
- Works more efficiently
- Knows when to stop analyzing (avoids overfitting)
- Can analyze multiple aspects simultaneously (parallelization)

### How XGBoost Improves on Gradient Boosting

1. Regularization: XGBoost includes built-in penalties to prevent overfitting, making the model more robust.
2. Handling Missing Values: XGBoost can automatically learn the best way to handle missing data.
3. Parallel Processing: While trees are built sequentially, XGBoost can parallelize operations within each tree, making it faster.
4. Tree Pruning: XGBoost uses a more sophisticated method to decide when to stop growing trees.
5. Built-in Cross-Validation: XGBoost can perform cross-validation during training.

### When to Use XGBoost

- When you need top performance on structured/tabular data
- When you're working on a machine learning competition
- When you have sufficient computational resources
- When you need to handle missing values
- When you want a model that's less prone to overfitting

### Strengths

- Typically achieves the best performance on structured data
- Fast training due to parallel processing
- Built-in regularization prevents overfitting
- Handles missing values automatically
- Provides excellent feature importance measures
- Highly customizable with many parameters

### Weaknesses

- Can be complex to tune properly
- Requires understanding of many hyperparameters
- May be overkill for simple problems
- Less interpretable than simpler models
- Requires installation of separate library

Let's implement XGBoost on our dataset.

In [None]:
!pip install xgboost lightgbm



In [None]:
from xgboost import XGBClassifier

In [None]:
xgb_model = XGBClassifier(n_estimators=100,
                                      learning_rate = 0.1,
                                      max_depth = 5,
                                      random_state=42)

In [None]:
xgb_model.fit(X_train,y_train)

In [None]:
xgb_predictions = xgb_model.predict(X_test)

In [None]:
xgb_f1_score = f1_score(y_test,xgb_predictions)
print(xgb_f1_score)

0.8582375478927203


In [None]:
xgb_accuracy = accuracy_score(y_test,xgb_predictions)
print(xgb_accuracy)

0.9075


## Algorithm 6: LightGBM (Light Gradient Boosting Machine)

### What is LightGBM?

LightGBM is another advanced implementation of gradient boosting, developed by Microsoft. The "Light" in its name refers to its fast training speed and low memory usage.

### The Intuition

If XGBoost is a brilliant analyst, LightGBM is a brilliant analyst who is also incredibly efficient. It's not just smart, it's also remarkably fast.

The key difference in how LightGBM builds trees is that it grows trees leaf-wise instead of level-wise:

- Level-wise (used by most algorithms): Grow all nodes at the same level before moving to the next level.
- Leaf-wise (used by LightGBM): Always split the leaf that will give the maximum reduction in loss.

Think of it like this: Instead of reviewing all customers at each level of analysis, LightGBM focuses on wherever the analysis will improve the most, which might mean completing one area before touching others.

### How LightGBM Is Different

1. Leaf-wise tree growth leads to better accuracy but can overfit if not careful.
2. Uses histogram-based algorithms to bin continuous features, making it faster.
3. Can handle large datasets very efficiently.
4. Supports categorical features directly without encoding.
5. Uses gradient-based sampling to focus on harder examples.

### When to Use LightGBM

- When you have a large dataset and need fast training
- When memory is limited
- When you need high accuracy
- When you have categorical features
- When you want to try an alternative to XGBoost

### Strengths

- Extremely fast training speed
- Low memory usage
- Often achieves better accuracy than XGBoost
- Handles large datasets efficiently
- Supports categorical features natively
- Provides good feature importance measures

### Weaknesses

- More prone to overfitting on small datasets
- Sensitive to hyperparameter tuning
- Less stable than XGBoost on small datasets
- Requires understanding of its unique parameters
- May be too complex for simple problems

Let's implement LightGBM on our dataset.

In [None]:
from lightgbm import LGBMClassifier

In [None]:
lgb_model = LGBMClassifier(n_estimators=100,
                            learning_rate = 0.1,
                            max_depth = 5,
                            random_state=42, verbose=-1)

In [None]:
lgb_model.fit(X_train,y_train)

In [None]:
lgb_predictions = lgb_model.predict(X_test)

In [None]:
lgb_f1_score = f1_score(y_test,lgb_predictions)
print(lgb_f1_score)

0.8602205693767633


In [None]:
lgb_accuracy = accuracy_score(y_test,lgb_predictions)
print(lgb_accuracy)

0.9091666666666667


# Model Optimization and Validation

Now that we've trained 6 different machine learning models, it's time to take our analysis to the next level. In this section, we will:

1. **Compare all model performances** to understand which algorithms work best for our data
2. **Optimize our models** using hyperparameter tuning
3. **Validate using K-Fold Cross Validation** for more reliable performance estimates
4. **Save our models** for future use and deployment

Let's start by comparing the performance of all our models.


##  Hyperparameter Tuning

### What Are Hyperparameters?

When we trained our models earlier, we used default settings. But machine learning models have settings (called **hyperparameters**) that we can adjust to improve their performance. Think of hyperparameters like the settings on your phone camera:

- **Default settings** work okay for most photos, but...
- **Adjusting settings** (brightness, contrast, focus) can give you much better photos in specific situations

Similarly, by tuning our model's hyperparameters, we can often improve its accuracy and performance.

### What Is Hyperparameter Tuning?

Hyperparameter tuning is the process of finding the best combination of settings for our model. We try different combinations and see which one gives us the best results.

### Why Not Tune Everything?

Each model has many hyperparameters, but tuning all of them would:
- Take too much time (hours or even days)
- Require massive computational resources
- Risk overfitting to our specific dataset

**For a 2-hour class and real-world efficiency**, we focus on tuning only the **most impactful hyperparameters** - the ones that typically give us the biggest performance improvements.

### Two Approaches to Hyperparameter Tuning

1. **GridSearchCV**: Tests every possible combination of parameters we specify. More thorough but slower.
2. **RandomizedSearchCV**: Tests random combinations of parameters. Faster and often finds good results.

We'll use **GridSearchCV** for this tutorial because:
- Our parameter grid is small (only 3-4 parameters per model)
- We want to systematically explore the best combinations
- It's easier to understand for beginners



In [None]:
#xgboost algorithm

n_estimators = [100,200,300]
max_depth = [3,5,7,9]

In [None]:
#GRIDSEARCHCV - CLEAR 1
100,3
100,5
100,7
200,3
200,5
200,7
300,3
300,5
300,7

In [None]:
RANDOMIZEDSEARCHCV - NOT CLEAR - 0
100,3
200,5
300,7
200,7
300,5


In [None]:
GRID - 1
RANDOM - 2