# Predicting client churn using supervised machine learning

**Introduction**
Beta Bank is trying to figure out how to predict which clients are most at risk for leaving the bank. I will use the client behavior data set in order to predict which clients will leave by creating an algorithim through supervised machine learning.

**Importing resources and data for evaluation**

In [1]:
#import packages to complete the tasks in this project
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, roc_auc_score, mean_absolute_error, classification_report, confusion_matrix
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.utils import shuffle, resample
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier


Take a preliminary look at the data. Look at the different columns and see if there are any missing values. Also, look to see what the data type for each column is to decide if any need to be changed before processing.

In [2]:
#upload the data into a pandas dataframe
df = pd.read_csv('Churn.csv')
display(df.head())

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [3]:
#Observe the shape of and features of the data
print("shape of the data:", df.shape)
print()
print(df.info())
print()
print(df.dtypes)


shape of the data: (10000, 14)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB
None

RowNumber            int64
CustomerId           int64
Su

**Pre-processing**
In this data set it seems that there are going to need to be a few pre-processing steps taken. 1. Fill the null values from the Tenure column. 2. Drop irrelevant data 3. Use one-hot encoding to turn the nominal columns (geography and gender) into numeric values. 4. Split the data into Training, Validation, and Test sets. 5. Scale the data to normalize each feature.

In [4]:
#since there are 909 null values in the Tenure column, I want to get a closer look at what they are so that I can decide what to do with them.
nan_df = df[df['Tenure'].isna()]
display(nan_df.head())

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
30,31,15589475,Azikiwe,591,Spain,Female,39,,0.0,3,1,0,140469.38,1
48,49,15766205,Yin,550,Germany,Male,38,,103391.38,1,0,1,90878.13,0
51,52,15768193,Trevisani,585,Germany,Male,36,,146050.97,2,0,0,86424.57,0
53,54,15702298,Parkhill,655,Germany,Male,41,,125561.97,1,0,0,164040.94,1
60,61,15651280,Hunter,742,Germany,Male,35,,136857.0,1,0,0,84509.57,0


In [5]:
#after looking considering the Tenure column, I've decided to fill the NaN values with the median value for the whole dataset
tenure_median = df['Tenure'].median()
df = df.fillna(tenure_median)

In [6]:
#In this assignment, RowNumber, CustomerId, and Surname are irrelevant to the task. Drop these from the data
df_rel = df.drop(columns=['RowNumber', 'CustomerId', 'Surname'])
display(df_rel.head())

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [7]:
#Now I want to check the unique values of the gender and geography columns. This will help with encoding them
print(df_rel['Gender'].unique())
print(df_rel['Geography'].unique())

['Female' 'Male']
['France' 'Spain' 'Germany']


In [8]:
#Use One-hot encoding for the Data in the gender and geography columns
df_ohe = pd.get_dummies(df_rel, drop_first=True)
#verify that the dtype for the columns in question have changed
print(df_ohe.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CreditScore        10000 non-null  int64  
 1   Age                10000 non-null  int64  
 2   Tenure             10000 non-null  float64
 3   Balance            10000 non-null  float64
 4   NumOfProducts      10000 non-null  int64  
 5   HasCrCard          10000 non-null  int64  
 6   IsActiveMember     10000 non-null  int64  
 7   EstimatedSalary    10000 non-null  float64
 8   Exited             10000 non-null  int64  
 9   Geography_Germany  10000 non-null  bool   
 10  Geography_Spain    10000 non-null  bool   
 11  Gender_Male        10000 non-null  bool   
dtypes: bool(3), float64(3), int64(6)
memory usage: 732.6 KB
None


In [9]:
#define features and target
target = df_ohe['Exited']
features = df_ohe.drop(columns=['Exited'], axis=1)
#split the data into a Training and Validation sets and Test Set
#use the stratify parameter to begin helping the imbalance of data
features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.2, random_state=8420, stratify=target)
features_train, features_test, target_train, target_test = train_test_split(features_train, target_train, test_size=0.25, random_state=8420, stratify=target_train)

print(f'The training set shape is {len(features_train)} for features')
print(f'The validation set shape is {len(features_valid)} for features')
print(f'The test set shape is {len(features_test)} for features')

The training set shape is 6000 for features
The validation set shape is 2000 for features
The test set shape is 2000 for features


In [10]:
#What is the distribution of the data to those who have left and those who have not
print('Those who have not exited yet make up:',((target==0).sum())/100,'% of this data set')
print('Those who have already exited make up:',(target==1).sum()/100,'% of this data set')

Those who have not exited yet make up: 79.63 % of this data set
Those who have already exited make up: 20.37 % of this data set


**Scale the Data**
I'm going to use the StandardScaler strategy to normalize the features of this data set. 

In [11]:
#Scale the data so that each features is normalized
scaler = StandardScaler()
scaler.fit(features_train)
features_train_scaled = scaler.transform(features_train)
features_valid_scaled = scaler.transform(features_valid)
features_test_scaled = scaler.transform(features_test)

#reconvert each scaled set back to DataFrames
features_train_scaled = pd.DataFrame(features_train_scaled, columns=features_train.columns)
features_valid_scaled = pd.DataFrame(features_valid_scaled, columns=features_valid.columns)
features_test_scaled = pd.DataFrame(features_test_scaled, columns=features_test.columns)

display(features_train_scaled.head())

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_Germany,Geography_Spain,Gender_Male
0,-0.451227,-1.331816,-0.365352,0.717553,-0.900446,-1.527525,0.952144,0.269298,-0.575297,1.722862,-1.106659
1,-1.59001,0.697556,-1.457949,0.247628,-0.900446,0.654654,-1.050262,0.035812,-0.575297,-0.580429,0.903621
2,-0.022877,0.504283,-1.09375,-1.225158,0.828411,0.654654,0.952144,-0.722471,-0.575297,-0.580429,-1.106659
3,-0.284066,-0.558722,-1.09375,0.057547,-0.900446,0.654654,0.952144,0.073021,1.738231,-0.580429,0.903621
4,0.144284,0.407646,1.455642,1.805569,0.828411,0.654654,0.952144,-0.767166,-0.575297,-0.580429,0.903621


**Findings from pre-processing**
There is an imbalance in this data that will need to be accounted for in the final model. With 80% of the observations in the not exited category and 20% having exited, the imbalance will likely prevent a model from being fully representative of our ultimate goal to be able to predict with high precision and recall which customers are most likely to leave.

This is a relatively small data set with only 10,000 observations. After splitting the data, there are now 6000 in the training set, 2000 in the validation set and 2000 in the test set. Because of this it will be important to keep as many observations as possible. So when it comes time to balance the data, I will need to use Upsampling and class_weight to preserve the data observations.

**Get a baseline**
Before doing any more data transformation it will be good to take a look at what the results from each of the ML models will produce. This will help see how imbalance in this data is affecting the results

In [12]:
#Create and train a LogisticRegression model
imbalanced_LR = LogisticRegression(random_state=8420, solver='liblinear')
imbalanced_LR.fit(features_train_scaled, target_train)
imbalanced_LR_prediction = imbalanced_LR.predict(features_valid_scaled)
#Calculate the F1 Score
print('Logistic Regrssion F1:', f1_score(target_valid, imbalanced_LR_prediction))

Logistic Regrssion F1: 0.31833910034602075


In [13]:
#Create and train a DecisionTreeClassifier
imbalanced_DT = DecisionTreeClassifier(random_state=8420, max_depth=15)
imbalanced_DT.fit(features_train_scaled, target_train)
imbalanced_DT_prediction = imbalanced_DT.predict(features_valid_scaled)
#Calculate the F1 Score
print('Decision Tree F1:', f1_score(target_valid, imbalanced_DT_prediction))

Decision Tree F1: 0.5287637698898409


In [14]:
# Create and train the RandomForest model
imbalanced_RF = RandomForestClassifier(random_state=8420, n_estimators=70, max_depth=15)

# Fit the model on the upsampled training data
imbalanced_RF.fit(features_train_scaled, target_train)

# Make predictions on the validation set
imbalanced_RF_prediction = imbalanced_RF.predict(features_valid_scaled)

#Calculate the F1 score
f1 = f1_score(target_valid, imbalanced_RF_prediction)
print(f"Random Forest F1 Score: {f1}")

Random Forest F1 Score: 0.5966514459665144


**Findings from the baseline models**
The RandomForest model produced the best F1 score, at .597 and LogisticRegression performed the poorest at .318. While the RandomForest model was able to produce a score at the .59 goal, it will likely improve significantly with a balanced dataset.

**Upsampling**
In this section, I will use upsampling with the scaled data to see the effect of balancing the data using this strategy.
Since approximately 80% of the target set is part of the not exited group, we have an imbalance that will need to get sorted out. I'm going to upsample the data because this dataset is relatively small. While the total data has 10,000 lines, after splitting the data into Training, Validation, and Test sets there will only be 6000, 2000, and 2000 lines respectively. So, using upsampling will not get rid of any of the precious data in these sets.

In [15]:
# Updated upsample function
def upsample(features, target, repeat):
    features = features.reset_index(drop=True)
    target = target.reset_index(drop=True)
    # Split features and target by the target classes
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    # Repeat the minority class to balance the data
    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)

    # Shuffle the upsampled data
    features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=8420)

    return features_upsampled, target_upsampled

# Call upsample function with scaled features
features_train_scaled_df = pd.DataFrame(features_train_scaled, columns=features_train.columns)
features_upsampled_scaled_train, target_upsampled_train = upsample(features_train_scaled_df, target_train, 4)


# Display proportions
print('Those who have not exited yet make up:', ((target_upsampled_train == 0).sum()) / len(target_upsampled_train) * 100, '% of this upsampled data set')
print('Those who have already exited make up:', ((target_upsampled_train == 1).sum()) / len(target_upsampled_train) * 100, '% of this upsampled data set')

Those who have not exited yet make up: 49.43099524105111 % of this upsampled data set
Those who have already exited make up: 50.569004758948886 % of this upsampled data set


In [16]:
# Call upsample function with scaled features
features_valid_scaled_df = pd.DataFrame(features_valid_scaled, columns=features_valid.columns)
features_upsampled_scaled_valid, target_upsampled_valid = upsample(features_valid_scaled_df, target_valid, 4)


# Display proportions
print('Those who have not exited yet make up:', ((target_upsampled_valid == 0).sum()) / len(target_upsampled_valid) * 100, '% of this upsampled data set')
print('Those who have already exited make up:', ((target_upsampled_valid == 1).sum()) / len(target_upsampled_valid) * 100, '% of this upsampled data set')

Those who have not exited yet make up: 49.456690468798506 % of this upsampled data set
Those who have already exited make up: 50.54330953120149 % of this upsampled data set


In [17]:
# Call upsample function with scaled features
features_test_scaled_df = pd.DataFrame(features_test_scaled, columns=features_test.columns)
features_upsampled_scaled_test, target_upsampled_test = upsample(features_test_scaled_df, target_test, 4)


# Display proportions
print('Those who have not exited yet make up:', ((target_upsampled_test == 0).sum()) / len(target_upsampled_test) * 100, '% of this upsampled data set')
print('Those who have already exited make up:', ((target_upsampled_test == 1).sum()) / len(target_upsampled_test) * 100, '% of this upsampled data set')

Those who have not exited yet make up: 49.37965260545906 % of this upsampled data set
Those who have already exited make up: 50.62034739454094 % of this upsampled data set


**Create Models**
Create models using the upsampled and scaled data to see if it out performs the original models.

In [18]:
#Create and train a LogisticRegression model
model1 = LogisticRegression(random_state=8420, solver='liblinear')
model1.fit(features_upsampled_scaled_train, target_upsampled_train)
predicted_valid_LR = model1.predict(features_upsampled_scaled_valid)

#Calculate the F1 score
print('Logistic Regrssion F1:', f1_score(target_upsampled_valid, predicted_valid_LR))

Logistic Regrssion F1: 0.720504353047133


In [19]:
#Create and train a DecisionTreeClassifier model
model_dt = DecisionTreeClassifier(random_state=8420, max_depth=15)
model_dt.fit(features_upsampled_scaled_train, target_upsampled_train)
predicted_valid_dt = model_dt.predict(features_upsampled_scaled_valid)

#Calculate the F1 score
print('Decision Tree F1:', f1_score(target_upsampled_valid, predicted_valid_dt))

Decision Tree F1: 0.6403477001086563


In [20]:
# Create and train the RandomForest model
rf_model = RandomForestClassifier(random_state=8420, n_estimators=70, max_depth=15)

# Fit the model on the upsampled training data
rf_model.fit(features_upsampled_scaled_train, target_upsampled_train)

# Make predictions on the validation set
predictions_valid_RF = rf_model.predict(features_upsampled_scaled_valid)

#Calculate the F1 score
f1 = f1_score(target_upsampled_valid, predictions_valid_RF) 
print(f"Random Forest F1 Score: {f1}")

Random Forest F1 Score: 0.7319884726224783


In [21]:
#Check the auc_roc score for the RandomForest Model
probabilities_valid = rf_model.predict_proba(features_upsampled_scaled_valid)
probabilities_one_valid = probabilities_valid[:, 1]

auc_roc = roc_auc_score(target_upsampled_valid, probabilities_one_valid)
print(f'auc_roc Score: {auc_roc}')

auc_roc Score: 0.8580761038388157


In [22]:
#Use the test set to verify that the RandomForest Model is a quality fit.
# Predict on the test set
predictions_test = rf_model.predict(features_upsampled_scaled_test)
#F1 score
f1_test = f1_score(target_upsampled_test, predictions_test)
print(f"Test F1 Score: {f1_test}")
probabilities_test = rf_model.predict_proba(features_upsampled_scaled_test)
probabilities_one_test = probabilities_test[:, 1]
auc_roc_test = roc_auc_score(target_upsampled_test, probabilities_one_test)
print(f'Test auc_roc Score: {auc_roc_test}')

Test F1 Score: 0.7028800583302953
Test auc_roc Score: 0.8412605305941472


**Findings from the Upsample Experiment**
The F1 Score from the RandomForest Classifier appears to be the best model of the three for this metric. It has an F1 score of .732, whereas the LogisticRegression had an F1 of .721, and the DecisionTree Classifier had an F1 of .640. These results show a significant improvement compared to the model without addressing the imbalance in this data. 

I applied the RandomForest model to the Test set to check to make sure that it is not overfit. The F1 score for the Test set was .703 and the auc_roc score was .841. Both of these metrics were within a reasonable range of the findings from the training and validation sets. This suggests that this model is working effectively.

**class_weight balancing**
Now I will use Class weight to balance the Data and see how it compares to the above models. 

In [23]:
#Create and train a LogisticRegression model
model2 = LogisticRegression(class_weight='balanced', random_state=8420, solver='liblinear')
model2.fit(features_train_scaled, target_train)
predicted_valid_LR2 = model2.predict(features_valid_scaled)

#Calculate the F1 score
print('Logistic Regrssion F1:', f1_score(target_valid, predicted_valid_LR2))

Logistic Regrssion F1: 0.495


In [24]:
#Create and train a DecisionTreeClassifier model
model_dt2 = DecisionTreeClassifier(class_weight='balanced', random_state=8420, max_depth=15)
model_dt2.fit(features_train_scaled, target_train)
predicted_valid_dt2 = model_dt2.predict(features_valid_scaled)

#Calculate the F1 score
print('Decision Tree F1:', f1_score(target_valid, predicted_valid_dt2))

Decision Tree F1: 0.5022727272727273


In [25]:
# Create and train the RandomForest model
rf_model2 = RandomForestClassifier(class_weight='balanced', random_state=8420, n_estimators=70, max_depth=15)

# Fit the model on the upsampled training data
rf_model2.fit(features_train_scaled, target_train)

# Make predictions on the validation set
predictions_valid_RF2 = rf_model2.predict(features_valid_scaled)

#Calculate the F1 score
f1 = f1_score(target_valid, predictions_valid_RF2) 
print(f"Random Forest F1 Score: {f1}")

Random Forest F1 Score: 0.6176470588235294


In [26]:
#Check the auc_roc score for the RandomForest Model
probabilities_valid2 = rf_model2.predict_proba(features_valid_scaled)
probabilities_two_valid = probabilities_valid2[:, 1]

auc_roc2 = roc_auc_score(target_valid, probabilities_two_valid)
print(f'auc_roc Score: {auc_roc2}')

auc_roc Score: 0.8627872865161001


In [27]:
#Use the test set to verify that the RandomForest Model is a quality fit.
# Predict on the test set
predictions_test2 = rf_model2.predict(features_test_scaled)
#F1 score
f1_test2 = f1_score(target_test, predictions_test2)
print(f"Test F1 Score: {f1_test2}")
probabilities_test2 = rf_model2.predict_proba(features_test_scaled)
probabilities_one_test2 = probabilities_test2[:, 1]
auc_roc_test2 = roc_auc_score(target_test, probabilities_one_test2)
print(f'Test auc_roc Score: {auc_roc_test2}')

Test F1 Score: 0.5822021116138764
Test auc_roc Score: 0.8538664523598384


**Findings from class weight balancing**
After applying balanced class weight to each of the learning models, there does not appear to be a significant improvement compared against the models in the imbalanced models. The best model again is the RandomForest model. 

I'm curious to know which features ended up being the most important to predict churn. Below I'm going to make a data frame that shows which features were most predictive. This will help Beta Bank know how to adjust business practices.

In [28]:
# Extract feature importances
feature_importances = rf_model.feature_importances_

# Create a DataFrame for better visualization
importance_df = pd.DataFrame({'Feature': features_upsampled_scaled_train.columns,'Importance': feature_importances}).sort_values(by='Importance', ascending=False)

print(importance_df)

              Feature  Importance
1                 Age    0.252405
4       NumOfProducts    0.142689
3             Balance    0.141403
7     EstimatedSalary    0.132725
0         CreditScore    0.128477
2              Tenure    0.073367
6      IsActiveMember    0.040273
8   Geography_Germany    0.036005
10        Gender_Male    0.022139
5           HasCrCard    0.016931
9     Geography_Spain    0.013586


In [29]:
# Combine Age and target into a single DataFrame
age_churn_data = pd.DataFrame({'Age': features_train['Age'], 'Churn': target_train})

# Group by age ranges and calculate churn rates
age_churn_data['age_group'] = pd.cut(age_churn_data['Age'], bins=[18, 30, 45, 60, 75, 90], 
                                     labels=['18-30', '31-45', '46-60', '61-75', '76-90'])

age_churn_rate = age_churn_data.groupby('age_group')['Churn'].mean()

print(age_churn_rate)

age_group
18-30    0.083051
31-45    0.159461
46-60    0.496945
61-75    0.267490
76-90    0.047619
Name: Churn, dtype: float64


  age_churn_rate = age_churn_data.groupby('age_group')['Churn'].mean()


**Conclusion:**
Beta Bank is facing a churn problem that can be addressed with a little research and machine learning. I looked into the data provided of 10,000 customers to build a model that would most effectively predict whether or not a client would is at risk of leaving the bank.s

Pre-processing Methods: There were several steps necessary to take the original dataset and make it useful for the Machine learning process. First, I needed to remove irrelevant columns. Second, I converted the nominal columns to numerical values using One-hot encoding. I then split the data into training, validation, and test sets. Finally, I scaled the data to normalize each column of data. By doing each of these steps the data was prepared for the Machine learning process.

Balancing methods: I used two different means of balancing the data to observe which performed more effectively with regard to precision and recall. The First method that I used was upsampling. This increased the number of churn observations up to equal that of no-churn. I also used the class weight parameter to create balance in each of the models. Of the two methods employed, the upsampling strategy produced higher F1 scores suggesting that it was able to more effectively predict churn.

Models used: I created 3 classifier models and compared the F1 Scores of each. The three models I used were the Decision Tree, Logistic Regression, and Random Forest. 

Findings: Of the three models, the best model for the F1 score ended up being the highest with the Random Forest model.
F1 Scores: 
DecisionTree = .721
Logistic Regression = .640
Random Forest = .732

As a result of these findings, it is clear that Beta Bank should use the RandomForest model created above for determining which clients are at highest risk of leaving their bank. I was curious to see which features ended up being the most relevant to building the highest performing RandomForest model. I found that Age was the most important feature. On further examination I learned that 50% of the clients who left were in the 46-60 year old age range and another 25% from the 61-75 year range. Beta Bank would benefit from some further research into what 46-75 year old clients are wanting from their bank and see if they can offer products or services that appeal to this demographic.