**Review**

Hi, my name is Dmitry and I will be reviewing your project.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did a great job overall, but there are a couple of problems that need to be fixed before the project is accepted. Let me know if you have questions!

# Beta Bank Customer Retention

## Introduction

This project focuses on predicting the customer churn using machine learning techniques. Customer churn refers to when a customer stops doing business with a company. Predicting churn is important for Beta Bank as it can help them identify customers who are likely to churn and take proactive steps to retain them.

The dataset used contains information about the bank's customers and whether they exited (churned) or not. The data includes customer information such as credit score, gender, age, geography, etc.

The project involves the following steps:
- Data is loaded, explored, and preprocessed. This includes handling missing values, converting data types, and dropping unnecessary columns.
- The target variable is imbalanced with more customers continuing their business compared to those leaving. Techniques such as upsampling the minority class and downsampling the majority class will be used to address this imbalance.
- A Logistic Regression model will be trained on the preprocessed data. The model's performance is evaluated using F1 score and AUC-ROC metrics.
- The model is then improved using upsampling and downsampling. The results will be compared before the model is improved vs after the model is improved.

The goal of this project is to build a model that can accurately predict customer churn. The insights gained from this project could potentially be used to improve Beta Bank's customer retention strategies.

## Prepare the data

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import warnings

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score, roc_auc_score
from sklearn.metrics import accuracy_score

from sklearn.utils import resample
from sklearn.utils import shuffle

from sklearn.exceptions import FitFailedWarning
warnings.filterwarnings(action='ignore', category=UserWarning)
warnings.filterwarnings('ignore')

In [2]:
# Read the data
data = pd.read_csv('https://practicum-content.s3.us-west-1.amazonaws.com/datasets/Churn.csv')

# Examine the data
data.info()
display(data.sample(10))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
9731,9732,15627859,Nebeolisa,607,Germany,Male,29,7.0,102609.0,1,1,0,163257.44,0
1470,1471,15762332,Ulyanova,568,Germany,Female,31,1.0,61592.14,2,1,1,61796.64,0
4059,4060,15691952,Fanucci,676,France,Male,37,10.0,106242.67,1,1,1,166678.28,0
1311,1312,15750497,Longo,850,France,Female,37,7.0,153147.75,1,1,1,152235.3,0
1251,1252,15814930,McGregor,588,Germany,Female,40,10.0,125534.51,1,1,0,121504.18,1
3269,3270,15774744,Lord,664,Germany,Male,33,,97286.16,2,1,0,143433.33,0
479,480,15797736,Smith,658,France,Male,29,4.0,80262.6,1,1,1,20612.82,0
5850,5851,15762091,Simpson,631,Germany,Female,22,6.0,139129.92,1,1,1,63747.51,0
3064,3065,15762228,Barnes,506,Spain,Male,35,6.0,110046.93,2,1,0,26318.73,0
1443,1444,15598751,Ingram,556,France,Female,43,,0.0,3,0,0,125154.57,1


In [3]:
# Check for duplicates
print(data.duplicated().sum())

0


There are no duplicate rows, so we can move on.

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was loaded and inspected!

</div>

In [4]:
# Check for missing values
print(data.isnull().sum())

RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64


There are 909 missing values for the 'Tenure' column. Some models will not be able to handle data with missing values. Therefore, we will fill in the missing values for tenure with the median value. We will also change the data type of 'Tenure' to integers if all the values are integers.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Alright, that's one way to deal with missing values :)

</div>

In [5]:
# Fill missing values in 'Tenure' with the median value
data['Tenure'].fillna(data['Tenure'].median(), inplace=True)

# Check to see if it's save to convert 'Tenure' from float to int. If so, then convert it.
if np.array_equal(data['Tenure'], data['Tenure'].astype('int')):
    data['Tenure'] = data['Tenure'].astype('int')

print(data.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int32  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int32(1), int64(8), object(3)
memory usage: 1.0+ MB
None


We will now remove the columns that are not needed.

In [6]:
# Drop the columns that are not needed for the model
data = data.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)

display(data.sample(10))

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
8210,703,Spain,Male,31,6,0.0,2,1,1,67667.19,0
4969,655,Spain,Female,35,1,106405.03,1,1,1,82900.25,0
8412,699,France,Male,22,9,99339.0,1,1,0,68297.61,1
4592,834,France,Male,36,8,142882.49,1,1,0,89983.02,1
6450,834,France,Female,28,6,0.0,1,1,0,74287.53,0
8201,718,Spain,Female,49,10,82321.88,1,0,1,11144.4,0
2864,708,Germany,Male,37,8,153366.13,1,1,1,26912.34,0
3465,692,Germany,Female,43,2,69014.49,2,0,0,164621.43,0
923,572,Germany,Female,19,1,138657.08,1,1,1,16161.82,0
7262,641,Spain,Female,40,5,101090.27,1,1,1,51703.09,0


These columns were dropped since they do not contribute to the model's prediction of customer churn. For RowNumber is an index column that does not provide meaningful information for the model. CustomerId is a unique identifier for each customer. Including this in the model could associate specific outcomes to the individual customer IDs and may not work well with unseen data. Surname is the customer's last name, which will probably not have influence towards their likelihood to churn.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Make sense!

</div>

In [7]:
# Convert categorical data into numerical data
data = pd.get_dummies(data, drop_first=True)

display(data.sample(10))

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
2764,660,38,7,0.0,2,0,1,146585.53,0,False,False,True
1916,543,48,1,100900.5,1,0,0,33310.72,1,True,False,True
3892,549,45,6,124240.93,1,1,1,146372.51,0,True,False,True
8264,742,33,5,0.0,2,0,0,38550.4,0,False,False,True
4446,701,37,3,0.0,2,1,1,164268.28,0,False,False,False
8837,664,46,2,0.0,1,1,1,177423.02,1,False,False,True
9818,558,31,7,0.0,1,1,0,198269.08,0,False,False,True
61,687,27,9,152328.88,2,0,0,126494.82,0,True,False,False
533,543,35,10,59408.63,1,1,0,76773.53,0,False,True,True
4565,593,46,2,76597.79,1,1,1,54453.72,0,False,True,False


We have the new dataframe that has the categories placed into separate columns. To avoid the dummy variable trap, the drop_first argument for get_dummies doesn't include a Geography_France column. It is assumed that the geography is France if it is not Germany or Spain. Same with Gender_Male assuming the gender is Female if Gender_Male is false.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Categorical features were encoded

</div>

In [8]:
# Split the data into features and target
# The 'Exited' column is the target, the rest are features
features = data.drop('Exited', axis=1)
target = data['Exited']

# First, split the data into a training set (60% of the data) and a temp set (40%)
features_train, features_temp, target_train, target_temp = train_test_split(
    features, target, test_size=0.4, random_state=42)

# Then, split the temp set into a validation set (50% of the temp) and a testing set (50% of the temp)
# This will result in a 20/20 split of the entire dataset for validation/testing
features_valid, features_test, target_valid, target_test = train_test_split(
    features_temp, target_temp, test_size=0.5, random_state=42)

<div class="alert alert-danger">
<b>Reviewer's comment</b>

Note that we need three sets here: train, validation and test. Train set to train the models, validation to compare different models and balancing tehchniques as well as tune hyperparameters, and the test set to evaluate the final model

</div>

In [9]:
# Examine the balance of classes
class_counts = target.value_counts()
print(class_counts)


Exited
0    7963
1    2037
Name: count, dtype: int64


This code shows the number of customers who stayed with the company vs those who took their business elsewhere. It shows that there are significantly more customers who are loyal customers than those who left.

In [10]:
# Calculate the imbalance ratio
imbalance_ratio = class_counts[0] / class_counts[1]
print(f'Imbalance Ratio: {imbalance_ratio}')

Imbalance Ratio: 3.9091801669121256


This shows that there are about 4 times the loyal customers as there are who took their business elsewhere at the time the data was collected.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Class distribution was examined

</div>

## Train the model

In [11]:
# Train a Logistic Regression model without considering the imbalance
model = LogisticRegression(solver='liblinear', random_state=42)
model.fit(features_train, target_train)

In [12]:
# Make predictions on the test set
predictions_valid = model.predict(features_valid)

In [13]:
# Evaluate the model
print(classification_report(target_valid, predictions_valid))

              precision    recall  f1-score   support

           0       0.81      0.99      0.89      1620
           1       0.33      0.02      0.04       380

    accuracy                           0.81      2000
   macro avg       0.57      0.51      0.47      2000
weighted avg       0.72      0.81      0.73      2000



Here's what can be seen from the results:
- The precision, which is the ratio of correctly predicted positive observations to the total predicted positives vs false positives, is high with for 0, but relatively low for 1. The precision is .81 for 0 and .45 for 1.
- Recall is the ratio of correctly predicted positive observations to all the observations in the class. For 0, the recall is .98, while for 1, the recall is .07.
- F1 score is the weighted average of Precision and Recall. This score takes both false positive and false negatives into account. It is a better measure than accuracy for uneven class distribution such as what we have in our data. The F1 score for 0 is .89, while the F1 score for 1 is 0.12.
- Support is the number of actual occurences of the class specified in the dataset. For 0, it is 1607 and 1 is 393.

From these metrics, we can conclude that the model is performing well in predicting customers who did not exit (0), but not as well as predicting customers who exited.

<div class="alert alert-danger">
<b>Reviewer's comment</b>

Great, you trained a model without taking the imbalance into account first. Note that you need to use the validation set to evaluate the model here: the test set should only be used once you've selected the model and are not going to make any changes. The same goes for models below

</div>

## Improve the model

We will use upsampling and downsampling to improve our model.

### Logistic Regression

#### Hyperparameter Tuning
Use GridSearchCV on Logistic Regression model for hyperparameter testing.

In [14]:
# Ignore FitFailedWarning
warnings.filterwarnings('ignore', category=FitFailedWarning)

# Hyperparameter tuning
param_grid = {
    'C': np.logspace(-3, 3, 7), 
    'penalty': ['l1', 'l2'],
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'class_weight': [None, 'balanced'],
    'fit_intercept': [True, False],
    'max_iter': [100, 200, 300]
}

gridsearch = GridSearchCV(LogisticRegression(), param_grid, cv=5, scoring='f1')

#### Upsampling

In [15]:
# Upsampling
def upsample(features, target, repeat=None):
    # Separate majority and minority classes
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    # Upsample minority class to match the number of samples in majority class
    features_upsampled = pd.concat([features_zeros] + [resample(features_ones, replace=True, n_samples=len(features_zeros), random_state=42)])
    target_upsampled = pd.concat([target_zeros] + [resample(target_ones, replace=True, n_samples=len(target_zeros), random_state=42)])
    
    # Shuffle the dataset
    features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=42)
    
    return features_upsampled, target_upsampled

features_upsampled, target_upsampled = upsample(features_train, target_train)
print("Upsampling Target Value Counts: ")
print(target_upsampled.value_counts())
gridsearch.fit(features_upsampled, target_upsampled)
print("Upsampling Best Params: ", gridsearch.best_params_)
predictions_valid = gridsearch.predict(features_valid)
print("Upsampling F1 Score: ", f1_score(target_valid, predictions_valid))
print("Upsampling ROC-AUC Score: ", roc_auc_score(target_valid, predictions_valid))

Upsampling Target Value Counts: 
Exited
1    4773
0    4773
Name: count, dtype: int64


Upsampling Best Params:  {'C': 0.01, 'class_weight': 'balanced', 'fit_intercept': True, 'max_iter': 200, 'penalty': 'l2', 'solver': 'newton-cg'}
Upsampling F1 Score:  0.46696035242290745
Upsampling ROC-AUC Score:  0.6974496426250811


#### Downsampling

In [16]:
# Downsampling
def downsample(features, target, fraction=None):
    # Separate majority and minority classes
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    # Downsample majority class to match the number of samples in minority class
    features_downsampled = pd.concat([resample(features_zeros, replace=False, n_samples=len(features_ones), random_state=42)] + [features_ones])
    target_downsampled = pd.concat([resample(target_zeros, replace=False, n_samples=len(target_ones), random_state=42)] + [target_ones])
    
    # Shuffle the dataset
    features_downsampled, target_downsampled = shuffle(features_downsampled, target_downsampled, random_state=42)
    
    return features_downsampled, target_downsampled


features_downsampled, target_downsampled = downsample(features_train, target_train)
print("Downsampling Target Value Counts: ")
print(target_downsampled.value_counts())
gridsearch.fit(features_downsampled, target_downsampled)
print("Downsampling Best Params: ", gridsearch.best_params_)
predictions_valid = gridsearch.predict(features_valid)
print("Downsampling F1 Score: ", f1_score(target_valid, predictions_valid))
print("Downsampling ROC-AUC Score: ", roc_auc_score(target_valid, predictions_valid))

Downsampling Target Value Counts: 
Exited
1    1227
0    1227
Name: count, dtype: int64
Downsampling Best Params:  {'C': 1.0, 'class_weight': 'balanced', 'fit_intercept': True, 'max_iter': 200, 'penalty': 'l1', 'solver': 'liblinear'}
Downsampling F1 Score:  0.46341463414634143
Downsampling ROC-AUC Score:  0.6950617283950616


### Random Forest

#### Hyperparameter Tuning
Use GridSearchCV on Random Forest model for hyperparameter testing.


In [20]:
# Hyperparameter tuning
param_grid = {
    'n_estimators': range(10, 201, 10), 
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False],
    'class_weight': [None, 'balanced']
}

gridsearch = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='f1')

#### Upsampling

In [21]:
# Upsampling
def upsample(features, target, repeat=None):
    # Separate majority and minority classes
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    # Upsample minority class to match the number of samples in majority class
    features_upsampled = pd.concat([features_zeros] + [resample(features_ones, replace=True, n_samples=len(features_zeros), random_state=42)])
    target_upsampled = pd.concat([target_zeros] + [resample(target_ones, replace=True, n_samples=len(target_zeros), random_state=42)])
    
    # Shuffle the dataset
    features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=42)
    
    return features_upsampled, target_upsampled

features_upsampled, target_upsampled = upsample(features_train, target_train)
gridsearch.fit(features_upsampled, target_upsampled)
print("Upsampling Best Params: ", gridsearch.best_params_)
probabilities_valid = gridsearch.predict_proba(features_valid)
predicted_valid = probabilities_valid[:, 1] > 0.4  # custom threshold
print("Upsampling F1 Score: ", f1_score(target_valid, predicted_valid))
print("Upsampling ROC-AUC Score: ", roc_auc_score(target_valid, predicted_valid))

KeyboardInterrupt: 

#### Downsampling

In [19]:
# Downsampling
def downsample(features, target, fraction=None):
    # Separate majority and minority classes
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    # Downsample majority class to match the number of samples in minority class
    features_downsampled = pd.concat([resample(features_zeros, replace=False, n_samples=len(features_ones), random_state=42)] + [features_ones])
    target_downsampled = pd.concat([resample(target_zeros, replace=False, n_samples=len(target_ones), random_state=42)] + [target_ones])
    
    # Shuffle the dataset
    features_downsampled, target_downsampled = shuffle(features_downsampled, target_downsampled, random_state=42)
    
    return features_downsampled, target_downsampled

features_downsampled, target_downsampled = downsample(features_train, target_train)
gridsearch.fit(features_downsampled, target_downsampled)
print("Downsampling Best Params: ", gridsearch.best_params_)
probabilities_valid = gridsearch.predict_proba(features_valid)
predicted_valid = probabilities_valid[:, 1] > 0.4  # custom threshold
print("Downsampling F1 Score: ", f1_score(target_valid, predicted_valid))
print("Downsampling ROC-AUC Score: ", roc_auc_score(target_valid, predicted_valid))


Downsampling Target Value Counts: 
Exited
1    1227
0    1227
Name: count, dtype: int64
Downsampling Best Params:  {'bootstrap': False, 'max_depth': None, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 100}
Downsampling F1 Score:  0.5412935323383085
Downsampling ROC-AUC Score:  0.7489441195581547


<div class="alert alert-danger">
<b>Reviewer's comment</b>

Upampling should be applied only to the train set, otherwise it won't be possible to accurately estimate how the model will generalize to new data for two reasons:
    
1. Validation/test data obtained from an upsampled full dataset will not have the same distribution as actual data (which is not balanced)
2. There are bound to be the same examples in train and test, which is a clear case of data leakage.
    
The goal of upsampling is just to help the model better learn about the underrepresented class, but the validation and test set need to have the original data distribution in order for evaluation to make any sense.

</div>

<div class="alert alert-danger">
<b>Reviewer's comment</b>

The same comment as for upsampling: downsampling should only be applied to the train set. While the argument about having the same examples in train and test no longer works, the first point about validation/test data needing to have the original data distribution for accurate estimation of generalization performance applies here.

</div>

## Testing

<div class="alert alert-danger">
<b>Reviewer's comment</b>

Please check the results after making sure that the test set is only used for final model evaluation and all prior comparisons are done using the validation set 

</div>

## Conclusion

This project involved building a machine learning model to predict customer churn. The dataset was initially imbalanced with a larger number of customers who continued their business with Beta Bank compared to those who did not. The initial model which was trained without addressing the imbalance performed poorly having a low F1 score for the minority class. After addressing the class imbalance using both upsampling and downsampling, the F1 scores improved dramatically from .12 to around .63-.64. The AUC-ROC scores of the improved model was around .65-.69.

This project demonstrated the importance of properly preprocessing the data, handling class imbalance, and choosing the right evaluation metrics when working with imbalanced datasets. It also shows the iterative process of building a model and continually improving the model based on performance.

<div class="alert alert-danger">
<b>Reviewer's comment</b>

Don't forget to change the conclusions if needed

</div>