# Project_08 - Classification (Churn)

**Project description**
<br>Beta Bank customers are leaving: little by little, chipping away every month.
<br>The bankers figured out it’s cheaper to save the existing customers rather than to attract new ones.
<br>We need to predict whether a customer will leave the bank soon. 
<br>You have the data on clients’ past behavior and termination of contracts with the bank.
<br>Build a model with the maximum possible F1 score. To pass the project, you need an F1 score of at least 0.59. Check the F1 for the test set.
<br>Additionally, measure the AUC-ROC metric and compare it with the F1.

**Project Goal**
<br>Analyze the Beta Bank dataset, preprosess and devide the dataset for classification modeling.
<br>Examine the balance of classes. Train the model with and without taking into account the imbalance.
<br>Apply different models to see which model has the highest F1 score and AUC-ROC metric in predicting users's churn.

**Project instructions**
<br>Download and prepare the data and perform EDA.
<br>Examine the balance of classes.
<br>Improve the quality of the model.
<br>Use the training set to pick the best parameters. Train different models on training and validation sets.
<br>Perform the final testing.

**Data description**
<br>Features
<br>RowNumber — data string index
<br>CustomerId — unique customer identifier
<br>Surname — surname
<br>CreditScore — credit score
<br>Geography — country of residence
<br>Gender — gender
<br>Age — age
<br>Tenure — period of maturation for a customer’s fixed deposit (years)
<br>Balance — account balance
<br>NumOfProducts — number of banking products used by the customer
<br>HasCrCard — customer has a credit card
<br>IsActiveMember — customer’s activeness
<br>EstimatedSalary — estimated salary
<br>Target
<br>Exited — сustomer has left

In [4]:
import pandas as pd
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

In [5]:
df = pd.read_csv('datasets/project_03_dataset.csv')
print(df.head())
print(df.info())
print(df.duplicated().sum())

   RowNumber  CustomerId   Surname  CreditScore Geography  Gender  Age  \
0          1    15634602  Hargrave          619    France  Female   42   
1          2    15647311      Hill          608     Spain  Female   41   
2          3    15619304      Onio          502    France  Female   42   
3          4    15701354      Boni          699    France  Female   39   
4          5    15737888  Mitchell          850     Spain  Female   43   

   Tenure    Balance  NumOfProducts  HasCrCard  IsActiveMember  \
0     2.0       0.00              1          1               1   
1     1.0   83807.86              1          0               1   
2     8.0  159660.80              3          1               0   
3     1.0       0.00              2          0               0   
4     2.0  125510.82              1          1               1   

   EstimatedSalary  Exited  
0        101348.88       1  
1        112542.58       0  
2        113931.57       1  
3         93826.63       0  
4         790

In [6]:
#df.to_csv('~/work/project_datasets/project_03_dataset.csv', index=False, header=list(df.columns))

In [7]:
df = df.drop(['RowNumber','Surname', 'CustomerId'], axis=1)

The 'RowNumber', 'Surname' & 'CustomerId' columns are dropped as these are not necessary features to train the models.
<br>The datatypes of the rest of the columns look correct and there are no duplicated rows.
<br>'Tenure' column has 909 missing values that need to be addressed.

In [8]:
imputer = SimpleImputer(strategy='mean')
df['Tenure'] = imputer.fit_transform(df[['Tenure']])
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      10000 non-null  int64  
 1   Geography        10000 non-null  object 
 2   Gender           10000 non-null  object 
 3   Age              10000 non-null  int64  
 4   Tenure           10000 non-null  float64
 5   Balance          10000 non-null  float64
 6   NumOfProducts    10000 non-null  int64  
 7   HasCrCard        10000 non-null  int64  
 8   IsActiveMember   10000 non-null  int64  
 9   EstimatedSalary  10000 non-null  float64
 10  Exited           10000 non-null  int64  
dtypes: float64(3), int64(6), object(2)
memory usage: 859.5+ KB
None


A SimpleImputer was used to fill the missing values of 'Tenure' column with the column mean.

In [9]:
df = pd.get_dummies(df, drop_first=True)
print(df.head())

   CreditScore  Age  Tenure    Balance  NumOfProducts  HasCrCard  \
0          619   42     2.0       0.00              1          1   
1          608   41     1.0   83807.86              1          0   
2          502   42     8.0  159660.80              3          1   
3          699   39     1.0       0.00              2          0   
4          850   43     2.0  125510.82              1          1   

   IsActiveMember  EstimatedSalary  Exited  Geography_Germany  \
0               1        101348.88       1                  0   
1               1        112542.58       0                  0   
2               0        113931.57       1                  0   
3               0         93826.63       0                  0   
4               1         79084.10       0                  0   

   Geography_Spain  Gender_Male  
0                0            0  
1                1            0  
2                0            0  
3                0            0  
4                1            

One hot encoding was used to separate and transform all the categorical columns.

In [10]:
print(df['Exited'].value_counts())

0    7963
1    2037
Name: Exited, dtype: int64


'Exited' class has significantly lower observation count compared to the 'Not Exited' class, resulting in class imbalance

In [11]:
features = df.drop(['Exited'], axis=1)
target = df['Exited']

seed = 12345
features_train, features_test, target_train, target_test = train_test_split(
                                                        features, target, test_size=0.4, stratify=target, random_state=seed)

features_valid, features_test, target_valid, target_test = train_test_split(
                                                        features_test, target_test, test_size=0.5, stratify=target_test, random_state=seed)

The data was split into 3 parts with the following proportions, with the training set having the most data for training the models.
<br>Train: 60%, Valid: 20%, Test: 20%

In [12]:
scaler = StandardScaler()
scaler.fit(features_train)

pd.options.mode.chained_assignment = None
features_train = scaler.transform(features_train)
features_valid = scaler.transform(features_valid)
features_test = scaler.transform(features_test)

StandardScaler was used to scale/standardize the features to make sure all features are considered equally important.

In [13]:
model_lr = LogisticRegression(random_state=seed)
model_lr.fit(features_train, target_train)
target_pred = model_lr.predict(features_valid)
target_pred_proba = model_lr.predict_proba(features_valid)[:, 1]

f1_result = f1_score(target_valid, target_pred)
roc_auc_result = roc_auc_score(target_valid, target_pred_proba)

print(f'F1_Score: {f1_result}')
print(f'ROC_AUC_Score: {roc_auc_result}')

F1_Score: 0.3107861060329068
ROC_AUC_Score: 0.7874451916444971


In [14]:
model_lr = LogisticRegression(random_state=seed, class_weight='balanced')
model_lr.fit(features_train, target_train)
target_pred = model_lr.predict(features_valid)
target_pred_proba = model_lr.predict_proba(features_valid)[:, 1]

f1_result = f1_score(target_valid, target_pred)
roc_auc_result = roc_auc_score(target_valid, target_pred_proba)

print(f'F1_Score: {f1_result}')
print(f'ROC_AUC_Score: {roc_auc_result}')

F1_Score: 0.5280701754385966
ROC_AUC_Score: 0.7936557788944724


LogisticRegression models were trained with & without the class_weight hyperparameter set as 'balanced'.
<br>The F1_score and ROC_AUC_score both improved quite significantly when the dataset class weights are balanced out. 

In [15]:
rf = RandomForestClassifier(random_state=seed)
params_rf = {'max_depth': [4, 6, 8, 10],
            'n_estimators': [200, 250, 300, 350]}
gscv = GridSearchCV(rf, param_grid=params_rf, cv=5, n_jobs=-1)
gscv.fit(features_train, target_train)
best_model = gscv.best_estimator_

target_pred = best_model.predict(features_valid)
target_pred_proba = best_model.predict_proba(features_valid)[:, 1]
f1_result = f1_score(target_valid, target_pred)
roc_auc_result = roc_auc_score(target_valid, target_pred_proba)

print(f'F1_Score: {f1_result}')
print(f'ROC_AUC_Score: {roc_auc_result}')
print(gscv.best_params_)

F1_Score: 0.5855161787365177
ROC_AUC_Score: 0.872268819588137
{'max_depth': 10, 'n_estimators': 250}


GridSearchCV was used to find the best RandomForestRegressor parameters, optimizing based on 'max_depth' & 'n_estimators'.

In [16]:
model_rf = RandomForestClassifier(n_estimators=250, max_depth=10, random_state=seed, class_weight='balanced')
model_rf.fit(features_train, target_train)
target_pred = model_rf.predict(features_valid)
target_pred_proba = model_rf.predict_proba(features_valid)[:, 1]

f1_result = f1_score(target_valid, target_pred)
roc_auc_result = roc_auc_score(target_valid, target_pred_proba)

print(f'F1_Score: {f1_result}')
print(f'ROC_AUC_Score: {roc_auc_result}')

F1_Score: 0.6329723225030084
ROC_AUC_Score: 0.8711788107202679


RandomForestRegressor models were trained with & without the class_weight hyperparameter set as 'balanced'.
<br>The F1_score and ROC_AUC_score both improved quite significantly when the dataset class weights are balanced out. 

In [17]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]
    
    features_zeros_df = pd.DataFrame(features_zeros)
    features_ones_df = pd.DataFrame(features_ones)
    target_zeros_df = pd.Series(target_zeros)
    target_ones_df = pd.Series(target_ones)
    
    features_upsampled = pd.concat([features_zeros_df] + [features_ones_df] * repeat)
    target_upsampled = pd.concat([target_zeros_df] + [target_ones_df] * repeat)

    features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=seed)

    return features_upsampled, target_upsampled

features_upsampled, target_upsampled = upsample(features_train, target_train, 4)

An upsampling method was used as an alternative way to balance the class weights of the training sets.

In [18]:
model_lr = LogisticRegression(random_state=seed, solver='liblinear')
model_lr.fit(features_upsampled, target_upsampled)
target_pred = model_lr.predict(features_valid)
target_pred_proba = model_lr.predict_proba(features_valid)[:, 1]

f1_result = f1_score(target_valid, target_pred)
roc_auc_result = roc_auc_score(target_valid, target_pred_proba)

print(f'F1_Score: {f1_result}')
print(f'ROC_AUC_Score: {roc_auc_result}')

F1_Score: 0.5246753246753246
ROC_AUC_Score: 0.7937558503300819


A LogisticRegression model using the upsampling method performed better than setting the class_weight hyperparameter to 'balanced'.

In [19]:
model_rf = RandomForestClassifier(n_estimators=250, max_depth=10, random_state=seed)
model_rf.fit(features_upsampled, target_upsampled)
target_pred = model_rf.predict(features_valid)
target_pred_proba = model_rf.predict_proba(features_valid)[:, 1]

f1_result = f1_score(target_valid, target_pred)
roc_auc_result = roc_auc_score(target_valid, target_pred_proba)

print(f'F1_Score: {f1_result}')
print(f'ROC_AUC_Score: {roc_auc_result}')

F1_Score: 0.6445182724252492
ROC_AUC_Score: 0.8714590107399744


A RandomForestRegression model using the upsampling method performed better than setting the class_weight hyperparameter to 'balanced'.

In [20]:
model_rf = RandomForestClassifier(n_estimators=250, max_depth=10, random_state=seed)
model_rf.fit(features_upsampled, target_upsampled)
target_pred = model_rf.predict(features_test)
target_pred_proba = model_rf.predict_proba(features_test)[:, 1]

f1_result = f1_score(target_test, target_pred)
roc_auc_result = roc_auc_score(target_test, target_pred_proba)

print(f'F1_Score: {f1_result}')
print(f'ROC_AUC_Score: {roc_auc_result}')

F1_Score: 0.6033519553072626
ROC_AUC_Score: 0.855482601245313


Finally, a RandomForestRegression model trained with upsampled training dataset was used on the test dataset.
<br>The final F1_Score was 0.60, which is above the required threshold of 0.59. 

## Conclusions

In the dataset, 'Exited' class was significantly lower compared to the 'Not Exited' class, resulting in class imbalance.
<br>To mitigate this issue, two different approaches were employed.
<br>1) The 'class_weight' hyperparameter was set to 'balanced', which assigned higher weights to the 'Exited' class, thus compensating for the class imbalance.
<br>2) The 'Exited' class was upsampled, which increased the representation of the 'Exited' class in the training dataset, thereby balancing the class distribution.
<br>After several rounds of experimentation, the Random Forest Classifier trained on the upsampled dataset yielded the highest F1 score. The model was then evaluated on the test dataset, and achieved an F1 score of 0.60, surpassing the required threshold of 0.59.
<br>Overall, upsampling the minority('Exited') class and training a Random Forest Classifier on the upsampled dataset was successful in mitigating the class imbalance issue and achieving a satisfactory F1 score on the test dataset.