Beta Bank customers are leaving: little by little, chipping away every month. The bankers figured out it’s cheaper to save the existing customers rather than to attract new ones.
We need to predict whether a customer will leave the bank soon. You have the data on clients’ past behavior and termination of contracts with the bank.
Build a model with the maximum possible F1 score. To pass the project, you need an F1 score of at least 0.59. Check the F1 for the test set.
Additionally, measure the AUC-ROC metric and compare it with the F1.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('/datasets/Churn.csv')
data

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.00,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.80,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.00,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,15606229,Obijiaku,771,France,Male,39,5.0,0.00,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10.0,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7.0,0.00,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772,Germany,Male,42,3.0,75075.31,2,1,0,92888.52,1


Most of the features are numeric, with the exception of Surname, Geography, and Gender. Can use OHE to process all except Surname, which will have a lot of different inputs, and is ultimately probably not relevant to estimating churn. I will exclude surname from the model, but can add it back if the client thinks it might be important.

RowNumber contains no information that is relevant, and can be dropped. Will check other columns for outliers, NaN, and repeats.

In [3]:
data['CreditScore'].min()

350

In [4]:
data['CreditScore'].max()

850

Credit scores stay within a reasonable range.

In [5]:
data[data.isna().any(axis=1)]

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
30,31,15589475,Azikiwe,591,Spain,Female,39,,0.00,3,1,0,140469.38,1
48,49,15766205,Yin,550,Germany,Male,38,,103391.38,1,0,1,90878.13,0
51,52,15768193,Trevisani,585,Germany,Male,36,,146050.97,2,0,0,86424.57,0
53,54,15702298,Parkhill,655,Germany,Male,41,,125561.97,1,0,0,164040.94,1
60,61,15651280,Hunter,742,Germany,Male,35,,136857.00,1,0,0,84509.57,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9944,9945,15703923,Cameron,744,Germany,Male,41,,190409.34,2,1,1,138361.48,0
9956,9957,15707861,Nucci,520,France,Female,46,,85216.61,1,1,0,117369.52,1
9964,9965,15642785,Douglas,479,France,Male,34,,117593.48,2,0,0,113308.29,0
9985,9986,15586914,Nepean,659,France,Male,36,,123841.49,2,1,0,96833.00,0


In [6]:
data[data.drop(['Tenure'], axis=1).isna().any(axis=1)]

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited


All NaN entries seem to be purely in Tenure, so that column can be dropped if necessary for the model, or we can send it back to the engineers to see if they can fill in the missing values.

In [7]:
data['Exited'].value_counts()

0    7963
1    2037
Name: Exited, dtype: int64

The data does seem to be fairly unbalanced at a ratio of 4:1 for remaining to exited. I will create a model on the unbalanced data, along with various balanced versions.

In [8]:
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [9]:
data = data.drop(['Surname', 'Tenure', 'RowNumber'], axis=1)
ohe_data = pd.get_dummies(data, drop_first=True)
features = ohe_data.drop(['CustomerId', 'Exited'], axis=1)
target = ohe_data['Exited']

In [10]:
features_train, features_2, target_train, target_2 = train_test_split(features, target, test_size=.4, random_state=42)

In [11]:
features_test, features_valid, target_test, target_valid = train_test_split(features_2, target_2, test_size=.5, random_state=42)

Split data into train(60%), test(20%), and valid(20%) for training and testing. Dropped Surname, Tenure, and Row Number column. Added OHE to categorical features for training the model.

In [12]:
model = LogisticRegression(solver='liblinear')
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)

In [13]:
f1 = f1_score(target_valid, predicted_valid)
f1

0.09503239740820735

In [14]:
auc = roc_auc_score(target_valid, predicted_valid)
auc

0.5220782106354614

AUC score for the unbalanced model using linear regression is .5, which is basically a random chance of predicting the right outcome based on the model.

In [15]:
model = LogisticRegression(solver='liblinear', class_weight='balanced')
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)


In [16]:
f1 = f1_score(target_valid, predicted_valid)
f1

0.47462919594067143

In [17]:
auc = roc_auc_score(target_valid, predicted_valid)
auc

0.6792845504369723

AUC with the classes balanced has gone up a small amount.

In [18]:
from sklearn.ensemble import RandomForestClassifier

In [19]:
for estim in range(5, 51, 5):
    for depth in range(5, 20, 5):
        model = RandomForestClassifier(n_estimators=estim, max_depth=depth, random_state=12345)
        model.fit(features_train, target_train)
        predicted_valid = model.predict(features_valid)
        f1 = f1_score(target_valid, predicted_valid)
        auc = roc_auc_score(target_valid, predicted_valid)
        print('N-estimator:', estim, ' | Depth:', depth, ' | F1:', f1, ' | AUC:', auc)

N-estimator: 5  | Depth: 5  | F1: 0.48514851485148514  | AUC: 0.6616945637683307
N-estimator: 5  | Depth: 10  | F1: 0.5518248175182482  | AUC: 0.6987483335802104
N-estimator: 5  | Depth: 15  | F1: 0.5416078984485191  | AUC: 0.6955488075840617
N-estimator: 10  | Depth: 5  | F1: 0.44974446337308355  | AUC: 0.6455265886535329
N-estimator: 10  | Depth: 10  | F1: 0.5571847507331379  | AUC: 0.7011850096282032
N-estimator: 10  | Depth: 15  | F1: 0.5774647887323943  | AUC: 0.7144867427047844
N-estimator: 15  | Depth: 5  | F1: 0.4594594594594595  | AUC: 0.6498592801066508
N-estimator: 15  | Depth: 10  | F1: 0.5793304221251819  | AUC: 0.7129240112575915
N-estimator: 15  | Depth: 15  | F1: 0.5832147937411096  | AUC: 0.7167160420678418
N-estimator: 20  | Depth: 5  | F1: 0.4594594594594595  | AUC: 0.6498592801066508
N-estimator: 20  | Depth: 10  | F1: 0.5962373371924746  | AUC: 0.7220189601540512
N-estimator: 20  | Depth: 15  | F1: 0.5754985754985754  | AUC: 0.7125907272996593
N-estimator: 25  | De

Searching different hyperparameters gets us right around the minimum f1 score, but it would be good to see if we can get higher with other methods.

In [20]:
from sklearn.utils import shuffle
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

features_upsampled, target_upsampled = upsample(features_train, target_train, 4)
for estim in range(5, 51, 5):
    for depth in range(5, 20, 5):
        model = RandomForestClassifier(n_estimators=estim, max_depth=depth, random_state=12345)
        model.fit(features_upsampled, target_upsampled)
        predicted_valid = model.predict(features_valid)
        f1 = f1_score(target_valid, predicted_valid)
        auc = roc_auc_score(target_valid, predicted_valid)
        print('N-estimator:', estim, ' | Depth:', depth, ' | F1:', f1, ' | AUC:', auc)

N-estimator: 5  | Depth: 5  | F1: 0.57243195785777  | AUC: 0.7570952451488667
N-estimator: 5  | Depth: 10  | F1: 0.6145940390544707  | AUC: 0.7699674122352245
N-estimator: 5  | Depth: 15  | F1: 0.5763097949886105  | AUC: 0.7320841356836025
N-estimator: 10  | Depth: 5  | F1: 0.5913978494623656  | AUC: 0.7703451340542142
N-estimator: 10  | Depth: 10  | F1: 0.6133056133056133  | AUC: 0.7675455488075841
N-estimator: 10  | Depth: 15  | F1: 0.5978391356542616  | AUC: 0.740490297733669
N-estimator: 15  | Depth: 5  | F1: 0.5854513584574934  | AUC: 0.7683083987557399
N-estimator: 15  | Depth: 10  | F1: 0.6237006237006237  | AUC: 0.7749518589838543
N-estimator: 15  | Depth: 15  | F1: 0.628099173553719  | AUC: 0.7612131536068731
N-estimator: 20  | Depth: 5  | F1: 0.6037399821905609  | AUC: 0.7814471930084432
N-estimator: 20  | Depth: 10  | F1: 0.6299376299376299  | AUC: 0.7793956450896165
N-estimator: 20  | Depth: 15  | F1: 0.6323529411764706  | AUC: 0.759235668789809
N-estimator: 25  | Depth: 5 

Upsampling raises the overall f1 score a little higher, but not by much.

In [21]:
model = LogisticRegression(solver='liblinear')
model.fit(features_upsampled, target_upsampled)
predicted_valid = model.predict(features_valid)
predicted_valid

array([1, 1, 0, ..., 0, 1, 1])

In [22]:
f1 = f1_score(target_valid, predicted_valid)
f1

0.46038863976083705

In [23]:
auc = roc_auc_score(target_valid, predicted_valid)
auc

0.6670567323359502

In [24]:
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])
    
    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345)
    
    return features_downsampled, target_downsampled

features_downsampled, target_downsampled = downsample(features_train, target_train, 0.25)
for estim in range(5, 51, 5):
    for depth in range(5, 20, 5):
        model = RandomForestClassifier(n_estimators=estim, max_depth=depth, random_state=12345)
        model.fit(features_downsampled, target_downsampled)
        predicted_valid = model.predict(features_valid)
        f1 = f1_score(target_valid, predicted_valid)
        auc = roc_auc_score(target_valid, predicted_valid)
        print('N-estimator:', estim, ' | Depth:', depth, ' | F1:', f1, ' | AUC:', auc)

N-estimator: 5  | Depth: 5  | F1: 0.5912882298424467  | AUC: 0.7658346911568658
N-estimator: 5  | Depth: 10  | F1: 0.5674931129476584  | AUC: 0.7478373574285293
N-estimator: 5  | Depth: 15  | F1: 0.5418060200668897  | AUC: 0.7359798548363206
N-estimator: 10  | Depth: 5  | F1: 0.6007532956685501  | AUC: 0.7712487038957192
N-estimator: 10  | Depth: 10  | F1: 0.5831012070566387  | AUC: 0.7590653236557546
N-estimator: 10  | Depth: 15  | F1: 0.5757575757575758  | AUC: 0.7509406013923864
N-estimator: 15  | Depth: 5  | F1: 0.5925233644859813  | AUC: 0.765738409124574
N-estimator: 15  | Depth: 10  | F1: 0.581651376146789  | AUC: 0.7593689823729818
N-estimator: 15  | Depth: 15  | F1: 0.5807033363390441  | AUC: 0.7607243371352392
N-estimator: 20  | Depth: 5  | F1: 0.6045627376425856  | AUC: 0.7729521552362613
N-estimator: 20  | Depth: 10  | F1: 0.5912882298424467  | AUC: 0.7658346911568658
N-estimator: 20  | Depth: 15  | F1: 0.5902255639097745  | AUC: 0.7632054510442898
N-estimator: 25  | Depth:

In [25]:
model = LogisticRegression(solver='liblinear')
model.fit(features_downsampled, target_downsampled)
predicted_valid = model.predict(features_valid)
predicted_valid

array([1, 1, 0, ..., 0, 1, 1])

In [26]:
f1 = f1_score(target_valid, predicted_valid)
f1

0.4599250936329588

In [27]:
auc = roc_auc_score(target_valid, predicted_valid)
auc

0.6665308843134351

In [28]:
model = RandomForestClassifier(n_estimators=45, max_depth=10, random_state=12345)
model.fit(features_upsampled, target_upsampled)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=10, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=45,
                       n_jobs=None, oob_score=False, random_state=12345,
                       verbose=0, warm_start=False)

In [29]:
predicted_valid = model.predict(features_valid)
f1 = f1_score(target_valid, predicted_valid)
f1

0.631578947368421

In [30]:
auc = roc_auc_score(target_valid, predicted_valid)
auc

0.7787735150348096

The model with the highest f1 score had estimators at 45 and max_depth at 10. We will test the results with the test set.

In [31]:
predicted_test = model.predict(features_test)
f1 = f1_score(target_test, predicted_test)
f1

0.5863267670915411

In [32]:
auc = roc_auc_score(target_test, predicted_test)
auc

0.7619070825211176

Using the test set, it looks like this specific set does not reach the minimum, but the validation set does. There may be a better model to use, but this model (upsampled random forest), has been the best out of all tested ones, and does meet requirements in the validation set. We also see that the AUC scores are positively correlated with f1 scores.