## Load and inspect the dataset

In [1]:
import pandas as pd
customers_df = pd.read_csv('bank_customers.csv')
customers_df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [2]:
customers_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


## Target incidence

The target column for prediction is 'Exited', which marks whether a customer has left the bank (1) or not (0).

The target incidence represents the number of each target value in the dataset, so how many customers exited (1) vs stayed (0). This reveals whether the dataset is balanced.

In [3]:
#!pip install tpot

In [4]:
#!pip install deap update_checker tqdm stopit

In [5]:
customers_df.Exited.value_counts(normalize=True).round(3)

0    0.796
1    0.204
Name: Exited, dtype: float64

## Find the optimal prediction model with TPOT

In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    customers_df[['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary']],
    customers_df['Exited'],
    test_size = 0.25,
    stratify = customers_df['Exited'])

X_train.head(5)

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
1780,802,33,8,0.0,2,1,0,143706.18
1876,640,39,9,131607.28,4,0,1,6981.43
3239,762,19,6,0.0,2,1,0,55500.17
1533,850,37,3,212778.2,1,0,1,69372.88
9199,544,26,6,0.0,1,1,0,100200.4


In [16]:
from tpot import TPOTClassifier
from sklearn.metrics import roc_auc_score

In [17]:
tpot = TPOTClassifier(
    generations = 5,
    population_size = 20,
    verbosity = 2,
    scoring = 'roc_auc',
    random_state = 42,
    disable_update_check = True,
    config_dict = 'TPOT light'
)
tpot.fit(X_train, y_train)

Optimization Progress:  33%|███▎      | 40/120 [00:07<00:16,  4.89pipeline/s]
Optimization Progress:  49%|████▉     | 59/120 [00:13<00:10,  5.79pipeline/s]
Optimization Progress:  66%|██████▌   | 79/120 [00:18<00:13,  3.08pipeline/s]
Optimization Progress:  83%|████████▎ | 100/120 [00:23<00:04,  4.80pipeline/s]
Optimization Progress: 100%|██████████| 120/120 [00:26<00:00,  4.49pipeline/s]
Generation 5 - Current best internal CV score: 0.8286398852793433
Best pipeline: DecisionTreeClassifier(ZeroCount(MultinomialNB(input_matrix, alpha=0.001, fit_prior=True)), criterion=entropy, max_depth=6, min_samples_leaf=15, min_samples_split=3)


TPOTClassifier(config_dict='TPOT light', disable_update_check=True,
               generations=5,
               log_file=<ipykernel.iostream.OutStream object at 0x7f0b44e51fd0>,
               population_size=20, random_state=42, scoring='roc_auc',
               verbosity=2)

In [18]:
# AUC score for tpot model
tpot_auc_score = roc_auc_score(y_test, tpot.predict_proba(X_test)[:, 1])
print(f'\nAUC score: {tpot_auc_score:.4f}')

# Print best pipeline steps
print('\nBest pipeline steps:', end='\n')
for idx, (name, transform) in enumerate(tpot.fitted_pipeline_.steps, start=1):
    # Print idx and transform
    print(f'{idx}. {transform}')


AUC score: 0.8364

Best pipeline steps:
1. StackingEstimator(estimator=MultinomialNB(alpha=0.001))
2. ZeroCount()
3. DecisionTreeClassifier(criterion='entropy', max_depth=6, min_samples_leaf=15,
                       min_samples_split=3, random_state=42)


In [19]:
pd.DataFrame.var(X_train).round(3)
X_train.var().round(3)

CreditScore        9.459865e+03
Age                1.135750e+02
Tenure             8.321000e+00
Balance            3.893007e+09
NumOfProducts      3.390000e-01
HasCrCard          2.060000e-01
IsActiveMember     2.500000e-01
EstimatedSalary    3.290748e+09
dtype: float64

## Decision Tree Classifier

In [24]:
from sklearn.tree import DecisionTreeClassifier 
from sklearn import metrics

In [29]:
# train the decision tree classifier with the parameters recommended by tpot

clf = DecisionTreeClassifier(
    criterion='entropy',
    max_depth=6,
    min_samples_leaf=15,
    min_samples_split=3,
    random_state=42
    )

clf = clf.fit(X_train,y_train)

y_pred = clf.predict(X_test)

In [30]:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.8576


The decision tree classifier achieved an accuracy of around 86%. Not bad.