# Bank customer churn prediction: MVP


In this project, the overall goal is to predict the churn of bank customers. From a business perspective, this is very relevant for the effort to retain customers with the ultimate end goal of increasing profitability.

Here in this notebook, a minimal viable product (MVP) is set up to investigate some models and generate a baseline to compare further modeling efforts to.

Customer churn is defined as the percentage of customers that stopped using a company's product or service offering in a defined time frame. One might consider that customer churn is not so important as long as more new customers are acquired than lost to the company. This is fogetting entirely the cost of acquiring new customers. Bringing in new customers is a lot less profitable than retaining customers. In financial services, for example, a 5% increase in customer retention produces more than a 25% increase in profit (http://www2.bain.com/Images/BB_Prescription_cutting_costs.pdf). The reason for that is because returning customers spend on average more than already existing customers. In online services, a loyal customer spends on average 2/3 more than a new one (http://www2.bain.com/Images/Value_online_customer_loyalty_you_capture.pdf). At the same time there is a cost associated with acquiring new customers, which decreases when less new customers have to be acquired. Keeping existing customers thus allows for a reallocation of funds away from the need of growing by acquiring new customers. 

Customer churn can be reduced by pooling resources into keeping the most profitable customers, instead of focusing on keeping overall customer numbers (even unprofitable ones). Another option would be to find out why and when customers are leaving, thus targeting in a customer lifetime this specific point and put effort into avoiding churn. In either case, the customer churn has to be thoroughly analyzed, which is what this small example project is designed to deliver.

## Outline

This churn prediction project follows this outline:

1. Dataset description
2. Descriptive visualizations using Tableau
3. Data extraction, transforming, and loading (ETL)
4. **Analysis of the dataset**
5. Visualization of the insights

In this part of the project, stage 4 is covered. Stages 1 and 2 can be found here: http://heikokromer.com/index.php/2020/01/10/bank-customer-churn-prediction-identifying-the-question/'. 


Stage 3 can be found here: https://kyso.io/heiko/bank-customer-churn-prediction-etl


## RandomForest

In this part, following the previous investigation, a RandomForest will be used on the dataset for Churn prediction.

### 1. Load dataset

In [54]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder
from sklearn.cluster import KMeans
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import precision_recall_curve
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
pd.options.mode.chained_assignment = None  # default='warn'

class Model_002():
    """ Class for the Analysis part that contains all the methods"""
    def __init__(self):
        self.save_path_prepared = '../02.Prepared_data/'
        
    def load_data(self, fname):
        """
        Reads and returns the dataset after the ETL process.
        """
        data = pd.read_csv(f"{self.save_path_prepared}/{fname}", index_col=0)
        
        return data
    
    def select_features(self, features, label_col, dataset):
        """
        Selects the columns features (passed as list) from the dataset and returns this dataset. Returns also the labels, 
        the label_col must be passed.
        """
        data_features = dataset[features]
        data_labels = dataset[label_col]

        return data_features, data_labels  
    

In [55]:
FNAME = '2020-01-26.One_hot_encoded.csv'
data = Model_002().load_data(FNAME)

data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Gender_Female,Gender_Male,Geography_France,Geography_Germany,Geography_Spain
0,1,15634602,Hargrave,619,42,2,0.0,1,1,1,101348.88,1,1,0,1,0,0
1,2,15647311,Hill,608,41,1,83807.86,1,0,1,112542.58,0,1,0,0,0,1
2,3,15619304,Onio,502,42,8,159660.8,3,1,0,113931.57,1,1,0,1,0,0
3,4,15701354,Boni,699,39,1,0.0,2,0,0,93826.63,0,1,0,1,0,0
4,5,15737888,Mitchell,850,43,2,125510.82,1,1,1,79084.1,0,1,0,0,0,1


In [56]:
# Previous analysis found that something was wrong with EstimatedSalary
# There are only two genders in this dataset, one column can be removed
features = ['CreditScore',
 'Age',
 'Tenure',
 'Balance',
 'NumOfProducts',
 'HasCrCard',
 'IsActiveMember',
 'Gender_Female',
 'Geography_France',
 'Geography_Germany',
 'Geography_Spain']

label_col = 'Exited'

X, y = Model_002().select_features(features, label_col, data)

fraction_train = 0.6
fraction_test = 0.4
random_state = 42

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=fraction_test, random_state=random_state)

print(f"Training set size (X, y): {X_train.shape, y_train.shape}")
print(f"Test set size (X, y): {X_test.shape, y_test.shape}")

# QA on number of cols
assert X_train.shape[1] == X_test.shape[1]

# QA on number of rows total
assert X_train.shape[0] + X_test.shape[0] == X.shape[0]


Training set size (X, y): ((6000, 11), (6000,))
Test set size (X, y): ((4000, 11), (4000,))


### 2. Building blocks for pipeline

In [57]:
# preprocessing

numerical_cols = ['CreditScore', 'Age', 'Tenure', 'Balance']
categorical_cols = ['NumOfProducts', 'HasCrCard', 'IsActiveMember', 'Gender_Female', 'Geography_France', 'Geography_Germany', 'Geography_Spain']

# numerical feature scaling

# min max scaler standard
mms_std = Pipeline(steps=[
        ('minmax', MinMaxScaler(feature_range=(0, 1)))])

# min max scaler -1 to 1
mms_minus1 = Pipeline(steps=[
        ('minmax', MinMaxScaler(feature_range=(-1, 1)))])

# standard scaler standard
ss = Pipeline(steps=[
        ('standard', StandardScaler())])

In [58]:
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

### min max scaler standard, feature range 0, 1

In [59]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', mms_std, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)])

pipe = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('classifier', RandomForestClassifier())
    ]
)

In [53]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = [{'classifier__n_estimators': n_estimators,
               'classifier__max_features': max_features,
               'classifier__max_depth': max_depth,
               'classifier__min_samples_split': min_samples_split,
               'classifier__min_samples_leaf': min_samples_leaf,
               'classifier__bootstrap': bootstrap}]

rf_random = RandomizedSearchCV(estimator=pipe, param_distributions=random_grid, n_iter=100, cv=3, verbose=2, random_state=42, n_jobs=-1)

rf_random.fit(X_train, y_train)
rf_random.best_params_

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    8.5s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:   47.8s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  1.8min finished


{'classifier__n_estimators': 300,
 'classifier__min_samples_split': 10,
 'classifier__min_samples_leaf': 2,
 'classifier__max_features': 'sqrt',
 'classifier__max_depth': 60,
 'classifier__bootstrap': True}

In [64]:
scoring = {'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score)}
# Create the parameter grid based on the results of random search 
param_grid = {
    'classifier__n_estimators': [200, 300, 500],
    'classifier__min_samples_split': [8, 10, 12],
    'classifier__min_samples_leaf': [1, 2, 3],
    'classifier__max_features': ['sqrt', 'auto'],
    'classifier__max_depth': [50, 60, 70],
    'classifier__bootstrap': [True]
}

rf_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2,  scoring=scoring, refit='AUC', return_train_score=True)
rf_cv.fit(X_train, y_train)
results = rf_cv.cv_results_
print("Tuned hpyerparameters :(best parameters) ", rf_cv.best_params_)
print("Best score (refit key) :", rf_cv.best_score_)
rf_optimized = rf_cv.best_estimator_

Fitting 3 folds for each of 162 candidates, totalling 486 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    7.9s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:   55.5s
[Parallel(n_jobs=-1)]: Done 341 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 486 out of 486 | elapsed:  3.1min finished


Tuned hpyerparameters :(best parameters)  {'classifier__bootstrap': True, 'classifier__max_depth': 50, 'classifier__max_features': 'sqrt', 'classifier__min_samples_leaf': 3, 'classifier__min_samples_split': 12, 'classifier__n_estimators': 300}
Best score (refit key) : 0.8595374245001812


In [69]:
print("Classification report for RandomForestClassifier:")
rf_predicted_test = rf_optimized.predict(X_test)
clf_report = classification_report(y_test, rf_predicted_test, target_names=['Stayed', 'Exited'])
print(clf_report)

Classification report for RandomForestClassifier:
              precision    recall  f1-score   support

      Stayed       0.88      0.97      0.92      3190
      Exited       0.78      0.47      0.59       810

    accuracy                           0.87      4000
   macro avg       0.83      0.72      0.75      4000
weighted avg       0.86      0.87      0.85      4000



- [ ] find a way to get the precision, recall and so on better and store these values in a dictionary
- [ ] repeat the process for the other scalers
- [ ] make a feature importance plot
- [ ] write some conclusion