# Introduction
Load the `churn modeling` dataset. [This dataset](https://www.kaggle.com/shrutimechlearn/churn-modelling) contains details of a bank's customers and the target variable is a binary variable reflecting the fact whether the customer left the bank (closed his account) or he/she continues to be a customer. Build an end-to-end machine learning pipeline to predict which customers are going to leave the company. 

## Importing Modules

In [1]:
import pandas as pd
import sklearn.model_selection
import sklearn.metrics
import sklearn.svm
import sklearn.tree
import plotly.express as px

## Loading the Dataset

In [3]:
df = pd.read_csv("../../datasets/churn_modeling.csv")
df = df.drop(["CustomerId", "Surname"], axis=1)
df = df.set_index("RowNumber")
df.head(3)

Unnamed: 0_level_0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
RowNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,619,France,Female,42,2,0.0,1,1,1,101348.88,1
2,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
3,502,France,Female,42,8,159660.8,3,1,0,113931.57,1


## Splitting the Data into Training and Test Sets

In [4]:
x = df.drop(["Exited"], axis=1)
y = df["Exited"]
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y)

## Encoding Categoritical Attributes

In [5]:
# Building the one-hot encoder model
enc = sklearn.preprocessing.OneHotEncoder(handle_unknown="ignore")
enc.fit(x_train)

# Encoding the categorical attriutes of training data
x_train = enc.transform(x_train)

# Encoding the categorical attriutes of test data
x_test = enc.transform(x_test)

print("x_train:", x_train.shape)
print("x_test:", x_test.shape)

x_train: (7500, 12840)
x_test: (2500, 12840)


## Model Selection and Hyperparameter Tuning

In [6]:
# Decistion Tree --------------------
parameters_grid = {
    "criterion": ["gini", "entropy"], 
    "max_depth": range(1, 20, 3),   # [1, 4, 7, ...]
    "min_samples_split": range(2, 20, 3)
}
model_1 = sklearn.model_selection.GridSearchCV(sklearn.tree.DecisionTreeClassifier(), 
                                               parameters_grid, scoring="accuracy", cv=5, n_jobs=-1)
model_1.fit(x_train, y_train)
print("Accuracy of best decision tree classfier = {:.2f}".format(model_1.best_score_))
print("Best found hyperparameters of decision tree classfier = {}".format(model_1.best_params_))
# -----------------------------------

# SVM -------------------------------
parameters_grid = {
    "kernel": ["linear", "rbf", "poly"], 
    "C": [0.001, 0.01, 0.1, 1, 10, 100]
}
model_2 = sklearn.model_selection.GridSearchCV(sklearn.svm.SVC(), 
                                               parameters_grid, scoring="accuracy", cv=5, n_jobs=-1)
model_2.fit(x_train, y_train)
print("Accuracy of best SVM classfier = {:.2f}".format(model_2.best_score_))
print("Best found hyperparameters of SVM classifier = {}".format(model_2.best_params_))
# -----------------------------------

Accuracy of best decision tree classfier = 0.84
Best found hyperparameters of decision tree classfier = {'criterion': 'gini', 'max_depth': 16, 'min_samples_split': 2}
Accuracy of best SVM classfier = 0.86
Best found hyperparameters of SVM classifier = {'C': 10, 'kernel': 'rbf'}


## Testing the Best Model

In [7]:
y_predicted = model_2.predict(x_test)
accuracy = sklearn.metrics.accuracy_score(y_test, y_predicted)
cm = sklearn.metrics.confusion_matrix(y_test, y_predicted)
precision, recall, f1, support = sklearn.metrics.precision_recall_fscore_support(y_test, y_predicted)

print("Accuracy =", accuracy)
print("Precision =", precision)
print("Recall =", recall)
print("F1-Score =", f1)
print("Confusion Matrix:\n", cm)

Accuracy = 0.846
Precision = [0.86200635 0.72727273]
Recall = [0.95909091 0.41538462]
F1-Score = [0.90796079 0.52876377]
Confusion Matrix:
 [[1899   81]
 [ 304  216]]
