## Dataset Description

It is a **customer churn** modeling dataset.

There is 10000 rows (each representing a unique customer) with 15 columns: 14 features with one target feature (Exited). 

The data is composed of both numerical and categorical features:

**Numeric Features:**
CustomerId: A unique ID of the customer.
CreditScore: The credit score of the customer,
Age: The age of the customer,
Tenure: The number of months the client has been with the firm.
Balance: Balance remaining in the customer account,
NumOfProducts: The number of products sold by the customer.
EstimatedSalary: The estimated salary of the customer.

**Categorical Features:**
Surname: The surname of the customer.
Geography: The country of the customer.
Gender: M/F
HasCrCard: Whether the customer has a credit card or not.
IsActiveMember: Whether the customer is active or not.

**The target column:** 
Exited — Whether the customer churned or not.

The dataset can be seen and downloaded [here](https://drive.google.com/file/d/12G9RpQauml0QOUAB3aaPaJVduyEnnMzR/view).

# 1 - ETL - Extract, Transform, Load

Let's take the following steps:
1. ETL - Extract, Transform, Load
    
    1.1 Load Libraries and install Dependencies

    1.2 Login to Weights & Biases
    
    1.3 Fetch Data
    
    1.4 Exploratory Data Analysis (EDA)
    
    1.5 Pre-procesing
    
    1.6 Clean Data

<center><img width="600" src="https://drive.google.com/uc?export=view&id=1a-nyAPNPiVh-Xb2Pu2t2p-BhSvHJS0pO"></center>

## 1.1 - Load Libraries and install Dependencies

In [2]:
# install dependencies
!pip install ipython==7.22.0
!pip install joblib==1.0.1
!pip install lightgbm==3.3.1
!pip install matplotlib
!pip install numpy
!pip install pandas
!pip install scikit_learn==0.24.1
!pip install seaborn
!pip install pandas-profiling==3.1.0
!pip install wandb

Collecting ipython==7.22.0
  Downloading ipython-7.22.0-py3-none-any.whl (785 kB)
[?25l[K     |▍                               | 10 kB 21.0 MB/s eta 0:00:01[K     |▉                               | 20 kB 11.6 MB/s eta 0:00:01[K     |█▎                              | 30 kB 7.0 MB/s eta 0:00:01[K     |█▊                              | 40 kB 6.4 MB/s eta 0:00:01[K     |██                              | 51 kB 3.0 MB/s eta 0:00:01[K     |██▌                             | 61 kB 3.6 MB/s eta 0:00:01[K     |███                             | 71 kB 4.1 MB/s eta 0:00:01[K     |███▍                            | 81 kB 4.6 MB/s eta 0:00:01[K     |███▊                            | 92 kB 5.1 MB/s eta 0:00:01[K     |████▏                           | 102 kB 4.2 MB/s eta 0:00:01[K     |████▋                           | 112 kB 4.2 MB/s eta 0:00:01[K     |█████                           | 122 kB 4.2 MB/s eta 0:00:01[K     |█████▍                          | 133 kB 4.2 MB/s eta 0:0

Collecting joblib==1.0.1
  Downloading joblib-1.0.1-py3-none-any.whl (303 kB)
[?25l[K     |█                               | 10 kB 21.3 MB/s eta 0:00:01[K     |██▏                             | 20 kB 8.1 MB/s eta 0:00:01[K     |███▎                            | 30 kB 3.8 MB/s eta 0:00:01[K     |████▎                           | 40 kB 4.9 MB/s eta 0:00:01[K     |█████▍                          | 51 kB 3.7 MB/s eta 0:00:01[K     |██████▌                         | 61 kB 4.4 MB/s eta 0:00:01[K     |███████▋                        | 71 kB 4.4 MB/s eta 0:00:01[K     |████████▋                       | 81 kB 4.0 MB/s eta 0:00:01[K     |█████████▊                      | 92 kB 4.5 MB/s eta 0:00:01[K     |██████████▉                     | 102 kB 4.3 MB/s eta 0:00:01[K     |████████████                    | 112 kB 4.3 MB/s eta 0:00:01[K     |█████████████                   | 122 kB 4.3 MB/s eta 0:00:01[K     |██████████████                  | 133 kB 4.3 MB/s eta 0:00:01

In [3]:
import wandb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas_profiling import ProfileReport
import tempfile
import os
import logging
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.neighbors import LocalOutlierFactor
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score, recall_score, confusion_matrix, classification_report 
import subprocess
import joblib
# Get multiple outputs in the same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Ignore all warnings
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## 1.2 - Login to Weights & Biases

In [4]:
# Login to Weights & Biases
!wandb login --relogin

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


## 1.3 - Fetch Data 

In [5]:
# download the data
!wget https://github.com/x4nth055/pythoncode-tutorials/raw/master/machine-learning/customer-churn-detection/Churn_Modelling.csv

--2022-05-17 22:53:11--  https://github.com/x4nth055/pythoncode-tutorials/raw/master/machine-learning/customer-churn-detection/Churn_Modelling.csv
Resolving github.com (github.com)... 52.192.72.89
Connecting to github.com (github.com)|52.192.72.89|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/x4nth055/pythoncode-tutorials/master/machine-learning/customer-churn-detection/Churn_Modelling.csv [following]
--2022-05-17 22:53:11--  https://raw.githubusercontent.com/x4nth055/pythoncode-tutorials/master/machine-learning/customer-churn-detection/Churn_Modelling.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 674857 (659K) [text/plain]
Saving to: ‘Churn_Modelling.csv’


2022-05-17 22:53:13 (12.6 MB

In [6]:
# Send the raw_data.csv to the Wandb storing it as an artifact
!wandb artifact put \
      --name projeto_01_EEC1731/raw_data.csv \
      --type raw_data \
      --description "The raw data for Customer Churn Detection" Churn_Modelling.csv

[34m[1mwandb[0m: Uploading file Churn_Modelling.csv to: "macleal/projeto_01_EEC1731/raw_data.csv:latest" (raw_data)
[34m[1mwandb[0m: Currently logged in as: [33mmacleal[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.12.16
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/content/wandb/run-20220517_225315-3io9ulft[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mlogical-frost-13[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/macleal/projeto_01_EEC1731[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/macleal/projeto_01_EEC1731/runs/3io9ulft[0m
Artifact uploaded, use this artifact in a run by adding:

    artifact = run.use_artifact("macleal/projeto_01_EEC1731/raw_data.csv:v0")

[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[34m[1mwandb[0m:                                       

## 1.4 - Exploratory Data Analysis (EDA)

In [7]:
# save_code tracking all changes of the notebook and sync with Wandb
run = wandb.init(project="projeto_01_EEC1731", save_code=True)

[34m[1mwandb[0m: Currently logged in as: [33mmacleal[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [8]:
# donwload the latest version of artifact raw_data.csv
artifact = run.use_artifact("projeto_01_EEC1731/raw_data.csv:latest")

# create a dataframe from the artifact
dc = pd.read_csv(artifact.file())
dc.head(5) 

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [9]:
dc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


In [10]:
dc.describe(exclude= ['O']) # Describe all numerical columns
dc.describe(include = ['O']) # Describe all non-numerical/categorical columns

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,15690940.0,650.5288,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,2886.89568,71936.19,96.653299,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,1.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,584.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,7500.25,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


Unnamed: 0,Surname,Geography,Gender
count,10000,10000,10000
unique,2932,3,2
top,Smith,France,Male
freq,32,5014,5457


In [11]:
dc.isnull().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

In [12]:
#ProfileReport(dc, title="Pandas Profiling Report", explorative=True)

In [13]:
# Checking number of unique customers in the dataset
dc.shape[0], dc.CustomerId.nunique()

(10000, 10000)

In [14]:
# churn value Distribution
dc["Exited"].value_counts()

0    7963
1    2037
Name: Exited, dtype: int64

In [15]:
dc.groupby(['Surname']).agg({'RowNumber':'count', 'Exited':'mean'}
                                  ).reset_index().sort_values(by='RowNumber', ascending=False).head()

Unnamed: 0,Surname,RowNumber,Exited
2473,Smith,32,0.28125
1689,Martin,29,0.310345
2389,Scott,29,0.103448
2751,Walker,28,0.142857
336,Brown,26,0.192308


In [16]:
dc.groupby(['Geography']).agg({'RowNumber':'count', 'Exited':'mean'}
                                  ).reset_index().sort_values(by='RowNumber', ascending=False)

Unnamed: 0,Geography,RowNumber,Exited
0,France,5014,0.161548
1,Germany,2509,0.324432
2,Spain,2477,0.166734


In [17]:
sns.set(style="whitegrid")
sns.boxplot(y=dc['CreditScore'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f91de7920d0>

In [18]:
sns.boxplot(y=dc['Age'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f91de7920d0>

In [19]:
sns.violinplot(y = dc.Tenure)

<matplotlib.axes._subplots.AxesSubplot at 0x7f91de7920d0>

In [20]:
sns.violinplot(y = dc['Balance'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f91de7920d0>

In [21]:
sns.set(style = 'ticks')
sns.distplot(dc.NumOfProducts, hist=True, kde=False)

<matplotlib.axes._subplots.AxesSubplot at 0x7f91de7920d0>

In [22]:
# When dealing with numerical characteristics, one of the most useful statistics to examine is the data distribution.
# we can use Kernel-Density-Estimation plot for that purpose. 
sns.kdeplot(dc.EstimatedSalary)

<matplotlib.axes._subplots.AxesSubplot at 0x7f91de7920d0>

In [23]:
run.finish()

VBox(children=(Label(value='7.206 MB of 7.206 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

## 1.5 Pre-procesing

In [24]:
# create a new wandb job
run = wandb.init(project="projeto_01_EEC1731", job_type="process_data")

In [25]:
# create a new artifact
input_artifact="projeto_01_EEC1731/raw_data.csv:latest"
artifact_name="preprocessed_data.csv"
artifact_type="clean_data"
artifact_description="Data after preprocessing"

In [26]:
# donwload the latest version of artifact raw_data.csv
artifact = run.use_artifact(input_artifact)

# create a dataframe from the artifact
dc = pd.read_csv(artifact.file())

In [27]:
# Separating out different columns into various categories as defined above
target_var = ['Exited']
cols_to_remove = ['RowNumber', 'CustomerId']
# numerical columns
num_feats = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']
# categorical columns
cat_feats = ['Surname', 'Geography', 'Gender', 'HasCrCard', 'IsActiveMember']

In [45]:
dc.head(5)
dc.shape

Unnamed: 0,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
8073,Patel,777,Germany,Female,34,5,96693.66,1,1,1,172618.52,0
2083,Buccho,534,France,Male,24,1,0.0,1,1,1,169653.32,0
8586,Jen,650,Germany,Female,46,9,149003.76,2,1,0,176902.83,0
7951,Crawford,850,France,Female,40,0,0.0,2,1,0,1099.95,0
6426,Sokolova,743,Spain,Male,45,7,157332.26,1,1,0,125424.42,0


(1000, 12)

In [28]:
dc.drop(cols_to_remove, axis=1, inplace=True)

In [29]:
# Delete duplicated rows
dc.drop_duplicates(inplace=True)

# Generate a "clean data file"
dc.to_csv(artifact_name,index=False)

In [46]:
dc.head(5)
dc.shape

Unnamed: 0,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
8073,Patel,777,Germany,Female,34,5,96693.66,1,1,1,172618.52,0
2083,Buccho,534,France,Male,24,1,0.0,1,1,1,169653.32,0
8586,Jen,650,Germany,Female,46,9,149003.76,2,1,0,176902.83,0
7951,Crawford,850,France,Female,40,0,0.0,2,1,0,1099.95,0
6426,Sokolova,743,Spain,Male,45,7,157332.26,1,1,0,125424.42,0


(1000, 12)

In [30]:
# Create a new artifact and configure with the necessary arguments
artifact = wandb.Artifact(name=artifact_name,
                          type=artifact_type,
                          description=artifact_description)
artifact.add_file(artifact_name)

<ManifestEntry digest: jh9BviyMoEovUBDRWFrC6A==>

In [31]:
# Upload the artifact to Wandb
run.log_artifact(artifact)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f91ddf7a190>

In [32]:
# close the run
# waiting a while after run the previous cell before execute this
run.finish()

VBox(children=(Label(value='0.519 MB of 0.519 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

# 2 - Data Checks

### 2.1 - Pytest

Pytests uses the following conventions to automatically discovering tests:

files with tests should be called test_*.py or *_test.py
test function name should start with test_

An important aspect when using ``pytest`` is understanding the fixture's scope works. 

The scope of the fixture can have a few legal values, described [here](https://docs.pytest.org/en/6.2.x/fixture.html#fixture-scopes). We are going to consider only **session** and **function**: with the former, the fixture is executed only once in a pytest session and the value it returns is used for all the tests that need it; with the latter, every test function gets a fresh copy of the data. This is useful if the tests modify the input in a way that make the other tests fail, for example.

## 2.2 - Create and run a test file

# 3 - Data Segregation

In [33]:
# global variables
# ratio used to split train and test data
test_size = 0.1
# seed used to reproduce purposes
seed = 42
# reference (column) to stratify the data
stratify = "Exited"
# name of the input artifact
artifact_input_name = "projeto_01_EEC1731/preprocessed_data.csv:latest"
# type of the artifact
artifact_type = "segregated_data"

In [34]:
# configure logging
logging.basicConfig(level=logging.INFO,
                    format="%(asctime)s %(message)s",
                    datefmt='%d-%m-%Y %H:%M:%S')

# reference for a logging obj
logger = logging.getLogger()

# initiate wandb project
run = wandb.init(project="projeto_01_EEC1731", job_type="split_data")

logger.info("Downloading and reading artifact")
artifact = run.use_artifact(artifact_input_name)
artifact_path = artifact.file()
dc = pd.read_csv(artifact_path)

# Split firstly in train/test, then we further divide the dataset to train and validation
logger.info("Splitting data into train and test")
splits = {}

# Keeping aside a test/holdout set
splits["train"], splits["test"] = train_test_split(dc, test_size = test_size, stratify = dc[stratify], random_state = seed)



17-05-2022 23:03:42 Downloading and reading artifact
17-05-2022 23:03:43 Splitting data into train and test


In [35]:
# Save the artifacts. We use a temporary directory so we do not leave any trace behind
with tempfile.TemporaryDirectory() as tmp_dir:

    for split, dc in splits.items():

        # Make the artifact name from the name of the split plus the provided root
        artifact_name = f"{split}.csv"

        # Get the path on disk within the temp directory
        temp_path = os.path.join(tmp_dir, artifact_name)

        logger.info(f"Uploading the {split} dataset to {artifact_name}")

        # Save then upload to W&B
        dc.to_csv(temp_path,index=False)

        artifact = wandb.Artifact(name=artifact_name,
                                  type=artifact_type,
                                  description=f"{split} split of dataset {artifact_input_name}",
        )
        artifact.add_file(temp_path)

        logger.info("Logging artifact")
        run.log_artifact(artifact)

        # This waits for the artifact to be uploaded to W&B. If you
        # do not add this, the temp directory might be removed before
        # W&B had a chance to upload the datasets, and the upload
        # might fail
        artifact.wait()

17-05-2022 23:03:43 Uploading the train dataset to train.csv


<ManifestEntry digest: ae0JhsHUjOefxhJLO1vZJA==>

17-05-2022 23:03:43 Logging artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7f91de4ff550>

<Artifact QXJ0aWZhY3Q6MTI5NjkwOTQ5>

17-05-2022 23:03:47 Uploading the test dataset to test.csv


<ManifestEntry digest: t+fpbEguwGRvJMTV004icA==>

17-05-2022 23:03:47 Logging artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7f91de528fd0>

<Artifact QXJ0aWZhY3Q6MTI5NjkwOTcy>

In [36]:
run.finish()

VBox(children=(Label(value='0.519 MB of 0.519 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

# 4 - Train

## 4.1 - Holdout

In [108]:
# global variables

# ratio used to split train and validation data
val_size = 0.12
# seed used to reproduce purposes
seed = 42
# reference (column) to stratify the data
stratify = "Exited"
# name of the input artifact
artifact_input_name = "projeto_01_EEC1731/train.csv:latest"
# type of the artifact
artifact_type = "Train"

In [109]:
# configure logging
logging.basicConfig(level=logging.INFO,
                    format="%(asctime)s %(message)s",
                    datefmt='%d-%m-%Y %H:%M:%S')

# reference for a logging obj
logger = logging.getLogger()

# initiate the wandb project
run = wandb.init(project="projeto_01_EEC1731",job_type="train")

logger.info("Downloading and reading train artifact")
local_path = run.use_artifact(artifact_input_name).file()
dc_train_val = pd.read_csv(local_path)

# Spliting train.csv into train and validation dataset
logger.info("Spliting data into train/val")
# split-out train/validation and test dataset
dc_train, dc_val, y_train, y_val = train_test_split(dc_train_val,
                                                  dc_train_val[stratify],
                                                  test_size=val_size,
                                                  random_state=seed,
                                                  shuffle=True)

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

18-05-2022 00:54:29 Downloading and reading train artifact
18-05-2022 00:54:30 Spliting data into train/val


## 4.2 - Data preparation

In [110]:
# load test artifact
artifact_input_name = "projeto_01_EEC1731/test.csv:latest"
logger.info("Downloading and reading test artifact")
local_path = run.use_artifact(artifact_input_name).file()
dc_test = pd.read_csv(local_path)
y_test = dc_test["Exited"]

18-05-2022 00:54:34 Downloading and reading test artifact


In [111]:
# Logger Info
logger.info("dc train: {}".format(dc_train.shape))
logger.info("y train: {}".format(y_train.shape))
logger.info("dc val: {}".format(dc_val.shape))
logger.info("y val: {}".format(y_val.shape))
logger.info("dc test: {}".format(dc_test.shape))
logger.info("y test: {}".format(y_test.shape))

18-05-2022 00:54:35 dc train: (7920, 12)
18-05-2022 00:54:35 y train: (7920,)
18-05-2022 00:54:35 dc val: (1080, 12)
18-05-2022 00:54:35 y val: (1080,)
18-05-2022 00:54:35 dc test: (1000, 12)
18-05-2022 00:54:35 y test: (1000,)


In [112]:
dc_train.head()

Unnamed: 0,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
247,Mai,591,Spain,Male,31,8,0.0,1,1,1,141677.33,0
852,Manna,649,France,Male,45,7,0.0,2,0,1,75204.21,0
1650,Ugoji,622,France,Male,35,8,0.0,2,1,1,131772.51,0
4288,Bluett,766,Germany,Female,38,7,130933.74,1,0,1,2035.94,0
199,Wallace,663,France,Female,39,8,0.0,2,1,1,101168.9,0


In [113]:
dc_test.head()

Unnamed: 0,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,Patel,777,Germany,Female,34,5,96693.66,1,1,1,172618.52,0
1,Buccho,534,France,Male,24,1,0.0,1,1,1,169653.32,0
2,Jen,650,Germany,Female,46,9,149003.76,2,1,0,176902.83,0
3,Crawford,850,France,Female,40,0,0.0,2,1,0,1099.95,0
4,Sokolova,743,Spain,Male,45,7,157332.26,1,1,0,125424.42,0


In [114]:
dc_test.head()

Unnamed: 0,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,Patel,777,Germany,Female,34,5,96693.66,1,1,1,172618.52,0
1,Buccho,534,France,Male,24,1,0.0,1,1,1,169653.32,0
2,Jen,650,Germany,Female,46,9,149003.76,2,1,0,176902.83,0
3,Crawford,850,France,Female,40,0,0.0,2,1,0,1099.95,0
4,Sokolova,743,Spain,Male,45,7,157332.26,1,1,0,125424.42,0


In [115]:
# Outlier Removal
logger.info("Outlier Removal")
# temporary variable
x = dc_train.select_dtypes("int64").copy()
# identify outlier in the dataset
lof = LocalOutlierFactor()
outlier = lof.fit_predict(x)
mask = outlier != -1
# Logger Information
logger.info("dc_train shape [original]: {}".format(dc_train.shape))
logger.info("dc_train shape [outlier removal]: {}".format(dc_train.loc[mask,:].shape))

# AVOID data leakage and you should not do this procedure in the preprocessing stage
# Note that we did not perform this procedure in the validation set

18-05-2022 00:54:35 Outlier Removal
18-05-2022 00:54:35 dc_train shape [original]: (7920, 12)
18-05-2022 00:54:35 dc_train shape [outlier removal]: (7891, 12)


## 4.3 - Encoding Categorical Features

In [116]:
# label encoding With  the sklearn method
le = LabelEncoder()
# Label encoding of Gender variable
dc_train['Gender'] = le.fit_transform(dc_train['Gender'])
le_gender_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
le_gender_mapping

{'Female': 0, 'Male': 1}

In [117]:
# Encoding Gender feature for validation and test set
dc_val['Gender'] = dc_val.Gender.map(le_gender_mapping)
dc_test['Gender'] = dc_test.Gender.map(le_gender_mapping)

# Filling missing/NaN values created due to new categorical levels
dc_val['Gender'].fillna(-1, inplace=True)
dc_test['Gender'].fillna(-1, inplace=True)

In [118]:
dc_train.Gender.unique(), dc_val.Gender.unique(), dc_test.Gender.unique()

(array([1, 0]), array([1, 0]), array([0, 1]))

In [119]:
# With the sklearn method(LabelEncoder())
le_ohe = LabelEncoder()
ohe = OneHotEncoder(handle_unknown = 'ignore', sparse=False)
enc_train = le_ohe.fit_transform(dc_train.Geography).reshape(dc_train.shape[0],1)
ohe_train = ohe.fit_transform(enc_train)
ohe_train

array([[0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.],
       ...,
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.]])

In [120]:
# mapping between classes
le_ohe_geography_mapping = dict(zip(le_ohe.classes_, le_ohe.transform(le_ohe.classes_)))
le_ohe_geography_mapping

{'France': 0, 'Germany': 1, 'Spain': 2}

In [121]:
# Encoding Geography feature for validation and test set
enc_val = dc_val.Geography.map(le_ohe_geography_mapping).ravel().reshape(-1,1)
enc_test = dc_test.Geography.map(le_ohe_geography_mapping).ravel().reshape(-1,1)

# Filling missing/NaN values created due to new categorical levels
enc_val[np.isnan(enc_val)] = 9999
enc_test[np.isnan(enc_test)] = 9999

In [122]:
ohe_val = ohe.transform(enc_val)
ohe_test = ohe.transform(enc_test)

In [123]:
# Show what happens when a new value is inputted into the OHE 
ohe.transform(np.array([[9999]]))

array([[0., 0., 0.]])

In [124]:
cols = ['country_' + str(x) for x in le_ohe_geography_mapping.keys()]
cols

['country_France', 'country_Germany', 'country_Spain']

In [125]:
# Adding to the respective dataframes
dc_train = pd.concat([dc_train.reset_index(), pd.DataFrame(ohe_train, columns = cols)], axis = 1).drop(['index'], axis=1)
dc_val = pd.concat([dc_val.reset_index(), pd.DataFrame(ohe_val, columns = cols)], axis = 1).drop(['index'], axis=1)
dc_test = pd.concat([dc_test.reset_index(), pd.DataFrame(ohe_test, columns = cols)], axis = 1).drop(['index'], axis=1)
print("Training set")
dc_train.head()
print("\n\nValidation set")
dc_val.head()
print("\n\nTest set")
dc_test.head()

Training set


Unnamed: 0,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,country_France,country_Germany,country_Spain
0,Mai,591,Spain,1,31,8,0.0,1,1,1,141677.33,0,0.0,0.0,1.0
1,Manna,649,France,1,45,7,0.0,2,0,1,75204.21,0,1.0,0.0,0.0
2,Ugoji,622,France,1,35,8,0.0,2,1,1,131772.51,0,1.0,0.0,0.0
3,Bluett,766,Germany,0,38,7,130933.74,1,0,1,2035.94,0,0.0,1.0,0.0
4,Wallace,663,France,0,39,8,0.0,2,1,1,101168.9,0,1.0,0.0,0.0




Validation set


Unnamed: 0,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,country_France,country_Germany,country_Spain
0,Yegorova,700,France,1,37,1,135179.49,1,1,0,160670.37,0,1.0,0.0,0.0
1,Cyril,800,France,0,40,3,75893.11,2,1,0,132562.23,0,1.0,0.0,0.0
2,Loton,602,Spain,1,37,3,107592.89,2,0,1,153122.73,0,0.0,0.0,1.0
3,Clark,682,France,0,30,9,0.0,2,1,1,195104.91,0,1.0,0.0,0.0
4,O'Meara,551,France,1,42,1,50194.59,1,1,1,23399.58,0,1.0,0.0,0.0




Test set


Unnamed: 0,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,country_France,country_Germany,country_Spain
0,Patel,777,Germany,0,34,5,96693.66,1,1,1,172618.52,0,0.0,1.0,0.0
1,Buccho,534,France,1,24,1,0.0,1,1,1,169653.32,0,1.0,0.0,0.0
2,Jen,650,Germany,0,46,9,149003.76,2,1,0,176902.83,0,0.0,1.0,0.0
3,Crawford,850,France,0,40,0,0.0,2,1,0,1099.95,0,1.0,0.0,0.0
4,Sokolova,743,Spain,1,45,7,157332.26,1,1,0,125424.42,0,0.0,0.0,1.0


In [126]:
dc_train.drop(['Geography'], axis=1, inplace=True)
dc_val.drop(['Geography'], axis=1, inplace=True)
dc_test.drop(['Geography'], axis=1, inplace=True)

In [127]:
means = dc_train.groupby(['Surname']).Exited.mean()
means.head()
means.tail()

Surname
Abazu      0.00
Abbie      0.00
Abbott     0.25
Abdulov    0.00
Abel       0.00
Name: Exited, dtype: float64

Surname
Zubarev     0.0
Zubareva    0.0
Zuev        0.0
Zuyev       1.0
Zuyeva      0.0
Name: Exited, dtype: float64

In [128]:
global_mean = y_train.mean()
global_mean

0.20517676767676768

In [129]:
# Creating new encoded features for surname - Target (mean) encoding
dc_train['Surname_mean_churn'] = dc_train.Surname.map(means)
dc_train['Surname_mean_churn'].fillna(global_mean, inplace=True)

In [130]:
freqs = dc_train.groupby(['Surname']).size()
freqs.head()

Surname
Abazu      2
Abbie      1
Abbott     4
Abdulov    2
Abel       1
dtype: int64

In [131]:
dc_train['Surname_freq'] = dc_train.Surname.map(freqs)
dc_train['Surname_freq'].fillna(0, inplace=True)

In [132]:
dc_train['Surname_enc'] = ((dc_train.Surname_freq * dc_train.Surname_mean_churn) - dc_train.Exited)/(dc_train.Surname_freq - 1)
# Fill NaNs occuring due to category frequency being 1 or less
dc_train['Surname_enc'].fillna((((dc_train.shape[0] * global_mean) - dc_train.Exited) / (dc_train.shape[0] - 1)), inplace=True)
dc_train.head(5)

Unnamed: 0,Surname,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,country_France,country_Germany,country_Spain,Surname_mean_churn,Surname_freq,Surname_enc
0,Mai,591,1,31,8,0.0,1,1,1,141677.33,0,0.0,0.0,1.0,0.0625,16,0.066667
1,Manna,649,1,45,7,0.0,2,0,1,75204.21,0,1.0,0.0,0.0,0.461538,13,0.5
2,Ugoji,622,1,35,8,0.0,2,1,1,131772.51,0,1.0,0.0,0.0,0.333333,3,0.5
3,Bluett,766,0,38,7,130933.74,1,0,1,2035.94,0,0.0,1.0,0.0,0.0,1,0.205203
4,Wallace,663,0,39,8,0.0,2,1,1,101168.9,0,1.0,0.0,0.0,0.111111,18,0.117647


In [133]:
# Replacing by category means and new category levels by global mean
dc_val['Surname_enc'] = dc_val.Surname.map(means)
dc_val['Surname_enc'].fillna(global_mean, inplace=True)
dc_test['Surname_enc'] = dc_test.Surname.map(means)
dc_test['Surname_enc'].fillna(global_mean, inplace=True)
# Show that using LOO Target encoding decorrelates features
dc_train[['Surname_mean_churn', 'Surname_enc', 'Exited']].corr()

Unnamed: 0,Surname_mean_churn,Surname_enc,Exited
Surname_mean_churn,1.0,0.55044,0.559735
Surname_enc,0.55044,1.0,-0.029586
Exited,0.559735,-0.029586,1.0


In [134]:
dc_train.drop(['Surname_mean_churn'], axis=1, inplace=True)
dc_train.drop(['Surname_freq'], axis=1, inplace=True)
dc_train.drop(['Surname'], axis=1, inplace=True)
dc_val.drop(['Surname'], axis=1, inplace=True)
dc_test.drop(['Surname'], axis=1, inplace=True)
dc_train.head()
dc_val.head()
dc_test.head()

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,country_France,country_Germany,country_Spain,Surname_enc
0,591,1,31,8,0.0,1,1,1,141677.33,0,0.0,0.0,1.0,0.066667
1,649,1,45,7,0.0,2,0,1,75204.21,0,1.0,0.0,0.0,0.5
2,622,1,35,8,0.0,2,1,1,131772.51,0,1.0,0.0,0.0,0.5
3,766,0,38,7,130933.74,1,0,1,2035.94,0,0.0,1.0,0.0,0.205203
4,663,0,39,8,0.0,2,1,1,101168.9,0,1.0,0.0,0.0,0.117647


Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,country_France,country_Germany,country_Spain,Surname_enc
0,700,1,37,1,135179.49,1,1,0,160670.37,0,1.0,0.0,0.0,1.0
1,800,0,40,3,75893.11,2,1,0,132562.23,0,1.0,0.0,0.0,0.205177
2,602,1,37,3,107592.89,2,0,1,153122.73,0,0.0,0.0,1.0,0.205177
3,682,0,30,9,0.0,2,1,1,195104.91,0,1.0,0.0,0.0,0.071429
4,551,1,42,1,50194.59,1,1,1,23399.58,0,1.0,0.0,0.0,0.0


Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,country_France,country_Germany,country_Spain,Surname_enc
0,777,0,34,5,96693.66,1,1,1,172618.52,0,0.0,1.0,0.0,0.0
1,534,1,24,1,0.0,1,1,1,169653.32,0,1.0,0.0,0.0,0.0
2,650,0,46,9,149003.76,2,1,0,176902.83,0,0.0,1.0,0.0,0.0
3,850,0,40,0,0.0,2,1,0,1099.95,0,1.0,0.0,0.0,0.2
4,743,1,45,7,157332.26,1,1,0,125424.42,0,0.0,0.0,1.0,0.205177


In [135]:
corr = dc_train.corr()
sns.heatmap(corr, cmap = 'coolwarm')

<matplotlib.axes._subplots.AxesSubplot at 0x7f91de7920d0>

In [136]:
sns.boxplot(x="Exited", y="Age", data=dc_train, palette="Set3")

<matplotlib.axes._subplots.AxesSubplot at 0x7f91de7920d0>

In [137]:
sns.violinplot(x="Exited", y="Balance", data=dc_train, palette="Set3")

<matplotlib.axes._subplots.AxesSubplot at 0x7f91de7920d0>

In [146]:
cat_vars_bv = ['Gender', 'HasCrCard', 'IsActiveMember', 'country_France', 'country_Germany', 'country_Spain']

for col in cat_vars_bv:
    dc_train.groupby([col]).Exited.mean()
    print()

Gender
0    0.247508
1    0.169684
Name: Exited, dtype: float64




HasCrCard
0    0.214653
1    0.201217
Name: Exited, dtype: float64




IsActiveMember
0    0.267806
1    0.145602
Name: Exited, dtype: float64




country_France
0.0    0.249618
1.0    0.161492
Name: Exited, dtype: float64




country_Germany
0.0    0.164485
1.0    0.328746
Name: Exited, dtype: float64




country_Spain
0.0    0.216588
1.0    0.170570
Name: Exited, dtype: float64




## 4.4 - Encoding Continuous Features

In [139]:
# Computed mean on churned or non chuned custmers group by number of product on training data
col = 'NumOfProducts'
dc_train.groupby([col]).Exited.mean()
# unique "NumOfProducts" on training data
dc_train[col].value_counts()

NumOfProducts
1    0.280946
2    0.075720
3    0.812500
4    1.000000
Name: Exited, dtype: float64

1    4015
2    3645
3     208
4      52
Name: NumOfProducts, dtype: int64

In [140]:
eps = 1e-6

dc_train['bal_per_product'] = dc_train.Balance/(dc_train.NumOfProducts + eps)
dc_train['bal_by_est_salary'] = dc_train.Balance/(dc_train.EstimatedSalary + eps)
dc_train['tenure_age_ratio'] = dc_train.Tenure/(dc_train.Age + eps)
dc_train['age_surname_mean_churn'] = np.sqrt(dc_train.Age) * dc_train.Surname_enc

In [141]:
new_cols = ['bal_per_product', 'bal_by_est_salary', 'tenure_age_ratio', 'age_surname_mean_churn']
# Ensuring that the new column doesn't have any missing values
dc_train[new_cols].isnull().sum()

bal_per_product           0
bal_by_est_salary         0
tenure_age_ratio          0
age_surname_mean_churn    0
dtype: int64

In [145]:
# Linear association of new columns with target variables to judge importance
sns.heatmap(dc_train[new_cols + ['Exited']].corr(), annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f91de7920d0>

In [143]:
dc_val['bal_per_product'] = dc_val.Balance/(dc_val.NumOfProducts + eps)
dc_val['bal_by_est_salary'] = dc_val.Balance/(dc_val.EstimatedSalary + eps)
dc_val['tenure_age_ratio'] = dc_val.Tenure/(dc_val.Age + eps)
dc_val['age_surname_mean_churn'] = np.sqrt(dc_val.Age) * dc_val.Surname_enc
dc_test['bal_per_product'] = dc_test.Balance/(dc_test.NumOfProducts + eps)
dc_test['bal_by_est_salary'] = dc_test.Balance/(dc_test.EstimatedSalary + eps)
dc_test['tenure_age_ratio'] = dc_test.Tenure/(dc_test.Age + eps)
dc_test['age_surname_mean_churn'] = np.sqrt(dc_test.Age) * dc_test.Surname_enc

## 4.5 - Scaling Features

In [147]:
# initialize the standard scaler
sc = StandardScaler()
cont_vars = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary', 'Surname_enc', 'bal_per_product'
             , 'bal_by_est_salary', 'tenure_age_ratio', 'age_surname_mean_churn']
cat_vars = ['Gender', 'HasCrCard', 'IsActiveMember', 'country_France', 'country_Germany', 'country_Spain']
# Scaling only continuous columns
cols_to_scale = cont_vars
sc_X_train = sc.fit_transform(dc_train[cols_to_scale])
# Converting from array to dataframe and naming the respective features/columns
sc_X_train = pd.DataFrame(data=sc_X_train, columns=cols_to_scale)
sc_X_train.shape
sc_X_train.head()

(7920, 11)

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,EstimatedSalary,Surname_enc,bal_per_product,bal_by_est_salary,tenure_age_ratio,age_surname_mean_churn
0,-0.615123,-0.758164,1.031408,-1.221524,-0.911982,0.732843,-0.736899,-1.101999,-0.034388,1.345712,-0.761317
1,-0.016051,0.57839,0.685099,-1.221524,0.800823,-0.428077,1.520201,-1.101999,-0.034388,0.195969,1.724994
2,-0.294929,-0.376291,1.031408,-1.221524,0.800823,0.55986,1.520201,-1.101999,-0.034388,1.014917,1.39487
3,1.19242,-0.089887,0.685099,0.875705,-0.911982,-1.705924,-0.015308,1.201192,0.495269,0.517364,-0.016345
4,0.128552,0.005581,1.031408,-1.221524,0.800823,0.025383,-0.471358,-1.101999,-0.034388,0.751977,-0.458316


In [148]:
# Scaling validation and test sets by transforming the mapping obtained through the training set
sc_X_val = sc.transform(dc_val[cols_to_scale])
sc_X_test = sc.transform(dc_test[cols_to_scale])
# Converting val and test arrays to dataframes for re-usability
sc_X_val = pd.DataFrame(data=sc_X_val, columns=cols_to_scale)
sc_X_test = pd.DataFrame(data=sc_X_test, columns=cols_to_scale)

In [151]:
# Creating feature-set and target for RFE model
y = dc_train['Exited'].values
X = dc_train[cat_vars + cont_vars]
X.columns = cat_vars + cont_vars
X.columns

Index(['Gender', 'HasCrCard', 'IsActiveMember', 'country_France',
       'country_Germany', 'country_Spain', 'CreditScore', 'Age', 'Tenure',
       'Balance', 'NumOfProducts', 'EstimatedSalary', 'Surname_enc',
       'bal_per_product', 'bal_by_est_salary', 'tenure_age_ratio',
       'age_surname_mean_churn'],
      dtype='object')

In [152]:
rfe_dt = RFE(estimator=DecisionTreeClassifier(max_depth = 4, criterion = 'entropy'), n_features_to_select=10) 
rfe_dt = rfe_dt.fit(X.values, y)  

In [157]:
mask = rfe_dt.support_.tolist()
selected_feats_dt = [b for a,b in zip(mask, X.columns) if a]
selected_feats_dt

['IsActiveMember',
 'country_Germany',
 'Age',
 'NumOfProducts',
 'EstimatedSalary',
 'Surname_enc',
 'bal_per_product',
 'bal_by_est_salary',
 'tenure_age_ratio',
 'age_surname_mean_churn']

In [170]:
selected_cat_vars = [x for x in selected_feats_dt if x in cat_vars]
selected_cont_vars = [x for x in selected_feats_dt if x in cont_vars]
# Using categorical features and scaled numerical features
X_train = np.concatenate((dc_train[selected_cat_vars].values, sc_X_train[selected_cont_vars].values), axis=1)
X_val = np.concatenate((dc_val[selected_cat_vars].values, sc_X_val[selected_cont_vars].values), axis=1)
X_test = np.concatenate((dc_test[selected_cat_vars].values, sc_X_test[selected_cont_vars].values), axis=1)
# print the shapes
X_train.shape, X_val.shape, X_test.shape

((7920, 10), (1080, 10), (1000, 10))

In [171]:
# Obtaining class weights based on the class samples imbalance ratio
_, num_samples = np.unique(y_train, return_counts=True)
weights = np.max(num_samples)/num_samples
# Define weight dictionnary
weights_dict = dict()
class_labels = [0,1]
# Weights associated with classes
for a,b in zip(class_labels,weights):
    weights_dict[a] = b

weights_dict

{0: 1.0, 1: 3.873846153846154}

In [160]:
# Re-defining X_train and X_val to consider original unscaled continuous features. y_train and y_val remain unaffected
X_train = dc_train[selected_feats_dt].values
X_val = dc_val[selected_feats_dt].values
# Decision tree classiier model
clf = DecisionTreeClassifier(criterion='entropy', class_weight=weights_dict, max_depth=4, max_features=None
                            , min_samples_split=25, min_samples_leaf=15)
# Fit the model
clf.fit(X_train, y_train)
# Checking the importance of different features of the model
pd.DataFrame({'features': selected_feats_dt,
              'importance': clf.feature_importances_
             }).sort_values(by='importance', ascending=False)

DecisionTreeClassifier(class_weight={0: 1.0, 1: 3.873846153846154},
                       criterion='entropy', max_depth=4, min_samples_leaf=15,
                       min_samples_split=25)

Unnamed: 0,features,importance
2,Age,0.44311
3,NumOfProducts,0.376083
0,IsActiveMember,0.095253
7,bal_by_est_salary,0.048546
5,Surname_enc,0.031103
4,EstimatedSalary,0.005906
1,country_Germany,0.0
6,bal_per_product,0.0
8,tenure_age_ratio,0.0
9,age_surname_mean_churn,0.0


In [161]:
# Validation metrics
print(f'Confusion Matrix: {confusion_matrix(y_val, clf.predict(X_val))}')
print(f'Area Under Curve: {roc_auc_score(y_val, clf.predict(X_val))}')
print(f'Recall score: {recall_score(y_val,clf.predict(X_val))}')
print(f'Classification report: \n{classification_report(y_val,clf.predict(X_val))}')

Confusion Matrix: [[568 304]
 [ 55 153]]
Area Under Curve: 0.6934765349329569
Recall score: 0.7355769230769231
Classification report: 
              precision    recall  f1-score   support

           0       0.91      0.65      0.76       872
           1       0.33      0.74      0.46       208

    accuracy                           0.67      1080
   macro avg       0.62      0.69      0.61      1080
weighted avg       0.80      0.67      0.70      1080



In [166]:
# Decision tree Classifier
clf = DecisionTreeClassifier(criterion='entropy', class_weight=weights_dict, 
                            max_depth=3, max_features=None,
                            min_samples_split=25, min_samples_leaf=15)
# We fit the model
clf.fit(X_train, y_train)

DecisionTreeClassifier(class_weight={0: 1.0, 1: 3.873846153846154},
                       criterion='entropy', max_depth=3, min_samples_leaf=15,
                       min_samples_split=25)

In [167]:
# Validation metrics
print(f'Confusion Matrix: {confusion_matrix(y_val, clf.predict(X_val))}')
print(f'Area Under Curve: {roc_auc_score(y_val, clf.predict(X_val))}')
print(f'Recall score: {recall_score(y_val,clf.predict(X_val))}')
print(f'Classification report: \n{classification_report(y_val,clf.predict(X_val))}')

Confusion Matrix: [[693 179]
 [ 70 138]]
Area Under Curve: 0.7290931545518702
Recall score: 0.6634615384615384
Classification report: 
              precision    recall  f1-score   support

           0       0.91      0.79      0.85       872
           1       0.44      0.66      0.53       208

    accuracy                           0.77      1080
   macro avg       0.67      0.73      0.69      1080
weighted avg       0.82      0.77      0.79      1080



In [164]:
## Preparing data and a few common model parameters
# Unscaled features will be used since it's a tree model

X_train = dc_train.drop(columns = ['Exited'], axis = 1)
X_val = dc_val.drop(columns = ['Exited'], axis = 1)

In [169]:
from utils import *

best_f1_lgb = LGBMClassifier(boosting_type='dart', class_weight={0: 1, 1: 3.0}, min_child_samples=20, n_jobs=-1, importance_type='gain', max_depth=6, num_leaves=63, colsample_bytree=0.6, learning_rate=0.1, n_estimators=201, reg_alpha=1, reg_lambda=1)
best_recall_lgb = LGBMClassifier(boosting_type='dart', num_leaves=31, max_depth=6, learning_rate=0.1, n_estimators=21, class_weight={0: 1, 1: 3.93}, min_child_samples=2, colsample_bytree=0.6, reg_alpha=0.3, reg_lambda=1.0, n_jobs=-1, importance_type='gain')
model = Pipeline(steps = [('categorical_encoding', CategoricalEncoder()),
                          ('add_new_features', AddFeatures()),
                          ('classifier', best_f1_lgb)
                         ])
# Fitting final model on train dataset
model.fit(X_train, y_train)
# Predict target probabilities
val_probs = model.predict_proba(X_val)[:,1]
# Predict target values on val data
val_preds = np.where(val_probs > 0.45, 1, 0) # The probability threshold can be tweaked
# Validation metrics
print(f'Confusion Matrix: {confusion_matrix(y_val,val_preds)}')
print(f'Area Under Curve: {roc_auc_score(y_val,val_preds)}')
print(f'Recall score: {recall_score(y_val,val_preds)}')
print(f'Classification report: \n{classification_report(y_val,val_preds)}')

NameError: ignored

In [None]:
# Save model object
joblib.dump(model, 'final_churn_model_f1_0_45.sav')

['final_churn_model_f1_0_45.sav']

# 5 - Test

In [None]:
# Load model object
model = joblib.load('final_churn_model_f1_0_45.sav')
X_test = dc_test.drop(columns=['Exited'], axis=1)
# Predict target probabilities
test_probs = model.predict_proba(X_test)[:,1]
# Predict target values on test data
test_preds = np.where(test_probs > 0.45, 1, 0) # Flexibility to tweak the probability threshold
#test_preds = model.predict(X_test)
# Test set metrics
roc_auc_score(y_test, test_preds)
recall_score(y_test, test_preds)
confusion_matrix(y_test, test_preds)
print(classification_report(y_test, test_preds))

0.7678570272911421

0.675392670157068

array([[696, 113],
       [ 62, 129]], dtype=int64)

              precision    recall  f1-score   support

           0       0.92      0.86      0.89       809
           1       0.53      0.68      0.60       191

    accuracy                           0.82      1000
   macro avg       0.73      0.77      0.74      1000
weighted avg       0.84      0.82      0.83      1000



In [None]:
# Adding predictions and their probabilities in the original test dataframe
test = dc_test.copy()
test['predictions'] = test_preds
test['pred_probabilities'] = test_probs
test.sample(5)

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,country_France,country_Germany,country_Spain,Surname_enc,bal_per_product,bal_by_est_salary,tenure_age_ratio,age_surname_mean_churn,predictions,pred_probabilities
674,617,1,34,9,0.0,2,1,0,118749.58,0,0.0,0.0,1.0,0.0,0.0,0.0,0.264706,0.0,0,0.046741
417,850,1,38,5,0.0,2,1,0,16491.64,0,1.0,0.0,0.0,0.375,0.0,0.0,0.131579,2.311655,0,0.083092
974,709,1,62,3,0.0,2,1,1,82195.15,0,0.0,0.0,1.0,0.25,0.0,0.0,0.048387,1.968502,0,0.05538
154,680,1,34,6,146422.22,1,1,0,67142.97,1,0.0,1.0,0.0,0.0,146422.073578,2.180753,0.176471,0.0,1,0.487481
335,833,1,29,1,96462.25,2,0,1,48986.18,0,0.0,1.0,0.0,0.20303,48231.100884,1.969173,0.034483,1.093352,0,0.022409


In [None]:
high_churn_list = test[test.pred_probabilities > 0.7].sort_values(by=['pred_probabilities'], ascending=False
                                                                 ).reset_index().drop(columns=['index', 'Exited', 'predictions'], axis=1)
high_churn_list.shape
high_churn_list.head()

(103, 18)

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,country_France,country_Germany,country_Spain,Surname_enc,bal_per_product,bal_by_est_salary,tenure_age_ratio,age_surname_mean_churn,pred_probabilities
0,546,0,58,3,106458.31,4,1,0,128881.87,0.0,1.0,0.0,0.0,26614.570846,0.826015,0.051724,0.0,0.992935
1,479,1,51,1,107714.74,3,1,0,86128.21,0.0,1.0,0.0,0.333333,35904.901365,1.250633,0.019608,2.380476,0.979605
2,745,1,45,10,117231.63,3,1,1,122381.02,0.0,1.0,0.0,0.25,39077.196974,0.957923,0.222222,1.677051,0.976361
3,515,1,45,7,120961.5,3,1,1,39288.11,0.0,1.0,0.0,0.2,40320.48656,3.078832,0.155556,1.341641,0.970001
4,481,0,57,9,0.0,3,1,1,169719.35,1.0,0.0,0.0,0.222222,0.0,0.0,0.157895,1.677741,0.965838


In [None]:
high_churn_list.to_csv('high_churn_list.csv', index=False)