# Churn - Classification Analysis

## Overview

- [Description](#Description)  
- [Data Description](#Data-Description)
- [Data Preparation](#Data-Preparation)

## Description

Our objective is to make **churn prediction**.

## Data Description

Columns:
- **RowNumber** (int > 0). Is not necessary as a feature
- **CustomerId** (int > 0). Is not necessary as a feature
- **Surname** (string). Is not necessary as a feature
- **CreditScore** (int). Numerical feature
- **Geography** (string). Categorical feature
- **Gender** (string). Categorical feature
- **Age** (int > 0). Numerical feature
- **Tenure** (int > 0). Numerical feature
- **Balance** (float). Numerical feature
- **NumOfProduct** (int > 0). Numerical feature
- **HasCrCard** (0/1). Binary feature
- **IsActiveMember** (0/1). Binary feature
- **EstimatedSalary** (float). Numerical feature
- **Exited** Target
    - exited (1): the customer left the company
    - no exited (0): the user remained at the company

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('data/Churn_Modelling.csv')

In [2]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [3]:
df.dtypes

RowNumber            int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure               int64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object

## Data Preparation

- Checking Missing Data (see [Missing Data](../../00 Data Preparation/01_Missing Data.ipynb))
- Feature scaling (see [Feature Scaling](../../00 Data Preparation/03_Feature_Scaling.ipynb))(necessary for some classification algorithms)
- One-hot-encoding for categorical data (see [Categorical Data](../../00 Data Preparation/02_Categorical Data.ipynb))

In [4]:
df.describe()

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,15690940.0,650.5288,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,2886.89568,71936.19,96.653299,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,1.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,584.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,7500.25,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


In [5]:
df.isnull().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

In [19]:
# Values for categorical data
for column in df.columns:
    if (df[column].dtype) == 'object':
        print(column)
        print('----------------')
        print(df[column].value_counts())
        print('\n')

Surname
----------------
Smith         32
Scott         29
Martin        29
Walker        28
Brown         26
Shih          25
Yeh           25
Genovese      25
Wright        24
Maclean       24
Ma            23
Fanucci       23
White         23
Wilson        23
Lu            22
Johnson       22
Chu           22
Moore         22
Wang          22
McGregor      21
Mai           21
Thompson      21
Sun           21
Kao           20
Young         20
Kerr          20
Lo            20
Mitchell      20
Hughes        20
Trevisani     20
              ..
Barnet         1
Nicholas       1
Melendez       1
Yeates         1
Uren           1
Eddy           1
Bidencope      1
Lassetter      1
Wardell        1
Mahomed        1
Hearn          1
Kline          1
Cantamessa     1
Ross-Watt      1
Greathouse     1
Candler        1
Morres         1
Levi           1
Rene           1
Carandini      1
Plumb          1
Percy          1
Jamison        1
Mikkelsen      1
Gunson         1
Chidalu        1
Bazile

In [20]:
# isolating the target
y = df[['Exited']]
X = df.drop(labels=['Exited'], axis=1)

In [21]:
y.head()

Unnamed: 0,Exited
0,1
1,0
2,1
3,0
4,0


In [29]:
y['Exited'].value_counts()

0    7963
1    2037
Name: Exited, dtype: int64

In [30]:
X.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1


In [31]:
# dropping columns that no are necessary: RowNumber, CustomerId, Surname
columns_to_drop = ['RowNumber', 'CustomerId', 'Surname']
X = df.drop(labels=columns_to_drop, axis=1)
X.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [10]:
from sklearn.preprocessing import StandardScaler
sd = StandardScaler(with_mean=True, with_std=True)

features_to_scale = ['age','tenure', 'balance', ]

Xscaled = sd.fit_transform(X[features_to_scale])
X[features_to_scale].describe()

Unnamed: 0,n_products_viewed,visit_duration
count,500.0,500.0
mean,0.854,1.05588
std,1.046362,0.976711
min,0.0,0.000141
25%,0.0,0.32855
50%,1.0,0.804717
75%,1.0,1.499518
max,4.0,6.368775


In [11]:
X = X.drop(labels=features_to_scale, axis=1)

for idx, feature in enumerate(features_to_scale):
    X[feature] = Xscaled[:, idx]

X[features_to_scale].describe()

Unnamed: 0,n_products_viewed,visit_duration
count,500.0,500.0
mean,-9.414691000000001e-17,-3.2862600000000006e-17
std,1.001002,1.001002
min,-0.8169784,-1.081995
25%,-0.8169784,-0.7454186
50%,0.1396708,-0.2574098
75%,0.1396708,0.4546714
max,3.009618,5.445026


In [12]:
X.head()

Unnamed: 0,is_mobile,is_returning_visitor,time_of_day,n_products_viewed,visit_duration
0,1,0,3,-0.816978,-0.408278
1,1,0,2,0.139671,-0.499428
2,1,1,1,-0.816978,-1.038843
3,1,1,1,0.139671,0.618932
4,0,1,1,0.139671,0.982712


In [13]:
X = pd.get_dummies(X, columns=['time_of_day'])
X.head()

Unnamed: 0,is_mobile,is_returning_visitor,n_products_viewed,visit_duration,time_of_day_0,time_of_day_1,time_of_day_2,time_of_day_3
0,1,0,-0.816978,-0.408278,0,0,0,1
1,1,0,0.139671,-0.499428,0,0,1,0
2,1,1,-0.816978,-1.038843,0,1,0,0
3,1,1,0.139671,0.618932,0,1,0,0
4,0,1,0.139671,0.982712,0,1,0,0


## Binary Classification

First we are going to perfom a **binary classification** for the bounce user action. In this case we are going to convert the actions 'add_to_cart', 'begin_checkout', and 'finish_checkout' in a non-bounce action. So we are going to classify data as:
- y=0 if user bounces
- y=1 if user doesn't bounce

In [14]:
y['user_action'].value_counts()

0    253
1    145
2     77
3     25
Name: user_action, dtype: int64

In [15]:
y = y['user_action'].apply(lambda x: 1 if x>0 else 0)
y.value_counts()

0    253
1    247
Name: user_action, dtype: int64