## Descriptive Features

Variables explanantion:
<ul> 
    <li> RowNumber - Numeric discrete. Units N/A corresponds to the record (row) number and has no effect on the output.
    <li> CustomerId - Categorial Nominal. Units N/A  contains random values and has no effect on customer leaving the bank.
    <li> Surname -  Categorial Nominal. Units N/A the surname of a customer has no impact on their decision to leave the bank.
    <li> CreditScore - Numeric Continuous. Unit N/A can have an effect on customer churn, since a customer with a higher credit score is less likely to leave the bank.
    <li> Geography - Nominal Categorical. Units N/A a customer’s location can affect their decision to leave the bank.
    <li> Gender - Nominal Categorical. Units N/A it’s interesting to explore whether gender plays a role in a customer leaving the bank.
    <li> Age - Numeric Discrete. Unit Years. this is certainly relevant, since older customers are less likely to leave their bank than younger ones.
    <li> Tenure - Numeric continuous. Unit Years. Number of years that the customer has been a client of the bank
    <li> Balance - Numeric continuous. Unit Dollars. Total amount of dollar in the customers bank accounts.
    <li> NumOfProducts - Numeric Discrete. Unit N/A. Number of products that a customer has purchased through the bank
    <li> HasCrCard - Binary. Unit N/A. Denotes if the customer has a credit card 1 is positive 0 is negative.
    <li> IsActiveMember -
    <li> EstimatedSalary -
    <li> Exited - 
<ul>

## Data Preparation

Importing all the modules

In [2]:
# Importing modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy
import warnings
###
warnings.filterwarnings('ignore')
###
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")

reading in the CSV from file. Please have the csv in the same file as the notebook

In [8]:
data = pd.read_csv('churn.csv')

data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


## Data Cleaning and Transformation
We first confirm that the feature types match the descriptions outlined in the documentation.

In [9]:
print(f"Shape of the dataset is {data.shape} \n")
print(f"Data types are below where 'object' indicates a string type: ")
print(data.dtypes)

Shape of the dataset is (10000, 14) 

Data types are below where 'object' indicates a string type: 
RowNumber            int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure               int64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object


## Checking for Missing Values

In [5]:
print(f"\nNumber of missing values for each feature:")
print(data.isnull().sum())


Number of missing values for each feature:
RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64


On surface, no attribute contains any missing values, though we shall see below that the missing values are coded with a question mark. We will address this issue later.

## Summary Statistics

In [24]:
from IPython.display import display, HTML
display(HTML('<b>Table 1: Summary of continuous features</b>'))
data.describe(include='int64')

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,NumOfProducts,HasCrCard,IsActiveMember,Exited
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,15690940.0,650.5288,38.9218,5.0128,1.5302,0.7055,0.5151,0.2037
std,2886.89568,71936.19,96.653299,10.487806,2.892174,0.581654,0.45584,0.499797,0.402769
min,1.0,15565700.0,350.0,18.0,0.0,1.0,0.0,0.0,0.0
25%,2500.75,15628530.0,584.0,32.0,3.0,1.0,0.0,0.0,0.0
50%,5000.5,15690740.0,652.0,37.0,5.0,1.0,1.0,1.0,0.0
75%,7500.25,15753230.0,718.0,44.0,7.0,2.0,1.0,1.0,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,4.0,1.0,1.0,1.0


In [25]:
display(HTML('<b>Table 2: Summary of categorical features</b>'))
data.describe(include='object')

Unnamed: 0,Surname,Geography,Gender
count,10000,10000,10000
unique,2932,3,2
top,Smith,France,Male
freq,32,5014,5457


## Continuous Features
Row number and ID columns have no impact on prediction so we drop them

In [10]:
data = data.drop(columns=['RowNumber', 'CustomerId'])
data.head()

Unnamed: 0,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


## Categorial Features 
Surname has no predictive power so we drop it


In [11]:
data = data.drop(columns='Surname')
data.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


Lets have a look at all of the cardinality for the categorial features

In [12]:
categoricalColumns = data.columns[data.dtypes==object].tolist()

for col in categoricalColumns:
    print('Unique values for ' + col)
    print(data[col].unique())
    print('')

Unique values for Geography
['France' 'Spain' 'Germany']

Unique values for Gender
['Female' 'Male']



No extra spaces and consistent cardinality

## Dependent Variable
Target feature is already binary with 1 as positive and 0 as negative so no changes need to be made.