**The Dataset is about bank customers churning and can be found on Kaggle:**   
https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling

**Disclaimer: The dataset above is simulated**

## Load the Data

In [57]:
# Load the required libraries for data manipulation and data visualization 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

In [58]:
# Load the dataset from local directory into a Pandas dataframe called 'df'
df = pd.read_csv('Churn_Modelling.csv', index_col=None)

In [59]:
# View the shape of the data using .shape
df.shape

(10000, 14)

## Data Wrangling

In real world data, cleaning requires a lot of effort and can be a very long process. However, this dataset from Kaggle is very clean and does not have any missing values but I still want to check it to make sure everything looks good and that the values match the column names appropiately. I dropped the 'RowNumber' column as it is redundant here! 

In [60]:
# Check to see if there are any null values in our dataset 
df.isnull().any()

RowNumber          False
CustomerId         False
Surname            False
CreditScore        False
Geography          False
Gender             False
Age                False
Tenure             False
Balance            False
NumOfProducts      False
HasCrCard          False
IsActiveMember     False
EstimatedSalary    False
Exited             False
dtype: bool

In [61]:
# View the data 
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [64]:
#Drop the RowNumber column as it is redundant 
df.drop('RowNumber', axis = 1, inplace=True)

In [65]:
df.head()

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


## Feature Conversion

The purpose of converting the 'Geography' column and 'Gender' column into numerical values is because during modelling, some actions can not be performed on categorical values. Here, I will convert the 'Geography' column into 3 numerical values and 'Gender' column into 2 numerical values. 

In [66]:
print(df['Gender'].value_counts())
print(df['Geography'].value_counts())

Male      5457
Female    4543
Name: Gender, dtype: int64
France     5014
Germany    2509
Spain      2477
Name: Geography, dtype: int64


In [67]:
df['Geography'].replace(['France', 'Germany', 'Spain'], [0, 1, 2], inplace=True)
df['Gender'].replace(['Male', 'Female'], [0, 1], inplace=True)
df.head()

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,15634602,Hargrave,619,0,1,42,2,0.0,1,1,1,101348.88,1
1,15647311,Hill,608,2,1,41,1,83807.86,1,0,1,112542.58,0
2,15619304,Onio,502,0,1,42,8,159660.8,3,1,0,113931.57,1
3,15701354,Boni,699,0,1,39,1,0.0,2,0,0,93826.63,0
4,15737888,Mitchell,850,2,1,43,2,125510.82,1,1,1,79084.1,0


## Data Rearrangement

For visual purposes, I like to move the response variable, in this case 'Exited', to the left side of the table. I find it quicker to view it this way, and also makes the dataset splitting into train/test set easier later on.

In [68]:
first_column = df['Exited']
df.drop('Exited', axis=1,inplace=True)
df.insert(0, 'Exited', first_column)
df.head()

Unnamed: 0,Exited,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,1,15634602,Hargrave,619,0,1,42,2,0.0,1,1,1,101348.88
1,0,15647311,Hill,608,2,1,41,1,83807.86,1,0,1,112542.58
2,1,15619304,Onio,502,0,1,42,8,159660.8,3,1,0,113931.57
3,0,15701354,Boni,699,0,1,39,1,0.0,2,0,0,93826.63
4,0,15737888,Mitchell,850,2,1,43,2,125510.82,1,1,1,79084.1


## Outlier Detection

Check for outliers in the data by using the .describe() method and looking for any extreme values in the min and max fields of the output. There seems to be no outliers in this data. 

In [69]:
df.describe()

Unnamed: 0,Exited,CustomerId,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,0.2037,15690940.0,650.5288,0.7463,0.4543,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881
std,0.402769,71936.19,96.653299,0.827529,0.497932,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818
min,0.0,15565700.0,350.0,0.0,0.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58
25%,0.0,15628530.0,584.0,0.0,0.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11
50%,0.0,15690740.0,652.0,0.0,0.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915
75%,0.0,15753230.0,718.0,1.0,1.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475
max,1.0,15815690.0,850.0,2.0,1.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48
