##Specifying the Data Analysis Question
* Beta Bank customers are leaving: little by little, chipping away every month. The bankers figured out it’s cheaper to save the existing customers rather than to attract new ones.We need to predict whether a customer will leave the bank soon

##Metric of success
Building a model that can predict with high accuracy whether a customer is about to leave

##Understanding the context
Beta Bank customers are leaving: little by little, chipping away every month. The bankers figured out it’s cheaper to save the existing customers rather than to attract new ones.We need to predict whether a customer will leave the bank soon. You have the data on clients’ past behavior and termination of contracts with the bank.

## Recording the Experimental Design
1. Download and prepare the data. Explain the procedure.
2. Examine the balance of classes. Train the model without taking into account the
imbalance. Briefly describe your findings.
3. Improve the quality of the model. Make sure you use at least two approaches to
fixing class imbalance. Use the training set to pick the best parameters. Train
different models on training and validation sets. Find the best one. Briefly
describe your findings.
4. Perform the final testing

##Data Relevance
* the source data has good information to predict if the customer will leave

#Reading the data

In [3]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

#read the data
churn_data = pd.read_csv('https://bit.ly/2XZK7Bo')

#preview of the data
print('\nPreview top 5:\n',churn_data.head())
print('\nPreview bottom 5:\n',churn_data.tail())
print('\nPreview sample 5:\n',churn_data.sample(5))
print(f'\n size of dataset:\n', churn_data.shape)





Preview top 5:
    RowNumber  CustomerId   Surname  CreditScore Geography  Gender  Age  \
0          1    15634602  Hargrave          619    France  Female   42   
1          2    15647311      Hill          608     Spain  Female   41   
2          3    15619304      Onio          502    France  Female   42   
3          4    15701354      Boni          699    France  Female   39   
4          5    15737888  Mitchell          850     Spain  Female   43   

   Tenure    Balance  NumOfProducts  HasCrCard  IsActiveMember  \
0     2.0       0.00              1          1               1   
1     1.0   83807.86              1          0               1   
2     8.0  159660.80              3          1               0   
3     1.0       0.00              2          0               0   
4     2.0  125510.82              1          1               1   

   EstimatedSalary  Exited  
0        101348.88       1  
1        112542.58       0  
2        113931.57       1  
3         93826.63       

In [4]:
#check for, data type for the variable values, missing data, duplicated observation
print(f'\n data type of dataset variables:\n', churn_data.dtypes) #all variable are expected data type
print(f'null values in the set\n {churn_data.isnull().sum()}') #tenure has missing values, quite a number
print(f'number of unique values for each variable in the set\n {churn_data.nunique()}')



 data type of dataset variables:
 RowNumber            int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure             float64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object
null values in the set
 RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64
number of unique values for each variable in the set
 RowNumber          10000
CustomerId         10000
Surname             2932
CreditScore          460
Geography              3
Gender                 2
Age       

##Data preparation
* standardise the columns name
* deal with the missing value - replace with mean of the variable

In [5]:
#convet columns name to lower case
churn_data.columns = churn_data.columns.str.lower().str.strip()
churn_data['tenure'].unique()



array([ 2.,  1.,  8.,  7.,  4.,  6.,  3., 10.,  5.,  9.,  0., nan])

In [7]:
#replacing the missing value with mean of the tenure, close to 10% with missing value
churn_data.loc[churn_data['tenure'].isnull(), 'tenure'] = float(churn_data['tenure'].mean())

In [8]:
#drop columns unnecessary like rownumber, customerid and surname
churn_data.drop(columns=['rownumber', 'customerid', 'surname'], inplace=True)

## Solution Implementation
### create and train, validate and test the model.

In [9]:
churn_data['geography'].unique()
churn_data.head()

Unnamed: 0,creditscore,geography,gender,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited
0,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [11]:
#use one hot encoding to convert the nominal variable, geography and gender. probably customer from a particluar
#country have an influnce on the churn rate
churn_data = pd.get_dummies(churn_data, drop_first=True)
churn_data.head()

Unnamed: 0,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited,geography_Germany,geography_Spain,gender_Male
0,619,42,2.0,0.0,1,1,1,101348.88,1,0,0,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8.0,159660.8,3,1,0,113931.57,1,0,0,0
3,699,39,1.0,0.0,2,0,0,93826.63,0,0,0,0
4,850,43,2.0,125510.82,1,1,1,79084.1,0,0,1,0


In [12]:
#create a train dataset
train_df = churn_data.copy()

#create a training and validation dataset from the dataset
train_df, valid_df = train_test_split(train_df, test_size=0.25, random_state=1234)
print(train_df.shape)
print(valid_df.shape)

(7500, 12)
(2500, 12)


In [13]:
#create features and target for both train and test
features_train = train_df.drop(columns=['exited'])
target_train = train_df['exited']
features_valid = valid_df.drop(columns=['exited'])
target_valid = valid_df['exited']

#create a model for Decision Trees, Random Forest and Logistic Regression
#model for Decision Trees, declare and find the ideal depth for the tree
for d in range(1, 11, 1):
  tree_model = DecisionTreeClassifier(random_state=1234, max_depth=d)
  tree_model.fit(features_train, target_train)  #train the model
  #check for accuracy
  print(f'Decision tree has accuracy of: {tree_model.score(features_train, target_train)} for depth of: {d}')

#declare model for random forest and find the best n_estimator value
for n in range(1,20,1):
  forest_model = RandomForestClassifier(class_weight='balanced' ,random_state=1234, n_estimators=n)
  forest_model.fit(features_train, target_train)
  print(f'Random forest has accuracy of: {forest_model.score(features_train, target_train)} for n={n}')

#declare a model for logistic regression
log_model = LogisticRegression(random_state=1234, solver='liblinear')
log_model.fit(features_train, target_train)
print(f'UNBalanced: logistic regression has accuracy of: {log_model.score(features_train, target_train)}')
log_model = LogisticRegression(class_weight='balanced', random_state=1234, solver='liblinear')
log_model.fit(features_train, target_train)
print(f'Balanced: logistic regression has accuracy of: {log_model.score(features_train, target_train)}')

#wieghting the classes and comparing the accuracy
forest_model = RandomForestClassifier(random_state=1234, n_estimators=11)
forest_model.fit(features_train, target_train)
print(f'UNBalanced: Random forest has accuracy of: {forest_model.score(features_train, target_train)}')

forest_model = RandomForestClassifier(class_weight='balanced' ,random_state=1234, n_estimators=11)
forest_model.fit(features_train, target_train)
print(f'Balanced: Random forest has accuracy of: {forest_model.score(features_train, target_train)}')




Decision tree has accuracy of: 0.796 for depth of: 1
Decision tree has accuracy of: 0.828 for depth of: 2
Decision tree has accuracy of: 0.8418666666666667 for depth of: 3
Decision tree has accuracy of: 0.8522666666666666 for depth of: 4
Decision tree has accuracy of: 0.8584 for depth of: 5
Decision tree has accuracy of: 0.8650666666666667 for depth of: 6
Decision tree has accuracy of: 0.8733333333333333 for depth of: 7
Decision tree has accuracy of: 0.8817333333333334 for depth of: 8
Decision tree has accuracy of: 0.8921333333333333 for depth of: 9
Decision tree has accuracy of: 0.9021333333333333 for depth of: 10
Random forest has accuracy of: 0.9273333333333333 for n=1
Random forest has accuracy of: 0.9329333333333333 for n=2
Random forest has accuracy of: 0.9686666666666667 for n=3
Random forest has accuracy of: 0.9604 for n=4
Random forest has accuracy of: 0.9784 for n=5
Random forest has accuracy of: 0.972 for n=6
Random forest has accuracy of: 0.9861333333333333 for n=7
Random f

##Recommendation
* Random forest with n estimator of 11 gives a very high predicting accuracy and it is not computationary expensive,
* Having weighted classes does not improve the model accuracy