# PREDICTING CUSTOMER CHURN WITHIN THE BANKING SPACE







# 1. INTRODUCTION

Customer churn is when someone chooses to stop using your products or services. In effect, it’s when a customer ceases to be a customer.
Customer churn is measured using customer churn rate. That’s the number of people who stopped being customers during a set period of time, such as a year, a month or a financial quarter. We are going to predict customer churn within the banking space.


# 2. Data review

First we need to import our churn dataset into our working environment by the help of built in function in R read.csv()

In [None]:
#installing required libraries
library(tidyverse) 
library(dplyr)
library(caret)
library(tm)
library(RColorBrewer)
library(wordcloud)
library(plyr)
library(SnowballC)
library(gmodels)
library(forcats)
library(corrplot)
library(plotly)
library(caTools)
library(C50)


data<-read.csv("../input/banking-churn/Churn_Modelling.csv")
str(data)

our data is made up of 10000 rows and 14 variables

In [None]:
#checking presence of NA in all our columns
colSums(is.na(data))

From the above we will not need column 1,2,3 because it will lead to distortion and ident in our prediction.

In [None]:
data<-data[-c(1,2,3)]
#to check if they are removed
str(data)

In [None]:
data$Exited<-mapvalues(data$Exited, from = c(0,1), to = c("no", "yes"))
data$IsActiveMember<-mapvalues(data$IsActiveMember, from = c(0,1), to = c("no", "yes"))
data$HasCrCard<-mapvalues(data$HasCrCard, from = c(0,1), to = c("no", "yes"))

#converting categorical variables to factors

data$Exited<-as.factor(data$Exited)
data$IsActiveMember<-as.factor(data$IsActiveMember)
data$HasCrCard<-as.factor(data$HasCrCard)

str(data)


In [None]:
#display the summary of descriptive statistics
summary(data)

#  3. EXPLORATORY DATA ANALYSIS

we want to visualize how categorical variables relate to our Target variable


In [None]:
ggplot(data = data, aes(x = Exited, fill =Geography )) +
    geom_bar()

 proportion of churned customers is  reciprocally related to the population of members alluding to the bank possibly having a problem (maybe not enough customer service resources allocated) in the areas where it has fewer clients.

In [None]:
ggplot(data = data, aes(x = Exited, fill =Gender )) +
    geom_bar()

From the plot,the propotion of women churning is higher than men.

In [None]:
ggplot(data = data, aes(x = HasCrCard, fill =Exited)) +
    geom_bar()

Majority of people who exited had credit card compared to those without given the majority had credit cards

In [None]:
ggplot(data = data, aes(x = IsActiveMember, fill =Exited)) +
    geom_bar()

 From the boxplot ,active members are more than innactive members.
 
 Higher Proportion of innactive members exited.

In [None]:
#relationship between continous data and our target variable

ggplot(data, aes(x = Exited, y = Age,fill=Exited)) +
        geom_boxplot()

As per the boxplot above, older people exited compared to younger people,likely due to many bank models are targeting the younger generation.

In [None]:
ggplot(data, aes(x = Exited, y = EstimatedSalary,fill=Exited)) +
        geom_boxplot()

From the boxplot,salary has a low impact of the customer churn.

In [None]:
ggplot(data, aes(x = Exited, y = Balance,fill=Exited)) +
        geom_boxplot()

Unfortunately the bank is losing customers with higher bank balance

In [None]:
ggplot(data, aes(x = Exited, y = NumOfProducts,fill=Exited)) +
        geom_boxplot()

The number of products has a zeo effect on customer churn.

# 4. SPLITTING DATASET TO TRAINING AND TESTING

In this section the data is split into two parts – train data set and test data set; the splitting ratio is 70:30; it means that 70% of data contributes to the train dataset and 30% of data contributes to the test dataset. The train dataset is used to build the model and test dataset is used to test the performance of the model.

In [None]:
str(data)

In [None]:
set.seed(123)
train_sample<-sample(10000,9000)
str(train_sample)
churn_train<-data[train_sample,]
churn_test<-data[-train_sample,]

In [None]:
prop.table(table(churn_train$Exited))

In [None]:
prop.table(table(churn_test$Exited))

It appears to be fairly even, so we can now build our decision tree.
    we will use C5.0 algorithm in C50 package to train our decision tree


In [None]:
churn_model<-C5.0(churn_train[-11],churn_train$Exited)

In [None]:
churn_model

In [None]:
summary(churn_model)

# 5. EVALUATING MODEL PERFORMANCE
To apply our decision tree to the test dataset,we use predict() 

In [None]:
churn_pred<-predict(churn_model,churn_test)

In [None]:
CrossTable(churn_test$Exited,churn_pred,prop.chisq=FALSE,prop.c=FALSE,prop.r=FALSE,dnn=c('Actual churn','predicted churn'))

# 6. IMPROVING MODEL PERFORMANCE

the error rate is 13.5%, we can try to boost our model to check if error rate reduces.


In [None]:
churn_boost<-C5.0(churn_train[-11],churn_train$Exited,trials=10)
churn_boost

In [None]:
summary(churn_boost)

In [None]:
churn_boost_pred<-predict(churn_boost,churn_test)

In [None]:
CrossTable(churn_test$Exited,churn_boost_pred,prop.chisq=FALSE,prop.c=FALSE,prop.r=FALSE,dnn=c('Actual churn','predicted churn'))

Boosting  has increased the error rate from 13.5% to 13.6%. However ,the type of mistakes are different.