![Bank](http://www.abm.co.uk/banking-finance-services/wp-content/uploads/sites/92/2016/04/Banking_Icons_Frame4c.jpg)

A financial institution is planning to roll out a stock market trading faciliation service for their existing
account holders. This service costs significant amount of money for the bank in terms of infra, licensing
and people cost. To make the serive offering profitable, they charge a percentage base comission on every
trade transaction. However this is not a unique service offered by them, many of their other competitors are
offering the same service and at lesser commission some times. To retain or attract people who trade heavily
on stock market and in turn generate a good commission for institution, they are planning to offer discounts
as they roll out the service to entire customer base.

Problem is , that this discount, hampers profits coming from the customers who do not trade in large
quantities . To tackle this issue , company wants to offer discounts selectively. To be able to do so, they need
to know which of their customers are going to be heavy traders or money makers for them.
To be able to do this, they decided to do a beta run of their service to a small chunk of their customer base
[approx 10000 people]. For these customers they have manually divided them into two revenue categories 1
and 2. Revenue one category is the one which are moeny makers for the bank, revenue category 2 are the
ones which need to be kept out of discount offers.

We need to use this study’s data to build a prediction model which should be able to identify if a customer is
potentially eligible for discounts [falls In revnue grid category 1]. Lets get the data and begin.

In [None]:

rg_train=read.csv("../input/rg_train.csv",stringsAsFactors=F)
rg_test=read.csv("../input/rg_test.csv",stringsAsFactors=F)

## Let's take a sneak peak of our train & test data 
head(rg_train)


Variable names are self explanatory as to what they represent. Our train data contains 32 variables like age_band,status,Occupation,home status etc. which will be involved in modelling process. Similarily for our test data given below there are 31 variables excluding that response variable (Revenue Grid) which we will predict.

In [None]:
head(rg_test)

In [None]:
library(dplyr)
glimpse(rg_train)

In [None]:
glimpse(rg_test)

Data Preparation:- 

We’ll combine our two datasets so that we do not need to prepare data separately for them. And we’ll also
avoid problem of dealing with different columns in different datasets.
However before combining them, we’ll need to add response column to test because number of columns need
to be same for two datasets to stack vertically.

In [None]:
rg_test$Revenue.Grid=NA
rg_train$data='train'
rg_test$data='test'
rg=rbind(rg_train,rg_test)

Lets start with looking at our first predictor variable in the data which is
“children”.

In [None]:
table(rg$children)

From glimpse function above we saw that the variable 'children' has come as character inspite of being numeric. We can easily convert this, to numeric data without any concern.

In [None]:
rg = rg %>%
mutate(children=ifelse(children=="Zero",0,substr(children,1,1)),
children=as.numeric(children))

To check whether it has been converted to numeric or not we will run glimpse again
glimpse(rg)

Lets look at age band variable , we can possibly convert this to numeric by taking average of age ranges.
Lets look at the frequency table any way to find if there are any non-numeric fields.

In [None]:
table(rg$age_band)

In [None]:
rg=rg %>%
mutate(a1=as.numeric(substr(age_band,1,2)),
a2=as.numeric(substr(age_band,4,5)),
age=ifelse(substr(age_band,1,2)=="71",71,ifelse(age_band=="Unknown",NA,0.5*(a1+a2)))
) %>%
select(-a1,-a2,-age_band)

In [None]:
glimpse(rg)



Next we’ll be looking at various categorical variables & create dummies for them. Instead of converting one by one, alternatively we can write a function which should ignore categories with low count.  

In [None]:
CreateDummies=function(data,var,freq_cutoff=0){
t=table(data[,var])
t=t[t>freq_cutoff]
t=sort(t)
categories=names(t)[-1]
for( cat in categories){
name=paste(var,cat,sep="_")
name=gsub(" ","",name)
name=gsub("-","_",name)
name=gsub("\\?","Q",name)
name=gsub("<","LT_",name)
name=gsub("\\+","",name)
name=gsub("\\/","_",name)
name=gsub(">","GT_",name)
name=gsub("=","EQ_",name)
name=gsub(",","",name)
data[,name]=as.numeric(data[,var]==cat)
}
data[,var]=NULL
return(data)
}

Let me explain the function CreateDummies which we created above.

t=table(data[,var]) this bit creates a frequency table for the given categorical column. t here is now
simply a table which contains names as categories of the categorical variable and their frequency in the data.
t=t[t>freq_cutoff] this line of code removes those categories from the table which have frequencies below
the frequency cutoff. ( this is a subjective choice)
‘t=sort(t)’ this line simple sorts the remaining table in ascending order
categories=names(t)[-1] since we sorted the table in ascending manner in the previous line, first category
here has least count. In this line of code we are taking out the category names except the first one ( which
has least count), thus making n-1 dummies from the remaining categories.
name=paste(var,cat,sep="_") all the dummy vars that we intend to make, need to have some name. this
line of code creates that name by concatenating variable name with category name with an _.
name=gsub(" ","",name) subsequent lines like these using gsub are essentially cleaning up the name thats
all. Since we dont have any control over what the categories can be, we are removing special characters and
spaces in the code in an automated fashion.
data[,name]=as.numeric(data[,var]==cat) once we have a cleaned up name, this line creates the dummy
var for that particular category.
data[,var]=NULL once we are done creating dummies for the variable using for loop. Variable is removed
from the data in this line.

Following a bit of code which can be used to extract names of all categorical variables in the data. 

In [None]:
names(rg)[sapply(rg,function(x) is.character(x))]

We are going to ignore column 'data' for obvious reasons and make dummies for rest.

In [None]:
cat_cols=c("status","occupation","occupation_partner","home_status","family_income","self_employed",
"self_employed_partner","TVarea","gender","region")
for(cat in cat_cols){
rg=CreateDummies(rg,cat,100)
}

We are dropping variables post_code, post_area. They take too many distinct values for these variables to
be useful in modeling process.

In [None]:
rg=rg %>%
select(-post_code,-post_area)

We need to convert our Response variable to 1/0

In [None]:
rg$Revenue.Grid=as.numeric(rg$Revenue.Grid==1)

Next we take care of missing values if any in the data. We will run a for loop as below. 

In [None]:
for(col in names(rg)){
if(sum(is.na(rg[,col]))>0 & !(col %in% c("data","Revenue.Grid"))){
rg[is.na(rg[,col]),col]=mean(rg[rg$data=='train',col],na.rm=T)
}
}

Now lets separate our datasets to begin modeling process

In [None]:
rg_train=rg %>% filter(data=='train') %>% select(-data)
rg_test=rg %>% filter(data=='test') %>% select (-data,-Revenue.Grid)

## Now our train data contains 77 variables instead of 32. 
glimpse(rg_train)

In [None]:
## Similarily for test data, variables got increased from 31 to 76 because of inclusion of dummy variables. 

glimpse(rg_test)

If we want to look at tentative performance measure, we’ll break our data into two parts. 

In [None]:
set.seed(2)
s=sample(1:nrow(rg_train),0.8*nrow(rg_train))
rg_train1=rg_train[s,]
rg_train2=rg_train[-s,]

First thing that we’ll be looking to eliminate is severe cases of multi-collinearity. . To examine VIF, we can run a linear regression. We are not concerned with the output of this linear regression model, we are only interested in VIF values of the predictor.

In [None]:
library(car)
for_vif=lm(Revenue.Grid~.-REF_NO,data=rg_train1)
sort(vif(for_vif),decreasing = T)[1:3]

There are few cases of insanely high VIF values , lets eliminate those variables one by one. Code
given below is result of multiple iterations. 

In [None]:
for_vif=lm(Revenue.Grid~.-REF_NO-Investment.in.Commudity
-Investment.in.Derivative-Investment.in.Equity
-region_SouthEast-TVarea_Central-occupation_Professional
-family_income_GT_EQ_35000-region_Scotland
-Portfolio.Balance,
data=rg_train1)
sort(vif(for_vif),decreasing = T)[1:3]

All VIF values now are less than 10. This is good enough for logistic regression , Lets move to build our
classification model now. 

In [None]:
log_fit=glm(Revenue.Grid~.-REF_NO-Investment.in.Commudity
-Investment.in.Derivative-Investment.in.Equity
-region_SouthEast-TVarea_Central-occupation_Professional
-family_income_GT_EQ_35000-region_Scotland-Portfolio.Balance,data=rg_train1,family = "binomial")

In [None]:
log_fit=step(log_fit)

In [None]:
summary(log_fit)

## If we look at summary(log_fit), we’ll find there are still some variable with high p-values.

We will run our logistic regression model with variables selected by step
function and now drop variabe based on p-values on our own from the remaining bunch.

In [None]:
formula(log_fit)

We can use this to now run our model and drop variables based on p-values too. Code given below is result
of multiple iteration. We have considered 0.1 as p-value cutoff, you can make it lower and drop more variables

In [None]:
log_fit=glm(Revenue.Grid ~ Average.Credit.Card.Transaction + Balance.Transfer +
Term.Deposit + Life.Insurance + Medical.Insurance + Average.A.C.Balance +
Personal.Loan + Investment.in.Mutual.Fund + Investment.Tax.Saving.Bond +
Home.Loan + Online.Purchase.Amount +
family_income_LT_30000GT_EQ_27500 +
self_employed_partner_No + TVarea_ScottishTV + TVarea_Meridian ,
data=rg_train,family='binomial')
summary(log_fit)

Lets see performance of score model on validation data that we kept aside.

In [None]:
## We will be using library pROC
library(pROC)

In [None]:
val.score=predict(log_fit,newdata = rg_train2,type='response')
auc_score=auc(roc(rg_train2$Revenue.Grid,val.score))
auc_score

The area under the curve is 0.96. So the tentative score performance of logistic regression is going to be around 0.96. Lets visualise how is our eventual binary response is behaving w.r.t. score that we obtained. 

In [None]:
library(ggplot2)
mydata=data.frame(Revenue.Grid=rg_train2$Revenue.Grid,val.score=val.score)
ggplot(mydata,aes(y=Revenue.Grid,x=val.score,color=factor(Revenue.Grid)))+
geom_point()+geom_jitter()

We can see that response 0 is bunched around low scores and response 1 is bunched around high scores,
However there is overlap as well across score values.

We know the tentative performance of logistic regression model in terms of auc score. Next we’ll build the
model on entire training data following the similar steps

In [None]:
for_vif=lm(Revenue.Grid~.-REF_NO-Investment.in.Commudity
-Investment.in.Derivative-Investment.in.Equity
-region_SouthEast-TVarea_Central-occupation_Professional
-family_income_GT_EQ_35000-region_Scotland-Portfolio.Balance
,data=rg_train)
sort(vif(for_vif),decreasing = T)[1:3]

In [None]:
log.fit.final=glm(Revenue.Grid~.-REF_NO-Investment.in.Commudity
-Investment.in.Derivative-Investment.in.Equity
-region_SouthEast-TVarea_Central-occupation_Professional
-family_income_GT_EQ_35000-region_Scotland-Portfolio.Balance,
data=rg_train,family='binomial')

In [None]:
log.fit.final=step(log.fit.final)

In [None]:
log.fit.final=glm(Revenue.Grid ~ Average.Credit.Card.Transaction + Balance.Transfer +
Term.Deposit + Life.Insurance + Medical.Insurance + Average.A.C.Balance +
Personal.Loan + Investment.in.Mutual.Fund + Investment.Tax.Saving.Bond +
Home.Loan + Online.Purchase.Amount + status_Partner +
occupation_partner_Retired+
self_employed_partner_No + TVarea_ScottishTV + TVarea_Meridian +
gender_Female,
data=rg_train,family='binomial')

In [None]:
summary(log.fit.final)

Now if we need to submit simple probability score we can make prediction on test data and submit that.

In [None]:
test.prob.score= predict(log_fit,newdata = rg_test,type='response')

test.prob.score

We can save this csv file in a location. 

In [None]:
write.csv(test.prob.score,"proper_submission_file_name.csv",row.names = F)