# Santander Customer Transaction Prediction

## Introduction

For this project, our team will be doing a Kaggle challenge on Santander’s data set. Santander is a Spanish owned financial bank, and they need help with identifying customer behavior and spending habit. We will be using their provided data set to bring a solution to their problem. We believe this data set can be a great practice to apply what we have been learning in the class, such as classification and prediction methods.

## Project description/Abstract

We chose a dataset from Kaggle based on the challenge. Kaggle is hosting for Santander bank. On the overview page Kaggle has outlined this description:
At Santander, our mission is to help people and businesses prosper. We are always looking for ways to help our customers understand their financial health and identify which products and services might help them achieve their monetary goals.
Our data science team is continually challenging our machine learning algorithms, working with the global data science community to make sure we can more accurately identify new ways to solve our most common challenge, binary classification problems such as: is a customer satisfied? Will a customer buy this product? Can a customer pay this loan?
In this challenge, we will identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The data provided for this competition has the same structure as the real data.

## Data Set

Kaggle competition Santander Customer Transaction Prediction dataset containing 200 numeric feature variables, the binary “target” column, and a string “ID_code” column. The training set is 200000 records which may require some dimension reduction in order to be more computationally efficient and to classify more accurately.

In [32]:
install.packages("MASS", repos = "http://cran.us.r-project.org")
install.packages("ggplot2", repos = "http://cran.us.r-project.org")
install.packages("data.table", repos = "http://cran.us.r-project.org")

"package 'ggplot2' is in use and will not be installed"

package 'data.table' successfully unpacked and MD5 sums checked


"cannot remove prior installation of package 'data.table'"


The downloaded binary packages are in
	C:\Users\Naeem\AppData\Local\Temp\Rtmp6jpX05\downloaded_packages


In [33]:
library(MASS)
library(ggplot2)
library(data.table)

ERROR: Error in library(data.table): there is no package called 'data.table'


Let's check the dimension of train and test sets. Also check what are the variables that are there in train but not in test. Also let's have a look at the head of the data sets

In [8]:
train = fread("train.csv")
dim(train) ; dim(test) ; setdiff(colnames(train) , colnames(test)) ; head(train) ; head(test)

ERROR: Error in fread("train.csv"): could not find function "fread"


## Data Pre-processing

 It seems like the variables have no names as such and the only variable that is missing in the test set is the target column which we need to predict.
We need target column in both train set and test set so I will use train set only in this case.
Let's remove the ID column

In [9]:
train$ID_code = NULL

str(train)

# datasets are too big. let's keep the first 21 variables
# train <- train[,1:22]
# str(train)
# test <- test[,1:21]
# str(test)
summary(train)

ERROR: Error in train$ID_code = NULL: object 'train' not found


Check if there are NA or blank values

In [10]:
NaValue = function (x) {sum(is.na(x)) }
apply(train, 2, NaValue)
# no NAs
BlankValue = function (x) {sum(x=="") }
apply(train, 2, BlankValue)
# no blanks
dataset <- train

ERROR: Error in apply(train, 2, NaValue): object 'train' not found


dataset is ready for model. At a first step, I split into train set and test set. The data is split into 60-40 ratio.

In [11]:
set.seed(1)
row.number = sample(1:nrow(dataset), 0.6*nrow(dataset))
train = dataset[row.number,]
test = dataset[-row.number,]
dim(train)
dim(test)

ERROR: Error in nrow(dataset): object 'dataset' not found


## Logistic Regression
### Initial Model

In [12]:
attach(train)
model1 = glm(factor(target)~., data=train, family=binomial)
summary(model1)


ERROR: Error in attach(train): object 'train' not found


Initial model shows most variables are statistically siginificant.

Predict for training data and find training accuracy

In [None]:
pred.prob = predict(model1, type="response")
pred.prob = ifelse(pred.prob > 0.5, 1, 0) # I use 0.5 as default threshold. should we change it to lower?
table(pred.prob, target)
# the accuracy of the model is 0.9145583=(106456+3291)/120000
detach(train)

### Prediction on test Data

In [13]:
attach(test)
pred.prob = predict(model1, newdata= test, type="response")
pred.prob = ifelse(pred.prob > 0.5, 1, 0)
table(pred.prob, target)
# the accuracy of the model is 0.9139125=(70943+2170)/80000

ERROR: Error in attach(test): object 'test' not found


## LDA model

### Training model

In [14]:
attach(train)
lda.model = lda (factor(target)~., data=train)
lda.model


ERROR: Error in attach(train): object 'train' not found


### Prediction on Training Data

In [15]:
predmodel.train.lda = predict(lda.model, data=train)
table(Predicted=predmodel.train.lda$class, target=train$target)
# accuracy =0.914625=(106361+3394)/120000 very similar to logisitc regression

ERROR: Error in predict(lda.model, data = train): object 'lda.model' not found


### Plot
The below plot shows how the response class has been classified by the LDA classifier. 
The X-axis shows the value of line defined by the co-efficient of linear discriminant for LDA model. 
The two groups are the groups for response classes.

In [16]:
ldahist(predmodel.train.lda$x[,1], g= predmodel.train.lda$class)

ERROR: Error in ldahist(predmodel.train.lda$x[, 1], g = predmodel.train.lda$class): object 'predmodel.train.lda' not found


From the plot, Group 0 has normal distribution. However, Group 1 does not have normal distribution. 
So, LDA is probably not ideal to predict because one of its preliminary is both groups should be normal distribution with same covariance.

### Prediction on Test Data

In [17]:
# check accuracy for test data
attach(test)
predmodel.test.lda = predict(lda.model, newdata=test)
table(Predicted=predmodel.test.lda$class, target=test$target)
# accuracy is 0.9140625 = (70891+2234)/80000 slightly lower than logistic regression

ERROR: Error in attach(test): object 'test' not found


### Plot for Test Prediction

In [18]:
par(mfrow=c(1,1))
plot(predmodel.test.lda$x[,1], predmodel.test.lda$class, col=test$target+10)

ERROR: Error in plot(predmodel.test.lda$x[, 1], predmodel.test.lda$class, col = test$target + : object 'predmodel.test.lda' not found


## QDA model
### Training Model

In [19]:
attach(train)
qda.model = qda(factor(target)~., data=train)
qda.model

ERROR: Error in attach(train): object 'train' not found


### Prediction on Training Data

In [None]:
predmodel.train.qda = predict(qda.model, data=train)
table(Predicted=predmodel.train.qda$class, target=target)
# accuracy = 0.939325 = (106068+6651)/120000 much better than logistic regression and LDA

### Predicting test results

In [20]:
attach(test)
predmodel.test.qda = predict(qda.model, newdata=test)
table(Predicted=predmodel.test.qda$class, target=test$target)
# accuracy = 0.9076625=(70390+2223)/80000 accuracy is lower in test set

ERROR: Error in attach(test): object 'test' not found


### Plot

In [21]:
par(mfrow=c(1,1))
plot(predmodel.test.qda$posterior[,2], predmodel.test.qda$class, col=test$target+10)

ERROR: Error in plot(predmodel.test.qda$posterior[, 2], predmodel.test.qda$class, : object 'predmodel.test.qda' not found


QDA has more incorrect classifications.
It seems logitstic regression is the best classifier among the 3.

In [22]:
data <- as.data.frame(test[,1:10])

hist(data$var_8)

ERROR: Error in as.data.frame(test[, 1:10]): object 'test' not found


## XGBoost

XGBoost is short for e***X***treme ***G***radient ***Boost***ing package.

XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for large structured data.
XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

https://xgboost.readthedocs.io/en/latest/tutorials/model.html

In [31]:
library(data.table)
library(caret)
library(xgboost)
library(pROC)
train = fread("../santander-customer-transaction-prediction/train.csv")

ERROR: Error in library(data.table): there is no package called 'data.table'


### Objective function

### $obj (\theta) = \sum_{i=1}^n L(y_i,y_i^t) +\sum_{i=1}^t \omega(\theta)$

### Loss Function
### $Logloss = -1/N \sum_{i=1}^N[y_i log p_i + (1-y_i)log(1-p_i)]$

In the above equation, N is the number of instances or samples. ‘yi’ would be the outcome of the i-th instance. Let us say, there are two results that an instance can assume, for example, 0 and 1. In the above equation, ‘yi’ would be 1 and hence, ‘1-yi’ is 0. ‘pi’ indicates the probability of the i-th instance assuming the value ‘yi’. In other words, log loss cumulates the probability of a sample assuming both states 0 and 1 over the total number of the instances. The simple condition behind the equation is: For the true output (yi) the probabilistic factor is -log(probability of true output) and for the other output is -log(1-probability of true output).

### Regularization

The regularization term controls the complexity of the model, which helps us to avoid overfitting.

### $\omega(\theta) = \gamma T +1/2 \lambda \sum_{j=1}^T w_j^2$

Here w is the vector of scores on leaves, T is the number of leaves, and \lambda is a shrinkage factor.

### Data Preprocessing

In [26]:
# Let's remove the ID column

train$ID_code = NULL
test$ID_code = NULL

target = train$target
summary(target)
table(target)

# we have 10% of the labels as positive and rest are zeros. Now let's create models

ERROR: Error in train$ID_code = NULL: object 'train' not found


### Cross Validation with 5 folds

In [28]:
nrounds = 5
set.seed(1234)
folds = createFolds(factor(target), k = 5, list = FALSE)

ERROR: Error in createFolds(factor(target), k = 5, list = FALSE): could not find function "createFolds"


### Training Model

In [29]:
dev.result <-  rep(0, nrow(train)) 
pred_te <- rep(0, nrow(test))

for (this.round in 1:nrounds) {      
  valid <- c(1:length(target)) [folds == this.round]
  dev <- c(1:length(target)) [folds != this.round]
  
  dtrain<- xgb.DMatrix(data= as.matrix(train[dev,]), 
                       label= target[dev])
  #weight = w[dev])
  dvalid <- xgb.DMatrix(data= as.matrix(train[valid,]) , 
                        label= target[valid])
  valids <- list(val = dvalid)
  #### parameters are far from being optimal ####  
  param = list(objective = "binary:logistic", 
               eval_metric = "auc",
               max_depth = 4,
               eta = 0.05,
               gamma = 5,
               subsample = 0.7,   
               colsample_bytree = 0.7,
               min_child_weight = 50,  
               colsample_bylevel = 0.7,
               lambda = 1, 
               alpha = 0,
               booster = "gbtree",
               silent = 0
  ) 
  model<- xgb.train(data = dtrain,
                    params= param, 
                    nrounds = 5000, 
                    verbose = T, 
                    list(val1=dtrain , val2 = dvalid) ,       
                    early_stopping_rounds = 50 , 
                    print_every_n = 500,
                    maximize = T
  )
  pred = predict(model,as.matrix(train[valid,]))
  dev.result[valid] = pred  
  pred_test  = predict(model,tefinal)
  pred_te = pred_te +pred_test
}

ERROR: Error in nrow(train): object 'train' not found


Now, let's check the xgboost CV score

In [30]:
auc(target,dev.result)
pred_test = pred_te/nrounds

ERROR: Error in auc(target, dev.result): could not find function "auc"
