In [1]:
# initial settings
#Note the installation of the forecast packages takes a while
library_path <- paste(getwd(), "packages",sep="/")
dir.create(library_path,showWarnings = FALSE)
.libPaths(library_path)

if(!require(tidyverse)){
    install.packages("tidyverse")
    library(tidyverse)
}
if(!require(repr)){
    install.packages("repr")
    library(repr)
}
if(!require(rpart)){
    install.packages("rpart")
    library(rpart)
}
if(!require(rpart.plot)){
    install.packages("rpart.plot")
    library(rpart.plot)
}
if(!require(caret)){
    install.packages("caret")
    library(caret)
}
if(!require(precrec)){
    install.packages("precrec")
    library(precrec)
}
if(!require(e1071)){
    install.packages("e1071")
    library(e1071)
}
if(!require(ISLR)){
    install.packages("ISLR")
    library(ISLR)
}
if(!require(Metrics)){
    install.packages("Metrics")
    library(Metrics)
}
if(!require(class)){
    install.packages("RPostgreSQL")
    library(class)
}

#library(tidyverse)
#library(rpart)
#library(rpart.plot)
library(caret)
#library(class)
#library(e1071)

#install.packages('precrec',lib='.', verbose=TRUE)
#library(precrec,lib.loc='.')



# Plot size depending on your screen resolution to 9 x 6
options(repr.plot.width=9, repr.plot.height=6)


Loading required package: tidyverse

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.6     [32m✔[39m [34mdplyr  [39m 1.0.8
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Loading required package: repr

Loading required package: rpart

Loading required package: rpart.plot

Loading required package: caret

Loading required package: lattice


Attaching package: ‘caret’


The following object is masked from ‘package:purrr’:

    lift


Loading required package: precrec

Loading required p

# Welcome to Day 2 ML Worshop



### Exercise 1: Builduing a k-NN - Personal Loan Acceptance Case

Read in the `UniversalBank.csv` file. This dataset contains data on 5000 customers from the **Universal Bank**, a relatively young bank growing rapidly in terms of overall  customer acquisition.

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise smarter campaigns with better target marketing. The goal is to use **k-NN** to **predict whether a new customer will accept a loan offer**. This will serve as the basis for the design of a new campaign. 

The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign. **Partition the data into training (60%) and validation (40%) sets.**

In [2]:
#load the data
universal.df <- read.csv("UniversalBank.csv")
dim(universal.df)
t(t(names(universal.df)))

0
ID
Age
Experience
Income
ZIP.Code
Family
CCAvg
Education
Mortgage
Personal.Loan


In [3]:
# partition the data
set.seed(1)  
train.index <- sample(row.names(universal.df), 0.6*dim(universal.df)[1])
valid.index <- setdiff(row.names(universal.df), train.index)  
train.df <- universal.df[train.index, -c(1, 5)]
valid.df <- universal.df[valid.index, -c(1, 5)]
t(t(names(train.df)))

0
Age
Experience
Income
Family
CCAvg
Education
Mortgage
Personal.Loan
Securities.Account
CD.Account


Consider the following customer:
- Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education = 2, 
- Mortgage = 0, Securities Account = 0, CD Account = 0, Online = 1, and
- Credit Card = 1. 

Perform a k-NN classification with all predictors except `ID` and `ZIP code` using `k = 1`. 

Note that using KNN in the class library, categorical predictors are automatically handled.

In [4]:
# builduing the new customer data
new.cust <- data.frame(Age = 40,                
                       Experience = 10,     
                       Income = 84,   
                       Family = 2,          
                       CCAvg = 2,          
                       Education = 2,        
                       Mortgage = 0,           
                       Securities.Account = 0, 
                       CD.Account = 0, 
                       Online = 1,            
                       CreditCard = 1)


In [5]:
# normalize the data
train.norm.df <- train.df[,-8]
valid.norm.df <- valid.df[,-8]

new.cust.norm <- new.cust
norm.values <- preProcess(train.df[, -8], method=c("center", "scale"))
train.norm.df <- predict(norm.values, train.df[, -8])
valid.norm.df <- predict(norm.values, valid.df[, -8])
new.cust.norm <- predict(norm.values, new.cust.norm)

In [6]:
# running the kNN
knn.pred <- class::knn(train = train.norm.df, 
                       test = new.cust.norm, 
                       cl = train.df$Personal.Loan, k = 1)
knn.pred

#### Answer: 
From the output we conclude that the above customer is classified as belonging to the **"loan not accepted"** group.

### Trying to find the optimal `k`

What is a choice of `k` that balances between overfitting and ignoring the predictor information?

In [7]:
# optimal k
accuracy.df <- data.frame(k = seq(1, 15, 1), overallaccurace = rep(0, 15))
for(i in 1:15) {
  knn.pred <- class::knn(train = train.norm.df, 
                         test = valid.norm.df, 
                         cl = train.df$Personal.Loan, k = i)
  accuracy.df[i, 2] <- confusionMatrix(knn.pred, 
                                       as.factor(valid.df$Personal.Loan))$overall[1]
}

which(accuracy.df[,2] == max(accuracy.df[,2])) 

#### Partial answer:
The output returned that the optimal value (the one with the max accuracy) is `k=3`.

Now let's see the accuracy for each value of `k`:

In [8]:
# checking the accuracy per level of k built by the code above
accuracy.df

k,overallaccurace
<dbl>,<dbl>
1,0.9565
2,0.9525
3,0.959
4,0.958
5,0.9585
6,0.9535
7,0.9545
8,0.9545
9,0.9555
10,0.9545


#### Answer:
Best **k = 3**. The value of k that balances between overfitting (k too small) and ignoring the predictor information (k too large) is 3.

Nowt, show the **confusion matrix** for the **validation data** that results from using the best k.

In [9]:
# knn with k = 3
knn.pred <- class::knn(train = train.norm.df, 
                       test = valid.norm.df, 
                       cl = train.df$Personal.Loan, k = 3)

confusionMatrix(knn.pred, as.factor(valid.df$Personal.Loan), positive = "1")

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1783   70
         1   12  135
                                          
               Accuracy : 0.959           
                 95% CI : (0.9494, 0.9673)
    No Information Rate : 0.8975          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7452          
                                          
 Mcnemar's Test P-Value : 3.082e-10       
                                          
            Sensitivity : 0.6585          
            Specificity : 0.9933          
         Pos Pred Value : 0.9184          
         Neg Pred Value : 0.9622          
             Prevalence : 0.1025          
         Detection Rate : 0.0675          
   Detection Prevalence : 0.0735          
      Balanced Accuracy : 0.8259          
                                          
       'Positive' Class : 1               
                        

Consider the following customer: 
- Age = 40, Experience = 10, 
- Income = 84, Family = 2, CCAvg = 2, Education = 2, Mortgage = 0, 
- Securities Account = 0, CD Account = 0, Online = 1 and Credit Card = 1. 

Classify the customer using the **best k**.

In [10]:
# predict new customer with k = 3
knn.pred <- class::knn(train = train.norm.df, 
                       test = new.cust.norm, 
                       cl = train.df$Personal.Loan, k = 3)
knn.pred

### Repartition the data, this time into training, validation, and test sets (50% : 30% : 20%)

Apply the k-NN method with the k chosen above. Compare the confusion matrix of the test set with that of the training and validation sets. Comment on the differences and their reason. 

In [11]:
# 3-way partition
set.seed(1)  
train.index <- sample(row.names(universal.df), 0.5*dim(universal.df)[1])
valid.index <- sample(setdiff(row.names(universal.df), train.index), 
                      0.3*dim(universal.df)[1])
test.index <-  setdiff(row.names(universal.df), c(train.index, valid.index)) 
train.df <- universal.df[train.index, -c(1, 5)]
valid.df <- universal.df[valid.index, -c(1, 5)]
test.df <- universal.df[test.index, -c(1, 5)]

In [12]:
# normalization
train.norm.df <- train.df[,-8]
valid.norm.df <- valid.df[,-8]
test.norm.df <- test.df[,-8]
norm.values <- preProcess(train.df[, -8], method=c("center", "scale"))
train.norm.df <- predict(norm.values, train.df[, -8])
valid.norm.df <- predict(norm.values, valid.df[, -8])
test.norm.df <- predict(norm.values, test.df[, -8])


In [13]:
# predictions on train
knn.predt <- class::knn(train = train.norm.df, 
                       test = train.norm.df, 
                       cl = train.df$Personal.Loan, k = 3)

confusionMatrix(knn.predt, as.factor(train.df$Personal.Loan), positive = "1")

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 2260   60
         1    8  172
                                          
               Accuracy : 0.9728          
                 95% CI : (0.9656, 0.9788)
    No Information Rate : 0.9072          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8204          
                                          
 Mcnemar's Test P-Value : 6.224e-10       
                                          
            Sensitivity : 0.7414          
            Specificity : 0.9965          
         Pos Pred Value : 0.9556          
         Neg Pred Value : 0.9741          
             Prevalence : 0.0928          
         Detection Rate : 0.0688          
   Detection Prevalence : 0.0720          
      Balanced Accuracy : 0.8689          
                                          
       'Positive' Class : 1               
                        

In [14]:
# predictions on validation
knn.predv <- class::knn(train = train.norm.df, 
                       test = valid.norm.df, 
                       cl = train.df$Personal.Loan, k = 3)

confusionMatrix(knn.predv, as.factor(valid.df$Personal.Loan), positive = "1")

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1358   44
         1    6   92
                                          
               Accuracy : 0.9667          
                 95% CI : (0.9563, 0.9752)
    No Information Rate : 0.9093          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7688          
                                          
 Mcnemar's Test P-Value : 1.672e-07       
                                          
            Sensitivity : 0.67647         
            Specificity : 0.99560         
         Pos Pred Value : 0.93878         
         Neg Pred Value : 0.96862         
             Prevalence : 0.09067         
         Detection Rate : 0.06133         
   Detection Prevalence : 0.06533         
      Balanced Accuracy : 0.83604         
                                          
       'Positive' Class : 1               
                        

In [15]:
# predictions on test
knn.predtt <- class::knn(train = train.norm.df, 
                       test = test.norm.df, 
                       cl = train.df$Personal.Loan, k = 3)
confusionMatrix(knn.predtt, as.factor(test.df$Personal.Loan), positive = "1")

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 882  40
         1   6  72
                                          
               Accuracy : 0.954           
                 95% CI : (0.9391, 0.9661)
    No Information Rate : 0.888           
    P-Value [Acc > NIR] : 1.144e-13       
                                          
                  Kappa : 0.7334          
                                          
 Mcnemar's Test P-Value : 1.141e-06       
                                          
            Sensitivity : 0.6429          
            Specificity : 0.9932          
         Pos Pred Value : 0.9231          
         Neg Pred Value : 0.9566          
             Prevalence : 0.1120          
         Detection Rate : 0.0720          
   Detection Prevalence : 0.0780          
      Balanced Accuracy : 0.8181          
                                          
       'Positive' Class : 1               
                              

### Conclusion:

We choose the best k, which minimizes the misclassification rate in the validation set. Our best k is 3. From the above confusion matrices we observe the following:

- The error rate increases from the training set to the validation set, and again from the validation set to the test set.  

- The differences are small, but this decreased performance, at least in the test set, is not unexpected - both the training and validation sets are used in setting the optimal k so there can be **overfitting**. 

- The test set was not used to select the optimal k, so reflects expected performance with new data, slightly less accurate.

- So, we can conclude that the model is not overfitting.