# Decision Trees in R

We're using the [rpart](https://www.rdocumentation.org/packages/rpart) package from R. 
There are alternative in implementations in R, for instance the [tree](https://www.rdocumentation.org/packages/tree) package.

The dataset is available from https://archive.ics.uci.edu/ml/datasets/Bank+Marketing.

## Set up environment and required packages

The package *tidyverse* includes *dplyr, tidyr, readr, ggplot2*

In [None]:
options(warn=-1)

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(caret))
suppressPackageStartupMessages(library(ROCR))
suppressPackageStartupMessages(library(rpart))
suppressPackageStartupMessages(library(rpart.plot))

## Load the data
We have 3 data sets
1. The full bank data set with more than 41.000 entries, quite unbalanced
2. A smaller subset - still unbalanced
3. A balanced sample of the full data set with ~ 9200 entries

In [None]:
data_dir_default = "../data/"
data_sets = c("bank-full", "bank-10percent", "bank-balanced")

A little helper function for loading the different data sets

In [None]:
read_data <- function(data_set, data_dir = data_dir_default) {
  data_set <- paste(data_dir, data_set, ".csv", sep='')
  read.csv(data_set)
}

Load the data

In [None]:
bank_data <- read_data(data_sets[1])
cat("# data rows: ", nrow(bank_data), "- # features: ", ncol(bank_data), "\n")

## Partition the data in training and test set

Another helper

In [None]:
partition_data <- function(data, prop = 0.8) {

  set.seed(4711)
  n <- nrow(data)
  n_train <- round(0.8 * n) 
  partition <- sample(1:n, n_train)
  
  first <-  data[partition,]
  second  <-  data[-partition,]
  
  list(first, second)
}

We'll use the standard 80/20 split

In [None]:
partitions <- partition_data(bank_data)
train.df <- partitions[[1]]
test.df  <- partitions[[2]]

cat("Number of training samples :", nrow(train.df), "\n")
cat("Number of test samples     :", nrow(test.df), "\n")

## Build the Model

We're using the *rpart* routine with its default settings. *rpart* implements the CART algorithm with tree pruning.

In [None]:
bank_model1 <- rpart(formula = y ~ ., 
                     data = train.df, 
                     method = "class")

rpart.plot(bank_model1)

Each node in the tree describes a splitting criterion. Also each node shows
 * the predicted class
 * the predicted probability of having value 'yes'
 * the percentage of observations in the node

## Evaluate on the test set
To get the predicted classes we need to call *predict.rpart* with *type="class"*, for a probability matrix with *type="prob"*

In [None]:
predicted <- function(model, data) {
    predicted_class = predict(object = model,  
                                newdata = data,
                                type = "class")  

    predicted_probs = predict(object = model,  
                                newdata = data,   
                                type = "prob")
    predicted_probs_yes <- predicted_probs[,"yes"]
    
    return (list(predicted_class, predicted_probs_yes))
}

predicted_class_probs <- predicted(bank_model1, test.df)
predict.class <- predicted_class_probs[[1]]
predict.probs.yes <- predicted_class_probs[[2]]

### Confusion Matrix

In [None]:
evaluation <- confusionMatrix(data = predict.class,       
                              reference = test.df$y)
print(evaluation)

### Classification Accuracy

In [None]:
accuracy <- evaluation$overall["Accuracy"]
cat("Classification Accuracy : ", format(100*accuracy,digits = 4), "%\n")

### ROC Curve

In [None]:
pred <- prediction(predict.probs.yes, test.df$y)
roc_perf <- performance(pred,"tpr","fpr")
plot(roc_perf, colorize=TRUE)

### Area under Curve (AUC)

In [None]:
auc_perf <- performance(pred,"auc")
auc <- auc_perf@y.values[[1]]
cat("AUC :", auc)

## Controlling the complexity
*rpart* has a so-called *complexity parameter* that basically controls the tree pruning
(see http://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf, Section 4).
The default value for cp is 0.01. Let's build a more complex tree.

In [None]:
bank_model2 <- rpart(formula = y ~ ., 
                     data = train.df, 
                     method = "class",
                     cp = 0.005)

rpart.plot(bank_model2)

Note that we have a deeper tree now. Contrary to the previous one this tree considers the *month* feature.

Let's do a few quantitive comparisons for different values of *cp* and different data sets.

In [None]:
build_and_evaluate_model <- function(data_set, cp) {
    data <- read_data(data_set)
    partitions <- partition_data(data)
    train.df <- partitions[[1]]
    test.df  <- partitions[[2]]
    model <- rpart(formula = y ~ ., 
                   data = train.df, 
                   method = "class",
                   cp = cp)

    predicted_class_probs <- predicted(model, test.df)
    predict.class <- predicted_class_probs[[1]]
    predict.probs.yes <- predicted_class_probs[[2]]
    
    evaluation <- confusionMatrix(data = predict.class,       
                                  reference = test.df$y)
    
    accuracy <- evaluation$overall["Accuracy"]
    
    pred <- prediction(predict.probs.yes, test.df$y)
    auc_perf <- performance(pred,"auc")
    auc <- auc_perf@y.values[[1]]
    
    cat("Data : ", 
        data_set,
        "\t - cp : ", cp, 
        "\t - accuracy : ", format(100*accuracy,digits = 4), 
        "\t - AUC : ", auc, 
        "\n")
    flush.console()
}

In [None]:
for (data_set in data_sets) {
  for ( cp in c(0.01, 0.005, 0.001)) {
      build_and_evaluate_model(data_set, cp)
  }
}