# German Credit dataset: decision tree algorithm
The following exercise is taken from <b> Machine Learning with R</b> by <b> Brett Lantz </b> (Third Edition)

The dataset used in the exercise is the <b>German Credit dataset</b> and was published by <b>Hans Hofmann</b>. The dataset here is downloaded from the textbook's github page.

## Step 1: Collecting the dataset

In [1]:
credit <- read.csv("https://raw.githubusercontent.com/PacktPublishing/Machine-Learning-with-R-Third-Edition/master/Chapter05/credit.csv")

## Step 2: Exploring and preparing the data

In [2]:
str(credit)

'data.frame':	1000 obs. of  17 variables:
 $ checking_balance    : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
 $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
 $ credit_history      : Factor w/ 5 levels "critical","good",..: 1 2 1 2 4 2 2 2 2 1 ...
 $ purpose             : Factor w/ 6 levels "business","car",..: 5 5 4 5 2 4 5 2 5 2 ...
 $ amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
 $ savings_balance     : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
 $ employment_duration : Factor w/ 5 levels "< 1 year","> 7 years",..: 2 3 4 4 3 3 2 3 4 5 ...
 $ percent_of_income   : int  4 2 2 2 3 2 3 2 2 4 ...
 $ years_at_residence  : int  4 2 3 4 4 4 4 2 4 2 ...
 $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
 $ other_credit        : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ housing             : Factor w/ 3 levels "other","own",..: 2 2 2 1 1 1 2 3 2 2 ...
 $ exi

In [4]:
table(credit$checking_balance)
table(credit$credit_history)
table(credit$default)


    < 0 DM   > 200 DM 1 - 200 DM    unknown 
       274         63        269        394 


 critical      good   perfect      poor very good 
      293       530        40        88        49 


 no yes 
700 300 

### Data preparation - creating random training and test datasets

In [6]:
# dataset is not randomly ordered. Therefore random sampling is required.
# Setting seed for text book RNG version
RNGversion("3.5.2"); set.seed(123)

train_sample <- sample(1000, 900)
str(train_sample)

# Create trainign and test data set
credit_train <- credit[train_sample, ]
credit_test <- credit[-train_sample, ]

prop.table(table(credit_train$default))
prop.table(table(credit_test$default))

"non-uniform 'Rounding' sampler used"


 int [1:900] 288 788 409 881 937 46 525 887 548 453 ...



       no       yes 
0.7033333 0.2966667 


  no  yes 
0.67 0.33 

## Step 3: Training a model on the data

In [9]:
library(C50)
credit_model <- C5.0(credit_train[-17], credit_train$default)
credit_model
summary(credit_model)


Call:
C5.0.default(x = credit_train[-17], y = credit_train$default)

Classification Tree
Number of samples: 900 
Number of predictors: 16 

Tree size: 57 

Non-standard options: attempt to group attributes



Call:
C5.0.default(x = credit_train[-17], y = credit_train$default)


C5.0 [Release 2.07 GPL Edition]  	Sun Aug 09 15:51:25 2020
-------------------------------

Class specified by attribute `outcome'

Read 900 cases (17 attributes) from undefined.data

Decision tree:

checking_balance in {> 200 DM,unknown}: no (412/50)
checking_balance in {< 0 DM,1 - 200 DM}:
:...credit_history in {perfect,very good}: yes (59/18)
    credit_history in {critical,good,poor}:
    :...months_loan_duration <= 22:
        :...credit_history = critical: no (72/14)
        :   credit_history = poor:
        :   :...dependents > 1: no (5)
        :   :   dependents <= 1:
        :   :   :...years_at_residence <= 3: yes (4/1)
        :   :       years_at_residence > 3: no (5/1)
        :   credit_history = good:
        :   :...savings_balance in {> 1000 DM,500 - 1000 DM}: no (15/1)
        :       savings_balance = 100 - 500 DM:
        :       :...other_credit = bank: yes (3)
        :       :   other_credit

- error rate of 14.8% (138 out of 900 could not be correctly specified
- credit_history {perfect, very good} as predictor for default seems counter intuitive
- 35 false positves, 98 false negatives

## Step 4: Evaluating model performance

In [15]:
credit_pred <- predict(credit_model, credit_test)

library(gmodels)
CrossTable(credit_test$default, credit_pred, 
          prop.shisq = F, prop.c = F, 
          prop.r = F, dnn =c("actual", "predicted default"))


 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
             | predicted default 
      actual |        no |       yes | Row Total | 
-------------|-----------|-----------|-----------|
          no |        59 |         8 |        67 | 
             |     0.869 |     3.082 |           | 
             |     0.590 |     0.080 |           | 
-------------|-----------|-----------|-----------|
         yes |        19 |        14 |        33 | 
             |     1.765 |     6.257 |           | 
             |     0.190 |     0.140 |           | 
-------------|-----------|-----------|-----------|
Column Total |        78 |        22 |       100 | 
-------------|-----------|-----------|-----------|

 


- accuracy of 73%
- with regards to default, only 42% accuracy

## Step 5: Improving model performance

In [12]:
# Include boosting!
credit_boost10 <- C5.0(credit_train[-17], credit_train$default, trials = 10)
credit_boost10
summary(credit_boost10)


Call:
C5.0.default(x = credit_train[-17], y = credit_train$default, trials = 10)

Classification Tree
Number of samples: 900 
Number of predictors: 16 

Number of boosting iterations: 10 
Average tree size: 47.5 

Non-standard options: attempt to group attributes



Call:
C5.0.default(x = credit_train[-17], y = credit_train$default, trials = 10)


C5.0 [Release 2.07 GPL Edition]  	Sun Aug 09 16:29:55 2020
-------------------------------

Class specified by attribute `outcome'

Read 900 cases (17 attributes) from undefined.data

-----  Trial 0:  -----

Decision tree:

checking_balance in {> 200 DM,unknown}: no (412/50)
checking_balance in {< 0 DM,1 - 200 DM}:
:...credit_history in {perfect,very good}: yes (59/18)
    credit_history in {critical,good,poor}:
    :...months_loan_duration <= 22:
        :...credit_history = critical: no (72/14)
        :   credit_history = poor:
        :   :...dependents > 1: no (5)
        :   :   dependents <= 1:
        :   :   :...years_at_residence <= 3: yes (4/1)
        :   :       years_at_residence > 3: no (5/1)
        :   credit_history = good:
        :   :...savings_balance in {> 1000 DM,500 - 1000 DM}: no (15/1)
        :       savings_balance = 100 - 500 DM:
        :       :...other_credit = bank: yes

- The classifier improves to 3.8% error rate

In [16]:
credit_boost_pred10 <- predict(credit_boost10, credit_test)
CrossTable(credit_test$default, credit_boost_pred10, 
          prop.shisq = F, prop.c = F, 
          prop.r = F, dnn =c("actual default", "predicted default"))


 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
               | predicted default 
actual default |        no |       yes | Row Total | 
---------------|-----------|-----------|-----------|
            no |        62 |         5 |        67 | 
               |     2.748 |     8.243 |           | 
               |     0.620 |     0.050 |           | 
---------------|-----------|-----------|-----------|
           yes |        13 |        20 |        33 | 
               |     5.578 |    16.735 |           | 
               |     0.130 |     0.200 |           | 
---------------|-----------|-----------|-----------|
  Column Total |        75 |        25 |       100 | 
---------------|-----------|-----------|-----------|

 


- error rate dropped to 18%
- yet, predicting defaults is still particularly bad: just 61%

### Making some mistakes more costly than others

In [19]:
# Assign a penalty to certain kinds of errors

# Cost matrix
matrix_dim <- list(c("no", "yes"), c("no", "yes"))
names(matrix_dim) <- c("predicted", "actual")

# Making defaults 4 times more costly
error_cost <- matrix(c(0,1,4,0), nrow = 2, dimnames = matrix_dim)
error_cost

Unnamed: 0,no,yes
no,0,4
yes,1,0


In [21]:
credit_cost <- C5.0(credit_train[-17], credit_train$default, costs  = error_cost)
credit_cost_pred <- predict(credit_cost, credit_test)
CrossTable(credit_test$default, credit_cost_pred, 
          prop.shisq = F, prop.c = F, 
          prop.r = F, dnn =c("actual default", "predicted default"))


 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
               | predicted default 
actual default |        no |       yes | Row Total | 
---------------|-----------|-----------|-----------|
            no |        37 |        30 |        67 | 
               |     1.918 |     1.507 |           | 
               |     0.370 |     0.300 |           | 
---------------|-----------|-----------|-----------|
           yes |         7 |        26 |        33 | 
               |     3.895 |     3.060 |           | 
               |     0.070 |     0.260 |           | 
---------------|-----------|-----------|-----------|
  Column Total |        44 |        56 |       100 | 
---------------|-----------|-----------|-----------|

 


- Overall, the model makes more mistakes: In total 37%.
- However, the model's performance on defaults is much better, at ~79%. Only 7 defaults were confused.