This repository has been archived by the owner on May 28, 2018. It is now read-only.
/
Modeling.Rmd
778 lines (577 loc) · 26.3 KB
/
Modeling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
---
title: "Modeling - GiveMeSomeCredit"
author: "David Crook"
date: "November 6, 2015"
output: html_document
fontsize: 12
geometry: margin=0.6in
---
View output [Modeling.html](https://idcrook.github.io/SR_Foundations_DS_Fall_2015/capstone/GiveMeSomeCredit/Modeling.html).
## Setup
```{r Setup, message=FALSE, warning=FALSE, echo=FALSE}
setwd("~/projects/Classes/FoundationsOfDataScience_sliderule/github/capstone/GiveMeSomeCredit")
# need these for decision tree graph
library(rpart)
library(rpart.plot)
suppressMessages(library(dplyr)) # summarise
library(caret)
library(data.table)
#install.packages("stepPlr")
#install.packages("pROC") # needed for varImp on logreg
library(gridExtra)
library(ROCR)
source('EvaluationMetrics.R')
```
``` {r Set seed}
# set randomizer's seed
set.seed(142)
```
## Read the data set
```{r Read data, echo=FALSE}
# read in cleaned version saved by EDA.Rmd
cs <- read.csv("cs-training-cleaned.csv")
# restore levels to factors
cs$SeriousDlqin2yrs <- factor(cs$SeriousDlqin2yrs,
levels = c(0, 1), labels = c("ok", "delinquent"))
#str(cs)
nb_samples <- nrow(cs)
x_vars = c(
'RevolvingUtilizationOfUnsecuredLines',
'age',
'NumberOfTime30.59DaysPastDueNotWorse',
'DebtRatio',
'MonthlyIncome',
'NumberOfOpenCreditLinesAndLoans',
'NumberOfTimes90DaysLate',
'NumberRealEstateLoansOrLines',
'NumberOfTime60.89DaysPastDueNotWorse',
'NumberOfDependents'
)
```
Out of the **`r formatC(nb_samples, format='d', big.mark=',')`** samples, the incidence of loan delinquency is **`r formatC(100 * sum(cs$SeriousDlqin2yrs == 'delinquent') / nb_samples, format='f', digits=2, big.mark=',')`%**.
## Split the data
Since the "test" dataset provided in the competition did not include a value for the dependent variable (`SeriousDlqin2yrs`), the models created will be tested and validated with a subset of the training data.
First, the competition training data is split into a Training set and a Test set:
```{r Split data}
# lower the proportion of training to have models build faster. ideally would
# set this higher to ~0.8, but that takes hours to train the random forest model
train_proportion <- .2
train_indices <- createDataPartition(
y=cs$SeriousDlqin2yrs,
p=train_proportion,
list=FALSE)
cs_train <- cs[train_indices, ]
cs_test <- cs[-train_indices, ]
```
Then the training set is further split from the Training set as a Validation set for the purpose of estimating Out-Of-Sample performance metrics:
```{r Further split training data}
valid_proportion_of_train <- 1 / 3
valid_indices <- createDataPartition(
y=cs_train$SeriousDlqin2yrs,
p=valid_proportion_of_train,
list=FALSE)
cs_valid <- cs_train[valid_indices, ]
cs_train <- cs_train[-valid_indices, ]
```
Just to demonstrate that the data was split representatively by **`createDataPartition`** from **`caret`**: the delinquency incidences in the Training, Validation and Test sets are **`r formatC(100 * sum(cs_train$SeriousDlqin2yrs == 'delinquent') / nrow(cs_train), format='f', digits=2, big.mark=',')`%**, **`r formatC(100 * sum(cs_valid$SeriousDlqin2yrs == 'delinquent') / nrow(cs_valid), format='f', digits=2, big.mark=',')`%** and **`r formatC(100 * sum(cs_test$SeriousDlqin2yrs == 'delinquent') / nrow(cs_test), format='f', digits=2, big.mark=',')`%** respectively.
Sample name | No. of Observations
------------|--------------------
cs\_train | `r nrow(cs_train)`
cs\_test | `r nrow(cs_test)`
cs\_valid | `r nrow(cs_valid)`
## Classification Models
There will be built three types of classification models: a Classification (CART) Tree, a Random Forest, and a Logistic Regression.
```{r caret parameters}
# set up some turning and cross-validation parameters that will be used across
# the classifiers
caret_optimized_metric <- 'logLoss' # equivalent to 1 / 2 of Deviance
caret_train_control <- trainControl(
classProbs=TRUE, # compute class probabilities
summaryFunction=mnLogLoss, # equivalent to 1 / 2 of Deviance
method='repeatedcv', # repeated Cross Validation
number=5, # 5 folds
repeats=2, # 2 repeats
allowParallel=FALSE)
```
### CART
```{r CART model 1st iter, message=FALSE, warning=FALSE}
# cp values
cp.grid = expand.grid( .cp = seq(0.001, 0.05, 0.001))
cart_model = train(
x=cs_train[, x_vars],
y=cs_train$SeriousDlqin2yrs,
method='rpart', # CART
metric=caret_optimized_metric,
trControl=caret_train_control,
tuneGrid = cp.grid
)
# cart_model
# show the tree for the tuned .cp value
cart_model.best.tree = cart_model$finalModel
prp(cart_model.best.tree)
cart_varImp <- varImp(cart_model)
plot(cart_varImp, main = "Variable importance in CART, 1st Iter.")
```
### Random Forest
```{r Random forest 1st iter, message=FALSE, warning=FALSE}
B <- 600
# http://topepo.github.io/caret/Random_Forest.html
rf_model <- train(
x=cs_train[, x_vars],
y=cs_train$SeriousDlqin2yrs,
method='rf', # Random Forest
metric=caret_optimized_metric,
ntree=B, # number of trees in the Random Forest
nodesize=100, # minimum node size set small enough to allow for complex trees,
# but not so small as to require too large B to eliminate high variance
importance=TRUE, # evaluate importance of predictors
keep.inbag=TRUE,
trControl=caret_train_control,
tuneGrid=NULL)
rf_varImp <- varImp(rf_model)
plot(rf_varImp, main = "Variable importance in RF, 1st Iter.")
```
### Logistic Regression
```{r logit 1st iter, message=FALSE, warning=FALSE}
log_reg_model <- train(
x=cs_train[, x_vars],
y=cs_train$SeriousDlqin2yrs,
preProcess=c('center', 'scale'),
method='plr', # Penalized Logistic Regression
metric=caret_optimized_metric,
trControl=caret_train_control,
tuneGrid=expand.grid(
lambda=0, # weight penalty parameter
cp='aic')) # complexity parameter (AIC / BIC)
log_reg_varImp <- varImp(log_reg_model)
plot(log_reg_varImp, main = "Variable importance in LogReg, 1st Iter.")
```
## Model prediction comparison
In order to see which model seems to have the best prediction characteristics, create and chart ROC curves for each model based on the sample test data we kept in reserve for validation.
``` {r Create OOS predictions for ROC curve, echo=FALSE}
low_prob <- 1e-6
high_prob <- 1 - low_prob
log_low_prob <- log(low_prob)
log_high_prob <- log(high_prob)
log_prob_thresholds <- seq(from=log_low_prob, to=log_high_prob, length.out=1000)
prob_thresholds <- exp(log_prob_thresholds)
# "bin_classif_eval" function is from the "EvaluationMetrics.R" helper script.
# It calculates a sequence of $specificity and $sensitivity at numerous points
# in the probability prediction
cart_pred_probs <- predict(
cart_model, newdata=cs_valid[ , x_vars], type='prob')
cart_oos_performance <- bin_classif_eval(
cart_pred_probs$delinquent, cs_valid$SeriousDlqin2yrs, thresholds=prob_thresholds)
rf_pred_probs <- predict(
rf_model, newdata=cs_valid[ , x_vars], type='prob')
rf_oos_performance <- bin_classif_eval(
rf_pred_probs$delinquent, cs_valid$SeriousDlqin2yrs, thresholds=prob_thresholds)
log_reg_pred_probs <- predict(
log_reg_model, newdata=cs_valid[, x_vars], type='prob')
log_reg_oos_performance <- bin_classif_eval(
log_reg_pred_probs$delinquent, cs_valid$SeriousDlqin2yrs, thresholds=prob_thresholds)
```
### Plot of ROC curves
``` {r Plot ROC 1st iter, message=FALSE, warning=FALSE, echo=FALSE}
plot(x = 1 - rf_oos_performance$specificity,
y = rf_oos_performance$sensitivity,
type = "l", col='darkgreen', lwd=2,
xlim = c(0., 1.), ylim = c(0., 1.),
main = "ROC Curves (Validation Data) - 1st Iter",
xlab = "1 - Specificity", ylab = "Sensitivity")
abline(a=0,b=1,lty=2,col=8)
lines(x=1 - cart_oos_performance$specificity,
y=cart_oos_performance$sensitivity,
col='red', lwd=2)
lines(x=1 - log_reg_oos_performance$specificity,
y=log_reg_oos_performance$sensitivity,
col='green', lwd=2)
legend('right', c('Random Forest', 'Logistic Regression', 'CART'),
lty=1, col=c('darkgreen', 'green', 'red'), lwd=2.)
# save a plot to PNG
dev.copy(png,"ROC1a.png",width=6,height=4.5,units="in",res=200)
dev.off()
```
### Model variable importance plots
``` {r}
# impVars <- row.names(log_reg_varImp$importance)
png("varImp1a.png", width=12, height=4, units="in", res=300)
p1 <- plot(log_reg_varImp, main = "LogReg, 1st Iter.")
p2 <- plot(cart_varImp, main = "CART, 1st Iter.")
p3 <- plot(rf_varImp, main = "RF, 1st Iter.")
grid.arrange(p1, p2, p3, nrow = 1)
dev.off()
```
## predictions
A sensistivity threshold of **`75`%** was chosen, subjectively, meaning it is desired of the model parameters to catch **75%** of the *actual* delinquency cases.
#### Logistic regression
```{r}
log_reg_sensitivity_threshold <- .75
log_reg_i <- min(which(log_reg_oos_performance$sensitivity < log_reg_sensitivity_threshold)) - 1
log_reg_selected_prob_threshold <- prob_thresholds[log_reg_i]
```
The selected decision threshold is **`r formatC(log_reg_selected_prob_threshold, format='f', digits=3)`** – meaning when the logistic regression model is used to predict on new data, it will predict "Delinquent" when the predicted probability exceeds that threshold.
#### Random forest
```{r}
rf_sensitivity_threshold <- .75
rf_i <- min(which(rf_oos_performance$sensitivity < rf_sensitivity_threshold)) - 1
rf_selected_prob_threshold <- prob_thresholds[rf_i]
```
The selected decision threshold for the Random Forest model is **`r formatC(rf_selected_prob_threshold, format='f', digits=3)`**.
#### CART
```{r}
cart_sensitivity_threshold <- .30
cart_i <- min(which(cart_oos_performance$sensitivity < cart_sensitivity_threshold)) - 1
cart_selected_prob_threshold <- prob_thresholds[cart_i]
# cart_selected_prob_threshold
```
The selected decision threshold is **`r formatC(cart_selected_prob_threshold, format='f', digits=3)`** – meaning when the CART model is used to predict on new data, it will predict "Delinquent" when the predicted probability exceeds that threshold.
## Testing Performance of Model
Calculate the performance on the test split of data from the original dataset on the respective model.
First a helper function to display Accuracy, Sensitivity, and Specificity is created.
``` {r helper funcs, echo=FALSE}
# helper function to display Accuracy, Sensitivity, and Specificity
display_accu_sens_spec <- function(iperf) {
if (is.numeric(iperf)) {
s1 <-
c(formatC(iperf[1], format='f', digits=4),
formatC(iperf[2], format='f', digits=4),
formatC(iperf[3], format='f', digits=4)
)
} else {
s1 <-
c("Accuracy:", formatC(iperf$accuracy, format='f', digits=4),
"Sensitivity:", formatC(iperf$sensitivity, format='f', digits=4),
"specificity:", formatC(iperf$specificity, format='f', digits=4)
)
}
s1
}
# Show Confusion Matrix
####################
# preds - predictions (as probability)
# thresh - thresh to divide probability into one classs or another
# s - string to identify
showCM <- function (preds, thresh, s) {
print(paste0("Confusion matrix for ", s, " predictions at prob threshold of ", formatC(thresh, format='f', digits=3)))
# confusion matrix
pred <- ifelse(preds[, "delinquent"] > thresh, 1, 0)
pred <- factor(pred, levels = c(0, 1), labels = c("ok", "delinquent"))
cm <- confusionMatrix(pred, cs_test$SeriousDlqin2yrs, positive = "delinquent")
as.table(cm)
}
```
### CART model
Evaluating the performance of the selected logit model, with a decision threshold at **`r formatC(cart_selected_prob_threshold, format='f', digits=3)`**:
```{r}
cart_test_pred_probs <- predict(
cart_model, newdata=cs_test[, x_vars], type='prob')
cart_test_performance <- bin_classif_eval(
cart_test_pred_probs$delinquent, cs_test$SeriousDlqin2yrs, thresholds=cart_selected_prob_threshold)
# expected predictive performance
display_accu_sens_spec(cart_oos_performance[cart_i, ])
# tested performance
display_accu_sens_spec(cart_test_performance)
showCM(cart_test_pred_probs, cart_selected_prob_threshold, "cart")
```
The *accuracy* is not great, but the test split data performance is similar to its expected performance.
### Logit (Logistic regression) model
Evaluating the performance of the selected logit model, with a decision threshold at **`r formatC(log_reg_selected_prob_threshold, format='f', digits=3)`**:
```{r}
log_reg_test_pred_probs <- predict(
log_reg_model, newdata=cs_test[, x_vars], type='prob')
log_reg_test_performance <- bin_classif_eval(
log_reg_test_pred_probs$delinquent, cs_test$SeriousDlqin2yrs, thresholds=log_reg_selected_prob_threshold)
# expected predictive performance
display_accu_sens_spec(log_reg_oos_performance[log_reg_i, ])
# tested performance
display_accu_sens_spec(log_reg_test_performance)
showCM(log_reg_test_pred_probs, log_reg_selected_prob_threshold, "log_reg")
```
The *accuracy* is not great, but the test split data performance is similar to its expected performance.
### Random Forest model
Evaluating the performance of the selected Random Forest model, with a decision threshold at **`r formatC(rf_selected_prob_threshold, format='f', digits=3)`**:
```{r}
rf_test_pred_probs <- predict(
rf_model, newdata=cs_test[, x_vars], type='prob')
rf_test_performance <- bin_classif_eval(
rf_test_pred_probs$delinquent, cs_test$SeriousDlqin2yrs, thresholds=rf_selected_prob_threshold)
# expected predicted performance
display_accu_sens_spec(rf_oos_performance[rf_i, ])
# tested performance
display_accu_sens_spec(rf_test_performance)
showCM(rf_test_pred_probs, rf_selected_prob_threshold, "rf")
```
### Cumulative gains, lift charts and AUC
``` {r}
# prediction() needs this to be numeric (not factor SeriousDlqin2yrs)
cs_test_y_numeric <- as.integer(cs_test$SeriousDlqin2yrs)
cs_test_y_numeric <- ifelse(cs_test_y_numeric == 2, 1, 0)
png("cart_gain.png", width=6, height=4.5, units="in", res=100)
cart_rocr <- prediction(cart_test_pred_probs[,2], cs_test_y_numeric)
cart_gain <- performance(cart_rocr, "tpr", "rpp")
p1 <- plot(cart_gain, colorize=TRUE, main = "Gain chart for CART 1st Iter") + abline(a=0,b=1,lty=2,col=8)
dev.off()
png("cart_lift.png", width=6, height=4.5, units="in", res=100)
cart_lift <- performance(cart_rocr, "lift", "rpp")
plot(cart_lift, main="Lift curve for CART 1st Iter", colorize=T)
dev.off()
png("rf_gain.png", width=6, height=4.5, units="in", res=100)
rf_rocr <- prediction(rf_test_pred_probs[,2], cs_test_y_numeric)
rf_gain <- performance(rf_rocr, "tpr", "rpp")
p2 <- plot(rf_gain, colorize=TRUE, main = "Gain chart for RF 1st Iter") + abline(a=0,b=1,lty=2,col=8)
dev.off()
png("rf_lift.png", width=6, height=4.5, units="in", res=100)
rf_lift <- performance(rf_rocr, "lift", "rpp")
plot(rf_lift, main="Lift curve for RF 1st Iter", colorize=T)
dev.off()
png("logreg_gain.png", width=6, height=4.5, units="in", res=100)
log_reg_rocr <- prediction(log_reg_test_pred_probs[,2], cs_test_y_numeric)
log_reg_gain <- performance(log_reg_rocr, "tpr", "rpp")
p3 <- plot(log_reg_gain, colorize=TRUE, main = "Gain chart for LogReg 1st Iter") + abline(a=0,b=1,lty=2,col=8)
dev.off()
png("logreg_lift.png", width=6, height=4.5, units="in", res=100)
log_reg_lift <- performance(log_reg_rocr, "lift", "rpp")
plot(rf_lift, main="Lift curve for LogReg 1st Iter", colorize=T)
dev.off()
cart_auc <- performance(cart_rocr, "auc")@y.values
cart_auc
rf_auc <- performance(rf_rocr, "auc")@y.values
rf_auc
log_reg_auc <- performance(log_reg_rocr, "auc")@y.values
log_reg_auc
```
The Random Forest model has a higher *accuracy* at the chosen sensitivity threshold. The performance comparison on the test slice is in the expected performance ballpark, within 1% or so.
So the models seem to be able to predict reasonably for "new" data and are not overtrained.
# Build models using features variables
Using some constructed feature variables, the classifier models are rebuilt.
``` {r}
# Remove the non-linear DebtRatio and add MonthlyExpenses and NetMonthlySurplus.
# Replace the three deliquency variables with the single
# ConsolidatedNumberOfDaysPastDue
x_vars_features = c(
'RevolvingUtilizationOfUnsecuredLines',
'age',
# 'NumberOfTime30.59DaysPastDueNotWorse',
# 'NumberOfTime60.89DaysPastDueNotWorse',
# 'NumberOfTimes90DaysLate',
'MonthlyIncome',
'MonthlyExpenses',
'NetMonthlySurplus',
'NumberOfOpenCreditLinesAndLoans',
'NumberRealEstateLoansOrLines',
'NumberOfDependents',
'ConsolidatedNumberOfDaysPastDue'
)
```
### CART 2nd iter
```{r CART model 2nd iter, message=FALSE, warning=FALSE}
# cp values
### cp.grid = expand.grid( .cp = seq(0.001, 0.010, 0.001))
# changed to a fixed value since randomness was changing this to a worse
# predicting model
cp.grid = expand.grid( .cp = c(0.003))
cart_model2 = train(
x=cs_train[, x_vars_features],
y=cs_train$SeriousDlqin2yrs,
method='rpart', # CART
metric=caret_optimized_metric,
trControl=caret_train_control,
tuneGrid = cp.grid
)
cart_model2
#str(cart_model2)
# show the tree for the tuned .cp value
cart_model2.best.tree = cart_model2$finalModel
prp(cart_model2.best.tree)
cart2_varImp <- varImp(cart_model2)
plot(cart2_varImp, main = "Variable importance in CART, 2nd Iter.")
```
### Random Forest 2nd iter
``` {r Random forest 2nd iter, message=FALSE, warning=FALSE}
B <- 600
# http://topepo.github.io/caret/Random_Forest.html
rf_model2 <- train(
x=cs_train[, x_vars_features],
y=cs_train$SeriousDlqin2yrs,
method='rf', # Random Forest
metric=caret_optimized_metric,
ntree=B, # number of trees in the Random Forest
nodesize=100, # minimum node size set small enough to allow for complex trees,
# but not so small as to require too large B to eliminate high variance
importance=TRUE, # evaluate importance of predictors
keep.inbag=TRUE,
trControl=caret_train_control,
tuneGrid=NULL)
rf2_varImp <- varImp(rf_model2)
plot(rf2_varImp, main = "Variable importance in RF, 2nd Iter.")
```
### Logistic Regression 2nd iter
``` {r logit 2nd iter, message=FALSE, warning=FALSE}
log_reg_model2 <- train(
x=cs_train[, x_vars_features],
y=cs_train$SeriousDlqin2yrs,
preProcess=c('center', 'scale'),
method='plr', # Penalized Logistic Regression
metric=caret_optimized_metric,
trControl=caret_train_control,
tuneGrid=expand.grid(
lambda=0, # weight penalty parameter
cp='aic')) # complexity parameter (AIC / BIC)
log_reg2_varImp <- varImp(log_reg_model2)
plot(log_reg2_varImp, main = "Variable importance in LogReg, 2nd Iter.")
```
### Model variable importance plots
``` {r}
# impVars <- row.names(log_reg2_varImp$importance)
png("varImp2a.png", width=12, height=4, units="in", res=300)
p1 <- plot(log_reg2_varImp, main = "LogReg, 2nd Iter.")
p2 <- plot(cart2_varImp, main = "CART, 2nd Iter.")
p3 <- plot(rf2_varImp, main = "RF, 2nd Iter.")
grid.arrange(p1, p2, p3, nrow = 1)
dev.off()
```
## Model prediction comparison
In order to see which model seems to have the best prediction characteristics, create and chart ROC curves for each model based on the sample test data we kept in reserve for validation.
``` {r Create OOS predictions 2nd iter, echo=FALSE}
cart2_pred_probs <- predict(
cart_model2, newdata=cs_valid[ , x_vars_features], type='prob')
cart2_oos_performance <- bin_classif_eval(
cart2_pred_probs$delinquent, cs_valid$SeriousDlqin2yrs, thresholds=prob_thresholds)
rf2_pred_probs <- predict(
rf_model2, newdata=cs_valid[ , x_vars_features], type='prob')
rf2_oos_performance <- bin_classif_eval(
rf2_pred_probs$delinquent, cs_valid$SeriousDlqin2yrs, thresholds=prob_thresholds)
log_reg2_pred_probs <- predict(
log_reg_model2, newdata=cs_valid[, x_vars_features], type='prob')
log_reg2_oos_performance <- bin_classif_eval(
log_reg2_pred_probs$delinquent, cs_valid$SeriousDlqin2yrs, thresholds=prob_thresholds)
```
### Plot of ROC curves
``` {r Plot ROC 2nd iter, message=FALSE, warning=FALSE, echo=FALSE}
plot(x = 1 - rf2_oos_performance$specificity,
y = rf2_oos_performance$sensitivity,
type = "l", col='darkgreen', lwd=2,
xlim = c(0., 1.), ylim = c(0., 1.),
main = "ROC Curves (Validation Data) - 2nd Iter",
xlab = "1 - Specificity", ylab = "Sensitivity")
abline(a=0,b=1,lty=2,col=8)
lines(x=1 - cart2_oos_performance$specificity,
y=cart2_oos_performance$sensitivity,
col='red', lwd=2)
lines(x=1 - log_reg2_oos_performance$specificity,
y=log_reg2_oos_performance$sensitivity,
col='green', lwd=2)
legend('right', c('Random Forest', 'Logistic Regression', 'CART'),
lty=1, col=c('darkgreen', 'green', 'red'), lwd=2.)
# save a plot to PNG
dev.copy(png,"ROC2a.png",width=6,height=4.5,units="in",res=200)
dev.off()
```
The CART model does significantly better this time using the engineered features than it did with the original variables, even matching or exceeding the Random Forest at higher specificity. The logisitic regression eventual catches up to the CART model at higher sensitivity. Again, however, the Random Forest model has the highest AUC among the three models.
## Predictions performance on 2nd iter
Again, a sensistivity threshold of **`75`%** is chosen.
#### Logistic regression
```{r echo=FALSE}
log_reg2_sensitivity_threshold <- .75
log_reg2_i <- min(which(log_reg2_oos_performance$sensitivity < log_reg2_sensitivity_threshold)) - 1
log_reg2_selected_prob_threshold <- prob_thresholds[log_reg2_i]
```
The selected decision threshold is **`r formatC(log_reg2_selected_prob_threshold, format='f', digits=3)`**.
#### Random forest
```{r echo=FALSE}
rf2_sensitivity_threshold <- .75
rf2_i <- min(which(rf2_oos_performance$sensitivity < rf_sensitivity_threshold)) - 1
rf2_selected_prob_threshold <- prob_thresholds[rf2_i]
```
The selected decision threshold for the Random Forest model is **`r formatC(rf2_selected_prob_threshold, format='f', digits=3)`**.
#### CART
```{r echo=FALSE}
cart2_sensitivity_threshold <- .75
cart2_i <- min(which(cart2_oos_performance$sensitivity < cart2_sensitivity_threshold)) - 1
cart2_selected_prob_threshold <- prob_thresholds[cart2_i]
#cart2_selected_prob_threshold
```
The selected decision threshold is **`r formatC(cart2_selected_prob_threshold, format='f', digits=3)`**.
## Evaluating 2nd iteration model performance
### Logit (Logistic regression) model 2nd iter
Evaluating the performance of the selected logit model, with a decision threshold at **`r formatC(log_reg2_selected_prob_threshold, format='f', digits=3)`**:
```{r}
log_reg2_test_pred_probs <- predict(
log_reg_model2, newdata=cs_test[, x_vars_features], type='prob')
log_reg2_test_performance <- bin_classif_eval(
log_reg2_test_pred_probs$delinquent, cs_test$SeriousDlqin2yrs, thresholds=log_reg2_selected_prob_threshold)
# expected predictive performance
display_accu_sens_spec(log_reg2_oos_performance[log_reg2_i, ])
# tested performance
display_accu_sens_spec(log_reg2_test_performance)
showCM(log_reg2_test_pred_probs, log_reg2_selected_prob_threshold, "log_reg2")
```
The *accuracy* degrades slightly over the first iteration model, and the test split data performance is similar to its expected performance.
### Random Forest model 2nd iter
Evaluating the performance of the selected Random Forest model, with a decision threshold at **`r formatC(rf2_selected_prob_threshold, format='f', digits=3)`**:
```{r}
rf2_test_pred_probs <- predict(
rf_model2, newdata=cs_test[, x_vars_features], type='prob')
rf2_test_performance <- bin_classif_eval(
rf2_test_pred_probs$delinquent, cs_test$SeriousDlqin2yrs, thresholds=rf2_selected_prob_threshold)
# expected predicted performance
display_accu_sens_spec(rf2_oos_performance[rf2_i, ])
# tested performance
display_accu_sens_spec(rf2_test_performance)
showCM(rf2_test_pred_probs, rf2_selected_prob_threshold, "rf2")
```
The Random Forest model has a higher *accuracy* at the chosen sensitivity threshold. The performance comparison on the test slice is in the expected performance ballpark, within 1% or so.
### CART model 2nd iter
Evaluating the performance of the selected CART model, with a decision threshold at **`r formatC(cart2_selected_prob_threshold, format='f', digits=3)`**:
```{r}
cart2_test_pred_probs <- predict(
cart_model2, newdata=cs_test[, x_vars_features], type='prob')
cart2_test_performance <- bin_classif_eval(
cart2_test_pred_probs$delinquent, cs_test$SeriousDlqin2yrs, thresholds=cart2_selected_prob_threshold)
# expected predicted performance
display_accu_sens_spec(cart2_oos_performance[cart2_i, ])
# tested performance
display_accu_sens_spec(cart2_test_performance)
showCM(cart2_test_pred_probs, cart2_selected_prob_threshold, "cart2")
```
The CART model is in the running this time. The prediction performance comparison on the test slice is reasonable.
### Cumulative gains and lift charts, AUC
``` {r}
# prediction() needs this to be numeric (not factor SeriousDlqin2yrs)
cs_test_y_numeric <- as.integer(cs_test$SeriousDlqin2yrs)
cs_test_y_numeric <- ifelse(cs_test_y_numeric == 2, 1, 0)
png("cart2_gain.png", width=6, height=4.5, units="in", res=100)
cart2_rocr <- prediction(cart2_test_pred_probs[,2], cs_test_y_numeric)
cart2_gain <- performance(cart2_rocr, "tpr", "rpp")
p1 <- plot(cart2_gain, colorize=TRUE, main = "Gain chart for CART 2nd Iter") + abline(a=0,b=1,lty=2,col=8)
dev.off()
png("cart2_lift.png", width=6, height=4.5, units="in", res=100)
cart2_lift <- performance(cart2_rocr, "lift", "rpp")
plot(cart2_lift, main="Lift curve for CART 2nd Iter", colorize=T)
dev.off()
png("rf2_gain.png", width=6, height=4.5, units="in", res=100)
rf2_rocr <- prediction(rf2_test_pred_probs[,2], cs_test_y_numeric)
rf2_gain <- performance(rf2_rocr, "tpr", "rpp")
p2 <- plot(rf2_gain, colorize=TRUE, main = "Gain chart for RF 2nd Iter") + abline(a=0,b=1,lty=2,col=8)
dev.off()
png("rf2_lift.png", width=6, height=4.5, units="in", res=100)
rf2_lift <- performance(rf2_rocr, "lift", "rpp")
plot(rf2_lift, main="Lift curve for RF 2nd Iter", colorize=T)
dev.off()
png("logreg2_gain.png", width=6, height=4.5, units="in", res=100)
log_reg2_rocr <- prediction(log_reg2_test_pred_probs[,2], cs_test_y_numeric)
log_reg2_gain <- performance(log_reg2_rocr, "tpr", "rpp")
p3 <- plot(log_reg2_gain, colorize=TRUE, main = "Gain chart for LogReg 2nd Iter") + abline(a=0,b=1,lty=2,col=8)
dev.off()
png("logreg2_lift.png", width=6, height=4.5, units="in", res=100)
log_reg2_lift <- performance(log_reg2_rocr, "lift", "rpp")
plot(log_reg2_lift, main="Lift curve for LogReg 2nd Iter", colorize=T)
dev.off()
cart2_auc <- performance(cart2_rocr, "auc")@y.values
cart2_auc
rf2_auc <- performance(rf2_rocr, "auc")@y.values
rf2_auc
log_reg2_auc <- performance(log_reg2_rocr, "auc")@y.values
log_reg2_auc
```