capstone/GiveMeSomeCredit/Modeling.Rmd

---
title: "Modeling - GiveMeSomeCredit"
author: "David Crook"
date: "November 6, 2015"
output: html_document
fontsize: 12
geometry: margin=0.6in
---

View output [Modeling.html](https://idcrook.github.io/SR_Foundations_DS_Fall_2015/capstone/GiveMeSomeCredit/Modeling.html).

## Setup


```{r Setup, message=FALSE, warning=FALSE, echo=FALSE}
setwd("~/projects/Classes/FoundationsOfDataScience_sliderule/github/capstone/GiveMeSomeCredit")

# need these for decision tree graph
library(rpart)
library(rpart.plot)

suppressMessages(library(dplyr)) # summarise
library(caret)
library(data.table)
#install.packages("stepPlr")
#install.packages("pROC")  # needed for varImp on logreg
library(gridExtra)
library(ROCR)

source('EvaluationMetrics.R')
```

``` {r Set seed}
# set randomizer's seed
set.seed(142)
```


## Read the data set

```{r Read data, echo=FALSE}
# read in cleaned version saved by EDA.Rmd
cs <- read.csv("cs-training-cleaned.csv")

# restore levels to factors
cs$SeriousDlqin2yrs <- factor(cs$SeriousDlqin2yrs, 
                              levels = c(0, 1), labels = c("ok", "delinquent"))
#str(cs)

nb_samples <- nrow(cs)

x_vars = c(
  'RevolvingUtilizationOfUnsecuredLines',
  'age',
  'NumberOfTime30.59DaysPastDueNotWorse',
  'DebtRatio',
  'MonthlyIncome',
  'NumberOfOpenCreditLinesAndLoans',
  'NumberOfTimes90DaysLate',
  'NumberRealEstateLoansOrLines',
  'NumberOfTime60.89DaysPastDueNotWorse',
  'NumberOfDependents'
  )
```

Out of the **`r formatC(nb_samples, format='d', big.mark=',')`** samples, the incidence of loan delinquency is **`r formatC(100 * sum(cs$SeriousDlqin2yrs == 'delinquent') / nb_samples, format='f', digits=2, big.mark=',')`%**.

## Split the data

Since the "test" dataset provided in the competition did not include a value for the dependent variable (`SeriousDlqin2yrs`), the models created will be tested and validated with a subset of the training data.

First, the competition training data is split into a Training set and a Test set:

```{r Split data}
# lower the proportion of training to have models build faster. ideally would
# set this higher to ~0.8, but that takes hours to train the random forest model
train_proportion <- .2
train_indices <- createDataPartition(
  y=cs$SeriousDlqin2yrs,
  p=train_proportion,
  list=FALSE)

cs_train <- cs[train_indices, ]
cs_test <- cs[-train_indices, ]
```

Then the training set is further split from the Training set as a Validation set for the purpose of estimating Out-Of-Sample performance metrics:

```{r Further split training data}
valid_proportion_of_train <- 1 / 3
valid_indices <- createDataPartition(
  y=cs_train$SeriousDlqin2yrs,
  p=valid_proportion_of_train,
  list=FALSE)

cs_valid <- cs_train[valid_indices, ]
cs_train <- cs_train[-valid_indices, ]
```

Just to demonstrate that the data was split representatively by **`createDataPartition`** from **`caret`**: the delinquency incidences in the Training, Validation and Test sets are **`r formatC(100 * sum(cs_train$SeriousDlqin2yrs == 'delinquent') / nrow(cs_train), format='f', digits=2, big.mark=',')`%**, **`r formatC(100 * sum(cs_valid$SeriousDlqin2yrs == 'delinquent') / nrow(cs_valid), format='f', digits=2, big.mark=',')`%** and **`r formatC(100 * sum(cs_test$SeriousDlqin2yrs == 'delinquent') / nrow(cs_test), format='f', digits=2, big.mark=',')`%** respectively.

Sample name | No. of Observations
------------|--------------------
cs\_train    | `r nrow(cs_train)`
cs\_test     | `r nrow(cs_test)`
cs\_valid    | `r nrow(cs_valid)`

## Classification Models

There will be built three types of classification models: a Classification (CART) Tree, a Random Forest, and a Logistic Regression.


```{r caret parameters}

# set up some turning and cross-validation parameters that will be used across
# the classifiers
caret_optimized_metric <- 'logLoss'   # equivalent to 1 / 2 of Deviance

caret_train_control <- trainControl(
  classProbs=TRUE,             # compute class probabilities
  summaryFunction=mnLogLoss,   # equivalent to 1 / 2 of Deviance
  method='repeatedcv',         # repeated Cross Validation
  number=5,                    # 5 folds
  repeats=2,                   # 2 repeats
  allowParallel=FALSE)
```

### CART

```{r CART model 1st iter, message=FALSE, warning=FALSE}

# cp values
cp.grid = expand.grid( .cp = seq(0.001, 0.05, 0.001))

cart_model = train(
  x=cs_train[, x_vars],
  y=cs_train$SeriousDlqin2yrs,
  method='rpart',     # CART
  metric=caret_optimized_metric,
  trControl=caret_train_control,
  tuneGrid = cp.grid
  )
# cart_model

# show the tree for the tuned .cp value 
cart_model.best.tree = cart_model$finalModel
prp(cart_model.best.tree)

cart_varImp <- varImp(cart_model)
plot(cart_varImp, main = "Variable importance in CART, 1st Iter.")
```


### Random Forest

```{r Random forest 1st iter, message=FALSE, warning=FALSE}
B <- 600

# http://topepo.github.io/caret/Random_Forest.html
rf_model <- train(
  x=cs_train[, x_vars],
  y=cs_train$SeriousDlqin2yrs,
  method='rf',        # Random Forest
  metric=caret_optimized_metric,
  ntree=B,            # number of trees in the Random Forest
  nodesize=100,       # minimum node size set small enough to allow for complex trees,
                      # but not so small as to require too large B to eliminate high variance
  importance=TRUE,    # evaluate importance of predictors
  keep.inbag=TRUE,
  trControl=caret_train_control,
  tuneGrid=NULL)

rf_varImp <- varImp(rf_model)
plot(rf_varImp, main = "Variable importance in RF, 1st Iter.")
```

### Logistic Regression

```{r logit 1st iter, message=FALSE, warning=FALSE}
log_reg_model <- train(
  x=cs_train[, x_vars],
  y=cs_train$SeriousDlqin2yrs,
  preProcess=c('center', 'scale'), 
  method='plr',       # Penalized Logistic Regression
  metric=caret_optimized_metric,
  trControl=caret_train_control,
  tuneGrid=expand.grid(
    lambda=0,   # weight penalty parameter
    cp='aic'))     # complexity parameter (AIC / BIC)

log_reg_varImp <- varImp(log_reg_model)
plot(log_reg_varImp, main = "Variable importance in LogReg, 1st Iter.")
```

## Model prediction comparison

In order to see which model seems to have the best prediction characteristics, create and chart ROC curves for each model based on the sample test data we kept in reserve for validation.

``` {r Create OOS predictions for ROC curve, echo=FALSE}
low_prob <- 1e-6
high_prob <- 1 - low_prob
log_low_prob <- log(low_prob)
log_high_prob <- log(high_prob)
log_prob_thresholds <- seq(from=log_low_prob, to=log_high_prob, length.out=1000)
prob_thresholds <- exp(log_prob_thresholds)

# "bin_classif_eval" function is from the "EvaluationMetrics.R" helper script. 
# It calculates a sequence of $specificity and $sensitivity at numerous points
# in the probability prediction

cart_pred_probs <- predict(
  cart_model, newdata=cs_valid[ , x_vars], type='prob')
cart_oos_performance <- bin_classif_eval(
  cart_pred_probs$delinquent, cs_valid$SeriousDlqin2yrs, thresholds=prob_thresholds)

rf_pred_probs <- predict(
  rf_model, newdata=cs_valid[ , x_vars], type='prob')
rf_oos_performance <- bin_classif_eval(
  rf_pred_probs$delinquent, cs_valid$SeriousDlqin2yrs, thresholds=prob_thresholds)

log_reg_pred_probs <- predict(
  log_reg_model, newdata=cs_valid[, x_vars], type='prob')
log_reg_oos_performance <- bin_classif_eval(
  log_reg_pred_probs$delinquent, cs_valid$SeriousDlqin2yrs, thresholds=prob_thresholds)

```


### Plot of ROC curves

``` {r Plot ROC 1st iter, message=FALSE, warning=FALSE, echo=FALSE}
plot(x = 1 - rf_oos_performance$specificity,
     y = rf_oos_performance$sensitivity,
     type = "l", col='darkgreen', lwd=2,
     xlim = c(0., 1.), ylim = c(0., 1.),
     main = "ROC Curves (Validation Data) - 1st Iter",
     xlab = "1 - Specificity", ylab = "Sensitivity")
abline(a=0,b=1,lty=2,col=8)
lines(x=1 - cart_oos_performance$specificity,
      y=cart_oos_performance$sensitivity,
      col='red', lwd=2)
lines(x=1 - log_reg_oos_performance$specificity,
      y=log_reg_oos_performance$sensitivity,
      col='green', lwd=2)
legend('right', c('Random Forest', 'Logistic Regression', 'CART'), 
    lty=1, col=c('darkgreen', 'green', 'red'), lwd=2.)

# save a plot to PNG
dev.copy(png,"ROC1a.png",width=6,height=4.5,units="in",res=200)
dev.off()
```

### Model variable importance plots

``` {r}
# impVars <- row.names(log_reg_varImp$importance)

png("varImp1a.png", width=12, height=4, units="in", res=300)
p1 <- plot(log_reg_varImp, main = "LogReg, 1st Iter.")
p2 <- plot(cart_varImp, main = "CART, 1st Iter.")
p3 <- plot(rf_varImp, main = "RF, 1st Iter.")
grid.arrange(p1, p2, p3, nrow = 1)
dev.off()
```


## predictions

A sensistivity threshold of **`75`%** was chosen, subjectively, meaning it is desired of the model parameters to catch **75%** of the *actual* delinquency cases.


#### Logistic regression

```{r}
log_reg_sensitivity_threshold <- .75
log_reg_i <- min(which(log_reg_oos_performance$sensitivity < log_reg_sensitivity_threshold)) - 1
log_reg_selected_prob_threshold <- prob_thresholds[log_reg_i]
```

The selected decision threshold is **`r formatC(log_reg_selected_prob_threshold, format='f', digits=3)`** &ndash; meaning when the logistic regression model is used to predict on new data, it will predict "Delinquent" when the predicted probability exceeds that threshold.

#### Random forest

```{r}
rf_sensitivity_threshold <- .75
rf_i <- min(which(rf_oos_performance$sensitivity < rf_sensitivity_threshold)) - 1
rf_selected_prob_threshold <- prob_thresholds[rf_i]
```

The selected decision threshold for the Random Forest model is **`r formatC(rf_selected_prob_threshold, format='f', digits=3)`**.

#### CART

```{r}
cart_sensitivity_threshold <- .30
cart_i <- min(which(cart_oos_performance$sensitivity < cart_sensitivity_threshold)) - 1
cart_selected_prob_threshold <- prob_thresholds[cart_i]
# cart_selected_prob_threshold
```

The selected decision threshold is **`r formatC(cart_selected_prob_threshold, format='f', digits=3)`** &ndash; meaning when the CART model is used to predict on new data, it will predict "Delinquent" when the predicted probability exceeds that threshold.


## Testing Performance of Model

Calculate the performance on the test split of data from the original dataset on the respective model.
First a helper function to display Accuracy, Sensitivity, and Specificity is created.

``` {r helper funcs, echo=FALSE}

# helper function to display Accuracy, Sensitivity, and Specificity
display_accu_sens_spec <- function(iperf) {
  if (is.numeric(iperf)) {
    s1 <- 
    c(formatC(iperf[1], format='f', digits=4),
      formatC(iperf[2], format='f', digits=4),
      formatC(iperf[3], format='f', digits=4)
    )
  } else {
    s1 <- 
    c("Accuracy:",    formatC(iperf$accuracy, format='f', digits=4),
      "Sensitivity:", formatC(iperf$sensitivity, format='f', digits=4),
      "specificity:", formatC(iperf$specificity, format='f', digits=4)
    )
  }
  s1
}


# Show Confusion Matrix
####################
# preds - predictions (as probability)
# thresh - thresh to divide probability into one classs or another
# s - string to identify
showCM <- function (preds, thresh, s) {
  print(paste0("Confusion matrix for ", s, " predictions at prob threshold of ", formatC(thresh, format='f', digits=3)))
  # confusion matrix
  pred <- ifelse(preds[, "delinquent"] > thresh, 1, 0)
  pred <- factor(pred, levels = c(0, 1), labels = c("ok", "delinquent"))
  cm <- confusionMatrix(pred, cs_test$SeriousDlqin2yrs, positive = "delinquent")
  as.table(cm)
}


```

### CART model 

Evaluating the performance of the selected logit model, with a decision threshold at **`r formatC(cart_selected_prob_threshold, format='f', digits=3)`**:

```{r}
cart_test_pred_probs <- predict(
  cart_model, newdata=cs_test[, x_vars], type='prob')

cart_test_performance <- bin_classif_eval(
  cart_test_pred_probs$delinquent, cs_test$SeriousDlqin2yrs, thresholds=cart_selected_prob_threshold)

# expected predictive performance 
display_accu_sens_spec(cart_oos_performance[cart_i, ])

# tested performance 
display_accu_sens_spec(cart_test_performance)

showCM(cart_test_pred_probs, cart_selected_prob_threshold, "cart")
```

The *accuracy* is not great, but the test split data performance is similar to its expected performance.

### Logit (Logistic regression) model 

Evaluating the performance of the selected logit model, with a decision threshold at **`r formatC(log_reg_selected_prob_threshold, format='f', digits=3)`**:

```{r}
log_reg_test_pred_probs <- predict(
  log_reg_model, newdata=cs_test[, x_vars], type='prob')

log_reg_test_performance <- bin_classif_eval(
  log_reg_test_pred_probs$delinquent, cs_test$SeriousDlqin2yrs, thresholds=log_reg_selected_prob_threshold)

# expected predictive performance 
display_accu_sens_spec(log_reg_oos_performance[log_reg_i, ])

# tested performance 
display_accu_sens_spec(log_reg_test_performance)

showCM(log_reg_test_pred_probs, log_reg_selected_prob_threshold, "log_reg")
```

The *accuracy* is not great, but the test split data performance is similar to its expected performance.

### Random Forest model 

Evaluating the performance of the selected Random Forest model, with a decision threshold at **`r formatC(rf_selected_prob_threshold, format='f', digits=3)`**:

```{r}
rf_test_pred_probs <- predict(
  rf_model, newdata=cs_test[, x_vars], type='prob')

rf_test_performance <- bin_classif_eval(
  rf_test_pred_probs$delinquent, cs_test$SeriousDlqin2yrs, thresholds=rf_selected_prob_threshold)

# expected predicted performance 
display_accu_sens_spec(rf_oos_performance[rf_i, ])

# tested performance 
display_accu_sens_spec(rf_test_performance)

showCM(rf_test_pred_probs, rf_selected_prob_threshold, "rf")
```


### Cumulative gains, lift charts and AUC

``` {r}

# prediction() needs this to be numeric (not factor SeriousDlqin2yrs)
cs_test_y_numeric <- as.integer(cs_test$SeriousDlqin2yrs)
cs_test_y_numeric <- ifelse(cs_test_y_numeric == 2, 1, 0)

png("cart_gain.png", width=6, height=4.5, units="in", res=100)
cart_rocr <- prediction(cart_test_pred_probs[,2], cs_test_y_numeric)
cart_gain <- performance(cart_rocr, "tpr", "rpp")
p1 <- plot(cart_gain, colorize=TRUE, main = "Gain chart for CART 1st Iter") + abline(a=0,b=1,lty=2,col=8)
dev.off()

png("cart_lift.png", width=6, height=4.5, units="in", res=100)
cart_lift <- performance(cart_rocr, "lift", "rpp")
plot(cart_lift, main="Lift curve for CART 1st Iter", colorize=T)
dev.off()

png("rf_gain.png", width=6, height=4.5, units="in", res=100)
rf_rocr <- prediction(rf_test_pred_probs[,2], cs_test_y_numeric)
rf_gain <- performance(rf_rocr, "tpr", "rpp")
p2 <- plot(rf_gain, colorize=TRUE, main = "Gain chart for RF 1st Iter") + abline(a=0,b=1,lty=2,col=8)
dev.off()

png("rf_lift.png", width=6, height=4.5, units="in", res=100)
rf_lift <- performance(rf_rocr, "lift", "rpp")
plot(rf_lift, main="Lift curve for RF 1st Iter", colorize=T)
dev.off()

png("logreg_gain.png", width=6, height=4.5, units="in", res=100)
log_reg_rocr <- prediction(log_reg_test_pred_probs[,2], cs_test_y_numeric)
log_reg_gain <- performance(log_reg_rocr, "tpr", "rpp")
p3 <- plot(log_reg_gain, colorize=TRUE, main = "Gain chart for LogReg 1st Iter") + abline(a=0,b=1,lty=2,col=8)
dev.off()

png("logreg_lift.png", width=6, height=4.5, units="in", res=100)
log_reg_lift <- performance(log_reg_rocr, "lift", "rpp")
plot(rf_lift, main="Lift curve for LogReg 1st Iter", colorize=T)
dev.off()

cart_auc <- performance(cart_rocr, "auc")@y.values
cart_auc
rf_auc <- performance(rf_rocr, "auc")@y.values
rf_auc
log_reg_auc <- performance(log_reg_rocr, "auc")@y.values
log_reg_auc
```

The Random Forest model has a higher *accuracy* at the chosen sensitivity threshold. The performance comparison on the test slice is in the expected performance ballpark, within 1% or so.

So the models seem to be able to predict reasonably for "new" data and are not overtrained.

# Build models using features variables

Using some constructed feature variables, the classifier models are rebuilt.

``` {r}
# Remove the non-linear DebtRatio and add MonthlyExpenses and NetMonthlySurplus.
# Replace the three deliquency variables with the single
# ConsolidatedNumberOfDaysPastDue
x_vars_features = c(
  'RevolvingUtilizationOfUnsecuredLines',
  'age',
#   'NumberOfTime30.59DaysPastDueNotWorse',
#   'NumberOfTime60.89DaysPastDueNotWorse',
#   'NumberOfTimes90DaysLate',
  'MonthlyIncome',
  'MonthlyExpenses',
  'NetMonthlySurplus',
  'NumberOfOpenCreditLinesAndLoans',
  'NumberRealEstateLoansOrLines',
  'NumberOfDependents',
  'ConsolidatedNumberOfDaysPastDue'
  )
```

### CART 2nd iter

```{r CART model 2nd iter, message=FALSE, warning=FALSE}

# cp values
### cp.grid = expand.grid( .cp = seq(0.001, 0.010, 0.001))
# changed to a fixed value since randomness was changing this to a worse
# predicting model
cp.grid = expand.grid( .cp = c(0.003))

cart_model2 = train(
  x=cs_train[, x_vars_features],
  y=cs_train$SeriousDlqin2yrs,
  method='rpart',     # CART
  metric=caret_optimized_metric,
  trControl=caret_train_control,
  tuneGrid = cp.grid
  )
cart_model2
#str(cart_model2)

# show the tree for the tuned .cp value 
cart_model2.best.tree = cart_model2$finalModel
prp(cart_model2.best.tree)

cart2_varImp <- varImp(cart_model2)
plot(cart2_varImp, main = "Variable importance in CART, 2nd Iter.")
```


### Random Forest 2nd iter

``` {r Random forest 2nd iter, message=FALSE, warning=FALSE}
B <- 600

# http://topepo.github.io/caret/Random_Forest.html
rf_model2 <- train(
  x=cs_train[, x_vars_features],
  y=cs_train$SeriousDlqin2yrs,
  method='rf',        # Random Forest
  metric=caret_optimized_metric,
  ntree=B,            # number of trees in the Random Forest
  nodesize=100,       # minimum node size set small enough to allow for complex trees,
                      # but not so small as to require too large B to eliminate high variance
  importance=TRUE,    # evaluate importance of predictors
  keep.inbag=TRUE,
  trControl=caret_train_control,
  tuneGrid=NULL)

rf2_varImp <- varImp(rf_model2)
plot(rf2_varImp, main = "Variable importance in RF, 2nd Iter.")
```

### Logistic Regression 2nd iter

``` {r logit 2nd iter, message=FALSE, warning=FALSE}
log_reg_model2 <- train(
  x=cs_train[, x_vars_features],
  y=cs_train$SeriousDlqin2yrs,
  preProcess=c('center', 'scale'), 
  method='plr',       # Penalized Logistic Regression
  metric=caret_optimized_metric,
  trControl=caret_train_control,
  tuneGrid=expand.grid(
    lambda=0,   # weight penalty parameter
    cp='aic'))     # complexity parameter (AIC / BIC)

log_reg2_varImp <- varImp(log_reg_model2)
plot(log_reg2_varImp, main = "Variable importance in LogReg, 2nd Iter.")
```


### Model variable importance plots

``` {r}
# impVars <- row.names(log_reg2_varImp$importance)

png("varImp2a.png", width=12, height=4, units="in", res=300)
p1 <- plot(log_reg2_varImp, main = "LogReg, 2nd Iter.")
p2 <- plot(cart2_varImp, main = "CART, 2nd Iter.")
p3 <- plot(rf2_varImp, main = "RF, 2nd Iter.")
grid.arrange(p1, p2, p3, nrow = 1)
dev.off()
```

## Model prediction comparison

In order to see which model seems to have the best prediction characteristics, create and chart ROC curves for each model based on the sample test data we kept in reserve for validation.

``` {r Create OOS predictions 2nd iter, echo=FALSE}

cart2_pred_probs <- predict(
  cart_model2, newdata=cs_valid[ , x_vars_features], type='prob')
cart2_oos_performance <- bin_classif_eval(
  cart2_pred_probs$delinquent, cs_valid$SeriousDlqin2yrs, thresholds=prob_thresholds)

rf2_pred_probs <- predict(
  rf_model2, newdata=cs_valid[ , x_vars_features], type='prob')
rf2_oos_performance <- bin_classif_eval(
  rf2_pred_probs$delinquent, cs_valid$SeriousDlqin2yrs, thresholds=prob_thresholds)

log_reg2_pred_probs <- predict(
  log_reg_model2, newdata=cs_valid[, x_vars_features], type='prob')
log_reg2_oos_performance <- bin_classif_eval(
  log_reg2_pred_probs$delinquent, cs_valid$SeriousDlqin2yrs, thresholds=prob_thresholds)
```


### Plot of ROC curves

``` {r Plot ROC 2nd iter, message=FALSE, warning=FALSE, echo=FALSE}
plot(x = 1 - rf2_oos_performance$specificity,
     y = rf2_oos_performance$sensitivity,
     type = "l", col='darkgreen', lwd=2,
     xlim = c(0., 1.), ylim = c(0., 1.),
     main = "ROC Curves (Validation Data) - 2nd Iter",
     xlab = "1 - Specificity", ylab = "Sensitivity")
abline(a=0,b=1,lty=2,col=8)
lines(x=1 - cart2_oos_performance$specificity,
      y=cart2_oos_performance$sensitivity,
      col='red', lwd=2)
lines(x=1 - log_reg2_oos_performance$specificity,
      y=log_reg2_oos_performance$sensitivity,
      col='green', lwd=2)
legend('right', c('Random Forest', 'Logistic Regression', 'CART'), 
    lty=1, col=c('darkgreen', 'green', 'red'), lwd=2.)

# save a plot to PNG
dev.copy(png,"ROC2a.png",width=6,height=4.5,units="in",res=200)
dev.off()
```

The CART model does significantly better this time using the engineered features than it did with the original variables, even matching or exceeding the Random Forest at higher specificity.  The logisitic regression eventual catches up to the CART model at higher sensitivity. Again, however, the Random Forest model has the highest AUC among the three models.


## Predictions performance on 2nd iter

Again, a sensistivity threshold of **`75`%** is chosen.


#### Logistic regression

```{r echo=FALSE}
log_reg2_sensitivity_threshold <- .75
log_reg2_i <- min(which(log_reg2_oos_performance$sensitivity < log_reg2_sensitivity_threshold)) - 1
log_reg2_selected_prob_threshold <- prob_thresholds[log_reg2_i]
```

The selected decision threshold is **`r formatC(log_reg2_selected_prob_threshold, format='f', digits=3)`**.

#### Random forest

```{r echo=FALSE}
rf2_sensitivity_threshold <- .75
rf2_i <- min(which(rf2_oos_performance$sensitivity < rf_sensitivity_threshold)) - 1
rf2_selected_prob_threshold <- prob_thresholds[rf2_i]
```

The selected decision threshold for the Random Forest model is **`r formatC(rf2_selected_prob_threshold, format='f', digits=3)`**.

#### CART

```{r echo=FALSE}
cart2_sensitivity_threshold <- .75
cart2_i <- min(which(cart2_oos_performance$sensitivity < cart2_sensitivity_threshold)) - 1
cart2_selected_prob_threshold <- prob_thresholds[cart2_i]
#cart2_selected_prob_threshold 
```

The selected decision threshold is **`r formatC(cart2_selected_prob_threshold, format='f', digits=3)`**.


## Evaluating 2nd iteration model performance

### Logit (Logistic regression) model  2nd iter

Evaluating the performance of the selected logit model, with a decision threshold at **`r formatC(log_reg2_selected_prob_threshold, format='f', digits=3)`**:

```{r}
log_reg2_test_pred_probs <- predict(
  log_reg_model2, newdata=cs_test[, x_vars_features], type='prob')

log_reg2_test_performance <- bin_classif_eval(
  log_reg2_test_pred_probs$delinquent, cs_test$SeriousDlqin2yrs, thresholds=log_reg2_selected_prob_threshold)

# expected predictive performance 
display_accu_sens_spec(log_reg2_oos_performance[log_reg2_i, ])

# tested performance 
display_accu_sens_spec(log_reg2_test_performance)

showCM(log_reg2_test_pred_probs, log_reg2_selected_prob_threshold, "log_reg2")
```

The *accuracy* degrades slightly over the first iteration model, and the test split data performance is similar to its expected performance.

### Random Forest model 2nd iter

Evaluating the performance of the selected Random Forest model, with a decision threshold at **`r formatC(rf2_selected_prob_threshold, format='f', digits=3)`**:

```{r}
rf2_test_pred_probs <- predict(
  rf_model2, newdata=cs_test[, x_vars_features], type='prob')

rf2_test_performance <- bin_classif_eval(
  rf2_test_pred_probs$delinquent, cs_test$SeriousDlqin2yrs, thresholds=rf2_selected_prob_threshold)

# expected predicted performance 
display_accu_sens_spec(rf2_oos_performance[rf2_i, ])

# tested performance 
display_accu_sens_spec(rf2_test_performance)

showCM(rf2_test_pred_probs, rf2_selected_prob_threshold, "rf2")
```

The Random Forest model has a higher *accuracy* at the chosen sensitivity threshold. The performance comparison on the test slice is in the expected performance ballpark, within 1% or so.

### CART model 2nd iter

Evaluating the performance of the selected CART model, with a decision threshold at **`r formatC(cart2_selected_prob_threshold, format='f', digits=3)`**:

```{r}
cart2_test_pred_probs <- predict(
  cart_model2, newdata=cs_test[, x_vars_features], type='prob')

cart2_test_performance <- bin_classif_eval(
  cart2_test_pred_probs$delinquent, cs_test$SeriousDlqin2yrs, thresholds=cart2_selected_prob_threshold)

# expected predicted performance 
display_accu_sens_spec(cart2_oos_performance[cart2_i, ])

# tested performance 
display_accu_sens_spec(cart2_test_performance)

showCM(cart2_test_pred_probs, cart2_selected_prob_threshold, "cart2")
```

The CART model is in the running this time. The prediction performance comparison on the test slice is reasonable.

### Cumulative gains and lift charts, AUC

``` {r}

# prediction() needs this to be numeric (not factor SeriousDlqin2yrs)
cs_test_y_numeric <- as.integer(cs_test$SeriousDlqin2yrs)
cs_test_y_numeric <- ifelse(cs_test_y_numeric == 2, 1, 0)

png("cart2_gain.png", width=6, height=4.5, units="in", res=100)
cart2_rocr <- prediction(cart2_test_pred_probs[,2], cs_test_y_numeric)
cart2_gain <- performance(cart2_rocr, "tpr", "rpp")
p1 <- plot(cart2_gain, colorize=TRUE, main = "Gain chart for CART 2nd Iter") + abline(a=0,b=1,lty=2,col=8)
dev.off()

png("cart2_lift.png", width=6, height=4.5, units="in", res=100)
cart2_lift <- performance(cart2_rocr, "lift", "rpp")
plot(cart2_lift, main="Lift curve for CART 2nd Iter", colorize=T)
dev.off()

png("rf2_gain.png", width=6, height=4.5, units="in", res=100)
rf2_rocr <- prediction(rf2_test_pred_probs[,2], cs_test_y_numeric)
rf2_gain <- performance(rf2_rocr, "tpr", "rpp")
p2 <- plot(rf2_gain, colorize=TRUE, main = "Gain chart for RF 2nd Iter") + abline(a=0,b=1,lty=2,col=8)
dev.off()

png("rf2_lift.png", width=6, height=4.5, units="in", res=100)
rf2_lift <- performance(rf2_rocr, "lift", "rpp")
plot(rf2_lift, main="Lift curve for RF 2nd Iter", colorize=T)
dev.off()

png("logreg2_gain.png", width=6, height=4.5, units="in", res=100)
log_reg2_rocr <- prediction(log_reg2_test_pred_probs[,2], cs_test_y_numeric)
log_reg2_gain <- performance(log_reg2_rocr, "tpr", "rpp")
p3 <- plot(log_reg2_gain, colorize=TRUE, main = "Gain chart for LogReg 2nd Iter") + abline(a=0,b=1,lty=2,col=8)
dev.off()

png("logreg2_lift.png", width=6, height=4.5, units="in", res=100)
log_reg2_lift <- performance(log_reg2_rocr, "lift", "rpp")
plot(log_reg2_lift, main="Lift curve for LogReg 2nd Iter", colorize=T)
dev.off()

cart2_auc <- performance(cart2_rocr, "auc")@y.values
cart2_auc
rf2_auc <- performance(rf2_rocr, "auc")@y.values
rf2_auc
log_reg2_auc <- performance(log_reg2_rocr, "auc")@y.values
log_reg2_auc

```