# Classification

## Simulate data

In [1]:
suppressMessages({
    library('dplyr')
})

set.seed(37)

getData <- function(N=1000) {
    x1 <- rnorm(N, mean=0, sd=1)
    x2 <- rnorm(N, mean=0, sd=1)
    y <- 1 + 2.0 * x1 + 3.0 * x2 + rnorm(N, mean=0, sd=1)
    y <- 1.0 / (1.0 + exp(-y))
    y <- rbinom(n=N, size=1, prob=y)
    
    df <- data.frame(x1=x1, x2=x2, y=y)
    df <- df %>% 
            mutate(y=ifelse(y == 0, 'neg', 'pos')) %>%
            mutate_if(is.character, as.factor)
    return(df)
}

T <- getData()

print(summary(T))

       x1                x2             y      
 Min.   :-2.8613   Min.   :-3.28763   neg:394  
 1st Qu.:-0.6961   1st Qu.:-0.59550   pos:606  
 Median :-0.0339   Median : 0.06348            
 Mean   :-0.0184   Mean   : 0.03492            
 3rd Qu.: 0.6836   3rd Qu.: 0.69935            
 Max.   : 3.8147   Max.   : 3.17901            


## Classification methods

There are many classifcation models available in `R`. Here, we use the `caret` package to apply some classification models. Other classification models are [listed](http://topepo.github.io/caret/train-models-by-tag.html) by tag.

### Random forest

In [2]:
library('caret')

ctrl <- trainControl(
    method='LGOCV', 
    p=0.8,
    classProbs=TRUE,
    summaryFunction=twoClassSummary
)
m <- train(y ~ ., data=T, method='rf', metric='ROC', trControl=ctrl)

Loading required package: lattice
Loading required package: ggplot2


note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .



In [3]:
print(m)

Random Forest 

1000 samples
   2 predictor
   2 classes: 'neg', 'pos' 

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (25 reps, 80%) 
Summary of sample sizes: 801, 801, 801, 801, 801, 801, ... 
Resampling results:

  ROC        Sens       Spec     
  0.8836576  0.7533333  0.8363636

Tuning parameter 'mtry' was held constant at a value of 2


### Logistic regression

In [4]:
ctrl <- trainControl(
    method='LGOCV', 
    p=0.8,
    classProbs=TRUE,
    summaryFunction=twoClassSummary
)

m <- train(y ~ ., data=T, method='glm', family='binomial', metric='ROC', trControl=ctrl)

In [5]:
print(m)

Generalized Linear Model 

1000 samples
   2 predictor
   2 classes: 'neg', 'pos' 

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (25 reps, 80%) 
Summary of sample sizes: 801, 801, 801, 801, 801, 801, ... 
Resampling results:

  ROC        Sens  Spec
  0.9101166  0.78  0.88



### AdaBoost

In [6]:
ctrl <- trainControl(
    method='LGOCV', 
    p=0.8,
    classProbs=TRUE,
    summaryFunction=twoClassSummary
)

m <- train(y ~ ., data=T, method='adaboost', metric='ROC', trControl=ctrl)

In [7]:
print(m)

AdaBoost Classification Trees 

1000 samples
   2 predictor
   2 classes: 'neg', 'pos' 

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (25 reps, 80%) 
Summary of sample sizes: 801, 801, 801, 801, 801, 801, ... 
Resampling results across tuning parameters:

  nIter  method         ROC        Sens       Spec     
   50    Adaboost.M1    0.8723056  0.7184615  0.8357025
   50    Real adaboost  0.7197754  0.7374359  0.8578512
  100    Adaboost.M1    0.8701865  0.7261538  0.8323967
  100    Real adaboost  0.6905022  0.7384615  0.8515702
  150    Adaboost.M1    0.8673172  0.7205128  0.8277686
  150    Real adaboost  0.6787137  0.7400000  0.8489256

ROC was used to select the optimal model using the largest value.
The final values used for the model were nIter = 50 and method = Adaboost.M1.
