Dr. Mike uses Weka in this episode, but we want to code. Thankfully we have a library specifically for that situation: RWeka.

In [4]:
install.packages("RWeka")
library(RWeka)

When you need outputs that are categorical or are based on a small amount of labels you Classify.


# Decision Trees
Decision trees will base their outputs on conditions generated on a structure with decision nodes, branches and leaf nodes.

In [5]:
install.packages("caret")
install.packages("data.table")
library(caret)
library(data.table)

# Pound seems to have gotten hand of the unmasked, headed version of the dataset. That one's not available for us, but it works the same
# Last column in the .data file will show a "+" for Approved and "-" for Not Approved

credit_data <- fread("crx.data", header = FALSE)

# Assign column names
colnames(credit_data) <- c("A1", "A2", "A3", "A4", "A5", "A6", "A7", "A8", "A9", "A10", "A11", "A12", "A13", "A14", "A15", "Class")

# Convert string attributes to factors (categorical data)
credit_data[] <- lapply(credit_data, function(x) if (is.character(x)) as.factor(x) else x)

# Split the data into training and testing sets
set.seed(123)
trainIndex <- createDataPartition(credit_data$Class, p = 0.70, list = FALSE)
trainData <- credit_data[trainIndex, ]
testData <- credit_data[-trainIndex, ]

# We'll pick the J48 decision tree for our model
model <- J48(Class ~ ., data = trainData)

# Predict on the test set
predictions <- predict(model, newdata = testData)

# Generate the confusion matrix, along with other results.
confusionMatrix(predictions, testData$Class)

Confusion Matrix and Statistics

          Reference
Prediction  -  +
         - 85  4
         + 29 88
                                          
               Accuracy : 0.8398          
                 95% CI : (0.7824, 0.8871)
    No Information Rate : 0.5534          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6842          
                                          
 Mcnemar's Test P-Value : 2.943e-05       
                                          
            Sensitivity : 0.7456          
            Specificity : 0.9565          
         Pos Pred Value : 0.9551          
         Neg Pred Value : 0.7521          
             Prevalence : 0.5534          
         Detection Rate : 0.4126          
   Detection Prevalence : 0.4320          
      Balanced Accuracy : 0.8511          
                                          
       'Positive' Class : -               
                                    

# K-NN 
K-Nearest Neighbours' approach is something like "what, in the existing dataset, have we already seen around this area?". It iteratively works with a selected number (K) of nearest neighbours of data points and calculates the average, or majority vote, of these neighbours' given labels.

In [6]:
model <- IBk(Class ~ ., data = trainData, control = Weka_control(K = 5)) # Let's pick 5 for our K-number

predictions <- predict(model, newdata = testData)

confusionMatrix(predictions, testData$Class)

Confusion Matrix and Statistics

          Reference
Prediction   -   +
         - 101  13
         +  13  79
                                          
               Accuracy : 0.8738          
                 95% CI : (0.8206, 0.9159)
    No Information Rate : 0.5534          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7447          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.8860          
            Specificity : 0.8587          
         Pos Pred Value : 0.8860          
         Neg Pred Value : 0.8587          
             Prevalence : 0.5534          
         Detection Rate : 0.4903          
   Detection Prevalence : 0.5534          
      Balanced Accuracy : 0.8723          
                                          
       'Positive' Class : -               
                              

# SVMs
Support Vector Machines maximize the separation between classes when drawing the decision boundary. It's good for non-linear data as well.

In [7]:
model <- SMO(Class ~ ., data = trainData)

predictions <- predict(model, newdata = testData)

confusionMatrix(predictions, testData$Class)

Confusion Matrix and Statistics

          Reference
Prediction  -  +
         - 93 10
         + 21 82
                                          
               Accuracy : 0.8495          
                 95% CI : (0.7932, 0.8954)
    No Information Rate : 0.5534          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.699           
                                          
 Mcnemar's Test P-Value : 0.07249         
                                          
            Sensitivity : 0.8158          
            Specificity : 0.8913          
         Pos Pred Value : 0.9029          
         Neg Pred Value : 0.7961          
             Prevalence : 0.5534          
         Detection Rate : 0.4515          
   Detection Prevalence : 0.5000          
      Balanced Accuracy : 0.8535          
                                          
       'Positive' Class : -               
                                    