Permalink
Find file
Fetching contributors…
Cannot retrieve contributors at this time
475 lines (388 sloc) 19 KB
title author date output
SPAM/HAM SMS classification using caret and Naive Bayes
Jesus M. Castagnetto
2015-01-03
html_document
theme keep_md toc
readable
true
true
library(knitr)
opts_chunk$set(cache=TRUE, comment="", warning=FALSE, message=FALSE)

Background

Motivation

I am currently reading the book "Machine Learning with R"[^mlr] by Brent Lantz, and also want to learn more about the caret[^caret] package, so I decided to replicate the SPAM/HAM classification example from the chapter 4 of the book using caret instead of the e1071^e1071 package used in the text.

There are other differences apart from using a different R package: instead of using as comparison the number of false positives, I decided to use the sensitivity and specificity as criteria to evaluate the prediction models. Also, I used the calculated models on a (different) second dataset to test their validity and prediction performance.

[^mlr]: Book page at Packt

[^caret]: The caret package site

Preliminary information

The dataset used in the book is a modified version of the "SMS Spam Collection v.1" created by Tiago A. Almeida and José Maria Gómez Hidalgo^smsspamcoll, as described in Chapter 4 ("Probabilistic Learning -- Clasification Using Naive Bayes") of the aforementioned book.

You can get the modified dataset from the book's page at Packt, but be aware that you will need to register to get the files. If you rather don't do that, you can get the original data files from the original creator's site.

To simplify things, we are going to use the original dataset.

For this excercise we will use the caret package to do the Naive Bayes[^nb] modeling and prediction, the tm package to generate the text corpus, the pander package to be able to output nicely formated tables, and the doMC to take advantage of parallel processing with multiple cores. Also, we will define some utility functions to simplify matters later in the code.

[^nb]: The caret package uses the klaR package for the Naive Bayes algorithm, so we are loading that library and its dependency (MASS) beforehand.

# libraries needed by caret
library(klaR)
library(MASS)
# for the Naive Bayes modelling
library(caret)
# to process the text into a corpus
library(tm)
# to get nice looking tables
library(pander)
# to simplify selections
library(dplyr)
library(doMC)
registerDoMC(cores=4)

# a utility function for % freq tables
frqtab <- function(x, caption) {
    round(100*prop.table(table(x)), 1)
}
# utility function to summarize model comparison results
sumpred <- function(cm) {
    summ <- list(TN=cm$table[1,1],  # true negatives
                 TP=cm$table[2,2],  # true positives
                 FN=cm$table[1,2],  # false negatives
                 FP=cm$table[2,1],  # false positives
                 acc=cm$overall["Accuracy"],  # accuracy
                 sens=cm$byClass["Sensitivity"],  # sensitivity
                 spec=cm$byClass["Specificity"])  # specificity
    lapply(summ, FUN=round, 2)
}

Reading and preparing the data

We start by downloading the zip file with the dataset, and reading the file into a dataframe. We then assign the appropiate names to the columns, and convert the type into a factor. Finally, we randomize the data frame.

if (!file.exists("smsspamcollection.zip")) {
download.file(url="http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/smsspamcollection.zip",
              destfile="smsspamcollection.zip", method="curl")
}
sms_raw <- read.table(unz("smsspamcollection.zip","SMSSpamCollection"),
                      header=FALSE, sep="\t", quote="", stringsAsFactors=FALSE)
colnames(sms_raw) <- c("type", "text")
sms_raw$type <- factor(sms_raw$type)
# randomize it a bit
set.seed(12358)
sms_raw <- sms_raw[sample(nrow(sms_raw)),]
str(sms_raw)

The modified data used in the book has 5559 SMS messages, whereas the original data used here has r nrow(sms_raw) rows (caveat: I have not checked for duplicates in the original dataset).

Preparing the data

We wil proceed in a similar fashion as described in the book, but make use of dplyr syntax to execute the text cleanup/transformation operations

First we will transform the SMS text into a corpus that can later be used in the analysis, then we will convert all text to lowercase, remove numbers, remove some common stop words in english, remove punctuation and extra whitespace, and finally, generate the document term that will be the basis for the classification task.

sms_corpus <- Corpus(VectorSource(sms_raw$text))
sms_corpus_clean <- sms_corpus %>%
    tm_map(content_transformer(tolower)) %>% 
    tm_map(removeNumbers) %>%
    tm_map(removeWords, stopwords(kind="en")) %>%
    tm_map(removePunctuation) %>%
    tm_map(stripWhitespace)
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)

Creating a classification model witn Naive Bayes

Generating the training and testing datasets

We will use the createDataPartition function to split the original dataset into a training and a testing sets, using the proportions from the book (75% training, 25% testing). This also generates the corresponding corpora and document term matrices.

According to the documentation that accompanies the data file, 86.6% of the entries correspond to legitimate messages ("ham"), and 13.4% to spam messages. We shall see if the partition procedure has preserved those proportions in the testing and training sets.

train_index <- createDataPartition(sms_raw$type, p=0.75, list=FALSE)
sms_raw_train <- sms_raw[train_index,]
sms_raw_test <- sms_raw[-train_index,]
sms_corpus_clean_train <- sms_corpus_clean[train_index]
sms_corpus_clean_test <- sms_corpus_clean[-train_index]
sms_dtm_train <- sms_dtm[train_index,]
sms_dtm_test <- sms_dtm[-train_index,]

ft_orig <- frqtab(sms_raw$type)
ft_train <- frqtab(sms_raw_train$type)
ft_test <- frqtab(sms_raw_test$type)
ft_df <- as.data.frame(cbind(ft_orig, ft_train, ft_test))
colnames(ft_df) <- c("Original", "Training set", "Test set")
pander(ft_df, style="rmarkdown",
       caption=paste0("Comparison of SMS type frequencies among datasets"))

It would seem that the procedure keeps the proportions perfectly.

Following the strategy used in the book, we will pick terms that appear at least 5 times in the training document term matrix. To do this, we first create a dictionary of terms (using the function findFreqTerms) that we will use to filter the cleaned up training and testing corpora.

As a final step before using these sets, we will convert the numeric entries in the term matrices into factors that indicate whether the term is present or not. For this, we'll use a slightly modified version of the convert_counts function that appear in the book, and apply it to each column in the matrices.

sms_dict <- findFreqTerms(sms_dtm_train, lowfreq=5)
sms_train <- DocumentTermMatrix(sms_corpus_clean_train, list(dictionary=sms_dict))
sms_test <- DocumentTermMatrix(sms_corpus_clean_test, list(dictionary=sms_dict))

# modified sligtly fron the code in the book
convert_counts <- function(x) {
    x <- ifelse(x > 0, 1, 0)
    x <- factor(x, levels = c(0, 1), labels = c("Absent", "Present"))
}
sms_train <- sms_train %>% apply(MARGIN=2, FUN=convert_counts)
sms_test <- sms_test %>% apply(MARGIN=2, FUN=convert_counts)

Training the two prediction models

We will now use Naive Bayes to train a couple of prediction models. Both models will be generated using 10-fold cross validation, with the default parameters.

The difference between the models will be that the first one does not use the Laplace correction and lets the training procedure figure out whether to user or not a kernel density estimate, while the second one fixes Laplace parameter to one (fL=1) and explicitly forbids the use of a kernel density estimate (useKernel=FALSE).

ctrl <- trainControl(method="cv", 10)
set.seed(12358)
sms_model1 <- train(sms_train, sms_raw_train$type, method="nb",
                trControl=ctrl)
sms_model1

set.seed(12358)
sms_model2 <- train(sms_train, sms_raw_train$type, method="nb", 
                    tuneGrid=data.frame(.fL=1, .usekernel=FALSE),
                trControl=ctrl)
sms_model2

Testing the predictions

We now use these two models to predict the appropriate classification of the terms in the test set. In each case we will estimate how good is the prediction using the confusionMatrix function. We will consider a positive result when a message is identified as (or predicted to be) SPAM.

sms_predict1 <- predict(sms_model1, sms_test)
cm1 <- confusionMatrix(sms_predict1, sms_raw_test$type, positive="spam")
cm1

sms_predict2 <- predict(sms_model2, sms_test)
cm2 <- confusionMatrix(sms_predict2, sms_raw_test$type, positive="spam")
cm2

We will also use our sumpred function to extract the true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), the prediction accuracy[^acc], the sensitivity^sens, and the specificity^spec.

Also, we will use the information from the similar models described in the book, in terms of TP, TN, TP, and FN, to estimate the rest of the parameters, and compare them with the caret derived models.

[^acc]: Accuracy, is the degree of closeness of measurements of a quantity to that quantity's actual (true) value.

[^sens]: The sensitivity, measures the proportion of actual positives which are correctly identified as such.

[^spec]: The specificity,measures the proportion of negatives which are correctly identified as such.

# from the table on page 115 of the book
tn=1203
tp=151
fn=32
fp=4
book_example1 <- list(
    TN=tn,
    TP=tp,
    FN=fn,
    FP=fp,
    acc=(tp + tn)/(tp + tn + fp + fn),
    sens=tp/(tp + fn),
    spec=tn/(tn + fp))

# from the table on page 116 of the book
tn=1204
tp=152
fn=31
fp=3
book_example2 <- list(
    TN=tn,
    TP=tp,
    FN=fn,
    FP=fp,
    acc=(tp + tn)/(tp + tn + fp + fn),
    sens=tp/(tp + fn),
    spec=tn/(tn + fp))

b1 <- lapply(book_example1, FUN=round, 2)
b2 <- lapply(book_example2, FUN=round, 2)
m1 <- sumpred(cm1)
m2 <- sumpred(cm2)
model_comp <- as.data.frame(rbind(b1, b2, m1, m2))
rownames(model_comp) <- c("Book model 1", "Book model 2", "Caret model 1", "Caret model 2")
pander(model_comp, style="rmarkdown", split.tables=Inf, keep.trailing.zeros=TRUE,
       caption="Model results when comparing predictions and test set")

Accuracy gives us an overall sense of how good the models are, and using that criteria, the ones in the book and those calculated here are very similar in how well they classify an SMS. All of them do surprisingly well taking into account the simplicity of the method.

The discussion in the book centered around the number of FP predicted by the model, but I'd rather look at the sensitivity (related to type II errors) and specificity (related to Type I errors) of the predictions (and the corresponding PPV and NPV).

In this example, the sensitivity gives us the probability of an SMS text being classified as SPAM, when it really is SPAM. Looking at this parameter, we see that even though the book's models do not differ much from the caret models in terms of accuracy, they do worse in terms of sensitivity. The text of the book argues that using the Laplace correction improves prediction, but with the cross-validated models generated using caret the opposite is true.

Of course, we gain in sensitivity, but we lose slightly in specificity, which in this example is the probability of a HAM message being classified as HAM. In other words, we increase (a bit) the misclassification of the regular SMS texts as SPAM. But the difference between the worst and the best specificity is of the order of 0.01 or ~1%.

Applying the model to a different SMS SPAM dataset

Just to check if our caret Naive Bayes models are good enough, we will test them against a different corpus. One described "Independent and Personal SMS Spam Filtering."[^brit_spam]

This dataset can be obtained from one of the authors site^brit_spam_data, as the british-english-sms-corpora.doc MSWORD document (retrieved on 2014-12-30). This document was converted to a text file using the Unix/Linux catdoc command ($ catdoc -aw british-english-sms-corpora.doc > british-english-sms-corpora.txt), and then split into two archives: one containing all the "ham" messages (british-english-sms-corpora_ham.txt), and the other containing all the "spam" message (british-english-sms-corpora_spam.txt).

[^brit_spam]: "Independent and Personal SMS Spam Filtering." in Proc. of IEEE Conference on Computer and Information Technology, Paphos, Cyprus, Aug 2011. Page 429-435. (URL: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=6036805)

These two archive were then read and combined into a data frame. The data was randomized to get a suitable dataset for the rest of the procedures.

The proportion of the spam and ham messages can be seen in the following table.

brit_ham <- data.frame(
    type="ham",
    text=readLines("british-english-sms-corpora_ham.txt"),
    stringsAsFactors=FALSE)
brit_spam <- data.frame(
    type="spam",
    text=readLines("british-english-sms-corpora_spam.txt"),
    stringsAsFactors=FALSE)
brit_sms <- data.frame(rbind(brit_ham, brit_spam))
brit_sms$type <- factor(brit_sms$type)
set.seed(12358)
brit_sms <- brit_sms[sample(nrow(brit_sms)),]
pander(frqtab(brit_sms$type), style="rmarkdown",
       caption="Proportions in the new SMS dataset")

As before, we convert the text into a corpus, clean it up, and generate a filtered document term matrix. The term counts in the matrix are converted into factors, and we generate predictions using both caret models.

Also, we calculate the confusion matrix for both predictions.

brit_corpus <- Corpus(VectorSource(brit_sms$text))
brit_corpus_clean <- brit_corpus %>%
    tm_map(content_transformer(tolower)) %>% 
    tm_map(removeNumbers) %>%
    tm_map(removeWords, stopwords()) %>%
    tm_map(removePunctuation) %>%
    tm_map(stripWhitespace)
brit_dtm <- DocumentTermMatrix(brit_corpus_clean, list(dictionary=sms_dict))
brit_test <- brit_dtm %>% apply(MARGIN=2, FUN=convert_counts)
brit_predict1 <- predict(sms_model1, brit_test)
brit_cm1 <- confusionMatrix(brit_predict1, brit_sms$type, positive="spam")
brit_cm1

brit_predict2 <- predict(sms_model2, brit_test)
brit_cm2 <- confusionMatrix(brit_predict2, brit_sms$type, positive="spam")
brit_cm2

When comparing the predictions on this new dataset, it is suprising that we still have a very good accuracy and sensitivity. Even though the Naive Bayes approach is simplistic, and makes some assumptions that are not always correct, it tends to do well with SPAM classification, which is the reason why is so widely used for this purpose.

bm1 <- sumpred(brit_cm1)
bm2 <- sumpred(brit_cm2)
model_comp <- as.data.frame(rbind(bm1,bm2))
rownames(model_comp) <- c("Caret model 1", "Caret model 2")
pander(model_comp, style="rmarkdown", split.tables=Inf, keep.trailing.zeros=TRUE,
       caption="Applying the caret models to a new dataset")

A cautionary tale about/against self-deception

Initially, when looking for a second dataset to contrast the models, I found "The SMS Spam Corpus v.0.1 Big"^smsspamcorpus by José María Gómez Hidalgo and Enrique Puertas Sanz, which contains a total of 1002 legitimate messages and a total of 322 spam messages.

When I redid the calculations on this dataset, I found a extraordinarily good prediction rate (vide infra), and initially I was amazed at how well the model performed, how well Naive Bayes can perform this textual classification, etc. That is, until I decided to check if there was any overlap between this dataset and the first one we used: Out of 1324 rows, 1274 were the same between both datasets. I later found out that the dataset I just obtained was a previous version of the first one -- something I should've noticed before doing the analysis.

Just out of completeness I will describe the procedure used to process this dataset. The data file of the "The SMS Spam Corpus v.0.1 Big" has inconsistent formatting, so we need to clean it up a bit before using it in our analysis.

library(plyr)
smsdata <- readLines(unz("SMSSpamCorpus01.zip", "english_big.txt")) %>%
    iconv(from="latin1", to="UTF-8")
# make it easy to parse later
smsdata <- gsub(",(spam|ham)$", "|\\1", smsdata)
smsdata <- smsdata %>% strsplit("|", fixed=TRUE) %>% ldply()
colnames(smsdata) <- c("text", "type")
smsdata$type <- factor(smsdata$type)
# generate the corpus
sms_corpus2 <- Corpus(VectorSource(smsdata$text))
sms_corpus2_clean <- sms_corpus2 %>%
    tm_map(content_transformer(tolower)) %>% 
    tm_map(removeNumbers) %>%
    tm_map(removeWords, stopwords()) %>%
    tm_map(removePunctuation) %>%
    tm_map(stripWhitespace)
# and the dtm
sms_dtm2 <- DocumentTermMatrix(sms_corpus2_clean, list(dictionary=sms_dict))
sms_test2 <- sms_dtm2 %>% apply(MARGIN=2, FUN=convert_counts)

# do the predictions using the caret models
sms_predict3 <- predict(sms_model1, sms_test2)
cm3 <- confusionMatrix(sms_predict3, smsdata$type, positive="spam")
cm3

sms_predict4 <- predict(sms_model2, sms_test2)
cm4 <- confusionMatrix(sms_predict4, smsdata$type, positive="spam")
cm4

As we can see below, the two models fit the dataset like an old glove. And if you read the explanation above, it makes sense that this happens.

m3 <- sumpred(cm3)
m4 <- sumpred(cm4)
model_comp <- as.data.frame(rbind(m3,m4))
rownames(model_comp) <- c("Caret model 1", "Caret model 2")
pander(model_comp, style="rmarkdown", split.tables=Inf, keep.trailing.zeros=TRUE,
       caption="Applying the caret models to a not so different dataset")

Reproducibility information

The dataset used to generate the models is the original "SMS Spam Collection v.1" by Tiago A. Almeida and José Maria Gómez Hidalgo^smsspamcoll, as described in the chapter 4 of book "Machine Learning with R" by Brett Lantz (ISBN 978-1-78216-214-8).

The dataset used to contrast the models is the "British English SMS Corpora", obtained on 2014-12-30 from one of the dataset author's http://mtaufiqnzz.wordpress.com/british-english-sms-corpora/.

sessionInfo()