Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
475 lines (388 sloc) 19 KB
---
title: "SPAM/HAM SMS classification using caret and Naive Bayes"
author: "Jesus M. Castagnetto"
date: '2015-01-03'
output:
html_document:
theme: readable
keep_md: true
toc: true
---
```{r echo=FALSE}
library(knitr)
opts_chunk$set(cache=TRUE, comment="", warning=FALSE, message=FALSE)
```
## Background
### Motivation
I am currently reading the book "Machine Learning with R"[^mlr] by Brent Lantz,
and also want to learn more about the `caret`[^caret] package, so I decided to replicate
the SPAM/HAM classification example from the chapter 4 of the book using `caret`
instead of the `e1071`[^e1071] package used in the text.
There are other differences apart from using a different R package:
instead of using as comparison the number of false positives, I decided
to use the sensitivity and specificity as criteria to evaluate the
prediction models. Also, I used the calculated models on a (different) second
dataset to test their validity and prediction performance.
[^mlr]: [Book page at Packt](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r)
[^caret]: [The caret package](http://caret.r-forge.r-project.org/) site
[^e1071]: [http://cran.r-project.org/web/packages/e1071/index.html](http://cran.r-project.org/web/packages/e1071/index.html)
### Preliminary information
The dataset used in the book is a modified version of the "SMS Spam Collection v.1" created by Tiago A. Almeida and José Maria Gómez Hidalgo[^smsspamcoll],
as described in Chapter 4 ("*Probabilistic Learning -- Clasification Using Naive Bayes*") of the aforementioned book.
[^smsspamcoll]: [http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/)
You can get the modified dataset from the book's page at Packt, but be aware
that you will need to register to get the files. If you rather don't do that,
you can get the original data files from the original creator's site.
To simplify things, we are going to use the original dataset.
For this excercise we will use the `caret` package to do the Naive Bayes[^nb]
modeling and prediction, the `tm` package to generate the text corpus,
the `pander` package to be able to output nicely formated tables,
and the `doMC` to take advantage of parallel processing with multiple cores.
Also, we will define some utility functions to simplify matters later in the code.
[^nb]: The `caret` package uses the `klaR` package for the Naive Bayes algorithm, so we are loading that library and its dependency (`MASS`) beforehand.
```{r warning=FALSE, message=FALSE}
# libraries needed by caret
library(klaR)
library(MASS)
# for the Naive Bayes modelling
library(caret)
# to process the text into a corpus
library(tm)
# to get nice looking tables
library(pander)
# to simplify selections
library(dplyr)
library(doMC)
registerDoMC(cores=4)
# a utility function for % freq tables
frqtab <- function(x, caption) {
round(100*prop.table(table(x)), 1)
}
# utility function to summarize model comparison results
sumpred <- function(cm) {
summ <- list(TN=cm$table[1,1], # true negatives
TP=cm$table[2,2], # true positives
FN=cm$table[1,2], # false negatives
FP=cm$table[2,1], # false positives
acc=cm$overall["Accuracy"], # accuracy
sens=cm$byClass["Sensitivity"], # sensitivity
spec=cm$byClass["Specificity"]) # specificity
lapply(summ, FUN=round, 2)
}
```
## Reading and preparing the data
We start by downloading the zip file with the dataset, and reading the file into
a dataframe. We then assign the appropiate names to the columns, and convert
the `type` into a factor. Finally, we randomize the data frame.
```{r}
if (!file.exists("smsspamcollection.zip")) {
download.file(url="http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/smsspamcollection.zip",
destfile="smsspamcollection.zip", method="curl")
}
sms_raw <- read.table(unz("smsspamcollection.zip","SMSSpamCollection"),
header=FALSE, sep="\t", quote="", stringsAsFactors=FALSE)
colnames(sms_raw) <- c("type", "text")
sms_raw$type <- factor(sms_raw$type)
# randomize it a bit
set.seed(12358)
sms_raw <- sms_raw[sample(nrow(sms_raw)),]
str(sms_raw)
```
The modified data used in the book has 5559 SMS messages, whereas the original
data used here has `r nrow(sms_raw)` rows (*caveat*: I have not checked for
duplicates in the original dataset).
## Preparing the data
We wil proceed in a similar fashion as described in the book, but make use
of `dplyr` syntax to execute the text cleanup/transformation operations
First we will transform the SMS text into a corpus that can later be used in the
analysis, then we will convert all text to lowercase, remove numbers, remove
some common *stop words* in english, remove punctuation and extra whitespace,
and finally, generate the document term that will be the basis for the
classification task.
```{r}
sms_corpus <- Corpus(VectorSource(sms_raw$text))
sms_corpus_clean <- sms_corpus %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeNumbers) %>%
tm_map(removeWords, stopwords(kind="en")) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace)
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)
```
## Creating a classification model witn Naive Bayes
### Generating the training and testing datasets
We will use the `createDataPartition` function to split the original dataset
into a training and a testing sets, using the proportions from the book (75%
training, 25% testing). This also generates the corresponding corpora and
document term matrices.
According to the documentation that accompanies the data file, 86.6% of the
entries correspond to legitimate messages ("ham"), and 13.4% to spam messages.
We shall see if the partition procedure has preserved those proportions in the
testing and training sets.
```{r results='asis'}
train_index <- createDataPartition(sms_raw$type, p=0.75, list=FALSE)
sms_raw_train <- sms_raw[train_index,]
sms_raw_test <- sms_raw[-train_index,]
sms_corpus_clean_train <- sms_corpus_clean[train_index]
sms_corpus_clean_test <- sms_corpus_clean[-train_index]
sms_dtm_train <- sms_dtm[train_index,]
sms_dtm_test <- sms_dtm[-train_index,]
ft_orig <- frqtab(sms_raw$type)
ft_train <- frqtab(sms_raw_train$type)
ft_test <- frqtab(sms_raw_test$type)
ft_df <- as.data.frame(cbind(ft_orig, ft_train, ft_test))
colnames(ft_df) <- c("Original", "Training set", "Test set")
pander(ft_df, style="rmarkdown",
caption=paste0("Comparison of SMS type frequencies among datasets"))
```
It would seem that the procedure keeps the proportions perfectly.
Following the strategy used in the book, we will pick terms that appear at least
5 times in the training document term matrix. To do this, we first create a
dictionary of terms (using the function `findFreqTerms`) that we will use to
filter the cleaned up training and testing corpora.
As a final step before using these sets, we will convert the numeric entries in
the term matrices into factors that indicate whether the term is present or not.
For this, we'll use a slightly modified version of the `convert_counts` function
that appear in the book, and apply it to each column in the matrices.
```{r}
sms_dict <- findFreqTerms(sms_dtm_train, lowfreq=5)
sms_train <- DocumentTermMatrix(sms_corpus_clean_train, list(dictionary=sms_dict))
sms_test <- DocumentTermMatrix(sms_corpus_clean_test, list(dictionary=sms_dict))
# modified sligtly fron the code in the book
convert_counts <- function(x) {
x <- ifelse(x > 0, 1, 0)
x <- factor(x, levels = c(0, 1), labels = c("Absent", "Present"))
}
sms_train <- sms_train %>% apply(MARGIN=2, FUN=convert_counts)
sms_test <- sms_test %>% apply(MARGIN=2, FUN=convert_counts)
```
### Training the two prediction models
We will now use Naive Bayes to train a couple of prediction models.
Both models will be generated using 10-fold cross validation, with the
default parameters.
The difference between the models will be that the first one does not
use the Laplace correction and lets the training procedure figure out whether
to user or not a kernel density estimate, while the second one fixes Laplace
parameter to one (`fL=1`) and explicitly forbids the use of a kernel density
estimate (`useKernel=FALSE`).
```{r warning=FALSE, message=FALSE}
ctrl <- trainControl(method="cv", 10)
set.seed(12358)
sms_model1 <- train(sms_train, sms_raw_train$type, method="nb",
trControl=ctrl)
sms_model1
set.seed(12358)
sms_model2 <- train(sms_train, sms_raw_train$type, method="nb",
tuneGrid=data.frame(.fL=1, .usekernel=FALSE),
trControl=ctrl)
sms_model2
```
### Testing the predictions
We now use these two models to predict the appropriate classification of the
terms in the test set. In each case we will estimate how good is the prediction
using the `confusionMatrix` function. We will consider a positive result when
a message is identified as (or predicted to be) SPAM.
```{r}
sms_predict1 <- predict(sms_model1, sms_test)
cm1 <- confusionMatrix(sms_predict1, sms_raw_test$type, positive="spam")
cm1
sms_predict2 <- predict(sms_model2, sms_test)
cm2 <- confusionMatrix(sms_predict2, sms_raw_test$type, positive="spam")
cm2
```
We will also use our `sumpred` function to extract the *true positives* (TP),
*true negatives* (TN), *false positives* (FP), and *false negatives* (FN),
the prediction *accuracy*[^acc], the *sensitivity*[^sens] (also
known as *recall* or *true positive rate*), and the *specificity*[^spec]
(also known as *true negative rate*).
Also, we will use the information from the similar models described in the book,
in terms of TP, TN, TP, and FN, to estimate the rest of the parameters, and
compare them with the `caret` derived models.
[^acc]: Accuracy, is the degree of closeness of measurements of a quantity to that quantity's actual (true) value.
[^sens]: The sensitivity, measures the proportion of actual positives which are correctly identified as such.
[^spec]: The specificity,measures the proportion of negatives which are correctly identified as such.
```{r results='asis'}
# from the table on page 115 of the book
tn=1203
tp=151
fn=32
fp=4
book_example1 <- list(
TN=tn,
TP=tp,
FN=fn,
FP=fp,
acc=(tp + tn)/(tp + tn + fp + fn),
sens=tp/(tp + fn),
spec=tn/(tn + fp))
# from the table on page 116 of the book
tn=1204
tp=152
fn=31
fp=3
book_example2 <- list(
TN=tn,
TP=tp,
FN=fn,
FP=fp,
acc=(tp + tn)/(tp + tn + fp + fn),
sens=tp/(tp + fn),
spec=tn/(tn + fp))
b1 <- lapply(book_example1, FUN=round, 2)
b2 <- lapply(book_example2, FUN=round, 2)
m1 <- sumpred(cm1)
m2 <- sumpred(cm2)
model_comp <- as.data.frame(rbind(b1, b2, m1, m2))
rownames(model_comp) <- c("Book model 1", "Book model 2", "Caret model 1", "Caret model 2")
pander(model_comp, style="rmarkdown", split.tables=Inf, keep.trailing.zeros=TRUE,
caption="Model results when comparing predictions and test set")
```
Accuracy gives us an overall sense of how good the models are, and
using that criteria, the ones in the book and those calculated here are very
similar in how well they classify an SMS. All of them do surprisingly well
taking into account the simplicity of the method.
The discussion in the book centered around the number of FP predicted by the
model, but I'd rather look at the sensitivity (related to type II errors) and
specificity (related to Type I errors) of the predictions (and the corresponding
PPV and NPV).
In this example, the sensitivity gives us the probability of an SMS
text being classified as SPAM, when it really is SPAM. Looking at this
parameter, we see that even though the book's models do not differ much
from the `caret` models in terms of accuracy, they do worse in terms of
sensitivity. The text of the book argues that using the Laplace correction
improves prediction, but with the cross-validated models generated using
`caret` the opposite is true.
Of course, we gain in sensitivity, but we lose slightly in specificity, which in
this example is the probability of a HAM message being classified as HAM. In
other words, we increase (a bit) the misclassification of the regular SMS texts
as SPAM. But the difference between the worst and the best specificity is of the
order of 0.01 or ~1%.
## Applying the model to a different SMS SPAM dataset
Just to check if our `caret` Naive Bayes models are good enough, we will test
them against a different corpus. One described "Independent and Personal SMS Spam Filtering."[^brit_spam]
This dataset can be obtained from one of the authors site[^brit_spam_data],
as the `british-english-sms-corpora.doc` MSWORD document (retrieved on
2014-12-30). This document was converted to a text file using the
Unix/Linux `catdoc` command (`$ catdoc -aw british-english-sms-corpora.doc > british-english-sms-corpora.txt`), and then split into two
archives: one containing all the "ham" messages
(`british-english-sms-corpora_ham.txt`), and the other containing all the
"spam" message (`british-english-sms-corpora_spam.txt`).
[^brit_spam]: "Independent and Personal SMS Spam Filtering." in Proc. of IEEE Conference on Computer and Information Technology, Paphos, Cyprus, Aug 2011. Page 429-435. (URL: [http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=6036805](http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=6036805))
[^brit_spam_data]: http://mtaufiqnzz.wordpress.com/british-english-sms-corpora/
These two archive were then read and combined into a data frame. The data was
randomized to get a suitable dataset for the rest of the procedures.
The proportion of the spam and ham messages can be seen in the following table.
```{r results='asis'}
brit_ham <- data.frame(
type="ham",
text=readLines("british-english-sms-corpora_ham.txt"),
stringsAsFactors=FALSE)
brit_spam <- data.frame(
type="spam",
text=readLines("british-english-sms-corpora_spam.txt"),
stringsAsFactors=FALSE)
brit_sms <- data.frame(rbind(brit_ham, brit_spam))
brit_sms$type <- factor(brit_sms$type)
set.seed(12358)
brit_sms <- brit_sms[sample(nrow(brit_sms)),]
pander(frqtab(brit_sms$type), style="rmarkdown",
caption="Proportions in the new SMS dataset")
```
As before, we convert the text into a corpus, clean it up, and generate a
filtered document term matrix. The term counts in the matrix are converted
into factors, and we generate predictions using both `caret` models.
Also, we calculate the confusion matrix for both predictions.
```{r}
brit_corpus <- Corpus(VectorSource(brit_sms$text))
brit_corpus_clean <- brit_corpus %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeNumbers) %>%
tm_map(removeWords, stopwords()) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace)
brit_dtm <- DocumentTermMatrix(brit_corpus_clean, list(dictionary=sms_dict))
brit_test <- brit_dtm %>% apply(MARGIN=2, FUN=convert_counts)
brit_predict1 <- predict(sms_model1, brit_test)
brit_cm1 <- confusionMatrix(brit_predict1, brit_sms$type, positive="spam")
brit_cm1
brit_predict2 <- predict(sms_model2, brit_test)
brit_cm2 <- confusionMatrix(brit_predict2, brit_sms$type, positive="spam")
brit_cm2
```
When comparing the predictions on this new dataset, it is suprising that we
still have a very good accuracy and sensitivity. Even though the Naive Bayes
approach is simplistic, and makes some assumptions that are not always correct,
it tends to do well with SPAM classification, which is the reason why is so
widely used for this purpose.
```{r results='asis'}
bm1 <- sumpred(brit_cm1)
bm2 <- sumpred(brit_cm2)
model_comp <- as.data.frame(rbind(bm1,bm2))
rownames(model_comp) <- c("Caret model 1", "Caret model 2")
pander(model_comp, style="rmarkdown", split.tables=Inf, keep.trailing.zeros=TRUE,
caption="Applying the caret models to a new dataset")
```
## A cautionary tale about/against self-deception
Initially, when looking for a second dataset to contrast the models, I found
"The SMS Spam Corpus v.0.1 Big"[^smsspamcorpus]
by José María Gómez Hidalgo and Enrique Puertas Sanz, which contains a
total of 1002 legitimate messages and a total of 322 spam messages.
When I redid the calculations on this dataset, I found a extraordinarily good
prediction rate (*vide infra*), and initially I was amazed at how well the
model performed, how well Naive Bayes can perform this textual classification,
etc. That is, until I decided to check if there was any overlap
between this dataset and the first one we used: Out of 1324 rows, 1274 were
the same between both datasets. I later found out that the dataset I just
obtained was a previous version of the first one -- something I should've
noticed before doing the analysis.
[^smsspamcorpus]: [http://www.esp.uem.es/jmgomez/smsspamcorpus](http://www.esp.uem.es/jmgomez/smsspamcorpus)
Just out of completeness I will describe the procedure used to process this dataset. The data file of the "The SMS Spam Corpus v.0.1 Big" has inconsistent
formatting, so we need to clean it up a bit before using it in our analysis.
```{r}
library(plyr)
smsdata <- readLines(unz("SMSSpamCorpus01.zip", "english_big.txt")) %>%
iconv(from="latin1", to="UTF-8")
# make it easy to parse later
smsdata <- gsub(",(spam|ham)$", "|\\1", smsdata)
smsdata <- smsdata %>% strsplit("|", fixed=TRUE) %>% ldply()
colnames(smsdata) <- c("text", "type")
smsdata$type <- factor(smsdata$type)
# generate the corpus
sms_corpus2 <- Corpus(VectorSource(smsdata$text))
sms_corpus2_clean <- sms_corpus2 %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeNumbers) %>%
tm_map(removeWords, stopwords()) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace)
# and the dtm
sms_dtm2 <- DocumentTermMatrix(sms_corpus2_clean, list(dictionary=sms_dict))
sms_test2 <- sms_dtm2 %>% apply(MARGIN=2, FUN=convert_counts)
# do the predictions using the caret models
sms_predict3 <- predict(sms_model1, sms_test2)
cm3 <- confusionMatrix(sms_predict3, smsdata$type, positive="spam")
cm3
sms_predict4 <- predict(sms_model2, sms_test2)
cm4 <- confusionMatrix(sms_predict4, smsdata$type, positive="spam")
cm4
```
As we can see below, the two models fit the dataset like an old glove. And if
you read the explanation above, it makes sense that this happens.
```{r results='asis'}
m3 <- sumpred(cm3)
m4 <- sumpred(cm4)
model_comp <- as.data.frame(rbind(m3,m4))
rownames(model_comp) <- c("Caret model 1", "Caret model 2")
pander(model_comp, style="rmarkdown", split.tables=Inf, keep.trailing.zeros=TRUE,
caption="Applying the caret models to a not so different dataset")
```
## Reproducibility information
The dataset used to generate the models is the original "SMS Spam Collection v.1"
by Tiago A. Almeida and José Maria Gómez Hidalgo[^smsspamcoll],
as described in the chapter 4 of book "Machine Learning with R" by Brett Lantz
(ISBN 978-1-78216-214-8).
The dataset used to contrast the models is the "British English SMS Corpora",
obtained on 2014-12-30 from one of the dataset author's [http://mtaufiqnzz.wordpress.com/british-english-sms-corpora/](http://mtaufiqnzz.wordpress.com/british-english-sms-corpora/).
```{r}
sessionInfo()
```