# Text Analytics and the Naïve Bayes Model

### 1. Loading Dataset

In [119]:
sms <- read.csv("datasets/spam.csv", stringsAsFactors=F)

### 2. Taking a look at the Data

In [120]:
head(sms)

Unnamed: 0_level_0,type,text
Unnamed: 0_level_1,<chr>,<chr>
1,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
2,ham,Ok lar... Joking wif u oni...
3,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
4,ham,U dun say so early hor... U c already then say...
5,ham,"Nah I don't think he goes to usf, he lives around here though"
6,spam,"FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv"


In [121]:
table(sms$type)


 ham spam 
4825  747 

In [122]:
#?prop.table 
round(prop.table(table(sms$type))*100, digits = 1) # gets % of each class


 ham spam 
86.6 13.4 

In [123]:
# Make Type a Factor
sms$type = factor(sms$type)

In [124]:
# install.packages("tm") #text-mine package
library(tm)

#?VectorSource # Vectorization:

#?Corpus # Create Corpus with all documents

sms_corpus = VCorpus(VectorSource(sms$text))
lapply(sms_corpus[1:3], as.character) #first 3 items

### Preprocessing

In [125]:
corpus_clean = tm_map(sms_corpus, tolower)
corpus_clean = tm_map(corpus_clean, removeNumbers)
corpus_clean = tm_map(corpus_clean, removeWords, stopwords())
corpus_clean = tm_map(corpus_clean, removePunctuation)
corpus_clean = tm_map(corpus_clean, stripWhitespace)
# see top 3 emails again
inspect(corpus_clean[1:3])

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 3

[[1]]
[1] go jurong point crazy available bugis n great world la e buffet cine got amore wat

[[2]]
[1] ok lar joking wif u oni

[[3]]
[1] free entry wkly comp win fa cup final tkts st may text fa receive entry questionstd txt ratetcs apply s



### Creating a sparse matrix from 
The columns are the number of different words<br>
The rows correspond to each email message<br>
The cells are the number of times each word appeared<br>

In [126]:
#!install.packages("SnowballC")#if you're having a problem to run the dtm line, try installing "SnowballC"
corpus_clean = tm_map(corpus_clean, PlainTextDocument) # Clear any type of possible associated metadata with the document
dtm = DocumentTermMatrix(corpus_clean)
str(dtm)

List of 6
 $ i       : int [1:42889] 1 1 1 1 1 1 1 1 1 1 ...
 $ j       : int [1:42889] 250 484 916 918 1237 1511 2785 2823 3532 5106 ...
 $ v       : num [1:42889] 1 1 1 1 1 1 1 1 1 1 ...
 $ nrow    : int 5572
 $ ncol    : int 7892
 $ dimnames:List of 2
  ..$ Docs : chr [1:5572] "character(0)" "character(0)" "character(0)" "character(0)" ...
  ..$ Terms: chr [1:7892] "‰û÷ll" "‰û÷m" "‰û÷re" "‰û÷s" ...
 - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
 - attr(*, "weighting")= chr [1:2] "term frequency" "tf"


### Split Data into Train and Test

In [127]:
set.seed(123) #set seed to make sure we model won't change every time we run it
# split the raw data:
# sample 75% of our values (randomly, and putting in a sample)
sample <- sample.int( n = nrow(sms), size = floor(.75*nrow(sms)), replace = F)
sms.train <- sms[sample, ]
sms.test <- sms[ -sample, ]

### Making our Document-Term Matrix and Corpus

In [128]:
# now split the document-term matrix
dtm.train <- dtm[sample, ]
dtm.test <- dtm[-sample, ]

In [129]:
# now the corpus
corpus.train = corpus_clean[sample]
corpus.test = corpus_clean[-sample]

In [130]:
# let's see if our split is reasonable: raw data should have about 87% of ham
round(prop.table(table(sms.train$type))*100)
round(prop.table(table(sms.test$type))*100)


 ham spam 
  87   13 


 ham spam 
  87   13 

### Selecting High Frequency Words
Remove words that appear less than 5 times

In [131]:
ncol(dtm.train)
ncol(dtm.test)

In [179]:
# DTMs have more than 7000 columns
# To get a managable matrix
# Eliminate words which appear in less than 3 SMS messages
freq_terms = findFreqTerms(dtm.train,5)
reduced_dtm.train = DocumentTermMatrix(corpus.train, list(dictionary = freq_terms))
reduced_dtm.test = DocumentTermMatrix(corpus.test, list(dictionary = freq_terms))

In [180]:
ncol(reduced_dtm.train)
ncol(reduced_dtm.test)

### Convert numberic values in dtm to character factors
Naive Bayes Classification works with factors

In [181]:
# Converting DTM Numerics to factor
convert_count = function(x) {
    x = ifelse(x > 0, 1, 0)
    x = factor(x, levels = c(0,1), labels=c("No","Yes"))
    return (x)
}

In [182]:
# apply() allows us to work either with rows or columns of a matrix.
# MARGIN = 1 is for rows, and 2 for columns
reduced_dtm.train = apply(reduced_dtm.train, MARGIN = 2, convert_count)
reduced_dtm.test = apply(reduced_dtm.test, MARGIN = 2, convert_count)

### Run Naïve Bayes Classification

In [183]:
#!install.packages("e1071")
library(e1071)
#?naiveBayes

sms_classifier = naiveBayes(reduced_dtm.train, sms.train$type)
# see how each word is related to a ham or spam
sms_classifier$tables[1:205]

$`‰û÷s`
              ‰û÷s
sms.train$type          No         Yes
          ham  0.998894722 0.001105278
          spam 1.000000000 0.000000000

$`‰ûò`
              ‰ûò
sms.train$type          No         Yes
          ham  0.998618403 0.001381597
          spam 1.000000000 0.000000000

$`å£wk`
              å£wk
sms.train$type          No         Yes
          ham  1.000000000 0.000000000
          spam 0.991071429 0.008928571

$abiola
              abiola
sms.train$type          No         Yes
          ham  0.997236806 0.002763194
          spam 1.000000000 0.000000000

$able
              able
sms.train$type          No         Yes
          ham  0.994197292 0.005802708
          spam 1.000000000 0.000000000

$abt
              abt
sms.train$type          No         Yes
          ham  0.994473611 0.005526389
          spam 1.000000000 0.000000000

$accept
              accept
sms.train$type          No         Yes
          ham  0.998618403 0.001381597
          spam 1.000000000 0.

In [184]:
# Predict using the classifier
sms_test.predicted = predict(sms_classifier,
                            reduced_dtm.test)

### Model Evaluation

In [185]:
# Crosstable to build confusion matrix
# checking the accuracy of the model
library(gmodels)
CrossTable(sms_test.predicted,
          sms.test$type,
          prop.chisq = FALSE, # as before
          prop.t = FALSE, #eliminate cell proportion
          dnn = c("predicted", "actual"))


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|-------------------------|

 
Total Observations in Table:  1393 

 
             | actual 
   predicted |       ham |      spam | Row Total | 
-------------|-----------|-----------|-----------|
         ham |      1201 |        20 |      1221 | 
             |     0.984 |     0.016 |     0.877 | 
             |     0.996 |     0.107 |           | 
-------------|-----------|-----------|-----------|
        spam |         5 |       167 |       172 | 
             |     0.029 |     0.971 |     0.123 | 
             |     0.004 |     0.893 |           | 
-------------|-----------|-----------|-----------|
Column Total |      1206 |       187 |      1393 | 
             |     0.866 |     0.134 |           | 
-------------|-----------|-----------|-----------|

 


### Improve Model
We'll improve the model with Laplace Smoothing
We're Incorrectly assigning 18 emails that are real emails to spam. <br>They could be important, let's try to reduce this number

In [186]:
# Creating our tentative of improved model
sms_classifier_l = naiveBayes(reduced_dtm.train, sms.train$type, laplace = 0.1)
sms_classifier_l$tables[1:5]
sms_test.predicted_l = predict(sms_classifier_l,
                              reduced_dtm.test)

$`‰û÷s`
              ‰û÷s
sms.train$type           No          Yes
          ham  0.9988671530 0.0011328470
          spam 0.9998214923 0.0001785077

$`‰ûò`
              ‰ûò
sms.train$type           No          Yes
          ham  0.9985908488 0.0014091512
          spam 0.9998214923 0.0001785077

$`å£wk`
              å£wk
sms.train$type           No          Yes
          ham  9.999724e-01 2.763042e-05
          spam 9.908961e-01 9.103891e-03

$abiola
              abiola
sms.train$type           No          Yes
          ham  0.9972093280 0.0027906720
          spam 0.9998214923 0.0001785077

$able
              able
sms.train$type           No          Yes
          ham  0.9941699823 0.0058300177
          spam 0.9998214923 0.0001785077


In [192]:
# See cross table of new model
CrossTable(sms_test.predicted_l,
          sms.test$type,
          prop.chisq = FALSE,
          prop.t = FALSE,
          dnn = c("predicted","actual"))

# Our model didn't improve, actually it got worst


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|-------------------------|

 
Total Observations in Table:  1393 

 
             | actual 
   predicted |       ham |      spam | Row Total | 
-------------|-----------|-----------|-----------|
         ham |      1203 |        20 |      1223 | 
             |     0.984 |     0.016 |     0.878 | 
             |     0.998 |     0.107 |           | 
-------------|-----------|-----------|-----------|
        spam |         3 |       167 |       170 | 
             |     0.018 |     0.982 |     0.122 | 
             |     0.002 |     0.893 |           | 
-------------|-----------|-----------|-----------|
Column Total |      1206 |       187 |      1393 | 
             |     0.866 |     0.134 |           | 
-------------|-----------|-----------|-----------|

 


In [191]:
# Compare Errors With and Without Laplace smoothing
# also check our predictions with some real values where the model was incorrect (when it was ham and we said spam)
sms.test$pred = sms_test.predicted
Error_test <- subset(sms.test, type == "ham" & pred == "spam")
Error_test

sms.test$pred = sms_test.predicted_l
Error_test_l <- subset(sms.test, type == "ham" & pred == "spam")
Error_test_l
# at least model improved with the 

Unnamed: 0_level_0,type,text,pred
Unnamed: 0_level_1,<fct>,<chr>,<fct>
495,ham,Are you free now?can i call now?,spam
2652,ham,"Text me when you get off, don't call, my phones having problems",spam
3117,ham,Now am free call me pa.,spam
4221,ham,"Plz note: if anyone calling from a mobile Co. &amp; asks u to type # &lt;#&gt; or # &lt;#&gt; . Do not do so. Disconnect the call,coz it iz an attempt of 'terrorist' to make use of the sim card no. Itz confirmd by nokia n motorola n has been verified by CNN IBN.",spam
5325,ham,"Dear Sir,Salam Alaikkum.Pride and Pleasure meeting you today at the Tea Shop.We are pleased to send you our contact number at Qatar.Rakhesh an Indian.Pls save our Number.Respectful Regards.",spam


Unnamed: 0_level_0,type,text,pred
Unnamed: 0_level_1,<fct>,<chr>,<fct>
495,ham,Are you free now?can i call now?,spam
2652,ham,"Text me when you get off, don't call, my phones having problems",spam
3117,ham,Now am free call me pa.,spam
