# Naive Bayes

The Naive Bayes Classifier is a supervised classifier whose statistical idea came about from an 18th century statistician, philosopher and Presbyterian minister Thomas Bayes.

### Probability
The probability of an event occuring is estimated from the observed data by dividing the
number of trials in which the event of interest occurred by the total number of trials.
For instance, if heads appears 3 out of 10 Coin flips, then probability that the next coin flip will be heads is estimated as 4/10 = 0.4 or 40%.
The probability of all the possible outcomes of a trial must always sum to 1, from our earlier example, the probability of having a Head in 10 coin flips is 0.3 impling that the probability of Not having Head of having Tail is 1 - 0.3 = 0.7 or 70%. Hence this concludes that Head and Tail are **mutually exclusive and exhaustive events**, which implies that they cannot occur at the same time and are the only possible outcomes.

### Conditional Probability with Bayes' theorem

P(A|B) = P(A n B)/ P(B)
The formular above describes the relationship between dependent events and is commonly known as the Bayes' theorem.
The formular can be read as The probability of event A given that event B has occured. This know as conditional probability since the probaility of A occuring depends on what happened to event B

*Bayes' theorem states that the best estimate of P(A|B) is the proportion of trials in which A occurred with B out of all the trials in which B occurred*

### The strengths and weaknesses Naive Bayes algorithm
The Naive in Naive Bayes came as a result of a "naive" assumptions that all features in the dataset are equally important and independent. These assumptions are rarely true in most real-world applications.however, this is a very effective algorithm.

**Pros**
- Simple, fast, and very effective
- Does well with noisy and missing data
- Requires relatively few examples for training, but also works well with very large numbers of examples

**Cons**
- Relies on an often-faulty assumption of equally important and independent features
- Not ideal for datasets with many numeric features

### Hands-on session

In this hands-on session, we would look to classify SMS messages as either ham(good) or spam(bad) using Naive bayes classifier.
The data used for this session is made available on the url provided below:

http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/

## Collecting the dataset

In [1]:
library(data.table)

"package 'data.table' was built under R version 3.5.3"

In [23]:
#ham_spam_SMS_messages <- read.table(file.choose(), sep = "\t", stringsAsFactors = FALSE, header = F, quote = "")
ham_spam_SMS_messages <- fread(file.choose(), header = F)



## Exploring and preparing the dataset

In [24]:
# view the first few records of the imported dataset
head(ham_spam_SMS_messages)

V1,V2
ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
ham,Ok lar... Joking wif u oni...
spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham,U dun say so early hor... U c already then say...
ham,"Nah I don't think he goes to usf, he lives around here though"
spam,"FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, Â£1.50 to rcv"


In [25]:
# Give proper names to our dataframe rather than the default V1 and V2
names(ham_spam_SMS_messages) <- c("Type", "Text")

In [26]:
# verify the names have been applied
head(ham_spam_SMS_messages)

Type,Text
ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
ham,Ok lar... Joking wif u oni...
spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham,U dun say so early hor... U c already then say...
ham,"Nah I don't think he goes to usf, he lives around here though"
spam,"FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, Â£1.50 to rcv"


In [27]:
# View the structure of the dataset
str(ham_spam_SMS_messages)

Classes 'data.table' and 'data.frame':	5574 obs. of  2 variables:
 $ Type: chr  "ham" "ham" "spam" "ham" ...
 $ Text: chr  "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..." "Ok lar... Joking wif u oni..." "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question("| __truncated__ "U dun say so early hor... U c already then say..." ...
 - attr(*, ".internal.selfref")=<externalptr> 


In [28]:
# Since this is a classification problem, lets convert the type field to a factor
ham_spam_SMS_messages$Type <- factor(ham_spam_SMS_messages$Type)

In [29]:
# View the converted field
str(ham_spam_SMS_messages$Type)

 Factor w/ 2 levels "ham","spam": 1 1 2 1 1 2 1 1 2 2 ...


In [30]:
# verify the names have been applied
head(ham_spam_SMS_messages)

Type,Text
ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
ham,Ok lar... Joking wif u oni...
spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham,U dun say so early hor... U c already then say...
ham,"Nah I don't think he goes to usf, he lives around here though"
spam,"FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, Â£1.50 to rcv"


In [None]:
# We need to reconvert to factors again
ham_spam_SMS_messages$Type <- factor(ham_spam_SMS_messages$Type)

In [None]:
# View the converted field again to ensure the factor levels reflect just ham and spam SMS
str(ham_spam_SMS_messages$Type)

In [31]:
table(ham_spam_SMS_messages$Type)


 ham spam 
4827  747 

## Data cleaning and standardization of text data

In [32]:
# Create a corpus for the SMS text field
library(tm)
ham_spam_SMS_messages_corpus <- VCorpus(VectorSource(ham_spam_SMS_messages$Text))

"package 'tm' was built under R version 3.5.3"Loading required package: NLP
"package 'NLP' was built under R version 3.5.2"

In [33]:
ham_spam_SMS_messages_corpus

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 5574

In [34]:
# to see a brief summary of the corpus
inspect(ham_spam_SMS_messages_corpus[1:3])

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 3

[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 111

[[2]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 29

[[3]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 155



In [35]:
# to view a corpus, the traditional head() would not yeild the desired output
head(ham_spam_SMS_messages_corpus)

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 6

In [37]:
# we use a special function to do this.
as.character(ham_spam_SMS_messages_corpus[[2]])

In [38]:
# to view the first 5 sms messages, we would use the lappy function
lapply(ham_spam_SMS_messages_corpus[1:5], as.character)

In [39]:
# As part of the data preparation phase, we convert all the text data to lower case(standardize)
ham_spam_SMS_corpus_clean <- tm_map(ham_spam_SMS_messages_corpus, content_transformer(tolower))

In [41]:
lapply(ham_spam_SMS_corpus_clean[1:5], as.character)

In [43]:
ham_spam_SMS_corpus_clean <- tm_map(ham_spam_SMS_corpus_clean, removeNumbers) # remove numbers
ham_spam_SMS_corpus_clean <- tm_map(ham_spam_SMS_corpus_clean, removeWords, stopwords()) # remove stop words
ham_spam_SMS_corpus_clean <- tm_map(ham_spam_SMS_corpus_clean, removePunctuation) # remove punctuation

In [None]:
library(SnowballC)
ham_spam_SMS_corpus_clean <- tm_map(ham_spam_SMS_corpus_clean, stemDocument) # stem the words
ham_spam_SMS_corpus_clean <- tm_map(ham_spam_SMS_corpus_clean, stripWhitespace) # remove white spaces

In [45]:
tex = "am going to footballing today's"
texv2 = VCorpus(VectorSource(tex))
test = tm_map(texv2, stemDocument)
lapply(test,as.character)

**Document Term Matrix (DTM)**<br>This is a data structure in which rows indicate documents (SMS messages) and columns indicate terms (words).

**Term Document Matrix (TDM)**<br>This is simply the transpose of DTM in which the rows are terms and the columns are documents and is ideal for cases were the number of documents is small while the word list is large.

In [46]:
# Create the DTM 
ham_spam_SMS_corpus_dtm <- DocumentTermMatrix(ham_spam_SMS_corpus_clean)

In [47]:
ham_spam_SMS_corpus_dtm

<<DocumentTermMatrix (documents: 5574, terms: 7983)>>
Non-/sparse entries: 43163/44454079
Sparsity           : 100%
Maximal term length: 40
Weighting          : term frequency (tf)

In [48]:
# To view the DTM run the code below
ham_spam_SMS_corpus_dtmMatrix <- as.matrix(ham_spam_SMS_corpus_dtm)
ham_spam_SMS_corpus_dtmMatrix[1:10, 1:12]

Unnamed: 0,â‘morrow,â‘rents,â“harry,â£â£,â£award,â£call,â£ea,â£k,â£million,â£minmobsmorelkpoboxhpfl,â£month,â£morefrmmob
1,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,0,0
10,0,0,0,0,0,0,0,0,0,0,0,0


#### Split the ham_spam_SMS_corpus_dtm to Train and Test Sets

In [49]:
ham_spam_SMS_train <- ham_spam_SMS_corpus_dtm[1:3902, ]
ham_spam_SMS_test <- ham_spam_SMS_corpus_dtm[3903:5574, ]

In [50]:
ham_spam_SMS_train_labels <- ham_spam_SMS_messages[1:3902, ]$Type
ham_spam_SMS_test_labels <- ham_spam_SMS_messages[3903:5574, ]$Type

In [52]:
# reduce words occuring less than 5 times
sms_freq_words <- findFreqTerms(ham_spam_SMS_train, 5)

In [53]:
sms_dtm_freq_train<- ham_spam_SMS_train[ , sms_freq_words]
sms_dtm_freq_test <- ham_spam_SMS_test[ , sms_freq_words]

In [54]:
convert_counts <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}

#### The apply() function below, allows a function to be used on each of the rows or columns  in a matrix/dataframe. It uses a MARGIN parameter to specify either rows or columns to apply the function  on. Here,  we'll use MARGIN = 2, since we're interested in the columns (MARGIN = 1 is used  for rows). 

In [55]:
sms_train <- apply(sms_dtm_freq_train, MARGIN = 2, convert_counts)
sms_test <- apply(sms_dtm_freq_test, MARGIN = 2, convert_counts)

In [57]:
#### Train the model
library(e1071)
sms_classifier <- naiveBayes(sms_train, ham_spam_SMS_train_labels)
# laplace = 1

"package 'e1071' was built under R version 3.5.2"

### Evaluating model performance<br>
To evaluate the SMS classifier model, we need to test its predictions on unseen messages  in the test data. The predict function is used to make prediction and the result is stored in the sms_test_pred object

In [58]:
sms_test_pred <- predict(sms_classifier, sms_test)

#### To compare the predictions to the true values, we'll use the CrossTable() function in the gmodels package.

In [59]:
library(gmodels)
CrossTable(sms_test_pred, ham_spam_SMS_test_labels, prop.chisq = FALSE, prop.t = FALSE, dnn = c('predicted', 'actual'))

"package 'gmodels' was built under R version 3.5.3"


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|-------------------------|

 
Total Observations in Table:  1672 

 
             | actual 
   predicted |       ham |      spam | Row Total | 
-------------|-----------|-----------|-----------|
         ham |      1437 |        28 |      1465 | 
             |     0.981 |     0.019 |     0.876 | 
             |     0.995 |     0.123 |           | 
-------------|-----------|-----------|-----------|
        spam |         7 |       200 |       207 | 
             |     0.034 |     0.966 |     0.124 | 
             |     0.005 |     0.877 |           | 
-------------|-----------|-----------|-----------|
Column Total |      1444 |       228 |      1672 | 
             |     0.864 |     0.136 |           | 
-------------|-----------|-----------|-----------|

 


Looking at the table, we can see that a total of only 7 + 28 = 35 of the 1,672 SMS messages were incorrectly classified (2.1 percent). Among the errors were 7 out of 1,444 ham messages that were misidentified as spam, and 28 of the 228 spam messages were incorrectly labeled as ham, bringing the overall accuracy of the classifier to approximately 98%

In [None]:
help(naiveBayes)

## Improving model performance

In [None]:
# laplace = 1,2, or 3 and verify if the model improved
sms_classifier2 <- naiveBayes(sms_train, ham_spam_SMS_train_labels, laplace = 1)

In [None]:
sms_test_pred2 <- predict(sms_classifier2, sms_test)

In [None]:
CrossTable(sms_test_pred2, ham_spam_SMS_test_labels, prop.chisq = FALSE, prop.t = FALSE, dnn = c('predicted', 'actual'))