# Naive Bayes Classifier for Spam Detection

## Instructions

Total Points: 10

Complete this notebook and submit it. The notebook needs to be a complete project report with 

* your implementation,
* documentation including a short discussion of how your implementation works and your design choices, and
* experimental results (e.g., tables and charts with simulation results) with a short discussion of what they mean. 

Use the provided notebook cells and insert additional code and markdown cells as needed.

## Introduction

A spam detection agent gets as its percepts text messages and needs to decide if they are spam or not.
Create a [naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) for the 
[UCI SMS Spam Collection Data Set](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) to perform this task.

## Create a bag-of-words representation of the text messages [3 Points]

The first step is to tokenize the text. Here is an example of how to use the [natural language tool kit (nltk)](https://www.nltk.org/) to create tokens (separate words).

In [1]:
import nltk
# You need to install nltk and then download the tokenizer once.
# nltk.download('punkt')

file = open("smsspamcollection/SMSSpamCollection", "r")

sentence = file.readline()
print(f"text message: \"{sentence}\"")

tokens = nltk.word_tokenize(sentence)

print(f"tokens: {tokens}")

text message: "ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
"
tokens: ['ham', 'Go', 'until', 'jurong', 'point', ',', 'crazy', '..', 'Available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', '...', 'Cine', 'there', 'got', 'amore', 'wat', '...']


Experiment with removing frequent words (called [stopwords](https://en.wikipedia.org/wiki/Stop_word)) and very infrequent words so you end up with a reasonable number of words used in the classifier. Maybe you need to remove digits or all non-letter characters. You may also use a stemming algorithm. 

Convert the tokenized data into a data structure that indicates for each for document what words it contains. The data structure can be a [document-term matrix](https://en.wikipedia.org/wiki/Document-term_matrix) with 0s and 1s, a pandas dataframe or some sparse matrix structure. Make sure the data structure can be used to split the data into training and test documents (see below).

Report the 20 most frequent and the 20 least frequent words in your data set.

In [2]:
from nltk.probability import FreqDist

In [3]:
def getFreq(tokens):
    fdist = FreqDist(tokens)
    print('Most Common:')
    print(fdist.most_common(20))
    print('Least Common:')
    print(fdist.most_common()[-20:])

Download 'stopwords' and 'punkt' from nltk

In [4]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hhe99\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hhe99\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [6]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hhe99\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [7]:
# Description and code goes here!
from nltk.corpus import stopwords

In [8]:
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

In [9]:
stop = set(stopwords.words('english'))

In [10]:
porter = PorterStemmer()
lmtzr = WordNetLemmatizer()

In [11]:
labels=[]
messages=[]

In [12]:
while True:
    line = file.readline()
    if not line:
        break
    words = nltk.word_tokenize(line)
    labels.append(words.pop(0))
    # first word is the label
    words_clean = []
    for word in words:
        # Remove punctuations, and non-letter words
        if word.isalpha():
            # lower case, lemmatize, and stem words
            word = lmtzr.lemmatize(word.lower())
            word = porter.stem(word)
            # remove stop words
            if word not in stop:
                words_clean.append(porter.stem(word))
    messages.append(words_clean)

Convert the tokenized text to a dataframe

In [13]:
import pandas as pd
df = pd.DataFrame(list(zip(labels, messages)), columns=['Label', 'Message'])

In [14]:
df.head()

Unnamed: 0,Label,Message
0,ham,"[ok, lar, joke, wif, u, oni]"
1,spam,"[free, entri, wkli, comp, win, fa, cup, final,..."
2,ham,"[u, dun, say, earli, hor, u, c, alreadi, say]"
3,ham,"[nah, think, go, usf, life, around, though]"
4,spam,"[freemsg, hey, darl, week, word, back, like, f..."


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5573 entries, 0 to 5572
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Label    5573 non-null   object
 1   Message  5573 non-null   object
dtypes: object(2)
memory usage: 43.6+ KB


## Learn parameters [3 Points]

Use 80% of the data (called training set; randomly chosen) to learn the parameters of the naive Bayes classifier (prior probabilities and likelihoods). Use [Laplacian smoothing](https://en.wikipedia.org/wiki/Additive_smoothing) for the counts.

Report the top 20 words (highest conditional probability) for spam and for ham (i.e., not spam).

In [16]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2, random_state=1)

In [17]:
train = train.reset_index()
test = test.reset_index()
train.drop('index', axis=1, inplace=True)
test.drop('index', axis=1, inplace=True)

Calculate the constants needed for the classifier

In [18]:
spam = [word for message in list(train[train['Label']=='spam']['Message']) for word in message]
ham = [word for message in list(train[train['Label']=='ham']['Message']) for word in message]

messages = list(train['Message'])
# total number of words
n_total = len([word for message in messages for word in message])

# total number of words in spam
n_spam = len(spam)

# total number of words in ham
n_ham = len(ham)

# P(Spam) and P(Ham)
p_spam = n_spam / n_total
p_ham = n_ham / n_total

In [19]:
n_total

38870

In [20]:
n_spam

7674

In [21]:
n_ham

31196

In [22]:
p_spam

0.19742732184203757

In [23]:
p_ham

0.8025726781579624

In [24]:
# Unique words
vocabulary = set([word for message in messages for word in message])

In [25]:
n_vocabulary = len(vocabulary)

In [26]:
# number of unique words
n_vocabulary

5268

In [27]:
# Laplace smoothing
alpha=1
n_classes=2

Calculate parameters

In [28]:
params_spam = {}
params_ham = {}

In [29]:
for word in vocabulary:
    # number of occurances of this word in spam messages
    n_word_spam = spam.count(word)
    # probability of a word given that it's a spam
    p_word_spam = (n_word_spam + alpha) / (n_spam + n_classes)
    params_spam[word] = p_word_spam
    
    n_word_ham = ham.count(word)
    p_word_ham = (n_word_ham + alpha) / (n_ham + n_classes)
    params_ham[word] = p_word_ham

In [30]:
sorted_params_spam = sorted(params_spam.items(), key=lambda x: x[1], reverse=True)

In [31]:
sorted_params_ham = sorted(params_ham.items(), key=lambda x:x[1], reverse=True)

In [32]:
def getTop20(arr):
    for count, ele in enumerate(arr):
        if count >= 20:
            break
        print(ele)

In [33]:
getTop20(sorted_params_spam)

('call', 0.0385617509119333)
('free', 0.02123501823866597)
('u', 0.01706618030224075)
('txt', 0.016545075560187597)
('ur', 0.015242313705054716)
('text', 0.0143303804064617)
('mobil', 0.014069828035435123)
('stop', 0.012245961438249088)
('claim', 0.011073475768629494)
('repli', 0.011073475768629494)
('prize', 0.009900990099009901)
('c', 0.009770713913496612)
('thi', 0.009119332985930172)
('get', 0.008728504429390308)
('servic', 0.008467952058363731)
('onli', 0.008337675872850442)
('send', 0.007556018759770714)
('tone', 0.00716519020323085)
('cash', 0.0070349140177175615)
('award', 0.0070349140177175615)


In [34]:
getTop20(sorted_params_ham)

('u', 0.026764536188217194)
('go', 0.011378934547086352)
('get', 0.009487787678697353)
('gt', 0.008365920892364894)
('lt', 0.008333867555612539)
('come', 0.007660747483813065)
('call', 0.007212000769280082)
('thi', 0.006795307391499455)
('ok', 0.006699147381242387)
('like', 0.006570934034232963)
('know', 0.0065388806974806075)
('love', 0.006506827360728252)
('ur', 0.006506827360728252)
('good', 0.006410667350471184)
('got', 0.0063786140137188285)
('wa', 0.006250400666709404)
('time', 0.005865760625681133)
('day', 0.00573754727867171)
('want', 0.005705493941919354)
('need', 0.004487467145329829)


In [35]:
# Classify a message
def classify(message):
    # initialize the probability to the prior probability
    p_message_spam = p_spam
    p_message_ham = p_ham
    for word in message:
        if word in params_spam:
            p_message_spam *= params_spam[word]
        if word in params_ham:
            p_message_ham *= params_ham[word]
    
    if p_message_spam > p_message_ham:
        return ['spam', p_message_spam, p_message_ham]
    else:
        return ['ham', p_message_spam, p_message_ham]

## Evaluate the classification performance [4 Points] 

Classify the remaining 20% of the data (test set) and calculate classification accuracy. Accuracy is defined as the proportion of correctly classified test documents.

Inspect a few misclassified text messages and discuss why the classification failed.

Discuss how you deal with words in the test data that you have not seen in the training data.

In [36]:
test['Prediction'] = test['Message'].apply(classify)

In [37]:
test.head(20)

Unnamed: 0,Label,Message,Prediction
0,ham,"[give, fli, monkey, wot, think, certainli, min...","[ham, 9.487723465686995e-32, 1.344055342434491..."
1,ham,"[ye, small, kid, boost, secret, energi]","[spam, 2.2003353355928533e-18, 1.9571187436383..."
2,ham,"[dont, think, need, yellow, card, uk, travel, ...","[ham, 6.183935269817049e-50, 5.80330532730119e..."
3,ham,"[sorri, din, lock, keypad]","[ham, 1.7460728013524371e-12, 7.99255984286686..."
4,ham,"[hi, hope, ur, day, good, back, walk, tabl, bo...","[ham, 5.643459327235041e-45, 1.582976066998658..."
5,ham,"[ok, thk, got, u, wan, come, wat]","[ham, 2.7672160577987614e-23, 5.23323281517724..."
6,spam,"[last, chanc, claim, ur, worth, discount, ye, ...","[spam, 7.551543933791488e-39, 6.29078197600759..."
7,ham,"[bank, say, money]","[ham, 3.4921456027048742e-12, 1.37395909679758..."
8,ham,"[horribl, gal, sch, stuff, come, u, got, mc]","[ham, 4.612026762997935e-25, 7.600788770158969..."
9,ham,"[good, afternoon, love, ani, job, prospect, mi...","[ham, 2.603381860182261e-42, 4.203697548133950..."


In [38]:
correct = 0
total = len(test)
incorrect_index = []

for index, row in test.iterrows():
    if row['Label'] == row['Prediction'][0]:
        correct += 1
    else:
        incorrect_index.append(index)

print('Correctly classified:', correct)
print('Incorrectly classified:', total-correct)
print('Accuracy', correct/total)

Correctly classified: 1058
Incorrectly classified: 57
Accuracy 0.9488789237668162


Inspect incorrectly classified

In [39]:
pd.set_option('display.max_colwidth', None)
test.iloc[incorrect_index]

Unnamed: 0,Label,Message,Prediction
1,ham,"[ye, small, kid, boost, secret, energi]","[spam, 2.2003353355928533e-18, 1.9571187436383694e-18]"
24,ham,"[daili, text, favour, thi, time]","[spam, 3.650926779057771e-15, 4.803993329289673e-16]"
45,ham,"[sort, code, acc, bank, natwest, repli, confirm, sent, thi, right, person]","[spam, 1.0004291599560244e-32, 7.976402069057452e-33]"
63,ham,"[madam, regret, receiv, refer, check, dlf, rakhesh, kerala]","[spam, 5.404872890499567e-22, 1.178970281743925e-22]"
87,ham,"[pa, select]","[spam, 4.825027736345271e-07, 9.235254888065201e-08]"
105,ham,"[call, messag, miss, call]","[spam, 1.5944128667357497e-09, 2.418919949605203e-10]"
109,spam,"[money, wine, number, wot, next]","[ham, 6.7417681999646354e-18, 3.5290702620052495e-16]"
123,ham,"[u, get, pic, msg, phone]","[spam, 1.8571061378741976e-12, 2.830567934844545e-13]"
130,ham,"[know, complain, num, onli, bettr, directli, go, bsnl, offc, nd, appli]","[spam, 1.3171586313064376e-22, 4.2026086525196473e-23]"
148,ham,"[call, messag, miss, call]","[spam, 1.5944128667357497e-09, 2.418919949605203e-10]"


Some messages are incorrectly classified but by a very small difference. For example, messages 270, 366, 722 are all down to the last decimal place. These messages contain many neutral words that have similar probabilities in both spam and ham. Another potential reason that these messages are incorrectly classified is because the classifier does not take into account words that it has not seen before. Third potential reason is that the classifier does not take into account the order of the words and therefore cannot get a sense of context.

Words that the classifier has not seen are ignored

## Bonus task [+1 Point]

Describe how you could improve the classifier.

I would use TF-IDF to imporve the classifier. TF-IDF stands for term frequency-inverse document frequency. It effectively diminishes the impact of words that occur a lot. For example, if the word "like" appears in almost every message, the classifier would treat it as less influencial than words that do not appear as frequent.
Also, if the probability of a word being spam verse being ham is very close, we can flag the message instead of labeling it as spam. Filtering termsmtime of the message, contact, length of spam messages

for each word:<br>
    TF = number of occurences of a word given the message is spam<br>
    IDF = log(total number of messages / total number of message containing the word)<br>
    p(word|spam) = (TF * IDF) / (sum of all words' TF * IDF) <br>