# Bayesian classifier: Spam or Ham?

## Extracting features from text

Text analysis is usually done with a technique called **bag of words**. We need to somehow convert a collection of words, to a feature vector, that can be then used in mathematical expressions. The main idea is:
1. Collect a very large set of documents, that will contain a very large number of words
2. Build a vocabulary, that consists of all the words in the above dataset. Each of these words, is represented by a number. For example if our vocabulary consists of 100 words, the first word will have the **id 0** and the last word will have the **id 99**.
3. By having the vocabulary, we can convert any new text to a histogram, by creating a new vector of size $(1,v)$, where $v$ is the vocabulary size. The value in the vector position $i$ will represent the number of occurrences of the $i$th vocabulary word.

Run the following code to see how we can build a vocabulary of words for text analysis. 

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=1)

sentences = ['This product is not good it is the worst I have ever seen.',
'This is the worst thing that has ever happend. How is it so bad?',
'This is a bad phone. It is slow and not very responsive.',
'Amazingly this product is very good! I did not expect that',
'How can this thing get any better? This is the best thing I have ever seen!',
'I thought that this will not be good, but in fact it is very good!']

vectorizer.fit(sentences)
feature_names = vectorizer.get_feature_names()

The code implements the first two steps as listed above. 
Identify which is the first and which is the second step, and print the outputs to examine the results. What is the size of our vocabulary in this case, and what are the features?	

In [2]:
print(len(feature_names))
print(feature_names)

37
['amazingly', 'and', 'any', 'bad', 'be', 'best', 'better', 'but', 'can', 'did', 'ever', 'expect', 'fact', 'get', 'good', 'happend', 'has', 'have', 'how', 'in', 'is', 'it', 'not', 'phone', 'product', 'responsive', 'seen', 'slow', 'so', 'that', 'the', 'thing', 'this', 'thought', 'very', 'will', 'worst']


The object **vectorizer** is responsible for converting a text segment to features. The third step from the list above (converting any new text to histogram) can be done with the command: **(Note, the testSentence can be any sentence you like to test.)**



In [3]:
print (vectorizer.transform(['This product is amazingly the worst thing']).toarray())

[[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0
  1]]


As you may have guessed, representing a text segment with just a collection of single words is not the best choice. Think about this example:

1. This product is not good it's bad!
2. This product is not bad it's good!

Well, our bag of words representation cannot tell the difference between these two! These two sentences are **semantically** exactly opposite, but the single-word based representation histogram, fails to see any difference. 

This is why when we want to add context to our analysis, we have to consider **n-grams**. **n-grams** are just a collection of $n$ words, instead of a collection of single words. For example, if we consider **2-grams** our features will not be the collection:

{this,product,is,not,good,it's,bad}

but it will be the collection:

{this product,product is,is not,not good,not bad,good it's,bad it's,it's good,it's bad}

We immediately see that now we can get some context about the sentence.

Run the following code that has the same corpus **(Note: In text analysis and data mining the term \emph{corpus} means the whole collection of the documents (the whole text dataset))** as the previously run program, but now the feature extraction is based on **2-grams**. 

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

sentences = ['This product is not good it is the worst I have ever seen.',
'This is the worst thing that has ever happend. How is it so bad?',
'This is a bad phone. It is slow and not very responsive.',
'Amazingly this product is very good! I did not expect that',
'How can this thing get any better? This is the best thing I have ever seen!',
'I thought that this will not be good, but in fact it is very good!']

bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern='\\b\\w+\\b', min_df=1)
bigram_vectorizer.fit(sentences)
feature_names = bigram_vectorizer.get_feature_names()


What is the size of our vocabulary in this case, and what are the features?	

In [4]:
print(len(feature_names))
print(feature_names)

100
['a', 'a bad', 'amazingly', 'amazingly this', 'and', 'and not', 'any', 'any better', 'bad', 'bad phone', 'be', 'be good', 'best', 'best thing', 'better', 'better this', 'but', 'but in', 'can', 'can this', 'did', 'did not', 'ever', 'ever happend', 'ever seen', 'expect', 'expect that', 'fact', 'fact it', 'get', 'get any', 'good', 'good but', 'good i', 'good it', 'happend', 'happend how', 'has', 'has ever', 'have', 'have ever', 'how', 'how can', 'how is', 'i', 'i did', 'i have', 'i thought', 'in', 'in fact', 'is', 'is a', 'is it', 'is not', 'is slow', 'is the', 'is very', 'it', 'it is', 'it so', 'not', 'not be', 'not expect', 'not good', 'not very', 'phone', 'phone it', 'product', 'product is', 'responsive', 'seen', 'slow', 'slow and', 'so', 'so bad', 'that', 'that has', 'that this', 'the', 'the best', 'the worst', 'thing', 'thing get', 'thing i', 'thing that', 'this', 'this is', 'this product', 'this thing', 'this will', 'thought', 'thought that', 'very', 'very good', 'very responsiv

Try again to get the feature representation of some simple sentences based on the trained vocabulary. The command is analogous to the command for the **1-grams**.

In [5]:
print (bigram_vectorizer.transform(['This product is amazingly the worst thing.']).toarray())

[[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0
  0 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1]]


## Spam filter

Consider the following email:

---

Hi,

succeeded in ringing a bell anytime I get a mail
to my inbox (Not Mailing lists, spams etc.) by
using procmail to execute a "play clink.wav" on
the "right" mails.
Now my demands are growing ;-)
I use my laptop remotely very often and now I would
like the bell to sound on that when I am there.
I tried to use the KDE remote sound server and it works in
the tests, but when procmail runs it it doesn't, presumably
as it doesn't have the authorization to communicate with
the laptop, being another user?

Any hints.

BRGDS

---

Is it easy for us to understand that this is not a spam email. But how to train a computer to understand the same? We will use the   bag of words technique, together with a new type of classifier called Bayesian classifier.

We will follow the typical steps for designing most classifiers:

1. Collect annotated data. 'annotated' means that for every email in our dataset, we know if it is spam or ham.
2. Train a classifier with some features. The classifier will try to associate certain feature combinations, with certain outcomes. In our case, the classifier tries to connect words and phrases with 'spam' or 'ham' categories.
3. Use the trained classifier in our system. With python, as we have seen before, we can **pickle** objects, save them to disk and use them later on.
4. The classifier that we will produce in this lab, is ready to use and it actually performs really well! Compared with the advanced python server related capabilities, you could be able to ship it with a real world product.

### Creating our training set

The first step is to load our data from two folders, where one holds the spam data and the other holds the ham data. 


In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
import glob
import numpy as np
import os
import requests
import zipfile
import io

In [7]:
r = requests.get('https://github.com/wOOL/COM2028/raw/master/W9/spam.zip')
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

In [9]:
#get the training corpus
spam_filenames = glob.glob(os.path.join("./datasets/spamham/original/training_corpus/spam/", '*'))
ham_filenames = glob.glob(os.path.join("./datasets/spamham/original/training_corpus/ham/", '*'))
training_number_spam = len(spam_filenames) 
training_number_ham = len(ham_filenames)
print(training_number_spam)
print(training_number_ham)

1395
1400


In [10]:
#create empty data holders for the training data
emails_data = []
emails_label = []

#add the ham mails
for filename in ham_filenames:
    #read the actual email data (text)
    email = open(filename, 'r', encoding="utf8", errors='ignore').read()
    emails_data.append(email)
    emails_label.append(0)

#add the sham mails
for filename in spam_filenames:
    #read the actual email data (text)
    email = open(filename, 'r', encoding="utf8", errors='ignore').read()
    emails_data.append(email)
    emails_label.append(1)

###  Getting the features

If we deal with large real world datasets, there is a problem we need to solve when we extract the bag of words: the frequency of the common words. Words like the, a, do, he, she etc will be dominant in our histograms, and we wont achieve much separability. This is why we use a technique called **tf-idf** to minimise the effect of such terms.

We quote the example from <http://tfidf.com>

Consider a document containing $100$ words wherein the word cat appears $3$ times. The **term frequency**  (**tf**) for cat is then $\frac{3}{100}= 0.03$. Now, assume we have $10^7$ documents and the word cat appears in $10^3$ of these. Then, the **inverse document frequency** (**idf**) is calculated as $log(\frac{10^7}{10^3})=4$. Thus, the Tf-idf weight is the product of these quantities: $0.03 \cdot 4 = 0.12$.

You don't need to worry a lot about  **tf-idf**, just consider it as a weighted way to get the bag of words-related features.

To get our bag of words, this time we will initialise the vectorizer with **3-grams**. 

In [11]:
#build the vocabulary based on all our emails - both spam and ham
pattern ='(?u)\\b[A-Za-z]{3,}'
tfidf = TfidfVectorizer(sublinear_tf=True, max_df=0.8,stop_words=None, token_pattern=pattern, ngram_range=(1, 3))
                        
#calculate features using tf-idf and create a training set 
X_train = tfidf.fit_transform(emails_data)
print("X_train is a sparse matrix with shape: %s" % str(X_train.shape))

features = tfidf.get_feature_names()
print('Total features: ' + str(len(features)))
print('First 10 features')
print(features[0:9])

X_train is a sparse matrix with shape: (2795, 802189)
Total features: 802189
First 10 features
['aaa', 'aaa aaaaaacxaaaaaaaaaaaaaaaaqwaaaaaaaaaaaaaaaf', 'aaa aaaaaacxaaaaaaaaaaaaaaaaqwaaaaaaaaaaaaaaaf qwaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaatgaafiqbfyqbswybaaaa', 'aaa aaaaqaaaaeeaaabcaaaaqwaaaeqaaabfaaaargaaaecaaabiaaaasqaa', 'aaa aaaaqaaaaeeaaabcaaaaqwaaaeqaaabfaaaargaaaecaaabiaaaasqaa aeoaaablaaaataaaae', 'aaa aakwdwaf', 'aaa aakwdwaf abigtbhipaaysdwageg', 'aaa blackcomb', 'aaa blackcomb panasas']


**3-grams** will allow for much more semantical context to be added to the whole process, and since the application problem is difficult, we need this extra context. Don't worry with the pattern command, it is just a regex (regular expression) to catch the phrases that match the pattern. 

In this case, we are using a new method of the vectorizer, which is the method **tfidf.fit\_transform**. This convenient method first extracts the vocabulary for our corpus, and then extracts the features for each of the training samples. So with this method, we avoid a for loop, during which we convert all of our training data to feature vectors. This is done automatically with this method. The result, **training\_feature\_vectors** is a sparse matrix, with $m$ lines, and each line is a vector of size $k$. In our case, since we have 1000 files, $m=1000$ and, $k$ is the number of words in our vocabulary. This has to do with how many different words exist in the corpus. Note that we also used the attribute **max\_df=0.5**, to discard words that exist in more than half of the documents. This is done as an extra preprocessing  step, because we don't want to deal with words that are present in a lot of the documents as they add complexity to our system, without giving back much discriminability. 

If you want to, you can print some of the contents of the e-mails to see exactly with what kind of data we are dealing with, and see the nature of the spam e-mails that are in the dataset.

### Training the naive bayesian classifier

The next step is to take the results from the vectorizer, and pass it to a classifier for the learning process. 


Notice that since we use the vectorizer's method 

**tfidf.fit\_transform(email\_data)**, it not only creates the vocabulary, but also has already returned all the feature vectors for each of the objects in **email\_data** in  
**training\_feature\_matrix**. Now the variable 
**training\_feature\_matrix** is ready to be used for the classifier training.

We will use the multinomial **Naive Bayes classifier** from  **sklearn**. From the doc page of **sklearn.naive\_bayes.MultinomialNB**:

**sklearn.naive\_bayes.MultinomialNB** is suitable for classification with discrete features (e.g., word counts for text classification). 


The training is done in a similar way to the neural network or the SVM: we pass to the classifier the matrix of our features, together with the targets for each of observation. In our case, this looks something like that


In [12]:
# create a Naive Bayes classifier
clf = MultinomialNB()
 
clf.fit(X_train, emails_label)
print("Trained MultinomialNB Classifier")

Trained MultinomialNB Classifier


Remember that **email\_labels** is the list we created in the first step, where we loaded our data from the files. Target values are usually **0:ham, 1:spam** but of course this is just a convention.

The two lines above are all we need to train our spam filter! Actually it is in fact a great spam filter, and it only took around 50 lines of code in total!

### How good is our spam filter?

In order to test if the spam filter works or not, we provide two folders with 500 e-mails from each categories as a testing set. These e-mails are (of course) different than the ones we used in the training set. 

After we read all the testing spam and ham data into two lists, 

**testing\_spam\_data** and **testing\_ham\_data**

we pass them through the vectorizer, in order to convert the text into feature vectors.


In [14]:
#test how well we did 
#in first test all are spam
testing_spam_filenames =   glob.glob(os.path.join("./datasets/spamham/original/testing_spam/", '*'))
testing_spam_data = []
for filename in testing_spam_filenames:
    #read the actual email data (text)
    email = open(filename, 'r', encoding="utf8", errors='ignore').read()
    #remove the non unicode characters
    testing_spam_data.append(email)

# extract features from raw text documents
X_test = tfidf.transform(testing_spam_data)

The next step of course is pass the above features to the classifier and get the results. 


In [15]:
results = clf.predict(X_test).astype(bool)

#how many were they correct and how many wrong?
print(np.count_nonzero(results))

410


In [16]:
#test how well we did 
#in first test all are spam
testing_easy_ham_filenames = glob.glob(os.path.join("./datasets/spamham/original/testing_ham/", '*'))
testing_easy_ham_data = []
for filename in testing_easy_ham_filenames:
    #read the actual email data (text)
    email = open(filename, 'r', encoding="utf8", errors='ignore').read()
    #remove the non unicode characters
    testing_easy_ham_data.append(email)

# extract features from raw text documents
X_test = tfidf.transform(testing_easy_ham_data)
 
# MultinomialNB's predict classes directly

results = clf.predict(X_test)
print(results)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 

In [17]:
misclassified_ids = np.nonzero(results)

for misclassified_id in misclassified_ids:
    print(misclassified_id)

[291 424]


In [18]:
#how many were they correct and how many wrong?
print(np.count_nonzero(results))

2


If you want to see how confident the classifier is for each result, you can also run clf.predict_proba(X_test)



This will return the probability for each of the e-mails belonging to either the \emph{ham} or the \emph{spam} class.


## Exercises

**Task 1:** Try to collect a few of your own e-mails as testing sets, andreport the results.

**Task 2:** Check the e-mails that were legitimate, and see why the classifier failed and classified them as spam. 

## Task 1: Find appropraite spam and ham data

Appropraite Spam and ham data, can be found at https://www.kaggle.com/veleon/ham-and-spam-dataset?, which is an opensource dataset, which was then stored to the relative file path of /datasets/spamham/online

In [33]:
ham_filenames =   glob.glob(os.path.join("./datasets/spamham/online/ham", '*'))
spam_filenames =   glob.glob(os.path.join("./datasets/spamham/online/spam", '*'))
ham_test_data=[]
for filename in ham_filenames:
    #read the actual email data (text)
    email = open(filename, 'r', encoding="utf8", errors='ignore').read()
    #remove the non unicode characters
    ham_test_data.append(email)

spam_test_data=[]
for filename in spam_filenames:
    #read the actual email data (text)
    email = open(filename, 'r', encoding="utf8", errors='ignore').read()
    #remove the non unicode characters
    spam_test_data.append(email)
# extract features from raw text documents
X_test_ham = tfidf.transform(ham_test_data)
X_test_spam = tfidf.transform(spam_test_data)
 
# MultinomialNB's predict classes directly
results_ham = clf.predict(X_test_ham)
results_spam = clf.predict(X_test_spam)
incorrect_ham= []
incorrect_spam= []
for x in range(0,len(results_ham)):
  if(results_ham[x]!=0):
    incorrect_ham.append(x)

for x in range(0,len(results_spam)):
  if(results_spam[x]!=1):
    incorrect_spam.append(x)
##Printing appropraite ham results
print("HAM Results:")
print(f"Incorrectly identified {np.count_nonzero(results_ham)} ham results are: ")
print(incorrect_ham)
accuracy = (len(results_ham)-len(incorrect_ham))/len(results_ham)
print(f"Accuracy of {accuracy}")

##Printing appropriate spam results
print("\nSPAM Results:")
print(f"Incorrectly identified spam {np.count_nonzero(results_spam)} results are: ")
print(incorrect_spam)
accuracy = (len(results_spam)-len(incorrect_spam))/len(results_spam)
print(f"Accuracy of {accuracy}")

HAM Results:
Incorrectly identified 12 ham results are: 
[118, 605, 633, 1118, 1203, 1225, 1287, 1407, 1691, 2198, 2276, 2323]
Accuracy of 0.9952959623676989

SPAM Results:
Incorrectly identified spam 420 results are: 
[8, 19, 25, 30, 31, 33, 43, 48, 51, 54, 62, 63, 65, 70, 77, 87, 88, 89, 104, 106, 113, 116, 118, 121, 133, 134, 136, 137, 144, 147, 151, 152, 171, 177, 192, 198, 199, 207, 219, 240, 243, 247, 259, 266, 275, 276, 280, 290, 292, 299, 308, 322, 326, 328, 340, 352, 367, 371, 376, 382, 383, 385, 394, 409, 411, 418, 421, 432, 435, 438, 439, 448, 450, 455, 460, 471, 475, 476, 488, 495, 497]
Accuracy of 0.8383233532934131


## Task 2 determining failures in the system

To do so we need to first consider incorrectly identified spam emails, and according to the above cell of results, we know that email such as 118, 605, 633, 1118, 1203 as well as others were incorrectly identified

In [43]:
print(ham_test_data[118])
##Below print statment for easier analysis
print("\n\n######################NEW MAIL START #########################\n\n")
print(ham_test_data[605])
print("\n\n######################NEW MAIL START #########################\n\n")
print(ham_test_data[633])
print("\n\n######################NEW MAIL START #########################\n\n")
print(ham_test_data[118])
print("\n\n######################NEW MAIL START #########################\n\n")
print(ham_test_data[1203])

Return-Path: anthony@interlink.com.au
Delivery-Date: Fri Sep  6 09:06:57 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Fri, 06 Sep 2002 18:06:57 +1000
Subject: [Spambayes] Re: [Python-Dev] Getting started with GBayes testing 
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHGEEMDKAA.tim.one@comcast.net> 
Message-ID: <200209060806.g8686ve03964@localhost.localdomain>


>>> Tim Peters wrote
> > I've actually got a bunch of spam like that. The text/plain is something
> > like
> >
> > **This is a HTML message**
> >
> > and nothing else.
> 
> Are you sure that's in a text/plain MIME section?  I've seen that many times
> myself, but it's always been in the prologue (*between* MIME sections -- so
> it's something a non-MIME aware reader will show you).

*nod* I know - on my todo is to feed the prologue into the system as well.

A snippet, hopefully not enough to trigger the spam-filters.


To: into89j@gin.elax.ekorp.com
X-Mailer: Microsoft Outlook Express 4.72.1712.3
X-MimeOLE: Produced By Micro

#### Analysis
Emails 118 and 633, significantly uses the syntax relating to making money, and refrences such, and since a large number of spam results are in relation to money making the model would have significantly likelihood of classifying the email and as spam.

Similarly emails such as 605,also ask the user to pass on the emails and subsequently relate to money, as such the can easily classify the email incorrectly.

Therefore we can modify the model to account further for accuracy, which however will lead to decrease in percision. 

#### In relatiy:
If a person were to increase the reliability of the model in the realworld, without changing the current classification metrics. A developer should keep track of the email address, and how many instances of emails from a given email is classified as a spam, if substantial the model should be more strict on all emails from a given email, whereas if generally a given the emails from a model is classifed by the model as not being spam, the model should be more lenient on emails from that emails address.