# DAT405/DIT407 Introduction to Data Science and AI 
## 2022-2023, Reading Period 4
## Assignment 4: Spam classification using Naïve Bayes 
The exercise takes place in this notebook environment.
Hints:
You can execute certain linux shell commands by prefixing the command with `!`. You can insert Markdown cells and code cells. The first you can use for documenting and explaining your results the second you can use writing code snippets that execute the tasks required.  

In this assignment you will implement a Naïve Bayes classifier in Python that will classify emails into spam and non-spam (“ham”) classes.  Your program should be able to train on a given set of spam and “ham” datasets. 
You will work with the datasets available at https://spamassassin.apache.org/old/publiccorpus/. There are three types of files in this location: 
-	easy-ham: non-spam messages typically quite easy to differentiate from spam messages. 
-	hard-ham: non-spam messages more difficult to differentiate 
-	spam: spam messages 

**Execute the cell below to download and extract the data into the environment of the notebook -- it will take a few seconds.** If you chose to use Jupyter notebooks you will have to run the commands in the cell below on your local computer, with Windows you can use 
7zip (https://www.7-zip.org/download.html) to decompress the data.

**What to submit:** 
Convert the notebook to a pdf-file and submit it. Make sure all cells are executed so all your code and its results are included. Double check the pdf displays correctly before you submit it.

In [1]:
# Download and extract data
# !wget https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2
# !wget https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2
# !wget https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2
# !tar -xjf 20021010_easy_ham.tar.bz2
# !tar -xjf 20021010_hard_ham.tar.bz2
# !tar -xjf 20021010_spam.tar.bz2

*The* data is now in the three folders `easy_ham`, `hard_ham`, and `spam`.

In [2]:
# !ls -lah

### 1. Preprocessing: 

##### 1.1 Look at a few emails from easy_ham, hard_ham and spam. Do you think you would be able to classify the emails just by inspection? How do you think a succesful model can learn the difference between the different classes of emails?


In [4]:
import os
easy_ham_files = os.listdir('./datasets/easy_ham/')
spam_files = os.listdir('./datasets/spam/')
easy_ham = []
spam = []

for file in easy_ham_files:
  f = open('./datasets/easy_ham/' + file,  encoding = "ISO-8859-1")
  easy_ham.append(f.read())
for file in spam_files:
  f = open('./datasets/spam/' + file,  encoding = "ISO-8859-1")
  spam.append(f.read())

print(easy_ham[0])
print(spam[0])


From rssfeeds@jmason.org  Mon Sep 30 13:43:46 2002
Return-Path: <rssfeeds@example.com>
Delivered-To: yyyy@localhost.example.com
Received: from localhost (jalapeno [127.0.0.1])
	by jmason.org (Postfix) with ESMTP id AE79816F16
	for <jm@localhost>; Mon, 30 Sep 2002 13:43:46 +0100 (IST)
Received: from jalapeno [127.0.0.1]
	by localhost with IMAP (fetchmail-5.9.0)
	for jm@localhost (single-drop); Mon, 30 Sep 2002 13:43:46 +0100 (IST)
Received: from dogma.slashnull.org (localhost [127.0.0.1]) by
    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g8U81fg21359 for
    <jm@jmason.org>; Mon, 30 Sep 2002 09:01:41 +0100
Message-Id: <200209300801.g8U81fg21359@dogma.slashnull.org>
To: yyyy@example.com
From: gamasutra <rssfeeds@example.com>
Subject: Priceless Rubens works stolen in raid on mansion
Date: Mon, 30 Sep 2002 08:01:41 -0000
Content-Type: text/plain; encoding=utf-8
Lines: 6
X-Spam-Status: No, hits=-527.4 required=5.0
	tests=AWL,DATE_IN_PAST_03_06,T_URI_COUNT_0_1
	version=2.50-cvs
X-Spam

We think we would be able to classify the mails by looking at them but it would take a long time. We think a model could classify the mails successfully to some degree but some not with 100% accuracy

Answer 1.1:

##### 1.2 Note that the email files contain a lot of extra information, besides the actual message. Ignore that for now and run on the entire text (in the optional part further down can experiment with filtering out the headers and footers). We don’t want to train and test on the same data (it might help to reflect on why if you don't recall). Split the spam and the ham datasets in a training set and a test set. (`hamtrain`, `spamtrain`, `hamtest`, and `spamtest`). Use only the easy_ham part as ham data for quesions 1 and 2.

In [None]:
# Write your code for here for splitting the data
from sklearn.model_selection import train_test_split

hamtrain, hamtest = train_test_split(easy_ham, test_size=0.3, random_state=0)
spamtrain, spamtest = train_test_split(spam, test_size=0.3, random_state=0)

### 2.1 Write a Python program that: 
1.	Uses the four datasets from Question 1 (`hamtrain`, `spamtrain`, `hamtest`, and `spamtest`) 
2.	Trains a Naïve Bayes classifier (use the [scikit-learn library](https://scikit-learn.org/stable/)) on `hamtrain` and `spamtrain`, that classifies the test sets and reports True Positive and False Negative rates on the `hamtest` and `spamtest` datasets. Use `CountVectorizer` ([Documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)) to transform the email texts into vectors. Please note that there are different types of Naïve Bayes Classifier in scikit-learn ([Documentation here](https://scikit-learn.org/stable/modules/naive_bayes.html)). Test two of these classifiers that are well suited for this problem:
- Multinomial Naive Bayes  
- Bernoulli Naive Bayes. 

Please inspect the documentation to ensure input to the classifiers is appropriate before you start coding. 



In [None]:
# Write your code here
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.metrics import confusion_matrix


# vectorize the text
vect = CountVectorizer()
vect.fit(hamtrain)

# transform the text into matrixes
hamtrain_matrix = vect.transform(hamtrain)
hamtest_matrix = vect.transform(hamtest)
spamtrain_matrix = vect.transform(spamtrain)
spamtest_matrix= vect.transform(spamtest)

# create the target matrixes
ham_target = ['ham'] * len(hamtrain)
spam_target = ['spam'] * len(spamtrain)

# create the multinomial naive bayes model and train it
mnb = MultinomialNB()
mnb.partial_fit(hamtrain_matrix, ham_target, classes=['ham', 'spam'])
mnb.partial_fit(spamtrain_matrix, spam_target)

# create the bernoulli naive bayes model and train it
bnb = BernoulliNB()
bnb.partial_fit(hamtrain_matrix, ham_target, classes=['ham', 'spam'])
bnb.partial_fit(spamtrain_matrix, spam_target)

# test the multinominal model with the test data
print("Multinomial Naive Bayes On Easy Ham Versus Spam:")

    # generate the predictedons for the ham and spam test data
mnb_ham_pred = mnb.predict(hamtest_matrix)
mnb_spam_pred = mnb.predict(spamtest_matrix)

    # generate the target matrixes for the ham and spam test data
ham_target = ['ham'] * len(hamtest)
spam_target = ['spam'] * len(spamtest)

    # generate the confusion matrixes for the ham and spam test data
mnb_ham_cm = confusion_matrix(ham_target, mnb_ham_pred)
mnb_spam_cm = confusion_matrix(spam_target, mnb_spam_pred)

    # calculate the true positive and false negative rates
mnb_TP = mnb_ham_cm[0][0]
mnb_FN = mnb_spam_cm[1][0]
mnb_TPR = mnb_TP/(mnb_TP + mnb_FN)
mnb_FNR = 1 - mnb_TPR
 
    # print the results
print("True Positive Rate:" + str(mnb_TPR))
print("False Negative Rate:" + str(mnb_FNR))

# same process as above but for the bernoulli model
print("Bernoulli Naive Bayes On Easy Ham Versus Spam:")
bnb_ham_pred = bnb.predict(hamtest_matrix)
bnb_spam_pred = bnb.predict(spamtest_matrix)
bnb_ham_cm = confusion_matrix(ham_target, bnb_ham_pred)
bnb_spam_cm = confusion_matrix(spam_target, bnb_spam_pred)
bnb_TP = bnb_ham_cm[0][0]
bnb_FN = bnb_spam_cm[1][0]
bnb_TPR = bnb_TP/(bnb_TP + bnb_FN)
bnb_FNR = 1 - bnb_TPR
print("True Positive Rate:" + str(bnb_TPR))
print("False Negative Rate:" + str(bnb_FNR))


### 2.2 Answer the following questions:
##### a) What does the CountVectorizer do?
CountVectorizer converts the text strings into a matrix the have words as coulums and files as rows, where each cell is the cont of a word within the text file. 
##### b) What is the difference between Multinomial Naive Bayes and Bernoulli Naive Bayes
Bernoulli naive bayes can be viewed as binary where it only cares if a worde is used or not i the file. Where as the mulitnomial takes into account how many times each word is used.


### 3.1 Run the two models:
Run (don't retrain) the two models from Question 2 on spam versus hard-ham. Does the performance differ compared to question 2 when the model was run on spam versus easy-ham? If so, why?

In [None]:
# Write your code here

Answer 3.1:

### 3.2 Retrain
Retrain new Multinomial and Bernolli Naive Bayes classifers on the combined (easy+hard) ham and spam. Now evaluate on spam versus hard-ham as in 3.1. Also evaluate on spam versus easy-ham. Compare the performance with question 2 and 3.1. What do you observe?

In [None]:
# Write your code here

Answer 3.2:

### 3.3 Further improvements
Do you have any suggestions for how performance could be further improved? You don't have to implement them, just present your ideas.

Answer 3.3: