# Groupe Assignment 3
## Team 7
### 04/02/2020

#### Let us load the data set on spam.

In [1]:
import pandas as pd
data = pd.read_csv("D:\Johanna\Ecole\Imperial\Machine learning\smsspamcollection\SMSSpamCollection", header=None, sep='\t', lineterminator='\n', names=['type', 'text'])
data.head()

Unnamed: 0,type,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


#### To create our naïve Bayes classifier, we will first remove all the punctuation and all the numbers from the text which will be in lower case.

In [2]:
import string
for p in string.punctuation :
    data['text'] = data['text'].str.replace(p, '')
data['text'] = data['text'].str.replace('\d+', '')
data['text'] = data['text'].str.lower()
data.shape

(5572, 2)

In [3]:
from sklearn.model_selection import train_test_split

y = data.iloc[:,:-1]
X = data.iloc[:, -1]

X_intermediary, X_test = train_test_split(data, test_size=2072/5572, random_state=42)
X_train, X_validate = train_test_split(X_intermediary, test_size=1000/3500, random_state=42)

print(X_train.shape, X_validate.shape, X_test.shape)

(2500, 2) (1000, 2) (2072, 2)


#### Let us consider the following class which defines a Naive Bayes Classifier for spam.

In [4]:
import numpy as np
import pandas as pd

class NaiveBayesForSpam :
    
    def train (self, hamMessages, spamMessages):
        self.words = set(' '.join (hamMessages + spamMessages).split())
        self.priors = np.zeros(2)
        self.priors[0] = float(len(hamMessages))/ (len(hamMessages)+len(spamMessages))
        self.priors[1] = 1.0 - self.priors[0]
        self.likelihoods = []
        for i, w in enumerate(self.words):
            prob1 = (1.0 + len([m for m in hamMessages if w in m]))/len(hamMessages)
            prob2 = (1.0 + len([m for m in spamMessages if w in m]))/len(spamMessages)
            self.likelihoods.append([min(prob1, 0.95), min(prob2, 0.95)])
        self.likelihoods = np.array(self.likelihoods).T
        
    def train2(self, hamMessages, spamMessages):
        self.words = set(' '.join(hamMessages + spamMessages).split())
        self.priors = np.zeros(2)
        self.priors[0] = float(len(hamMessages))/ (len(hamMessages)+len(spamMessages))
        self.priors[1] = 1.0 - self.priors[0]
        self.likelihoods = []
        spamkeywords = []
        for i, w in enumerate(self.words):
            prob1 = (1.0 + len([m for m in hamMessages if w in m]))/len(hamMessages)
            prob2 = (1.0 + len([m for m in spamMessages if w in m]))/len(spamMessages)
            if prob1*20<prob2:
                self.likelihoods.append([min(prob1, 0.95), min(prob2, 0.95)])
                spamkeywords.append(w)
        self.words = spamkeywords
        self.likelihoods = np.array(self.likelihoods).T
    
    def predict(self, message):
        posteriors = np.copy(self.priors)
        for i, w in enumerate (self.words):
            if w in message.lower():
                posteriors *= self.likelihoods[:,i]
            else:
                posteriors *= np.ones(2) - self.likelihoods[:, i]
            posteriors = posteriors/np.linalg.norm(posteriors, ord = 1)
        if posteriors[0] > 0.5:
            return ['ham', posteriors[0]]
        return ['spam', posteriors[1]]
    
    def score(self, messages, labels):
        confusion = np.zeros(4).reshape(2,2)
        for m, l in zip(messages, labels):
            if self.predict(m)[0] == 'ham' and l == 'ham':
                confusion[0,0] +=1
            elif self.predict(m)[0] == 'ham' and l == 'spam':
                confusion[0,1] +=1
            elif self.predict(m)[0] == 'spam' and l == 'ham':
                confusion[1,0] +=1
            elif self.predict(m)[0] == 'spam' and l == 'spam':
                confusion[1,1] +=1
        return (confusion[0,0] + confusion[1,1])/float(confusion.sum()), confusion

#### This class is composed of 4 functions.
#### 1. The first function train

This function is taking as arguments 2 lists : the ham messages one and the spam messages one. It changes the classifier attributes :

- $words$ is the set of each word which appears at least one in one message (either spam or ham). As every set, each word is contained only once.

- $priors$ is the list of the 2 prior probabilities as defined in Bayes Theorem. $prior[0]$ is the probability of being a ham and $prior[1]$ is the probability of being the spam. It is simply computed as the proportion of ham or spam messages into the whole set of messages.

- $likelihoods[:, 0]$ is the list of the likelihood as defined in Bayes Theorem for ham messages, in other words, each row is the probability a given word appears knowing that the message is a ham message. It is computed as the proportion of ham messages in which this word appears. Actually, it is capped to 0.95.

- $likelihoods[:, 1]$ is the list of the likelihood as defined in Bayes Theorem for spam messages, in other words, each row is the probability a given word appears knowing that the message is a spam message. It is computed as the proportion of spam messages in which this word appears. Actually, it is capped to 0.95.

#### 2. The second function train

This function is taking as arguments 2 lists : the ham messages one and the spam messages one. It changes the classifier attributes :

- $words$ is first the set of each word which appears at least one in one message (either spam or ham). As every set, each word is contained only once. At the end, it is replaced by the set of word where the proportion in spam messages is at least 20 times superior to the proportion in ham messages.

- $priors$ is the same as before.

- $likelihoods$ is the same as before but only for spam key words (set of word where the proportion in spam messages is at least 20 times superior to the proportion in ham messages).

#### 3. The thrid function: predict

This function takes as argument a new message and returns the predicted category as well as the posterior probability $P(X|M)$ defines as $P(X)\times P(M|X)$ where $X$ is the predicted category.

To calculate the posterior probabilities, this function takes the prior probabilities from the priors attribute of the classifier and computes the likelihood as the product of all the likelihoods of each word which is in both the message and the words attribute, according to the conditional independence. It is using the Bayes Theorem. The predicted category is the one with the higher posterior probability.

#### 4. The fourth function: score

This function is used on a trained classifier, taking as arguments a list of messages and their category. It returns the accuracy and the confusion matrix. Indeed, for each message, it computes the predicted category thanks to the predict function and compares it to the category from labels. If we suppose that ham is the positive label, $confusion[0,0]$ is the number of true positive. $confusion[0,1]$ is the number of false positive. $confusion[1,0]$ is the number of false negative and $confusion[0,0]$ is the number of true negative.

#### Let us now train 2 different classifiers with our train data.

In [5]:
hamMessages_train = X_train[X_train['type'] == 'ham']["text"].tolist()
spamMessages_train = X_train[X_train['type'] == 'spam']["text"].tolist()


cfl1 = NaiveBayesForSpam()
cfl2 = NaiveBayesForSpam()


cfl2.train2(hamMessages_train, spamMessages_train)

#### Let us have a look into the performance of those 2 classifiers.

In [8]:
messages_validate = X_validate["text"].tolist()
labels_validate = X_validate["type"].tolist()

import time
start_time = time.time()
s1 = cfl1.score(messages_validate, labels_validate)
print(s1)
print("--- %s seconds ---" % (time.time() - start_time))

(0.963, array([[845.,  24.],
       [ 13., 118.]]))
--- 87.3810453414917 seconds ---


In [9]:
import time
start_time = time.time()
s2 = cfl2.score(messages_validate, labels_validate)
print(s2)
print("--- %s seconds ---" % (time.time() - start_time))

(0.953, array([[852.,  41.],
       [  6., 101.]]))
--- 3.6581149101257324 seconds ---


It appears that the second training is slightly less performant since it has an accuracy lower of 1%. Yet, it is more than 20 times faster. Indeed, the set of words is significantly reduced, increasing the execution speed.

We will now compare the accuracy of both classifier on the training data set.

In [10]:
messages_train = X_train["text"].tolist()
labels_train = X_train["type"].tolist()
s1 = cfl1.score(messages_train, labels_validate)
s2 = cfl2.score(messages_train, labels_validate)

print(s1[0], s2[0])

0.974 0.9728


It seems that accuracy is lower for both training and validation set.

For the validation set, we have 13 messages classified as spam but were actually ham. 
To reduce this number, we could create the reverse of spam key words : ham key words. This list would contain all the words where the proportion in ham messages is at least 20 times superior to the proportion in spam messages. It could increase the execution time but would still be faster than the first classifier.

#### Let us finally test our classifier on the test set.

In [11]:
#First, we train the second classifier with both training and validation sets
hamMessages_train = X_intermediary[X_intermediary['type'] == 'ham']["text"].tolist()
spamMessages_train = X_intermediary[X_intermediary['type'] == 'spam']["text"].tolist()

cfl2.train2(hamMessages_train, spamMessages_train)

messages_test = X_test["text"].tolist()
labels_test = X_test["type"].tolist()

cfl2.score(messages_test, labels_test)[1]

(0.9768339768339769, array([[1796.,   41.],
        [   7.,  228.]]))