# SPAM MAIL DETECTION WITH NAIVE BAYES



Imports

In [1]:
import pandas as pd
import numpy as np
import math
import re
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

Reading the data

In [2]:
df = pd.read_csv('emails.csv') #reading data
df.drop_duplicates(inplace=True) #removing duplicate rows

spammails = df[df['spam'] == 1]
hammails = df[df['spam'] == 0]

Class distribution

In [3]:
df.groupby('spam').count()

Unnamed: 0_level_0,text
spam,Unnamed: 1_level_1
0,4327
1,1368


Function that returns most common three words for Part1

In [4]:
def get_three_words(dataset, ngram, stopword):

    vectorizer = CountVectorizer(ngram_range=ngram, max_features=3, stop_words=stopword) #max_features=3 gives us most common 3 words
    txt = dataset.text
    X = vectorizer.fit_transform(txt)  #transform
    feature_names = vectorizer.get_feature_names()
    feature_count = X.toarray().sum(axis=0)
    dictionary = dict(zip(feature_names,feature_count)) #creating dictionary

    return dictionary

In [5]:
spam_mail_unigram = get_three_words(spammails,(1,1),None)
print('Most common 3 words in spam mails with stop-words: ', spam_mail_unigram)
spam_mail_unigram_sw = get_three_words(spammails,(1,1), 'english')
print('Most common 3 words in spam mails without stop-words : ', spam_mail_unigram_sw)

ham_mail_unigram = get_three_words(hammails, (1,1), None)
print('Most common 3 words in ham mails with stop-words: ',ham_mail_unigram)
ham_mail_unigram_sw = get_three_words(hammails, (1,1), 'english')
print('Most common 3 words in ham mails without stop-words : ', ham_mail_unigram_sw)

Most common 3 words in spam mails with stop-words:  {'and': 6517, 'the': 8975, 'to': 8165}
Most common 3 words in spam mails without stop-words :  {'business': 844, 'com': 999, 'subject': 1574}
Most common 3 words in ham mails with stop-words:  {'and': 20805, 'the': 40935, 'to': 33369}
Most common 3 words in ham mails without stop-words :  {'ect': 11410, 'enron': 13329, 'subject': 8545}


## PART 1
With stop-words, we can see domination of stop-words. We can't decide mail ham or spam from 3 words because they are all the same. From ocurrences of the words persfective, ham mail / spam mail ratio is 3,2. Ocurrence of "and" and "to" is relatively close when we multiply spam words counts with 3,2. With stop-words, most common 3 words does not gives us a any idea.

Without stop-words, spam mail senders focusing on business and direct links(We can get this idea from "com"). Novadays big companies uses bridges or click buttons to reach link, but spam mailers puts direct link into the mail. "ect" and "enron" looks weird but they are the email domains, ECT=enron capital trade, and HOU=Houston.


In [6]:
def bow(ngram, stp_wrds):
    vectorizer = CountVectorizer(ngram_range=ngram, min_df=0.01, stop_words=stp_wrds)
    txt = df.text.fillna("")
    X = vectorizer.fit_transform(txt) #_transform

    arr2 = vectorizer.get_feature_names()
    new_df = pd.concat([df, pd.DataFrame(X.A, columns=arr2)], axis=1, join='inner')

    pd.options.mode.chained_assignment = None  # default='warn'
    global train,test
    train, test = train_test_split(new_df, test_size=0.2)

    global train2,test2
    if(ngram == (1,1)):
        train.columns.values[0] = "text1"
        train.columns.values[train.columns.get_loc("text")] = "text1"
        train.columns.values[0] = "text"

    train2, test2 = train.copy(), test.copy()
    train['text'] = train['text'].str.replace('\W', ' ', regex=True) #removes punctuation
    train['text'] = train['text'].str.lower() #lowercase strings
    train['text'] = train['text'].str.split()
    global vocabulary
    vocabulary = []
    for text in train['text']:
       for word in text:
          vocabulary.append(word)

    vocabulary = list(set(vocabulary))
    global spam_mails, ham_mails
    spam_mails = train[train['spam'] == 1]
    ham_mails = train[train['spam'] == 0]

#### Bayes Theorem

If we know the probability of occuring A when B has already occured i.e. **P(A|B)** but we want to know the reverse, probability of occuring B when A has already occured i.e. p(B|A), in that case we will use Bayes theroem which states: **P(A|B) = P(A,B) / P(B) = P(B|A) P(A) / p(B)**

Event B can be further split into two mutually exclusive events, "B and A" and "B and not A" **P(B) = P(B,A) + P(B, not A)**

so we can write-

**P(A|B) = P(B|A) P(A) / {P(B|A) P(A) + P(B | not A)P(not A)}** and this is how bayes theroes is usually written

Lets now use bayes theroem to create the naive Bayes clasifier We will be using the example of spam filter to classify a mail as spam or non spam.

Lets say S is the event that a mail is spam and D is the event that the mail contains the word "Discount". Now using Bayes theorem, we can find the probabilty of a mail being spam if it contains the word discount

**P(S|D) = P(D|S) P(S) / {P(D|S) P(S) + P(D | not S) P(not S)}**

Here numerator is probability that mail is spam and contains Discount and denominator is just the probability that the mail contains Discount. Now if a mail being spam and not spam are equally likely events then **P(S) = P (not S) =0.5**

Therefore **P(S|D) = P(D|S) / {P(D|S) + P(D | not S)}**

so if 60% of the spam mails have the word Discount and only 2% of the non spam mails have the word discount in it then **P(S|D)** = *(0.60/0.62)= 96.77%*

In [7]:
def naive_bayes():
    global p_spam, p_ham, parameters_spam, parameters_ham
    p_spam = len(spam_mails) / len(train)
    p_ham = len(ham_mails) / len(train)
    # n_Spam
    n_words_per_spam_message = spam_mails['text'].apply(len)
    n_spam = n_words_per_spam_message.sum()
    # n_Ham
    n_words_per_ham_message = ham_mails['text'].apply(len)
    n_ham = n_words_per_ham_message.sum()
    # n_Vocabulary
    n_vocabulary = len(vocabulary)
    alpha = 1 #laplace smoothing
    #Initiate parameters
    parameters_spam = {unique_word:0 for unique_word in vocabulary}
    parameters_ham = {unique_word:0 for unique_word in vocabulary}

    # Calculate parameters
    for word in vocabulary:
        if(word in spam_mails and word != "text"):
            n_word_given_spam = spam_mails[word].sum() # spam_messages already defined
            p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
            parameters_spam[word] = p_word_given_spam

            n_word_given_ham = ham_mails[word].sum() # ham_messages already defined
            p_word_given_ham =(n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
            parameters_ham[word] = p_word_given_ham
            #print(p_word_given_ham)
        else:
            continue

Function that classifies mails

In [8]:
def classify_test_set(message):
    message = re.sub('\W', ' ', message)
    message = message.lower().split()

    p_spam_given_message = math.log2(p_spam)
    p_ham_given_message = math.log2(p_ham)

    for word in message:
        if (word in parameters_spam):
            if(parameters_spam[word] != 0):
                p_spam_given_message += math.log2(parameters_spam[word])
        if (word in parameters_ham):
            if(parameters_ham[word] !=0):
                p_ham_given_message += math.log2(parameters_ham[word])

    if p_ham_given_message > p_spam_given_message:
        return 0
    elif p_spam_given_message > p_ham_given_message:
        return 1
    else:
        return 1 #equal probility classified as spam

In [9]:
def print_score():
    test['prediction'] = test['text'].apply(classify_test_set)

    print("Accuracy:", accuracy_score(test['spam'], test['prediction']))
    print("Precision:", precision_score(test['spam'], test['prediction']))
    print("Recall:", recall_score(test['spam'], test['prediction']))
    print("F1 score:", f1_score(test['spam'], test['prediction']))

In [10]:
print("Unigram scores")
bow((1,1), None)
naive_bayes()
print_score()
print()
print("Unigram scores without stopwords")
bow((1,1), "english")
naive_bayes()
print_score()
print()
print("Bigram scores")
bow((2,2), None)
naive_bayes()
print_score()
print()
print("Bigram scores without stopwords")
bow((2,2), "english")
naive_bayes()
print_score()

Unigram scores
Accuracy: 0.9744042365401588
Precision: 0.9878542510121457
Recall: 0.9037037037037037
F1 score: 0.9439071566731141

Unigram scores without stopwords
Accuracy: 0.9779346866725508
Precision: 0.9871794871794872
Recall: 0.9130434782608695
F1 score: 0.948665297741273

Bigram scores
Accuracy: 0.7625772285966461
Precision: 1.0
Recall: 0.014652014652014652
F1 score: 0.02888086642599278

Bigram scores without stopwords
Accuracy: 0.7811120917917035
Precision: 1.0
Recall: 0.027450980392156862
F1 score: 0.0534351145038168


#### Unigram Analysis
For our data removing stopwords don't have any effect on our scores. Removing stopwords will reduce total word size in mails and it will help us to calculate predictions faster. We have relatively close accuracy, precision, recall and f1 scores. False positives and false negatives are balanced. Selecting accuracy score will be better for our data

#### Bigram Analysis
Using bigram changes our scores dramaticly. We can say for wrong predictions algorithm classifies *spam* mails as a *ham*. This is why we have a amazingly low recall score. But our algorithm doesn't predicts *ham* mails as a *spam*. This gives a 1.0 precision score. The reason is ham mails has a better structe. Gramaticly correct sentences has a more common adjacent words. From spam mail persfective, senteces are messy and has a lot of typos. Bigram is a good model when predicting ham mails but not good for predicting spam.

In [11]:
spam_mails2 = train2[train2['spam'] == 1]
ham_mails2 = train2[train2['spam'] == 0]

def td_idf(dataframe, type, stp_wrds):
    tfidf_vectorizer = TfidfVectorizer(stop_words=stp_wrds)
    tfidf_vector = tfidf_vectorizer.fit_transform(dataframe["text"])
    tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=dataframe["text"], columns=tfidf_vectorizer.get_feature_names())
    tfidf_df.loc["0"] = (tfidf_df > 0).sum()
    tfidf_df.drop("text", axis=1)

    tfidf_df = tfidf_df.sort_index()
    if(type == "presence"):
        print()
        tfidf_df = tfidf_df.sort_values(by = "0", axis = 1, ascending=False )
    if(type == "absence"):
        tfidf_df = tfidf_df.sort_values(by = "0", axis = 1)
    print(tfidf_df.loc["0"].head(10))

In [12]:
print("10 words whose presence most strongly predicts that the mail is spam")
td_idf(spam_mails2, "presence", None)
print("10 words whose absence most strongly predicts that the mail is spam")
td_idf(spam_mails2, "absence", None)
print("10 words whose presence most strongly predicts that the mail is ham")
td_idf(ham_mails2, "presence", None)
print("10 words whose absence most strongly predicts that the mail is ham")
td_idf(ham_mails2, "absence", None)


10 words whose presence most strongly predicts that the mail is spam

subject    1113.0
to          946.0
the         885.0
and         802.0
you         791.0
of          788.0
your        779.0
for         738.0
is          716.0
in          691.0
Name: 0, dtype: float64
10 words whose absence most strongly predicts that the mail is spam
interact         1.0
stow             1.0
hsa              1.0
hsdl             1.0
storyclick       1.0
storyacquired    1.0
storms           1.0
https            1.0
hue              1.0
huffman          1.0
Name: 0, dtype: float64
10 words whose presence most strongly predicts that the mail is ham

subject    3418.0
to         3180.0
the        3142.0
and        2802.0
for        2733.0
you        2730.0
of         2608.0
on         2560.0
in         2529.0
is         2394.0
Name: 0, dtype: float64
10 words whose absence most strongly predicts that the mail is ham
invention        1.0
knights          1.0
knobs            1.0
knocked          1.0


Presence of words are dominated by stop-words. These will not helpfull for our algorithm. We can rather choose using absence for our algorithm. 

In [13]:
print("10 words whose presence most strongly predicts that the mail is ham without stopwprds")
td_idf(ham_mails2, "presence", "english")
print("10 words whose presence most strongly predicts that the mail is spam without stopwprds")
td_idf(spam_mails2, "presence", "english")

10 words whose presence most strongly predicts that the mail is ham without stopwprds

subject     3418.0
vince       2195.0
enron       2044.0
cc          1724.0
kaminski    1554.0
thanks      1416.0
2000        1383.0
pm          1340.0
ect         1278.0
know        1224.0
Name: 0, dtype: float64
10 words whose presence most strongly predicts that the mail is spam without stopwprds

subject        1113.0
com             368.0
http            313.0
just            299.0
business        269.0
click           262.0
information     259.0
time            252.0
email           249.0
best            246.0
Name: 0, dtype: float64


Without stopwords now we have a usefull presence of words. We can use these words to predict faster. When we see these words
we can decide is mail spam or ham earlier. We can gain time.

**Why might it make sense to remove stop words when interpreting the model?**
If we have a huge amount of data or limited memory, removing stopwords will help a lot. Also we remove the low-level information from our text in order to give more focus to the important information. For our dataset removing stopwords don't have any negative effect. Helped us to calculate unigram faster and increased bigram score tiny bit.


**Why might it make sense to keep stop words?**
We should remove stopwords only if they don’t add any new information to our problem.For example, if we are training a model that can perform the sentiment analysis task, we might not remove the stop words.


Movie review: “The movie was not good at all.”


Text after removal of stop words: “movie good”

We can clearly see that the review for the movie was negative. However, after the removal of stop words, the review became positive
