# Bayes-Theorem based Spam Filter

## Ling-Spam emails

### 1 Explore and clean the data

#### 1. Load the lingspam-emails.csv.bz2 dataset.

Browse a handful of emails, both spam and non-spam ones, to see what kind of text we are working
with here.

Hint: check out textwrap module to print long strings on multiple lines.


In [13]:
# imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

raw = pd.read_csv("../data/lingspam-emails.csv.bz2", sep="\t")
np.shape(raw)
#pd.set_option("display.max_colwidth", -1) # uncomment to see all the text
raw

Unnamed: 0,spam,files,message
0,False,3-1msg1.txt,Subject: re : 2 . 882 s - > np np > date : su...
1,False,3-1msg2.txt,Subject: s - > np + np the discussion of s - ...
2,False,3-1msg3.txt,Subject: 2 . 882 s - > np np . . . for me it ...
3,False,3-375msg1.txt,"Subject: gent conference "" for the listserv ""..."
4,False,3-378msg1.txt,Subject: query : causatives in korean could a...
...,...,...,...
2888,True,spmsgc50.txt,Subject: . international driver ' s license n...
2889,True,spmsgc51.txt,Subject: new on 95 . 8 capital fm this is new...
2890,True,spmsgc52.txt,Subject: re : new medical technology company ...
2891,True,spmsgc53.txt,Subject: re : your request for an overview ye...


#### 2. Ensure the data is clean: remove all cases with missing spam and empty message field. We do not care about the file names.

In [14]:
ls = raw.dropna(subset=["spam", "message"])
ls.isna().sum()

spam       0
files      0
message    0
dtype: int64

### 2 Create Document-term matrix (DTM)

#### 1. Choose ∼ 10 words which might be good to distinguish between spam/non-spam. Use these four: viagra, deadline, million, and and. Choose more words yourself (you may want to return here and reconsider your choice later).

We chose: rates, sex, cash, insurance, free, and money

#### 2. Convert your messages into DTM. We do not use the full 60k-words DTM here but only a baby-DTM of the 10 words you picked above. You may add the DTM columns to the original data frame, or keep those in a separate structure.

In [15]:
list_of_words = ["viagra", "deadline", "million", "and", "toll", "prizes", "cash", "insurance", "free", "money"]

for w in list_of_words:
    ls[w] = ls.message.str.lower().str.contains(w)
    ls[w] = ls[w] * 1 #1 = true, 0 = false
    
ls.head(10)

Unnamed: 0,spam,files,message,viagra,deadline,million,and,toll,prizes,cash,insurance,free,money
0,False,3-1msg1.txt,Subject: re : 2 . 882 s - > np np > date : su...,0,0,0,1,0,0,0,0,0,0
1,False,3-1msg2.txt,Subject: s - > np + np the discussion of s - ...,0,0,0,0,0,0,0,0,0,0
2,False,3-1msg3.txt,Subject: 2 . 882 s - > np np . . . for me it ...,0,0,0,0,0,0,0,0,0,0
3,False,3-375msg1.txt,"Subject: gent conference "" for the listserv ""...",0,0,0,1,0,0,0,0,0,1
4,False,3-378msg1.txt,Subject: query : causatives in korean could a...,0,0,0,1,0,0,0,0,0,0
5,False,3-378msg2.txt,Subject: l2 learning / cultural empathy a gra...,0,0,0,1,0,0,0,0,0,0
6,False,3-378msg3.txt,Subject: psycholinguistics teaching for an un...,0,0,0,1,0,0,0,0,0,0
7,False,3-378msg4.txt,Subject: german corpora i am looking for on-l...,0,0,0,0,0,0,0,0,0,0
8,False,3-378msg5.txt,"Subject: t hi , help ! i have to design an ex...",0,0,0,1,0,0,0,0,0,0
9,False,3-379msg1.txt,Subject: job - university of utah the linguis...,0,0,0,1,0,0,0,0,0,0


#### 3. Split your work data (i.e. the DTM) and target (the spam indicator) into training and validation chunks (80/20 is a good split).


In [16]:
x = ls.drop(["spam", "files", "message"], axis=1)
y = ls.spam * 1
train_X, test_X, train_y, test_y = train_test_split(x,y, test_size=0.2)

### 3. Estimate and validate

#### 1. Design a scheme for your variable names that describes these probabilities so that a) you understand what they mean; and b) the others (including your grader) will understand those!

Hint: you may get some ideas from the Python notes, Section 2.3 Base Language.

The first task is to compute these probabilities. Use only training data for this task.

Vairable name for spam/not spam: Pr_S1, Pr_S0

Vairable name for spam email with the spam word: Pr_S1W1

Vairable name for non-spam email with the spam word: Pr_S0W1

Vairable name for spam email without the spam word: Pr_S1W0

Vairable name for non-spam email without the spam word: Pr_S0W0

Vairable name for probability of the spam word is marked as spam: Pr_W1S1
    
Vairable name for probability of the spam word is not marked as spam: Pr_W1S0

#### 2. Compute the priors, the unconditional probabilities for an email being spam and non-spam, Pr(category = S) and Pr(category = NS). These probabilities are based on the spam variable alone, not on the text.


In [17]:
Pr_S1 = train_y.mean()
Pr_S0 = 1 - train_y.mean()
print("Probability of being spam in training data", round(Pr_S1, 3))
print("Probability of being not spam in training data", round(Pr_S0, 3))

Probability of being spam in training data 0.165
Probability of being not spam in training data 0.835


#### 3. For each word w, compute the normalizers, Pr(w = 1) and Pr(w = 0).

Hint: this is Pr(million = 1) = 0.0484. But note this value (and the following hints) depends on
your random training/validation split!


In [18]:
for word in list_of_words: 
    Pr_W1 = np.mean(train_X[word])
    Pr_W0 = 1 - Pr_W1
    
    print("Pr(", word,"= 1):", round(Pr_W1, 3))
    print("Pr(", word,"= 0):", round(Pr_W0, 3))

Pr( viagra = 1): 0.0
Pr( viagra = 0): 1.0
Pr( deadline = 1): 0.153
Pr( deadline = 0): 0.847
Pr( million = 1): 0.048
Pr( million = 0): 0.952
Pr( and = 1): 0.942
Pr( and = 0): 0.058
Pr( toll = 1): 0.019
Pr( toll = 0): 0.981
Pr( prizes = 1): 0.008
Pr( prizes = 0): 0.992
Pr( cash = 1): 0.036
Pr( cash = 0): 0.964
Pr( insurance = 1): 0.009
Pr( insurance = 0): 0.991
Pr( free = 1): 0.184
Pr( free = 0): 0.816
Pr( money = 1): 0.084
Pr( money = 0): 0.916


#### 4. For each word w, compute Pr(w = 1|category = S) and Pr(w = 1|category = NS). These probabilities are based on both the spam-variable and on the DTM component that corresponds to the word w.

Hint: Pr(million = 1|category = S) = 0.252


In [19]:
for word in list_of_words:
    
    temp = train_X.copy()
    temp["spam"] = train_y
    
    Pr_W1S1 = temp[temp.spam == 1]
    Pr_W1S1 = Pr_W1S1[word].mean()
    
    Pr_W1S0 = temp[temp.spam == 0]
    Pr_W1S0 = Pr_W1S0[word].mean()
    
    print("Pr(", word, "= 1|category = S ):", round(Pr_W1S1, 3))
    print("Pr(", word, "= 1|category = NS ):", round(Pr_W1S0, 3))

Pr( viagra = 1|category = S ): 0.003
Pr( viagra = 1|category = NS ): 0.0
Pr( deadline = 1|category = S ): 0.0
Pr( deadline = 1|category = NS ): 0.184
Pr( million = 1|category = S ): 0.241
Pr( million = 1|category = NS ): 0.01
Pr( and = 1|category = S ): 0.919
Pr( and = 1|category = NS ): 0.946
Pr( toll = 1|category = S ): 0.1
Pr( toll = 1|category = NS ): 0.004
Pr( prizes = 1|category = S ): 0.042
Pr( prizes = 1|category = NS ): 0.001
Pr( cash = 1|category = S ): 0.163
Pr( cash = 1|category = NS ): 0.011
Pr( insurance = 1|category = S ): 0.034
Pr( insurance = 1|category = NS ): 0.004
Pr( free = 1|category = S ): 0.622
Pr( free = 1|category = NS ): 0.097
Pr( money = 1|category = S ): 0.362
Pr( money = 1|category = NS ): 0.029


#### 5. Finally, compute the probabilities of interest, Pr(category = S|w = 1) and Pr(category = S|w = 0). Compute this value using Bayes theorem, not directly by counting! For the check, you may also compute Pr(category = NS|w = 1) and Pr(category = NS|w = 0)

Hint: Pr(category = S|million = 1) = 0.843. But note this number depends on your random
testing-validation split!


In [20]:
for word in list_of_words:
    Pr_W1 = np.mean(train_X[word])
    Pr_W0 = 1 - Pr_W1
    
    temp = train_X.copy()
    temp["spam"] = train_y
    
    Pr_W1S1 = temp[temp.spam == 1]
    Pr_W1S1 = Pr_W1S1[word].mean()
    
    Pr_W1S0 = temp[temp.spam == 0]
    Pr_W1S0 = Pr_W1S0[word].mean()
    
    Pr_W0S1 = 1 - Pr_W1S1
    Pr_W0S0 = 1 - Pr_W1S0
        
    Pr_S1W1 = (Pr_W1S1 * Pr_S1) / Pr_W1
    Pr_S1W0 = (Pr_W0S1 * Pr_S1) / Pr_W0
    
    Pr_S0W1 = (Pr_W1S0 * Pr_S0) / Pr_W1
    Pr_S0W0 = (Pr_W0S0 * Pr_S0) / Pr_W0
        
    print("Pr(category = S |", word, " = 1) = ", round(Pr_S1W1, 3))
    print("Pr(category = S |", word, " = 0) = ", round(Pr_S1W0, 3))

    print("Pr(category = NS |", word, " = 1) = ", round(Pr_S0W1, 3))
    print("Pr(category = NS |", word, " = 0) = ", round(Pr_S0W0, 3))

Pr(category = S | viagra  = 1) =  1.0
Pr(category = S | viagra  = 0) =  0.164
Pr(category = NS | viagra  = 1) =  0.0
Pr(category = NS | viagra  = 0) =  0.836
Pr(category = S | deadline  = 1) =  0.0
Pr(category = S | deadline  = 0) =  0.194
Pr(category = NS | deadline  = 1) =  1.0
Pr(category = NS | deadline  = 0) =  0.806
Pr(category = S | million  = 1) =  0.821
Pr(category = S | million  = 0) =  0.131
Pr(category = NS | million  = 1) =  0.179
Pr(category = NS | million  = 0) =  0.869
Pr(category = S | and  = 1) =  0.161
Pr(category = S | and  = 0) =  0.23
Pr(category = NS | and  = 1) =  0.839
Pr(category = NS | and  = 0) =  0.77
Pr(category = S | toll  = 1) =  0.844
Pr(category = S | toll  = 0) =  0.151
Pr(category = NS | toll  = 1) =  0.156
Pr(category = NS | toll  = 0) =  0.849
Pr(category = S | prizes  = 1) =  0.889
Pr(category = S | prizes  = 0) =  0.159
Pr(category = NS | prizes  = 1) =  0.111
Pr(category = NS | prizes  = 0) =  0.841
Pr(category = S | cash  = 1) =  0.747
Pr(categ

#### 6. Which of these probabilities have to sum to one? (E.g. Pr(category = 1) +Pr(category = 0) = 1.) Which ones do not? Explain!


Each of the 10 words have 4 different Pr, and the pairs that have W = 1 (so when the word is present in the email) sum to one and for the pairs that have W = 0 (so when the word is not present) sum to one as well. 

For example: 

Pr(category = S | money  = 1) + Pr(category = NS | money  = 1) = (0.72 + 0.28) = 1

Pr(category = S | money  = 0) + Pr(category = NS | money  = 0) = (0.112 + 0.888) = 1

#### 7. For each email in your validation set, predict whether it is predicted to be spam or nonspam. Hint: you should check if it contains the word w and use the appropriate probability, Pr(category = S|w = 1) or Pr(category = S|w = 0).


In [21]:
for words in list_of_words:
    Pr_W1 = np.mean(train_X[words])
    Pr_W0 = 1 - Pr_W1
    
    temp = train_X.copy()
    temp["spam"] = train_y
    
    Pr_W1S1 = temp[temp.spam == 1]
    Pr_W1S1 = Pr_W1S1[words].mean()
    
    Pr_W1S0 = temp[temp.spam == 0]
    Pr_W1S0 = Pr_W1S0[word].mean()
    
    Pr_W0S1 = 1 - Pr_W1S1
    Pr_W0S0 = 1 - Pr_W1S0
    
    predict = []
    
    for index, row in test_X.iterrows():
        if row[words] == 1:    # w = 1
            Pr_S1W1 = (Pr_W1S1 * Pr_S1) / Pr_W1
            
            if Pr_S1W1 > 0.5: # when it is spam
                predict.append(1)
            else:
                predict.append(0)
        elif row[words] == 0: 
            Pr_S1W0 = (Pr_W0S1 * Pr_S1) / Pr_W0
            
            if Pr_S1W0 > 0.5: # when it is spam
                predict.append(1)
            else:
                predict.append(0)

    cm = confusion_matrix(test_y, predict)
    print("\n" + words)
    print(cm)
    
    print("Accuracy: ", round(np.mean(test_y == predict),3)) #accuracy

    from sklearn.metrics import precision_score
    print("Precision: ", round(precision_score(test_y, predict, zero_division = 0),3)) #precision

    from sklearn.metrics import recall_score
    print("Recall: ", round(recall_score(test_y, predict),3)) #recall


viagra
[[479   0]
 [100   0]]
Accuracy:  0.827
Precision:  0.0
Recall:  0.0

deadline
[[479   0]
 [100   0]]
Accuracy:  0.827
Precision:  0.0
Recall:  0.0

million
[[475   4]
 [ 76  24]]
Accuracy:  0.862
Precision:  0.857
Recall:  0.24

and
[[479   0]
 [100   0]]
Accuracy:  0.827
Precision:  0.0
Recall:  0.0

toll
[[479   0]
 [ 86  14]]
Accuracy:  0.851
Precision:  1.0
Recall:  0.14

prizes
[[478   1]
 [ 96   4]]
Accuracy:  0.832
Precision:  0.8
Recall:  0.04

cash
[[466  13]
 [ 72  28]]
Accuracy:  0.853
Precision:  0.683
Recall:  0.28

insurance
[[477   2]
 [ 98   2]]
Accuracy:  0.827
Precision:  0.5
Recall:  0.02

free
[[446  33]
 [ 39  61]]
Accuracy:  0.876
Precision:  0.649
Recall:  0.61

money
[[469  10]
 [ 58  42]]
Accuracy:  0.883
Precision:  0.808
Recall:  0.42


#### 8. Print the resulting confusion matrix and compute accuracy, precision and recall.

Each confusion matrix for each word is above, and the resulting accurate, precison, and recall.

#### 9. Which steps above constitute model training? In which steps do you use trained model? What is a trained model in this case? Explain!

Hint: a trained model is all you need to make predictions.

Step 7 uses model training to calculate the different probablities that a word will be spam or not. From there we no longer need the training data, as we only need to know which probability is over our set threshold to determine if the email is spam or not. Train_X is the trained model, as it is the set of emails we use to calcuate probaility and determine our predictions of whether train_X emails are spam or not. 

#### 10. Comment the overall performance of the model–how do accuracy, precision and recall look like?


We determined that if more than 5 of the words were determined to be part of a spam email, then the email is spam email (not spam otherwise). Our accuracy is relatively high, at 83.9%, our precision was also high at 94.1%, and our recall is at 14.8%, which is similar to the original percentage of spam emails in our test data (19.7%). 

#### 11. Explain why do you see very low recall while the other indicators do not look that bad.

Somewhat explained above, recall is the relative amount of posititve results in the data. Our positive result is spam emails, which only consist of 19.7% of the test data, so an accurate model would show a percent close to this percent, if it had predicted a similar relative percent of spam emails.

#### 12. Explain why some words work well and others not:

(a) why does “million” improve accuracy?

Million improves accuracy as it is a word that commonly appears in spam emails and less often in regular emails. The word 'million' is not likely to be used in a professional setting, and is often used as bait to get the victim to click on the email.

(b) why does “viagra” not work?

Viagra does not work very well because the frequency of it in the whole data frame is very low, so the training data almost always never has any/enough emails for it to be a work to accurately identify spam emails.

(c) why does “deadline” not work?

Deadline does not work as well as there are many more emails that use the word 'deadline' that are not spam emails than emails that are spam and use the word deadline. For example, the word 'deadline' is likely to be used in conversations between colleagues when discussing work. But in this case specifically, there is no deadline word in the training set for spam emails. 

(d) why does “and” not work?

And is a word that is frequently used in both spam non-spam emails, so it is not a good word to identify spam emails with and will be commonly misidentified both ways. 

Hint: You may just see where in which emails these words occur, and how frequently. These are all
different reasons!


#### 13. Add such smoothing to the model. You can either literally add two such lines of data, or alternatively manipulate the way you compute the probabilities.

In [22]:
alpha = 3

for word in list_of_words:
    Pr_W1 = np.mean(train_X[word])
    Pr_W0 = 1 - Pr_W1
    
    temp = train_X.copy()
    temp["spam"] = train_y
    
    spam_temp = temp[temp.spam == 1]
    Pr_W1S1 = spam_temp[spam_temp == 1].count()
    Pr_W1S1 = (Pr_W1S1[words]) / (len(spam_temp))
    
    nospam_temp = temp[temp.spam == 0]
    Pr_W1S0 = nospam_temp[nospam_temp == 1].count()
    Pr_W1S0 = (Pr_W1S0[words]) / (len(nospam_temp))
    
    Pr_W0S1 = 1 - Pr_W1S1
    Pr_W0S0 = 1 - Pr_W1S0
        
    Pr_S1W1 = ((Pr_W1S1 * Pr_S1) + alpha) / (Pr_W1 + 2*alpha)
    Pr_S1W0 = ((Pr_W0S1 * Pr_S1) + alpha) / (Pr_W0 + 2*alpha)
    
    Pr_S0W1 = ((Pr_W1S0 * Pr_S0) + alpha) / (Pr_W1 + 2*alpha)
    Pr_S0W0 = ((Pr_W0S0 * Pr_S0) + alpha) / (Pr_W0 + 2*alpha)
        
    print("Pr(category = S |", word, " = 1) = ", round(Pr_S1W1, 3))
    print("Pr(category = S |", word, " = 0) = ", round(Pr_S1W0, 3))

    print("Pr(category = NS |", word, " = 1) = ", round(Pr_S0W1, 3))
    print("Pr(category = NS |", word, " = 0) = ", round(Pr_S0W0, 3))

Pr(category = S | viagra  = 1) =  0.51
Pr(category = S | viagra  = 0) =  0.444
Pr(category = NS | viagra  = 1) =  0.504
Pr(category = NS | viagra  = 0) =  0.544
Pr(category = S | deadline  = 1) =  0.497
Pr(category = S | deadline  = 0) =  0.454
Pr(category = NS | deadline  = 1) =  0.492
Pr(category = NS | deadline  = 0) =  0.557
Pr(category = S | million  = 1) =  0.506
Pr(category = S | million  = 0) =  0.447
Pr(category = NS | million  = 1) =  0.5
Pr(category = NS | million  = 0) =  0.548
Pr(category = S | and  = 1) =  0.441
Pr(category = S | and  = 0) =  0.513
Pr(category = NS | and  = 1) =  0.436
Pr(category = NS | and  = 0) =  0.629
Pr(category = S | toll  = 1) =  0.508
Pr(category = S | toll  = 0) =  0.445
Pr(category = NS | toll  = 1) =  0.502
Pr(category = NS | toll  = 0) =  0.546
Pr(category = S | prizes  = 1) =  0.509
Pr(category = S | prizes  = 0) =  0.444
Pr(category = NS | prizes  = 1) =  0.503
Pr(category = NS | prizes  = 0) =  0.545
Pr(category = S | cash  = 1) =  0.507
P

#### 14. Repeat the tasks above: compute the probabilities, do predictions, compute the accuracy, precision, recall for all words.

In [23]:
alpha = 3

for words in list_of_words:
    Pr_W1 = np.mean(train_X[words])
    Pr_W0 = 1 - Pr_W1
    
    temp = train_X.copy()
    temp["spam"] = train_y
    
    spam_temp = temp[temp.spam == 1]
    Pr_W1S1 = spam_temp[spam_temp == 1].count()
    Pr_W1S1 = (Pr_W1S1[words]) / (len(spam_temp))
    
    nospam_temp = temp[temp.spam == 0]
    Pr_W1S0 = nospam_temp[nospam_temp == 1].count()
    Pr_W1S0 = (Pr_W1S0[words]) / (len(nospam_temp))
    
    Pr_W0S1 = 1 - Pr_W1S1
    Pr_W0S0 = 1 - Pr_W1S0
    
    predict = []
    
    for index, row in test_X.iterrows():
        if row[words] == 1:    # w = 1
            Pr_S1W1 = ((Pr_W1S1 * Pr_S1) + alpha) / (Pr_W1 + 2*alpha)
            #Pr_S1W1 = (Pr_W1S1 * Pr_S1) / Pr_W1
            #print(Pr_S1W1)
            
            if Pr_S1W1 > 0.5: # when it is spam
                predict.append(1)
            else:
                predict.append(0)
        elif row[words] == 0: 
            Pr_S1W0 = ((Pr_W0S1 * Pr_S1) + alpha) / (Pr_W0 + 2*alpha)
            #Pr_S1W0 = (Pr_W0S1 * Pr_S1) / Pr_W0
            #print(Pr_S1W0)
            
            if Pr_S1W0 > 0.5: # when it is spam
                predict.append(1)
            else:
                predict.append(0)

    cm = confusion_matrix(test_y, predict)
    print("\n" + words)
    print(cm)
    
    print("Accuracy: ", round(np.mean(test_y == predict),3)) #accuracy

    from sklearn.metrics import precision_score
    print("Precision: ", round(precision_score(test_y, predict, zero_division = 0),3)) #precision

    from sklearn.metrics import recall_score
    print("Recall: ", round(recall_score(test_y, predict),3)) #recall


viagra
[[479   0]
 [100   0]]
Accuracy:  0.827
Precision:  0.0
Recall:  0.0

deadline
[[479   0]
 [100   0]]
Accuracy:  0.827
Precision:  0.0
Recall:  0.0

million
[[475   4]
 [ 76  24]]
Accuracy:  0.862
Precision:  0.857
Recall:  0.24

and
[[479   0]
 [100   0]]
Accuracy:  0.827
Precision:  0.0
Recall:  0.0

toll
[[479   0]
 [ 86  14]]
Accuracy:  0.851
Precision:  1.0
Recall:  0.14

prizes
[[478   1]
 [ 96   4]]
Accuracy:  0.832
Precision:  0.8
Recall:  0.04

cash
[[466  13]
 [ 72  28]]
Accuracy:  0.853
Precision:  0.683
Recall:  0.28

insurance
[[477   2]
 [ 98   2]]
Accuracy:  0.827
Precision:  0.5
Recall:  0.02

free
[[446  33]
 [ 39  61]]
Accuracy:  0.876
Precision:  0.649
Recall:  0.61

money
[[469  10]
 [ 58  42]]
Accuracy:  0.883
Precision:  0.808
Recall:  0.42


#### 15. Comment on the results. Does smoothing improve the overall performance?

Smoothing does not change the overall performance of accuracy, precision, and recall, even if we increase the value of alpha. We believe that this smoothing is not making a great enough change in the data to change any of the predictions. However, if we look at the probalites in 13, and we see that as alpha increases, the probabilties become closer to 50% (again, because of our threshold, though the proabilites are becoming more smaller in differences, those that were greater than out threshold are still greater so the predictions do not change).