# Predicting Ham vs Spam emails

In [1]:
%run Coding_naive_bayes.ipynb # allows us to use the code we wrote
import pandas
import pathlib

### 1. Imports and pre-processing data

We load the data into a Pandas dataframe, then we preprocess it by adding a column with the (non-repeated) lowercase words in the email.

In [2]:
# Environment variables
dir_path = pathlib.Path.cwd()
name_dataset = "emails.csv"

column_emails = "text"
column_words = "words"
column_label = "spam"

# Read dataset
emails = pandas.read_csv(dir_path.parents[0] / name_dataset)
emails[:10]

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
5,"Subject: great nnews hello , welcome to medzo...",1
6,Subject: here ' s a hot play in motion homela...,1
7,Subject: save your money buy getting this thin...,1
8,Subject: undeliverable : home based business f...,1
9,Subject: save your money buy getting this thin...,1


In [3]:
# Helpers (Preprocess) =========================================

def split_string_into_unique_words(string):
    return list(set(string.split()))

def process_series_email(series_text):
    """ Converts text to lower-case then returns list of unique words """
    series_words = series_text.copy() # copies original series
    series_words = series_words.str.lower()
    series_words = series_words.apply(split_string_into_unique_words)

    return series_words

In [4]:
emails[column_words] = process_series_email(emails[column_emails])
emails[:10]

Unnamed: 0,text,spam,words
0,Subject: naturally irresistible your corporate...,1,"[provided, in, gaps, change, aim, changes, int..."
1,Subject: the stock trading gunslinger fanny i...,1,"[kansas, ramble, segovia, herald, libretto, ea..."
2,Subject: unbelievable new homes made easy im ...,1,"[of, in, time, credit, complete, 1, advantage,..."
3,Subject: 4 color printing special request add...,1,"[of, now, goldengraphix, canyon, version, our,..."
4,"Subject: do not have money , get software cds ...",1,"[old, ', d, t, cds, be, finish, along, great, ..."
5,"Subject: great nnews hello , welcome to medzo...",1,"[of, 5, in, miilion, um, customers, pleased, a..."
6,Subject: here ' s a hot play in motion homela...,1,"[ensuring, toois, predictions, states, shouid,..."
7,Subject: save your money buy getting this thin...,1,"[of, now, in, provided, right, tried, cialis, ..."
8,Subject: undeliverable : home based business f...,1,"[on, ptt, mon, recipient, telecom, unknown, 20..."
9,Subject: save your money buy getting this thin...,1,"[of, aicohoi, in, provided, right, now, tried,..."


### 2. Calculate the priors

Our label column is boolean, with spam being 1 and ham being 0. Let's calculate the probabilities of seeing a ham or spam email from just the labeled data.

In [5]:
label_spam = 1
label_ham = 0

# meta data
num_emails = len(emails)
counts_label = emails[column_label].value_counts()
num_spam = counts_label[label_spam]
print(counts_label)

print("Number of emails:", num_emails)
print("Number of spam emails:", num_spam)

# Calculating the prior probability an email is spam.
dict_priors = calculate_frequency_average(emails[column_label])
print("Probability of spam:", dict_priors[label_spam])

0    4360
1    1368
Name: spam, dtype: int64
Number of emails: 5728
Number of spam emails: 1368
Probability of spam: 0.2388268156424581


### 3. Training the model

We'll calculate word frequencies based on the whole text, then train the model by calculating the frequencies for each word in each label.

In [6]:
dict_frequencies_whole_text = construct_frequency_dict_from_series(
    emails[column_emails])
dict_model = calculate_labeled_frequencies(
    dict_frequencies_whole_text, emails, column_label, column_words)

In [7]:
# Some examples (1 is spam, and 0 is ham)
print(dict_model['lottery'])
print(dict_model['sale'])
print(dict_model['already'])

{1: 8.000010682682143, 0: 1.0682682143736557e-05}
{1: 38.00005661821536, 0: 41.00005661821536}
{1: 64.0002382238118, 0: 317.0002382238118}


### 4. Using the model to make predictions

We can see the probability a word is associated with spam given our data. We can also add new words to our model by calculating their word frequencies.

In [8]:
print(predict_bayes('lottery', label_spam, dict_model))
print(predict_bayes('sale', label_spam, dict_model))
print(predict_bayes('already', label_spam, dict_model))

0.9999986646682983
0.4810126854437434
0.1679794178226178


In [9]:
list_emails = [
    "lottery sale",
    "Hi mom how are you",
    "Hi MOM how aRe yoU afdjsaklfsdhgjasdhfjklsd",
    "meet me at the lobby of the hotel at nine am",
    "enter the lottery to win three million dollars",
    "buy cheap lottery easy money now",
    "buy cheap lottery easy money"
    "Grokking Machine Learning by Luis Serrano",
    "asdfgh"]

# Adding new words to our dictionary
dict_frequencies_new_words = construct_frequency_dict_from_strings(list_emails)
dict_model = setup_naive_bayes(
    dict_frequencies_new_words, emails, column_label, column_words, column_emails)
cout = "Probability email is spam: "
for email in list_emails:
    print(cout, predict_naive_bayes(email, label_spam, dict_model, emails[column_label]))

Probability email is spam:  0.9999999005387217
Probability email is spam:  0.09520065375112553
Probability email is spam:  0.25112821625045023
Probability email is spam:  3.4107286996145865e-11
Probability email is spam:  0.9999999987178892
Probability email is spam:  0.9999999999343989
Probability email is spam:  0.9999999996071451
Probability email is spam:  0.5000000000000001


### 5. Do our results make sense?
The "Grokking Machine Learning by Luis Serrano" classification was surprising. Or was it? Let's check how often a word like "serrano" appears in spam emails.

In [10]:
print(dict_model['serrano'])
print(predict_bayes('serrano', label_spam, dict_model))

{1: 1.0000005876034477, 0: 5.876034475869477e-07}
0.9999994123972429


Hmm, that seeems pretty high. But, if we look closer at the training data, the following email was labaled spam and has "serrano"!

> Subject: important announcement : your application was approved  we tried to contact you last week about refinancing your home at a lower rate .  i would like to inform you know that you have been pre - approved .  here are the results :  * account id : [ 987 - 528 ]  * negotiable amount : $ 153 , 367 to $ 690 , 043  * rate : 3 . 70 % - 5 . 68 %  please fill out this quick form and we will have a broker contact you as soon as possible .  regards ,  shannon **serrano** senior account manager  lyell national lenders , llc .  database deletion :  www . lend - bloxz . com / r . php

Talk about bad luck. This highlights the importance of cleaning the data before you train!