# Coding a spam classifier with naive Bayes

### 1. Imports and pre-processing data

We load the data into a Turi Create SFrame, and then preprocess it by adding a string with the (non-repeated) words in the email.

In [241]:
import turicreate
import numpy as np

In [242]:
emails = turicreate.SFrame('./emails.csv')

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [243]:
emails

text,spam
Subject: naturally irresistible your ...,1
Subject: the stock trading gunslinger f ...,1
Subject: unbelievable new homes made easy im ...,1
Subject: 4 color printing special request ...,1
"Subject: do not have money , get software cds ...",1
"Subject: great nnews hello , welcome to ...",1
Subject: here ' s a hot play in motion homeland ...,1
Subject: save your money buy getting this thing ...,1
Subject: undeliverable : home based business for ...,1
Subject: save your money buy getting this thing ...,1


In [244]:
def process_email(text):
    return list(set(text.split()))
emails['words'] = emails['text'].apply(process_email)

In [245]:
#emails['word_count'] = turicreate.text_analytics.count_words(emails['text'])

In [246]:
emails

text,spam,words
Subject: naturally irresistible your ...,1,"[all, through, portfolio, its, guaranteed, ,, to, ..."
Subject: the stock trading gunslinger f ...,1,"[and, merrill, is, nameable, clockwork, ..."
Subject: unbelievable new homes made easy im ...,1,"[pre, and, all, show, being, visit, loan, 454, ..."
Subject: 4 color printing special request ...,1,"[and, golden, 5110, 626, color, ca, an, canyon, ..."
"Subject: do not have money , get software cds ...",1,"[comedies, all, old, tradgedies, be, money, ..."
"Subject: great nnews hello , welcome to ...",1,"[va, groundsel, allusion, ag, tosher, confide, ..."
Subject: here ' s a hot play in motion homeland ...,1,"[precise, all, chain, limited, indicating, ..."
Subject: save your money buy getting this thing ...,1,"[right, want, just, money, is, within, it, ..."
Subject: undeliverable : home based business for ...,1,"[unknown, grownups, co, telecom, is, mts, 000, ..."
Subject: save your money buy getting this thing ...,1,"[right, want, just, money, is, within, it, ..."


### 2. Coding Naive Bayes

We start by counting how many spam and ham emails contain a given word.

We check for the words 'money' and 'easy'.

In [247]:
def count_spam_ham(word):
    email_count = {'spam': 0, 'ham': 0}
    for email in emails:
        if word in email['words']:
            if email['spam']:
                email_count['spam'] += 1
            else:
                email_count['ham'] += 1
    return email_count

# In case it's a dictionary
'''
def count_spam_ham(word):
    email_count = {'spam': 0, 'ham': 0}
    for email in emails:
        if word in email['word_count']:
            if email['spam']:
                email_count['spam'] += 1
            else:
                email_count['ham'] += 1
    return email_count
'''

"\ndef count_spam_ham(word):\n    email_count = {'spam': 0, 'ham': 0}\n    for email in emails:\n        if word in email['word_count']:\n            if email['spam']:\n                email_count['spam'] += 1\n            else:\n                email_count['ham'] += 1\n    return email_count\n"

In [248]:
print count_spam_ham('money')
print count_spam_ham('easy')

{'ham': 87, 'spam': 280}
{'ham': 61, 'spam': 110}


Now we make a function that takes a number of words. The naive Bayes algorithm goes over all these words, multiplies the probabilities that the email containing them are spam, and ham. Finally, calculates the weighted probabilities using Naive Bayes, and returns the probability that the email is spam.

In [253]:
def prob_spam_bayes(word):
    # Returns the probability that the email is spam given that it contains a word
    spam, ham = count_spam_ham(word)
    if spam==0 and ham==0:
        return 0.5
    return 1.0*spam/(spam+ham)

In [401]:
def prob_spam_naive_bayes(words):
    email_counts = [count_spam_ham(word) for word in words]
    spams = [count['spam'] for count in email_counts]
    hams = [count['ham'] for count in email_counts]
    #print spams
    #print hams
    spam = np.prod([count['spam'] for count in email_counts])
    ham = np.prod([count['ham'] for count in email_counts])
    if spam==0 and ham==0:
        return 0.5
    return 1.0*spam/(spam+ham)

# In case the email comes as a string
def prob_spam_naive_bayes_string(email):
    words = email.split()
    print words
    return prob_spam_naive_bayes(words)

### Testing with some sample emails
We verify that for non-spammy words, the classifier gives us small probabilities, and for spammy words it gives us large probabilities.

In [402]:
prob_spam_naive_bayes(['money', 'easy'])

0.8530201899908605

In [403]:
prob_spam_naive_bayes(['mom','friend','school'])

0.008857887217413228

In [404]:
prob_spam_naive_bayes(['prince','viagra'])

1.0

In [405]:
prob_spam_naive_bayes_string('hi mom how are you please buy apples')

['hi', 'mom', 'how', 'are', 'you', 'please', 'buy', 'apples']


0.0

In [406]:
prob_spam_naive_bayes_string('buy cheap viagra get lottery')

['buy', 'cheap', 'viagra', 'get', 'lottery']


1.0

In [407]:
prob_spam_naive_bayes_string('enter in the lottery now win three million dollars')

['enter', 'in', 'the', 'lottery', 'now', 'win', 'three', 'million', 'dollars']


1.0

In [408]:
prob_spam_naive_bayes_string('lets meet at the hotel lobby at nine am tomorrow')

['lets', 'meet', 'at', 'the', 'hotel', 'lobby', 'at', 'nine', 'am', 'tomorrow']


0.0

In [409]:
prob_spam_naive_bayes_string('hi mom make easy money')

['hi', 'mom', 'make', 'easy', 'money']


0.08279582746750283

In [410]:
prob_spam_naive_bayes_string('hi mom')

['hi', 'mom']


0.03860711582134747

In [411]:
prob_spam_naive_bayes_string('make easy money')

['make', 'easy', 'money']


0.6921082499793675

In [412]:
prob_spam_naive_bayes_string('subject')

['subject']


0.06958657388456815

In [413]:
prob_spam_naive_bayes_string('wadlidoo hi mom')

['wadlidoo', 'hi', 'mom']


0.5

### 3. Training an actual model (for efficiency)

Our plan is to write a dictionary, and in this dictionary record every word, and its pair of occurrences in spam and ham

In [293]:
model = {}

# Training process
for email in emails:
    for word in email['words']:
        if word not in model:
            model[word] = {'spam': 1, 'ham': 1}
        if word in model:
            if email['spam']:
                model[word]['spam'] += 1
            else:
                model[word]['ham'] += 1

In [398]:
def predict(email):
    #print email
    words = set(email.split())
    spams = []
    hams = []
    for word in words:
        if word in model:
            spams.append(model[word]['spam'])
            hams.append(model[word]['ham'])
    #print words
    #print spams
    #print hams
    #prod_spams = long(1)
    #prod_hams = long(1)
    prod_spams = long(np.prod(spams))
    prod_hams = long(np.prod(hams))
    #print prod_spams
    #print prod_hams
    return 1.0*prod_spams/(prod_spams + prod_hams)
    #return 1.0*np.prod(spams)/(np.prod(spams)+np.prod(hams))

In [414]:
predict('hi mom how are you viagra lottery scam money easy')

0.7379367444666222

In [415]:
predict('enter the lottery to win three million dollars')

0.38569290647197135

In [416]:
predict('meet me at the lobby of the hotel at nine am')

0.02490194297492509

In [417]:
predict('buy cheap lottery easy money now')

0.9913514898646872