# Coding a spam classifier with naive Bayes

### 1. Imports and pre-processing data

In [182]:
import turicreate
import numpy as np

In [183]:
emails = turicreate.SFrame('./emails.csv')

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [184]:
emails

text,spam
Subject: naturally irresistible your ...,1
Subject: the stock trading gunslinger f ...,1
Subject: unbelievable new homes made easy im ...,1
Subject: 4 color printing special request ...,1
"Subject: do not have money , get software cds ...",1
"Subject: great nnews hello , welcome to ...",1
Subject: here ' s a hot play in motion homeland ...,1
Subject: save your money buy getting this thing ...,1
Subject: undeliverable : home based business for ...,1
Subject: save your money buy getting this thing ...,1


In [185]:
def process_email(text):
    return list(set(text.split()))
emails['words'] = emails['text'].apply(process_email)

In [186]:
#emails['word_count'] = turicreate.text_analytics.count_words(emails['text'])

In [187]:
emails

text,spam,words
Subject: naturally irresistible your ...,1,"[all, through, portfolio, its, guaranteed, ,, to, ..."
Subject: the stock trading gunslinger f ...,1,"[and, merrill, is, nameable, clockwork, ..."
Subject: unbelievable new homes made easy im ...,1,"[pre, and, all, show, being, visit, loan, 454, ..."
Subject: 4 color printing special request ...,1,"[and, golden, 5110, 626, color, ca, an, canyon, ..."
"Subject: do not have money , get software cds ...",1,"[comedies, all, old, tradgedies, be, money, ..."
"Subject: great nnews hello , welcome to ...",1,"[va, groundsel, allusion, ag, tosher, confide, ..."
Subject: here ' s a hot play in motion homeland ...,1,"[precise, all, chain, limited, indicating, ..."
Subject: save your money buy getting this thing ...,1,"[right, want, just, money, is, within, it, ..."
Subject: undeliverable : home based business for ...,1,"[unknown, grownups, co, telecom, is, mts, 000, ..."
Subject: save your money buy getting this thing ...,1,"[right, want, just, money, is, within, it, ..."


### 2. Coding Naive Bayes

We start by counting how many spam and ham emails contain a given word.

We check for the words 'money' and 'easy'.

In [203]:
def count_spam_ham(word):
    email_count = {'spam': 0, 'ham': 0}
    for email in emails:
        if word in email['words']:
            if email['spam']:
                email_count['spam'] += 1
            else:
                email_count['ham'] += 1
    return email_count

# In case it's a dictionary
'''
def count_spam_ham(word):
    email_count = {'spam': 0, 'ham': 0}
    for email in emails:
        if word in email['word_count']:
            if email['spam']:
                email_count['spam'] += 1
            else:
                email_count['ham'] += 1
    return email_count
'''

"\ndef count_spam_ham(word):\n    email_count = {'spam': 0, 'ham': 0}\n    for email in emails:\n        if word in email['word_count']:\n            if email['spam']:\n                email_count['spam'] += 1\n            else:\n                email_count['ham'] += 1\n    return email_count\n"

In [189]:
print count_spam_ham('money')
print count_spam_ham('easy')

{'ham': 87, 'spam': 280}
{'ham': 61, 'spam': 110}


Now we make a function that takes a number of words. The naive Bayes algorithm goes over all these words, multiplies the probabilities that the email containing them are spam, and ham. Finally, calculates the weighted probabilities using Naive Bayes, and returns the probability that the email is spam.

In [205]:
def prob_spam_bayes(word):
    # Returns the probability that the email is spam given that it contains a word
    spam, ham = count_spam_ham(word)
    return 1.0*spam/(spam+ham)

In [229]:
def prob_spam_naive_bayes(words):
    email_counts = [count_spam_ham(word) for word in words]
    spam = np.prod([count['spam'] for count in email_counts])
    ham = np.prod([count['ham'] for count in email_counts])
    if spam==0 and ham==0:
        return 0.5
    return 1.0*spam/(spam+ham)

# In case the email comes as a string
def prob_spam_naive_bayes_string(email):
    words = email.split()
    print words
    return prob_spam_naive_bayes(words)

### Testing with some sample emails
We verify that for non-spammy words, the classifier gives us small probabilities, and for spammy words it gives us large probabilities.

In [208]:
prob_spam_naive_bayes(['money', 'easy'])

0.8530201899908605

In [209]:
prob_spam_naive_bayes(['mom','friend','school'])

0.008857887217413228

In [210]:
prob_spam_naive_bayes(['nigeria','prince','viagra'])

1.0

In [231]:
prob_spam_naive_bayes_string('hi mom how are you please buy apples')

['hi', 'mom', 'how', 'are', 'you', 'please', 'buy', 'apples']


0.0

In [234]:
prob_spam_naive_bayes_string('buy cheap viagra get lottery')

['buy', 'cheap', 'viagra', 'get', 'lottery']


1.0

In [235]:
prob_spam_naive_bayes_string('enter in the lottery now win three million dollars')

['enter', 'in', 'the', 'lottery', 'now', 'win', 'three', 'million', 'dollars']


1.0

In [236]:
prob_spam_naive_bayes_string('lets meet at the hotel lobby at nine am tomorrow')

['lets', 'meet', 'at', 'the', 'hotel', 'lobby', 'at', 'nine', 'am', 'tomorrow']


0.0

In [237]:
prob_spam_naive_bayes_string('hi mom make easy money')

['hi', 'mom', 'make', 'easy', 'money']


0.08279582746750283

In [238]:
prob_spam_naive_bayes_string('hi mom')

['hi', 'mom']


0.03860711582134747

In [239]:
prob_spam_naive_bayes_string('make easy money')

['make', 'easy', 'money']


0.6921082499793675

In [240]:
prob_spam_naive_bayes_string('subject')

['subject']


0.06958657388456815