## Exercise: Spam I am

Based off of Paul Graham’s "A Plan for Spam" algorithm, we'll create a small spam filter using the following corpus.

In [6]:
spam_corpus = [["i", "am", "spam", "spam", "i", "am"], ["i", "do", "not", "like", "that", "spamiam"]]
ham_corpus = [["do", "i", "like", "green", "eggs", "and", "ham"], ["i", "do"]]

However since I'm too lazy to properly go through and evaluate each "message" separately, I'll just combine the spam/ham messages together before evaluating the corpus.

In [7]:
spam_corpus = ["i", "am", "spam", "spam", "i", "am", "i", "do", "not", "like", "that", "spamiam"]
ham_corpus = ["do", "i", "like", "green", "eggs", "and", "ham", "i", "do"]

And we're also going to want to have a combined token list with all the words that exist in either corpus.

In [8]:
all_tokens = spam_corpus + ham_corpus
all_tokens = list(dict.fromkeys(all_tokens))

So the first step towards our little spam filter will be to count up the occurrences of each word and store them in a dictionary, for both the spam corpus and the ham corpus

In [9]:
spam_counts = {}
ham_counts = {}

# initializing every token to 0
for token in all_tokens:
    spam_counts[token] = 0
    ham_counts[token] = 0

# - the spam corpus
for token in spam_corpus:
    spam_counts[token] += 1

# - the ham corpus
for token in ham_corpus:
    ham_counts[token] += 1
    
print(spam_counts)
print(ham_counts)

{'i': 3, 'am': 2, 'spam': 2, 'do': 1, 'not': 1, 'like': 1, 'that': 1, 'spamiam': 1, 'green': 0, 'eggs': 0, 'and': 0, 'ham': 0}
{'i': 2, 'am': 0, 'spam': 0, 'do': 2, 'not': 0, 'like': 1, 'that': 0, 'spamiam': 0, 'green': 1, 'eggs': 1, 'and': 1, 'ham': 1}


The next step will be to compute the probabilities for each of the tokens in our corpus.

In [10]:
def prob(token):
    
    # if the token hasn't been seen, we'll assume it's innocent
    if token not in all_tokens:
        return 0.4
    
    good = 2 * ham_counts[token] # biased to reduce false positives (i.e. good emails getting lost in spam)
    bad = spam_counts[token] # just the number of times the word was repeated in the spam corpus
    
    n_bad = 2 # number of spam messages (not spam words)
    n_good = 2 # number of good messages
    
    # using a count threshold of 1 instead of Graham's 5 
    # which makes sense for a smaller corpus such as ours...
    if good + bad >= 1:
        return max(0.01, min(0.99, min(1.0, bad / n_bad) / (min(1.0, good / n_good) + min(1.0, bad / n_bad))))
    else:
        return 0

# finding the probabilities
probabilities = {}
for token in all_tokens:
    probabilities[token] = prob(token)

print(probabilities)

{'i': 0.5, 'am': 0.99, 'spam': 0.99, 'do': 0.3333333333333333, 'not': 0.99, 'like': 0.3333333333333333, 'that': 0.99, 'spamiam': 0.99, 'green': 0.01, 'eggs': 0.01, 'and': 0.01, 'ham': 0.01}


The next step is to figure out what are the most interesting words in a sample email, and Graham limited his list to the top 15 values with the probabilities furthest from 50-50.

In [11]:
# grabbing the first 15 values furthest from 50%
def get_top_interesting_values(values, limit=15):
    # calculate the probability that the message is spam given each word separately
    probabilities = {}
    for i in values:
        probabilities[i] = prob(i)
        
    # sort the probabilities by how "interesting" each value is
    # i.e. how far away from 50% it is
    sorted_values = sorted(probabilities.items(), key=lambda x: -abs(x[1]-0.5))
    
    count = 0
    interesting = []

    # grab the most "interesting" values to evaluate the email
    for item in sorted_values:
        interesting.append(item[1])
        count += 1
        # stop once we hit get 15 or so values
        if count > limit:
            break
    
    return interesting

Finally, we can take the most interesting words from an email and use them to evaluate if we think the email is spam or not.

In [12]:
from functools import reduce
def is_spam(featured_probs):
    prod = reduce((lambda x, y: x * y), featured_probs)
    compl = reduce((lambda x, y: x * y), list(map(lambda x: 1 - x, featured_probs)))
    value = prod / (prod + compl)
    return value

So then to go through and test everything, we can use the following helper function

In [13]:
def evaluate(email):
    values =  email.split()
    top_15 = get_top_interesting_values(values)
    print("P(spam) = " + str(is_spam(top_15)))

So for a couple of trivial samples, we get the following

In [15]:
spam_email = "i not spam"
mixed_email = "i do not like green eggs and ham"
ham_email = "green eggs and ham"
foreign_email = "green eggs and Sam"

evaluate(spam_email)
evaluate(mixed_email)
evaluate(ham_email)
evaluate(foreign_email)

P(spam) = 0.9998979800040808
P(spam) = 2.576524716472776e-07
P(spam) = 1.0410203448479832e-08
P(spam) = 6.870729626826629e-07


Unless there's something particularly Bayesian about his evaluation of the email (the second portion of his code-block), nothing strikes me as particularly Bayesian about his approach, barring potentially is it's adaptability to an expanding corpus. For the most part it just looks like he's using a frequentist approach, tweaking the notions to bias it a little to avoid false positives, etc. Perhaps the tweaks he's added could be considered priors? Otherwise, it's just his interest in expanding his corpus over time that feels at all Bayesian.

## Exercise: Rain, Rain, Go Away!

Here's the aima implementation of the Bayes Net

In [19]:
from probability import BayesNet, enumeration_ask

# Utility variables
T, F = True, False

cloudy = BayesNet([
    ('Cloudy', '', 0.5),
    ('Sprinkler', 'Cloudy', {T: 0.1, F: 0.5}),
    ('Rain', 'Cloudy', {T: 0.8, F: 0.2}),
    ('WetGrass', 'Sprinkler Rain', {(T, T): .99, (T, F): 0.90, (F, T): 0.90, (F, F): 0.0})
    ])

In a full joint probability distribution for this domain, none of the 4 variables would be considered independent of each other and so if we were to create a truth-table for it, we would have to look at the P("wet"|every combo of cloudy, sprinkling, rain) which would invovle finding 16 different probabilities.

In a Bayesian Network, however, we can assume that since Sprinkler, Rain, and Wet Grass are clearly dependent on Cloudy, we can think of them as being conditionally independent, so there would be 3 independent variables.

P(Cloudy) is just given as P(C) or <0.5, 0.5>

In [20]:
print(enumeration_ask('Sprinkler',dict(Cloudy=T), cloudy).show_approx())      

False: 0.9, True: 0.1


P(Sprinkler|cloudy) is also right off the table: P(S|C) or <0.10, 0.90>

In [21]:
print(enumeration_ask('Sprinkler',dict(Cloudy=T), cloudy).show_approx())      

False: 0.9, True: 0.1


P(Cloudy|sprinkler and not raining) = $P(+c|+s,-r) = P(+c,+s,-r) / P(+s,-r)$

$$= \frac{P(+c)P(+s|+c)P(-r|+c)}{\sum_c P(c)P(+s|c)P(-r|c)}$$

In [22]:
numerator = 0.5 * 0.1 * 0.2
denominator = 0.1 * 0.5 * 0.2 + 0.5 * 0.8 * 0.5
print(numerator / denominator)
print(1 - numerator / denominator)
print(enumeration_ask('Cloudy',dict(Sprinkler=T,Rain=F), cloudy).show_approx()) 

0.04761904761904762
0.9523809523809523
False: 0.952, True: 0.0476


P(WetGrass | cloudy, sprinkler, raining) = $\alpha\sum P(+w | s, r) * P(s|c) * P(r|c) * P(C) $

In [24]:
val1 = 0.99 * ( 0.1 * 0.8 * 0.5 + 0.5 * 0.2 * 0.5) 
+ 0.90 * ( 0.1 * 0.2 * 0.5 + 0.5 * 0.8 * 0.5) 
+ 0.90 * ( 0.9 * 0.8 * 0.5 + 0.5 * 0.2 * 0.5)

0.36900000000000005

In [25]:
val2 = 0.01 * ( 0.1 * 0.8 * 0.5 + 0.5 * 0.2 * 0.5) 
+ 0.10 * ( 0.1 * 0.2 * 0.5 + 0.5 * 0.8 * 0.5) 
+ 0.10 * ( 0.9 * 0.8 * 0.5 + 0.5 * 0.2 * 0.5)

0.04100000000000001

In [27]:
alpha = 1 / (val1 + val2)

In [30]:
prob_wet = val1 * alpha
prob_wet

In [None]:
prob_dry = val2 * alpha
prob_dry

0.009999999999999998

In [38]:
print(enumeration_ask('WetGrass',dict(Sprinkler=T,Rain=T,Cloudy=T), cloudy).show_approx())

False: 0.01, True: 0.99


P(Cloudy|grass is not wet)= $$ P(+c|-w) = P(+c,-w) / P(-w)$$

$$=\frac{\sum_s \sum_r P(+c,-w,s,r)}{\sum_s \sum_r \sum_c P(c, -w, s, r)}$$

$$=\frac{\sum_s \sum_r P(-w|s,r)P(s|+c)P(r|+c)P(+c)}{\sum_s \sum_r \sum_c P(-w|s,r)P(s|c)P(r|c)P(c)}$$

In [39]:
numerator = (0.01) * (0.1 * 0.8 * 0.5) + (0.1) * (0.9 * 0.8 * 0.5) + (0.1) * (0.1 * 0.2 * 0.5)
denominator = (0.01) * (0.1 * 0.8 * 0.5 + 0.5 * 0.2 * 0.5) + (0.10) * (0.9 * 0.8 * 0.5 + 0.5 * 0.2 * 0.5) + (0.10) * (0.1 * 0.2 * 0.5 + 0.5 * 0.8 * 0.5)
numerator/denominator

0.5945945945945945

In [40]:
numerator = (0.01) * (0.1 * 0.8 * 0.5) + (0.1) * (0.9 * 0.8 * 0.5) + (0.1) * (0.1 * 0.2 * 0.5)


In [41]:
print(enumeration_ask('Cloudy', dict(WetGrass=F), cloudy).show_approx())

False: 0.639, True: 0.361


While I am aware that my theoretical answer and my algorithmic answer don't match, I'm not certain I can figure out the reason. If I understood the Udacity Probalistic Inference video on Enumeration, I'm pretty sure I'm approaching the problem correctly, but given that my answers don't match, I suppose I'm still missing something conceptually. That or I made an arithmetic mistake...