# Homework 2

## Bayes Spam Filter

First, I make a list of all mail documents.

In [13]:
# mail corpus                                                                                                                                                                                
spam_corpus = [["I", "am", "spam", "spam", "I", "am"], ["I", "do", "not", "like", "that", "spamiam"]]                                                                                        
ham_corpus = [["do", "i", "like", "green", "eggs", "and", "ham"], ["i", "do"]]                                                                                                               
all_mail = spam_corpus + ham_corpus

Then make lists of spam and ham words and get the lengths of each

In [14]:
# mail docs
spam = []
for mail in spam_corpus:
    [spam.append(word.lower()) for word in mail]

ham = []
for mail in ham_corpus:
    [ham.append(word.lower()) for word in mail]
    
# corpus lengths
n_good = len(ham_corpus)
n_bad = len(spam_corpus)



Using the list of all documents and the spam and ham lists, I created a hashtable of words and their frequencies in both the spam corpus and the ham corpus.

In [15]:
# hashtable of good and bad words and their frequencies
good_words_freqs = {}
bad_words_freqs = {}

# count each word and its frequency for each spam and ham
for mail in all_mail:
    [bad_words_freqs.update({word.lower(): spam.count(word.lower())}) for word in mail]

for mail in all_mail:
    [good_words_freqs.update({word.lower(): ham.count(word.lower())}) for word in mail]

This function implements Paul Graham's probability calculation algorithm

In [16]:
def probability_calc(word):
    '''
    Calculate spam probability for each token
    '''
    g = 2 * good_words_freqs[word] # double the count of ham words to reduce false positives
    b = bad_words_freqs[word]
    if g + b > 5:
        return max(0.01, min(0.99, min(1.0, b/n_bad) / min(1.0, g/n_good) + min(1.0, b/n_bad)))
    
    return 0

Using this algorithm, probabilities are assigned to each word in both corpora. If a word occurs only in the ham corpus, it is assigned a probability of 0.01. If it is found only in the spam corpus, it gets a probability of 0.9. If found in both, Graham's algorithm is used to find a probability. If it is not in either corpus, it is assigned a probability of 0.4.

In [17]:
# Probability of spam hashtable for each word 
all_word_probs = {}

for mail in all_mail:
    for word in mail:
        if word not in ham:
            all_word_probs[word] = 0.9
        elif word not in spam:
            all_word_probs[word] = 0.1
        elif word in spam and word in ham:
            p = probability_calc(word)
            all_word_probs[word] = p

        elif word not in spam and word not in ham:
            all_word_probs[word] = 0.4

This scoring function finds the probability that a document is spam by finding the product of the probabilities of its words.

In [18]:
def score(mail, probabilities):
    prob = 1
    for word in mail:
        prob*=all_word_probs[word.lower()]
    return prob/(prob + (1-prob))

In [22]:
print("Ham: ")
[print(mail, " ", score(mail, all_word_probs)) for mail in ham_corpus]

print("Spam: ")
[print(mail, " ", score(mail, all_word_probs)) for mail in spam_corpus]

Ham: 
['do', 'i', 'like', 'green', 'eggs', 'and', 'ham']   0.0
['i', 'do']   0.0
Spam: 
['I', 'am', 'spam', 'spam', 'I', 'am']   0.6430436100000001
['I', 'do', 'not', 'like', 'that', 'spamiam']   0.0


[None, None]

## 2

In [7]:
import sys                                                                                                      
sys.path.append('/home/james/Documents/Calvin/CS-344/cs344-code/tools/aima')                                    
from probability import BayesNet, enumeration_ask, elimination_ask, gibbs_ask, rejection_sampling, likelihood_weighting                                                                      


In [23]:
# Utility variables                                                                                                                                                                          
T, F = True, False                                                                                                                                                                           

wet_lawns = BayesNet([                                                                                                                                                                       
  ('Cloudy', '', 0.5),                                                                                                                                                                     
  ('Sprinkler', 'Cloudy', {T: 0.1, F: 0.5}),                                                                                                                                               
  ('Rain', 'Cloudy', {T: 0.8, F: 0.2}),                                                                                                                                                    
  ('WetGrass', 'Sprinkler Rain', {(T, T): 0.99, (T, F): 0.90, (F, T): 0.9, (F, F): 0.0})                                                                                                   
  ])                                                                                                                                                                                       

#d.                                                                                                                                                                                         
print('P(Cloudy)')                                                                                                                                                                           
# P(Cloudy) = 0.5 from the network
print(enumeration_ask('Cloudy', dict(),  wet_lawns).show_approx())                                                                                                                           


print('P(Sprinker | cloudy')                                                                                                                                                                 
# P(Sprinkler | Cloudy) = 0.5
print(enumeration_ask('Sprinkler', dict(Cloudy=T), wet_lawns).show_approx())                                                                                                                  

print('P(Cloudy| the sprinkler is running and it’s not raining)')                                                                                                                            
print(enumeration_ask('Cloudy', dict(Sprinker=T, Rain=F), wet_lawns).show_approx())                                                                                                          

print('P(WetGrass | it’s cloudy, the sprinkler is running and it’s raining)')                                                                                                                
print(enumeration_ask('WetGrass', dict(Cloudy=T, Sprinkler=T, Rain=T), wet_lawns).show_approx())                                                                                              

print('P(Cloudy | the grass is not wet)')                                                                                                                                                    
print(enumeration_ask('Cloudy', dict(WetGrass=F), wet_lawns).show_approx()) 

P(Cloudy)
False: 0.5, True: 0.5
P(Sprinker | cloudy
False: 0.9, True: 0.1
P(Cloudy| the sprinkler is running and it’s not raining)
False: 0.8, True: 0.2
P(WetGrass | it’s cloudy, the sprinkler is running and it’s raining)
False: 0.091, True: 0.909
P(Cloudy | the grass is not wet)
False: 0.639, True: 0.361


### 2. b.                                                                                                                       
A full join probability distribution would have 2*2*4 = 16 values                                                                                                                            
### 2. c.                                                                                                                                                                                    
The Bayes Network for this domain has 9 independent values with the conditional independence relations   