# CS 344 Homework 2
## Elizabeth Koning
March 7, 2019

## Problem 1

In [1]:
spam_corpus = [["i", "am", "spam", "spam", "i", "am"], ["i", "do", "not", "like", "that", "spamiam"]]
ham_corpus = [["do", "i", "like", "green", "eggs", "and", "ham"], ["i", "do"]]

min_threshold = 1
unseen_prob = 0.4


class SpamFilter:

    def __init__(self, spam, not_spam):
        # number of messages in each corpus
        self.ngood = len(not_spam)
        self.nbad  = len(spam)

        # flatten lists
        spam = sum(spam, [])
        not_spam = sum(not_spam, [])

        # list of words in either corpus
        keys = list( set(spam) | set(not_spam) ) # union of key lists

        # store counts of tokens in each list
        self.spam = {token:spam.count(token) for token in keys}
        self.not_spam = {token:not_spam.count(token) for token in keys}

        # third hash table -- probability that message containing word is spam
        self.probs = {}
        for token in keys:
            g = float(2 * self.not_spam[token])
            b = float(self.spam[token])
            if g + b > min_threshold:
                self.probs[token] = max(0.01, min(0.99, min(1.0, b/self.nbad) / (min(1.0, g/self.ngood) + min(1.0, b/self.nbad))))

    def filter(self, text):
        product = 1.0
        comp_product = 1.0

        for token in text:
            if token in self.probs:
                probability = self.probs[token]
            else:
                probability = unseen_prob
            product *= probability
            comp_product *= (1.0 - probability)

        print(text, product / (product + comp_product))

        return product / (product + comp_product)
    
    
spam_filter = SpamFilter(spam_corpus, ham_corpus)

# examples
print(spam_filter.filter(["i"])) # single word
print(spam_filter.filter(["blah"])) # single word not in corpuses
print(spam_filter.filter(["spamiam", "am"]))
print(spam_filter.filter(["i", "blah"]))
print(spam_filter.filter(["i", "am"]))
print(spam_filter.filter(["i", "do"])) # ham list
print(spam_filter.filter(["i", "am", "spam", "spam", "i", "am"])) # spam list
print(spam_filter.filter(["do", "i", "like", "green", "eggs", "and", "ham"])) # ham list

['i'] 0.5
0.5
['blah'] 0.4
0.4
['spamiam', 'am'] 0.9850746268656716
0.9850746268656716
['i', 'blah'] 0.4
0.4
['i', 'am'] 0.99
0.99
['i', 'do'] 0.3333333333333333
0.3333333333333333
['i', 'am', 'spam', 'spam', 'i', 'am'] 0.9999999895897965
0.9999999895897965
['do', 'i', 'like', 'green', 'eggs', 'and', 'ham'] 2.6025508824397714e-09
2.6025508824397714e-09


What makes this approach to SPAM Bayesian?

This approach is Bayesian because of how we are handling probabilities. We are dealing with our degree of belief that the given message is spam.

This system works with probabilities instead of hard rules. In his post, Graham explains the advantages of that, but it relates to Bayesian statistics in how the result of our analysis is probabilities.

## Problem 2

### 2a

In [2]:
from probability import BayesNet, enumeration_ask, elimination_ask, gibbs_ask

# Utility variables

grass = BayesNet([
    ('Cloudy', '', 0.5),
    ('Sprinkler', 'Cloudy', {T:0.1, F:0.5}),
    ('Rain', 'Cloudy', {T:0.8, F:0.2}),
    ('WetGrass', 'Sprinkler Rain', {(T, T):0.99, (T,F):0.9, (F,T):0.9, (F,F):0.0}),
    ])

print("i. " + enumeration_ask('Cloudy', dict(), grass).show_approx())

print("ii. " + enumeration_ask('Sprinkler', dict(Cloudy=T), grass).show_approx())

print("iii. " + enumeration_ask('Cloudy', dict(Sprinkler=T, Rain=F), grass).show_approx())

print("iv. " + enumeration_ask('WetGrass', dict(Cloudy=T, Sprinkler=T, Rain=T), grass).show_approx())

print("v. " + enumeration_ask('Cloudy', dict(WetGrass=F), grass).show_approx())

ImportError: No module named 'probability'

### 2b
2^4 = 16
This number of independent values in the full joint probability distrubtion comes from the 4 different variables, each of which can be true or false.


### 2c
9
The number of independent values in the Bayesian network can be counted from the multiply connected network as shown in the figure.

### 2d
i. P(Cloudy) = <0.5, 0.5>

ii. P(Sprinker | cloudy) = <0.1, 0.9>

iii. P(Cloudy| the sprinkler is running and it’s not raining) = alpha * <0.5*0.1*0.2, 0.5*0.5*0.8> = alpha * <0.01, 0.2> = <0.0476, 0.9524>

iv. P(WetGrass | it’s cloudy, the sprinkler is running and it’s raining) = alpha * <0.5*0.1*0.8*0.99, 0.5*0.1*0.8*0.01> = alpha * <0.99, 0.01> = <0.99, 0.01>

v. P(Cloudy | not GrassWet) = alpha * sum( P(C, r, s, not g) )

    = alpha * sum(s)( sum(r)( P(C)*P(s^r)*P(g | s^r) ) )


       s ^ r       TT              TF              FT            FF

    = alpha * <0.5*0.08*0.01 + 0.5*0.02*0.10 + 0.5*0.72*0.10 + 0.5*0.18*1.00, C
    
               0.5*0.10*0.01 + 0.5*0.40*0.10 + 0.5*0.10*0.10 + 0.5*0.40*1.00> not C

    = alpha * <0.1274, 0.2255>
    
    = <0.361, 0.639>