#CSE 101: Computer Science Principles
####Stony Brook University
####Kevin McDonnell (ktm@cs.stonybrook.edu)
##Module 21: Machine Learning for Spam Filtering



### Note

* See the course website for the [lecture slides](https://drive.google.com/file/d/1D6MZXbXSmO8peqvFvpZqYnzBX1THx9nC/view?usp=sharing), which are separate from this Colab file.

Download [`quote1.txt`](https://drive.google.com/file/d/1FZ0dO1HH8O9d38Qxmzi_H2ndhq5sd5VP/view?usp=sharing) to try the following code.

In [None]:
import string

def tokenize(st):
    tokens = []
    for word in st.split():
        tokens.append(word.strip(string.punctuation).lower())
    return tokens

res = tokenize('With confidence, you have won even before you have started.')
res

['with',
 'confidence',
 'you',
 'have',
 'won',
 'even',
 'before',
 'you',
 'have',
 'started']

In [None]:
import string

def wf(filename):
    count = {}
    for line in open(filename):
        for word in tokenize(line):
            count.setdefault(word, 0)
            count[word] += 1
    return count


freqs = wf('quote1.txt')
freqs

{'': 2,
 '106': 1,
 '43': 1,
 'are': 1,
 'bc': 2,
 'before': 1,
 'cicero': 1,
 'confidence': 2,
 'defeated': 1,
 'even': 1,
 'have': 3,
 'if': 1,
 'in': 2,
 'life': 1,
 'marcus': 1,
 'no': 1,
 'of': 1,
 'race': 1,
 'self': 1,
 'started': 1,
 'the': 1,
 'tullius': 1,
 'twice': 1,
 'with': 1,
 'won': 1,
 'you': 4}

In [None]:
import heapq

class WordQueue(object):
    def __init__(self):
        self._data = []

    def insert(self, item):
        words = [val[2] for val in self._data]
        if item[1] in words:
            return
        heapq.heappush(self._data, (-abs(item[0] - 0.5), item[0], item[1]))

    def pop(self):
        return heapq.heappop(self._data)

In [None]:
def load_probabilities(filename):
    prob = {}
    with open(filename) as f:
        for line in f:
            p, w = line.split()
            prob[w] = float(p)
    return prob

In [None]:
def spamicity(word, pbad, pgood):
    if word in pbad and w in pgood:
        return pbad[w] / (pbad[w] + pgood[w])
    else:
        return None

In [None]:
def combined_probability(queue, max_words):
    p = q = 1.0
    for i in range(max_words):
        x = queue.pop()[1]
        p *= x
        q *= (1.0 - x)
    return p / (p + q)

In [None]:
def pspam(filename, max_words):
    queue = WordQueue()
    pbad = load_probabilities('bad.txt')
    pgood = load_probabilities('good.txt')
    with open(filename) as message:
        for line in message:
            for word in tokenize(line):
                sp = spamicity(word, pbad, pgood)
                if sp is not None:
                    queue.insert((sp, word))
    return combined_probability(queue, max_words)

To try execute the spam-filtering code, you will need the following files:
* [`bad.txt`](https://drive.google.com/file/d/1UmUPIYckMsMxKKLjiR5Dz2ixUDK7fVH-/view?usp=sharing)
* [`good.txt`](https://drive.google.com/file/d/10ggYU3GggTVBLYcwUK4FzS0MnptEzDbt/view?usp=sharing)
* [`msg1.txt`](https://drive.google.com/file/d/1Hv2c7SAEFZc1l_5kob6NcEgV3goi20EC/view?usp=sharing)
* [`msg2.txt`](https://drive.google.com/file/d/1W8mfmypGOc9Dbw8o11DFinrVKVotPkdL/view?usp=sharing)
* [`msg3.txt`](https://drive.google.com/file/d/1A1nyonoO5ZplwXQRH_xtD2aEc0Ir5ZIA/view?usp=sharing)
* [`msg4.txt`](https://drive.google.com/file/d/1zgxXylsO0_NnFoq8wbbyWz_GzRSovhlu/view?usp=sharing)

In [None]:
print(pspam('msg1.txt', 15))
print(pspam('msg2.txt', 15))
print(pspam('msg3.txt', 15))
print(pspam('msg4.txt', 15))

0.9293048326577117
4.400695206741519e-05
0.05810198935535336
3.758445064217253e-15
