# Naive Bayes

Baye's theorem allows for introduction of prior assumptions into the probability formula for a current set of events:

$$ P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)} $$

Read as a word problem, it's quite confusing:
> the probability of `A` given `B` is equal to the probability of `B` given `A` times the probability of `A` divided by the probability of `B`

In order to explain Bayes' theorem, let's create an example of its usage.  Imagine that we are consistently receiving messages at random.  Let `S` be the event 'the message is spam', and `V` be the event 'the message contains the word viagra'.  Then, Bayes' theorem tells us that the probability that the message is spam conditional on containing the word viagra is:

$$ P(S \mid V) = \frac{P(V \mid S) \, P(S)}{P(V \mid S) \, P(S) + P(V \mid -S) \, P(-S)} $$

The numerator is the probability that a message is spam *and* contains *viagra*.  The denominator is just the probability that a message contains *viagra*.  Hence, you can think of this calculation as simply representing the proportion of *viagra* messages that are spam.

If we have a data set with data properly labeled as either spam or not spam, then we can estimate `P(V|S)` and `P(V|-S)`, respectively.  We can further assume that any message is equally likely to be spam or not spam so that `P(S) = P(-S) = 0.5`, then:

$$ P(S \mid V) = \frac{P(V \mid S)}{P(V \mid S) + P(V \mid -S) \, P(-S)} $$

For example, if 50% of spam messages have the word *viagra*, but only 1% of nonspam messages do, then the probability that any given *viagra*-containing emails is spam is:

$$ \frac{0.5}{(0.5 + 0.01)} = 98% $$

Which is quite acceptable.  Next, let's expand our vocabulary to many words `w1...wn`.  To move this into the realm of probability theory, we'll write `Xi` for the event 'a message contains the word `wi`'.  Our estimate would look like:

$$ P(X_i \mid S) $$

This represents the probability that a spam message contains the `i`th word, and a similar estimate:

$$ P(X_i \mid -S) $$

represents the probability that a nonspam message contains the `i`th word.  The key to naive Bayes is making the assumption that the presences or absences of each word are independent of one another, and they are conditional on a message being spam or not.  Intuitively, this assumption means that knowing whether a certain spam message contains the word 'viagra' gives you no information about whether the same message contains the word 'rolex'.  In math terms, this means that:

$$ P(X_1 = x_1, ..., X_n = x_n \mid S) = P(X_1 = x1 \mid S) \times ... \times P(X_n = x_n \mid S) $$

This is an extreme assumption.  If half of all spam messages are for viagra and the other half are for cheap rolexes, this means that:

$$ P(X_1 = 1, X_2 = 1 \mid S) = P(X_1 = 1 \mid S) P(X_2 = 1 \mid S) = .5 \times .5 = .25 $$

Since we've assumed away the knowledge that viagra and rolex actually never occur together, we know that this assumption is a bit ridiculous.  Despite the unrealistic assumptions made by the model, it often performs well and is used in actual  spam filters.

Next, we can use the single spam word formula to calculate:

$$ P(S \mid X = x) = \frac{P(X = x \mid S)}{[P(X = x \mid S) + P(X = x \mid -S)]} $$

The naive Bayes assumption allows us to compute each of the probabilities on the right simply by multiplying together the individual probability estimate for each vocabulary word.  In practice, you usually want to avoid multiplying lots of probabilities together, to avoid a problem called *underflow*, in which computers don't deal well withs floating-point numbers that are too close to zero.  Recalling from algebra that:

$$ \log (ab) = \log a + \log b $$
$$ \exp (\log x) = x $$

we usually compute `p_1 *...* p_n` as the equivalent (but floating-point-friendlier):

$$ \exp (\log (p_1) + ... + \log (p_n)) $$

The only challenge left is coming up with estimates for:

$$ P(X_i \mid S) $$
$$ P(X_i \mid -S) $$

In order to get the probabilities that a spam message contains the word  `w_i`.  If we have a fair number of training messages labeled as spam and not-spam, an obvious first try is to estimate `P(X_i|S)` simply as the fraction of spam messages containing word `w_i`.

This causes a big problem though.  Imagine that in our training set the vocabulary word 'data' only occurs in nonspam messages.  Then we'd estimate `P('data' | S) = 0`.  The result is that our Naive Bayes classifier would always assign spam probability `0` to *any* message containing the word 'data', even a message like 'data on cheap viagra and authentic rolex watches'.  To avoid this problem, we usually use some kind of smoothing.

In particular, we'll choose a pseudocount `k` and estimate the probability of seeing the`i`th word in a spam as:

$$ P(X_i \mid S) = \frac{(k + number\ of\ spams\ containing\ w_i)}{2k + number\ of\ spams} $$

Similarly for `P(X_i | -S)`.  That is, when computing the spam probabilities for the `i`th word, we assume we also saw `k` additional spams containing the word and `k` additional spams not containing the word.

Now, let's see what this looks like in python.  First, we need to build a function to tokenize messages into distinct words:

In [32]:
from collections import Counter, defaultdict
from code_python3.machine_learning import split_data
import math, random, re, glob

def tokenize(message):
    message = message.lower()                       # convert to lowercase
    all_words = re.findall("[a-z0-9']+", message)   # extract the words
    return set(all_words)

Next, we need to count the words in a labeled training set of messages:

In [21]:
def count_words(training_set):
    """training set consists of pairs (message, is_spam)"""
    counts = defaultdict(lambda: [0, 0])
    for message, is_spam in training_set:
        for word in tokenize(message):
            counts[word][0 if is_spam else 1] += 1
    return counts

Next, we'll turn the counts into estimated probabilities using the smoothing we described before:

In [22]:
def word_probabilities(counts, total_spams, total_non_spams, k=0.5):
    """turn the word_counts into a list of triplets
    w, p(w | spam) and p(w | ~spam)"""
    return [(w,
             (spam + k) / (total_spams + 2 * k),
             (non_spam + k) / (total_non_spams + 2 * k))
             for w, (spam, non_spam) in counts.items()]

The last piece is to use these word probabilities to assign probabilities to messages:

In [23]:
def spam_probability(word_probs, message):
    message_words = tokenize(message)
    log_prob_if_spam = log_prob_if_not_spam = 0.0

    for word, prob_if_spam, prob_if_not_spam in word_probs:

        # for each word in the message,
        # add the log probability of seeing it
        if word in message_words:
            log_prob_if_spam += math.log(prob_if_spam)
            log_prob_if_not_spam += math.log(prob_if_not_spam)

        # for each word that's not in the message
        # add the log probability of _not_ seeing it
        else:
            log_prob_if_spam += math.log(1.0 - prob_if_spam)
            log_prob_if_not_spam += math.log(1.0 - prob_if_not_spam)

    prob_if_spam = math.exp(log_prob_if_spam)
    prob_if_not_spam = math.exp(log_prob_if_not_spam)
    return prob_if_spam / (prob_if_spam + prob_if_not_spam)

We can put this all together into a class:

In [24]:
class NaiveBayesClassifier:

    def __init__(self, k=0.5):
        self.k = k
        self.word_probs = []

    def train(self, training_set):

        # count spam and non-spam messages
        num_spams = len([is_spam
                         for message, is_spam in training_set
                         if is_spam])
        num_non_spams = len(training_set) - num_spams

        # run training data through our "pipeline"
        word_counts = count_words(training_set)
        self.word_probs = word_probabilities(word_counts,
                                             num_spams,
                                             num_non_spams,
                                             self.k)

    def classify(self, message):
        return spam_probability(self.word_probs, message)

Using the data set from http://spamassassin.apache.org/old/publiccorpus/, let's get to work:

In [33]:
path = r'/home/jovyan/work/spam_data/*/*'

def get_subject_data(path):

    data = []

    # regex for stripping out the leading "Subject:" and any spaces after it
    subject_regex = re.compile(r"^Subject:\s+")

    # glob.glob returns every filename that matches the wildcarded path
    for fn in glob.glob(path):
        is_spam = "ham" not in fn

        with open(fn,'r',encoding='ISO-8859-1') as file:
            for line in file:
                if line.startswith("Subject:"):
                    subject = subject_regex.sub("", line).strip()
                    data.append((subject, is_spam))
                    
    return data

Now we can split the data into training data and test data, and then we're ready to build a classifier:

In [36]:
def p_spam_given_word(word_prob):
    word, prob_if_spam, prob_if_not_spam = word_prob
    return prob_if_spam / (prob_if_spam + prob_if_not_spam)

def train_and_test_model(path):

    data = get_subject_data(path)
    random.seed(0)      # just so you get the same answers as me
    train_data, test_data = split_data(data, 0.75)

    classifier = NaiveBayesClassifier()
    classifier.train(train_data)

    classified = [(subject, is_spam, classifier.classify(subject))
              for subject, is_spam in test_data]

    counts = Counter((is_spam, spam_probability > 0.5) # (actual, predicted)
                     for _, is_spam, spam_probability in classified)

    print(counts)

    classified.sort(key=lambda row: row[2])
    spammiest_hams = list(filter(lambda row: not row[1], classified))[-5:]
    hammiest_spams = list(filter(lambda row: row[1], classified))[:5]

    print("spammiest_hams", spammiest_hams)
    print("hammiest_spams", hammiest_spams)

    words = sorted(classifier.word_probs, key=p_spam_given_word)

    spammiest_words = words[-5:]
    hammiest_words = words[:5]

    print("spammiest_words", spammiest_words)
    print("hammiest_words", hammiest_words)
    
train_and_test_model(path)

Counter({(False, False): 716, (True, True): 85, (True, False): 49, (False, True): 26})
spammiest_hams [('Species at risk of extinction growing', False, 0.8958889624800298), ('Cell phones coming soon', False, 0.9666801692557617), ('Adam dont job for no one, see.', False, 0.9758486261566025), ('2000+ year old Greek computer reinterpreted', False, 0.9767939458812925), ('Save up to 70% on international calls!', False, 0.9776715683050723)]
hammiest_spams [('I was so scared... my very first DP', True, 5.406261641572762e-05), ('Re: Hi', True, 0.0009722322165778127), ('*****SPAM*****', True, 0.0021267760018624494), ('http://www.efi.ie/', True, 0.007758971914390464), ('Outstanding Opportunities for "Premier Producers"', True, 0.008355413144240641)]
spammiest_words [('zzzz', 0.02837837837837838, 0.0002294630564479119), ('money', 0.033783783783783786, 0.0002294630564479119), ('rates', 0.033783783783783786, 0.0002294630564479119), ('systemworks', 0.033783783783783786, 0.0002294630564479119), ('adv

This gives 85 true positives, 49 false negatives, 26 false positives, and 716 true negatives, meaning our model is `85/(85+49) = 63%`, which seems pretty suboptimal.  But with work it should get better results.

*note: the author's implementation got a precision result of 73%*

There are several ways to improve accuracy:

- expand the data set
- look at the message content, not just the subject line
- modify the classifier to accept an optional `min_count` threshold and ignore tokens that don't appear at least that many times
- the tokenizer has no notion of similar words.  Modify the classifier to take an optional stemmer function that converts words to equivalence classes of words.  Possibly use PorterStemmer
- More inputs could be added that search for domain name, or some other property that may be important