# Jonathan Halverson
# Tuesday, May 16, 2017
# Native Bayes

Here we consider writing a Naive Bayes classifier for detecting spam emails versus ham. We wish to compute $P(y|x_1, x_2, ..., x_n)$, where $y$ means the class (spam or ham) and $x_i$ is word $i$ of the vocabulary. According to Bayes' theorem:

$$P(y|x_1, x_2, ..., x_n)=\frac{P(y)P(x_1, x_2, ..., x_n|y)}{P(x_1, x_2, ..., x_n)}$$

Next we make the assumption of independence (this is the naive part):

$$P(x_1, ..., x_{i-1}, x_{i+1}, ..., x_n|y) = P(x_i|y)$$

Then

$$P(y|x_1, x_2, ..., x_n)=\frac{P(y)\prod_i P(x_i|y)}{P(x_1, x_2, ..., x_n)}$$

The numerator is a constant and the class is then

$$\hat{y}= arg max(y) P(y)\prod_i P(x_i|y)$$

The $P(x_i|y)$ are pre-computed using the training data for each class. The model can then be evaluated for a given test set feature vector.

In [1]:
import re
import requests
from bs4 import BeautifulSoup
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [2]:
def scrape_and_tokenize(person):
     # download and parse the biography
     base_url = 'https://en.wikipedia.org/wiki/'
     r = requests.get(base_url + person)
     soup = BeautifulSoup(r.content, 'lxml')

     # extract the text of each paragraph
     raw_text = ''
     for paragraph in soup.find_all('p'):
          raw_text += paragraph.get_text()

     # keep only alphabetical characters and split on whitespace
     letters_only = re.sub("[^a-zA-Z]", " ", raw_text)
     words = letters_only.lower().split()

     # count the words and filter based on count and stopwords, apply stemming
     count = Counter(words)
     porter = PorterStemmer()
     stops = stopwords.words("english")
     words = [porter.stem(word) for word in words if (word not in stops) and (count[word] > 1) and (len(word) > 1)]
     return words

In [3]:
einstein = scrape_and_tokenize('Albert_Einstein')
newton = scrape_and_tokenize('Isaac_Newton')
darwin = scrape_and_tokenize('Charles_Darwin')
spielberg = scrape_and_tokenize('Steven_Spielberg')
allen = scrape_and_tokenize('Woody_Allen')
cameron = scrape_and_tokenize('James_Cameron')
jordan = scrape_and_tokenize('Michael_Jordan')
brady = scrape_and_tokenize('Tom_Brady')
williams = scrape_and_tokenize('Serena_Williams')

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform([' '.join(einstein), ' '.join(newton), ' '.join(spielberg), ' '.join(allen)])
X.toarray()

array([[2, 4, 0, ..., 4, 2, 0],
       [0, 0, 3, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 2],
       [0, 0, 0, ..., 2, 0, 0]])

In [5]:
y = [0, 0, 1, 1]

In [6]:
vec = CountVectorizer()
Xv = vec.fit_transform(['apple soup', 'table stamp', 'donut king'])
Xv.toarray()

array([[1, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 1, 0],
       [0, 0, 1, 1, 0, 0]])

In [7]:
vec.get_feature_names()

[u'apple', u'chad', u'donut', u'king', u'narly', u'soup']

In [8]:
from sklearn.naive_bayes import BernoulliNB

# alpha is the smoothing parameter
clf = BernoulliNB(alpha=1.0, binarize=None, fit_prior=True, class_prior=None)
clf.fit(X, y)

BernoulliNB(alpha=1.0, binarize=None, class_prior=None, fit_prior=True)

In [9]:
obs = vectorizer.transform([' '.join(cameron)])
clf.predict(obs)

  neg_prob = np.log(1 - np.exp(self.feature_log_prob_))
  neg_prob = np.log(1 - np.exp(self.feature_log_prob_))


array([0])

In [10]:
obs.toarray()

array([[0, 0, 0, ..., 0, 0, 0]])

In [13]:
clf.predict(X[2:3])

array([0])