# Jonathan Halverson
# Tuesday, May 16, 2017
# Native Bayes

In [1]:
import numpy as np

Here we consider writing a Naive Bayes classifier for detecting spam emails versus ham. We wish to compute $P(y|x_1, x_2, ..., x_n)$, where $y$ means the class (spam or ham) and $x_i$ is word $i$ of the vocabulary. According to Bayes' theorem:

$$P(y|x_1, x_2, ..., x_n)=\frac{P(y)P(x_1, x_2, ..., x_n|y)}{P(x_1, x_2, ..., x_n)}$$

Next we make the assumption of independence (this is the naive part):

$$P(x_1, ..., x_{i-1}, x_{i+1}, ..., x_n|y) = P(x_i|y)$$

Then

$$P(y|x_1, x_2, ..., x_n)=\frac{P(y)\prod_i P(x_i|y)}{P(x_1, x_2, ..., x_n)}$$

The numerator is a constant and the class is then

$$\hat{y}= arg max(y) P(y)\prod_i P(x_i|y)$$

The $P(x_i|y)$ are pre-computed using the training data for each class. The model can then be evaluated for a given test set feature vector.

In [2]:
import re
import requests
from bs4 import BeautifulSoup
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [3]:
def scrape_and_tokenize(person):
     # download and parse the biography
     base_url = 'https://en.wikipedia.org/wiki/'
     r = requests.get(base_url + person)
     soup = BeautifulSoup(r.content, 'lxml')

     # extract the text of each paragraph
     raw_text = ''
     for paragraph in soup.find_all('p'):
          raw_text += paragraph.get_text()

     # keep only alphabetical characters and split on whitespace
     letters_only = re.sub("[^a-zA-Z]", " ", raw_text)
     words = letters_only.lower().split()

     # count the words and filter based on count and stopwords, apply stemming
     count = Counter(words)
     porter = PorterStemmer()
     stops = stopwords.words("english")
     words = [porter.stem(word) for word in words if (word not in stops) and (count[word] > 1) and (len(word) > 1)]
     return words

In [4]:
einstein = scrape_and_tokenize('Albert_Einstein')
newton = scrape_and_tokenize('Isaac_Newton')
darwin = scrape_and_tokenize('Charles_Darwin')
spielberg = scrape_and_tokenize('Steven_Spielberg')
allen = scrape_and_tokenize('Woody_Allen')
cameron = scrape_and_tokenize('James_Cameron')
jordan = scrape_and_tokenize('Michael_Jordan')
brady = scrape_and_tokenize('Tom_Brady')
williams = scrape_and_tokenize('Serena_Williams')

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words=stopwords.words("english"))
X = vectorizer.fit_transform([' '.join(einstein), ' '.join(newton), ' '.join(darwin), ' '.join(spielberg), ' '.join(allen), ' '.join(cameron)])
X.toarray()

array([[2, 4, 0, ..., 2, 0, 0],
       [0, 0, 3, ..., 0, 0, 0],
       [0, 0, 2, ..., 0, 0, 3],
       [0, 0, 0, ..., 0, 2, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

The first three biographies are scientists (class 0) while the last three are filmmakers (class 1).

In [6]:
y = np.array([0, 0, 0, 1, 1, 1])

In [7]:
vectorizer.get_feature_names()[:10]

[u'aarau',
 u'abandon',
 u'abbey',
 u'abl',
 u'abraham',
 u'absenc',
 u'absorpt',
 u'abstract',
 u'absurd',
 u'abus']

In [8]:
X.toarray().shape

(6, 2304)

### Work a smaller case

In [9]:
vec = CountVectorizer()
Xv = vec.fit_transform(['apple soup soup', 'table stamp', 'donut king'])
Xv.toarray()

array([[1, 0, 0, 2, 0, 0],
       [0, 0, 0, 0, 1, 1],
       [0, 1, 1, 0, 0, 0]])

In [10]:
vec.get_feature_names()

[u'apple', u'donut', u'king', u'soup', u'stamp', u'table']

### Now fit the model

In [11]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB

# alpha is the smoothing parameter
#clf = BernoulliNB(alpha=1.0, binarize=None, fit_prior=True, class_prior=None)
clf = MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)
clf.fit(X, y)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Let's check predictions on the training data:

In [12]:
obs = vectorizer.transform([' '.join(cameron), ' '.join(darwin)])
clf.predict(obs)

array([1, 0])

In [13]:
obs.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 2, ..., 0, 0, 3]])

In [14]:
obs.toarray().shape

(2, 2304)

Now make prediction using data that the model has no seen:

In [16]:
kubrick = scrape_and_tokenize('Stanley_Kubrick')
karle = scrape_and_tokenize('Jerome_Karle')
obs = vectorizer.transform([' '.join(kubrick), ' '.join(karle)])
clf.predict(obs)

array([1, 0])

We see that the model correctly predicts the classes.