# Document Classification with Naive Bayes
Based on [Jurafsky: Speech and Language Processing (3rd ed. draft)](https://web.stanford.edu/~jurafsky/slp3/)

Multinomial Naive Bayes: The class priors are assumed to follow a multinomial distribution - common for text classification.

Assumes Bag-of-Words: Documents are represented by the unordered collection of their word counts. Text sequence information is ignored.
## The Model
* Corpus of $N$ documents $\{d_n\}_1^N$
* Vocabulary $V$: set of words $\{w_j\}_1^J$
* Feature based on count of vocabulary word $w_j$ in document $d_n$: $f_{nj}=f(w_j;d_n)$  
* $K$ Classes $\{c_k\}_1^K$

The Naive Bayes classifier returns the class $\hat{c}_n$ which has the maximum posterior probability given the document $d_n$.
$$
\DeclareMathOperator*{\argmax}{argmax}
\hat{c}_n=\argmax_{\{c_k\}} P(c_k|d_n)
$$

Bayes' Theorem says that the conditional probability $P(c_k|d_n)$ to observe class $c_k$ given document $d_n$ can be decomposed into the product of the likelihood $P(d_n|c_k)$ and the prior $P(c_k)$ (the probability for class $c_k$ without any observation)
$$\hat{c}_n=\argmax_{\{c_k\}} P(c_k|d_n)=\argmax_{\{c_k\}} \frac{P(d_n|c_k)P(c_k)}{P(d_n)}$$

$P(d_n)$ can be dropped, because it is the same for all classes $\{c_k\}_1^K$.

$$\hat{c}_n=\argmax_{\{c_k\}} P(c_k|d_n)=\argmax_{\{c_k\}} \overbrace{P(d_n|c_k)}^{\text{likelihood}}\overbrace{P(c_k)}^{\text{prior}}$$

Represent document $d_n$ as a set of features $f_{n1}, f_{n2},...,f_{nJ}$, which encode the presence of the respective vocabulary words $w_1, w_2,...,w_J$ in document $d_n$  

$$\hat{c}=\argmax_{\{c_k\}} P(f_{n1}, f_{n2},...,f_{nJ}|c_k)P(c_k)$$

Naive assumption: the probabilities $P(f_{nj}|c_k)$ are independent given the class $c_k$ and therefore

$$P(f_{n1}, f_{n2},...,f_{nJ}|c_k) = P(f_{n1}|c_k)\cdot P(f_{n2}|c_k)\cdot ... \cdot P(f_{nJ}|c_k)$$

$$c_n^{NB}=\argmax_{\{c_k\}} P(c_k) \prod_{j} P(f_{nj}|c_k)$$

in log space to avoid underflow and increase speed:

$$c_n^{NB}=\argmax_{\{c_k\}} \log P(c_k) + \sum_{j} \log P(f_{nj}|c_k)$$

## Application to Word Tokens of Document $d_n$
Applying this to word tokens of document $d_n$ $\{t_i\}$ requires to look up the corresponding word in the vocabulary $w_{j}$ for each token $t_i$. Tokens without corresponding word in the vocabulary are simply ignored.

$$c_n^{NB}=\argmax_{\{c_k\}} \log P(c_k) + \sum_{t_i} \log P(w_j|c_k)$$

## Training

$$\hat{P}(c_k)=\frac{N_k}{N}$$
$N_k$ the number of documents of class $c_k$ divided by the total number of documents (samples).

$P(w_{j}|c_k)$ is computed as the fraction of times the vocabulary word $w_j$ appears among all tokens in all documents of class $c_k$

$$P(w_{j}|c_k)=\frac{count(w_j,c_k)}{\sum_{w_{j'}\in V}count(w_{j'},c_k)}$$

Apply Smoothing $\alpha$ to avoid zero probabilites for vocabulary words that do not appear in all classes.

$$P(w_j|c_k)=\frac{count(w_j,c_k)+\alpha}{\sum_{w\in V}(count(w,c_k)+\alpha)}=\frac{count(w_j,c_k)+\alpha}{\left(\sum_{w\in V}(count(w,c_k)\right)+\alpha\cdot|V|}$$

Laplace Smoothing: $\alpha = 1$  
Lidstone Smoothing: $0 < \alpha < 1$ 

Words missing in the vocabulary are simply ignored.

In [6]:
import numpy as np
import pandas as pd

## The Data
Sentiment Classification task (0: Negative, 1: Positive)

In [7]:
docs = ['just plain boring',
         'entirely predictable and lacks energy',
         'no surprises and very few laughs',
         'very powerful',
         'the most fun file of the summer'
        ]

classes = ['-',
     '-',
     '-',
     '+',
     '+'
    ]

In [8]:
N = len(docs)
clss = list(set(classes))
K = len(clss)

## An Illustrative Implementation

In [9]:
alpha = 1.0

### Organise the Target Classes

In [11]:
map_cls = {label:i for i,label in enumerate(clss)}

T = np.array([map_cls[label] for label in classes])

In [12]:
clss

['+', '-']

In [13]:
T

array([1, 1, 1, 0, 0])

### Calculate the Log Priors

In [15]:
log_priors = [0.0]*K
for t in T:
    log_priors[t] += 1
print(log_priors)
log_priors = [np.log(count/N) for count in log_priors]
print(log_priors)

[2.0, 3.0]
[-0.916290731874155, -0.5108256237659907]


### The Vocabulary

In [18]:
V = set([token for doc in docs for token in doc.split()])
n_V = len(V)

map_V = {word:pos for pos, word in enumerate(V)}
map_V

{'and': 0,
 'surprises': 1,
 'just': 2,
 'boring': 3,
 'file': 4,
 'few': 5,
 'no': 6,
 'lacks': 7,
 'summer': 8,
 'of': 9,
 'fun': 10,
 'very': 11,
 'plain': 12,
 'entirely': 13,
 'energy': 14,
 'predictable': 15,
 'the': 16,
 'laughs': 17,
 'most': 18,
 'powerful': 19}

### Convert the training documents into word counts

In [19]:
def doc_to_word_counts(doc):
    counts = [0]*n_V
    for token in doc.split():
        try:
            counts[map_V[token]] += 1
        except KeyError: # simply ignore words not in the vocabulary
            pass
    return counts

In [20]:
X = np.array([doc_to_word_counts(doc) for doc in docs])

$X$ now contains for each document (row) the counts per vocabulary word (column):

In [21]:
df_X = pd.DataFrame(X, columns=V)
df_X.index = docs
df_X

Unnamed: 0,and,surprises,just,boring,file,few,no,lacks,summer,of,fun,very,plain,entirely,energy,predictable,the,laughs,most,powerful
just plain boring,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
entirely predictable and lacks energy,1,0,0,0,0,0,0,1,0,0,0,0,0,1,1,1,0,0,0,0
no surprises and very few laughs,1,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,1,0,0
very powerful,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1
the most fun file of the summer,0,0,0,0,1,0,0,0,1,1,1,0,0,0,0,0,2,0,1,0


### Training - Calculate the Likelihoods for the vocabulary words given class $c$
$$P(w_{i}|c)=\frac{count(w_i,c)+\alpha}{\sum_{w\in V}(count(w,c)+\alpha)}$$

First calculate the  $count(w_i,c)+\alpha$ for each class $c$ and vocabulary word $w_i$

In [22]:
counts_w_c = np.ones((K,n_V)) * alpha  # initialise with the +alpha counts for the Smoothing
for label,i in map_cls.items():  # loop over all labels and their positions in the cls array
    for doc in X[T == i]:  # select all docs of class c
        counts_w_c[i] += doc  # and add up the word counts

In [23]:
df_counts_w_c = pd.DataFrame(counts_w_c, columns=V)
df_counts_w_c.index = clss
df_counts_w_c

Unnamed: 0,and,surprises,just,boring,file,few,no,lacks,summer,of,fun,very,plain,entirely,energy,predictable,the,laughs,most,powerful
+,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,3.0,1.0,2.0,2.0
-,3.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0


Calculate the denominator $\sum_{w\in V}(count(w,c)+\alpha)$

In [24]:
sums_per_c = np.sum(counts_w_c, axis=1)  # the alpha was already considered in the initialisation of counts_w_c! 
sums_per_c

array([29., 34.])

Calculate $\log P(w_{i}|c)$ per class

In [25]:
log_likelihood_w_c = np.log(counts_w_c / sums_per_c[:,None])
log_likelihood_w_c

array([[-3.36729583, -3.36729583, -3.36729583, -3.36729583, -2.67414865,
        -3.36729583, -3.36729583, -3.36729583, -2.67414865, -2.67414865,
        -2.67414865, -2.67414865, -3.36729583, -3.36729583, -3.36729583,
        -3.36729583, -2.26868354, -3.36729583, -2.67414865, -2.67414865],
       [-2.42774824, -2.83321334, -2.83321334, -2.83321334, -3.52636052,
        -2.83321334, -2.83321334, -2.83321334, -3.52636052, -3.52636052,
        -3.52636052, -2.83321334, -2.83321334, -2.83321334, -2.83321334,
        -2.83321334, -3.52636052, -2.83321334, -3.52636052, -3.52636052]])

### Inference
1. Convert document into word vector
2. Calculate posterior probabilities $\log P(c) + \sum_{t_i} \log P(w_i|c)$ for all classes $c$
3. Select the maximum probability and map to the correspoding class label

In [26]:
def infer(doc):
    # convert incoming document to word count vector
    x = np.array(doc_to_word_counts(doc))
    # calculate the posterior probabilities per class
    # elements of x are 0 if the corresponding vocabulary word does not appear in doc and 1 if it does
    # -> log_likelihood_w_c * x essentially selects the relevant log likelihoods
    class_probs = log_priors + np.sum(log_likelihood_w_c * x, axis=1)
    # select the class with the highest probability
    class_index = np.argmax(class_probs)
    # map class index back to label
    return clss[class_index]

In [30]:
infer('predictable with no fun')

'-'