# Naive Bayes

## Model Specification

Bayes theorem yields:

\begin{align}
P(y|x_1,\dots, x_N) \propto P(y)P(x_1,\dots, x_N|y).
\end{align}

Under the **Naive Bayes assumption**: $P(x_1,\dots, x_N|y)=\prod_{i=1}^NP(x_i|y)$. That is, Naive Bayes handles the curve of dimensionality by simply assuming all variables are independent. Different distributional assumption about $P(x_i|y)$ gives rise to different versions of Naive Bayes.

### Variants and Generalizations

**Gaussian Naive Bayes**

The likelihood of the features is assumed to be Gaussian:

\begin{align}
P(x_i|y) = \frac{1}{\sqrt{2\pi\sigma_{iy}^2}}\exp\left(-\frac{(x_i-\mu_{iy})^2}{2\sigma_{iy}^2}\right).
\end{align}
The parameters $\mu_i$ and $\sigma_i$ are estimated intuitively, by the sample mean and standard deviations of samples belonging the corresonding class $y$, respetively.

**Bernoulli Naive Bayes**

Bernoulli Naive Bayes, or **Bernoulli event model** works with **occurence vector**, i.e. $x_i=1$ when the $i$-th word is in the document, and $0$ otherwise. The model is first the document type is generated, and then by some probability each word is going to independently appear or not appear in the document. Parameters are $\phi_k$, which is the probably that the document is in class $k$ and $\phi_{i|y=k}$, which is the probability of feature $i$ appearing in the document given that the document belongs to class $k$ ($k=1,\dots, K$). The smoothed version of the maximum likelihood is just as simple as inpecting the **occurence frequency**:
\begin{align}
\hat{\phi}_k &= \frac{\sum_{n=1}^N1(\text{the $n$-th document is in class $k$})+\alpha}{N+K\alpha}\\
\hat{\phi}_{i|y=k} &=\frac{\sum_{n=1}^N1(\text{the $n$-th document is in class $k$ and it has feature $i$})+\alpha}{\sum_{n=1}^N1(\text{the $n$-th document is in class $k$})+K\alpha}.
\end{align}
The smoothing priors $\alpha \ge 0$ accounts for features not present in the learning samples and prevents zero probabilities in further computations. Setting $\alpha = 1$ is called **Laplace smoothing**, while $\alpha < 1$ is called **Lidstone smoothing**.

**Multinomial Naive Bayes**

Multinomial Naive Bayes, or **multinomial event model**, where data is typically represented as **word vector count** (although tf-idf vectors are also known to work well in practice). The model is first the document type is generated, and then each word is indenpendently 'written'. Parameters are $\phi_k$, which is the probably that the document is in class $k$ and $\phi_{i|y=k}$, which is the probability of word $i$ appearing as the 'next word' in the document given that the document belongs to class $k$ ($k=1,\dots, K$). The smoothed version of the maximum likelihood is just inspecting the **counting frequency**.
\begin{align}
\hat{\phi}_k &= \frac{\sum_{n=1}^N1(\text{the $n$-th document is in class $k$})+\alpha}{N+K\alpha}\\
\hat{\phi}_{i|y=k} &=\frac{\sum_{n=1}^N\sum_{j=1}^{m_n}1(\text{the $n$-th document is in class $k$ and the $j$-th word in the document is $i$})+\alpha}{\sum_{n=1}^N1(\text{the $n$-th document is in class $k$})m_n+|V|\alpha},
\end{align}
where $m_n$ is the total number of words in the $n$-th document.

## Theoretical Properties

### Advantages

- The independence assumption of Naive Bayes is (surpringsly) powerful when the dimension $p$ of the feature space is high. In fact, despite these rather optimistic assumptions, naive Bayes classifiers often outperform far more sophisticated alternatives. The intuition is that, although the individual class density estimates may be biased, this bias might not hurt the posterior probabilitieis as much, especially near the decision region. Despite the presence of strong dependence, naive Bayes can still be optimal if the dependences distributes evenly in classes, or if the dependences cancel out each other; see the reference under Further Reading below.
- If a component $X_j$ of $X$ is discrete, then an appropriate histogram estimate can be used. This provides a seamless way of mixing variable types in a feature vector.
- The Naive Bayes supports 'online learning', or is an 'incremental learner', meaning it can update its model one training example at a time, without re-processing all previous examples.

### Disadvantages
- Due to the over-simplifying assumption of independence, the class probability outputs, e.g. from `predict_proba` in `sklearn` are not to be taken too seriously.

### Relation to Other Models

- Both Naive Bayes and the Generlized Additive Models have divided the joint probabilities by individual marginal probabilities by some sort of independence. But Naive Bayes takes further parametric distributional assumptions on the marginal probabilities. This difference is similar to the difference between [linear discriminant analysis](LDA.ipynb) and [logistic regression](logistic_regression).
- Algorithms that try to learn $p(y|x)$ directly are called **discrimative algorithms**, while algorithms which try to model $p(x|y)$ are called **generative learning algorithms**. Naive Bayes is a kind of generative learning algorithms; see more discussion in the [LDA notebook](LDA and QDA.ipynb).

## Empirical Performance

### Advantages

- Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality.

### Disadvantages

## Implementation Details and Practical Tricks

**`BernoulliNB` in `sklearn`** (the interface of **`MultinomialNB`** is similar and hence omitted)

In [1]:
import numpy as np
X = np.random.randint(2, size=(6, 100))
Y = np.array([1, 2, 3, 4, 4, 5])
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB(alpha=1.0, binarize=0.0, fit_prior=True, class_prior=None)
clf.fit(X, Y)
print(clf.predict(X[2:3]))

[3]


**Parameters**

- **`alpha`**:

    Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
    

- **`binarize`**:

    Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to already consist of binary vectors.
    

- **`fit_prior`**:

    Whether to learn class prior probabilities or not. If false, a uniform prior will be used.
    

- **`class_prior`**:

    Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
    

**Attributes**

- **`class_log_prior_`**:

    Log probability of each class (smoothed). Not to be trusted with great confidence, per discussed.
    

- **`feature_log_prob_`**:

    Empirical log probability of features given a class, $P(x_i|y)$.
    

- **`class_count_`**:

    Number of samples encountered for each class during fitting. This value is weighted by the sample weight when provided.
    

- **`feature_count_`**:

    Number of samples encountered for each `(class, feature)` during fitting. This value is weighted by the sample weight when provided.

**Selected Method**

- **`partial_fit(X, y[, classes, sample_weight])`**: Incremental fit on a batch of samples. That is, Naive Bayes models in `sklearn` can be used to tackle large scale classification problems for which the full training set might not fit in memory

**`GaussianNB` in `sklearn`**

In [4]:
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([1, 1, 1, 2, 2, 2])
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB(priors=None)
clf.fit(X, Y)
print(clf.predict([[-0.8, -1]]))
clf_pf = GaussianNB()
clf_pf.partial_fit(X, Y, np.unique(Y))
print(clf_pf.predict([[-0.8, -1]]))

[1]
[1]


**Parameters**

- **`priors`**: shape `(n_classes,)`

    Prior probabilities of the classes. If specified the priors are not adjusted according to the data.

**Selected Attributes**

- **`theta_`**: shape `(n_classes, n_features)`

    mean of each feature per class

- **`sigma_`**: shape `(n_classes, n_features)`

    variance of each feature per class
    
The methods of `GaussianNB` are similar to the other two NB algorithms in `sklearn`, and hence omitted.

**Constructing the Vocabulary or Dictionary**

In text classification, it is of practical importance to choose the dictionary by which to construct the feature vector. 

- Rather than looking through an English dictionary for the list of all English words, it is more common to look through the training set and encode in our feature vector only the words that occur at least once. Apart from reducing the number of words modeled and hence reducing our computational and space requirements, this also has the advantage of allowing us to model/include as a feature many words that may appear in the corpus but that you won't find in a dictionary.
- Sometimes we also exclude the very high frequency words, such as 'the', "of", "and". These high frequency, "content-free" words are called **stop words** since they occur in so many documents and do little to indicate what class it belongs to.

## Use Cases

Naive Bayes works well in problems of text classificiation and spam filtering. Since it is very fastness and is straightforward to implement, it is usually the first to try in such tasks.

## Results Interpretation, Metrics and Visualization

## References

- ESL Chapter 6.6.3
- [`sklearn` Document 1.9](http://scikit-learn.org/stable/modules/naive_bayes.html)
- Data Science for Business, Chapter 9

### Further Reading

- H. Zhang (2004). [The optimality of Naive Bayes](http://scikit-learn.org/stable/modules/naive_bayes.html). Proc. FLAIRS.
- C.D. Manning, P. Raghavan and H. Schütze (2008). Introduction to Information Retrieval. Cambridge University Press, pp. 234-265.
- A. McCallum and K. Nigam (1998). [A comparison of event models for Naive Bayes text classification.](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.1529) Proc. AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48.
- V. Metsis, I. Androutsopoulos and G. Paliouras (2006). [Spam filtering with Naive Bayes – Which Naive Bayes?](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.61.5542) 3rd Conf. on Email and Anti-Spam (CEAS).

## Misc.