For the Bayes' theorem, refer to page
{% page-ref page="../../probability-statistics-and-data-analysis/methods-theorems-and-laws/the-bayes-theorem.md" %}
The Naive Bayes is a probabilistic classifier based on (surprise surprise!) the Bayes' theorem and it uses a Maximum A Priori estimate to classify the labels.
In a nutshell, its properties are:
- It assumes independence among features used for the classification;
- It is usually used for text classification;
- Is fast;
- Requires small training data;
- The probability of the outcome classification is unreliable.
Given a target variable$$y$$(the class) and features$$x_1, x_2, \ldots, x_n$$, by Bayes' theorem we can write
$$P(y)$$being the frequency of the class$$y$$.
The naive assumption of the algorithm is that features are independent of each other, that is, the likelihood at the second member can be factorised into the product of the likelihoods of single features:
This way, we can simplify and write
We apply the MAP estimation (see page) to find the$$y$$that maximises the posterior. The denominator only gives a constant of normalisation, so the maximising value is found via
What this means is that the classifier assigns the class label$$\hat y$$as the one which maximises the posterior probability.
{% page-ref page="../../probability-statistics-and-data-analysis/methods-theorems-and-laws/the-maximum-likelihood-maximum-a-posteriori-and-expectation-maximisation-estimation-methods.md" %}
The different Naive Bayes classifiers differ in the assumptions for the likelihood distribution
In a Gaussian Naive Bayes, it is assumed to be a gaussian:
with parameters$$\mu_y$$and
In a Bernoulli Naive Bayes, used when features are binary, it is assumed that
In a Multinomial Naive Bayes, with feature vector
Note that the multinomial Naive Bayes classifier becomes a linear classifier when expressed in logarithmic scale:
with
If in the training data there is no value for which feature$$x_i$$is determined by the class
A correction to remedy this problem (regularised Naive Bayes) is obtained via adding an addend in the calculation of the likelihood as a frequency so as to have a small but non-zero probability. While in general we would calculate it as
where$$n_i$$is the number of times feature$$x_i$$appears in the sample for class
where
This small example, as well as the ones below are taken and reworked from the Wikipedia page on the topic.
The problem is classifying if a person is a male (M) or a female (F) based on height (h, in feet), weight (w, in pounds) and foot size (f, in inches). This is the training data we assume to have collected:
Gender | h (feet) | w (lbs) | f (inches) |
---|---|---|---|
M | 6 | 180 | 12 |
M | 5.92 | 190 | 11 |
M | 5.58 | 170 | 12 |
M | 5.92 | 165 | 10 |
F | 5 | 100 | 6 |
M | 5.5 | 150 | 8 |
M | 5.42 | 130 | 7 |
M | 5.75 | 150 | 9 |
We use a gaussian assumption, so we assume the likelihood for each feature to be
and we estimate the parameters of said gaussians via MLE (see page), obtaining:
Gender | ||||||
---|---|---|---|---|---|---|
M | 5.86 | 176.25 | 11.25 | |||
F | 5.42 | 132.5 | 7.5 | 1.67 |
The two classes are equiprobable because we got the same number of training points for each, so$$a = b$$, and these are the priors for each class. Note that we could also give the priors from the population, assuming that each gender is equiprobable.
Now, given a new sample point whose height is 6 feet, weight 130 lbs and foot size 8 inches, we want to classify its gender, so we determine which class maximises the posterior:
where, under the Naive Bayes assumption,
and
which is just a normalising constant so can be ignored. Now,
In the same way we compute $$P(w | M) = 5.9881 \cdot 10^{-6}$$and$$P(f | M) = 1.3112 \cdot 10^{-3}$$, so that in the end we obtain$$P(M | h, w, f) = 6.1984 \cdot 10^{-9}$$. Similarly we get$$P(F | h, w, f) = 5.3779 \cdot 10^{-4}$$, which is larger so we predict that the sample is a female.
This is a common application of a Naive Bayes classifier and it is a case of text classification.
Some words are more frequent than others in spam e-mails (for example "Viagra" is definitely a recurring word in spam e-mails). The user manually and continuously trains the filter of their e-mail provider by indicating whether a mail is spam or not. For all words in each training mail, the filter then adjusts the probability that it will appear in a spam or legitimate e-mail.
Let S be the event that an e-mail is spam and$$w$$a word, then we compute the probability that an e-mail is spam given that it contains$$w$$as (
where$$P(S)$$, the prior, is the probability that a message is spam in general, and$$P(w | S)$$is the probability that$$w$$appears in spam messages.
A non biased filter will assume$$P(S) = P(\neg S) = 0.5$$, biased filters will assume higher probability for mail being spam.$$P(S | w)$$is approximated by the frequency of mails containing word$$w$$and being identified as spam in the learning phase, and similarly for
Now, this is valid for a single word, but a functional spam classifier uses several words and a Naive Bayes hypothesis, assuming that the presence of each word is an independent event. Note that this is a crude assumption as in reality in natural language words co-occurrence is key. Nevertheless, it is a useful idealisation, useful for the calculation in the Naive Bayes fashion.
So, with more words considered and asusming the priors are the same (see references),
Given texts which can fall into categories (for example literary genres), we use
with$$C$$being the genre,$$w_i$$the words and the denominator is an irrelevant factor.
With a bag of words approach, if we have a training set$$D$$and a vocabulary$$V$$containing all the words in the documents, considering$$D_i$$the subset of texts in category$$C_i$$, then
(fraction of samples in category$$C_i$$). Now, we concatenate all documents in$$D_i$$, obtaining$$n_i$$words and
(we use smoothing). The predicted category is then
- P Graham, A Plan for Spam, 2002
{% page-ref page="../../probability-statistics-and-data-analysis/methods-theorems-and-laws/the-maximum-likelihood-maximum-a-posteriori-and-expectation-maximisation-estimation-methods.md" %}