### Bayes Theorem in Action
Let’s see how this works in practice with an example. Suppose you are building a classifier that says whether a text is about Sports or Not sports. Our training data has 5 sentences:

Text

Tag

A great game

Sports

The election was over

Not sports

Very clean match

Sports

A clean but forgettable game

Sports

It was a close election

Not sports

So here comes the Naive part: you assume that every word in a sentence is independent of the other ones. This means that you’re no longer looking at entire sentences, but rather at individual words. So for our purposes, “this was a fun party” is the same as “this party was fun” and “party fun was this”. Lets take a very close game as our sample sentence. Then mathematically, the above idea can be represented as follows:

P(a very close game)=P(a)×P(very)×P(close)×P(game)

This assumption is very strong but super useful. It’s what makes this model work well with little data or data that may be mislabeled. The next step is to apply Bayes theorem to this:

P(a very close game | Sports)=P(a|Sports)×P(very|Sports)×P(close|Sports)×P(game|Sports)

And now, all of these individual words actually show up several times in our training data, and we can calculate them!

Calculating probabilities
The final step is just to calculate every probability and see which one turns out to be larger. Calculating a probability is just calcuting the conditional probability of each word (for each class) in our training data.

First, you calculate the a priori probability of each tag: for a given sentence in our training data, the probability of Sports is P(Sports) = 3/5 and Not sports is P(Not Sports) = 2/5. That’s easy enough.

Then, calculating P(game | Sports) means counting how many times the word ‘game’ appears in Sports texts and the total number of words in sports are 11. Therefore,  P(game | Sports) = 2/11. However, you run into a problem here: ‘close’ doesn’t appear in any Sports texts. That means that P(close | Sports) = 0. 

This is rather inconvenient since you are going to be multiplying it with the other probabilities, so you’ll end up with P(a|Sports) x P(very|Sports) x 0 x P(game|Sports) 

This equals 0, since in a multiplication, if one of the terms is zero, the whole calculation is nullified. Doing things this way simply doesn’t give us any information at all, so you have to find a way around.

Laplace Smoothing
In Laplace smoothing you add 1 to every word count so it’s never zero. To balance this, you add the number of possible words to the divisor, so the division will never be greater than 1. In our case, the possible words are ['a', 'great', 'very', 'over', 'it', 'but', 'game', 'election', 'clean', 'close', 'the', 'was', 'forgettable', 'match']

Since the number of possible words is 14 (We’ve counted them!), applying smoothing you get that P(game|sports)=(2+1)/(11 + 14). The full results are:

Word

P(word / Sports)

P(word / Not sports)

a

(2+1)/(11+14)

(1+1)/(9+14)

very

(1+1)/(11+14)

(0+1)/(9+14)

close

(0+1)/(11+14)

(1+1)/(9+14)

game

(2+1)/(11+14)

(0+1)/(9+14)

Now you just multiply all the probabilities, and see who is bigger:

P(a very close game | Sports)=P(a|Sports)×P(very|Sports)×P(close|Sports)×P(game|Sports)×P(Sports)

=2.76×10−5

=0.0000276

P(a very close game | Not Sports)=P(a|Not Sports)×P(very|Not Sports)×P(close|Not Sports)×P(game|Not Sports)×P(Not Sports)

=0.572×10−5

=0.00000572

The class with the highest probability is considered as the most likely class. This is also known as Maximum A Posteriori (MAP).

Often times, we make use of Log Probabilities because, the calculation of the likelihood of different class values involves multiplying a lot of small numbers together. This can lead to an underflow of numerical precision. As such it is good practice to use a log transform of the probabilities to avoid this underflow.

Also, Naive Bayes is easy to re-train, just Update the Probabilities. When new data becomes available, you can simply update the probabilities of your model. This can be helpful if the data changes frequently.

Types of NB Classifiers
Gaussian Naive Bayes : Continuous real-valued features
MultiNomial Naive Bayes : Discrete features (eg: word counts)
Bernoulli Naive Bayes : Binary feature vectors (eg: word present/absent)
Advantages
Naive Bayes Algorithm is a fast, highly scalable for high dimensional datasets. Hence most widely used for text data because their feature vectors are very high dimensional.

Naive Bayes can be use for Binary and Multiclass classification.

It is a simple algorithm that depends on doing a bunch of counts.

It can be easily train on small dataset.

Naive bayes is very similar to linear models like SVM, Logistic Regression. etc.

Disadvantages
It considers all the features to be unrelated, so it cannot learn the relationship between features. E.g., Let’s say Remo is going to a party. While cloth selection for the party, Remo is looking at his cupboard. Remo likes to wear a white color shirt. In Jeans, he likes to wear a brown Jeans, but Remo doesn’t like wearing a white shirt with Brown Jeans. Naive Bayes can learn individual features importance but can’t determine the relationship among features, like the one above (Remo doesn’t like white shirt + brown jeans).