# Naive Bayes Classifier

Naive Bayes is a machine learning algorithm for classification problems. It is based on Bayes’ probability theorem. It is primarily used for text classification which involves high dimensional training data sets. A few examples are spam filtration, sentimental analysis, and classifying news articles.It is not only known for its simplicity, but also for its effectiveness. It is fast to build models and make predictions with Naive Bayes algorithm. Naive Bayes is the first algorithm that should be considered for solving text classification problem. Hence, you should learn this algorithm thoroughly.

These classifiers are called **naive**, because they assume that features are conditionally independent from each other.

Imagine two people Alice and Bob whose word usage pattern you know. To keep example simple, lets assume that Alice uses combination of three words [love, great, wonderful] more often and Bob uses words [dog, ball, wonderful] often.

Lets assume you received and anonymous email whose sender can be either Alice or Bob. Lets say the content of email is “I love beach sand. Additionally the sunset at beach offers wonderful view”

## Can you guess who the sender might be?

Well if you guessed it to be Alice you are correct. Perhaps your reasoning would be the content has words love, great and wonderful that are used by Alice.

Now let’s add a combination and probability in the data we have.Suppose Alice and Bob uses following words with probabilities as show below. Now, can you guess who is the sender for the content : “Wonderful Love.”



<img src="images/a.png">

## Now what do you think?

If you guessed it to be Bob, you are correct. If you know mathematics behind it, good for you. If not, don’t worry we will check that out here. 


This is where we apply **Bayes Theorem**.

## Bayes Theorem 

<img src = 'images/b.png'>

It tells us how often A happens given that B happens, written P(A|B), when we know how often B happens given that A happens, written P(B|A) , and how likely A and B are on their own.

- P(A|B) is “Probability of A given B”, the probability of A given that B happens
- P(A) is Probability of A
- P(B|A) is “Probability of B given A”, the probability of B given that A happens
- P(B) is Probability of B

Let's say 
When P(Fire) means how often there is fire, and P(Smoke) means how often we see smoke, then:

- P(Fire|Smoke) means how often there is fire when we see smoke. 
- P(Smoke|Fire) means how often we see smoke when there is fire.

### Now can you apply this to out Alice and Bob example?

#### A simple example.

<img src ='images/nb.gif'>

To demonstrate the concept of Naïve Bayes Classification, consider the example displayed in the illustration above. As indicated, the objects can be classified as either GREEN or RED. Our task is to classify new cases as they arrive, i.e., decide to which class label they belong, based on the currently exiting objects.

Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case (which hasn't been observed yet) is twice as likely to have membership GREEN rather than RED. In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities are based on previous experience, in this case the percentage of GREEN and RED objects, and often used to predict outcomes before they actually happen.

#### Thus, we can write:

<img src='images/nb2.gif'>

#### Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior probabilities for class membership are:

<img src = 'images/nb3.gif'>

<img src = "images/nb4.gif">

Having formulated our **prior probability**, we are now ready to classify a new object (WHITE circle). Since the objects are well clustered, it is reasonable to assume that the more GREEN (or RED) objects in the vicinity of X, the more likely that the new cases belong to that particular color. To measure this **likelihood**, we draw a circle around X which encompasses a number (to be chosen a priori) of points irrespective of their class labels. Then we calculate the number of points in the circle belonging to each class label. From this we calculate the likelihood.

<img src ="iamges/nb5.gif">

From the illustration above, it is clear that Likelihood of X given GREEN is smaller than Likelihood of X given RED, since the circle encompasses 1 GREEN object and 3 RED ones. Thus:

<img src ="images/nbe.gif">

<img src ="images/nb6.gif">

Although the prior probabilities indicate that X may belong to GREEN (given that there are twice as many GREEN compared to RED) the likelihood indicates otherwise; that the class membership of X is RED (given that there are more RED objects in the vicinity of X than GREEN). In the Bayesian analysis, the final classification is produced by combining both sources of information, i.e., the prior and the likelihood, to form a posterior probability using the so-called Bayes' rule (named after Rev. Thomas Bayes 1702-1761).

<img src ="images/nbe2.gif">

#### Finally, we classify X as RED since its class membership achieves the largest posterior probability.

Let’s see how this works in practice with a simple example. Suppose we are building a classifier that says whether a text is about sports or not. Our training data has 5 sentences:
	
<table> <tr> <th>Text</th> <th>Tag</th> </tr> <tr> <td>A great game</td> <td>Sports</td> </tr> <tr> <td>The election was over</td> <td>Not sports</td> </tr><tr> <td> Very clean match</td><td> Sports</td></tr><tr> <td> A clean but forgettable game</td><td> Sports</td></tr> <tr> <td> It was a close election</td><td>Not sports</td></tr></table>


### Feature Extraction

The first thing we need to do when creating a machine learning model is to decide what to use as features. We call features the pieces of information that we take from the text and give to the algorithm so it can work its magic. For example, if we were doing classification on health, some features could be a person’s height, weight, gender, and so on. We would exclude things that maybe are known but aren’t useful to the model, like a person’s name or favorite color.

In this case though, we don’t even have numeric features. We just have text. We need to somehow convert this text into numbers that we can do calculations on.

So what do we do? Simple! We use word frequencies. That is, we ignore word order and sentence construction, treating every document as a set of the words it contains. Our features will be the counts of each of these words. Even though it may seem too simplistic an approach, it works surprisingly well.

### Being Naive

So here comes the Naive part: we assume that **every word in a sentence is independent of the other ones**. This means that we’re no longer looking at entire sentences, but rather at individual words. So for our purposes, “this was a fun party” is the same as “this party was fun” and “party fun was this”.

#### We write this as:

<img src ="https://monkeylearn.com/blog/wp-content/ql-cache/quicklatex.com-18b07be1820ddfc98fb0b70b09a12092_l3.svg">

This assumption is **very strong but super useful**. It’s what makes this model work well with little data or data that may be mislabeled. The next step is just applying this to what we had before:

<img src ="https://monkeylearn.com/blog/wp-content/ql-cache/quicklatex.com-3f0b8947db7eb1d9c48206461c26d27e_l3.svg">

And now, all of these individual words actually show up several times in our training data, and we can calculate them!

## Calculating probabilities

The final step is just to calculate every probability and see which one turns out to be larger.

Calculating a probability is just counting in our training data.

First, we calculate the a **priori probability** of each tag: for a given sentence in our training data, the probability that it is Sports P(Sports) is 3/5. Then, P(Not Sports) is 2/5. That’s easy enough.

Then, calculating P(game | Sports) means counting how many times the word “game” appears in Sports texts (2) divided by the total number of words in sports (11). Therefore, P(game | Sports) = 2/11

However, we run into a problem here: “close” doesn’t appear in any Sports text! That means that P(close | Sports) = 0.  This is rather inconvenient since we are going to be multiplying it with the other probabilities, so we’ll end up with P(a | Sports) \times P(very | Sports) \times 0 \times P(game | Sports). This equals 0, since in a multiplication, if one of the terms is zero, the whole calculation is nullified. Doing things this way simply doesn’t give us any information at all, so we have to find a way around.

## Laplace Smoothing

How do we do it? By using something called Laplace smoothing: we add 1 to every count so it’s never zero. To balance this, we add the number of possible words to the divisor, so the division will never be greater than 1. In our case, the possible words are ['a', 'great', 'very', 'over', 'it', 'but', 'game', 'election', 'clean', 'close', 'the', 'was', 'forgettable', 'match'].

Since the number of possible words is 14 (I counted them!), applying smoothing we get that P(game | sports)=(2+1)/(11 + 14). The full results are:

<table> <tr> <th>Word</th> <th>P(word | Sports)</th> <th>P(word | Not Sports)</th></tr> <tr> <td>a</td> <td>(2+1)/(11+14)</td> <td>(1+1)/(9+14)</td></tr> <tr> <td>very</td> <td>(1+1)/(11+14)</td> <td>(0+1)/(9+14)</td></tr>><tr> <td>close</td> <td>(0+1)/(11+14)</td> <td>(1+1)/(9+14)</td></tr><tr> <td>game</td> <td>(2+1)/(11+14)</td> <td>(0+1)/(9+14)</td></tr></table>

Now we just multiply all the probabilities, and see who is bigger:

<img src ="https://monkeylearn.com/blog/wp-content/ql-cache/quicklatex.com-5468b798612b8637bb76b25b9754de26_l3.svg">

The class with the highest probability is considered as the most likely class. This is also known as Maximum A Posteriori (MAP).

## Types of Naive Bayes Algorithm

- Gaussian Naive Bayes : continuous real-valued features
- MultiNomial Naive Bayes : discrete features (eg: word counts)
- Bernoulli Naive Bayes : binary feature vectors (eg: word present/absent)

## Best Prepare Your Data For Naive Bayes
- **Categorical Inputs**: Naive Bayes assumes label attributes such as binary, categorical or nominal.
- **Gaussian Inputs**: If the input variables are real-valued, a Gaussian distribution is assumed. In which case the algorithm will perform better if the univariate distributions of your data are Gaussian or near-Gaussian. This may require removing outliers (e.g. values that are more than 3 or 4 standard deviations from the mean).
- **Classification Problems**: Naive Bayes is a classification algorithm suitable for binary and multiclass classification.
- **Log Probabilities**: The calculation of the likelihood of different class values involves multiplying a lot of small numbers together. This can lead to an underflow of numerical precision. As such it is good practice to use a log transform of the probabilities to avoid this underflow.
- **Update Probabilities**: When new data becomes available, you can simply update the probabilities of your model. This can be helpful if the data changes frequently.

## Advantages and Disadvantage of Naive Bayes classifier
### Advantages
- Naive Bayes Algorithm is a fast, highly scalable for **high dimensional datasets**. Hence most widely used for **text data** because their feature vectors are very high dimensional.
- Naive Bayes can be use for Binary and Multiclass classification. It provides different types of Naive Bayes Algorithms like GaussianNB, MultinomialNB, BernoulliNB.
- It is a simple algorithm that depends on doing a bunch of counts.
- Great choice for Text Classification problems. It’s a popular choice for spam email classification.
- It can be easily train on small dataset

### Disadvantages
- It considers all the features to be **unrelated**, so it cannot learn the relationship between features. E.g., Let’s say Remo is going to a party. While cloth selection for the party, Remo is looking at his cupboard. Remo likes to wear a white color shirt. In Jeans, he likes to wear a brown Jeans, But Remo doesn’t like wearing a white shirt with Brown Jeans. Naive Bayes can learn individual features importance but can’t determine the relationship among features.

## Further Reading 
- [Navie bayes user guide (sklearn)](http://scikit-learn.org/stable/modules/naive_bayes.html)
- [UCI Machine Learning repository](https://archive.ics.uci.edu/ml/index.php)