## Table of Contents:
* [Naive Bayes](#naive_bayes)
* [Bayes Theorem](#bayes_theorem)
* [Different types of Naive Bayes Classifier](#naive_bayes_types)
* [Questions](#questions)

## Naive Bayes <a class="anchor" id="naive_bayes"></a>
--> Naive Bayes is a generative model.
--> Naive Bayes is a set of supervised learning algorithms based on applying Bayes' theorem. <br>
--> Classification, Regression

--> <b>Assumptions:</b> <br>
features are independent within each class (no co-relation).

## Bayes Theorem <a class="anchor" id="bayes_theorem"></a>
$ 
\begin{align}
P(A|B) = \frac{P(B|A) P(A)}{P(B)} 
\end{align}
$

A and B are independent events <br>
$ P(A|B) $ - conditional prob, likelihood of A given B happened ~ <b>posterior</b><br>
$ P(B|A) $ - conditional prob, likelihood of B given A happened ~ <b>likelihood</b><br>
$ P(A) $ - marginal probability, probabilities of A independently ~ <b>prior</b><br>
$ P(B) $ - marginal probability, probabilities of B independently ~ <b>evidence</b><br>

If we use the same in-terms of <b>ML usecases.</b> <br>
$ 
\begin{align}
P(y|X) = \frac{P(X|y) P(y)}{P(X)} 
\end{align}
$

$y$ is target class. <br>
$X$ are features ~ $[x_1, x_2, x_3, ..., x_d]$ <br>
$d$ is the number of features
<br>

let's say, <br>
We have three features X = $[x_1, x_2, x_3]$, and <br>
there are two target classes $y_1$ and $y_2$<br>
then the equation will be, <br>

$ 
\begin{align}
P(y=y_1|X=[x_1,x_2,x_3]) \\ 
& = \frac{P(X=[x_1,x_2,x_3]|y=y_1) P(y=y_1)}{P(X=[x_1,x_2,x_3])}  \\ 
& = \frac{P(X=x_1|y=y_1) P(X=x_2|y=y_1) P(X=x_3|y=y_1) P(y=y_1)}{P(X=x_1) P(X=x_2) P(X=x_3)}
\end{align}
$

<br>
<br>

$ 
\begin{align}
P(y=y_2|X=[x_1,x_2,x_3]) \\ 
& = \frac{P(X=[x_1,x_2,x_3]|y=y_2) P(y=y_2)}{P(X=[x_1,x_2,x_3])} \\ 
& = \frac{P(X=x_1|y=y_2) P(X=x_2|y=y_2) P(X=x_3|y=y_2) P(y=y_2)}{P(X=x_1) P(X=x_2) P(X=x_3)} 
\end{align}
$

<br>
<br>

Now compare $ P(y=y_1|X=[x_1,x_2,x_3]) $ and $ P(y=y_2|X=[x_1,x_2,x_3]) $ <br>
which ever is higher, the $[x_1, x_2, x_3]$ (~ test/inference data) will be assigned to $y_1$ or $y_2$ <br>

<br>
In case of comparison, the denominator is same, <br>
so we can rewrite the above two equations <br>

$ 
\begin{align}
P(y=y_1|X=[x_1,x_2,x_3]) \propto P(X=x_1|y=y_1) P(X=x_2|y=y_1) P(X=x_3|y=y_1) P(y=y_1)
\end{align}
$

$ 
\begin{align}
P(y=y_2|X=[x_1,x_2,x_3]) \propto P(X=x_1|y=y_2) P(X=x_2|y=y_2) P(X=x_3|y=y_2) P(y=y_2)
\end{align}
$

So, in concise manner, we can write, <br><br>

$K$ number of classes, <br>
$d$ number of features <br>

$
\begin{align}
& P(y_k|X) = \frac{P(X|y_k) P(y_k)}{P(X)} \\ 
& P(y_k|X) = \frac{P(X = [x_1, x_2, .. ,x_d] |y_k) P(y_k)}{P(X = [x_1, x_2, .., x_d])} \\
& P(y_k|X) \propto P(X = [x_1, x_2, .. ,x_d] |y_k) P(y_k) \\
& P(y_k|X) \propto P(y_k) P(X = [x_1, x_2, .., x_d] |y_k) \\
& P(y_k|X) \propto P(y_k) P(X = x_1 |y_k) P(X = x_2 |y_k) .. P(X = x_d |y_k) \\
& P(y_k|X) \propto P(y_k) \prod_{i=1}^{d} p(x_i | y_k)\\
\end{align}
$
<br>
$
\begin{align}
\hat{y} = \underset{k \in {1, .., K}}{\mathrm{arg\,max}} P(y_k) \prod_{i=1}^{d} p(x_i | y_k)
\end{align}
$

<br> <br>

Here the <b>ASSUMPTION</b> is, <br>
the data/features are conditionally independant given the class label. It allows us to write the class conditional density as a <b>product</b>: <br>
$ P(y_k|X) \propto P(y_k) \prod_{i=1}^{d} p(x_i | y_k) $

==> maximum a posteriori (MAP) hypothesis. <br>
MAP(h) = max(P(h|d)) ~ hypothesis given data <br>
$
\begin{align}
MAP(h) = {\mathrm{max}} P(X|y_k) P(y_k)
\end{align}
$

## Different types of NB classifiers <a class="anchor" id="naive_bayes_types"></a>

1) Categorical NB <br>
When all the features are categorical. [It also works on continuous data]<br>
[NaiveBayesCategorical.ipynb](https://github.com/jaydeepchakraborty/NLP/blob/master/NaiveBayesCategorical.ipynb)


2) Gaussian NB <br>
When all the features are continuous.<br>
[NaiveBayesGaussian.ipynb](https://github.com/jaydeepchakraborty/NLP/blob/master/NaiveBayesGaussian.ipynb)

3) Bernoulli NB <br>
features are 0 and 1 only <br>
Bernoulli is a binary algorithm used when the feature is present or not. <br>
[NaiveBayesBernoulli.ipynb](https://github.com/jaydeepchakraborty/NLP/blob/master/NaiveBayesBernoulli.ipynb)


Text/NLP ~ word's presence <br>

It assumes that all the features are binary such that they take only two values. Means 0s can represent 
“word does not occur in the document” and 1s as "word occurs in the document"

4) Multinomial NB <br>
features are integers/categorical <br>
Multinomial Naïve Bayes consider a feature vector where a given term represents the number of times it appears or very often i.e. frequency.  <br>
[NaiveBayesMultinomial.ipynb](https://github.com/jaydeepchakraborty/NLP/blob/master/NaiveBayesMultinomial.ipynb)

Text/NLP ~ word's frequency <br>

Its is used when we have discrete data (e.g. frequency of words present in the dicument). In text learning 
we have the count of each word to predict the class or label.

### Resources
1) https://remykarem.github.io/blog/naive-bayes.html
2) https://www.youtube.com/watch?v=IvTCdrx1SHQ
3) https://github.com/ConsciousML/Naive-Bayes-Classifier-from-scratch/blob/master/Naive%20Bayes%20Classifier.ipynb
4) https://machinelearningmastery.com/naive-bayes-for-machine-learning/
5) https://jakevdp.github.io/PythonDataScienceHandbook/05.05-naive-bayes.html

## QUESTIONS <a class="anchor" id="questions"></a>

1) <b>Is Naive Bayes affected by Imbalanced data, if yes how to resolve it?</b> <br>
Yes, naive Bayes is affected by imbalanced data. The posterior probability is 
badly affected by prior probabilities.

<br>

The above example the class +ve prior probability will be 998 times higher than the class -ve, this difference in naive bayes creates a bias for class +ve.

'P_y1:- 0.001727485630620034' <br>
'P_y0:- 0.9982725143693799'

One simple solution is to ignore the prior probabilities. <br>
undersampling  <br>
oversampling.  <br>

2) <b>How to combine categorical and numerical </b> <br>
https://medium.com/analytics-vidhya/use-naive-bayes-algorithm-for-categorical-and-numerical-data-classification-935d90ab273f