# Naive Bayes


---


## References

[Geeks for Geeks - Naive Bayes Classifiers](https://www.geeksforgeeks.org/naive-bayes-classifiers/)

[Geeks for Geeks - Multinomial Naive Bayes](https://www.geeksforgeeks.org/multinomial-naive-bayes/)

[Geeks for Geeks - Gaussian Naive Bayes](https://www.geeksforgeeks.org/gaussian-naive-bayes/)

[Geeks for Geeks - Bernoulli Naive Bayes](https://www.geeksforgeeks.org/bernoulli-naive-bayes/)

[Scikit Learn - Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html)


---


## Notes


#### Characteristics
- Supervised Learning
- Classification
- Based on Bayes' Theorem and conditional probability
- 


#### Assumptions
- Conditional independence among features
    - Given label, all features are independent
- Features follow specific distributions
    - Gaussian, Multinomial, or Bernoulli


#### Input & Output
- **Input**: features $X$ 
- **Output**: predicted label $y$

#### Parameters
- $P(y)$ for all labels $y$
- $P(x\mid y)$ for all labels $y$ and feature value $x$

#### Runtime Complexity
- **Training**: $O(d)$ per sample
- **Inference**: $O(d)$

where $d$ is the number of features.

#### Pros & Cons
- **Advantages**: 
    - small parameter size
    - fast inference
    - performs well on categorical features
* **Disadvantages**: 
    - assumption for independence does not always hold in real world
    - poor generalization for unseen events


#### Applications
- spam filtering
- sentiment analysis
- classifying texts

---

## Mathematics

### Bayes Theorem

$$P(y\mid X) = \frac{P(X\mid y) P(y)}{P(X)}=\frac{P(x_1,x_2,\ldots,x_n \mid y)P(y)}{P(x_1,x_2,\ldots,x_n)}$$

- $P(y \mid X)$: posterior probability of label $y$ given features $X$
- $P(X \mid y)$: likelihood of features $X$ given label $y$
- $P(y)$: prior probability of label $y$
- $P(X)$: probability of features $X$

With conditional independence between $x_1, x_2, \ldots, x_d$,
$$P(y\mid x_1, \ldots x_n) = \frac{P(y)P(x_1\mid y)P(x_2\mid y)\cdots P(x_n\mid y)}{P(x_1)P(x_2)\cdots P(x_n)} = \frac{P(y)\prod\limits_{i=1}^{n}P(x_i\mid y)}{P(x_1)P(x_2)\cdots P(x_n)}$$

Since we want to find the label with the highest posterior probability given input features, 
$$P(y\mid x_1, \ldots x_n)\propto P(y)\prod\limits_{i=1}^{n}P(x_i\mid y)$$
$$\hat{y}=\arg\max_{y}\bigg(P(y)\prod\limits_{i=1}^{n}P(x_i\mid y)\bigg)$$

For all three Naive Bayes models below, we use:
$$P(y)=\frac{N_y}{n}$$
where
- $N_y$: occurence of class $y$
- $n$: number of data sample

### Multinomial Naive Bayes

Used when features represent counts or frequencies
- text classification

$$P(x_i\mid y) = \frac{N_{x_i,y}+1}{N_y+V}$$

where 
- $N_{x_i,y}$: count of word $x_i$ in label $y$
- $N_{y}$: total count of words in label $y$
- $V$: vocabulary size

### Gaussian Naive Bayes

Used for continuous numerical features

$$P(x_i\mid y)=\frac{1}{\sqrt{2\pi \sigma_y^2}}e^{-\frac{(x_i-\mu_y)^2}{2\sigma_y^2}}$$

where
- $\mu_y$: mean of feature $x_i$ for label $y$
- $\sigma_y$: variance of feature $x_i$ for label $y$

### Bernoulli Naive Bayes

Used for binary features
- presence or absence in text classification

$$P(x_i\mid y) = P(i\mid y)(x_i) + (1-P(i\mid y))(1-x_i)=\begin{cases}P(i\mid y)&x_i=1\\1-P(i\mid y)&x_i=0\end{cases}$$

where
- $P(i\mid y)$: probability that event $i$ happens given label $y$
- $x_i$: indicator variable of event $i$


---


## Comments