# **Naive Bayes**

Naive Bayes is a family of probabilistic classification algorithms based on **Bayes' Theorem**. It assumes that all features are conditionally independent, which simplifies calculations but may not always hold in real-world datasets.

---

## **1. Why "Naive"?**

The algorithm is called "naive" because it makes the simplifying assumption that all features in the dataset are independent of each other, given the class label. While this assumption is rarely true in practice, the algorithm often performs well despite this limitation.

---

## **2. Conditional Probability**

**Definition**:  
Conditional probability is the probability of an event occurring given that another event has already occurred. It is expressed as:  
$$
P(A|B) = \frac{P(A \cap B)}{P(B)}
$$
Where:  
- $P(A|B)$: Probability of event $A$ given $B$.  
- $P(A \cap B)$: Probability of both $A$ and $B$ occurring.  
- $P(B)$: Probability of $B$ occurring.

---

## **3. Bayes' Theorem**

**Definition**:  
Bayes' Theorem relates the conditional and marginal probabilities of events. It is expressed as:  
$$
P(H|E) = \frac{P(E|H) \cdot P(H)}{P(E)}
$$
Where:  
- $P(H|E)$: Posterior probability (probability of hypothesis $H$ given evidence $E$).  
- $P(E|H)$: Likelihood (probability of evidence $E$ given hypothesis $H$).  
- $P(H)$: Prior probability (initial probability of hypothesis $H$).  
- $P(E)$: Marginal probability (probability of evidence $E$).

### **Interpretation**  
Bayes' Theorem allows us to update our belief about a hypothesis as new evidence is introduced.

---

## **4. Types of Naive Bayes**

### **a. Gaussian Naive Bayes**
- **Description**: Assumes that the continuous features follow a Gaussian (normal) distribution.
- **Likelihood Formula**: For a feature $x_i$, the probability is:  
$$
P(x_i|C) = \frac{1}{\sqrt{2\pi\sigma_C^2}} \exp\left(-\frac{(x_i - \mu_C)^2}{2\sigma_C^2}\right)
$$
  Where:
  - $\mu_C$: Mean of the feature for class $C$.  
  - $\sigma_C^2$: Variance of the feature for class $C$.

- **When to Use**: Suitable for datasets with continuous numerical features that follow a normal distribution.

---

### **b. Multinomial Naive Bayes**
- **Description**: Designed for discrete count data, such as word counts in text classification.
- **Formula**: The probability of a document $D$ belonging to class $C$ is:  
$$
P(C|D) \propto P(C) \prod_{i=1}^n P(x_i|C)
$$
  Where $x_i$ represents the count of the $i$-th feature.

- **When to Use**: Ideal for text data or data with discrete counts, such as word frequency in document classification tasks.

---

### **c. Bernoulli Naive Bayes**
- **Description**: Assumes binary features (presence or absence of a feature). It models data with binary outcomes.
- **Formula**: The probability for class $C$ is:  
$$
P(C|X) \propto P(C) \prod_{i=1}^n P(x_i|C)^{x_i} \cdot (1 - P(x_i|C))^{(1 - x_i)}
$$
- **When to Use**: Suitable for binary feature datasets, such as a bag-of-words model indicating the presence or absence of specific words.

---

## **5. When to Use Which Naive Bayes Variant**

1. **Gaussian Naive Bayes**:
   - Use for continuous features that are approximately normally distributed.
   - Example: Predicting outcomes based on continuous measurements like age, weight, or height.

2. **Multinomial Naive Bayes**:
   - Use for discrete count data.
   - Example: Text classification, such as spam detection or sentiment analysis.

3. **Bernoulli Naive Bayes**:
   - Use for binary feature data.
   - Example: Binary document classification based on the presence or absence of specific words.

---

## **6. Key Strengths and Weaknesses**

### **Strengths**:
- Simple, fast, and easy to implement.
- Works well on high-dimensional data.
- Requires a small amount of training data.

### **Weaknesses**:
- Assumes independence among features, which may not hold in real-world data.
- Gaussian Naive Bayes may struggle with continuous features that do not follow a normal distribution.

Naive Bayes is a robust algorithm that can handle a wide range of classification problems effectively, especially when the assumptions closely align with the dataset characteristics.
