# Little Math

### 1. Bayes’ Theorem

Bayes’ theorem tells us how to calculate the probability of something (say $A$) happening when we already know something else (say $B$) happened:

$$
P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}
$$

* $P(A \mid B)$ = probability of $A$ given $B$
* $P(B \mid A)$ = probability of $B$ given $A$
* $P(A)$ = overall probability of $A$
* $P(B)$ = overall probability of $B$

---

### 2. Example (Bayes in daily life)

Suppose:

* 1% of emails are spam.
* If an email is spam, there’s a 90% chance it contains the word *“offer”*.
* If an email is not spam, there’s a 20% chance it contains the word *“offer”*.

Now, if an email has the word *“offer”*, what’s the probability it’s spam?
That’s exactly where Bayes’ theorem helps.

---

### 3. Extending to Many Features

In real problems (like text classification), we don’t look at just one word, but many words (like *offer*, *win*, *money*, etc.).

So, if our class is **Spam** or **Not Spam**, and we have features (words) $x_1, x_2, x_3, \dots, x_n$, Bayes says:

$$
P(\text{Class} \mid x_1, x_2, \dots, x_n) 
= \frac{P(x_1, x_2, \dots, x_n \mid \text{Class}) \cdot P(\text{Class})}{P(x_1, x_2, \dots, x_n)}
$$

---

### 4. The “Naive” Assumption

Here’s the tricky part:
Calculating $P(x_1, x_2, \dots, x_n \mid \text{Class})$ is **very hard** because it needs the joint probability of all features together.

So Naive Bayes makes a **simplifying assumption**:
👉 All features are independent given the class.

This means:

$$
P(x_1, x_2, \dots, x_n \mid \text{Class}) \approx P(x_1 \mid \text{Class}) \cdot P(x_2 \mid \text{Class}) \cdot \dots \cdot P(x_n \mid \text{Class})
$$

---

### 5. Naive Bayes Formula

Now, the formula becomes:

$$
P(\text{Class} \mid x_1, x_2, \dots, x_n) \propto P(\text{Class}) \cdot \prod_{i=1}^{n} P(x_i \mid \text{Class})
$$

The denominator $P(x_1, x_2, \dots, x_n)$ is the same for all classes, so we usually ignore it when comparing.

So the rule is:
👉 For each class, multiply its prior probability ($P(\text{Class})$) by the likelihoods of each feature.
👉 Pick the class with the highest score.

# About Algorithm

### What is Naive Bayes?

* **Naive Bayes** is a **simple algorithm** used to **classify things into categories** (like spam vs not spam, positive review vs negative review, etc.).
* It is based on **probability** — it tries to find which category is **most likely** for a given piece of data.
* The word **“Naive”** comes from the fact that it assumes all features (clues) are independent from each other.
* The word **“Bayes”** comes from **Bayes’ Theorem**, the math rule it uses.

---

### How it works (super short steps):

1. Look at your data (features/clues).
2. For each possible category, calculate a probability score.
3. Choose the category with the **highest score**.

---

👉 In short:
**Naive Bayes is a simple probability-based method that guesses the category of data by multiplying probabilities of clues and picking the most likely category.**

# Varients of Naive Bayes

### 1. **Bernoulli Naive Bayes**

* **What it is:** Works with **binary features** (yes/no, 0/1, present/absent).
* **When to use:** When data is about **whether a feature exists or not** (not how many times).
* **Tiny Example:**

  * Email classification:

    * Feature = “Does the email contain the word *offer*?”
    * Answer = Yes (1) or No (0).
  * Bernoulli NB looks at the **presence/absence** of words, not counts.

---

### 2. **Multinomial Naive Bayes**

* **What it is:** Works with **counts of features** (how many times something appears).
* **When to use:** For **text data** where word frequency matters.
* **Tiny Example:**

  * Email classification:

    * Word “offer” appears **3 times**.
    * Word “win” appears **1 time**.
  * Multinomial NB uses these **counts** to calculate probabilities.

---

### 3. **Gaussian Naive Bayes**

* **What it is:** Works with **continuous (numeric) features**, assuming values follow a **bell curve (Gaussian distribution)**.
* **When to use:** When features are **real numbers** like height, weight, temperature, exam scores, etc.
* **Tiny Example:**

  * Predict if a fruit is “apple” or “orange” using **weight** and **diameter** (continuous values).
  * Gaussian NB assumes these numbers follow a normal distribution for each class.

---

### Quick Summary

* **Bernoulli:** Features are **yes/no** → good for binary text data.
* **Multinomial:** Features are **counts** → good for word frequency in documents.
* **Gaussian:** Features are **continuous numbers** → good for measurements.
