# Probabilities to Predictions: A Beginner's Journey through Bayes

# Probability

---

## 1) Random Experiments
An experiment is called a **random experiment** if:
- It has **more than one possible outcome**
- It is **not possible to predict** the outcome in advance

---

## 2) Trial
A **trial** refers to a **single execution** of a random experiment.  
Each trial produces **one outcome**.

---

## 3) Outcome
An **outcome** is a **single possible result** of a trial.

---

## 4) Sample Space
The **sample space** of a random experiment is the **set of all possible outcomes** that can occur.  
*One random experiment will have one sample space.*

---

## 5) Event
An **event** is a **specific set of outcomes** from a random experiment.  
- It is a **subset of the sample space**
- An event can include **a single or multiple outcomes**

> **Note**: The thing you want to measure with probability is called the "event".

---

# Random Variable

A **random variable** is a **function**, not a traditional variable.  
It **maps outcomes (from sample space) to real numbers**.

### Function View:
- **Input (Sample Space)** → **Logic** → **Output (Real Number)**

## Random Variable: Input & Output

- **Input**: An outcome from the sample space of a random process  
- **Output**: A real number assigned to each possible outcome

A **random variable** is a function that maps outcomes from the sample space to real numbers.  
Think of it like: `Input (sample space outcome) → Logic → Output (real number)`

---

## Probability Distribution of a Random Variable

A **probability distribution** is a list of all the possible outcomes of a random variable, along with the probability associated with each outcome.

---

## Joint Probability

**Joint Probability** is the probability that **two events happen together**.

> For example, calculating the probability of both Event A and Event B occurring simultaneously.

---

## Marginal / Simple / Unconditional Probability

**Marginal probability** is the probability of a **single event** occurring, **without considering** the influence or presence of any other event.

> It answers questions like: "What is the chance of Event A happening?" without any condition or dependency.


### Pandas Crosstab for Probability

```python
pd.crosstab(index, columns, normalize='all', margins=True)
```
- **normalize='all'**: This gives you the joint probability across the entire table.

- **margins=True**: This adds marginal probabilities (totals for rows and columns).

### 📘 What is Conditional Probability?

**Conditional Probability** is the probability of an event **A** occurring **given** that another event **B** has already occurred.

It helps answer questions like:  
> "What is the probability of A happening if we already know B has happened?"

---

### 🧠 Key Idea:

- When we compute conditional probability, we **restrict** our focus only to the part of the data where event **B** has happened.
- Then, we check how often **A** also occurs within that restricted set.

---

### 🧮 Formula:

$$
P(A \mid B) = \frac{P(A \cap B)}{P(B)}
$$

Where:
- **P(A ∩ B)** is the probability that both A and B happen (joint probability).
- **P(B)** is the probability that B happens (marginal probability).

---

### 📌 Example Interpretation:

If you're trying to find the probability that a person has a disease (A) **given** that their test result is positive (B), then:

- **P(A | B)** tells you how likely it is the person is actually sick **knowing** their test is positive.


## 🎯 Understanding Event Relationships in Probability

---

### ✅ 1. Independent Events

**Definition:**  
Two events **A** and **B** are said to be **independent** if the occurrence of one does **not affect** the probability of the other.

**Mathematically:**  
$$
P(A \cap B) = P(A) \cdot P(B)
$$

**Example:**  
- Tossing a coin and rolling a die.  
  Getting "Heads" on the coin does **not influence** the outcome of the die roll.

---

### 🔁 2. Dependent Events

**Definition:**  
Two events are **dependent** if the occurrence of one **does affect** the probability of the other.

**Example:**  
- Drawing two cards from a deck **without replacement**.  
  The outcome of the first draw **affects** the probabilities in the second draw because the deck now has fewer cards.

---

### ❌ 3. Mutually Exclusive Events (Disjoint Events)

**Definition:**  
Two events are **mutually exclusive** if **they cannot occur at the same time**.

**Mathematically:**  
$$
P(A \cap B) = 0
$$

**Example:**  
- Tossing a coin and getting both **Heads** and **Tails** in a single toss.  
  Not possible — they are **mutually exclusive** outcomes.

---

### 🧠 Summary Table

| Concept                | Can Happen Together? | Does One Affect the Other? | Example                              |
|------------------------|----------------------|-----------------------------|--------------------------------------|
| Independent Events     | ✅ Yes               | ❌ No                        | Coin toss & die roll                 |
| Dependent Events       | ✅ Yes               | ✅ Yes                       | Drawing cards without replacement    |
| Mutually Exclusive     | ❌ No                | ✅ Yes (if one occurs, other can't) | Getting Heads or Tails in one toss   |

---

🔍 **Note:**  
- Mutually exclusive events are **not independent** — if one happens, the probability of the other becomes zero.


## 📘 Bayes' Theorem

---

### 📖 Definition

**Bayes' Theorem** is a way to calculate the **probability of a hypothesis** (event A) **given some observed evidence** (event B).  
It helps us update our beliefs based on new data.

In simple terms:  
> **What is the probability of A happening, given that B has already happened?**

---

### 🧮 Formula

$$
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
$$

Where:

- **P(A|B)** = Posterior Probability (Probability of A given B has occurred)  
- **P(B|A)** = Likelihood (Probability of B given A has occurred)  
- **P(A)** = Prior Probability (Initial belief about A)  
- **P(B)** = Marginal Probability (Total probability of B)

---

### 💡 Intuition

Bayes' theorem updates **prior beliefs** (P(A)) after observing **new evidence** (B), turning them into **posterior beliefs** (P(A|B)).

---

### 📦 Example: Medical Test

Suppose:

- 1% of people have a rare disease → P(Disease) = 0.01  
- If a person has the disease, test is 99% accurate → P(Positive | Disease) = 0.99  
- If a person doesn't have the disease, test has 5% false positive rate → P(Positive | No Disease) = 0.05

**Question:** What is the probability that a person actually has the disease if they test positive?

---

#### Step 1: Known Values

- P(Disease) = 0.01  
- P(No Disease) = 0.99  
- P(Positive | Disease) = 0.99  
- P(Positive | No Disease) = 0.05  

---

#### Step 2: Calculate Total Probability of Positive (P(Positive))

$$
P(Positive) = P(Positive | Disease) \cdot P(Disease) + P(Positive | No Disease) \cdot P(No Disease)
$$

$$
= (0.99 \cdot 0.01) + (0.05 \cdot 0.99) = 0.0099 + 0.0495 = 0.0594
$$

---

#### Step 3: Apply Bayes' Theorem

$$
P(Disease | Positive) = \frac{0.99 \cdot 0.01}{0.0594} ≈ \frac{0.0099}{0.0594} ≈ 0.1667
$$

✅ **Interpretation:**  
Even with a positive result, the chance the person actually has the disease is only **~16.67%**. This happens because the disease is rare, and false positives are relatively common.

---

### 📊 When to Use Bayes' Theorem

- When you want to **update your belief** after seeing new data  
- In **medical diagnosis**, **spam filtering**, **machine learning**, **fraud detection**, etc.

---

### 🚀 Advanced Tip

In **machine learning**, Bayes' Theorem is used in:
- **Naive Bayes Classifier**
- **Bayesian Inference**
- **Bayesian Networks**

These models assume independence and use probabilities to make predictions and decisions.

---

### 🧠 Summary

| Term         | Meaning                                 |
|--------------|------------------------------------------|
| Prior        | Belief before seeing evidence (P(A))     |
| Likelihood   | How likely evidence is, given A (P(B|A)) |
| Evidence     | Total probability of B (P(B))            |
| Posterior    | Updated belief after evidence (P(A|B))   |

---

📌 **Bayes' Theorem helps convert intuition into rational belief updating.**


### Naive Bayes

Naive Bayes is a **supervised machine learning algorithm** used primarily for **classification tasks**.  
It is a **probabilistic classifier** based on Bayes' Theorem with a strong assumption of **independence between features**.

Despite its simplicity, Naive Bayes often performs surprisingly well in practice and is widely used in:
- Text classification (spam detection, sentiment analysis)
- Recommender systems
- Filtering systems

---

### Bayes’ Theorem (Foundation of Naive Bayes)

Bayes' Theorem describes the probability of an event based on prior knowledge of conditions related to the event.

**Formula**:

$$
P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}
$$


- **P(A | B)**: Posterior probability of class A given predictor B  
- **P(B | A)**: Likelihood (probability of predictor B given class A)  
- **P(A)**: Prior probability of class A  
- **P(B)**: Prior probability of predictor B  

---

### Key Components of Naive Bayes

1. **Naive Assumption**  
   The algorithm assumes that features are **independent** given the class label, which simplifies computation:

$$
P(x_1, x_2, \ldots, x_n \mid y) = P(x_1 \mid y) \cdot P(x_2 \mid y) \cdot \ldots \cdot P(x_n \mid y)
$$

2. **Classification Rule**  
To classify an instance `x`, the algorithm computes the probability of each class `y` given `x`, and predicts the class with the highest probability:

$$
\hat{y} = \underset{y}{\arg\max} \; P(y \mid x)
$$

Using Bayes' Theorem:

$$
\hat{y} = \underset{y}{\arg\max} \; P(x \mid y) \cdot P(y)
$$


3. **Estimating Probabilities**  
- **Class Prior Probability** `P(y)`
- **Conditional Probabilities** `P(xi | y)` for each feature `xi`

---

### Summary

Naive Bayes is fast, interpretable, and effective for many classification problems, especially where the independence assumption roughly holds.


### Handling Numerical Data in Naive Bayes

Naive Bayes can handle **numerical data** by assuming that features follow a specific probability distribution.  
The most common variant for this purpose is **Gaussian Naive Bayes**, which assumes that numerical features are distributed according to a **normal (Gaussian) distribution**.

---

### Key Concepts

1. **Assumption of Normal Distribution**  
   For each class, Naive Bayes assumes that numerical features follow a Gaussian distribution:

$$
P(x \mid y) = \frac{1}{\sqrt{2\pi\sigma^2}} \cdot \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)
$$

Where:
- `μ` is the mean of the feature for class `y`
- `σ²` is the variance

2. **Parameter Estimation**  
The model calculates the mean (`μ`) and variance (`σ²`) for each feature within each class during training.

3. **Probability Density Function (PDF)**  
Used to compute the likelihood of a feature value belonging to a class.

4. **Independence Assumption**  
Each feature is assumed to be independent given the class label.

---

### What if Data is Not Gaussian?

If features are **not normally distributed**, several strategies can be applied:

- **Data Transformation**: Apply log, square root, or Box-Cox transformations to approximate normality.
- **Alternative Distributions**: Use different distributions (e.g., Multinomial or Bernoulli) for features where Gaussian doesn't fit.
- **Discretization**: Convert continuous variables into categorical bins.
- **Kernel Density Estimation (KDE)**: Estimate distributions non-parametrically.
- **Use Other Models**: If distributional assumptions are invalid, consider tree-based models like Random Forests.

---

### Underflow Problem

- **Underflow** occurs when very small probabilities are multiplied together, leading to numerical values too close to zero for the computer to represent accurately.
- **Solution**: Use **log-probabilities**:


$$
\log(ab) = \log(a) + \log(b)
$$

This avoids multiplying tiny numbers directly and stabilizes computation.

---

### Laplace (Additive) Smoothing

Used to avoid zero probabilities during classification.

- **Problem**: If a feature value was never seen in training for a given class, its probability becomes zero.
- **Solution**: Add a small constant `α` (alpha) to all counts:


$$
P(\text{feature} \mid \text{class}) = \frac{\text{count} + \alpha}{\text{total} + \alpha \cdot n}
$$

Where:
- `α` is typically 1 (Laplace) or < 1 (Lidstone)
- `n` is the number of unique feature values

- **Trade-off**:
- Very small `α` → risk of high variance
- Very large `α` → high bias and underfitting

---

### Summary

Naive Bayes handles numerical data well under the Gaussian assumption but provides flexibility with transformations and smoothing techniques to adapt to various data types.


### Reason No. 2: Laplace Smoothing as a Hyperparameter

- **Laplace (Additive) Smoothing** can act as a **hyperparameter**.
- By changing its value, you can control **overfitting or underfitting** in your model.
- It works like a **tuning knob** — a small value reduces variance; a large value increases bias.

---

### Types of Naive Bayes

1. **Gaussian Naive Bayes**  
   - Used when all features are **continuous numerical values**.
2. **Categorical Naive Bayes**  
   - Designed for purely **categorical features**.
3. **Bernoulli Naive Bayes**  
   - Suitable for **binary/boolean features**.
4. **Multinomial Naive Bayes**  
   - Ideal for **count features**, like word counts in text classification.
5. **Complement Naive Bayes**  
   - A variant of Multinomial NB, designed to work better with **imbalanced datasets**.

---

### Why Laplace Smoothing is **Not** Applied to Gaussian Naive Bayes

- **Laplace Smoothing** is meant for **categorical data** to handle **zero-frequency** problems.
- **Gaussian Naive Bayes** works on **continuous data** using **probability density functions**, not frequency-based estimates.
- Hence, **Laplace smoothing is not needed or applicable** in Gaussian NB.

---

### Assumptions of Gaussian Naive Bayes

1. **Feature Independence (Naive Assumption)**  
   - Features are **independent** given the class label.  
   - The presence/absence or value of one feature does **not affect** another feature.

2. **Gaussian Distribution of Features**  
   - Assumes each **numerical feature** follows a **normal distribution** within each class.
   - Probability Density Function:
     
     $$
        P(x \mid y) = \frac{1}{\sqrt{2\pi\sigma^2}} \cdot \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right)
     $$
     
     Where:
     - `μ` is the mean of the feature for a class
     - `σ²` is the variance

3. **Class-Conditional Independence**  
   - The joint probability of features given a class is the **product** of individual feature probabilities:
    $$
      P(x_1, x_2, \ldots, x_n \mid y) = P(x_1 \mid y) \cdot P(x_2 \mid y) \cdot \ldots \cdot P(x_n \mid y)
    $$

---

### Summary

- Gaussian Naive Bayes is **fast and effective** for numerical features.
- Laplace smoothing is not used in Gaussian NB as it is designed for **discrete/categorical** data models.
- The assumptions of **independence** and **normality** are central to its performance.


### Limitations of Naive Bayes

1. **Feature Independence Assumption**  
   - Naive Bayes assumes all features are independent given the class label.  
   - In real-world datasets, features are often **correlated**, violating this assumption and leading to **suboptimal performance**.

2. **Gaussian Distribution Assumption**  
   - **Gaussian Naive Bayes** assumes that continuous features follow a **normal distribution**.  
   - If the data is **skewed**, **non-Gaussian**, or **categorical**, performance may suffer unless proper preprocessing is applied.

3. **Sensitivity to Imbalanced Data**  
   - Performs poorly when one class **significantly outnumbers** another.  
   - **Class priors** can dominate predictions, leading to **biased results**.

4. **Not Suitable for Complex Relationships**  
   - Naive Bayes **cannot capture feature interactions** or complex patterns due to its **naive independence assumption**.

5. **Handling of Continuous Features**  
   - Requires continuous features to be **Gaussian**.  
   - If the assumption fails, **transformations** (e.g., log, Box-Cox) or **discretization** are needed.

6. **Zero Probability Problem**  
   - If a feature value does not appear in the training data for a class, the model assigns **zero probability** to it.  
   - This is handled using **Laplace (Additive) Smoothing**.

7. **Poor Performance with High-Dimensional Data**  
   - In high-dimensional spaces, the independence assumption becomes more problematic.  
   - Naive Bayes may underperform compared to more **complex classifiers** that model interactions (e.g., SVM, Random Forest).


### Categorical Naive Bayes

#### When to Use:
- Use **Categorical Naive Bayes** when **all input features are categorical** (e.g., gender, color, category).

#### Laplace Smoothing:
- Formula: 

$$
P(x_i | y) = \frac{\text{count}(x_i, y) + \alpha}{n_y + \alpha \cdot k}$$\]

- `n` = total observations for a class  
- `k` = number of categories for that feature  
- Example: If a "Gender" feature has two categories (Male, Female), then `k = 2`.

---

#### Assumptions:
1. **Feature Independence** (Naive Assumption):  
   Each feature is assumed to be independent given the class label.

2. **Categorical Feature Distribution**:  
   Assumes features follow a **categorical** or **multinomial** distribution.

3. **No Missing Values**:  
   Requires clean data without missing categorical entries.

4. **Class Conditional Independence**:  
   The probability of a class is computed from the **independent contribution** of each feature.

---

#### Limitations:

1. **Feature Correlation Ignored**  
   - Fails to capture relationships or interactions between features.

2. **Imbalanced Class Problem**  
   - Performs poorly with highly imbalanced class distributions.

3. **Zero Probability Issue**  
   - If a category is not seen during training, it assigns **zero probability**, leading to classification errors.  
   - Can be handled with **Laplace Smoothing**.

4. **Poor for Complex Data Structures**  
   - Not suitable for datasets with high interactions or dependencies between features.

5. **High Cardinality Issue**  
   - Struggles when categorical features have a large number of unique values.

6. **Not Suitable for Continuous Features**  
   - Cannot handle numerical data **without preprocessing** (e.g., binning or discretization).


### Multinomial Naive Bayes

#### When to Use:
- When all features take **discrete values**, especially in **text data**, such as:
  - Document classification
  - Sentiment analysis
  - Spam detection

---

#### Assumptions:
1. **Feature Independence**: Each feature is conditionally independent given the class.
2. **Discrete Feature Values**: Features are **count-based** (e.g., word frequencies or term counts).
3. **Class-Conditional Distribution**: Feature counts for each class follow a **multinomial distribution**.
4. **Non-Negative Features**: All feature values must be **non-negative integers** (e.g., no negative or continuous values).

---

#### Limitations:
1. **Independence Assumption**: May not hold in real-world datasets with correlated features.
2. **Not Suitable for Continuous Data**: Needs preprocessing like discretization or binning.
3. **Imbalanced Class Sensitivity**: Struggles with skewed class distributions.
4. **Zero Probability Issue**: Unseen feature values lead to zero probability unless **Laplace Smoothing** is applied.
5. **Poor Handling of Rare Words**: Rare terms may be underweighted, even if important.
6. **Scalability Issues**: May struggle with **very high-dimensional spaces** (e.g., vocabularies with millions of words).

---

### Bernoulli Naive Bayes

#### When to Use:
- When features follow a **Bernoulli distribution** (binary 0 or 1).
- Particularly useful for text classification where:
  - Presence or **absence** of a word matters more than frequency.
  - Can outperform Multinomial NB in certain sparse binary settings.

---

#### Assumptions:
1. **Binary Feature Values**: Features are binary (0 or 1).
2. **Feature Independence**: Like all Naive Bayes variants.
3. **Class-Conditional Distribution**: Follows **Bernoulli distribution** for each class.
4. **No Negative Values**: All feature values must be binary (0 or 1), no negatives or continuous values.

---

#### Limitations:
1. **Independence Assumption**: Ignores feature dependencies.
2. **Binary Data Restriction**: Only works with binary features, requiring transformation from continuous or count data.
3. **Zero Probability Problem**: Similar to other Naive Bayes variants.
4. **Imbalanced Class Issue**: Sensitive to skewed class distribution.
5. **Feature Representation Sensitivity**: Depends heavily on how features are binarized.
6. **Less Effective with High Cardinality**: Sparse or irrelevant binary features can reduce performance.

---

### Extra Tip: Out-of-Core Naive Bayes with `partial_fit`
- Useful when **data is too large to fit in memory**.
- `partial_fit()` allows you to train in **mini-batches** or **chunks**.


# Code Examples

In [6]:
# Import libraries
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB, ComplementNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris, fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# 1. Gaussian Naive Bayes - for continuous numerical data
def gaussian_nb_example():
    data = load_iris()
    X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)

    model = GaussianNB()
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)

    print("GaussianNB Accuracy:", accuracy_score(y_test, predictions))

# 2. Multinomial Naive Bayes - for text data (discrete word counts)
def multinomial_nb_example():
    categories = ['alt.atheism', 'sci.space']
    data = fetch_20newsgroups(subset='train', categories=categories)
    
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(data.data)
    y = data.target

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    model = MultinomialNB()
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)

    print("MultinomialNB Accuracy:", accuracy_score(y_test, predictions))

# 3. Bernoulli Naive Bayes - for binary features (presence/absence of words)
def bernoulli_nb_example():
    categories = ['alt.atheism', 'sci.space']
    data = fetch_20newsgroups(subset='train', categories=categories)
    
    vectorizer = CountVectorizer(binary=True)
    X = vectorizer.fit_transform(data.data)
    y = data.target

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    model = BernoulliNB()
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)

    print("BernoulliNB Accuracy:", accuracy_score(y_test, predictions))

# 4. Complement Naive Bayes - better suited for imbalanced text data
def complement_nb_example():
    categories = ['alt.atheism', 'sci.space']
    data = fetch_20newsgroups(subset='train', categories=categories)
    
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(data.data)
    y = data.target

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    model = ComplementNB()
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)

    print("ComplementNB Accuracy:", accuracy_score(y_test, predictions))

# Run all examples
gaussian_nb_example()
multinomial_nb_example()
bernoulli_nb_example()
complement_nb_example()


GaussianNB Accuracy: 0.9777777777777777
MultinomialNB Accuracy: 0.9875776397515528
BernoulliNB Accuracy: 0.984472049689441
ComplementNB Accuracy: 0.9875776397515528


![image.png](attachment:9f66331c-4fad-493d-9876-b7aeb1c0e68b.png)