## Probability

* **Frequentist:** Uses hypothesis testing and p-values to check if there is a meaningful difference.
* **Bayesian:** Uses prior knowledge and calculates the probability one design is better than the other.



## 2. Probabilistic Models


**Decision rules:**

**Maximum a posteriori (MAP) decision rule:**

Consider both likelihood and prior.

**Maximum likelihood (ML) decision rule:**

Consider only likelihood, assume priors are equal.



## 3. Probability Distributions for Continuous Variables

**Continuous variables:** like age, height — take any numeric value.

---

### Gaussian (Normal) distribution:

* Characterized by mean (μ) and standard deviation (σ).
* Probability density for a value x:

$$
P(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi}\sigma} \exp \left(-\frac{(x-\mu)^2}{2\sigma^2}\right)
$$

* Multivariate Gaussian extends this to vectors with mean vector μ and covariance matrix Σ.

---

### Example: Classification with Gaussians

* Two classes with Gaussian distributions.
* Each class has parameters (mean μ and std dev σ).
* Find a decision boundary to separate classes.
* Using likelihood ratio of the two Gaussians, find where LR=1 → decision boundary.

---

### Mixture Model

Data comes from a mix of two Gaussians:

$$
P(x| \oplus) = \text{Gaussian with } \mu_\oplus, \sigma_\oplus
$$

$$
P(x| \ominus) = \text{Gaussian with } \mu_\ominus, \sigma_\ominus
$$

Use likelihood ratio for classification.

---

### Decision boundaries scenarios:

* If σ’s are equal, decision boundary is the midpoint between means.
* If σ’s differ, the decision regions can be more complex, possibly non-contiguous.

---

### Maximum Likelihood Estimation (MLE)

* Used to find parameters (μ, σ) that make the observed data most likely.
* Example: Given data points, estimate the mean by maximizing the likelihood function.

---



## 4. Probability Distributions for Categorical Variables

**Categorical variables:** like words in a document.

---

* **Multivariate Bernoulli distribution:** models presence or absence of words (0 or 1).
* **Multinomial distribution:** models counts of word occurrences (how many times each word appears).

Formulas:

* Bernoulli:

$$
P(X=(x_1,...,x_k)) = \prod_{i=1}^k \theta_i^{x_i} (1-\theta_i)^{1-x_i}
$$

* Multinomial (total n words):

$$
P(X=(x_1,...,x_k)) = \frac{n!}{x_1! ... x_k!} \prod_{i=1}^k \theta_i^{x_i}
$$

---



## 5. Naïve Bayes Model

Assumes **features are independent given the class** — this is a simplification but works well in practice.

* Used in spam classification: assumes words occur independently in an email.

---

**Note:** This assumption is usually false because words depend on each other (e.g., "scientific" and "experiment" often appear together), but the model still works well.

---

### Using Naïve Bayes for classification:

* Use decision rules (ML or MAP).
* ML assumes uniform class distribution; MAP uses class priors.

---

### Example:

Vocabulary: words a, b, c
Email has: 3 'a's, 1 'b', 0 'c's
Classify as spam or ham using:

* Multivariate Bernoulli: uses binary presence (1 or 0)
* Multinomial: uses counts of words

---

### Training Naïve Bayes

* Estimate parameters (probabilities of words given class) from training data.
* Use frequency counts: e.g., probability θ = (number of times word appears in class) / (total words in class).

---

### Zero frequency problem

If a word never appears with a class in training data, its probability estimate is zero, making the whole product zero.

**Solution:** Smoothing (Laplace correction):

$$
\hat{\theta} = \frac{d + 1}{n + k}
$$

where d = number of successes, n = trials, k = number of categories.

---


## 6. Logistic Regression

* Used for **binary classification**.
* Estimates the probability p of belonging to a class.
* Logistic function:

$$
f(x) = \frac{e^x}{1 + e^x}
$$

* Model:

$$
\hat{p}(x) = \frac{1}{1 + e^{-(w \cdot x - t)}}
$$

where w are weights, t is threshold.

* Uses maximum likelihood to find w and t.

---



## 7. Gaussian Mixture Models (GMM)

* Data generated by several Gaussian distributions with unknown labels.
* Goal: estimate parameters (means μj, covariances Σj) of each Gaussian without knowing the class labels.

---

### Chicken and egg problem:

To classify points, need parameters. To estimate parameters, need class labels.

---

### Expectation-Maximization (EM) algorithm:

Iterative method to solve above problem.

1. **Initialize** parameters randomly.
2. **E-step:** Compute probabilities of each point belonging to each class given parameters.
3. **M-step:** Update parameters using these probabilities.
4. Repeat until convergence.