# Generative Models for Classification

### Classification

Suppose responses $y$ are of the form "yes" or "no". Say $y \in \{+1, -1\}$.

Suppose further that with each response there is an observable $x \in D$, some domain.

We want to come up with some $g : D \to \{+1, -1\}$ that predicts $y$ from $x$.

## The generative model

Suppose each response $t_i$ is independent and can take on one of $c$ classes.  
For each $t_i$, let $x_i$ have some distribution given its corresponding $t_i$.

$t_i \stackrel{iid}{\sim} Discrete(p_1, ..., p_c)$

$X_i \mid t_i \sim P(x_i | \theta_{t_i})$

**case 1**: we abstract away $\theta_k$ and the domain of $X_i$  
**case 2**: $X_i \in \mathbb{R}^d$, $\theta_k = (\mu_k, \Sigma_k)$, $p(x_i | \theta_k) = \mathcal{N}_d(x_i | \mu_k, \Sigma_k)$

### Case 1

we want to find $p(y = k \mid x)$  
$p(y = k \mid x) = \frac{p(y = k) p(x | y = k)}{\sum_j p(y = j) p(x | y = j)}$

let $a_k = \log p(y = k) p(x | y = k)$  
then $p(y = k | x) = \frac{e^{a_k}}{\sum_j e^{a_j}}$,
the softmax function

in the case where $k \in \{1, 2\}$, we have  
$p(y = 1 | x) = \frac{1}{1 + e^{-(a_1 - a_2)}}$  
$= \frac{1}{1 + e^{-a}}$  
where $a = a_1 - a_2$, the log odds $\log \frac{p(y = 1 | x)}{p(y = 2 | x)}$

### Case 2

suppose we have two classes with identical $\Sigma_1 = \Sigma_2 = \Sigma$

then $a = \log p_1 - \log p_2 - \frac{d}{2} \log 2 \pi - \frac{1}{2} \log |\Sigma| - \frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \frac{d}{2} \log 2 \pi + \frac{1}{2} \log |\Sigma| + \frac{1}{2} (x - \mu_2)^\top \Sigma^{-1} (x - \mu_2)$  
$= \log p_1 - \log p_2 - \frac{1}{2} \mu_1^\top \Sigma^{-1} \mu_1 + \frac{1}{2} \mu_2^\top \Sigma^{-1} \mu_2 + x^\top \Sigma^{-1} (\mu_1 - \mu_2)$

let $w_0$ be the terms that do not depend on $x$ and $w_1$ be the coefficient of $x$  
then we have $w_0 + w_1^\top x$, which is a linear function of $x$

if $\Sigma_1 \neq \Sigma_2$, we then have a quadratic term $-\frac{1}{2} x^\top (\Sigma_1^{-1} - \Sigma_2^{-1}) x$

#### Maximum likelihood estimation

$L = \prod_k \big(\prod_{y_i = k} p_k \mathcal{N}(x_i | \mu_k, \Sigma_k) \big)$

$\ell = \sum_k \sum_{y_i = k} \log p_k + \log \mathcal{N}(x_i | \mu_k, \Sigma_k)$

so we can treat each class separately and get separate MLEs for each class

$\hat{p}_k = \frac{n_k}{n}$

$\hat{\mu}_k = \frac{1}{n_k} \sum_{y_i = k} x_i$

$\hat{\Sigma}_k =\frac{1}{n_k} \sum_{y_i = k} (x_i - \hat{\mu}_k) (x_i - \hat{\mu}_k)^\top$

if $\Sigma_k = \Sigma$ $\forall k$, then we have the same $\hat{p}_k$, $\hat{\mu}_k$, but we get $\hat{\Sigma} = \frac{1}{n} \sum_i (x_i - \hat{\mu}) (x_i - \hat{\mu})^\top$ where $\hat{\mu} = \frac{1}{n} \sum_i x_i$  
this is the same as $\frac{1}{K} \sum_k \hat{\Sigma}_k$