# Generative Models for Classification

### Classification

Suppose responses $y$ are of the form "yes" or "no". Say $y \in \{+1, -1\}$.

Suppose further that with each response there is an observable $x \in D$, some domain.

We want to come up with some $g : D \to \{+1, -1\}$ that predicts $y$ from $x$.

## The generative model

Suppose each response $t_i$ is independent and can take on one of $c$ classes.  
For each $t_i$, let $x_i$ have some distribution given its corresponding $t_i$.

$t_i \stackrel{iid}{\sim} Discrete(p_1, ..., p_c)$

$X_i \mid t_i \sim P(x_i | \theta_{t_i})$

**case 1**: we abstract away $\theta_k$ and the domain of $X_i$  
**case 2**: $X_i \in \mathbb{R}^d$, $\theta_k = (\mu_k, \Sigma_k)$, $p(x_i | \theta_k) = \mathcal{N}_d(x_i | \mu_k, \Sigma_k)$

### Case 1

we want to find $p(y = k \mid x)$  
$p(y = k \mid x) = \frac{p(y = k) p(x | y = k)}{\sum_j p(y = j) p(x | y = j)}$

let $a_k = \log p(y = k) p(x | y = k)$  
then $p(y = k | x) = \frac{e^{a_k}}{\sum_j e^{a_j}}$,
the softmax function

in the case where $k \in \{1, 2\}$, we have  
$p(y = 1 | x) = \frac{1}{1 + e^{-(a_1 - a_2)}}$  
$= \frac{1}{1 + e^{-a}}$  
where $a = a_1 - a_2$, the log odds $\log \frac{p(y = 1 | x)}{p(y = 2 | x)}$

### Case 2

suppose we have two classes with identical $\Sigma_1 = \Sigma_2 = \Sigma$

then $a = \log p_1 - \log p_2 - \frac{d}{2} \log 2 \pi - \frac{1}{2} \log |\Sigma| - \frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \frac{d}{2} \log 2 \pi + \frac{1}{2} \log |\Sigma| + \frac{1}{2} (x - \mu_2)^\top \Sigma^{-1} (x - \mu_2)$  
$= \log p_1 - \log p_2 - \frac{1}{2} \mu_1^\top \Sigma^{-1} \mu_1 + \frac{1}{2} \mu_2^\top \Sigma^{-1} \mu_2 + x^\top \Sigma^{-1} (\mu_1 - \mu_2)$

let $w_0$ be the terms that do not depend on $x$ and $w_1$ be the coefficient of $x$  
then we have $w_0 + w_1^\top x$, which is a linear function of $x$

if $\Sigma_1 \neq \Sigma_2$, we then have a quadratic term $-\frac{1}{2} x^\top (\Sigma_1^{-1} - \Sigma_2^{-1}) x$

#### Maximum likelihood estimation

$L = \prod_k \big(\prod_{y_i = k} p_k \mathcal{N}(x_i | \mu_k, \Sigma_k) \big)$

$\ell = \sum_k \sum_{y_i = k} \log p_k + \log \mathcal{N}(x_i | \mu_k, \Sigma_k)$

so we can treat each class separately and get separate MLEs for each class

$\hat{p}_k = \frac{n_k}{n}$

$\hat{\mu}_k = \frac{1}{n_k} \sum_{y_i = k} x_i$

$\hat{\Sigma}_k =\frac{1}{n_k} \sum_{y_i = k} (x_i - \hat{\mu}_k) (x_i - \hat{\mu}_k)^\top$

if $\Sigma_k = \Sigma$ $\forall k$, then we have the same $\hat{p}_k$, $\hat{\mu}_k$, but we get $\hat{\Sigma} = \frac{1}{n} \sum_i (x_i - \hat{\mu}) (x_i - \hat{\mu})^\top$ where $\hat{\mu} = \frac{1}{n} \sum_i x_i$  
this is the same as $\frac{1}{K} \sum_k \hat{\Sigma}_k$

### Back to the generative model

let score $y_i = w^\top \phi(x_i)$

for now let $\phi(x_i) = \begin{bmatrix} x_{i1} \\ \vdots \\ x_{id} \end{bmatrix}$ (can also include nonlinear terms in general)

$w^\top = f(\mu_i \Sigma_i p_i)$

so our estimation scheme looks like  
data $\to$ $p_k, \mu_k, \Sigma_k$ $\to$ $w$ $\to$ $\hat{y}_i$ $\to$ $\hat{t}_i$

we can skip some intermediate steps and just estimate $w$ from the data directly

## Logistic regression

$t_i \stackrel{indep}{\sim} Bernoulli(\cdot)$

where the parameter is some function of $w^\top \phi(x_i)$

assumption on how the data are generated: $p_i = \sigma(w^\top \phi(x_i))$

note that we do not impose any distribution on the data matrix $\Phi$

### Maximum likelihood estimation

$L = \prod_i y_i^{t_i} (1 - y_i)^{1 - t_i}$

where $y_i = \sigma(w^\top \phi(x_i))$

$\ell = \sum_i t_i \log y_i + \sum_i (1 - t_i) \log (1 - y_i)$

to take the derivative:  
$\sigma'(x) = \frac{e^{-x}}{1 + e^{-x}} \frac{1}{1 + e^{-x}}$
$= \sigma(x) (1 - \sigma(x))$

then we have

$\partial_w \ell = \sum_i \frac{t_i}{y_i} y_i (1 - y_i) \phi(x_i) - 
\sum_i (1 - t_i) \frac{1}{1 - y_i} y_i (1 - y_i) \phi(x_i)$

$= \sum_i \phi(x_i) \big( t_i (1 - y_i) - (1 - t_i) y_i \big)$  
$= \sum_i \phi(x_i) \big( t_i - t_i y_i - y_i + y_i t_i \big)$  
$= \sum_i \phi(x_i) (t_i - y_i)$  
$= \Phi^\top (t - y) = 0$

can't solve this analytically

#### Gradient ascent

1. initialize $w_0$
2. choose step size $\eta$
3. until convergence, do $w_{i+1} = w_i + \eta \nabla_w f(w_i)$

variants

* stochastic gradient ascent (compute gradient from subset)

#### Newton-Raphson

1. initialize $w_0$
2. until convergence, do $w_{i+1} = w_i - H(w_i)^{-1} \nabla f(w_i)$

where $H$ is the hessian $H(w) = \nabla \nabla^\top f(w)$  
$H_{ij} = \frac{\partial^2}{\partial w_i \partial w_j}$

**e.g.** 

let $f(x) = (x - 2) (x - 3)$  
$= x^2 - 5x + 6$

roots at $2$ and $3$

$f'(x) = 2x - 5$  
$f''(x) = 2$

let $x_0 = 4$  
then $x_1 = 4 - \frac{2 (4) - 5}{2} = 2.5$  
then $x_2 = 2.5 - \frac{0}{2} = 2.5$

so $f(x)$ has an optimum at 2.5

**e.g.**

let $f(x) = 5 x_1^2 + 6 x_1 x_2 + 3 x_2^2$

then $\nabla f = \begin{bmatrix} 10 x_1 + 6 x_2 \\ 6 x_1 + 6 x_2 \end{bmatrix}$

then $H = \begin{bmatrix} 10 & 6 \\ 6 & 6 \end{bmatrix}$

let $x^{(0)} = (1, 1)$

then $x^{(1)} = 
\begin{bmatrix} 1 \\ 1 \end{bmatrix} - 
\begin{bmatrix} 10 & 6 \\ 6 & 6 \end{bmatrix}^{-1} 
\begin{bmatrix} 16 \\ 12 \end{bmatrix} = 
\begin{bmatrix} 0 \\ 0 \end{bmatrix}$

since $f$ is quadratic, we only need one step to get to the optimum

### back to logistic regression

$\nabla_w \ell = \Phi^\top (t - y)$

$H = -\sum_i \phi(x_i) y_i (1 - y_i) \phi(x_i)^\top$  
$= -\sum_i \phi(x_i) \sqrt{y_i (1 - y_i)} \sqrt{y_i (1 - y_i)} \phi(x_i)^\top$  
$= -\Phi^\top R \Phi$

where $R = diag(y_i (1 - y_i))$

### Newton-Raphson for logistic regression

1. initialize $w^{(0)}$
2. until convergence, do $w^{(1)} = w^{(0)} + (\Phi^\top R \Phi)^{-1} \Phi^\top (y - t)$