# Generative Models for Classification

### Classification

Suppose responses $y$ are of the form "yes" or "no". Say $y \in \{+1, -1\}$.

Suppose further that with each response there is an observable $x \in D$, some domain.

We want to come up with some $g : D \to \{+1, -1\}$ that predicts $y$ from $x$.

## The generative model

Suppose each response $t_i$ is independent and can take on one of $c$ classes.  
For each $t_i$, let $x_i$ have some distribution given its corresponding $t_i$.

$t_i \stackrel{iid}{\sim} Discrete(p_1, ..., p_c)$

$X_i \mid t_i \sim P(x_i | \theta_{t_i})$

**case 1**: we abstract away $\theta_k$ and the domain of $X_i$  
**case 2**: $X_i \in \mathbb{R}^d$, $\theta_k = (\mu_k, \Sigma_k)$, $p(x_i | \theta_k) = \mathcal{N}_d(x_i | \mu_k, \Sigma_k)$

### Case 1

we want to find $p(y = k \mid x)$  
$p(y = k \mid x) = \frac{p(y = k) p(x | y = k)}{\sum_j p(y = j) p(x | y = j)}$

let $a_k = \log p(y = k) p(x | y = k)$  
then $p(y = k | x) = \frac{e^{a_k}}{\sum_j e^{a_j}}$,
the softmax function

in the case where $k \in \{1, 2\}$, we have  
$p(y = 1 | x) = \frac{1}{1 + e^{-(a_1 - a_2)}}$  
$= \frac{1}{1 + e^{-a}}$  
where $a = a_1 - a_2$, the log odds $\log \frac{p(y = 1 | x)}{p(y = 2 | x)}$

### Case 2

suppose we have two classes with identical $\Sigma_1 = \Sigma_2 = \Sigma$

then $a = \log p_1 - \log p_2 - \frac{d}{2} \log 2 \pi - \frac{1}{2} \log |\Sigma| - \frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \frac{d}{2} \log 2 \pi + \frac{1}{2} \log |\Sigma| + \frac{1}{2} (x - \mu_2)^\top \Sigma^{-1} (x - \mu_2)$  
$= \log p_1 - \log p_2 - \frac{1}{2} \mu_1^\top \Sigma^{-1} \mu_1 + \frac{1}{2} \mu_2^\top \Sigma^{-1} \mu_2 + x^\top \Sigma^{-1} (\mu_1 - \mu_2)$

let $w_0$ be the terms that do not depend on $x$ and $w_1$ be the coefficient of $x$  
then we have $w_0 + w_1^\top x$, which is a linear function of $x$

if $\Sigma_1 \neq \Sigma_2$, we then have a quadratic term $-\frac{1}{2} x^\top (\Sigma_1^{-1} - \Sigma_2^{-1}) x$

#### Maximum likelihood estimation

$L = \prod_k \big(\prod_{y_i = k} p_k \mathcal{N}(x_i | \mu_k, \Sigma_k) \big)$

$\ell = \sum_k \sum_{y_i = k} \log p_k + \log \mathcal{N}(x_i | \mu_k, \Sigma_k)$

so we can treat each class separately and get separate MLEs for each class

$\hat{p}_k = \frac{n_k}{n}$

$\hat{\mu}_k = \frac{1}{n_k} \sum_{y_i = k} x_i$

$\hat{\Sigma}_k =\frac{1}{n_k} \sum_{y_i = k} (x_i - \hat{\mu}_k) (x_i - \hat{\mu}_k)^\top$

if $\Sigma_k = \Sigma$ $\forall k$, then we have the same $\hat{p}_k$, $\hat{\mu}_k$, but we get $\hat{\Sigma} = \frac{1}{n} \sum_i (x_i - \hat{\mu}) (x_i - \hat{\mu})^\top$ where $\hat{\mu} = \frac{1}{n} \sum_i x_i$  
this is the same as $\frac{1}{K} \sum_k \hat{\Sigma}_k$

### Back to the generative model

let score $y_i = w^\top \phi(x_i)$

for now let $\phi(x_i) = \begin{bmatrix} x_{i1} \\ \vdots \\ x_{id} \end{bmatrix}$ (can also include nonlinear terms in general)

$w^\top = f(\mu_i \Sigma_i p_i)$

so our estimation scheme looks like  
data $\to$ $p_k, \mu_k, \Sigma_k$ $\to$ $w$ $\to$ $\hat{y}_i$ $\to$ $\hat{t}_i$

we can skip some intermediate steps and just estimate $w$ from the data directly

## Logistic regression

$t_i \stackrel{indep}{\sim} Bernoulli(\cdot)$

where the parameter is some function of $w^\top \phi(x_i)$

assumption on how the data are generated: $p_i = \sigma(w^\top \phi(x_i))$

note that we do not impose any distribution on the data matrix $\Phi$

### Maximum likelihood estimation

$L = \prod_i y_i^{t_i} (1 - y_i)^{1 - t_i}$

where $y_i = \sigma(w^\top \phi(x_i))$

$\ell = \sum_i t_i \log y_i + \sum_i (1 - t_i) \log (1 - y_i)$

to take the derivative:  
$\sigma'(x) = \frac{e^{-x}}{1 + e^{-x}} \frac{1}{1 + e^{-x}}$
$= \sigma(x) (1 - \sigma(x))$

then we have

$\partial_w \ell = \sum_i \frac{t_i}{y_i} y_i (1 - y_i) \phi(x_i) - 
\sum_i (1 - t_i) \frac{1}{1 - y_i} y_i (1 - y_i) \phi(x_i)$

$= \sum_i \phi(x_i) \big( t_i (1 - y_i) - (1 - t_i) y_i \big)$  
$= \sum_i \phi(x_i) \big( t_i - t_i y_i - y_i + y_i t_i \big)$  
$= \sum_i \phi(x_i) (t_i - y_i)$  
$= \Phi^\top (t - y) = 0$

can't solve this analytically

#### Gradient ascent

1. initialize $w_0$
2. choose step size $\eta$
3. until convergence, do $w_{i+1} = w_i + \eta \nabla_w f(w_i)$

variants

* stochastic gradient ascent (compute gradient from subset)

#### Newton-Raphson

1. initialize $w_0$
2. until convergence, do $w_{i+1} = w_i - H(w_i)^{-1} \nabla f(w_i)$

where $H$ is the hessian $H(w) = \nabla \nabla^\top f(w)$  
$H_{ij} = \frac{\partial^2}{\partial w_i \partial w_j}$

**e.g.** 

let $f(x) = (x - 2) (x - 3)$  
$= x^2 - 5x + 6$

roots at $2$ and $3$

$f'(x) = 2x - 5$  
$f''(x) = 2$

let $x_0 = 4$  
then $x_1 = 4 - \frac{2 (4) - 5}{2} = 2.5$  
then $x_2 = 2.5 - \frac{0}{2} = 2.5$

so $f(x)$ has an optimum at 2.5

**e.g.**

let $f(x) = 5 x_1^2 + 6 x_1 x_2 + 3 x_2^2$

then $\nabla f = \begin{bmatrix} 10 x_1 + 6 x_2 \\ 6 x_1 + 6 x_2 \end{bmatrix}$

then $H = \begin{bmatrix} 10 & 6 \\ 6 & 6 \end{bmatrix}$

let $x^{(0)} = (1, 1)$

then $x^{(1)} = 
\begin{bmatrix} 1 \\ 1 \end{bmatrix} - 
\begin{bmatrix} 10 & 6 \\ 6 & 6 \end{bmatrix}^{-1} 
\begin{bmatrix} 16 \\ 12 \end{bmatrix} = 
\begin{bmatrix} 0 \\ 0 \end{bmatrix}$

since $f$ is quadratic, we only need one step to get to the optimum

### back to logistic regression

$\nabla_w \ell = \Phi^\top (t - y)$

$H = -\sum_i \phi(x_i) y_i (1 - y_i) \phi(x_i)^\top$  
$= -\sum_i \phi(x_i) \sqrt{y_i (1 - y_i)} \sqrt{y_i (1 - y_i)} \phi(x_i)^\top$  
$= -\Phi^\top R \Phi$

where $R = diag(y_i (1 - y_i))$

### Newton-Raphson for logistic regression

1. initialize $w^{(0)}$
2. until convergence, do $w^{(i+1)} = w^{(i)} + (\Phi^\top R \Phi)^{-1} \Phi^\top (y - t)$

#### Other types of models

* ordinal regression
* count regression
* proportion regression (e.g., beta regression)
* robust regression
* ...

would be good to have one model/framework for all types of predictions

### Exponential family of distributions

$X \mid \eta \sim P(x | \eta) \in \text{exponential family}$ iff $p(x | \eta) = h(x) g(\eta) e^{\eta^\top u(x)}$

**def** $\eta \in \mathbb{R}^d$ are the *natural parameters*

**def** $u(x) \in \mathbb{R}^d$ are the *sufficient statistics*

**def** $\theta = E[u(X)] \in \mathbb{R}^d$ are the *expectation parameters*

**theorem** there exists a 1-1 function $\psi$ such that $\eta = \psi(\theta)$ and $\theta = \psi^{-1}(\eta)$ if $u$ are linearly independent

**e.g.** bernoulli distribution

$f(x) = p^x (1-p)^{1-x}$  
$= e^{x \log p} e^{(1-x) \log (1-p)}$  
$= e^{x \log p} e^{\log (1-p)} e^{-x \log (1-p)}$  
$= (1-p) e^{x \log \frac{p}{1-p}}$

$u(x) = x$  
$h(x) = 1$  
$\eta = \log \frac{p}{1-p}$

we can see that $1 - p = \frac{1}{1 + e^\eta}$,  
so $g(\eta) = \frac{1}{1 + e^\eta}$

$f(x) = \frac{1}{1 + e^\eta} e^{\eta x}$

$p$ is the *standard parameter*

**e.g.** normal distribution

$f(x) = (2 \pi \sigma^2)^{-1/2} e^{-\frac{(x - \mu)^2}{2 \sigma^2}}$  
$= (2 \pi \sigma^2)^{-1/2} e^{-\frac{\mu^2}{2 \sigma^2} - \frac{x^2}{2 \sigma^2} + \frac{\mu x}{\sigma^2}}$

$h(x) = 1$  
$\eta_1 = \frac{\mu}{\sigma^2}$  
$\eta_2 = -\frac{1}{2 \sigma^2}$  
$u_1(x) = x$  
$u_2(x) = x^2$  
$g(\eta_1, \eta_2) = (-\eta_2 / \pi)^{1/2} e^{-\frac{\eta_1^2}{4 \eta_2}}$

**e.g.** poisson distribution

$f(x) = \frac{1}{x!} \lambda^x e^{-\lambda x}$

$\eta = \log \lambda$  
$\lambda = e^{\eta}$  
$u(x) = x$

**theorem**

1. if $X_1, ..., X_n \stackrel{iid}{\sim} P(x)$ where $P$ is in the exponential family, then $L(\eta) = \Big(\prod_i h(x_i)\Big) g(\eta)^n e^{\eta^\top \sum_i u(x_i)}$

2. $\theta = E[u(X)] = -\frac{\partial}{\partial \eta} \log g(\eta)$  
$Cov(X) = E\big[ (X - E[X]) (X - E[X])^\top \big] = -\frac{\partial^2}{\partial \eta \partial \eta^\top} \log g(\eta)$

3. if the sample is of size 1, then the maximum likelihood estimate occurs when $u(x) = \theta$  
if sample is iid, then $\frac{1}{n} \sum_i u(X_i)$ is the maximum likelihood estimate for $\theta$

**corollary** $\hat{\theta}_{MLE} = \frac{1}{n} \sum u(X_i)$

in the exponential family, the maximum likelihood estimate coincides with our intuition

**e.g.** bernoulli distribution

$u(X) = X$

$E[X] = p$

so $\hat{p}_{MLE} = \frac{1}{n} \sum X_i$

**proof of (3)**

recall that $L(\eta) \propto g(\eta)^n e^{\eta^\top \sum_i u(x_i)}$  
$\implies \ell(\eta) = n \log g(\eta) + \eta^\top \sum_i u(x_i) + C$  
$\implies \ell'(\eta) = n \partial_\eta \log g(\eta) + \sum_i u(x_i)$  
$\implies$ maximum likelihood estimate occurs when 
$\frac{1}{n} \sum_i u(X_i) = -\partial_\eta \log g(\eta)$

**proof of (2)**

$\int h(x) g(\eta) e^{\eta^\top u(x)} dx = 1$ since this is a pdf

then taking the derivative of both sides w.r.t. $\eta$:  
$(\partial_\eta g(\eta)) \int \cdots dx + g(\eta) \int h(x) e^{\eta^\top u(x)} u(x) dx = 0$  
$\implies (\partial_\eta g(\eta)) (g(\eta))^{-1} + g(\eta) E[u(X)] = 0$  
$\implies -E[u(X)] = \partial_\eta \log g(\eta)$

**proof of (2) cont'd**

we have
$\partial_{\eta_i} \log g(\eta) = -g(\eta) \int h(x) e^{\eta^\top u(x)} u_i(x) dx$

taking the derivative w.r.t $\eta_j$:

$\partial_{\eta_i} \partial_{\eta_j} \log g(\eta) = -\partial_{\eta_j} g(\eta) \int h(x) e^{\eta^\top u(x)} u_i(x) dx - g(\eta) \int h(x) e^{\eta^\top u(x)} u_i(x) u_j(x) dx$  
$= g(\eta)E[u_j(X)] (g(\eta))^{-1} E[u_i(X)] - E[u_i(X) u_j(X)]$  
$= -Cov(X_i, X_j)$

#### Scaled exponential family

$\eta \in \mathbb{R}$, $u(X) = X$

$p(x) = \frac{1}{s} h(\frac{x}{s}) g(\eta) e^{\frac{1}{s} \eta x}$

then $-\partial_\eta \log g(\eta) = \frac{1}{s} E[X] = \frac{1}{s} E[u(X)] = \frac{\theta}{s}$  
and $-\partial_\eta^2 \log g(\eta) = \frac{Var(X)}{s^2}$

**e.g.** Bernoulli

$p(x) = \sigma(-\eta) e^{\eta x}$  
$\theta = \mu = E[X]$  
$\eta = \log \frac{\mu}{1 - \mu}$, $\mu = \sigma(\eta)$

so this is in the scale family with $s = 1$

**e.g.** Poisson

$p(x) = \frac{1}{x!} e^{-e^\eta} e^{\eta x}$  
$\eta = \psi(\lambda) = \log \lambda$  
$\lambda = \theta = e^\eta$

so again $s=1$

**e.g.** normal

$p(x) = (2 \pi \sigma^2)^{-1/2} e^{-\frac{(x - \mu)^2}{2 \sigma^2}}$

let $s = \sigma^2$  
then $p(x) = (2 \pi s)^{-1/2} e^{-\frac{1}{2s} x^2} e^{-\frac{\mu^2}{2s}} e^{\frac{1}{s} \mu x}$  

then $\mu = \theta = E[X] = \eta = \psi(\eta)$

### Count/Poisson regression

Since our responses $t_i$ are counts, a plausible model is $t_i \sim Poisson(y_i)$ where $y_i > 0$. One function that forces positive values is $y_i = e^{a_i}$ where we can say $a_i$ is our linear combination $a_i = w^\top \phi(x_i)$. Then we get

$p(t_i | y_i) = \frac{y_i^{t_i} e^{-y_i}}{t_i!}$

In order to find $\hat{w}_{MLE}$, we use Newton-Raphson:

$\ell(w) = \sum_i t_i \log y_i - \sum_i y_i - \sum_i \log t_i !$  
$= \sum_i t_i a_i - \sum_i e^{a_i} - \log t_i !$

$\implies \nabla_w \ell = \sum_i t_i \phi(x_i) - \sum_i e^{a_i} \phi(x_i)$  
$= \sum_i (t_i - y_i) \phi(x_i)$  
$= \Phi^\top (t - y)$

$\nabla_w \times \nabla_w \ell = - \sum_i \phi(x_i) \partial_w^\top y_i$  
$= -\sum_i \phi(x_i) y_i \phi(x_i)^\top$  
$= -\Phi^\top diag(y (1 - y)) \Phi$

so the update step is:

$w^{(i+1)} = w^{(i)} - (\Phi^\top diag(y^{(i)} (1 - y^{(i)})) \Phi)^{-1} \Phi^\top (y^{(i)} - t)$

### Generalized linear model

there is an underlying linear function: $a_i = w^\top \phi(x_i)$

predictions are some function of the linear outputs: $y_i = f(a_i)$

* $f$ is the activation function
* $f^{-1}$ is the link function
* $y_i$ is the mean parameter $y_i = \theta_i = E[X_i]$

$\eta_i = \psi(y_i)$

$t_i \mid \eta_i \sim P$ where $P$ is in the 1 dimensional scaled exponential family

if we choose $f = \psi^{-1}$, then $\eta_i = a_i = w^\top \phi(x_i)$ (canonical link function)

$L(w) = \prod_i \frac{1}{s} h(\frac{t_i}{s}) g(\eta_i) e^{\frac{1}{s} \eta_i t_i}$

$\implies \ell(w) = -n \log s + \sum_i \log h(\frac{t_i}{s}) + \sum_i \log g(\eta_i) + \frac{1}{s} \sum_i \eta_i t_i$  
$\implies \nabla_w \ell = \sum_i (\partial_{\eta_i} \log g(\eta_i)) \psi'(y_i) f'(a_i) \phi(x_i) + \frac{1}{s} \sum_i t_i \phi'(y_i) f'(a_i) \phi(x_i)$  
$= \frac{1}{s} \sum_i (t_i - y_i) \psi'(y_i) f'(a_i) \phi(x_i)$

recall that $\partial_\eta \log g(\eta) = \theta$ (which is $y_i$ in this case)

if using canonical link function, then 

$= \frac{1}{s} \sum_i (t_i - y_i) \phi(x_i)$
$= \frac{1}{s} \Phi^\top (t - y)$

the second derivative ...

$\nabla_w \times \nabla_w \ell$  
$= \frac{1}{s} \sum_i (t_i - y_i) \psi'(y_i) f''(a_i) \phi(x_i) \phi(x_i)^\top$  
$+ \frac{1}{s} \sum_i (t_i - y_i) (f'(a_i))^2 \psi''(y_i) \phi(x_i) \phi(x_i)^\top$  
$-\frac{1}{s} \sum_i \psi'(y_i) (f'(a_i))^2 \phi(x_i) \phi(x_i)^\top$

if using canonical link function, then the first two terms cancel out and third term simplifies to  
$-\frac{1}{s} \sum_i f'(a_i) \phi(x_i) \phi(x_i)^\top$
$= -\frac{1}{s} \Phi^\top diag(f'(a)) \Phi$

then the Newton-Raphson update step (in the canonical case) becomes

$w^{(i+1)} = w^{(i)} + (\Phi^\top diag(f'(a^{(i)})) \Phi)^{-1} \Phi^\top (t - y^{(i)})$

## Bayesian logistic regression

need a prior for weights $w$  
since $w$ can take on any value in $\mathbb{R}^d$, an appropriate prior might be normal  
we also can't come up with a good conjugate prior, so ...

$w \sim \mathcal{N}_d(m_0, S_0)$

choosing $m_0 = 0$ induces shrinkage  
a simple choice of $S_0$ might be $S_0 = \alpha^{-1} I$

$p(w | \Phi, t) \propto p(w) p(t | w, \Phi)$  
$\propto (2 \pi)^{-d/2} |S_0|^{-1/2} e^{-\frac{1}{2} (w - m_0)^\top S_0^{-1} (w - m_0)} \prod_i y_i^{t_i} (1 - y_i)^{1 - t_i}$

can't get a recognizable form for the posterior  
approximate the posterior using a known form

### MAP solution

$\hat{w}_{MAP} = \arg\max \log p(w | \Phi, t)$

$\log p(w | \Phi, t) -\frac{1}{2} w^\top S_0^{-1} w + w^\top S_0^{-1} m_0 + \sum_i t_i \log y_i + \sum_i (1 - t_i) \log (1 - y_i) + C$

taking the gradient w.r.t. $w$ yields:

$\nabla_w p(w | \cdot) = -S_0^{-1} w+ S^{-1} m_0 + \Phi^\top (t - y)$  
$= \Phi^\top (t - y) - S_0^{-1} (w - m_0)$

$\nabla_w \times \nabla_w p(w | \cdot) = -\Phi^\top diag(y (1 - y)) \Phi - S_0^{-1}$

then we can use Newton-Raphson:

$w^{(i+1)} = w^{(i)} + (\Phi^\top diag(y^{(i)} (1-y^{(i)})) \Phi + S_0^{-1})^{-1} (\Phi^\top (t - y^{(i)}) - S_0^{-1} (w^{(i)} - m_0))$

#### Laplace approximation

set $\hat{\mu} = \arg\max p(w | \cdot) = \hat{w}_{MAP}$  
and $\hat{\Sigma} =$ curvature of $p(w | \cdot)$  
and approximate $w \mid \cdot \sim \mathcal{N}(\hat{\mu}, \hat{\Sigma})$

##### method

let $f(x)$ be an unnormalized posterior  
$p(x) = \frac{1}{z} f(x)$ for some normalizing term $z$
($z = \int f(x) dx$)  
if $p(x)$ is normal pdf, then $\log p(x)$ is quadratic  
so we can get a normal pdf approximation of $p(x)$ with second order taylor series

set $g(x) = \log f(x)$

$g(x) \approx g(x_0) + g'(x_0) (x - x_0) + \frac{1}{2} g''(x_0) (x - x_0)^2$

if we choose $x_0 = \arg\max p(x)$ and $p(x)$ is differentiable, $g'(x_0) = 0$

we can set $\hat{\sigma}^2 = -\frac{1}{g''(x_0)}$  
alternatively $\beta = -g''(x_0)$

in higher dimensions:

* $g(x) \approx g(x_0) + \frac{1}{2} (x - x_0) H(x - x_0)$
* again, choose $x_0 = \arg\max g(x)$
* then approximate $x \sim \mathcal{N}_d(x_0, -H^{-1}|_{x_0})$

##### evidence function approximation

$\int f(x) dx \approx f(x_0) e^{-\frac{1}{2 \sigma^2} (x - x_0)^2} dx$
$= f(x_0) (2 \pi \sigma^2)^{1/2}$

### Laplace approximation for the distribution of $w$ in bayesian logistic regression:

* $w | \Phi, t \sim \mathcal{N} \Big(\hat{w}_{MAP}, \hat{\Sigma}_{MAP} \Big)$
* $\hat{\Sigma}_{MAP} = \big(\Phi^\top diag(y (1-y)) \Phi + S_0^{-1} \big)^{-1}$
* $y = \sigma(\Phi \hat{w}_{MAP})$
* evidence 
    * $\log p(w | \cdot) \approx f(\hat{w}_{MAP}) e^{-\frac{1}{2} (w - \hat{w})^\top \hat{\Sigma} (w - \hat{w})}$
    * $f(w) = \mathcal{N}(w | m_0, S_0) L(w)$
    * evidence $= \int f(w) dw = f(\hat{w}) (2 \pi)^{d/2} |\hat{\Sigma}|^{1/2}$

#### predictive distribution

find $p(t_{n+1} | t, \Phi, \phi(x_{n+1}))$

$p(t_{n+1} | \cdot) = \int p(w | \cdot) p(t_{n+1} | w, \cdot) dw$  
$= \int \mathcal{N}(w | \hat{w}, \hat{\Sigma}) \sigma(w^\top \phi(x_{n+1})) dw$  

no closed form solution

approximate $\sigma(z) \approx \Phi(\sqrt{\pi / 8} z)$

$\approx \int \mathcal{N}(w | \cdot) \Phi(\sqrt{\pi/8} w^\top \phi(x_{n+1})) dw$

if $z = w^\top \phi(x_{n+1})$ and $w \sim \mathcal{N}(\hat{w}, \hat{\Sigma})$  
then $z \sim \mathcal{N}(\hat{w}^\top \phi(x_{n+1}), \phi(x_{n+1})^\top \hat{\Sigma} \phi(x_{n+1}))$

then we get  
$= \int \mathcal{N}(z | \mu_z, \sigma^2_z) \Phi(\sqrt{\pi / 8} z) dz$

where $\mu_z$ and $\sigma^2_z$ are the mean and variance of $z$ from above

using $\int \Phi(\alpha t) \mathcal{N}(t | \mu, \sigma^2) = \Phi(\frac{\mu}{\alpha^{-2} + \sigma^2})$, we get

$= \Phi \bigg(\frac{\mu_z}{\frac{8}{\pi} + \sigma_z^2} \bigg)$

we can approximate this back to the sigmoid function to get  
$\approx \sigma \bigg( \sqrt{8 / \pi} \frac{\mu_z}{\frac{8}{\pi} + \sigma^2_z} \bigg)$

for predicting class, we might say if $p(t_{n+1} | \cdot) > 0.5$ we predict $t_{n+1} = 1$ and $0$ otherwise  
but the only value inside $\sigma(\cdot)$ that can be negative is $\mu_z$, so we only care about $\mu_z$ for predicting class