# Dirichlet Process

**e.g.** how to determine number of clusters for a clustering problem

* model comparison
* evidence maximization
    * comparable (equivalent?) to BIC
* would be nice to do this as a part of the model fitting 
    * instead of fitting multiple models and comparing them

**def** dirichlet process

$G \sim DP(\alpha, H)$ where $\alpha \in \mathbb{R}$ and $H$ is a base measure  
iff $\forall$ partition of the sample space $A_1 \cup A_2 \cup \cdots$, 
$(G(A_1), G(A_2), ..., G(A_m)) \sim Dirichlet(\alpha H(A_1), \alpha H(A_2), ..., \alpha H(A_m))$

$\forall$ partition, distribution induced by sampling $G$ and evaluating it on the partition is dirichlet

1. If we have a DP and we sample $G$ and then sample $\theta_1, ..., \theta_n$,  
then $G \mid \theta_1, ..., \theta_n$ is a dirichlet process with parameters $\alpha + n, \frac{\alpha}{\alpha + n} H + \frac{n}{\alpha + n} (\frac{\sum \delta_{\theta_i}}{n})$

2. $\theta_{n+1} \mid \theta_1, ..., \theta_n \sim \frac{1}{\alpha + n} (\alpha H + \sum \delta_{\theta_i})$

if we draw a $G$ and then a $\theta \mid G$, and we have interval $B$, then  
$P(\theta \in B) = \int_G \int_\theta p(\theta) p(\theta | G) I_B(\theta) dG d\theta$  
$E[I_B(\theta)] = H(B)$

to calculate the posterior  
first focus on one partition $A_1 \cup \cdots \cup A_r$

let $(G(A_1), ..., G(A_r)) = (g_1, ..., g_r)$ and $(H(A_1), ..., H(A_r)) = (h_1, ..., h_r)$

prior: $p(g) = Dirichlet(\alpha h)$

* $\propto \prod_k g_k^{\alpha h_k - 1}$

likelihood: $p(\theta | g) = \prod_i g_i^{n_i}$

posterior: $\propto \prod_k g_k^{\alpha h_k + n_k - 1}$  
$\implies Dirichlet(\alpha h + n)$

$n = (n_1, ..., n_r)$ and $n_k$ is the number of samples that fall in partition $A_k$

let $\alpha^{new} = \alpha + n$ and $H^{new} = \frac{\alpha}{\alpha + n} H + \frac{n}{\alpha + n} (\frac{\sum \delta_{\theta_i}}{n})$

consider new partition over the same space $B_1 \cup \cdots \cup B_s$

$H^{new}(B_j) = \frac{\alpha}{\alpha+n} H(B_j) + \frac{1}{\alpha + n} \sum_i I_{B_j}(\theta_i)$  
the second term is just $n_j$

then sampling from posterior $DP(\alpha^{new}, H^{new})$  
induced samples over $\{B_j\}$ have distribution $Dirichlet(\alpha^{new} h^{new} + n(B))$

to evaluate $\theta_{n+1} \mid \theta_1, ..., \theta_n \sim \frac{1}{\alpha + n} (\alpha H + \sum \delta_{\theta_i})$

fix partition $B$  
then $P(\theta \in B_j) = H^{new} (B_j) = \frac{\alpha}{\alpha + n} H*B_j) + \frac{1}{\alpha + n} \sum_i I_{B_j}(\theta_i)$

1. $n \to \infty$ then sampling concentrates on a discrete set
2. CRP $\theta_{n+1} \mid \theta_1, ..., \theta_n \sim \frac{1}{\alpha + n} (\alpha H + \sum_i \delta_{\theta_i})$ (Blackwell-McQueen)
    * for each $i$, $z_i = \begin{cases} k & \text{w.p. } \frac{n_k}{\alpha + n} \\ r + 1 & \text{w.p. } \frac{\alpha}{\alpha + n} \end{cases}$  
    $n_k = $ count of $\theta_k$  
    $k$ ranges from 1 to $r$  
    choose new cluster with probability $\alpha / (\alpha + n)$ or choose existing cluster with probability proportional to cluster sizes
3. given sample $\theta_1, ..., \theta_n$ which is size $n$ and infinite number of possible clusters,  
$E[\sum_i^n X_i] = \sum_i^n E[X_i] = \sum_i \frac{\alpha}{\alpha + i - 1} \leq \alpha \sum_i i^{-1} = \alpha O(\log n)$  
proportional to $\log n$ (similar to BIC)

DP mixture models

$G \sim DP$  
$\theta_i \sim G$  
$x_i \sim F(\theta_i)$

**e.g.**

$H \sim \mathcal{N}(0, 1)$  
$X_i \mid \theta^*_k \sim \mathcal{N}(\theta^*_k, \Sigma)$

First, initialize $\theta^*_K$ for $k = 1, 2, ...$ using $H$  
Draw $Z_i$ and assign them to one of the $k$ using CRP  
Use $\theta^*_k$ to draw $X_i$

**Gibbs sampling**

* unobserved: $\{z_i\}, \{\theta_k^*\}$
* observed: $\{x_i\}$
* $p(z_i = k \mid z^{(-i)}, \{\theta_k^*\}, \{x_i\})$

if $z_i$ goes to an existing cluster, $z_i = k$ w.p. $\frac{n_k}{\alpha + n}$  
if $z_i$ goes to a new cluster, $z_i = r + 1$ w.p. $\frac{\alpha}{\alpha + n}$

also need to consider the likelihood given the cluster  
so $p(z_i = k \mid x_i) \propto \frac{n_k}{\alpha + n} p(x_i \mid \theta^*_k)$  
$p(z_i = r + 1 \mid x_i) \propto \frac{\alpha}{\alpha + n} p(x_i \mid \theta_{r+1}^*)$  
where $\theta^*_{r+1} \sim H$ has to be sampled

consider cluster $k$ with parameter $\theta^*_k$ and $z_{i_1}, z_{i_2}, z_{i_3}$  
$\propto p(\theta^*_k) \prod_j p(x_{i_j} | \theta^*_k)$  
and we can calculate all of these terms since we know the prior and likelihood

***what is the output of this Gibbs sampling?***

**stick breaking construction**

equivalent to CRP

$\forall k$, $v_k \sim Beta(1, \alpha$  
$\pi_k = v_k \prod_i^{k-1} (1 - v_i)$, represents class proportions  
$\theta_k^* \sim H$  
$G = \sum_k^\infty \pi_k \delta_{\theta^*_k}$

DP mixture of diagonal normal samples

$\forall k$, $v_k \sim Beta(1, \alpha)$  
$\pi_k = v_k \prod_i^{k-1} (1 - v_i)$  
$\mu^*_k \sim \mathcal{N}(0, \sigma_1^2 I)$

$\forall i$, $z_i \sim Discrete(\{\pi\})$  
$x_i \sim \mathcal{N}(\mu^*_{z_i}, \sigma^2_2, I)$

**mean field approximation**

let $q(\{v_k\}, \{\mu^*_k\}, \{z_i\}) = (\prod_k q(v_k)) (\prod_k q(\mu^*_k)) (\prod_i q(z_i))$  
enforce truncation in $q(\cdot)$  
for some fixed $T$, $q(v_T = 1) = 1 \implies q(z_i = k) = 0$ $\forall k > T$

$p(\mu) \propto (1 - \mu)^{\alpha - 1}$

$E[\mu] = \frac{a}{a + b}$  
$E[\log \mu] = \psi(a) - \psi(b)$  
$E[\log (1 - \mu) = \psi(b) - \psi(a)$

mean field variational solution for $v_s$

generic mean field construction:  
$E_{q(v_{-s}, \mu_k, z_i)}[\log L]$

$v_s \sim Beta(1, \alpha) \implies p(v_s) \propto (1 - v_s)^{\alpha - 1}$

$z_i \sim Discrete(\pi) \implies p(z_i) = \prod_k \pi_k^{z_{ik}}$  
$= \prod_k (1 - v_k)^{I(z_i > k)} v_k^{I(z_i = k)}$  

$g(v_s) = E[(\alpha-1) \log (1 - v_s) \sum_i (I(z_i > s) \log (1 - v_s) + I(z_i = s) \log v_s)]$  
let $q(z_i) = \prod_k \gamma_{ik}^{z_{ik}}$ where $\gamma_{ik}$ are parameters  
then $g(v_s) = (\alpha - 1) \log (1 - v_s) + \log (1 - v_s) (\sum_i \sum_{k=s+1}^T \gamma_{ik}) + \log v_s \sum_i \gamma_{is}$  
then $q(v_s) \propto v_s^{\sum_i \gamma_{is}} (1 - v_s)^{\alpha + \sum_i \sum_k \gamma_{ik} - 1}$  
$\implies q(v_s) = Beta(v_s \mid 1 + \sum_i \gamma_{is}, \alpha + \sum_i, k \gamma_{ik})$