# Chapter 6: Bayesian methods

## Flow of story
- matrix factorization
- LDA setup: $w$, $z$, and the graph. Connection to MF
- Dirichlet prior. Topic modeling examples.
- Marginal probability and intractability
- EM, MCMC

## 0. Mathematical foundation

### 0.1 Multinomial distribution

$\text{Multinomial}(m_1,...,m_K \mid \vec{\pi})$ is the multinomial distribution

\begin{align}
\text{Multinomial}(m_1,...m_K\mid \vec{\pi}) &= \frac{n!}{m_1!...m_K!}\prod_{k=1}^K \pi_k^{m_k} \\
&= \frac{1}{B(\vec{\alpha}-1)}\prod_{k=1}^K \pi_k^{m_k}
\end{align}

where $m_k$ are the number of observations in class $k$ and $\pi_k \in [0,1]$ is the probability of observation drawn from class $k$. $B(\alpha)$ is the beta function

\begin{equation}
B(\vec{\alpha})=\frac{\prod_{k=1}^K \Gamma(\alpha_k)}{\Gamma(\sum_{k=1}^K)\alpha_k})
\end{equation}

where $\Gamma(x)$ is the generalized factorial function. A multinomial distribution is a generalization of the binomail distribution to $K$ classes.

### 0.2 Dirichlet distribution

$\text{Dir}(\vec{\pi}\mid \vec{\alpha})$ is the Dirichlet distribution

\begin{equation}
\text{Dir}(\vec{\pi}\mid \vec{\alpha}) = \frac{1}{B(\vec{\alpha})}\prod_{k=1}^K \pi_k^{\alpha_k-1}
\end{equation}

where $\alpha_k \ge 0$, $\pi_k \ge 0$ and $\sum_{k=1}^K \pi_k=1$. Therefore the domain of Dirichlet distribution is over a simplex of $\vec{\pi}$.

Consider the likelihood of observing $m_k$ samples in cluster $k$. Assuming a Dirichlet prior, the likelihood of observing $m_1,...,m_K$ is

\begin{align}
P(m_1,...,m_K) &\propto p(m_1,...,m_k \mid \vec{\pi})p(\vec{\pi}) \\
&= \text{Multinomial}(m_1,...,m_K \mid \vec{\pi}) \times \text{Dir}(\vec{\pi}) \\
&= \frac{n!}{m_1!...m_K!}\pi^{m_1}...\pi^{m_K}\Big[\frac{1}{B(\vec{\alpha})}\prod_{k=1}^K \pi_k^{\alpha_k-1}\Big]
\end{align}

The multinomial distribution and the Dirichlet distribution have very similar form, except the former is a function of $\vec{m}$ and the latter $\vec{\pi}$. For this reason, Dirichlet distribution is also called the **conjugate prior** of the multinomial distribution.

#### Connection with Gaussian mixture

Given a dataset $\{x_i\}_{i=1}^N$, a clustering problem concerns deriving the optimal number of clusters and assigning each point to a cluster. Given we know the likelihood $\pi_k$ of a random data point being drawn from cluster $k$, and that we assume the value of a data from cluster $k$ is described by a Gaussian distribution $N(\mu_k, \Sigma_k)$, the conditional likelihood of observing the data $x_1,...,x_N$ is

\begin{equation}
P(x_1,...,x_N \mid \vec{\pi}, \vec{\mu}, \vec{\Sigma}) = \prod_{i=1}^N\sum_{k=1}^K \pi_k N(x_i \mid \mu_k, \Sigma_k)
\end{equation}

where the vector notation denotes over $K$ clusters. $\mu_k$ and $\Sigma_k$ can themselves be a vector for multidimensional clustering problem.

If we do not know the value of $\pi_k$, $\mu_k$ and $\Sigma_k$ a priori but only know the prior distribution, we can introduce priors for those parameters and integrate them out to obtain the total probability

\begin{equation}
P(x_1,...,x_N) = \int d\vec{\pi}d\vec{\mu}d\vec{\Sigma}p(\vec{\pi})p(\vec{\mu})p(\vec{\Sigma})\Big[\prod_{i=1}^N\sum_{k=1}^K \pi_k N(x_i\mid \mu_k, \Sigma_k)\Big]
\end{equation}

where the $p(*)$ are the priors of the parameters of the clustering problem. For example, the priors can take the following distribution:

- $p(\mu_k) = N(\overline \mu_k, 1)$
- $p(\Sigma_k)$ = Wishart distribution
- $p(\vec{\pi}) = \text{Dir}(\vec{\pi}\mid \alpha_1,...,\alpha_K)$

Wishart distribution is a distribution of symmetric, positive-definite random matrices.


### 0.3 Dirichlet process (DP)

A Dirichlet Process (DP) generalizes the Dirichlet distribution to infinite dimension. It is a process that generate a distribution

\begin{equation}
G \sim \text{DP}(\alpha,H)
\end{equation}

There are two arguments to a DP: the concentration parameter $\alpha$ and the base distribution $H$. A realization of the DP is a distribution, which similarity with the base distribution is controlled by $\alpha$.
 
The sampled distribution $G$ can be written as

\begin{equation}
G(x)\equiv \sum_{k=1}^\infty \pi_k \delta_{\theta_k}(x) \label{DP_sampled_distr}\tag{Eq 3}
\end{equation}

where 

\begin{align}
\vec{\pi} &\sim \lim_{K\to\infty}\text{Dir}\Big(\frac{\alpha_1}{K},...,\frac{\alpha_K}{K}\Big) \\
\theta_k &\sim H, k=1,...\infty
\end{align}

and $\delta_{\theta_k}(x)=1$ if $x=\theta_k$ and zero otherwise.

In **Eq 3**, the point masses $\theta_k$ (or atoms) are drawn from a continuous space $\Omega$ and the weight $\pi_k$ are drawn from an infinite-dimensional Dirichlet distribution. The fact that **Eq 3** is discrete allows for finite probability of resampling any existing values $\theta_j$, which will add to the mass $\pi_j$ at $j$.

To get a better intuition, consider $\alpha \to 0$. Such a limit corresponds to the distribution of $\vec{\pi}$ peaking at the corners of the (infinite-dimensional) simplex. I.e., a random draw from $\text{Dir}(\alpha/K...)$ will give $\pi_j$ that equals 1 at one random entry and zero at the rest. In such a limit, $G$ in **Eq 3** will reduce to $G=\delta_{\theta_q}$ where $\pi_j=1$ when $j=q$ and zero otherwise. On the other hand, in the $\alpha \to K$ limit, the distribution of $\vec{\pi}$ will be uniform as the Dirichlet distribution will be a constant. In such limit, $G=\sum_{k=1}^\infty \delta_{\theta_k}$, which equals to $H$ upon prior partition of $\Omega$.

**Eq 4** can be thought of as taking the infinite limit of a Dirichlet distribution by allowing infinite categories $K\to\infty$. Instead of defining infinite parameters $\alpha_1,\alpha_2,...$ in this limit, we encode the infinite parameters into a base distribution $H$. To understand this, let's partition the space $\Omega$ into finite partitions $\{A_1,...,A_K\}$. This is equivalent to binning the x-axis of the distribution $G$. The. fraction of mass of all atoms in partition $A_k$, deonted by $p_k$, follows the following Dirichlet distribution

\begin{align}
(p_1,...,p_K) &\sim \text{Dir}(\alpha H(A_1), \alpha H(A_2),...\alpha H(A_K)) \\
&= \frac{1}{B(\vec{\alpha})}\prod_{k=1}^K p_k^{\alpha H(A_k)-1}
\end{align}

Note that $\sum_{k=1}^K p_k=1$.

#### Bayesian updating DP

Given the current mass distribution in different partition

\begin{equation}
(m_1,...,m_K) \sim \text{Dir}(\alpha H(A_1), \alpha H(A_2),...\alpha H(A_K))
\end{equation}

if we observe the new data to fall within partition $j$, then the mass distributions are updated as 

\begin{equation}
(m_1,...,m_j,...,m_K \mid X_1 \in A_j) \sim \text{Dir}(\alpha H(A_1), \alpha H(A_2),...,\alpha H(A_j)+1,...,\alpha H(A_K))
\end{equation}

The DP is updated accordingly as 

\begin{equation}
G \mid X_1 \sim \text{DP}\Big(\alpha+1, \frac{\alpha H + \delta(x-X_1)}{\alpha+1}\Big)
\end{equation}

To explain the effect of observing a new data point at $x=X1$, the probability density at $x=X_1$ in the base distribution $H$ is enhanced. The sampled distribution given the new data point $G \mid X_1$ converges closer to the  updated base distribution.

### 0.4 Bayesian Mixture Model

### 0.5 Dirichlet Process Mixture

### 0.6 Hierarchical Dirichlet Process (HDP) [ref](https://www.cs.cmu.edu/~epxing/Class/10708-14/scribe_notes/scribe_note_lecture20.pdf)

By adding one more level of DP over $G_0$, HDP enables data in groups to share countable infinite cluster identities and to exhibit unique cluster propositions. By simply adding a second level of DP over $G_0$ with concentration parameter $\gamma$ and base measure $H$, HDP guaranties the discreteness of $G_0$. Therefore, HDP mixture models yield exactly the grouped data characteristic

## 1. Topic modeling

On the high level, topic modeling views a document consists of a composite of $K$ topics. Each topic in turn is a composite of words. In a matrix factorization approach to topic modeling, given $D$ documents and $N$ words, the document-term frequency matrix $D \times N$ can be factorized into the document-topic matrix $D\times k$ which describes the distribution of topics in each document, and the topic-word matrix $k\times N$ which describes the distribution of words in each topic.

The following convention will be used for the remaining sections:

- number of topics: $K$, index of each topic: $k$
- number of documents: $D$, index of each document: $d$
- number of words: $N$, index of each word: $j$

### 1.1 Topic modeling using Latent Dirichlet Allocation (LDA)

In LDA, document-topic distribution and the topic-word distribution are given a prior, which is the Dirichlet distribution. For a given document, a realization from the document-topic Dirichlet distribution describes the distribution of each topic in that document whereas for a given topic, a realization from the topic-word Dirichlet distribution describes the distribution of each word in a given topic

\begin{align}
& \theta_d \sim \text{Dir}(\alpha),\ \ d \in \{1,...,D\} \\
& \phi_k \sim \text{Dir}(\beta),\ \ k \in \{1,...,K\}
\end{align}

where $\theta_d$ and $\phi_k$ are vectors of dimension $K$ and $N$ respectively. $\theta_d$ describes the likelihood of optics in each document and $\phi_k$ the likelihood of words in each topic. Once $\theta_d$ and $\phi_k$ are generated, the topic and word for each document can be generated by multinomial distributions

\begin{align}
z_{dj} \sim \text{Multinomial}(\theta_d) \\
w_{dj} \sim \text{Multinomial}(\phi_{z_{dj}})
\end{align}

where $z_{dj}$ is the topic for the $j$-th word in document $d$. $w_{dj}$ is the $j$-th specific word in document $d$ and is generated by choosing $\phi$ of the current topic of word $w_{dj}$. Note that sampling from the multinomial distribution gives a particular category with probability described by the said distribution. Sometimes people will use $\text{Categorical(*)}$ instead of $\text{Multinomial(*)}$ for this reason.

One can use maximum likelihood to derive the optimal $\alpha$ and $\beta$ given a corpus of documents. first starts with the likelihood of observing a document $d$

\begin{equation}
P(\theta_d, \vec{z}_d, \vec{w}_d \mid \alpha, \beta) = p(\theta_d \mid \alpha)\prod_{j=1}^{N_d}p(z_{dj}\mid \theta_d)p(w_{dj}\mid z_{dj},\beta)
\end{equation}

To get to the likelihood of observing the corpus $C$ given $\alpha$ and $\beta$, integrate over $\theta_d$ and $z_d$

\begin{align}
P(\vec{w}_d \mid \alpha, \beta) &= \int d\theta_d p(\theta_d \mid \alpha)\Big(\prod_{d=1}^D\sum_{z_d}p(z_d\mid \theta_d)p(w_d\mid z_d,\beta)\Big) \\
P(C=\{\vec{w}_d\}_{d=1}^D \mid \alpha, \beta) &= \prod_{d=1}^D \int d\theta_d p(\theta_d\mid \alpha)\Big(\prod_{j=1}^{N_d}\sum_{z_{dj}}p(z_{dj}\mid \theta_d)p(w_{dj}\mid z_{dj}, \beta)\Big)
\end{align}

where

\begin{align}
p(\theta \mid \alpha) &= \text{Dir}(\alpha) \\
p(w_d \mid z_d, \beta) &= \prod_{d=1}^D\text{Multinomial}(\phi_{z_{dj}}) \\
p(z_{dj} \mid \theta) &= \text{Multinomial}(\theta_d)
\end{align}

Note that $P(\vec{w}\mid \alpha, \beta)$ is directly controlled by two parameters $\alpha$ and $\beta$. See figure below

**insert figure**

Since $z$ is generated by a multinomial distribution paramterized by $\phi$, which itself is sampled from the Dirichlet distribution parameterized by $\beta$, the figure above should include an intermediate node $\phi$ after $\beta$ and before $w$. The $w$ node is colored grey because it is the only observable (data) of the process.

#### Inference

Goal; Infer parameter $\alpha$ and $\beta$ based on observable $w$ (the words).

**Step 1**: Update $\phi$

\begin{equation}
\phi^{(t+1)} = \phi_k \mid \vec{w}, \vec{z}^{(t)} \sim \text{Dir}(\beta + \vec{n}_k)
\end{equation}

where $\vec{n}_k$ is the number of occurence of words $i=1,..,N$ under topic $k$.

**Step 2**: update $\theta$

\begin{equation}
\theta^{(t+1)} = \theta_m \mid \vec{w}, \vec{z}^{(t)} \sim \text{Dir}(\alpha^{(t)} + \vec{m}_d)
\end{equation}

where $\vec{m}_d$ is the number of occurence of topics $k=1,...,K$ in document $d$.

**Step 3**: Update $\vec{z}_d^{(t+1)}$ by sampling from

\begin{equation}
\vec{z}_d^{(t+1)} \sim P(z_{dj}=1 \mid \vec{w}, \phi^{(t+1)}) = \frac{\theta_{dj}\phi_{kw_{dj}}^{(t+1)}}{\sum_{d=1}^D \theta_{dj}\phi_{kw_{dj}}^{(t+1)}}
\end{equation}

**Step 4**: Update $\alpha^{(t+1)}$ by the Metropolis-Hasting algorithm:

1. Sample $\alpha'$ from $N(\alpha^{(t)}, \sigma^2(\alpha^{(t)}))$
2. Let \begin{equation}
r = \frac{p(\alpha' \mid \theta^{(t)}, \vec{w}, \vec{z}^{(t)})}{p(\alpha^{(t)} \mid \theta^{(t)},\vec{w},\vec{z}^{(t)})} \cdot \frac{p(\alpha' \mid \alpha^{(t)})}{p(\alpha^{(t)} \mid \alpha')}
\end{equation}
3. Update $\alpha^{(t+1)}$ to $\alpha'$ with probability $\min(1,r)$, i.e. always accept if $\alpha'$ is more likely. Otherwise, update with probability $r$.

### 1.2 Topic modeling using Hierarchical Dirichlet Process (HDP)

Starting with a corpus of $M$ documents, $K$ topics and $N$ words. Again, a document consists of a collection of topics and each topic consists of a collection of words. in the case of HDP, $K$ can be infinite.

The $n$-th word in document $m$, $w_{mn}$ is assumed to be drawn from a multinomial (categorical) distribution parametrized by $\theta_{mn}$

\begin{equation}
w_{mn} \mid \theta_{mn} \sim \text{Multinomial}(\theta_{mn})
\end{equation}

The parameter $\theta_{mn}$ is drawn from a distribution $G_m$ corresponds to document $m$. $G_m$ is a distribution over a countably-infinite number of topics, i.e. $\Omega$ of $G_m$ is not $\mathbb{R}$. This fact implies same value of $\theta_{mn}$ can be sampled mulitple times, as oppose to sampling from a continuous distribution, e.g. Gaussian, in which case no two samples will share the exact same value. **This is the source of clustering in sampling from a DP**.

\begin{align}
& \theta_{mn} \mid G_m \sim G_m \\
& G_m \mid \alpha_0, G_0 \sim \text{DP}(\alpha_0, G_0)
\end{align}

The reason why $G_m$ is a distribution over a countably-infinite number of topics comes from the fact that $G_m$ is sampled from a DP whose base distribution is discrete to begin with. $G_0$ is a base distribution common to all documents and therefore all $G_j$ will have the same $\Omega$, i.e. sample set of topic universe. $G_0$ is itself drawn from another DP with discrete base distribution $H$

\begin{equation}
G_0 \mid \gamma, H \sim \text{DP}(\gamma, H)
\end{equation}

The discrete base distribution $H$ is drawn from a Dirichlet distribution parameterized by $\beta$

\begin{equation}
H \sim \text{Dirichlet}(\beta)
\end{equation}

where $H$ is a distribution whose dimension is the size of the vocabulary, hence is discrete.

Note that $P(\vec{w}_m \mid \theta_m)$ is directly controlled by one parameter $\theta_m$. see figure below

**insert figure**

where $\phi_{ij}$ is $\theta_{mn}$ we used here.

#### Inference

Since $G$ is non-parametric, i.e. number of parameters scale with the number of data, using EM will be difficult. An alternative will be using MCMC (Metropolis-Hasting and/or Gibbs sampling)

### 1.3 Topic modeling using spherical Hierarchical Dirichlet Process (sHDP)
sHDP leverages the embedding of words in HDP. The embedded vector of word $j$ from document $d$ (with dimension=$M$) are normalized and are assumed to be generated from a vMF distribution with center $\mu_k$ (also with dimension=$M$). The center $\mu_k$ are themselves also drawn from the vMF distribution.

**insert figure**

#### Inference
Again, computation of the full posterior distribution is intractable and variational method will be used. A mean-field approach, i.e. the posterior distribution $p$ is approximated by a fully factorizable variational distribution $q$:

\begin{equation}
p(z,\beta,\pi,\mu,\kappa) \approx q(z,\beta,\pi,\mu,\kappa)=q(z)q(\beta)q(\pi)q(\mu)q(\kappa)
\end{equation}

where the next step is to minimize the variational free energy (ELBO), and hence the KL divergence between $p$ and $q$

\begin{equation}
F_q = \langle \log p(X,z,\beta,\pi,\mu,\kappa)\rangle_q - H_q
\end{equation}