Topic Model with Latent Dirichlet Allocation
==

### background

In natural language processing, **Latente Dirichlet Allocation(LDA)** is a widely used topic model, which can automatically discover topics that a corpus contains, and explain similarities between documents. LDA is very intriguing for us, because it is a three-level hierarchical Bayesian model and topic modeling is a classic problem in natural language processing. 

In the following report,we first describle the mechanism of Latent Dirichlet Allocation. Then we use two methods to implment LDA, one is Variational Inference, and the other is Collapsed Gibbs Sampling. We try to optimize the performance ot our impelmentation by Cython. Additionally We generate test data set based on different topics and visualze the result of topic discovery. 

### Algorithm Description

LDA uses a generative model to explain how the observed words in a corpus which contains many documents are generated from latent variables. The following shows the graphical model representation of LDA.

<img src = 'LDA.png'>

The boxes are "plates" representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document. M denotes the number of documents, N the number of words in a documents. And we define the following terms:

* $\alpha$ is the parameter of the Dirichlet prior on the per-document topic distribution, 
* $\beta_i$ is the word distribution for topic k
* $\theta_i$ is the topic distribution for document i,
* $z_{mw}$ is the topic of word w in document m

LDA assumes the following generative process for each document **m** in a corpus:
1. Choose N ~ Poisson($\xi$).
2. Choose $\theta$ ~ Dir($\alpha$).
3. For each of the N word $w_n$:
    1. Choose a topic $z_n$ ~ Multinomial($\theta$).
    2. Choose a word $w_n$ from $p(w_n |z_n, \beta)$, a multinomial probability conditioned on the topic $z_n$.

#### The Dirichlet Distribution

The **Dirichlet Distribution** is the multivariate generalization of beta distribution, which means the Dirichlet distribution is adistribution over discrete probability distributions. Dirichlet Distributions are oftenly used as conjugate prior distributions of the categorical distribution and multinomial distribution in Bayesian statistics. 

A *k*-dimensional Dirichlet random variable $\theta$ can take values in the (k-1)-simplex (a k-vector $\theta$ lies in the (k-1)-simplex if ${ \theta  }_{ i }\ge 0,\sum _{ i }^{ k }{ { \alpha  }_{ i } } $), and has the following probability density on the simplex:

$$p(\theta| \alpha) = \frac { \Gamma (\sum _{ i=1 }^{ k }{ { \alpha  }_{ i } } ) }{ \prod _{ i=1 }^{ k }{ \Gamma ({ \alpha  }_{ i }) }  } { { \theta  }_{ 1 } }^{ { \alpha  }_{ 1 }-1 }\cdot \cdot \cdot { { \theta  }_{ k } }^{ { \alpha  }_{ k }-1 }$$

Given the parameters $\alpha$ and $\beta$, the joint distribution of a topic mixture $\theta$, a set of N topics z, and a set of N words w is given by:

$$
p(\theta, z, w|\alpha, \beta)=p(\theta|\alpha)\prod _{ n=1 }^{ N }{ p(z_n|\theta)p(w_n|z_n,\beta) } 
$$

where $p(z_n|\theta)$ is simply $\theta_i$ for the unique i such that { z }_{ n }^{ i }=1. Integrating over $\theta$ and summing over z, we obtain the marginal distribution of a document:

$$
p(w|\alpha, \beta) = \int { p(\theta |\alpha )(\prod _{ n=1 }^{ N }{ \sum { p({ z }_{ n }|\theta )p({ w }_{ n }|{ z }_{ n },\beta ) }  } )d\theta  } 
$$

Finally, taking the product of the marginal probabilities of single documents, we obtain the probability of a corpus:

$$
p(D|\alpha, \beta) = \prod _{ d=1 }^{ M }{ \int { p(\theta_d |\alpha )(\prod _{ n=1 }^{ N_d }{ \sum { p({ z }_{ dn }|\theta )p({ w }_{ dn }|{ z }_{ dn },\beta ) }  } )d\theta_d  }  } 
$$

To infere the latent parameters is a problem of Bayesian inference. Next, we use Variational Inference and Gibbs Sampling to estimate latent parameters.