Topic Model with Latent Dirichlet Allocation
==

## 1.background

In natural language processing, **Latente Dirichlet Allocation(LDA)** is a widely used topic model, which can automatically discover topics that a corpus contains, and explain similarities between documents. LDA is very intriguing for us, because it is a three-level hierarchical Bayesian model and topic modeling is a classic problem in natural language processing. 

In the following report,we first describle the mechanism of Latent Dirichlet Allocation. Then we use two methods to implment LDA, one is Variational Inference, and the other is Collapsed Gibbs Sampling. We try to optimize the performance ot our impelmentation by Cython. Additionally We generate test data set based on different topics and visualze the result of topic discovery. 

## 2.Algorithm Description

LDA uses a generative model to explain how the observed words in a corpus which contains many documents are generated from latent variables. The following shows the graphical model representation of LDA.

<img src = 'LDA.png'>

The boxes are "plates" representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document. M denotes the number of documents, N the number of words in a documents, and V indicates the vocabulary in the corpus. And we define the following terms:

* $\alpha$ is the parameter of the Dirichlet prior on the per-document topic distribution, 
* $\beta_i$ is the word distribution for topic k
* $\theta_i$ is the topic distribution for document i,
* $z_{mw}$ is the topic of word w in document m

LDA assumes the following generative process for each document **m** in a corpus:
1. Choose N ~ Poisson($\xi$).
2. Choose $\theta$ ~ Dir($\alpha$).
3. For each of the N word $w_n$:
    1. Choose a topic $z_n$ ~ Multinomial($\theta$).
    2. Choose a word $w_n$ from $p(w_n |z_n, \beta)$, a multinomial probability conditioned on the topic $z_n$.

### The Dirichlet Distribution

The **Dirichlet Distribution** is the multivariate generalization of beta distribution, which means the Dirichlet distribution is adistribution over discrete probability distributions. Dirichlet Distributions are oftenly used as conjugate prior distributions of the categorical distribution and multinomial distribution in Bayesian statistics. 

A *k*-dimensional Dirichlet random variable $\theta$ can take values in the (k-1)-simplex (a k-vector $\theta$ lies in the (k-1)-simplex if ${ \theta  }_{ i }\ge 0,\sum _{ i }^{ k }{ { \alpha  }_{ i } } $), and has the following probability density on the simplex:

$$p(\theta| \alpha) = \frac { \Gamma (\sum _{ i=1 }^{ k }{ { \alpha  }_{ i } } ) }{ \prod _{ i=1 }^{ k }{ \Gamma ({ \alpha  }_{ i }) }  } { { \theta  }_{ 1 } }^{ { \alpha  }_{ 1 }-1 }\cdot \cdot \cdot { { \theta  }_{ k } }^{ { \alpha  }_{ k }-1 }$$

Given the parameters $\alpha$ and $\beta$, the joint distribution of a topic mixture $\theta$, a set of N topics z, and a set of N words w is given by:

$$
p(\theta, z, w|\alpha, \beta)=p(\theta|\alpha)\prod _{ n=1 }^{ N }{ p(z_n|\theta)p(w_n|z_n,\beta) } 
$$

where $p(z_n|\theta)$ is simply $\theta_i$ for the unique i such that { z }_{ n }^{ i }=1. Integrating over $\theta$ and summing over z, we obtain the marginal distribution of a document:

$$
p(w|\alpha, \beta) = \int { p(\theta |\alpha )(\prod _{ n=1 }^{ N }{ \sum { p({ z }_{ n }|\theta )p({ w }_{ n }|{ z }_{ n },\beta ) }  } )d\theta  } 
$$

Finally, taking the product of the marginal probabilities of single documents, we obtain the probability of a corpus:

$$
p(D|\alpha, \beta) = \prod _{ d=1 }^{ M }{ \int { p(\theta_d |\alpha )(\prod _{ n=1 }^{ N_d }{ \sum { p({ z }_{ dn }|\theta )p({ w }_{ dn }|{ z }_{ dn },\beta ) }  } )d\theta_d  }  } 
$$

To infere the latent parameters is a problem of Bayesian inference. Next, we use Variational Inference and Gibbs Sampling to estimate latent parameters.

## 3.Implementation - Gibbs Sampling

In the original paper, Blei, Ng and Jordan (2002) gave a variational inference approximation of the posterior distribution, because posterior distribution is usually intractable. But since then, Gibbs Sampling has also become a commonly used way to infere latent parameters in LDA. Here we use Gibbs Sampling to implement LDA. Gibbs Sampling is a Markov Chain Monte Carol methods, of which the next state is reached by sequentially sampling from full conditional distribution of all other variables and the data.

Since the Dirichlet distribution is conjugate prior of the multinomial distribution, the posteriors of $\theta_i$ and $\beta_i$ also follow Dirichlet distribution. Their posterior means are:

$$
\theta_{i,k} = \frac { { n }_{ i }^{ k }+{ \alpha  }_{ k } }{ \sum _{ k=1 }^{ K }{ { n }_{ i }^{ k }+{ \alpha  }_{ k } }  } 
$$

$$
\beta_{k,w} = \frac { { n }_{ w }^{ k }+{ \beta  }_{ w } }{ \sum _{ w=1 }^{ W }{ { n }_{ w }^{ k }+{ \beta  }_{ w } }  } 
$$

where ${ n }_{ i }^{ k }$ is the number of words in document i that have been assigned to topic k, ${ n }_{ w }^{ k }$ is the total number word w assigned to topic k among all documents in the corpus.

Obviously, the inference of $\theta$ and $\beta$ only depends on assignments of each word to topics $z_i$. Therefore, we can only focus on estimation of $z_i$. And here we define some terms:

* $n_m$: the word count of document m, not including the current one 
* $n_{mz}$: the number of words from document m assigned to topic z, not including the current one
* $n_{zw}$: the number of instances of word w assigned to topic z, not including the current one
* $n_z$: the total number of words assigned to topic z, not including the current one

Then, the posterior distribution of word assignment is:

$$
p(z_i=j|z_i,w)\propto \frac { n_{zw} + \phi }{ n_z + V\phi } \cdot \frac { n_{mz} + \alpha }{ n_m + K\alpha } 
$$

And we can implement LDA by Gibbs Sampling.

### Parameters

document:    $m = 1,...,M$

topic asigned to word:       $z = 1,...,K$

word:        $w = 1,...,N_V$

vocabulary : $v = 1,...,V$

Z: topic assigned to word w

$\theta: K \times N$ 

$\beta: M \times K$ 

$Multinomial(\theta)$: distribution over words for a given topic

$Multinomial(\beta)$: distribution over topics for a given document

$n_m$, $n_{mz}$, $n_{zw}$, $n_z$: as defined above

### words count function

In [7]:
def words_count_doc(corpus):
    """
    Count the toal number of words in each document in corpus.

    Parameters
    ----------
    corpus : a list-like, contains bag-of-words of each document

    Returns
    -------
    n_m : a np.array, shape(M)
         the total number of words in each document
    """
    n_m = []
    for i in range(len(corpus)):
        n_m.append(np.sum(corpus[i], axis = 0)[1])
    return np.array(n_m)

### Initialize empty parameter

In [8]:
def empty_parameters(corpus, K, V):
    """
    Initialize empty parameter n_mz, n_zw, n_z.

    Parameters:
    -----------
    K : int, the number of topics
    V : int, the number of vocabulary
    
    Returns:
    --------
    z_mw : the topic of word w in document m
    n_mz : the number of words from document m assigned to topic z
    n_zw : the number of words assigned topic z
    n_z : the total number of words assigned to topic z
    """
    z_mw = []
    n_mz = np.zeros((len(corpus), K))
    n_zw = np.zeros((K, V))
    n_z = np.zeros(K)
    return z_mw, n_mz, n_zw, n_z

### Initial parameters based on words in documents

In [9]:
def initial_parameters(corpus, K, V):
    """
    Initialize parameters for the corpus 

    Parameters:
    -----------
    corpus: a list-like, contains bag-of-words of each document
    K : int, the number of topics
    V : int, the number of vocabulary

    Returns:
    --------
    z_mw : the topic of word w in document m
    n_mz : the number of words from document m assigned to topic z
    n_zw : the number of words assigned topic z
    n_z : the total number of words assigned to topic z
    
    """
    z_mw, n_mz, n_zw, n_z = empty_parameters(corpus, K, V)
    z_mw = []
    for m, doc in enumerate(corpus):
        z_n = []
        for n, t in doc:
            z = np.random.randint(0, K)
            z_n.append(z)
            n_mz[m, z] += t
            n_zw[z, n] += t
            n_z[z] += t
        z_mw.append(np.array(z_n))
    return z_mw, n_mz, n_zw, n_z

## Gibbs Sampling

In [11]:
def gibbs_sampling(corpus, max_iter, K, V, n_zw, n_z, n_mz, n_m, z_mw, alpha, phi):
    beta_gibbs = []
    theta_gibbs = []
    
    for i in range(max_iter):
        if i%1000 == 0:
            print(i)
        for m, doc in enumerate(corpus):
            for n, (w, t) in enumerate(doc):
                #exclude the current word
                z = z_mw[m][n]
                n_mz[m, z] -= t
                n_m[m] -= t
                n_zw[z, w] -= t
                n_z[z] -= t
        
                new_z = sample_topic(K, n_zw, n_z, n_mz, n_m, alpha, phi, w, m)

                #include the current word
                z_mw[m][n] = new_z
                n_mz[m, new_z] += t
                n_zw[new_z, w] += t
                n_z[new_z] += t
                n_m[m] += t

        #update beta
        beta_gibbs.append(update_beta(V, n_zw, n_z, alpha))
        #update theta
        theta_gibbs.append(update_theta(K, n_mz, n_m, phi))
    return beta_gibbs, theta_gibbs

def sample_topic(K, n_zw, n_z, n_mz, n_m, alpha, phi, w, m):
    """
    Sample new topic for current word
    
    """
    p_z = np.zeros(K)
    for j in range(K):
        p_z[j] = ((n_zw[j, w] + phi)/(n_z[j] + V * phi)) * ((n_mz[m, j] + alpha)/(n_m[m] + K * alpha))
    new_z = np.random.multinomial(1, p_z/p_z.sum()).argmax()
    return new_z    

def update_beta(V, n_zw, n_z, alpha):
    """
    Update beta
    """
    beta = (n_zw + alpha)/(n_z[:,None] + V *alpha)
    return beta

def update_theta(K, n_mz, n_m, phi):
    """
    Update theta
    """
    theta = (n_mz + phi)/(n_m[:, None] + K * phi)
    return theta