# Generating/initializing random topics and words

- Collection of basic blogs at bottom https://devo-evo.lab.asu.edu/methods/?q=node/42
- The original Edwin Chen github repo on Sarah Palin: https://github.com/echen/sarah-palin-lda 

In [1]:
documents = []

documents.append("I like eating broccoli, and munching on avocados. "+ \
                "I would drink lots of Gatorade, diet Coke, Ginger ale, " + \
                "and devour cake. Cheesecake is delicious.")
documents.append("Puppies and kittens are adorable. Hamsters are cute. Koalas " + \
                "are so furry. My pet chihuahua and your pet Spaniel play with " + \
                "the Golden Retriever.")
documents.append("Puppies and kittens are adorable. Hamsters are cute. Koalas " + \
                "are so furry. My pet chihuahua and your pet Spaniel play with " + \
                "the Golden Retriever.")
documents.append("Puppies and kittens are adorable. Hamsters are cute. Koalas " + \
                "are so furry. My pet chihuahua and your pet Spaniel play with " + \
                "the Golden Retriever.")

### Count the number of words in the above categories

In [2]:
import re
import numpy as np

# Get the vocabulary using regular expressions
def docs2vocab( documents ):
    
    alldocs = ""
    for doc in documents:
        alldocs += doc
    vocabulary = re.split('; |, | |\.', alldocs)
    vocabulary = filter( lambda(elt): elt!='', list(set(vocabulary)))
    
    return vocabulary

# For a single document, translate it into it's vocabulary integers
def doc2vocab( doc, vocab ):
    
    word2vocab = dict( zip( vocab, range(len(vocab))))
    doc_words = re.split('; |, | |\.', doc)
    translated_words = [ word2vocab[word] for word in doc_words if word in word2vocab.keys() ]
    
    return translated_words

# Take integer words, and translate them back to a list of ascii words
def vocab2doc( translated, vocab ):
    return [ vocabulary[ int_word ] for int_word in translated ]

### Examples of how to use the above functions

In [3]:
# Create a vocabulary from documents
vocabulary = docs2vocab( documents )

# Integer words, ascii words, and both words in list format
document = documents[0]
int_words = doc2vocab(document, vocabulary)
ascii_words = vocab2doc( int_words, vocabulary )
int_ascii = zip( int_words, ascii_words )

### Make some random assignments

Topic distribution is $\theta \sim Dir( \alpha )$, so we'd like $P(\theta|\alpha)$. These are sampled once per document; therefore, there are $M$ of them.

In [9]:
alphas = [5,5]
theta = np.random.dirichlet( alphas )

Given a topic distribution as specified by $P(\theta|\alpha)$, we'd like to take $N$ samples  from the distribution specified by $\theta$, each of which is called $z_n$. So,

$z_n \sim Multi( \theta )$

Formally, this is $P(z_n | \theta) = \frac{N!}{1!2!\cdots k!} \theta_1^{z_n^{(1)}} \theta_2^{z_n^{(2)}} \cdots \theta_k^{z_n^{(k)}}$ is a multinomial distribution, with parameter $\theta$.

In [5]:
N = 100
z_n = np.random.multinomial( 1, thetas, size=N )

Given the topic $z_n$ and an overall word distribution as specified by $\beta$, the word distribution is a multinomial distribution with parameter $\beta$. Here, the parameter $\beta$ is a matrix of size $k \times V$, since $z_n \in [1, k]$, i.e. $k$ topics, and there are $V$ words.

The probability $P( w | z_n, \beta )$ is a proper distribution whose columns and rows both sum to one.

In [None]:
w_t_n = np.random.multinomial( V, betas, size=N )

With the given topic distribution $\theta = [\theta_1, \theta_2, \theta_3, \cdots, \theta_M$], the probability of that the $n^{th}$ word is from topic $k$ is $P( z_n = k | \theta )$, which is simply $\theta_k$. That is to say, the probability that $z_n^{(k)} = 1$, or $P(z_n^{(k)}=1 | \theta, \alpha)$ is simply the parameter $\theta_k$. 

Then, we can write $P( z_n, \theta | \alpha ) = P( z_n | \alpha) P(\theta | \alpha)$. If we are looking over $N$ words and each of them the draws are independent, then we have $P( \theta, z | \alpha ) = \prod_n P( z_n | \alpha ) P(\theta | \alpha ) = P(\theta | \alpha) \prod_n P( z_n | \theta )$.

26.19999999999999