Code that generates a document:
    * For n_words iterations:
        Randomly select topic based on topic probabilities for that doc
        With selected topic, randomly select word based on word probabilities for that topic

In [17]:
np.random.dirichlet(np.ones(30))

array([0.0001648 , 0.05273598, 0.06448435, 0.00685215, 0.00318632,
       0.01910815, 0.0095655 , 0.01695564, 0.05596396, 0.09974797,
       0.00601108, 0.00217723, 0.07407779, 0.02503357, 0.02863812,
       0.02343912, 0.06564616, 0.02484568, 0.01163379, 0.00124848,
       0.04806407, 0.04158466, 0.05770537, 0.03806463, 0.00574891,
       0.09112793, 0.03394238, 0.05648604, 0.02901851, 0.00674166])

In [13]:
import numpy as np

docs = [[0.98, 0.01, 0.01],
        [0.01, 0.98, 0.01],
        [0.01, 0.01, 0.98]]
topics = [[ 0.4,      0.4,   0.01,        0.01,    0.01,       0.01,
            0.1,     0.04,   0.01,        0.01],
          [0.01,     0.01,    0.4,         0.4,    0.01,       0.01,
            0.1,     0.04,   0.01,        0.01],
          [0.02,     0.02,   0.01,        0.01,     0.4,        0.4,
           0.02,      0.1,   0.01,        0.01]]
words =  ['cat', 'kitten',  'dog', 'puppy',  'deep', 'learning',
          'fur',  'image',  'GPU', 'asparagus']


def make_doc(topic_probs=None, n_words=40, verbose=True):
    if topic_probs is None:
        topic_probs = np.random.dirichlet(np.ones(len(topics)))
    if verbose:
        print('topic_probs:', topic_probs)
    results = []
    for _ in range(n_words):
        i_topic = np.random.choice(len(topics), p=topic_probs)
        topic = topics[i_topic]
        word = np.random.choice(words, p=topic)
        results.append(word)
    return ' '.join(results)

documents = []
for i, doc in enumerate(docs):
    documents.append(make_doc(topic_probs=doc, n_words=10, verbose=False))
    print(documents[i])

cat cat cat cat cat kitten cat deep kitten cat
puppy dog dog dog puppy dog dog puppy dog puppy
learning deep deep deep learning image learning learning learning puppy


We can also generate random topic distributions and word distributions within topics from dirichlet priors as below. This would be how we could initialize a **Latent Dirichlet Allocation** model before running training iterations.

In [2]:
num_docs = 3
num_topics = 3
doc_params = []
for i in range(num_docs):
    doc_params.append(list(np.random.dirichlet(np.ones(num_topics))))
doc_params = np.array(doc_params)

In [3]:
num_words = 10
topic_params = []
for i in range(num_topics):
    topic_params.append(list(np.random.dirichlet(np.ones(num_words))))
topic_params = np.array(topic_params)

In [4]:
np.sum(topic_params[0])

1.0

In [5]:
topic_params.shape

(3, 10)

As our model learns document-topic and topic-word distribution parameters, we can measure the likelihood of a particular document occuring given those parameters. We're essentially aggregating up from the probability of seeing each word in the document to the prob of the entire document. 

In [6]:
def get_doc_prob(doc_words, doc_params, topic_params):
    total_prob = 1
    for word in doc_words:
        word_prob = 0
        word_index = words.index(word)
        for i in range(len(topics)):
            word_prob += (doc_params[i]) * (topic_params[i][word_index])
        total_prob *= word_prob
    return total_prob

In [7]:
get_doc_prob(documents[0].split(), docs[0], topics)

5.520390620941468e-06

And from there, we can aggregate from probability of each document up to the entire collection of documents. This gives us a way to measure how well our LDA model is doing / to actually fit it. We make iterative updates to the parameters (with a very complicated expectation-maximization algorithm) to gradually improve the likelihood of the document/word outcomes we observe based on the selected parameters.

In [8]:
def get_dataset_prob(documents, docs_params, topics_params):
    total_doc_prob = 1
    for i, document in enumerate(documents):
        doc_prob = get_doc_prob(document.split(), docs_params[i], topics_params)
        total_doc_prob *= doc_prob
        print(doc_prob)
    return total_doc_prob

In [9]:
get_dataset_prob(documents, docs, topics)

5.520390620941468e-06
3.552880107928436e-08
8.611442246976467e-05


1.688986798789035e-17

In [10]:
get_dataset_prob(documents, doc_params, topic_params)

1.7981358807381574e-10
1.0249552632938032e-12
1.510377909215812e-11


2.783639830994419e-33

In [11]:
doc_params

array([[0.23676883, 0.2719823 , 0.49124887],
       [0.8164061 , 0.16713261, 0.01646129],
       [0.28597635, 0.12655035, 0.5874733 ]])