# 94-775/95-865: Topic Modeling with Latent Dirichlet Allocation

Author: George H. Chen (georgechen [at symbol] cmu.edu)

The beginning part of this demo is a shortened and modified version of sklearn's LDA & NMF demo (http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html).

We'll use NumPy.

In [None]:
import numpy as np
np.set_printoptions(precision=5, suppress=True)

## Latent Dirichlet Allocation

We first load in 10,000 posts from the 20 Newsgroups dataset.

In [None]:
from sklearn.datasets import fetch_20newsgroups
num_articles = 10000
data = fetch_20newsgroups(shuffle=True, random_state=0,
                          remove=('headers', 'footers', 'quotes')).data[:num_articles]

We can verify that there are 10,000 posts, and we can look at an example post.

In [None]:
len(data)

In [None]:
# you can take a look at what individual documents look like by replacing what index we look at
print(data[2])

We now fit a `CountVectorizer` model that will compute, for each post, what its raw word count histograms are (the "term frequencies" we saw in week 1).

The output of the following cell is the term-frequencies matrix, where rows index different posts/text documents, and columns index 1000 different vocabulary words. A note about the arguments to `CountVectorizer`:

- `max_df`: we only keep words that appear in at most this fraction of the documents
- `min_df`: we only keep words that appear in at least this many documents
- `stop_words`: whether to remove stop words
- `max_features`: among words that don't get removed due to the above 3 arguments, we keep the top `max_features` number of most frequently occuring words

In [None]:
vocab_size = 1000
from sklearn.feature_extraction.text import CountVectorizer

# CountVectorizer does tokenization and can remove terms that occur too frequently, not frequently enough, or that are stop words

# document frequency (df) means number of documents a word appears in
tf_vectorizer = CountVectorizer(max_df=0.95,
                                min_df=2,
                                stop_words='english',
                                max_features=vocab_size)
tf = tf_vectorizer.fit_transform(data)

We can verify that there are 10,000 rows (corresponding to posts), and 1000 columns (corresponding to words).

In [None]:
tf.shape

A note about the `tf` matrix: this actually is stored as what's called a sparse matrix (rather than a 2D NumPy array that you're more familiar with). The reason is that often these matrices are really large and the vast majority of entries are 0, so it's possible to save space by not storing where the 0's are.

In [None]:
type(tf)

 To convert `tf` to a 2D NumPy table, you can run `tf.toarray()` (this does not modify the original `tf` variable).

In [None]:
type(tf.toarray())

In [None]:
tf.toarray().shape

We can figure out what words the different columns correspond to by using the `get_feature_names()` function; the output is in the same order as the column indices. In particular, we can index into the following list (i.e., so given a column index, we can figure out which word it corresponds to).

In [None]:
print(tf_vectorizer.get_feature_names())

We can also go in reverse: given a word, we can figure out which column index it corresponds to. To do this, we use the `vocabulary_` attribute.

In [None]:
tf_vectorizer.vocabulary_['bus']

We can figure out what the raw counts are for the 0-th post as follows.

In [None]:
tf[0].toarray()

We now fit an LDA model to the data.

In [None]:
num_topics = 10

from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=num_topics, learning_method='online', random_state=0)
lda.fit(tf)

The fitting procedure determines the every topic's distribution over words; this information is stored in the `components_` attribute. There's a catch: we actually have to normalize to get the probability distributions (without this normalization, instead what the model has are pseudocounts for how often different words appear per topic).

In [None]:
lda.components_.shape

In [None]:
lda.components_.sum(axis=1)

In [None]:
topic_word_distributions = np.array([row / row.sum() for row in lda.components_])

We can verify that each topic's word distribution sums to 1.

In [None]:
topic_word_distributions.sum(axis=1)

We can also print out what the probabilities for the different words are for a specific topic. This isn't very easy to interpret.

In [None]:
print(topic_word_distributions[0])

Instead, usually people do something like looking at the most probable words per topic, and try to use these words to interpret what the different topics correspond to.

In [None]:
num_top_words = 20

print('Displaying the top %d words per topic and their probabilities within the topic...' % num_top_words)
print()

for topic_idx in range(num_topics):
    print('[Topic ', topic_idx, ']', sep='')
    sort_indices = np.argsort(-topic_word_distributions[topic_idx])
    for rank in range(num_top_words):
        word_idx = sort_indices[rank]
        print(tf_vectorizer.get_feature_names()[word_idx], ':', topic_word_distributions[topic_idx, word_idx])
    print()

We can use the `transform()` function to figure out for each document, what fraction of it is explained by each of the topics.

In [None]:
doc_topic_matrix = lda.transform(tf)

In [None]:
doc_topic_matrix.shape

In [None]:
doc_topic_matrix[0]

## Computing co-occurrences of words

Here, we count the number of newsgroup posts in which two words both occur. This part of the demo should feel like a review of co-occurrence analysis from earlier in the course, except now we use scikit-learn's built-in CountVectorizer. Conceptually everything else in the same as before.

In [None]:
word1 = 'year'
word2 = 'team'

word1_column_idx = tf_vectorizer.vocabulary_[word1]
word2_column_idx = tf_vectorizer.vocabulary_[word2]

In [None]:
np.array(tf.todense())

In [None]:
tf[:, word1_column_idx].toarray()

In [None]:
documents_with_word1 = (tf[:, word1_column_idx].toarray().flatten() > 0)

In [None]:
documents_with_word2 = (tf[:, word2_column_idx].toarray().flatten() > 0)

In [None]:
documents_with_both_word1_and_word2 = documents_with_word1 * documents_with_word2

Next, we compute the log of the conditional probability of word 1 appearing given that word 2 appeared, where we add in a little bit of a fudge factor in the numerator (in this case, it's actually not needed but some times you do have two words that do not co-occur for which you run into a numerical issue due to taking the log of 0).

In [None]:
eps = 0.1
np.log2((documents_with_both_word1_and_word2.sum() + eps) / documents_with_word2.sum())

In [None]:
def prob_see_word1_given_see_word2(word1, word2, vectorizer, eps=0.1):
    word1_column_idx = vectorizer.vocabulary_[word1]
    word2_column_idx = vectorizer.vocabulary_[word2]
    documents_with_word1 = (tf[:, word1_column_idx].toarray().flatten() > 0)
    documents_with_word2 = (tf[:, word2_column_idx].toarray().flatten() > 0)
    documents_with_both_word1_and_word2 = documents_with_word1 * documents_with_word2
    return np.log2((documents_with_both_word1_and_word2.sum() + eps) / documents_with_word2.sum())

## Topic coherence

The below code shows how one implements the topic coherence calculation from lecture.

In [None]:
average_coherence = 0
for topic_idx in range(num_topics):
    print('[Topic ', topic_idx, ']', sep='')
    sort_indices = np.argsort(topic_word_distributions[topic_idx])[::-1]
    coherence = 0.
    for top_word_idx1 in sort_indices[:num_top_words]:
        word1 = tf_vectorizer.get_feature_names()[top_word_idx1]
        for top_word_idx2 in sort_indices[:num_top_words]:
            word2 = tf_vectorizer.get_feature_names()[top_word_idx2]
            if top_word_idx1 != top_word_idx2:
                coherence += prob_see_word1_given_see_word2(word1, word2, tf_vectorizer, 0.1)
    print('Coherence:', coherence)
    print()
    average_coherence += coherence
average_coherence /= num_topics
print('Average coherence:', average_coherence)

## Number of unique words

The below code shows how one implements the number of unique words calculation from lecture.

In [None]:
average_number_of_unique_top_words = 0
for topic_idx1 in range(num_topics):
    print('[Topic ', topic_idx1, ']', sep='')
    sort_indices1 = np.argsort(topic_word_distributions[topic_idx1])[::-1]
    num_unique_top_words = 0
    for top_word_idx1 in sort_indices1[:num_top_words]:
        word1 = tf_vectorizer.get_feature_names()[top_word_idx1]
        break_ = False
        for topic_idx2 in range(num_topics):
            if topic_idx1 != topic_idx2:
                sort_indices2 = np.argsort(topic_word_distributions[topic_idx2])[::-1]
                for top_word_idx2 in sort_indices2[:num_top_words]:
                    word2 = tf_vectorizer.get_feature_names()[top_word_idx2]
                    if word1 == word2:
                        break_ = True
                        break
                if break_:
                    break
        else:
            num_unique_top_words += 1
    print('Number of unique top words:', num_unique_top_words)
    print()
    
    average_number_of_unique_top_words += num_unique_top_words
average_number_of_unique_top_words /= num_topics
print('Average number of unique top words:', average_number_of_unique_top_words)