## Topic Modeling ##

*Lauren Klein wrote this lesson in 2019 drawing on writing by [Ted Underwood](https://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/) and [Matthew Jockers](http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/), this [video of a talk by David Mimno](https://vimeo.com/53080123), and [this notebook](https://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html) by Radim Rehurek. It was supplemented in 2020 with additional materials from Dan Sinykin, and revised again in 2021 by Lauren Klein.* 

## What is Topic Modeling? ##

In both the Li and Bamman paper, and the Antoniak et al. paper, we've seen how topic modeling plays a major role. What is topic modeling? At its most basic level, topic modeling is an automated method for extracting the themes, or "topics," from large sets of documents--like GPT-3 generated fiction, or birth stories, or as we'll explore today, articles in the Emory Wheel.

There are numerous kinds of topic models, but the most popular and widely-used kind is latent Dirichlet allocation (LDA). It's so popular, in fact, that "LDA" and "topic model" are sometimes used interchangeably, even though LDA is only one type.

LDA math is pretty complicated. We're not going to get very deep into the math just yet (or maybe not ever, depending on the time). But first we are going to introduce two important concepts that will help us conceptually understand how LDA topic models work.

### 1) LDA is an Unsupervised Algorithm 
Topic modeling is a kind of machine learning. Machine learning always sounds complicated, but it really just means that computer algorithms are performing tasks without being explicitly programmed to do so and that they are "learning" how to perform these tasks by being fed training data. In the field of machine learning, algorithms are typically split into two broad categories: supervised and unsupervised. These categories describe how the algorithms are "trained" or how they "learn." LDA is an unsupervised algorithm.

If an algorithm is supervised, that means a researcher is helping to guide it with some kind of information, like labels. For example, if you wanted to create an algorithm that could identify pictures of cats vs pictures of dogs, you could train it with a bunch of pictures of cats that were clearly labeled CAT and a bunch of pictures of dogs that were clearly labeled DOG. The algorithm would then be able to learn which features are specific to cats vs dogs because you explicitly told it: this is a picture of a cat; this is a picture of a dog.

If an algorithm is unsupervised, that means a researcher does not train it with outside information. There are no labels. The algorithm just learns that pictures of cats are more similar to each other and pictures of dogs are more similar to each other. The algorithm doesn't really know that one cluster is cats and one cluster is dogs; it just knows that there are two distinct clusters.

Because LDA is an unsupervised algorithm, we don't tell our topic model which words or topics to look for. We only tell the topic model how many topics (or clusters of words) that we want returned. The topic model doesn't know anything about Frida Kahlo, Nella Larsen, and Jackie Robinson. It doesn't know anything about art, literature, and sports.

### 2) LDA is a Probabilistic Model 
LDA fundamentally relies on statistics and probabilities. Rather than calculating precise and unchanging metrics about a given corpus, a topic model makes a series of very sophisticated guesses about the corpus. These guesses will change slightly every time we run the topic model. This is important to remember as we analyze, interpret, and make arguments based on our results. All of our results in this lesson will be probabilities, and they'll change slightly every time we re-run the topic model.

When we tell the topic model that we want to extract 15 topics from the Emory Wheel, here's what the topic model does:

The topic model starts off with a slightly silly, backwards assumption. The topic model assumes that every single one of the 4000-some-odd articles in the corpus was written by someone who exclusively drew their words from 15 mystery topics, or 15 clusters of words. To spin it in a slightly different way with a different medium, the topic model assumes that there was one master artist with 15 different paints on her palette, who created all the articles by dipping her brush into these 15 paints alone, applying and blending them onto each canvas in different proportions. The topic model is trying to discover the 15 mystery topics that created all the Wheel articles, as well as the mixture of these topics that makes up each individual article.

The topic model begins by taking a completely wild guess about the 15 topics, but then it iterates through all the words in all the article and makes better and better guesses. If the word "student" keeps showing up with the words "stress" and "exam," and if all three words keep showing up in the same kinds of article, then the topic model starts to suspect that these three words should belong to the same topic. If the word "film" keeps showing up with "Atlanta" and "industry," then the topic model suspects that they should belong to the same topic, too. The topic model finally arrives at its best guesses for the 15 topics that most likely created all the Emory Wheel articles.


## LDA explained again in more abstract terms

Probabilistic topic models begin with an assumption and a definition. 

The assumption: all documents contain a mixture of different topics.

The definition: a topic is a collection of words, each with a different probability of occurance in a particular document (or other chunk of text) discussing that topic. 




Here's a nice illustration, created by Ted Underwood, that shows this assumed relatioship between topics and documents. 

![topics and docs](https://tedunderwood.files.wordpress.com/2012/04/shapeart.png)

Above we see an example of the basic assumption of topic modeling: one topic might contain many occurrences of “organize,” “committee,” “direct,” and “lead.” Another might contain a lot of “mercury” and “arsenic,” with a few occurrences of “lead.” 

The three documents are assumed to contain both topics in different proportions.

But here is the thing: we can’t directly observe topics. All we actually have are the documents that attest to their existence. So in other words:

**Topic modeling is a way of extrapolating backward from a collection of documents to infer the topics that could have generated them.** 

There is simply no way to infer the exact topics in a set of documents; there are too many unknowns. So (probabalistic) topic modeling works backwards. It pretends that the problem is mostly solved. 

**How does this play out in actual life?**

Suppose we knew which topic produced every word in the collection, except for this one word in document D. The word happens to be “lead,” which we’ll call word type W. How are we going to decide whether this occurrence of W belongs to topic 1 or topic 2?

![topics and docs](https://tedunderwood.files.wordpress.com/2012/04/shapeart.png)

We can’t know for sure. But one way to guess is to consider two questions. This is the first: 

* How often does “lead” appear in topic 1 elsewhere? If “lead” often occurs in discussions of 1, then this instance of “lead” might belong to 1 as well. 

But a word can be common in more than one topic, as it is in topics 1 and 2 above. And we don’t want to assign “lead” to a topic about leadership (topic 1) if this document is mostly about heavy metal contamination (topic 2). So we also need to consider a second question:

* How common is topic 1 in the rest of the document?

To answer these questions, here’s what we’ll do:

For each possible topic Z, we’ll multiply the frequency of this word type W in Z by the number of other words in document D that already belong to Z. The result will represent the probability that this word came from Z. Here’s the actual formula:

![LDA formula](https://tedunderwood.files.wordpress.com/2012/04/ldaformula.png)

There are also a few Greek letters scattered in there, but they aren’t important for our purposes. Technically, they’re called “hyperparameters,” but you can think of them simply as fudge factors. 

In other words: there’s some chance that this word belongs to topic Z even if it is nowhere else associated with Z; the fudge factors keep that possibility open. (If you want to understand hyperparameters beyond the "fudge factor" explanation, see "[Rethinking LDA: Why Priors Matter](http://people.cs.umass.edu/~mimno/publications.html).")

The overall emphasis on probability in this technique, of course, is why it’s called *probabilistic topic modeling*.

### Enter Sampling ###

Now, suppose that instead of having the problem mostly solved, we had only a wild guess which word belonged to which topic. We could still use the strategy I've just described to improve our guess, by making it more internally consistent. 

We could go through the collection, word by word, and reassign each word to a topic, guided by the formula above. 

And in fact, that's what LDA actually does.

And as we do that, two things happen:

1) Words will gradually become more common in topics where they are already common. And also,

2) Topics will become more common in documents where they are already common. 

Thus our model will gradually become more consistent as topics focus on specific words and documents. But it can’t ever become perfectly consistent, because words and documents don’t line up in one-to-one fashion. So the tendency for topics to concentrate on particular words and documents will eventually be limited by the actual, messy distribution of words across documents.

That’s how topic modeling works in practice. You assign words to topics randomly and then just keep improving the model, to make your guess more internally consistent, until the model reaches an equilibrium that is as consistent as the collection allows.

For a slightly more in depth explanation of how LDA works, see [this video](https://vimeo.com/53080123). (Start around 5:35). 

### A brief historical / technical digression... ###

Topic modeling began as a US military project in the in the 1990s. The goal was to automatically detect changes in newswire text so that governmental and military organizations could be alerted to emerging geopolitical events. (For more on this history, see [Binder](https://dhdebates.gc.cuny.edu/read/untitled/section/4b276a04-c110-4cba-b93d-4ded8fcfafc9#ch18).)


In the early 2000s, a team of computer science researchers released [MALLET](http://mallet.cs.umass.edu/topics.php), short for **MA**chine **L**earning for **L**anguag**E** **T**oolkit. As the name suggests, MALLET is a software toolkit that enables a range of NLP techniques. Today, people mostly only use it for topic modeling, which it remains very very good at.

With that said, MALLET is written in Java, which means that it's not ideal for working in Python and Jupyter notebooks. None other than Maria Antoniak has written a convenient Python package that allows you to use MALLET in a Jupyter notebook. Her package is called [Little MALLET Wrapper](https://github.com/maria-antoniak/little-mallet-wrapper), and I'm working on getting it set up for our JupyterHub.

Until then, we'll be using [gensim](https://radimrehurek.com/gensim/about.html), a native Python library for topic modeling tht was created in the early 2010s by a computer science PhD student, Radim Rehurek. Its ease of use has made the use of topic models explode--although, I should note, most people end up returning to MALLET for research-level code. 

## Let's go! ##

In [None]:
# import and setup modules we'll be using in this notebook
import logging # for logging status etc
import itertools # helpful library for iterating through things

import numpy as np # this is a powerful python math package that many others are based on
import gensim # our topic modeling library
import os # for file i/o

# configure logging 
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO  

# a helpful function that returns the first `n` elements of the stream as plain list.
# we'll use this later
def head(stream, n=10):
    return list(itertools.islice(stream, n))

In [None]:
# import some more modules for processing the corpus
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

### Tokenizing ###

As previously discussed, many NLP tasks require that you first tokenize your corpus. We've used tokenziers built into both NLTK and spaCy already for this course.  

Here, however, we're going to write our own quick tokenizing function that makes use of gensim's [simple_preprocess function](https://radimrehurek.com/gensim/utils.html), which breaks a document into a list of lowercase tokens. The lower-casing is important for topic modeling since we want both uppercase and lowercase versions of the same word to be counted together. 

In [None]:
# here's some nice dense python for you:
# this defines our tokenize function for future use
def tokenize(text):
    return [token for token in simple_preprocess(text) if token not in STOPWORDS]

### Further pre-processing our corpus ###

This is the other necessary step before running a topic model. You need to write a function that iterates through your corpus and returns each document in the format (title, tokens). 

**Side note that we've not yet discussed tuples.** Tuples exist in many programming languages, including R. For our purposes, just know that tuples are sequences of objects--  just like lists-- but they cannot be changed. In Python, you indicate a tuple with parentheses. 

In any case, the gensim documentation tells us that we want to define a pre-processing function like this:

In [None]:
# A function to yield each doc in a base directory as a `(filename, tokens)` tuple.

def iter_docs(base_dir):
    docCount = 0
    docs = os.listdir(base_dir)

    for doc in docs:
        if not doc.startswith('.'):
            with open(base_dir + doc, "r") as file:
                text = file.read()
                tokens = tokenize(text) 
        
                yield doc, tokens

In [None]:
# set up the stream for later processing 
stream = iter_docs('../corpora/emory-wheel/articles/')

# while we're at it, take a look at what this looks like for the first five docs
for doc, tokens in itertools.islice(stream, 5):
    print(doc, tokens[:10])  # print the doc title and its first ten tokens

The next step is to create a Dictionary (not to be confused with a Python dictionary) which maps each word to a numerical ID. 

This mapping step is required because most algorithms, including gensim's implementation of LDA, rely on numerical libraries that work with vectors indexed by integers, not by strings. Also, many functions need to know the vector/matrix dimensionality in advance.

The mapping can be constructed automatically by giving gensim's Dictionary class a stream of tokenized documents, like so:

In [None]:
# set up the stream 
# this is the one line you'd change here with another corpus and/or corpus location
stream = iter_docs('../corpora/emory-wheel/articles/')

# all of the rest is standard from the gensim documentation
doc_stream = (tokens for _, tokens in stream)
              
id2word_wheel = gensim.corpora.Dictionary(doc_stream) 

print(id2word_wheel)


The Dictionary (id2word_wheel) now contains all words that appeared in the corpus, along with how many times they appeared. 

gensim provides a handy function for mapping tokens to their ID numbers, not unlike the sk-learn vectorizer:

In [None]:
id2word_wheel.token2id

There aren't many things you need to do in order to tune your topic model, but one important thing do consider is whether you should filter the words. 

gensim also provides functions for this:

In [None]:
# this line below would, for example, filter out 50 most frequent words
# it's commented out here because I don't want to use it in this case
# but it's handy to know about 
# id2word_wheel.filter_n_most_frequent(50)

# this line filters out words that appear only 1 doc, keeping the rest
# I will use this one 
# note how no_below and no_above take different values
id2word_wheel.filter_extremes(no_below=2, no_above=1.0)

id2word_wheel

Note that by removing the words that only appeared in a single document, we went from 118,786 unique words (or tokens) to 43,169. That's not a huge number for a topic model, but we'll see how it goes... 

Snce a streamed corpus and a dictionary is all we need to create the vectors for our topic model, we can get started. 

In [None]:
# a class we need; this is the same for every topic model you create with gensim. 
# no need to modify it here

class Corpus(object):
    def __init__(self, dump_file, dictionary, clip_docs=None):
        self.dump_file = dump_file
        self.dictionary = dictionary
        self.clip_docs = clip_docs
    
    def __iter__(self):
        self.titles = []
        for title, tokens in itertools.islice(iter_docs(self.dump_file), self.clip_docs):
            self.titles.append(title)
            yield self.dictionary.doc2bow(tokens)
    
    def __len__(self):
        return self.clip_docs

In [None]:
# create a stream of bag-of-words vectors
wheel_corpus = Corpus('../corpora/emory-wheel/articles/', id2word_wheel)

# print the first vector in the stream to see what it looks like; 
# this is in the format (word_id, count in first doc)

vector = next(iter(wheel_corpus))

vector  

In [None]:
# now we're ready to run our topic model!

%time lda_model = gensim.models.LdaModel(wheel_corpus, num_topics=15, id2word=id2word_wheel, passes=5) 

# note that passes should be higher -- usually in the 50-100 range -- 
# but in the interests of time we'll only do 5 


In [None]:
# some additional helpful functions built into LdaModel

# how to store corpus to disk
from gensim.corpora import MmCorpus
MmCorpus.serialize('./wheel.corpus.mm', wheel_corpus)

# how to store dictionary to disk
id2word_wheel.save('./wheel.dictionary')

# how to store model to disk 
lda_model.save('./lda_wheel-15topics_5iters.model')

You can also load in a saved model. This is very helpful to know about, since generating new topic models takes time. 

Here, we're going to load in a (slightly) better topic model of the Emory Wheel with the same number of topics (15), but 50 iterations.

In [None]:
# load an old model; in this case, a topic model of the ccp with 50 iterations
lda_model = gensim.models.LdaModel.load('./lda_wheel-15topics_50iters.model')

In [None]:
# gensim comes with a bunch of functions that make interacting with the output of the topic
# model a little easier. this one shows the topics. 

# show the topics, in the format (number of topics to show, number of terms)
# note that all words are in all topics, just some topics consist of very very small
# proportions of that word

# as you can tell already, even the top words in each topic are only a very small proportion
# of that topic, since we are dealing with about 14K unique words

lda_model.show_topics(15, 20)

In [None]:
# let's format the words a little more nicely; 
# the formatted=False parameter returns tuples of (word, probability)

topics = lda_model.show_topics(15, 20, formatted=False)

for topic in topics:
    topic_num = topic[0]
    topic_words = ""
    
    topic_pairs = topic[1]
    for pair in topic_pairs:
        topic_words += pair[0] + ", "
    
    print("T" + str(topic_num) + ": " + topic_words)

## Topics and Labels

Now you can see, perhaps, that we call a "topic" is really just a list of the most probable words for that topic, which are sorted in descending order of probability. The most probable word for the topic is the first word. 

Topic models start to get more powerful when we, as human researchers, analyze the most probable words for every topic and summarize what these words have in common. This summary can then be used as a descriptive label for the topic. Remember, since an LDA topic model is an unsupervised algorithm, it doesn't know what these words mean in relationship to one another. It's up to us, as the human researchers, to make meaning out of the topics.

How might you label the following topics?

✨Topic 3:✨

`team, emory, game, eagles, said, university, season, second, time, points, senior, win, junior, year, sophomore, teams, freshman, college, run, play, `

In [None]:
# your answer here

✨Topic 6:✨

`killers, song, album, music, songs, rae, band, stage, concert, performance, crowd, audience, artists, lyrics, sound, dance, makeshift, collapsing, nextyear, weighty, `

In [None]:
# your answer here

✨Topic 13:✨

`officer, subject, said, april, epd, reported, complainant, emory, case, individual, student, wallet, responded, assigned, investigator, number, second, stage, campus, driver, `



## Refining the topic model

These are decent topics, but they're not amazing. Here are a few things you might want to try in order to fine-tune your model:

* Filtering some of the most common words (see the filtering function above)
* Generating fewer topics (we could try 10, for instance). 

Feel free to try those things on your own. 

But for the purposes of this class, let's take a bit of a closer look at the probabilities attached to each word in a single topic. 

In [None]:
# T13 looks coherent
topic = topics[13]

# the first item of the topic is the topic number
topic_num = topic[0]

# the next item is another list with pairs of words and percentages
topic_pairs = topic[1]
for pair in topic_pairs:
    print(pair[0] + ": " + str(pair[1]))

# since all topics contain all words, the sum of all of the probabilities of each 
# topic should be 1


Let's flip it around and look at the topical composition of a single document. 

NB: MALLET provides this output automatically, but with gensim there's a bit more work required.

In [None]:
tokens = [] 

# open one file
with open('../corpora/emory-wheel/articles/2014-10-02-Atlanta-Food-Truck-Park-Brings-Enriching-Epicurian-Experience.txt', "r") as file:
    text = file.read()
    tokens = tokenize(text) # remember this from above

# create the bag of words for the document on the basis of the Wheel dictionary, created above
doc_bow = id2word_wheel.doc2bow(tokens)

# get the topics that the doc consists of
doc_topics = lda_model.get_document_topics(doc_bow)

doc_topics
    



In [None]:
# now we can cross-reference to find those topics and words

for topic, prob in doc_topics:
    print("T" + str(topic) + ": " + "{:.2%}".format(prob) + " of document.")
          
        #  str(round(prob, 2)))

    topic_words = "Top words in topic: "
    select_topics = topics[topic]
    
    for pair in select_topics[1]:
        topic_words += pair[0] + ", "
    
    print(topic_words)
 

### Evaluating Topics ###

Gensim has several built-in methods for evaluating topics included as a model called [CoherenceModel](https://radimrehurek.com/gensim/models/coherencemodel.html). The fastest one to calculate is called u_mass, and in this case, the closer to zero (positive or negative), the better the score. 

Let's see how our model performs: 

In [None]:
from gensim.models.coherencemodel import CoherenceModel

cm = CoherenceModel(model=lda_model, corpus=wheel_corpus, coherence='u_mass')

coherence = cm.get_coherence()  # get coherence value

coherence

Here's a review essay by Hanna Wallach et al. that summarizes a few methods of evaluation, including some involving humans in the loop: ["Evaluation Methods for Topc Models"](http://dirichlet.net/pdf/wallach09evaluation.pdf).

Another way to evalute topics is just to look at them.

The pyLDAvis library lets you do this in a single line. It's very satisfying! 

In [None]:
# LDA visualization tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this

# just reformat the corpus for pyLDAvis 
from gensim.corpora import MmCorpus
wheel_mm_corpus = MmCorpus('./wheel.corpus.mm')

pyLDAvis.enable_notebook()

pyLDAvis.gensim.prepare(lda_model, wheel_mm_corpus, id2word_wheel)