# Topic Modeling ##


## What is Topic Modeling? ##

What is topic modeling? At its most basic level, topic modeling is an automated method for extracting the themes, or "topics," from large sets of documents--like newswires, or literary scholarship, or as we'll explore here, articles in the Emory Wheel.

There are numerous kinds of topic models, but the most popular and widely-used kind is latent Dirichlet allocation (LDA). It's so popular, in fact, that "LDA" and "topic model" are sometimes used interchangeably, even though LDA is only one type.

LDA math is pretty complicated. We're not going to get very deep into the math just yet (or maybe not ever, depending on the time). But first we are going to introduce two important concepts that will help us conceptually understand how LDA topic models work.

### 1) LDA is an Unsupervised Algorithm
Topic modeling is a kind of machine learning. Machine learning always sounds complicated, but it really just means that computer algorithms are performing tasks without being explicitly programmed to do so and that they are "learning" how to perform these tasks by being fed training data. In the field of machine learning, algorithms are typically split into two broad categories: supervised and unsupervised. These categories describe how the algorithms are "trained" or how they "learn." LDA is an unsupervised algorithm.

If an algorithm is supervised, that means a researcher is helping to guide it with some kind of information, like labels. For example, if you wanted to create an algorithm that could identify pictures of cats vs pictures of dogs, you could train it with a bunch of pictures of cats that were clearly labeled CAT and a bunch of pictures of dogs that were clearly labeled DOG. The algorithm would then be able to learn which features are specific to cats vs dogs because you explicitly told it: this is a picture of a cat; this is a picture of a dog.

If an algorithm is unsupervised, that means a researcher does not train it with outside information. There are no labels. The algorithm just learns that pictures of cats are more similar to each other and pictures of dogs are more similar to each other. The algorithm doesn't really know that one cluster is cats and one cluster is dogs; it just knows that there are two distinct clusters.

Because LDA is an unsupervised algorithm, we don't tell our topic model which words or topics to look for. We only tell the topic model how many topics (or clusters of words) that we want returned. The topic model doesn't know anything about Frida Kahlo, Nella Larsen, and Jackie Robinson. It doesn't know anything about art, literature, and sports.

### 2) LDA is a Probabilistic Model
LDA fundamentally relies on statistics and probabilities. Rather than calculating precise and unchanging metrics about a given corpus, as we've done thus far, a topic model makes a series of very sophisticated guesses about the corpus. These guesses will change slightly every time we run the topic model. This is important to remember as we analyze, interpret, and make arguments based on our results. All of our results in this lesson will be probabilities, and they'll change slightly every time we re-run the topic model.

When we tell the topic model that we want to extract 15 topics from the Emory Wheel, here's what the topic model does:

The topic model starts off with a slightly silly, backwards assumption. The topic model assumes that every single one of the 4000-some-odd articles in the corpus was written by someone who exclusively drew their words from 15 mystery topics, or 15 clusters of words. To spin it in a slightly different way with a different medium, the topic model assumes that there was one master artist with 15 different paints on her palette, who created all the articles by dipping her brush into these 15 paints alone, applying and blending them onto each canvas in different proportions. The topic model is trying to discover the 15 mystery topics that created all the Wheel articles, as well as the mixture of these topics that makes up each individual article.

The topic model begins by taking a completely wild guess about the 15 topics, but then it iterates through all the words in all the article and makes better and better guesses. If the word "student" keeps showing up with the words "stress" and "exam," and if all three words keep showing up in the same kinds of article, then the topic model starts to suspect that these three words should belong to the same topic. If the word "film" keeps showing up with "Atlanta" and "industry," then the topic model suspects that they should belong to the same topic, too. The topic model finally arrives at its best guesses for the 15 topics that most likely created all the Emory Wheel articles.


## LDA explained again in more abstract terms

Probabilistic topic models begin with an assumption and a definition.

The assumption: all documents contain a mixture of different topics.

The definition: a topic is a collection of words, each with a different probability of occurance in a particular document (or other chunk of text) discussing that topic.




Here's a nice illustration, created by Ted Underwood, that shows this assumed relatioship between topics and documents.

![topics and docs](https://tedunderwood.files.wordpress.com/2012/04/shapeart.png)

Above we see an example of the basic assumption of topic modeling: one topic might contain many occurrences of “organize,” “committee,” “direct,” and “lead.” Another might contain a lot of “mercury” and “arsenic,” with a few occurrences of “lead.”

The three documents are assumed to contain both topics in different proportions.

But here is the thing: we can’t directly observe topics. All we actually have are the documents that attest to their existence. So in other words:

**Topic modeling is a way of extrapolating backward from a collection of documents to infer the topics that could have generated them.**

### You're speaking in prose (I mean Baysian statistics)! ###

When you begin to train a topic model, the model can only make a wild guess as to which words belong to which topic, since it doesn't know the topics in advance.

But over time, the model "learns" which words are more likely to be associated with which topics, and "updates its priors."

This is the fundamental idea of Bayes' Rule, which underlies Baysian statistics. Bayes's Rule (or Bayes's Law or Bayes' Theorum) provides us a way of incorporating prior belief into an assessment of the probability of something happening, or being true. It is often employed recursively--so, you register your preheld belief (or "prior"), run the experiment, and then update what you thought you knew.

When we do that with topic modeling, two things happen:

1) Words will gradually become more common in topics where they are already common. And also,

2) Topics will become more common in documents where they are already common.

Thus our model will gradually become more consistent as topics focus on specific words and documents. But it can’t ever become perfectly consistent, because words and documents don’t line up in one-to-one fashion. So the tendency for topics to concentrate on particular words and documents will eventually be limited by the actual, messy distribution of words across documents.

That’s how topic modeling works in practice. You assign words to topics randomly and then just keep improving the model, to make your guess more internally consistent, until the model reaches an equilibrium that is as consistent as the collection allows.

For a slightly more in depth explanation of how LDA works, see [this video](https://vimeo.com/53080123). (Start around 5:35).

### A brief historical / technical digression... ###

Topic modeling began as a US military project in the in the 1990s. The goal was to automatically detect changes in newswire text so that governmental and military organizations could be alerted to emerging geopolitical events. (For more on this history, see [Binder](https://dhdebates.gc.cuny.edu/read/untitled/section/4b276a04-c110-4cba-b93d-4ded8fcfafc9#ch18).)


In the early 2000s, a team of computer science researchers released [MALLET](http://mallet.cs.umass.edu/topics.php), short for **MA**chine **L**earning for **L**anguag**E** **T**oolkit. As the name suggests, MALLET is a software toolkit that enables a range of NLP techniques. Today, people mostly only use it for topic modeling, which it remains very very good at.

With that said, MALLET is written in Java, which means that it's not ideal for working in Python and Colab notebooks. Maria Antoniak, whose "Birth Stories" paper we'll soon read for this class, has written a convenient Python package that allows you to use MALLET in a Colab notebook. Her package is called [Little MALLET Wrapper](https://github.com/maria-antoniak/little-mallet-wrapper), and you should think about using that if you end up using topic modeling as one of your methods for your final project.

For today, though, we'll be using the slightly less accurate but decidedly easier to install [gensim](https://radimrehurek.com/gensim/about.html), a native Python library for topic modeling tht was created in the early 2010s by a computer science PhD student, Radim Rehurek. Its ease of use has made the use of topic models explode--although, I should reiterate, most people end up returning to MALLET for research-level code.

## Let's go! ##

In [None]:
%pip install gensim

In [None]:
# import and setup modules we'll be using in this notebook
import logging # for logging status etc
import itertools # helpful library for iterating through things

import numpy as np # this is a powerful python math package that many others are based on
import gensim # our topic modeling library
import os # for file i/o

# configure logging, since topic modeling takes a while and it's good to know what's going on
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO

# a helpful function that returns the first `n` elements of the stream as plain list.
# we'll use this later
def head(stream, n=10):
    return list(itertools.islice(stream, n))

In [None]:
# import some more gensim modules for processing the corpus
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

## Load in our data

For today's lesson, we'll be using a dataset of articles from the *Emory Wheel* betweeen 2014 and 2019. This dataset was created by Honggang Min and Kexin Guan for their final project in the 2019 iteration of QTM 340, and was generously transfered back to me for future class use.  

The  documents are individual .txt files that are stored in a zip file on my Google Drive. Below is code similar that gets the data from my Google Drive and unzips it into a folder, which is further processed below.

In [None]:
# For downloading large files from Google Drive
# https://github.com/wkentaro/gdown
import gdown

# then download the zip file
gdown.download('https://drive.google.com/uc?export=download&id=1MbziyCKDr4FMFEDcC_b537I7j7ymnsUN', quiet=False)

# unzip it
!unzip wheel-cleaner.zip

## Processing the data

There is always some sort of pre-processing involved in running any text analysis method. However, the particulars of this step are always a little different.

To generate a topic model using gensim, each document needs to be fed into the model in the format: (title, tokens).

(Tokens in this context means the specific unit that is processed. In this case, the tokens are just single words).

The next few cells define a function to tokenize our documents, which we then to pre-process or *tokenize* our documents.

### Step 1: Define our tokenizing function ###

As previously discussed, gensim requires that we first tokenize our corpus.

Here, we're going to define a own quick tokenizing function that makes use of gensim's [simple_preprocess function](https://radimrehurek.com/gensim/utils.html), which breaks a document into a list of lowercase tokens. The lower-casing is important for topic modeling since we want both uppercase and lowercase versions of the same word to be counted together.

In [None]:
# here's some nice dense python for you:
# this defines our tokenize function for future use
def tokenize(text):
    return [token for token in simple_preprocess(text) if token not in STOPWORDS]

### Step 2: Process the docs ###

Now that we have our tokenizing function, we can use it in the function below. This one iterates through each document (a newspaper article) our corpus (our Emory Wheel dataset) and returns each document in the format (title, tokens), as required.

The gensim documentation tells us that we want to define a pre-processing function like this:

In [None]:
# A function to yield each doc in a base directory as a `(filename, tokens)` tuple.

def iter_docs(base_dir):
    docCount = 0
    docs = os.listdir(base_dir)

    for doc in docs:
        if not doc.startswith('.'):
            with open(base_dir + doc, "r") as file:
                text = file.read()
                tokens = tokenize(text)

                yield doc, tokens

The next step is to create a data structure that maps each word in the dataset to a numerical ID.

This mapping step is required because most algorithms, including gensim's implementation of LDA, rely on numerical libraries that work with vectors indexed by numbers, not by words.

The mapping can be constructed automatically, like so:

In [None]:
# set up the stream
# this is the one line you'd change here with another corpus and/or corpus location
stream = iter_docs('wheel-clean/')

# all of the rest is standard from the gensim documentation
doc_stream = (tokens for _, tokens in stream)

id2word_wheel = gensim.corpora.Dictionary(doc_stream)

print(id2word_wheel)


The Dictionary (id2word_wheel) now contains all words that appeared in the corpus, along with an ID number for each of the words.

Let's take a look:

In [None]:
id2word_wheel.token2id

There aren't many things you need to do in order to tune your topic model, but one important thing do consider is whether you should filter the words.

gensim also provides functions for this:

In [None]:
# This line below would, for example, filter out 50 most frequent words.
# It's commented out here because I don't want to use it in this case,
# but it's handy to know about.
# id2word_wheel.filter_n_most_frequent(50)

# this line filters out words that appear only 1 doc, keeping the rest
# I will use this one
# Note how no_below and no_above take different data types. Not sure why!
id2word_wheel.filter_extremes(no_below=2, no_above=1.0)

id2word_wheel

Note that by removing the words that only appeared in a single document, we went from 77,892 unique words (or tokens) to 33,830. That's not a huge number for a topic model, but we'll see how it goes...

Snce a streamed corpus and a dictionary is all we need to create the vectors for our topic model, we can get started.

In [None]:
# a class we need; this is the same for every topic model you create with gensim.
# no need to modify it here

class Corpus(object):
    def __init__(self, dump_file, dictionary, clip_docs=None):
        self.dump_file = dump_file
        self.dictionary = dictionary
        self.clip_docs = clip_docs

    def __iter__(self):
        self.titles = []
        for title, tokens in itertools.islice(iter_docs(self.dump_file), self.clip_docs):
            self.titles.append(title)
            yield self.dictionary.doc2bow(tokens)

    def __len__(self):
        return self.clip_docs

In [None]:
# create a stream of bag-of-words vectors
# here's another place where you'd change the name/location of the corpus if you want to
# run a topic model on something else
wheel_corpus = Corpus('wheel-clean/', id2word_wheel)

# print the first vector in the stream to see what it looks like;
# this is in the format (word_id, count in first doc)

vector = next(iter(wheel_corpus))

vector

Note that the vector above, which contains counts of the words that appear in the first document in the corpus, does not contain counts for all of the words in the corpus. Rather, it is what is known as a "sparse" vector, in that it omits all of the words in the corpus with a count of zero (that is, the words in the corpus that do not appear in each document).  

In [None]:
# now we're ready to run our topic model!

%time lda_model = gensim.models.LdaModel(wheel_corpus, num_topics=20, id2word=id2word_wheel, passes=50)

# note that passes should be higher -- usually in the 50-100 range --
# but in the interests of time we'll only do 5

In [None]:
# some additional helpful functions built into LdaModel

# how to store corpus to LOCAL disk (for Colab users, see below)
# from gensim.corpora import MmCorpus
# MmCorpus.serialize('./wheel.corpus.mm', wheel_corpus)

# how to store dictionary to disk
# id2word_wheel.save('./wheel.dictionary')

# how to store model to disk
# lda_model.save('./lda_wheel-20topics_50iters.model')

You can also load in a saved model. This is very helpful to know about, since generating new topic models takes time.

Here, we're going to load in a (slightly) better topic model of the Emory Wheel with the same number of topics (15), but 50 iterations.

In [None]:
# how you would load an old model from your own google drive
# lda_model = gensim.models.LdaModel.load('/content/gdrive/My Drive/corpora/lda_wheel-20topics_5iters.model')

# how we are going to all load in the same model -- from my google drive
gdown.download('https://drive.google.com/uc?export=download&id=1ZnoYyOWfs4tBezNvMj3TTTa0ylXNoUJU', quiet=False)

# unzip it
# !unzip lda_wheel-20topics_50iters.zip

# load it
lda_model_50iters = gensim.models.LdaModel.load('lda_wheel-20topics_50iters.model')

In [None]:
# gensim comes with a bunch of functions that make interacting with the output
# of the topic model a little easier. this one shows the topics.

# first, let's be sure format the words nicely;
import textwrap

# this is the actual loading in of the topics
topics = lda_model_50iters.show_topics(20, 20, formatted=False)

# this is the print nicely part
for topic in topics:
    topic_num = topic[0]
    topic_words = ""

    topic_pairs = topic[1]
    for pair in topic_pairs:
        topic_words += pair[0] + ", "

    print(textwrap.fill("T" + str(topic_num) + ": " + topic_words) + "\n", 100)

## Topics and Labels

Now you can see, perhaps, that we call a "topic" is really just a list of the most probable words for that topic, which are sorted in descending order of probability. The most probable word for the topic is the first word.

Topic models start to get more powerful when we, as human researchers, analyze the most probable words for every topic and summarize what these words have in common. This summary can then be used as a descriptive label for the topic. Remember, since an LDA topic model is an unsupervised algorithm, it doesn't know what these words mean in relationship to one another. It's up to us, as the human researchers, to make meaning out of the topics.

How might you label the following topics?

✨Topic 7:✨

```
food, water, dining, emory, restaurant, meal, market, employees,
options, menu, sodexo, cox, according, service, workers, chicken,
hurricane, campus, waste, sustainability
```



In [None]:
# your answer here

✨Topic 8 :✨

```
sga, student, said, council, president, students, cc, college,
vice, wheel, committee, legislature, government, year, graduate, spc,
ma, executive, board, funding
```

In [None]:
# your answer here

✨Topic 0:✨

```
university, match, singles, won, set, team, win, doubles, matches,
emory, williams, state, college, winning, court, championship, ncaa,
sophomore, tennis, freshman
```

## Refining the topic model

These are decent topics, but they're not amazing. Here are a few things you might want to try in order to fine-tune your model:

* Filtering some of the most common words (see the filtering function above)
* Generating more or fewer topics (we could try 10, for instance).

Feel free to try those things on your own.

But for the purposes of this class, let's take a bit of a closer look at the probabilities attached to each word in a single topic.

In [None]:
# T0 looks coherent
topic = topics[0]

# the first item of the topic is the topic number
topic_num = topic[0]

print("For Topic 0, of all 30K+ words, these are the top words in the topic, sorted by proportion of the topic.\n")
# the next item is another list with pairs of words and percentages
topic_pairs = topic[1]
for pair in topic_pairs:
    print(pair[0] + ": " + str(pair[1]))

# since all topics contain all words, the sum of all of the percentages of each
# topic should be 1


Let's flip it around and look at the topical composition of a single document.

NB: MALLET provides this output automatically, but with gensim there's a bit more work required.

In [None]:
import textwrap

tokens = []

# open one article
with open('wheel-clean/2014-10-02-Atlanta-Food-Truck-Park-Brings-Enriching-Epicurian-Experience.txt', "r") as file:
    text = file.read()
    print(textwrap.fill(text, 100))
    tokens = tokenize(text) # remember this from above

# create the bag of words for the document on the basis of the Wheel dictionary, created above
doc_bow = id2word_wheel.doc2bow(tokens)

# get the topics that the doc consists of
# doc_topics = lda_model.get_document_topics(doc_bow)
doc_topics = lda_model_50iters.get_document_topics(doc_bow)

# sort doc_topics by percentage
doc_topics = sorted(doc_topics, key=lambda x: x[1], reverse=True)

print("\nArticle topics:")

# now we can cross-reference to find those topics and words
for topic, prob in doc_topics:
    print("T" + str(topic) + ": " + "{:.2%}".format(prob) + " of document.")

        #  str(round(prob, 2)))

    topic_words = "Top words in topic: "
    select_topics = topics[topic]

    for pair in select_topics[1]:
        topic_words += pair[0] + ", "

    print(topic_words)

### Evaluating Topics ###

Gensim has several built-in methods for evaluating topics included as a model called [CoherenceModel](https://radimrehurek.com/gensim/models/coherencemodel.html). The fastest one to calculate is called u_mass, and in this case, the closer to zero (positive or negative), the better the score.

Let's see how our model performs:

In [None]:
from gensim.models.coherencemodel import CoherenceModel

cm = CoherenceModel(model=lda_model, corpus=wheel_corpus, coherence='u_mass')
# cm = CoherenceModel(model=lda_model_50iters, corpus=wheel_corpus, coherence='u_mass')


coherence = cm.get_coherence()  # get coherence value

coherence

Here's a review essay by Hanna Wallach et al. that summarizes a few methods of evaluation, including some involving humans in the loop: ["Evaluation Methods for Topc Models"](http://dirichlet.net/pdf/wallach09evaluation.pdf).

Another way to evalute topics is just to look at them.

The pyLDAvis library lets you do this in a single line. It's very satisfying!

In [None]:
# Install the pyLDAvis package if you haven't already
!pip install pyLDAvis

# LDA visualization tools
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis # Import the gensim_models submodule

pyLDAvis.enable_notebook()

# Check for and remove empty documents from the corpus
wheel_corpus = [doc for doc in wheel_corpus if doc]

# Use prepare function from the gensim_models submodule for Gensim models
vis = gensimvis.prepare(lda_model, wheel_corpus, id2word_wheel)

vis

*Lauren Klein wrote this lesson in 2019 drawing on writing by [Ted Underwood](https://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/) and [Matthew Jockers](http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/), this [video of a talk by David Mimno](https://vimeo.com/53080123), and [this notebook](https://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html) by Radim Rehurek. It was supplemented in 2020 with additional materials from Dan Sinykin, and revised again in 2021, 2022, 2024, and 2025 and by Lauren Klein.*