## Topic Modeling Homework ##

*The text in the first half of this notebook restates much of what was covered in Thursday's lecture. Please read through it again since, as I explained in class, topic modeling is a heady concept that can take a while to settle in your brain.*

*Alternately, or in addition, you may wish to watch [this video](https://vimeo.com/53080123) for a third (!) explanation of how topic modeling works. (Start around 3:30).* 

*If, in the end, topic modeling still seems confusing to you, that is also fine. It may make more sense after you've had a chance to play around with the output of an actual model--which is what the second half of this notebook allows you to do.*

*There are three exercises embedded in this notebook, all located towards the end. Please complete the exercises and when you're done, upload your completed notebook to Canvas.*

## Part 1: Topic Modeling Explained ##

*Per above, as the first part of your homework, please just read through these next few sections.*

### What is Topic Modeling? ###

In both the Li and Bamman paper, and the Antoniak et al. paper, we've seen how topic modeling plays a major role. What is topic modeling? At its most basic level, topic modeling is an automated method for extracting the themes, or "topics," from large sets of documents--like GPT-3 generated fiction, or birth stories, or as we'll explore today, articles in the Emory Wheel.

There are numerous kinds of topic models, but the most popular and widely-used kind is latent Dirichlet allocation (LDA). It's so popular, in fact, that "LDA" and "topic model" are sometimes used interchangeably, even though LDA is only one type.

LDA math is pretty complicated. We're not going to get very deep into the math just yet (or maybe not ever, depending on the time). But first we are going to introduce two important concepts that will help us conceptually understand how LDA topic models work.

### 1) LDA is an Unsupervised Algorithm 
Topic modeling is a kind of machine learning. Machine learning always sounds complicated, but it really just means that computer algorithms are performing tasks without being explicitly programmed to do so and that they are "learning" how to perform these tasks by being fed training data. In the field of machine learning, algorithms are typically split into two broad categories: supervised and unsupervised. These categories describe how the algorithms are "trained" or how they "learn." LDA is an unsupervised algorithm.

If an algorithm is supervised, that means a researcher is helping to guide it with some kind of information, like labels. For example, if you wanted to create an algorithm that could identify pictures of cats vs pictures of dogs, you could train it with a bunch of pictures of cats that were clearly labeled CAT and a bunch of pictures of dogs that were clearly labeled DOG. The algorithm would then be able to learn which features are specific to cats vs dogs because you explicitly told it: this is a picture of a cat; this is a picture of a dog.

If an algorithm is unsupervised, that means a researcher does not train it with outside information. There are no labels. The algorithm just learns that pictures of cats are more similar to each other and pictures of dogs are more similar to each other. The algorithm doesn't really know that one cluster is cats and one cluster is dogs; it just knows that there are two distinct clusters.

Because LDA is an unsupervised algorithm, we don't tell our topic model which words or topics to look for. We only tell the topic model how many topics (or clusters of words) that we want returned. The topic model doesn't know anything about Frida Kahlo, Nella Larsen, and Jackie Robinson. It doesn't know anything about art, literature, and sports.

### 2) LDA is a Probabilistic Model 
LDA fundamentally relies on statistics and probabilities. Rather than calculating precise and unchanging metrics about a given corpus, a topic model makes a series of very sophisticated guesses about the corpus. These guesses will change slightly every time we run the topic model. This is important to remember as we analyze, interpret, and make arguments based on our results. All of our results in this lesson will be probabilities, and they'll change slightly every time we re-run the topic model.

When we tell the topic model that we want to extract 15 topics from the Emory Wheel, here's what the topic model does:

The topic model starts off with a slightly silly, backwards assumption. The topic model assumes that every single one of the 4000-some-odd articles in the corpus was written by someone who exclusively drew their words from 15 mystery topics, or 15 clusters of words. To spin it in a slightly different way with a different medium, the topic model assumes that there was one master artist with 15 different paints on her palette, who created all the articles by dipping her brush into these 15 paints alone, applying and blending them onto each canvas in different proportions. The topic model is trying to discover the 15 mystery topics that created all the Wheel articles, as well as the mixture of these topics that makes up each individual article.

The topic model begins by taking a completely wild guess about the 15 topics, but then it iterates through all the words in all the article and makes better and better guesses. If the word "student" keeps showing up with the words "stress" and "exam," and if all three words keep showing up in the same kinds of article, then the topic model starts to suspect that these three words should belong to the same topic. If the word "film" keeps showing up with "Atlanta" and "industry," then the topic model suspects that they should belong to the same topic, too. The topic model finally arrives at its best guesses for the 15 topics that most likely created all the Emory Wheel articles.


### LDA explained again in more concrete terms

Probabilistic topic models begin with an assumption and a definition. 

The assumption: all documents contain a mixture of different topics.

The definition: a topic is a collection of words, each with a different probability of occurance in a particular document (or other chunk of text) discussing that topic. 

Here's a nice illustration, created by Ted Underwood, that shows this assumed relatioship between topics and documents. 

![topics and docs](https://tedunderwood.files.wordpress.com/2012/04/shapeart.png)

Above we see an example of the basic assumption of topic modeling: one topic might contain many occurrences of “organize,” “committee,” “direct,” and “lead.” Another might contain a lot of “mercury” and “arsenic,” with a few occurrences of “lead.” 

The three documents are assumed to contain both topics in different proportions.

But here is the thing: we can’t directly observe topics. All we actually have are the documents that attest to their existence. So in other words:

**Topic modeling is a way of extrapolating backward from a collection of documents to infer the topics that could have generated them.** 

There is simply no way to infer the exact topics in a set of documents; there are too many unknowns. So (probabalistic) topic modeling works backwards. It pretends that the problem is mostly solved. 

**How does this play out in actual life?**

Suppose we knew which topic produced every word in the collection, except for this one word in document D. The word happens to be “lead,” which we’ll call word type W. How are we going to decide whether this occurrence of W belongs to topic 1 or topic 2?

![topics and docs](https://tedunderwood.files.wordpress.com/2012/04/shapeart.png)

We can’t know for sure. But one way to guess is to consider two questions. This is the first: 

* How often does “lead” appear in topic 1 elsewhere? If “lead” often occurs in discussions of 1, then this instance of “lead” might belong to 1 as well. 

But a word can be common in more than one topic, as it is in topics 1 and 2 above. And we don’t want to assign “lead” to a topic about leadership (topic 1) if this document is mostly about heavy metal contamination (topic 2). So we also need to consider a second question:

* How common is topic 1 in the rest of the document?

To answer these questions, here’s what we’ll do:

For each possible topic Z, we’ll multiply the frequency of this word type W in Z by the number of other words in document D that already belong to Z. The result will represent the probability that this word came from Z. Here’s the actual formula:

![LDA formula](https://tedunderwood.files.wordpress.com/2012/04/ldaformula.png)

There are also a few Greek letters scattered in there, but they aren’t important for our purposes. Technically, they’re called “hyperparameters,” but you can think of them simply as fudge factors. 

In other words: there’s some chance that this word belongs to topic Z even if it is nowhere else associated with Z; the fudge factors keep that possibility open. (If you want to understand hyperparameters beyond the "fudge factor" explanation, see "[Rethinking LDA: Why Priors Matter](http://people.cs.umass.edu/~mimno/publications.html).")

The overall emphasis on probability in this technique, of course, is why it’s called *probabilistic topic modeling*.

**Enter Sampling**

Now, suppose that instead of having the problem mostly solved, we had only a wild guess which word belonged to which topic. We could still use the strategy I've just described to improve our guess, by making it more internally consistent. 

We could go through the collection, word by word, and reassign each word to a topic, guided by the formula above. 

And in fact, that's what LDA actually does.

And as we do that, two things happen:

1) Words will gradually become more common in topics where they are already common. And also,

2) Topics will become more common in documents where they are already common. 

Thus our model will gradually become more consistent as topics focus on specific words and documents. But it can’t ever become perfectly consistent, because words and documents don’t line up in one-to-one fashion. So the tendency for topics to concentrate on particular words and documents will eventually be limited by the actual, messy distribution of words across documents.

That’s how topic modeling works in practice. You assign words to topics randomly and then just keep improving the model, to make your guess more internally consistent, until the model reaches an equilibrium that is as consistent as the collection allows.

## A brief historical / technical digression... ##

Topic modeling began as a US military project in the in the 1990s. The goal was to automatically detect changes in newswire text so that governmental and military organizations could be alerted to emerging geopolitical events. (For more on this history, see [Binder](https://dhdebates.gc.cuny.edu/read/untitled/section/4b276a04-c110-4cba-b93d-4ded8fcfafc9#ch18).)


In the early 2000s, a team of computer science researchers released [MALLET](http://mallet.cs.umass.edu/topics.php), short for **MA**chine **L**earning for **L**anguag**E** **T**oolkit. As the name suggests, MALLET is a software toolkit that enables a range of NLP techniques. Today, people mostly only use it for topic modeling, which it remains very very good at.

With that said, MALLET is written in Java, which means that it's not ideal for working in Python and Jupyter notebooks. None other than Maria Antoniak has written a convenient Python package that allows you to use MALLET in a Jupyter notebook. Her package is called [Little MALLET Wrapper](https://github.com/maria-antoniak/little-mallet-wrapper), and I'm working on getting it set up for our JupyterHub.

Until then, we'll be using [gensim](https://radimrehurek.com/gensim/about.html), a native Python library for topic modeling (among other tasks) that was created in the early 2010s by a computer science PhD student, Radim Rehurek. While it's more convenient than MALLET, the topics it generates are generally considered to be less good/coherent than those generated by MALLET, so most people end up returning to MALLET for research-level code. 

## Part II: Topic Modeling with Gensim 

The code to generate topic models with gensim is definitely a little complicated. When we're able to use MALLET, you'll appreciate it for how much more clear it is to set up and query the model. 

How I recommend you approach the code below is simply to run each cell one-by-one until you get the output of the model. (I'll note where that is). Please don't concern yourself with the code that does the setup and processing, except to a) run it; and b) note that we're seeing yet another version of the standard load libraries / pre-process text / tokenize pipeline. 

### Load libraries and configure logging

In [None]:
# import and setup modules we'll be using in this notebook
import logging # for logging status etc
import itertools # helpful library for iterating through things

import numpy as np # this is a powerful python math package that many others are based on
import gensim # our topic modeling library
import os # for file i/o

# configure logging since topic modeling can take some time and it's nice to get status updates
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO  

# a helpful function that returns the first `n` elements of the stream as plain list.
# we'll use this later but we'll declare it now
def head(stream, n=10):
    return list(itertools.islice(stream, n))

In [None]:
# import some more modules for processing the corpus
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

### Define functions for pre-processing and tokenizing our text ###

As previously discussed, many NLP tasks require that you tokenize your corpus. We've used tokenziers built into both NLTK and spaCy already for this course.  

Here, however, we're going to write our own quick tokenizing function that makes use of gensim's [simple_preprocess function](https://radimrehurek.com/gensim/utils.html), which breaks a document into a list of lowercase tokens. The lower-casing is important for topic modeling since we want both uppercase and lowercase versions of the same word to be counted together. 

We'll define the tokenizing function first, and then use it in our pre-processing function (the second function defined in the cell below). 

For topic modeling with gensim, we need to pre-process our documents so that they end up in the format (filename, tokens). We're using this format because that's what the gensim documentation tell us to use. In fact, both of these functions come nearly verbatim from the gensim documentation. 

In [None]:
# this defines our tokenize function for future use
def tokenize(text):
    return [token for token in simple_preprocess(text) if token not in STOPWORDS] # more list comprehension syntax! 

# this define a function that yields each doc in a base directory as a `(filename, tokens)` tuple.
def iter_docs(base_dir):
    docCount = 0
    docs = os.listdir(base_dir)

    for doc in docs:
        if not doc.startswith('.'):
            with open(base_dir + doc, "r") as file:
                text = file.read()
                tokens = tokenize(text) 
        
                yield doc, tokens

### Construct the id2word dictionary

The next step in generating a topic model with gensim is to create a dictionary (not to be confused with a Python dictionary) which maps each word to a numerical ID. 

This mapping step is required because almost all models involving text, including this one, work with vectors indexed by integers, not by strings. Also, many functions need to know the vector/matrix dimensionality in advance.

The mapping can be constructed automatically by giving gensim's Dictionary class a so-called "stream" of tokenized documents, as in the cells below:


In [None]:
# set up the document "stream" for use below 
## NOTE PATH MAY NEED TO CHANGE DEPENDING ON THE RELATIVE LOCATION OF YOUR CORPUS
stream = iter_docs('../corpora/emory-wheel/articles/') 

# take a look at what the "stream" looks like for the first five docs
for doc, tokens in itertools.islice(stream, 5):
    print(doc, tokens[:10])  # print the doc title and its first ten tokens

Now we'll have gensim construct the id2word dictionary on the basis of the stream we've set up:

In [None]:
# code to create id2word dictionary from the gensim documentation

# set up the stream (again)
stream = iter_docs('../corpora/emory-wheel/articles/') 

# get the tokens from the stream
doc_stream = (tokens for _, tokens in stream)
              
# send to the dictionary constructor 
id2word_wheel = gensim.corpora.Dictionary(doc_stream) 

# print out a sign that it's done
id2word_wheel

### Mapping tokens to ID numners

The gensim dictionary (id2word_wheel, which I've named for the Emory Wheel data it contains) now contains all words that appeared in the corpus, along with how many times they appeared. 

gensim provides a handy function for mapping tokens to their ID numbers, not unlike the sk-learn vectorizer. Let's take a look:

In [None]:
id2word_wheel.token2id

## Filtering out common words / uncommon words ##

There aren't many things you need to do in order to tune your topic model, but one important thing do consider is whether you should filter some of the words. 

gensim provides two basic functions for this, one which filters out the top n most frequent words, and another which filters out the words that appear in fewer (or greater) than a certain number of documents. 

We'll try out both below, even though you may end up not needing / wanting to use both (or either) of these functions for your particular purpose. This is one of the places where experience (and trial and error) will tell you what's appropriate. 

In [None]:
# this line filters out the 10 most frequent words
id2word_wheel.filter_n_most_frequent(10)

# this line filters out words that appear only 1 doc, keeping the rest
# note how no_below and no_above take different values
id2word_wheel.filter_extremes(no_below=2, no_above=1.0)

id2word_wheel

Note that by removing the words that only appeared in a single document, we went from 118,786 unique words (or tokens) to 43,158. That's not a huge number for a topic model, but we'll see how it goes... 

### Final set-up

We need to do only a few more things in order to run our topic model. 1) Define a special class for working with corpora (again, per the gensim documentation); and 2) use the Corpus class to create a stream of bag-of-words vectors. These two chunks of code are as follows: 

In [None]:
# Defining the Corpus class 
# this is the same for every topic model you create with gensim. 
# no need to modify it here

class Corpus(object):
    def __init__(self, dump_file, dictionary, clip_docs=None):
        self.dump_file = dump_file
        self.dictionary = dictionary
        self.clip_docs = clip_docs
    
    def __iter__(self):
        self.titles = []
        for title, tokens in itertools.islice(iter_docs(self.dump_file), self.clip_docs):
            self.titles.append(title)
            yield self.dictionary.doc2bow(tokens)
    
    def __len__(self):
        return self.clip_docs

In [None]:
# Creating our stream of bag-of-words vectors
## MODIFY PATH IF YOU'RE GETTING FILE NOT FOUND ERRORS
wheel_corpus = Corpus('../corpora/emory-wheel/articles/', id2word_wheel) 

wheel_corpus

### Running our topic model! 

At long last, we're ready to run our topic model. Let's do it!

In [None]:
# run the model (using logging since it may take a while)
# note the num_topics and the passes parameters; these are the most important parameters for topic modeling

%time lda_model = gensim.models.LdaModel(wheel_corpus, num_topics=15, id2word=id2word_wheel, passes=5) 

# note that passes should be higher -- usually in the 50-100 range -- 
# but in the interests of time we'll only do 5 

### Helpful functions for saving your model

Because topic models can take a long time to run, it can be helpful to save your model and/or its components so that it can be loaded back in at a later date. Here's how you do those things:

In [None]:
# how to store corpus to disk
from gensim.corpora import MmCorpus
MmCorpus.serialize('./wheel.corpus.mm', wheel_corpus) 

# how to store dictionary to disk
id2word_wheel.save('./wheel.dictionary')

# how to store model to disk 
lda_model.save('./lda_wheel-15topics_5iters.model')

## IF THESE FUNCTIONS DON'T WORK IT'S LIKELY BECAUSE YOU DID NOT CLOSE-AND-HALT THEN REOPEN YOUR NOTEBOOK 
## IN YOUR MY-WORK FOLDER!!!

### A helpful function for loading in a saved model

You can also load in a saved model. 

Here, we're going to load in a (slightly) better topic model of the Emory Wheel with the same number of topics (15), but 50 iterations. I did not filter out any of the top n words, but I did filter out all of the words that appeared in only one article.

In [None]:
# load a saved model; in this case, a topic model of the ccp with 50 iterations
lda_model = gensim.models.LdaModel.load('../models/lda_wheel-15topics_50iters.model')

## IF YOU GET ERRORS HERE, CHECK YOUR PATH!!!

### Interacting with your model

gensim comes with a bunch of built-in methods that make interacting with the output of the topic model a little easier. Here are some of the most useful:

In [None]:
# show the topics in the format (number of topics to show, number of terms)

# as you can tell already, even the top words in each topic are only a very small proportion
# of that topic, since we are dealing with about 40K unique words

lda_model.show_topics(15, 10)

In [None]:
# let's format the words a little more nicely; 
# the formatted=False parameter returns tuples of (word, probability)

topics = lda_model.show_topics(15, 10, formatted=False)

for topic in topics:
    topic_num = topic[0]
    topic_words = ""
    
    topic_pairs = topic[1]
    for pair in topic_pairs:
        topic_words += pair[0] + ", "
    
    print("T" + str(topic_num) + ": " + topic_words)

### Labeling topics

Now you can see (I hope!) that what we call a "topic" is really just a list of the most probable words for that topic, which are sorted in descending order of probability. The most probable word for the topic is the first word. 

Topic models start to get more powerful when we, as human researchers, analyze the most probable words for every topic and summarize what these words have in common. This summary can then be used as a descriptive label for the topic. 

Remember, since an LDA topic model is an unsupervised algorithm, it doesn't know what these words mean in relationship to one another. It's up to us, as the human researchers, to make meaning out of the topics.

## EXERCISE 1: LABELING TOPICS

How might you label the following topics generated by the model above?

✨Topic 3:✨

`team, emory, game, eagles, said, university, season, second, time, points, senior, win, junior, year, sophomore, teams, freshman, college, run, play, `

In [None]:
# your answer here -- a single word or short phrase is fine

✨Topic 6:✨

`killers, song, album, music, songs, rae, band, stage, concert, performance, crowd, audience, artists, lyrics, sound, dance, makeshift, collapsing, nextyear, weighty, `

In [None]:
# your answer here -- a single word or short phrase is fine

✨Topic 13:✨

`officer, subject, said, april, epd, reported, complainant, emory, case, individual, student, wallet, responded, assigned, investigator, number, second, stage, campus, driver, `



In [None]:
# your answer here -- a single word or short phrase is fine

### Refining the model

These are decent topics, but they're not amazing. Here are a few things you might want to try in order to fine-tune your model:

* Filtering some of the most common words (see the filtering function above)
* Generating fewer topics (we could try 10, for instance). 

Most work with topic modeling involves a fair amount of trial and error before you arrive at an appropriate number of topics and the best ways to filter your corpus. 

Feel free to try those things on your own. 

### Topics and word probabilities

Now let's take a bit of a closer look at the probabilities attached to each word in a single topic. We'll look at topic 13, the one that seems be about police and crime.

In [None]:
# T13 looks coherent
topic = topics[13]

print("Topic 13")

# the first item in the topic list is the topic number
topic_num = topic[0]

# the next item in the topic list is another list with pairs of words and percentages
# this is what we want to examine
topic_pairs = topic[1]
for idx, pair in enumerate(topic_pairs):
    print(str(idx) + ". " + pair[0] + ": " + str(pair[1]))

# since all topics contain all words, the sum of all of the probabilities of each 
# topic should be 1


## EXERCISE 2: UNDERSTANDING WORD PROBABILITIES

In a sentence or two, please explain what the output of the cell just above is telling us?

In [None]:
# your answer here

### Documents and topic probabilities / proportions

Another way we can use the output of a topic model is to examine the probabilities of topics in each document. 

While MALLET provides this output automatically, there's a bit more work required to display it in gensim. When we're able to use MALLET later on in the semester, you will apprciate how much easier that is. 

For now, though, we'll do it this way:

In [None]:
tokens = [] 

# open one file
with open('../corpora/emory-wheel/articles/2014-10-02-Atlanta-Food-Truck-Park-Brings-Enriching-Epicurian-Experience.txt', "r") as file:
    text = file.read()
    tokens = tokenize(text) # remember this from above

# create the bag of words for the document on the basis of the Wheel dictionary, created above
doc_bow = id2word_wheel.doc2bow(tokens)

# get the topics that the doc consists of
doc_topics = lda_model.get_document_topics(doc_bow)

# now, format this a bit more nicely so we can understand the output 
for topic, prob in doc_topics:
    print("T" + str(topic) + ": " + "{:.2%}".format(prob) + " of document")

You will note that this is not a list of all 15 topics. This is because, if a topic's proportion is very very small, it gets rounded down to zero and does not appear.

Let's try to make this output even more meaningful by also displaying the words associated with each topic.

In [None]:
# cross-reference the topic proportions with the words to get more meaningful output
for topic, prob in doc_topics:
    topic_words = ""
    select_topics = topics[topic]
    
    for pair in select_topics[1]:
        topic_words += pair[0] + ", "
    
    print("T" + str(topic) + ": " + "{:.2%}".format(prob) + " of document. Top words: " + topic_words)

## EXERCISE 3: EXAMINING THE TOPICAL COMPOSITION OF DOCUMENTS

Copying and modifying the code in the two cells above, print out the topical composition of another document in our corpus. (You may need to take a look at the directory that contains all of the articles to determine the name of a file to load). 

In [None]:
# your code here

Which topic appears in the highest proportion in your document? (In other words, which topic probability is most prevalent in the document you have selected?)

In [None]:
# your answer here -- can just be the topic number. no need to write code to print it out.

### Evaluating Topics ###

Gensim has several built-in methods for evaluating topics included as a model called [CoherenceModel](https://radimrehurek.com/gensim/models/coherencemodel.html). The fastest one to calculate is called u_mass, and in this case, the closer to zero (in either direction, positive or negative), the better the score. 

Let's see how our model performs: 

In [None]:
from gensim.models.coherencemodel import CoherenceModel

cm = CoherenceModel(model=lda_model, corpus=wheel_corpus, coherence='u_mass')

coherence = cm.get_coherence()  # get coherence value

coherence

As with many metrics, this number is mostly helpful in a relative context. How some people employ metrics like this is to generate models with a variety of topics and look for the number of topics that yields the best coherence score. 

With that said, the debate about how best to evaluate topics is far from settled. Here's a review essay by Hanna Wallach et al. that summarizes a few additional methods of evaluation, including some involving humans in the loop: ["Evaluation Methods for Topc Models"](http://dirichlet.net/pdf/wallach09evaluation.pdf).

**OK. That's it. You did it! Time to upload your notebook to canvas.**