# Topic modeling

Having written [a lot of prose to explain topic modeling](https://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/) elsewhere, I won't repeat myself at length here.

Suffice it to say that this notebook demonstrates an implementation of LDA in python, using the ```gensim``` module.

Topic modeling is an area where sheer compute power starts to matter more than it has in most of our other work, and I don't think ```gensim``` is necessarily the fastest implementation. If you wanted to apply topic modeling to a large corpus, it might be worthwhile figuring out how to use gensim in a "distributed" way, or exploring another implementation, such as [```MALLET.```](http://mallet.cs.umass.edu) MALLET is the most commonly-used implementation in digital humanities, and there's [a good Programming Historian tutorial.](http://programminghistorian.org/lessons/topic-modeling-and-mallet) However, MALLET requires Java, and I wanted to limit the number of installation problems we confront.


In [20]:
import gensim
import os, math
import pandas as pd
import nltk

nltk.download('stopwords')
nltk.download('punkt')
# You may not have the stopwords downloaded yet.
# You can comment this out after it runs once.
        

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tunder/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/tunder/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Load a corpus

I've provided three corpora: ```tinywikicorpus.csv```, ```smallwikicorpus.csv```, and ```mediumwikicorpus.csv.```

This stuff gets compute-intensive pretty fast, so let's start with the small one. This has 250 Wikipedia pages, each on a separate line of the file -- and only the first 250 words of each page. The tiny corpus has 160 words of 160 pages; the medium corpus has 400 words from 400 pages.

Obviously, this is not a huge corpus! But in real-life applications, you have to distribute topic modeling over multiple cores, and even then it's common to wait several hours for a result. That doesn't adapt very well to a classroom experiment.

In [30]:
# Very simply, reading the corpus from a text file.
# Each page is on a separate line.

relativepath = os.path.join('..', 'data', 'smallwikicorpus.txt')
wikicorpus = []
with open(relativepath, encoding = 'utf-8') as f:
    for line in f:
        wikicorpus.append(line.strip())

### Prepare the corpus for topic modeling

In part this is a simple tokenizing job. We have represented Wikipedia pages as single strings; gensim is going to expect each document to be a *list* of words. So we need to split the document into words.

But in the process of doing that, we also want to get rid of extremely common words, which make a topic model difficult to read and interpret.

To do this, we create a list of "stopwords." We also remove punctuation, and lowercase everything.

In [31]:
from nltk.corpus import stopwords

# We're going to borrow a list of stopwords from nltk.

# This list of "stopwords" removed from the corpus is not
# a trivial, generic decision; your choice of stopwords can
# in practice significantly affect the result. Here's a place where
# the open-ended character of an unsupervised learning algorithm
# becomes tricky.

# stopwords = {'a', 'an', 'the', 'of', 'and', 'in', 'to', 'by', 'on', 'for', 'it', 'at', 'me', 'from', 'with', '.', ','}
# in case you can't access nltk

from nltk.tokenize import word_tokenize
import string

stopped = set(stopwords.words('english'))
punctuation = set(string.punctuation)
stopped = stopped.union(punctuation)

more_stops = {"paul", "john", "jack", "\'s", "nt",
              "``", "\'the", ";", '“', 'pb', "mary", 
              "henry", "arthur", "polly", "alice", 
              "jane", "jean", "michael", "harold",
             "tom", "richard"}
# When you're topic-modeling fiction, personal names
# present a special problem.

stopped = stopped.union(more_stops)
punctuation.add('“')
punctuation.add('”')
punctuation.add('—')

def strip_punctuation(atoken):
    global punctuation
    punct_stripped = ''.join([ch for ch in atoken if ch not in punctuation])
    return punct_stripped

def clean_text(atext):
    global stopped
    clean_version = [strip_punctuation(x) for x in word_tokenize(atext.lower())]
    rejoined = ' '.join(clean_version)
    tokenized = [x for x in word_tokenize(atext.lower()) if not x in stopped]
    return tokenized

clean_corpus = []
for atext in wikicorpus:
    clean_version = clean_text(atext)
    if len(clean_version) > 1:
        clean_corpus.append(clean_version)
    
print("The clean_corpus contains " + str(len(clean_corpus)) + " texts.")

The clean_corpus contains 250 texts.


### Build a dictionary and create the doc-term matrix

The math inside ```gensim``` runs quicker if we know, at the outset, how many words we're dealing with, and represent each word as an integer. So the first stage in building a model is to build a dictionary, which stores words as the values of integer keys.

In [34]:
from gensim import corpora

dictionary = corpora.Dictionary(clean_corpus)
dictionary.filter_extremes(no_below = 4, no_above = 0.11)

# The filter_extremes method allows us to remove words from the dictionary.
# In this case we remove words that occur in fewer than 4 documents, or more
# than 11% of the documents in the corpus. This is, in effect, another
# form of stopwording.

# If you had a much larger corpus, you might increase no_below to 10 or 20.

print('Dictionary made.')
print(len(dictionary), "words.")
print(len(clean_corpus), "documents.")
doc_term_matrix = [dictionary.doc2bow(doc) for doc in clean_corpus if len(doc) > 1]
print('Doc-term matrix extracted.')


Dictionary made.
1470 words.
250
Doc-term matrix extracted.


In [47]:
# Just to show you what's in the dictionary.

print(dictionary[1069])
print(dictionary[880])

national
climate


In [46]:
# And what our corpus looks like now.
# Each tuple contains a word ID, and the number of occurrences of that word.

print(doc_term_matrix[4])

[(2, 2), (25, 1), (56, 1), (72, 1), (90, 1), (96, 1), (108, 1), (173, 1), (174, 1), (235, 2), (246, 1), (255, 1), (274, 1), (317, 1), (395, 11), (398, 1), (400, 1), (543, 2), (568, 1), (677, 1), (697, 1), (718, 4), (777, 1), (819, 1), (828, 1), (853, 1), (861, 1), (880, 2), (882, 3), (925, 1), (999, 1), (1054, 1), (1069, 8), (1099, 1), (1132, 1), (1137, 1), (1153, 1), (1161, 2), (1190, 1), (1203, 1), (1236, 1), (1256, 1), (1276, 2), (1299, 1), (1316, 1), (1357, 2), (1359, 1), (1361, 1), (1376, 1), (1397, 1), (1424, 1)]


### Actually running LDA

The first line here creates an LDA-modeling demon.
The second line asks the demon to create a model of our corpus.

```num_topics``` and ```passes``` are both parameters you may want to fiddle with. Sixteen topics is a pretty small number. In a larger corpus that would be increased. For our medium corpus, you might try 20 or 25. As with clustering, there are strategies that can attempt to optimize the "right" number, but this is in reality a matter of judgement.

```passes``` sets the number of iterations. More is better, up to a thousand or so. But for a classroom experiment, we probably don't want to go over 200.

In [48]:
Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(doc_term_matrix, num_topics = 16, id2word = dictionary, passes = 50)

In [51]:
def pretty_print_topics(topiclist):
    for topicnum, topic in topiclist:
        cleanwords = []
        pieces = topic.split(' + ')
        for p in pieces:
            numword = p.split('*')
            word = numword[1].strip('"')
            cleanwords.append(word)
        print(topicnum, ' '.join(cleanwords))

pretty_print_topics(ldamodel.print_topics(num_topics=16, num_words=10))

0 team debut club professional league de 2011 football january season
1 set product speech often usually form tour due 2 left
2 society massachusetts soviet local england india vehicle state korean december
3 design air mobile aircraft designed community douglas operating engine us
4 music lake rock game popular great 1960s games royal century
5 class working species land congress orchids often president code white
6 video game format steel football media research includes audio players
7 album band song studio songs music record track single records
8 film housing public garden scheme hong father kong built complex
9 language state driver effect chart ohio places condition saint college
10 bay police gold valley white river south north became de
11 soviet union french government town local german forces william cases
12 system water b length japanese stay business book event government
13 park national area university berlin open group organisation services million
14 film rock series

Is that impressive? Probably not. The value of topic modeling depends heavily on the size of the corpus, and we are deliberately using small corpora to avoid frying your laptops.

If it ran quickly enough you might try increasing the number of iterations to 200. See if those topics seem to make more sense. If *that* runs quickly enough, you might try loading the mediumwikicorpus.csv, to see if you get even more interpretable topics. But it will probably take 10-15 minutes to run, at a minimum.

#### Other things you can do

One of the nice things about the gensim module is that it allows you to update an existing model; you can even add documents to the corpus and update the model.

In addition to getting the top words for a given topic (topic distribution across terms), you can get the distribution of a document across topics, or the distribution of a word across topics. For more on these options, see [the documentation.](https://radimrehurek.com/gensim/models/ldamodel.html)

In [50]:
ldamodel.update(doc_term_matrix, iterations = 50)

In [55]:
ldamodel.get_document_topics(doc_term_matrix[6])

[(0, 0.40098275226084501), (1, 0.4694650330026649), (7, 0.11742533925777134)]

In [58]:
ldamodel.get_term_topics('rock')

[(4, 0.015614891031639226), (14, 0.013205793321285565)]