# Topic modeling

Having written [a lot of prose to explain topic modeling](https://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/) elsewhere, I won't repeat myself at length here.

Suffice it to say that this notebook demonstrates an implementation of LDA in python, using the ```gensim``` module.

Topic modeling is an area where sheer compute power starts to matter more than it has in most of our other work, and I don't think ```gensim``` is necessarily the fastest implementation. If you wanted to apply topic modeling to a large corpus, it might be worthwhile figuring out how to use gensim in a "distributed" way, or exploring another implementation, such as [```MALLET.```](http://mallet.cs.umass.edu) MALLET is the most commonly-used implementation in digital humanities, and there's [a good Programming Historian tutorial.](http://programminghistorian.org/lessons/topic-modeling-and-mallet) However, MALLET requires Java, and I wanted to limit the number of installation problems we confront.


In [2]:
import gensim
import os, math
import pandas as pd
import nltk

nltk.download('stopwords')
nltk.download('punkt')
# You may not have the stopwords downloaded yet.
# You can comment this out after it runs once.
        

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rmorriss/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/rmorriss/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Load a corpus

I've provided three corpora: ```tinywikicorpus.csv```, ```smallwikicorpus.csv```, and ```mediumwikicorpus.csv.```

This stuff gets compute-intensive pretty fast, so let's start with the small one. This has 250 Wikipedia pages, each on a separate line of the file -- and only the first 250 words of each page. The tiny corpus has 160 words of 160 pages; the medium corpus has 400 words from 400 pages.

Obviously, this is not a huge corpus! But in real-life applications, you have to distribute topic modeling over multiple cores, and even then it's common to wait several hours for a result. That doesn't adapt very well to a classroom experiment.

In [11]:
# Very simply, reading the corpus from a text file.
# Each page is on a separate line.

relativepath = os.path.join('..', 'data', 'smallwikicorpus.txt')
wikicorpus = []
with open(relativepath, encoding = 'utf-8') as f:
    for line in f:
        wikicorpus.append(line.strip())

print(wikicorpus[0])

Research Design and Standards Organization The Research Design and Standards Organisation (RDSO) is an ISO 9001 research and development organisation under the Ministry of Railways of India, which functions as a technical adviser and consultant to the Railway Board, the Zonal Railways, the Railway Production Units, RITES and IRCON International in respect of design and standardisation of railway equipment and problems related to railway construction, operation and maintenance. History. To enforce standardisation and co-ordination between various railway systems in British India, the Indian Railway Conference Association (IRCA) was set up in 1903. It was followed by the establishment of the Central Standards Office (CSO) in 1930, for preparation of designs, standards and specifications. However, till independence in 1947, most of the designs and manufacture of railway equipments was entrusted to foreign consultants. After independence, a new organisation called Railway Testing and Resea

In [55]:
relativepath = os.path.join('..', 'data', 'weekfour', 'poefic.csv')
poefic = pd.read_csv(relativepath)
poefic.head()
# fictioncorpus = [' '.join(x.split()[0:1200]) for x in poefic.text[0:200]]
fictioncorpus = [' '.join(x.split()[0:1200]) for x in poefic.text]
print(len(fictioncorpus))

1027


### Prepare the corpus for topic modeling

In part this is a simple tokenizing job. We have represented Wikipedia pages as single strings; gensim is going to expect each document to be a *list* of words. So we need to split the document into words.

But in the process of doing that, we also want to get rid of extremely common words, which make a topic model difficult to read and interpret.

To do this, we create a list of "stopwords." We also remove punctuation, and lowercase everything.

In [47]:
from nltk.corpus import stopwords

# We're going to borrow a list of stopwords from nltk.

# This list of "stopwords" removed from the corpus is not
# a trivial, generic decision; your choice of stopwords can
# in practice significantly affect the result. Here's a place where
# the open-ended character of an unsupervised learning algorithm
# becomes tricky.

# stopwords = {'a', 'an', 'the', 'of', 'and', 'in', 'to', 'by', 'on', 'for', 'it', 'at', 'me', 'from', 'with', '.', ','}
# in case you can't access nltk

from nltk.tokenize import word_tokenize
import string

stopped = set(stopwords.words('english'))
punctuation = set(string.punctuation)
stopped = stopped.union(punctuation)

more_stops = {"paul", "john", "jack", "\'s", "nt",
              "``", "\'the", ";", '“', 'pb', "mary", 
              "henry", "arthur", "polly", "alice", 
              "jane", "jean", "michael", "harold",
             "tom", "richard", "<pb>"}
# When you're topic-modeling fiction, personal names
# present a special problem.

stopped = stopped.union(more_stops)
punctuation.add('“')
punctuation.add('”')
punctuation.add('—')

def strip_punctuation(atoken):
    global punctuation
    punct_stripped = ''.join([ch for ch in atoken if ch not in punctuation])
    return punct_stripped

def clean_text(atext):
    global stopped
    clean_version = [strip_punctuation(x) for x in word_tokenize(atext.lower())]
    rejoined = ' '.join(clean_version)
    tokenized = [x for x in word_tokenize(rejoined.lower()) if not x in stopped]
    return tokenized

clean_corpus = []
for atext in fictioncorpus:
    clean_version = clean_text(atext)
    if len(clean_version) > 1:
        clean_corpus.append(clean_version)
    
print("The clean_corpus contains " + str(len(clean_corpus)) + " texts.")

The clean_corpus contains 200 texts.


### Build a dictionary and create the doc-term matrix

The math inside ```gensim``` runs quicker if we know, at the outset, how many words we're dealing with, and represent each word as an integer. So the first stage in building a model is to build a dictionary, which stores words as the values of integer keys.

In [48]:
from gensim import corpora

dictionary = corpora.Dictionary(clean_corpus)
dictionary.filter_extremes(no_below = 4, no_above = 0.11)

# The filter_extremes method allows us to remove words from the dictionary.
# In this case we remove words that occur in fewer than 4 documents, or more
# than 11% of the documents in the corpus. This is, in effect, another
# form of stopwording.

# If you had a much larger corpus, you might increase no_below to 10 or 20.

print('Dictionary made.')
print(len(dictionary), "words.")
print(len(clean_corpus), "documents.")
doc_term_matrix = [dictionary.doc2bow(doc) for doc in clean_corpus if len(doc) > 1]
print('Doc-term matrix extracted.')


Dictionary made.
3555 words.
200 documents.
Doc-term matrix extracted.


In [49]:
# Just to show you what's in the dictionary.

print(dictionary[1069])
print(dictionary[880])

remembered
notes


In [50]:
# And what our corpus looks like now.
# Each tuple contains a word ID, and the number of occurrences of that word.

print(doc_term_matrix[0])

[(22, 1), (34, 1), (55, 1), (81, 1), (116, 2), (119, 1), (145, 1), (198, 1), (205, 1), (245, 1), (285, 1), (297, 1), (304, 1), (353, 1), (356, 1), (370, 1), (397, 1), (426, 1), (434, 3), (452, 1), (471, 1), (498, 1), (499, 1), (508, 1), (538, 1), (541, 1), (546, 1), (563, 1), (573, 1), (606, 1), (643, 1), (716, 1), (722, 1), (724, 1), (727, 2), (738, 1), (748, 1), (803, 1), (806, 1), (820, 1), (834, 1), (840, 1), (841, 1), (853, 1), (902, 1), (931, 1), (937, 1), (947, 1), (957, 1), (959, 1), (985, 1), (987, 1), (991, 2), (1017, 1), (1019, 1), (1021, 1), (1065, 1), (1089, 1), (1157, 1), (1165, 1), (1177, 1), (1178, 1), (1182, 2), (1237, 1), (1298, 1), (1348, 1), (1393, 1), (1444, 1), (1453, 1), (1503, 1), (1514, 1), (1539, 1), (1584, 2), (1622, 1), (1657, 5), (1676, 1), (1682, 1), (1709, 1), (1750, 1), (1768, 1), (1820, 1), (1866, 1), (1889, 1), (1900, 1), (1972, 1), (1973, 1), (2020, 1), (2025, 1), (2095, 1), (2123, 1), (2135, 1), (2174, 1), (2256, 1), (2258, 1), (2270, 1), (2294, 1), 

### Actually running LDA

The first line here creates an LDA-modeling demon.
The second line asks the demon to create a model of our corpus.

```num_topics``` and ```passes``` are both parameters you may want to fiddle with. Sixteen topics is a pretty small number. In a larger corpus that would be increased. For our medium corpus, you might try 20 or 25. As with clustering, there are strategies that can attempt to optimize the "right" number, but this is in reality a matter of judgement.

```passes``` sets the number of iterations. More is better, up to a thousand or so. But for a classroom experiment, we probably don't want to go over 200.

In [51]:
Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(doc_term_matrix, num_topics = 16, id2word = dictionary, passes = 150)

In [52]:
def pretty_print_topics(topiclist):
    for topicnum, topic in topiclist:
        cleanwords = []
        pieces = topic.split(' + ')
        for p in pieces:
            numword = p.split('*')
            word = numword[1].strip('"')
            cleanwords.append(word)
        print(topicnum, ' '.join(cleanwords))

pretty_print_topics(ldamodel.print_topics(num_topics=16, num_words=10))

0 joan th visitors rain passage bell accident ha bridge cheeks
1 doctor count ye valley cat mortal particular hills plain snow
2 moral boys « » theatre courage society liberty crowd passion
3 violet aunt miller squire natural maam coal safe pain bitter
4 prince bill princess captain ship boats banks lord sail shot
5 dick aunt ball creek boys francis molly marriage captain dr
6 king lord witness salt horses fifteen wound sarah bull counsel
7 music gate diana dinner dull lovers tender blind indian baby
8 ter african slave pain dat major hell ship de capable
9 de madame garden bob bread news carry article wrath german
10 violet study colonel castle doctor william smith portrait beach rome
11 clay madame glass river robert crowd cab hall forest carriage
12 falls sin sins marry evil dan angel cousin modern smith
13 lord myles uncle major gate promise walls betty nice paused
14 french battle enemy thou race village city hill army houses
15 beauty mark jerome cottage sam sunshine ghost anger 

Is that impressive? Probably not. The value of topic modeling depends heavily on the size of the corpus, and we are deliberately using small corpora to avoid frying your laptops.

If it ran quickly enough you might try increasing the number of iterations to 200. See if those topics seem to make more sense. If *that* runs quickly enough, you might try loading the mediumwikicorpus.csv, to see if you get even more interpretable topics. But it will probably take 10-15 minutes to run, at a minimum.

#### Other things you can do

One of the nice things about the gensim module is that it allows you to update an existing model; you can even add documents to the corpus and update the model.

In addition to getting the top words for a given topic (topic distribution across terms), you can get the distribution of a document across topics, or the distribution of a word across topics. For more on these options, see [the documentation.](https://radimrehurek.com/gensim/models/ldamodel.html)

In [50]:
ldamodel.update(doc_term_matrix, iterations = 50)

In [55]:
ldamodel.get_document_topics(doc_term_matrix[6])

[(0, 0.40098275226084501), (1, 0.4694650330026649), (7, 0.11742533925777134)]

In [58]:
ldamodel.get_term_topics('rock')

[(4, 0.015614891031639226), (14, 0.013205793321285565)]