## Topic Modeling ##

*This lesson draws on blog posts by [Ted Underwood](https://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/) and [Matthew Jockers](http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/), this [video of a talk by David Mimno](https://vimeo.com/53080123), and [this notebook](https://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html) by Radim Rehurek. 

Let's take a look at this feature, "[30 Years of American Anxieties](https://pudding.cool/2018/11/dearabby/)," which both Zoey and Derek talked about in their dicussion posts. 

It looks at 20,000 letters published in the "Dear Abby" advice column over the course of nearly 40 years in order to deterimine the topics that recur year in and year out.

It also segmented these topics by subject: husbands, wives, sons, daughters, friends, and bosses.

So how did they do it?

Here's what they say:

*After initially exploring the corpus, we began to identify a number of common themes which Dear Abby’s readers frequently brought up, and decided to focus on three: sex, LGBTQ issues, and religion. For each relevant issue, we created a list of relevant keywords for each issue and used those to first create a broad grouping of question, before breaking them down into categories.*

This is a little vague, but it sounds like they created their "list of relevant keywords for each issue" by hand. It sounds like a lot of work!

As it turns out, there is an automated method for extracting the topics in a large set of documents like this: **topic modeling.** 

The end result is very much the same as the Dear Abby feature--a set of keywords associated with a set of topics--but the method by which those keywords are determined is quite different.

But how is it different?

For one, it's *unsupervised*. 

This means that a person does not tell the model what to look for in advance. 

(The process employed by the Dear Abby team sounds like it was *supervised*, since they created the list of keywords themselves). 

So now let's compare the Dear Abby project to Cameron Blevins's [topic model of Martha Ballard's diary](http://www.cameronblevins.org/posts/topic-modeling-martha-ballards-diary/).

How is it different? 

Blevins didn't know what the diary contained. He didn't read it first. Instead, he used a topic model to extract a set of topics, and then further analyzed them. 

Or, for another example, the project that Alex talked about in his discussion post for today.

Today we're going to use topic modeling to create a topic model of the Colored Conventions Corpus. 

### But first, a history lesson... ###

Topic modeling began as a US military project in the in the 1990s. The goal was to automatically detect changes in newswire text so that governmental and military organizations could be alerted to emerging geopolitical events. (For more on this history, see [Binder](https://dhdebates.gc.cuny.edu/read/untitled/section/4b276a04-c110-4cba-b93d-4ded8fcfafc9#ch18).)


In the early 2000s, a team of computer science researchers released [MALLET](http://mallet.cs.umass.edu/topics.php), a software toolkit for generating topic models (among other document classification and clustering techniques), and so the technique began to see more mainstream use. 

Then, in the early 2010s, a computer science PhD student released the first version of [gensim](https://radimrehurek.com/gensim/about.html), which is what we'll be using today. It's a Python library for topic modeling, and its ease of use has made the use of topic models explode. 

### So how does probabilistic topic modeling work anyway? ###

Here's what David Blei has to say:

*Topic modeling algorithms are statistical methods that analyze the words of the original texts to discover the themes that run through them, how those themes are connected to each other, and how they change over time.* 

As an aside, his ACM paper, "[Probabalistic Topic Models](https://mimno.infosci.cornell.edu/info6150/readings/Blei2012.pdf)," is another great example of technical writing that is both incredibly informative and clear. 


Probabilistic topic models begin with an assumption and a definition. 

The assumption: all documents contain a mixture of different topics. 

The definition: a topic is a collection of words, each with a different probability of occurance in a particular document (or other chunk of text) discussing that topic. 




Here's a nice illustration, created by Ted Underwood, that shows this assumed relatioship between topics and documents. 

![topics and docs](https://tedunderwood.files.wordpress.com/2012/04/shapeart.png)

Above we see an example of the basic assumption of topic modeling: one topic might contain many occurrences of “organize,” “committee,” “direct,” and “lead.” Another might contain a lot of “mercury” and “arsenic,” with a few occurrences of “lead.” 

The three documents are assumed to contain both topics in different proportions.

But here is the thing: we can’t directly observe topics. All we actually have are the documents that attest to their existence. So in other words:

**Topic modeling is a way of extrapolating backward from a collection of documents to infer the topics that could have generated them.** 

There is simply no way to infer the exact topics in a set of documents; there are too many unknowns. So (probabalistic) topic modeling works backwards. It pretends that the problem is mostly solved. 

**How does this play out in actual life?**

Suppose we knew which topic produced every word in the collection, except for this one word in document D. The word happens to be “lead,” which we’ll call word type W. How are we going to decide whether this occurrence of W belongs to topic 1 or topic 2?

![topics and docs](https://tedunderwood.files.wordpress.com/2012/04/shapeart.png)

We can’t know for sure. But one way to guess is to consider two questions. This is the first: 

* How often does “lead” appear in topic 1 elsewhere? If “lead” often occurs in discussions of 1, then this instance of “lead” might belong to 1 as well. 

But a word can be common in more than one topic, as it is in topics 1 and 2 above. And we don’t want to assign “lead” to a topic about leadership (topic 1) if this document is mostly about heavy metal contamination (topic 2). So we also need to consider a second question:

* How common is topic 1 in the rest of the document?

To answer these questions, here’s what we’ll do:

For each possible topic Z, we’ll multiply the frequency of this word type W in Z by the number of other words in document D that already belong to Z. The result will represent the probability that this word came from Z. Here’s the actual formula:

![LDA formula](https://tedunderwood.files.wordpress.com/2012/04/ldaformula.png)

There are also a few Greek letters scattered in there, but they aren’t important for our purposes. Technically, they’re called “hyperparameters,” but you can think of them simply as fudge factors. 

In other words: there’s some chance that this word belongs to topic Z even if it is nowhere else associated with Z; the fudge factors keep that possibility open. (If you want to understand hyperparameters beyond the "fudge factor" explanation, see "[Rethinking LDA: Why Priors Matter](http://people.cs.umass.edu/~mimno/publications.html).")

The overall emphasis on probability in this technique, of course, is why it’s called *probabilistic topic modeling*.

### Enter Sampling ###

Now, suppose that instead of having the problem mostly solved, we had only a wild guess which word belonged to which topic. We could still use the strategy I've just described to improve our guess, by making it more internally consistent. 

We could go through the collection, word by word, and reassign each word to a topic, guided by the formula above. 

And in fact, that's what LDA actually does.

(LDA is the most commonly used algorithm for topic modeling. It stands for Latent Dirichlet Allocation.) 

In any case, as we do that, two things happen:

1) Words will gradually become more common in topics where they are already common. And also,

2) Topics will become more common in documents where they are already common. 

Thus our model will gradually become more consistent as topics focus on specific words and documents. But it can’t ever become perfectly consistent, because words and documents don’t line up in one-to-one fashion. So the tendency for topics to concentrate on particular words and documents will eventually be limited by the actual, messy distribution of words across documents.

That’s how topic modeling works in practice. You assign words to topics randomly and then just keep improving the model, to make your guess more internally consistent, until the model reaches an equilibrium that is as consistent as the collection allows.

For a slightly more in depth explanation of how LDA works, see [this video](https://vimeo.com/53080123). (Start around 5:35). 

### Let's do it! ###

In [1]:
# import and setup modules we'll be using in this notebook
import logging # for logging status etc
import itertools # helpful library for iterating through things

import numpy as np # this is a powerful python math package that many others are based on
import gensim # our topic modeling library
import os # for file i/o

# configure logging 
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO  

# a helpful function that returns the first `n` elements of the stream as plain list.
# we'll use this later
def head(stream, n=10):
    return list(itertools.islice(stream, n))

In [2]:
# import some more modules for processing the corpus
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

If you haven't already, please download and unzip the Colored Conventions Corpus: [link](http://coloredconventions.org/intro-corpus).

### Tokenizing ###

Many NLP tasks require that you first tokenize your corpus. We actually already tokenized something when we chunked our song lyrics by line. 

Here is [another example of tokenizing](https://programminghistorian.org/en/lessons/sentiment-analysis) that uses nltk to tokenize a document by sentence instead. (Scroll down to where it discusses the nltk word_tokenize module). **ATTENTION! This may be helpful to you for your next homework!**

Here, however, we're going to write our own quick tokenizing function that makes use of gensim's [simple_preprocess function](https://radimrehurek.com/gensim/utils.html), which breaks a document into a list of lowercase tokens. The lower-casing is important for topic modeling since we want both uppercase and lowercase versions of the same word to be counted together. 

In [3]:
# here's some nice dense python for you
def tokenize(text):
    return [token for token in simple_preprocess(text) if token not in STOPWORDS]

### Further pre-processing our corpus ###

This is the other necessary step before running a topic model. You need to write a function that iterates through your corpus and returns each document in the format (title, tokens). 

**Side note that we've not yet discussed tuples.** Tuples exist in many programming languages, including R. For our purposes, just know that tuples are sequences of objects--  just like lists-- but they cannot be changed. In Python, you indicate a tuple with parentheses. 

In any case, we will want a pre-processing function like this:

In [4]:
# A function to yield each doc in CCP Corpus as a `(filename, tokens)` tuple.

def iter_docs(base_dir):
    docCount = 0
    docs = os.listdir(base_dir)

    for doc in docs:
        if not doc.startswith('.'):
            with open(base_dir + doc, "r") as file:
                text = file.read()
                tokens = tokenize(text) 
        
                yield doc, tokens

In [5]:
# set up the stream for later processing 
stream = iter_docs('./2019-09-ccp-corpus-0.3/ccprecords/')

# while we're at it, take a look at what this looks like for the first five docs
for doc, tokens in itertools.islice(stream, 5):
    print(doc, tokens[:10])  # print the doc title and its first ten tokens

1873.NJ-06.03.NEWB.ART.01.txt ['jersey', 'convention', 'received', 'new', 'citizens', 'state', 'convention', 'new', 'jersey', 'meet']
1884.FL-02.05.GAIN.MIN.01.txt ['proceedings', 'state', 'conference', 'colored', 'men', 'florida', 'held', 'gainesville', 'february', 'address']
1855.CT-04-18.HART.ART.01.txt ['colored', 'men', 'convention', 'according', 'adjournment', 'september', 'colored', 'men', 'connecticut', 'assembled']
1868.IA-02.12.DESM.MIN.01.txt ['proceedings', 'iowa', 'state', 'colored', 'convention', 'held', 'city', 'des', 'moines', 'february']
1872.NE-04.13.OMAH.ART.01.txt ['colored', 'convention', 'following', 'omalia', 'nebraska', 'paper', 'nebraska', 'send', 'delegates', 'new']


The next step is to create a Dictionary (not to be confused with a Python dictionary) which maps each word to a numerical ID. 

This mapping step is required because most algorithms, including gensim's implementation of LDA, rely on numerical libraries that work with vectors indexed by integers, not by strings. Also, many need to know the vector/matrix dimensionality in advance.

The mapping can be constructed automatically by giving gensim's Dictionary class a stream of tokenized documents, like so:

In [6]:
# creating the CCP Corpus Dictionary

doc_stream = (tokens for _, tokens in iter_docs('./2019-09-ccp-corpus-0.3/ccprecords/'))
              
id2word_ccp = gensim.corpora.Dictionary(doc_stream) 

print(id2word_ccp)

INFO : adding document #0 to Dictionary(0 unique tokens: [])
INFO : built Dictionary(23844 unique tokens: ['ability', 'according', 'advance', 'aim', 'alien']...) from 147 documents (total 469668 corpus positions)


Dictionary(23844 unique tokens: ['ability', 'according', 'advance', 'aim', 'alien']...)


The Dictionary (id2word_ccp) now contains all words that appeared in the corpus, along with how many times they appeared. 

gensim provides a handy function for mapping tokens to their ID numbers, viz:

In [7]:
print(id2word_ccp.token2id)



There aren't many things you need to do in order to tune your topic model, but one important thing do consider is whether you should filter the words. 

gensim also provides functions for this:

In [8]:
# filter out 50 most frequent words
# id2word_ccp.filter_n_most_frequent(50)

# filter out words in only 1 doc, keeping the rest
# note how no_below and no_above take different values
id2word_ccp.filter_extremes(no_below=2, no_above=1.0)

print(id2word_ccp)

INFO : discarding 9830 tokens: [('aves', 1), ('conventionising', 1), ('etrurian', 1), ('inters', 1), ('molded', 1), ('mounding', 1), ('ordains', 1), ('pitied', 1), ('rarae', 1), ('sunt', 1)]...
INFO : keeping 14014 tokens which were in no less than 2 and no more than 147 (=100.0%) documents
INFO : resulting dictionary: Dictionary(14014 unique tokens: ['ability', 'according', 'advance', 'aim', 'alien']...)


Dictionary(14014 unique tokens: ['ability', 'according', 'advance', 'aim', 'alien']...)


Note that by removing the words that only appeared in a single document, we went from 23,844 unique words (or tokens) to 14,014. That's not a huge number for a topic model, and as you'll see, there are probably other methods that would work better for this corpus. We'll explore some of those next class. 

But for now, since a streamed corpus and a dictionary is all we need to create the vectors for our topic model, we can get started. 

In [9]:
# a class we need; this is the same for every topic model you create with gensim. 
# no need to modify it here

class Corpus(object):
    def __init__(self, dump_file, dictionary, clip_docs=None):
        self.dump_file = dump_file
        self.dictionary = dictionary
        self.clip_docs = clip_docs
    
    def __iter__(self):
        self.titles = []
        for title, tokens in itertools.islice(iter_docs(self.dump_file), self.clip_docs):
            self.titles.append(title)
            yield self.dictionary.doc2bow(tokens)
    
    def __len__(self):
        return self.clip_docs

In [10]:
# create a stream of bag-of-words vectors
ccp_corpus = Corpus('./2019-09-ccp-corpus-0.3/ccprecords/', id2word_ccp)

# print the first vector in the stream to see what it looks like; 
# this is in the format (word_id, count in first doc)

vector = next(iter(ccp_corpus))
print(vector)  

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 3), (9, 1), (10, 2), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 2), (17, 1), (18, 1), (19, 1), (20, 1), (21, 2), (22, 1), (23, 2), (24, 1), (25, 1), (26, 1), (27, 1), (28, 3), (29, 2), (30, 1), (31, 1), (32, 1), (33, 1), (34, 2), (35, 1), (36, 3), (37, 2), (38, 1), (39, 2), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 3), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 2), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 2), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 2), (88, 1), (89, 1), (90, 3), (91, 1), (92, 1), (93, 1), (94, 1), (95, 2), (96, 1), (97, 1), (98, 2), (99, 4), (100, 1), (101, 1), (102, 1), (103, 1), (104, 1), (105, 2), (106, 8), (107, 2), (108, 1), (109, 1), (110, 1),

In [11]:
# now we're ready to run our topic model!

%time lda_model = gensim.models.LdaModel(ccp_corpus, num_topics=15, id2word=id2word_ccp, passes=5) 

# note that passes should be higher -- usually in the 50-100 range -- 
# but in the interests of time we'll only do 5 


INFO : using symmetric alpha at 0.06666666666666667
INFO : using symmetric eta at 0.06666666666666667
INFO : using serial LDA version on this node
INFO : running online (multi-pass) LDA training, 15 topics, 5 passes over the supplied corpus of 147 documents, updating model once every 147 documents, evaluating perplexity every 147 documents, iterating 50x with a convergence threshold of 0.001000
INFO : -10.752 per-word bound, 1725.1 perplexity estimate based on a held-out corpus of 147 documents with 456969 words
INFO : PROGRESS: pass 0, at document #147/147
INFO : topic #11 (0.067): 0.010*"people" + 0.010*"mr" + 0.009*"colored" + 0.009*"committee" + 0.008*"state" + 0.008*"men" + 0.007*"convention" + 0.006*"shall" + 0.005*"president" + 0.005*"rights"
INFO : topic #12 (0.067): 0.013*"colored" + 0.011*"convention" + 0.010*"mr" + 0.008*"state" + 0.008*"people" + 0.007*"committee" + 0.007*"shall" + 0.005*"men" + 0.005*"resolved" + 0.004*"states"
INFO : topic #10 (0.067): 0.010*"committee" +

CPU times: user 29 s, sys: 363 ms, total: 29.3 s
Wall time: 21.4 s


In [12]:


# some additional helpful functions built into LdaModel

# how to store corpus to disk
from gensim.corpora import MmCorpus
MmCorpus.serialize('./ccp.corpus.mm', ccp_corpus)

# how to store dictionary to disk
id2word_ccp.save('./ccp.dictionary')

# how to store model to disk 
lda_model.save('./lda_ccp-15topics_5iters.model')

INFO : storing corpus in Matrix Market format to ./ccp.corpus.mm
INFO : saving sparse matrix to ./ccp.corpus.mm
INFO : PROGRESS: saving document #0
INFO : saved 147x14014 matrix, density=8.384% (172720/2060058)
INFO : saving MmCorpus index to ./ccp.corpus.mm.index
INFO : saving Dictionary object under ./ccp.dictionary, separately None
INFO : saved ./ccp.dictionary
INFO : saving LdaState object under ./lda_ccp-15topics_5iters.model.state, separately None
INFO : saved ./lda_ccp-15topics_5iters.model.state
INFO : saving LdaModel object under ./lda_ccp-15topics_5iters.model, separately ['expElogbeta', 'sstats']
INFO : storing np array 'expElogbeta' to ./lda_ccp-15topics_5iters.model.expElogbeta.npy
INFO : not storing attribute id2word
INFO : not storing attribute dispatcher
INFO : not storing attribute state
INFO : saved ./lda_ccp-15topics_5iters.model


You can also load in a saved model. This is very helpful to know about, since generating new topic models takes time. 

Here, we're going to load in a (slightly) better topic model of the CCP Corpus with the same number of topics (15), but 50 iterations.

In [13]:
# load an old model; in this case, a topic model of the ccp with 50 iterations
lda_model = gensim.models.LdaModel.load('./lda_ccp-15topics_50iters.model')

INFO : loading LdaModel object from ./lda_ccp-15topics_50iters.model
INFO : loading expElogbeta from ./lda_ccp-15topics_50iters.model.expElogbeta.npy with mmap=None
INFO : setting ignored attribute id2word to None
INFO : setting ignored attribute state to None
INFO : setting ignored attribute dispatcher to None
INFO : loaded ./lda_ccp-15topics_50iters.model
INFO : loading LdaState object from ./lda_ccp-15topics_50iters.model.state
INFO : loaded ./lda_ccp-15topics_50iters.model.state


In [14]:
# gensim comes with a bunch of functions that make interacting with the output of the topic
# model a little easier. this one shows the topics. 

# show the topics, in the format (number of topics to show, number of terms)
# note that all words are in all topics, just some topics consist of very very small
# proportions of that word

# as you can tell already, even the top words in each topic are only a very small proportion
# of that topic, since we are dealing with about 14K unique words

lda_model.show_topics(15, 20)

[(0,
  '0.019*"mr" + 0.013*"committee" + 0.009*"convention" + 0.007*"people" + 0.006*"government" + 0.006*"state" + 0.006*"slavery" + 0.006*"report" + 0.005*"motion" + 0.005*"resolution" + 0.005*"men" + 0.005*"cuba" + 0.004*"said" + 0.004*"colored" + 0.003*"adopted" + 0.003*"country" + 0.003*"let" + 0.003*"spanish" + 0.003*"right" + 0.003*"citizens"'),
 (1,
  '0.003*"indiana" + 0.002*"mulattoes" + 0.002*"indianapolis" + 0.002*"persons" + 0.002*"negroes" + 0.001*"speakers" + 0.001*"prevents" + 0.001*"expenditure" + 0.001*"embodies" + 0.001*"loaf" + 0.001*"debarring" + 0.001*"concedes" + 0.001*"sustained" + 0.001*"agitating" + 0.001*"preventing" + 0.001*"extremely" + 0.001*"contain" + 0.001*"episcopal" + 0.001*"lots" + 0.001*"ancestors"'),
 (2,
  '0.012*"shall" + 0.010*"president" + 0.009*"convention" + 0.009*"association" + 0.009*"state" + 0.009*"colored" + 0.008*"county" + 0.007*"committee" + 0.007*"people" + 0.007*"resolved" + 0.006*"men" + 0.006*"rights" + 0.004*"following" + 0.004*"

In [16]:
# let's format the words a little more nicely; 
# the formatted=False parameter returns tuples of (word, probability)

topics = lda_model.show_topics(15, 20, formatted=False)

for topic in topics:
    topic_num = topic[0]
    topic_words = ""
    
    topic_pairs = topic[1]
    for pair in topic_pairs:
        topic_words += pair[0] + ", "
    
    print("T" + str(topic_num) + ": " + topic_words)

T0: mr, committee, convention, people, government, state, slavery, report, motion, resolution, men, cuba, said, colored, adopted, country, let, spanish, right, citizens, 
T1: indiana, mulattoes, indianapolis, persons, negroes, speakers, prevents, expenditure, embodies, loaf, debarring, concedes, sustained, agitating, preventing, extremely, contain, episcopal, lots, ancestors, 
T2: shall, president, convention, association, state, colored, county, committee, people, resolved, men, rights, following, man, motion, slavery, rev, white, citizens, great, 
T3: kentucky, school, state, normal, frankfort, district, wm, louisville, colored, teachers, committee, lexington, laws, weeks, schools, convention, simmons, education, common, ky, 
T4: convention, committee, shall, mr, people, resolved, president, report, motion, appointed, society, rev, new, john, following, states, read, business, subject, seconded, 
T5: colored, state, people, men, convention, states, citizens, white, rights, committee,

These are not amazing topics. It might just be that the documents themselves are too similar to each other for topic modeling to be the best appoach. (The sameyness is something that the CCP team suggested). 

But a few other things we might want to try before we throw topic modeling out the window:
* Filtering some of the most common words (see the filtering function above)
* Generating fewer topics (we could try 5, for instance). 
* Anything else we might consider?

Feel free to try those things on your own. 

Next, let's take a bit of a closer look at the probabilities attached to each word in a single topic. 


In [17]:
# T9 looks decent
topic = topics[9]

# this is the topic number
topic_num = topic[0]

topic_pairs = topic[1]
for pair in topic_pairs:
    print(pair[0] + ": " + str(pair[1]))

# since all topics contain all words, the sum of all of the probabilities of each 
# topic should be 1


colored: 0.014209137
people: 0.010636336
white: 0.008736604
south: 0.007967107
conference: 0.0077905497
race: 0.0076308916
negro: 0.007405244
said: 0.0060578035
men: 0.005750052
man: 0.00489279
work: 0.0045141783
great: 0.0044996077
labor: 0.0039110696
time: 0.0038835607
southern: 0.0037885576
country: 0.0037708979
states: 0.0035611608
good: 0.0033553445
let: 0.0032570495
years: 0.0032372389


Let's flip it around and look at the document composition. 

Mallet provides this output automatically, but with gensim there's a bit more work required.

In [20]:
tokens = [] 

# open one file
with open('./2019-09-ccp-corpus-0.3/ccprecords/1851.NY-07.22.ALBA.MIN.01.txt', "r") as file:
    text = file.read()
    tokens = tokenize(text) # remember this from above

# create the bag of words for the document on the basis of the CCP dictionary, created above
doc_bow = id2word_ccp.doc2bow(tokens)

# get the topics that the doc consists of
doc_topics = lda_model.get_document_topics(doc_bow)

doc_topics
    



[(4, 0.028083513), (8, 0.011763256), (13, 0.03993155), (14, 0.91510516)]

In [22]:
# now we can cross-reference to find those topics and words

for topic, prob in doc_topics:
    print("T" + str(topic) + ": " + "{:.2%}".format(prob) + " of document.")
          
        #  str(round(prob, 2)))

    topic_words = "Top words in topic: "
    select_topics = topics[topic]
    
    for pair in select_topics[1]:
        topic_words += pair[0] + ", "
    
    print(topic_words)
 

T4: 2.81% of document.
Top words in topic: convention, committee, shall, mr, people, resolved, president, report, motion, appointed, society, rev, new, john, following, states, read, business, subject, seconded, 
T8: 1.18% of document.
Top words in topic: mr, convention, committee, league, shall, motion, state, president, rev, adopted, colored, resolved, members, th, rights, people, resolution, equal, county, secretary, 
T13: 3.99% of document.
Top words in topic: mr, convention, colored, committee, men, state, rights, motion, people, president, slavery, country, right, said, resolution, new, states, called, government, man, 
T14: 91.51% of document.
Top words in topic: convention, state, people, colored, committee, resolved, shall, resolution, men, man, new, mr, rights, citizens, th, political, states, great, country, free, 


### Evaluating Topics ###

Gensim has several built-in methods for evaluating topics, including something called [topic coherence](https://rare-technologies.com/what-is-topic-coherence/), which is one of the most helpful measures. 

One way to determine whether you've selected the appropriate number of topics is to calculate the coherence score for different numbers of topics. The higher the score, the better. But this can be time-consuming. 

Let's just see how it works.

In [23]:
from gensim.models.coherencemodel import CoherenceModel

cm = CoherenceModel(model=lda_model, corpus=ccp_corpus, coherence='u_mass')

coherence = cm.get_coherence()  # get coherence value

coherence

-0.8975078393034888

Here's a review essay by Hanna Wallach et al. that summarizes a few methods of evaluation, including some involving humans in the loop: ["Evaluation Methods for Topc Models"](http://dirichlet.net/pdf/wallach09evaluation.pdf).

Another way to evalute topics is just to look at them.

The pyLDAvis library lets you do this in a single line. It's very satisfying! 

In [24]:
# LDA visualization tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this

# just reformat the corpus for pyLDAvis 
from gensim.corpora import MmCorpus
ccp_mm_corpus = MmCorpus('./ccp.corpus.mm')

pyLDAvis.enable_notebook()

pyLDAvis.gensim.prepare(lda_model, ccp_mm_corpus, id2word_ccp)

INFO : loaded corpus index from ./ccp.corpus.mm.index
INFO : initializing cython corpus reader from ./ccp.corpus.mm
INFO : accepted corpus with 147 documents, 14014 features, 172720 non-zero entries
INFO : NumExpr defaulting to 4 threads.
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))
