## Topic Modeling ##

*This lesson draws on blog posts by [Ted Underwood](https://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/) and [Matthew Jockers](http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/), this [video of a talk by David Mimno](https://vimeo.com/53080123), and [this notebook](https://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html) by Radim Rehurek. 

In [None]:
# import and setup modules we'll be using in this notebook
import logging # for logging status etc
import itertools # helpful library for iterating through things

import numpy as np # this is a powerful python math package that many others are based on
import gensim # our topic modeling library
import os # for file i/o

# configure logging 
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO  

# a helpful function that returns the first `n` elements of the stream as plain list.
# we'll use this later
def head(stream, n=10):
    return list(itertools.islice(stream, n))

In [None]:
# import some more modules for processing the corpus
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

If you haven't already, please download and unzip the Colored Conventions Corpus: [link](http://coloredconventions.org/intro-corpus).

### Tokenizing ###

Many NLP tasks require that you first tokenize your corpus. We actually already tokenized something when we chunked our song lyrics by line. 

Here is [another example of tokenizing](https://programminghistorian.org/en/lessons/sentiment-analysis) that uses nltk to tokenize a document by sentence instead. (Scroll down to where it discusses the nltk word_tokenize module). **ATTENTION! This may be helpful to you for your next homework!**

Here, however, we're going to write our own quick tokenizing function that makes use of gensim's [simple_preprocess function](https://radimrehurek.com/gensim/utils.html), which breaks a document into a list of lowercase tokens. The lower-casing is important for topic modeling since we want both uppercase and lowercase versions of the same word to be counted together. 

In [None]:
# here's some nice dense python for you
def tokenize(text):
    return [token for token in simple_preprocess(text) if token not in STOPWORDS]

### Further pre-processing our corpus ###

This is the other necessary step before running a topic model. You need to write a function that iterates through your corpus and returns each document in the format (title, tokens). 

**Side note that we've not yet discussed tuples.** Tuples exist in many programming languages, including R, I think. If not, or even if so, just know that tuples are sequences of objects--  just like lists-- but they cannot be changed. In Python, you indicate a tuple with parentheses. 

In any case, we will want a pre-processing function like this:

In [None]:
# A function to yield each doc in CCP Corpus as a `(filename, tokens)` tuple.

def iter_docs(base_dir):
    docCount = 0
    docs = os.listdir(base_dir)

    for doc in docs:
        if not doc.startswith('.'):
            with open(base_dir + doc, "r") as file:
                text = file.read()
                tokens = tokenize(text) 
        
                yield doc, tokens

In [None]:
# set up the stream for later processing 
stream = iter_docs('./2019-09-ccp-corpus-0.3/ccprecords/')

# while we're at it, take a look at what this looks like for the first five docs
for doc, tokens in itertools.islice(stream, 5):
    print(doc, tokens[:10])  # print the doc title and its first ten tokens

The next step is to create a Dictionary (not to be confused with a Python dictionary) which maps each word to a numerical ID. 

This mapping step is required because most algorithms, including gensim's implementation of LDA, rely on numerical libraries that work with vectors indexed by integers, not by strings. Also, many need to know the vector/matrix dimensionality in advance.

The mapping can be constructed automatically by giving gensim's Dictionary class a stream of tokenized documents, like so:

In [None]:
# creating the CCP Corpus Dictionary

doc_stream = (tokens for _, tokens in iter_docs('./2019-09-ccp-corpus-0.3/ccprecords/'))
              
id2word_ccp = gensim.corpora.Dictionary(doc_stream) 

print(id2word_ccp)

The Dictionary (id2word_ccp) now contains all words that appeared in the corpus, along with how many times they appeared. 

gensim provides a handy function for mapping tokens to their ID numbers, viz:

In [None]:
print(id2word_ccp.token2id)

There aren't many things you need to do in order to tune your topic model, but one important thing do consider is whether you should filter the words. 

gensim also provides functions for this:

In [None]:
# filter out 50 most frequent words
# id2word_ccp.filter_n_most_frequent(50)

# filter out words in only 1 doc, keeping the rest
# note how no_below and no_above take different values
id2word_ccp.filter_extremes(no_below=2, no_above=1.0)

print(id2word_ccp)

Note that by removing the words that only appeared in a single document, we went from 23,844 unique words (or tokens) to 14,014. That's not a huge number for a topic model, and as you'll see, there are probably other methods that would work better for this corpus. We'll explore some of those next class. 

But for now, since a streamed corpus and a dictionary is all we need to create the vectors for our topic model, we can get started. 

In [None]:
# a class we need; this is the same for every topic model you create with gensim

class Corpus(object):
    def __init__(self, dump_file, dictionary, clip_docs=None):
        self.dump_file = dump_file
        self.dictionary = dictionary
        self.clip_docs = clip_docs
    
    def __iter__(self):
        self.titles = []
        for title, tokens in itertools.islice(iter_docs(self.dump_file), self.clip_docs):
            self.titles.append(title)
            yield self.dictionary.doc2bow(tokens)
    
    def __len__(self):
        return self.clip_docs

In [None]:
# create a stream of bag-of-words vectors
ccp_corpus = Corpus('./2019-09-ccp-corpus-0.3/ccprecords/', id2word_ccp)

# print the first vector in the stream to see what it looks like; 
# this is in the format (word_id, count in first doc)

vector = next(iter(ccp_corpus))
print(vector)  

In [None]:
# now we're ready to run our topic model!

%time lda_model = gensim.models.LdaModel(ccp_corpus, num_topics=15, id2word=id2word_ccp, passes=10) 

# note that passes should be higher -- people usually do 50-100 -- 
# but in the interests of time we'll only do 10 


In [None]:
# some additional helpful functions built into LdaModel

# how to store corpus to disk
from gensim.corpora import MmCorpus
MmCorpus.serialize('./ccp.corpus.mm', ccp_corpus)

# how to store dictionary to disk
id2word_ccp.save('./ccp.dictionary')

# how to store model to disk 
lda_model.save('./lda_ccp-15topics_10iters.model')

In [None]:
# load an old model; in this case, a topic model of the ccp with 50 iterations
lda_model = gensim.models.LdaModel.load('./lda_ccp-15topics_50iters.model')

In [None]:
# show the top 20 terms in each topic

lda_model.show_topics(15, 20)

In [None]:
# let's format the words nicely; 
# the formatted=False parameter returns tuples of (word, probability)

topics = lda_model.show_topics(15, 20, formatted=False)

for topic in topics:
    topic_num = topic[0]
    topic_words = ""
    
    topic_pairs = topic[1]
    for pair in topic_pairs:
        topic_words += pair[0] + ", "
    
    print("T" + str(topic_num) + ": " + topic_words)

In [None]:
# let's take a bit of a closer look at the probabilities attached to each word 
# in a single topic 

# T0 looks decent
topic = topics[0]

# this is the topic number
topic_num = topic[0]

topic_pairs = topic[1]
for pair in topic_pairs:
    print(pair[0] + ": " + str(pair[1]))

# since all topics contain all words, the sum of all of the probabilities of each 
# topic should be 1


In [None]:
# let's flip it around and look at the document composition 
# Mallet does this automatically, but with gensim's built-in topic modeling 
# algorithm, we need to do it manually

tokens = [] 

# open one file
with open('./2019-09-ccp-corpus-0.3/ccprecords/1843.NY-08.15.BUFF.MIN.01.txt', "r") as file:
    text = file.read()
    tokens = tokenize(text) # remember this from above

# create the bag of words for the document on the basis of the CCP dictionary, created above
doc_bow = id2word_ccp.doc2bow(tokens)

# get the topics that the doc consists of
doc_topics = lda_model.get_document_topics(doc_bow)

doc_topics
    



In [None]:
# now we can cross-reference to find those topics and words

for topic, prob in doc_topics:
    print("T" + str(topic) + ": " + "{:.2%}".format(prob) + " of document.")
          
        #  str(round(prob, 2)))

    topic_words = "Top words in topic: "
    select_topics = topics[topic]
    
    for pair in select_topics[1]:
        topic_words += pair[0] + ", "
    
    print(topic_words)
 

### Evaluating Topics ###

Gensim has several built-in methods for evaluating topics, including something called [topic coherence](https://rare-technologies.com/what-is-topic-coherence/). 

While it's time-consuming, one way to determine whether you've selected the appropriate number of topics is to calculate the coherence score for different numbers of topics. The higher the score, the better.

In [None]:
from gensim.models.coherencemodel import CoherenceModel

cm = CoherenceModel(model=lda_model, corpus=ccp_corpus, coherence='u_mass')

coherence = cm.get_coherence()  # get coherence value

coherence

But another way to evalute topics is to look at them.

In [None]:
# LDA visualization tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this

# just reformat the corpus for pyLDAvis 
ccp_mm_corpus = MmCorpus('./ccp.corpus.mm')

pyLDAvis.enable_notebook()

pyLDAvis.gensim.prepare(lda_model, ccp_mm_corpus, id2word_ccp)