# Topic Modeling with Gensim

This notebook explores two topic modeling algorithms with the Gensim package:

1. LDA - Latent Dirichlet Allocation, an algorithm using probabilistic graphical models
2. LSI - Latent Semantic Indexing, also called LSA, Latent Semantic Analysis, an algorithm using Singular Value Decomposition (SVD), a dimensionality reduction technique 

### Gensim

The gensim Python library can be installed with pip/pip3. Documentation about the gensim API as well as tutorials can be found on the [web site](https://radimrehurek.com/gensim/index.html)


### A toy corpus

A very small corpus will be used to illustrate the steps involved in running the algorithms with Gensim. In fact, the corpus is so small that it is unlikely that good results will be achieved. The corpus consists of only 4 documents, each representing the text of one section of a textbook. The code block below reads in the anatomy, business law, economics, and geography texts, creating a list of docs.

In [1]:
import re

num_docs = 4
docs = []

with open('../school_texts/anat.txt', 'r') as f:
    doc_anat = f.read().lower()
    doc_anat = doc_anat.replace('\n', ' ')
    docs.append(doc_anat)

with open('../school_texts/buslaw.txt', 'r') as f:
    doc_buslaw = f.read().lower()
    doc_buslaw = doc_buslaw.replace('\n', ' ')
    docs.append(doc_buslaw)
    
with open('../school_texts/econ.txt', 'r') as f:
    doc_econ = f.read().lower()
    doc_econ = doc_econ.replace('\n', ' ')
    docs.append(doc_econ)
    
with open('../school_texts/geog.txt', 'r') as f:
    doc_geog = f.read().lower()
    doc_geog = doc_geog.replace('\n', ' ')
    docs.append(doc_geog)
    
# look at part of each document
for i in range(num_docs):
    print(docs[i][:50])

the autonomic nervous system the autonomic nervous
16.1 theory of contract remedies purpose of remedi
22 | inflation inflation is a general and ongoing 
chapter 13 the pacific and antarctica the immense 


In [2]:
# gensim and nltk imports
from gensim import models, corpora
from nltk import word_tokenize
from nltk.corpus import stopwords

In [3]:
NUM_TOPICS = 4

In [4]:
# preprocess docs
def preprocess(docs, stopwords):
    """
    Tokenize, remove stopwords and non-alpha tokens.
    param: docs - a list of raw text documents
    return: a list of processed tokens
    """
    
    processed_docs = []
    for doc in docs:
        tokens = [t for t in word_tokenize(doc.lower()) if t not in stopwords
                 and t.isalpha()]
        processed_docs.append(tokens)
        
    return processed_docs


In [5]:
preprocessed_docs = preprocess(docs, stopwords.words('english'))

In [6]:
for i in range(num_docs):
    print(preprocessed_docs[i][:5])

['autonomic', 'nervous', 'system', 'autonomic', 'nervous']
['theory', 'contract', 'remedies', 'purpose', 'remedies']
['inflation', 'inflation', 'general', 'ongoing', 'rise']
['chapter', 'pacific', 'antarctica', 'immense', 'tropical']


In [7]:
# the dictionary maps words to id numbers
dictionary = corpora.Dictionary(preprocessed_docs)

In [8]:
print('len of dictionary:', len(dictionary))
print('some items:', dictionary[0], dictionary[4053])

len of dictionary: 4054
some items: abdominal zone


In [9]:
# represent the doc tokens in numeric form
corpus = [dictionary.doc2bow(tokens) for tokens in preprocessed_docs]

In [10]:
# each doc in the corpus is now a bag of words
# printing the first few 'words' in the bag of words confirms that word order is lost
print(corpus[0][:5])
print(dictionary[4], dictionary[2], dictionary[1])

[(0, 4), (1, 2), (2, 1), (3, 1), (4, 1)]
accelerator absorbed ability


In [11]:
# build an LDA model
lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)

In [12]:
print("LDA Model Results")
for i in range(NUM_TOPICS):
    print("\nTopic #%s:" % i, lda_model.print_topic(i, 10))

# the results below show for each topic, the top 10 words associated with that topic, along with the importance factor

LDA Model Results

Topic #0: 0.014*"inflation" + 0.006*"damages" + 0.005*"price" + 0.005*"one" + 0.005*"contract" + 0.005*"system" + 0.005*"prices" + 0.004*"goods" + 0.004*"party" + 0.004*"would"

Topic #1: 0.010*"inflation" + 0.008*"islands" + 0.007*"system" + 0.005*"rate" + 0.005*"goods" + 0.005*"one" + 0.005*"would" + 0.005*"sympathetic" + 0.004*"many" + 0.004*"price"

Topic #2: 0.012*"inflation" + 0.007*"contract" + 0.007*"system" + 0.007*"party" + 0.006*"price" + 0.005*"damages" + 0.005*"would" + 0.004*"sympathetic" + 0.004*"goods" + 0.004*"prices"

Topic #3: 0.006*"islands" + 0.006*"inflation" + 0.005*"one" + 0.005*"party" + 0.005*"contract" + 0.005*"system" + 0.004*"would" + 0.004*"island" + 0.004*"damages" + 0.004*"many"


In [13]:
for i in range(NUM_TOPICS):
    top_words = [t[0] for t in lda_model.show_topic(i, 9)]
    print("\nTopic", str(i), ':', top_words)


Topic 0 : ['inflation', 'damages', 'price', 'one', 'contract', 'system', 'prices', 'goods', 'party']

Topic 1 : ['inflation', 'islands', 'system', 'rate', 'goods', 'one', 'would', 'sympathetic', 'many']

Topic 2 : ['inflation', 'contract', 'system', 'party', 'price', 'damages', 'would', 'sympathetic', 'goods']

Topic 3 : ['islands', 'inflation', 'one', 'party', 'contract', 'system', 'would', 'island', 'damages']


In [14]:
# look at weights for top 10 words in topic 0
lda_model.show_topic(0, 10)

[('inflation', 0.014359514),
 ('damages', 0.0059469584),
 ('price', 0.005468406),
 ('one', 0.0049800146),
 ('contract', 0.004808123),
 ('system', 0.0046616443),
 ('prices', 0.0045778817),
 ('goods', 0.0044693653),
 ('party', 0.004247469),
 ('would', 0.0042141583)]

In [15]:
print("LDA Model 1 Perplexity:", lda_model.log_perplexity(corpus))

from gensim.models.coherencemodel import CoherenceModel

coherence1 = CoherenceModel(model=lda_model,
                           texts=preprocessed_docs, dictionary=dictionary, coherence='c_v')
print('Coherence score:', coherence1.get_coherence())

LDA Model 1 Perplexity: -8.259309803241289
Coherence score: 0.3787863721173812


## Visualization

The pyLDAvis package enables visualization of topics and documents. The package can be installed with pip or pip3.

In [16]:
import pyLDAvis
from pyLDAvis import gensim
pyLDAvis.enable_notebook()

In [17]:
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)

vis

## Interpreting the visualization

Each bubble in the left part of the visualization represents a topic. The larger the bubble, the more prevalent that topic is in the corpus. A good topic model will have well-separated, not overlapping bubbles, not all in one quadrant. The closer bubbles are, the more similar their topics. Overlapping bubbles can be an indication that there are too many topics. 

Moving the cursor over a bubble, important words in that topic are highlighted in red on the right side of the visualization. The red bars indicate term frequency of each word in the topic.  

Visualizing the topic models in this way helps identify problems in the topic model:

* topics include unimportant words such as 'would', 'one', 'may', 'many', 'also'; using an expanded set of stop words would help alleviate some of this
* words and their plurals show up in topics, see 'island' and 'islands'
* the same words appear in multiple topics; for example 'island' and 'islands'; this is often an indication that the number of topics is too large

Documentation for the visualization program is [available here](https://pyldavis.readthedocs.io/en/latest/modules/API.html)

## Run LDA again

Another LDA model is built in the next code block, with the following changes:

* additional words were added to the standard stopwords
* the number of topics is changed to 4


In [18]:
enhanced_stopwords = stopwords.words('english') 
enhanced_stopwords += ['could', 'may', 'would', 'many', 'also']
preprocessed_docs2 = preprocess(docs, enhanced_stopwords)

# the dictionary maps words to id numbers
dictionary2 = corpora.Dictionary(preprocessed_docs2)

# represent the doc tokens in numeric form
corpus2 = [dictionary2.doc2bow(tokens) for tokens in preprocessed_docs2]

# the dictionary maps words to id numbers
dictionary2 = corpora.Dictionary(preprocessed_docs2)

# build another LDA model, this time with 4 topics
lda_model2 = models.LdaModel(corpus=corpus2, num_topics=4, id2word=dictionary2)
vis2 = pyLDAvis.gensim.prepare(lda_model2, corpus2, dictionary2)
vis2

## Comparing the two models

We know from this small corpus that there really are four topics:

1. the autonomic nervous system
2. theory of contract remedies
3. the Pacific and Antarctica
4. inflation

There are several problems revealed by visual inspection of the topics.
* The topic identified as '1' above is a mix of key words from all 4 documents. 
* The topic identifed as '2' above seems closest to the inflation topic with key words like: inflation, system, rate, price, goods, and damages.
* The topic identifed as '3' above mixes key words: inflation, system, islands, party, damages, sympathetic. These seem to be key words from all 4 documents.
* The topic identified as '4' seems to be a weak topic with low estimated frequencies.

These results are not impressive. Only topic '2' seems to have honed in on a topic that is consistent with the documents. 

In [19]:
print("LDA Model 2 Perplexity:", lda_model2.log_perplexity(corpus2))

coherence2 = CoherenceModel(model=lda_model2,
                           texts=preprocessed_docs2, dictionary=dictionary2, coherence='c_v')
print('Coherence score:', coherence2.get_coherence())

LDA Model 2 Perplexity: -8.053633718873671
Coherence score: 0.4341028452886194


## Conclusion

Overall, LDA gave interest results but the resuls don't conform to human intuition about the corpus, and the scores were not high. The main problem with applying LDA to this tiny corpus is the fact that the corpus is tiny. There are not enough words in enough contexts for the algorithm to learn very much. In fact, it's surprising that it learned as much as it did. 

The purpose of demonstrating LDA on this small corpus is to show the steps involved in a notebook that will run quickly. A later project notebook will show results from running LDA on a larger corpus.

## LSI

In [20]:
# build an LSI model
lsi_model = models.LsiModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)

In [21]:
print("LSI Model Results")
for i in range(NUM_TOPICS):
    print("\nTopic #%s:" % i, lsi_model.print_topic(i, 10))

LSI Model Results

Topic #0: 0.628*"inflation" + 0.233*"price" + 0.219*"goods" + 0.191*"prices" + 0.185*"rate" + 0.136*"index" + 0.106*"would" + 0.106*"one" + 0.105*"year" + 0.103*"interest"

Topic #1: 0.440*"system" + 0.272*"sympathetic" + -0.225*"inflation" + 0.210*"autonomic" + 0.178*"parasympathetic" + 0.171*"fibers" + 0.149*"receptors" + 0.142*"nervous" + 0.123*"postganglionic" + 0.123*"ganglia"

Topic #2: 0.350*"party" + 0.350*"damages" + 0.330*"contract" + -0.218*"inflation" + 0.186*"islands" + 0.181*"breach" + 0.177*"nonbreaching" + 0.172*"would" + 0.154*"may" + 0.128*"one"

Topic #3: 0.463*"islands" + 0.238*"island" + -0.184*"party" + -0.184*"damages" + -0.182*"contract" + 0.163*"antarctica" + 0.146*"ozone" + 0.138*"many" + 0.134*"pacific" + 0.121*"world"


In [22]:
coherence3 = CoherenceModel(model=lsi_model,
                           texts=preprocessed_docs, dictionary=dictionary, coherence='c_v')
print('Coherence score:', coherence3.get_coherence())

Coherence score: 0.6470303253793954
