# **Topic Modeling Example**

For this example topic modeling we will be using the CMU Book Summary Dataset which is in booksummaries.txt on D2L.

To use this notebook with the dataset, download this book summaries file and put it somewhere you can access it from the notebook. Note that you may need to change the information stored in variable `data`.

Preprocessing is very important to informative topic modeling.
What are some common preprocessing steps used before applying topic modeling methods?

In [None]:
import nltk
import numpy as np
nltk.download('popular')
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

data = "booksummaries.txt"

stop = set(stopwords.words('english'))
lemma = WordNetLemmatizer()


# tokenize, remove stopwords, non-alphabetic words, lowercase, remove pos tag "IN" and lemmatize
def preprocess(text):
    tokens = word_tokenize(text)
    tagged = pos_tag(tokens)
    tokenized = [token.lower() for token, pos in tagged if token.isalpha() and pos != "IN" and token.lower() not in stop]
    normalized = [lemma.lemmatize(word) for word in tokenized]
    return normalized


book_summaries = []
for line in open(data, encoding="utf-8"):
   temp = line.split("\t")
   book_summaries.append(preprocess(temp[6]))

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nlt

Topic modeling also tends to perform better if we remove extremes. The `filter_extremes` method in gensim is used to remove words that appear too frequently (not necessarily stop words) or too rarely in the corpus.

In this example, the `filter_extremes` method is used to remove words that appear in fewer than 10 documents or in more than 50% of the documents.

In [None]:
from gensim.corpora import Dictionary

# Filter infrequent or too frequent words.
dictionary = Dictionary(book_summaries)
dictionary.filter_extremes(no_below=10, no_above=0.5, keep_n=100000)
corpus = [dictionary.doc2bow(summary) for summary in book_summaries]

Now we can perform LDA modeling with our corpus. The id2word argument is optional, but it helps to look at the topics later.

Without `id2word`, LDA only operates on word IDs,  numerical representations, in the corpus. These numerical representations are not human-readable, so we map them back to their corresponding words to understand the topics generated by the model.

In [None]:
from gensim.models import LdaModel
from pprint import pprint

model = LdaModel(corpus=corpus, id2word=dictionary, random_state=1, iterations=500, num_topics=10)
top_topics = list(model.top_topics(corpus))
pprint(top_topics)



[([(0.007172093, 'one'),
   (0.006260412, 'world'),
   (0.006104435, 'time'),
   (0.0047533745, 'book'),
   (0.004753092, 'life'),
   (0.0045972797, 'story'),
   (0.0039292523, 'human'),
   (0.0036683064, 'also'),
   (0.0035807546, 'people'),
   (0.003492045, 'first'),
   (0.0032407825, 'new'),
   (0.0032233659, 'way'),
   (0.0031646132, 'would'),
   (0.0029774294, 'two'),
   (0.0029199943, 'find'),
   (0.0028721034, 'year'),
   (0.0028581873, 'novel'),
   (0.00264098, 'take'),
   (0.0025657848, 'begin'),
   (0.0025502192, 'back')],
  -0.8932805154124046),
 ([(0.007390181, 'life'),
   (0.0061942693, 'mother'),
   (0.0060234047, 'family'),
   (0.0060004354, 'father'),
   (0.0055693286, 'one'),
   (0.00553425, 'friend'),
   (0.0053656716, 'love'),
   (0.0051521845, 'go'),
   (0.0049002254, 'find'),
   (0.004680428, 'school'),
   (0.0044564395, 'new'),
   (0.004296794, 'get'),
   (0.004284089, 'time'),
   (0.004252952, 'tell'),
   (0.004212913, 'year'),
   (0.0040181857, 'woman'),
   (0.0

In [None]:
for idx in range(10):
    print("Topic #%s:" % idx, model.print_topic(idx, 10))
print("=" * 20)

Topic #0: 0.007*"life" + 0.006*"mother" + 0.006*"family" + 0.006*"father" + 0.006*"one" + 0.006*"friend" + 0.005*"love" + 0.005*"go" + 0.005*"find" + 0.005*"school"
Topic #1: 0.005*"book" + 0.005*"novel" + 0.005*"alex" + 0.004*"also" + 0.004*"one" + 0.004*"tom" + 0.004*"jake" + 0.004*"charlie" + 0.003*"simon" + 0.003*"chapter"
Topic #2: 0.005*"one" + 0.005*"book" + 0.005*"war" + 0.005*"new" + 0.004*"state" + 0.004*"also" + 0.004*"human" + 0.004*"world" + 0.004*"people" + 0.003*"earth"
Topic #3: 0.007*"king" + 0.006*"find" + 0.006*"ship" + 0.005*"one" + 0.004*"monk" + 0.004*"take" + 0.004*"book" + 0.004*"story" + 0.004*"time" + 0.004*"two"
Topic #4: 0.007*"one" + 0.006*"world" + 0.006*"time" + 0.005*"book" + 0.005*"life" + 0.005*"story" + 0.004*"human" + 0.004*"also" + 0.004*"people" + 0.003*"first"
Topic #5: 0.012*"murder" + 0.009*"case" + 0.007*"david" + 0.007*"police" + 0.006*"one" + 0.006*"man" + 0.005*"george" + 0.005*"found" + 0.005*"find" + 0.005*"killer"
Topic #6: 0.006*"richard

**Exercise: Try changing, adding, or removing some preprocessing steps. What did the topics look like?**

Gensim provides a probability distribution over topics as well. Each entry in the distribution represents the probability that a document belongs to a particular topic. The sum of all probabilities in a distribution should be 1.

In [None]:
#get document topics restricted to the first 10 docs
num_documents_to_display = 10
document_topics = model.get_document_topics(corpus[:num_documents_to_display])

#display the document topics
for i, doc_topics in enumerate(document_topics):
    print(f"Document {i + 1}:")
    pprint(doc_topics)
    print()

Document 1:
[(1, 0.040953256),
 (3, 0.13384454),
 (4, 0.31740758),
 (6, 0.020126289),
 (7, 0.2810832),
 (9, 0.20576574)]

Document 2:
[(0, 0.31120548),
 (1, 0.51216626),
 (2, 0.037406337),
 (3, 0.072301),
 (4, 0.03857037),
 (6, 0.027546467)]

Document 3:
[(0, 0.3877), (2, 0.2861788), (3, 0.14573894), (4, 0.10244911), (9, 0.07699939)]

Document 4:
[(1, 0.032403637), (2, 0.27764016), (4, 0.6894216)]

Document 5:
[(2, 0.5923614), (4, 0.40548846)]

Document 6:
[(0, 0.5092008), (2, 0.05386992), (4, 0.37353978), (7, 0.06168125)]

Document 7:
[(0, 0.08626161), (4, 0.3028647), (7, 0.08069305), (8, 0.5289057)]

Document 8:
[(0, 0.061200928),
 (1, 0.061960027),
 (5, 0.07761176),
 (6, 0.55218655),
 (8, 0.088604875),
 (9, 0.1577691)]

Document 9:
[(2, 0.15813407), (3, 0.43972838), (5, 0.37517282)]

Document 10:
[(0, 0.13127612),
 (1, 0.047344062),
 (3, 0.6132021),
 (6, 0.03326719),
 (7, 0.07536973),
 (9, 0.09671484)]



## **Measuring Coherence**

In gensim, you can measure coherence for topics generated by LDA using the CoherenceModel class from the gensim.models.coherencemodel module. There are several coherence measures available.

The higher the coherence score, the more meaningful and coherent the topics are.

The `c_v` coherence evaluates the coherence of topics based on the co-occurrence of words in the same context window. For each pair of words in the top words of a topic, the coherence score is calculated based on their co-occurrence in the document collection. Then the coherence scores for all word pairs in a topic are averaged to obtain the overall coherence score for that topic.

The `u_mass` coherence is calculated as the average of the pointwise mutual information of each word pair in a topic. This measures the association between two words, and the `u_mass` coherence is the average of these associations over all word pairs in a topic.

In [None]:
from gensim.models.coherencemodel import CoherenceModel

#c_v requires original text, since we filtered extremes as a dictionary:
filtered_words = set(dictionary.values())
filtered_texts = [
    [word for word in doc if word in filtered_words]
    for doc in book_summaries
]

# Calculate the coherence score
coherence_model = CoherenceModel(model=model, texts=filtered_texts, dictionary=dictionary, coherence='u_mass')
coherence = coherence_model.get_coherence()
print("The coherence score is:", coherence)

The coherence score is: 0.2999683155451401


## **Visualization**

In [None]:
#!pip install pyLDAvis
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
%matplotlib inline

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(model, corpus, dictionary)
vis

  and should_run_async(code)


**Exercise: Try running LSA from Gensim on this data. Any differences?**

We can also try using scikit-learn instead to perform LSA.

In [None]:
#perform SVD on the document-term matrix
import numpy as np
import pandas as pd
from gensim import corpora, matutils

#convert the corpus to a dense document-term matrix
dtm = np.array(matutils.corpus2dense(corpus, num_terms=len(dictionary)), dtype=float)

from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=10)
doc_topic_matrix = svd.fit_transform(dtm)

#create a pandas DataFrame from the doc_topic_matrix
df = pd.DataFrame(doc_topic_matrix, columns=["Topic" + str(i) for i in range(10)])

#inspect the first few rows of the DataFrame each row represents a document
print(df.head(10), '\n')

#feature names (terms) from the Gensim Dictionary
terms = list(dictionary.token2id.keys())

#get the topic-word contributions
normalized_components = svd.components_ / svd.components_.sum(axis=1)[:, np.newaxis]

for i, comp in enumerate(normalized_components):
    terms_comp = list(zip(terms, comp))
    sorted_terms = sorted(terms_comp, key=lambda x: x[1], reverse=True)[:10]
    topic_str = "Topic #{}: {}".format(i, ' + '.join(["{:.3f}*'{}'".format(weight, term) for term, weight in sorted_terms]))
    print(topic_str + '\n')

     Topic0    Topic1    Topic2    Topic3    Topic4    Topic5    Topic6  \
0  3.722577  1.376190  0.486538 -1.624921  0.256548 -0.169981 -0.237847   
1  7.412378 -1.939929  2.983865 -4.385954  2.584215  0.009786  1.306758   
2  1.748346  0.616566  0.417420 -0.139057 -0.257538  0.056446  0.215195   
3  3.803363  0.609263 -0.710584 -0.088643 -0.178610 -0.117717  1.062154   
4  3.399859 -1.592972  1.284432 -1.420248  0.992637  0.029560  0.344733   
5  0.914273 -0.491636  0.329553 -0.147690  0.643775  0.210980 -0.312713   
6  1.505289 -0.520330  0.487330 -0.662778  0.464127  0.061806  0.249010   
7  1.740669 -0.076497 -0.129620 -0.596571 -0.546300 -0.260982 -0.449808   
8  0.357986 -0.200926  0.119294 -0.105072  0.020863 -0.079875 -0.078935   
9  1.678758 -0.099019 -0.231543 -0.764726  0.112042 -0.510814 -0.264586   

     Topic7    Topic8    Topic9  
0  0.158940  0.013901 -0.434231  
1 -1.263470 -1.079877 -0.885399  
2 -0.227670 -0.069438 -0.202586  
3 -0.729579 -0.937814 -0.945188  
4 -0