In [1]:
# https://www.machinelearningplus.com/nlp/gensim-tutorial/

### 11. How to create Topic Models with LDA?

The objective of topic models is to extract the underlying topics from a given collection of text documents. Each document in the text is considered as a combination of topics and each topic is considered as a combination of related words.

Topic modeling can be done by algorithms like Latent Dirichlet Allocation (LDA) and Latent Semantic Indexing (LSI).

In both cases you need to provide the number of topics as input. The topic model, in turn, will provide the topic keywords for each topic and the percentage contribution of topics in each document.

The quality of topics is highly dependent on the quality of text processing and the number of topics you provide to the algorithm. The earlier post on how to build best topic models explains the procedure in more detail. However, I recommend understanding the basic steps involved and the interpretation in the example below.

Step 0: Load the necessary packages and import the stopwords.

In [18]:
# Step 0: Import packages and stopwords
from gensim.models import LdaModel, LdaMulticore
from gensim import corpora
import gensim.downloader as api
from gensim.utils import simple_preprocess, lemmatize
from nltk.corpus import stopwords
import re
import logging
from pprint import pprint
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(level=logging.INFO)
stop_words = stopwords.words('english')
stop_words = stop_words + ['com', 'edu', 'subject', 'lines', 'organization', 'would', 'article', 'could']

In [3]:
# Step 1: Import the dataset and get the text and real topic of each news article
dataset = api.load("text8")
data = [d for d in dataset]

Step 2: Prepare the downloaded data by removing stopwords and lemmatize it. For Lemmatization, gensim requires the pattern package. So, be sure to do pip install pattern in your terminal or prompt before running this. !! if you get a pip erro 'OSError: mysql_config not found" then in OSX do:  brew install mariadb

I have setup lemmatization such that only Nouns (NN), Adjectives (JJ) and Pronouns (RB) are retained. Because I prefer only such words to go as topic keywords. This is a personal choice.

In [4]:
#StopIteration error!
# try spacy below
# Step 2: Prepare Data (Remove stopwords and lemmatize)
data_processed = []

# for i, doc in enumerate(data[:100]):
#     doc_out = []
#     for wd in doc:
#         if wd not in stop_words:  # remove stopwords
#             lemmatized_word = lemmatize(wd, allowed_tags=re.compile('(NN|JJ|RB)'))  # lemmatize
#             if lemmatized_word:
#                 doc_out = doc_out + [lemmatized_word[0].split(b'/')[0].decode('utf-8')]
#         else:
#             continue
#     data_processed.append(doc_out)

# Print a small sample    
# print(data_processed[0][:5]) 
#> ['anarchism', 'originated', 'term', 'abuse', 'first']

In [5]:
import spacy

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent))
#         print(sent, doc)
        texts_out.append([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if
                               token.pos_ in allowed_postags])
    return texts_out


# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# Run in terminal: python3 -m spacy download en
nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only Noun, Adj, Verb, Adverb
data_processed = lemmatization(data[:100], allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

The data_processed is now processed as a list of list of words. You can now use this to create the Dictionary and Corpus, which will then be used as inputs to the LDA model.

In [7]:
# Step 3: Create the Inputs of LDA model: Dictionary and Corpus
dct = corpora.Dictionary(data_processed)
corpus = [dct.doc2bow(line) for line in data_processed]

2019-03-24 15:38:00,100 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-03-24 15:38:00,555 : INFO : built Dictionary(43849 unique tokens: ['abacus', 'ability', 'able', 'abnormal', 'abolish']...) from 100 documents (total 575401 corpus positions)


We have the Dictionary and Corpus created. Let’s build a LDA topic model with 7 topics, using LdaMulticore(). 7 topics is an arbitrary choice for now.

In [12]:
# dct.token2id
[[(dct[id], count) for id, count in line] for line in corpus]

[[('abacus', 1),
  ('ability', 4),
  ('able', 7),
  ('abnormal', 1),
  ('abolish', 1),
  ('abolition', 1),
  ('about', 1),
  ('absence', 2),
  ('absolute', 2),
  ('abstention', 1),
  ('absurdity', 1),
  ('abundance', 1),
  ('abuse', 2),
  ('academic', 1),
  ('accept', 4),
  ('access', 2),
  ('accidentally', 1),
  ('accommodation', 1),
  ('accord', 3),
  ('accordance', 1),
  ('account', 1),
  ('accumulate', 1),
  ('accurately', 1),
  ('accusation', 1),
  ('achieve', 1),
  ('achievement', 2),
  ('act', 7),
  ('acting', 1),
  ('action', 12),
  ('active', 4),
  ('activism', 1),
  ('activist', 1),
  ('activity', 7),
  ('adam', 1),
  ('address', 1),
  ('adherent', 1),
  ('adjective', 1),
  ('adjust', 1),
  ('adopt', 2),
  ('adult', 4),
  ('advance', 2),
  ('advocate', 12),
  ('affect', 3),
  ('affection', 1),
  ('affinity', 1),
  ('african', 1),
  ('afterwards', 1),
  ('age', 7),
  ('aggression', 2),
  ('aggressive', 1),
  ('agitate', 1),
  ('agitation', 1),
  ('agitator', 1),
  ('agree', 1)

In [13]:
# Step 4: Train the LDA model
lda_model = LdaMulticore(corpus=corpus,
                         id2word=dct,
                         random_state=100,
                         num_topics=7,
                         passes=10,
                         chunksize=1000,
                         batch=False,
                         alpha='asymmetric',
                         decay=0.5,
                         offset=64,
                         eta=None,
                         eval_every=0,
                         iterations=100,
                         gamma_threshold=0.001,
                         per_word_topics=True)

# save the model
lda_model.save('lda_model.model')

# See the topics
lda_model.print_topics(-1)

2019-03-24 15:39:23,735 : INFO : using asymmetric alpha [0.26219156, 0.19027454, 0.14931786, 0.12287004, 0.104381524, 0.090729296, 0.080235206]
2019-03-24 15:39:23,736 : INFO : using symmetric eta at 0.14285714285714285
2019-03-24 15:39:23,746 : INFO : using serial LDA version on this node
2019-03-24 15:39:23,785 : INFO : running online LDA training, 7 topics, 10 passes over the supplied corpus of 100 documents, updating every 7000 documents, evaluating every ~0 documents, iterating 100x with a convergence threshold of 0.001000
2019-03-24 15:39:23,796 : INFO : training LDA model using 7 processes
2019-03-24 15:39:23,853 : INFO : PROGRESS: pass 0, dispatched chunk #0 = documents up to #100/100, outstanding queue size 1
2019-03-24 15:39:24,824 : INFO : topic #6 (0.080): 0.002*"be" + 0.001*"have" + 0.000*"agave" + 0.000*"american" + 0.000*"not" + 0.000*"also" + 0.000*"year" + 0.000*"use" + 0.000*"state" + 0.000*"s"
2019-03-24 15:39:24,826 : INFO : topic #5 (0.091): 0.015*"be" + 0.004*"hav

2019-03-24 15:39:28,380 : INFO : topic #1 (0.190): 0.003*"be" + 0.001*"have" + 0.000*"american" + 0.000*"s" + 0.000*"use" + 0.000*"other" + 0.000*"year" + 0.000*"not" + 0.000*"also" + 0.000*"first"
2019-03-24 15:39:28,381 : INFO : topic #0 (0.262): 0.009*"be" + 0.002*"have" + 0.001*"not" + 0.001*"state" + 0.001*"use" + 0.001*"other" + 0.001*"first" + 0.001*"s" + 0.001*"also" + 0.001*"most"
2019-03-24 15:39:28,383 : INFO : topic diff=0.102739, rho=0.119438
2019-03-24 15:39:28,384 : INFO : PROGRESS: pass 7, dispatched chunk #0 = documents up to #100/100, outstanding queue size 1
2019-03-24 15:39:28,872 : INFO : topic #6 (0.080): 0.001*"be" + 0.001*"have" + 0.000*"agave" + 0.000*"american" + 0.000*"not" + 0.000*"also" + 0.000*"year" + 0.000*"use" + 0.000*"state" + 0.000*"s"
2019-03-24 15:39:28,874 : INFO : topic #5 (0.091): 0.014*"be" + 0.003*"have" + 0.001*"other" + 0.001*"also" + 0.001*"not" + 0.001*"use" + 0.001*"s" + 0.001*"american" + 0.001*"time" + 0.001*"first"
2019-03-24 15:39:28,

[(0,
  '0.008*"be" + 0.002*"have" + 0.001*"not" + 0.001*"state" + 0.001*"use" + 0.001*"other" + 0.001*"first" + 0.001*"s" + 0.001*"also" + 0.001*"most"'),
 (1,
  '0.003*"be" + 0.001*"have" + 0.000*"american" + 0.000*"s" + 0.000*"use" + 0.000*"other" + 0.000*"year" + 0.000*"not" + 0.000*"also" + 0.000*"first"'),
 (2,
  '0.004*"be" + 0.001*"have" + 0.000*"also" + 0.000*"not" + 0.000*"state" + 0.000*"use" + 0.000*"many" + 0.000*"first" + 0.000*"can" + 0.000*"most"'),
 (3,
  '0.005*"be" + 0.001*"have" + 0.001*"also" + 0.001*"not" + 0.001*"use" + 0.000*"s" + 0.000*"american" + 0.000*"first" + 0.000*"many" + 0.000*"see"'),
 (4,
  '0.051*"be" + 0.011*"have" + 0.005*"not" + 0.004*"also" + 0.004*"use" + 0.004*"other" + 0.004*"s" + 0.003*"state" + 0.003*"american" + 0.003*"first"'),
 (5,
  '0.013*"be" + 0.003*"have" + 0.001*"other" + 0.001*"also" + 0.001*"not" + 0.001*"use" + 0.001*"s" + 0.001*"american" + 0.001*"time" + 0.001*"first"'),
 (6,
  '0.001*"be" + 0.000*"have" + 0.000*"agave" + 0.000*

The lda_model.print_topics shows what words contributed to which of the 7 topics, along with the weightage of the word’s contribution to that topic.

You can see the words like ‘also’, ‘many’ coming across different topics. So I would add such words to the stop_words list to remove them and further tune to topic model for optimal number of topics.

LdaMulticore() supports parallel processing. Alternately you could also try and see what topics the LdaModel() gives.

### 12. How to interpret the LDA Topic Model’s output?
The lda_model object supports indexing. That is, if you pass a document (list of words) to the lda_model, it provides 3 things:

The topic(s) that document belongs to along with percentage.
The topic(s) each word in that document belongs to.
The topic(s) each word in that document belongs to AND the phi values.
So, what is phi value?

Phi value is the probability of the word belonging to that particular topic. And the sum of phi values for a given word adds up to the number of times that word occurred in that document.

For example, in below output for the 0th document, the word with id=0 belongs to topic number 6 and the phi value is 3.999. That means, the word with id=0 appeared 4 times in the 0th document.

In [14]:
for c in lda_model[corpus[5:8]]:
    print("Document Topics      : ", c[0])      # [(Topics, Perc Contrib)]
    print("Word id, Topics      : ", c[1][:3])  # [(Word id, [Topics])]
    print("Phi Values (word id) : ", c[2][:2])  # [(Word id, [(Topic, Phi Value)])]
    print("Word, Topics         : ", [(dct[wd], topic) for wd, topic in c[1][:2]])   # [(Word, [Topics])]
    print("Phi Values (word)    : ", [(dct[wd], topic) for wd, topic in c[2][:2]])  # [(Word, [(Topic, Phi Value)])]
    print("------------------------------------------------------\n")

Document Topics      :  [(4, 0.99983835)]
Word id, Topics      :  [(1, [4]), (6, [4]), (10, [4])]
Phi Values (word id) :  [(1, [(4, 2.9978013)]), (6, [(4, 1.9995625)])]
Word, Topics         :  [('ability', [4]), ('about', [4])]
Phi Values (word)    :  [('ability', [(4, 2.9978013)]), ('about', [(4, 1.9995625)])]
------------------------------------------------------

Document Topics      :  [(4, 0.99984616)]
Word id, Topics      :  [(1, [4]), (6, [4]), (13, [4])]
Phi Values (word id) :  [(1, [(4, 5.995603)]), (6, [(4, 1.999563)])]
Word, Topics         :  [('ability', [4]), ('about', [4])]
Phi Values (word)    :  [('ability', [(4, 5.995603)]), ('about', [(4, 1.999563)])]
------------------------------------------------------

Document Topics      :  [(4, 0.9998578)]
Word id, Topics      :  [(2, [4]), (6, [4]), (13, [4])]
Phi Values (word id) :  [(2, [(4, 0.99949414)]), (6, [(4, 0.9997814)])]
Word, Topics         :  [('able', [4]), ('about', [4])]
Phi Values (word)    :  [('able', [(4, 0.

### 13. How to create a LSI topic model using gensim?
The syntax for using an LSI model is similar to how we built the LDA model, except that we will use the LsiModel().

In [21]:
# other info on LSI with gensim: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python
# https://nlpforhackers.io/topic-modeling/

In [19]:
from gensim.models import LsiModel

# Build the LSI Model
lsi_model = LsiModel(corpus=corpus, id2word=dct, num_topics=7, decay=0.5)

# View Topics
pprint(lsi_model.print_topics(-1))
#> [(0, '0.262*"also" + 0.197*"state" + 0.197*"american" + 0.178*"first" + '
#>   '0.151*"many" + 0.149*"time" + 0.147*"year" + 0.130*"person" + 0.130*"world" '
#>   '+ 0.124*"war"'),
#>  (1, '0.937*"agave" + 0.164*"asia" + 0.100*"aruba" + 0.063*"plant" + 0.053*"var" '
#>   '+ 0.052*"state" + 0.045*"east" + 0.044*"congress" + -0.042*"first" + '
#>   '0.041*"maguey"'),
#>  (2, '0.507*"american" + 0.180*"football" + 0.179*"player" + 0.168*"war" + '
#>   '0.150*"british" + -0.140*"also" + 0.114*"ball" + 0.110*"day" + '
#>   '-0.107*"atheism" + -0.106*"god"'),
#>  (3, '-0.362*"apollo" + 0.248*"lincoln" + 0.211*"state" + -0.172*"player" + '
#>   '-0.151*"football" + 0.127*"union" + -0.125*"ball" + 0.124*"government" + '
#>   '-0.116*"moon" + 0.116*"jews"'),
#>  (4, '-0.363*"atheism" + -0.334*"god" + -0.329*"lincoln" + -0.230*"apollo" + '
#>   '-0.215*"atheist" + -0.143*"abraham" + 0.136*"island" + -0.132*"aristotle" + '
#>   '0.124*"aluminium" + -0.119*"belief"'),
#>  (5, '-0.360*"apollo" + 0.344*"atheism" + -0.326*"lincoln" + 0.226*"god" + '
#>   '0.205*"atheist" + 0.139*"american" + -0.130*"lunar" + 0.128*"football" + '
#>   '-0.125*"moon" + 0.114*"belief"'),
#>  (6, '-0.313*"lincoln" + 0.226*"apollo" + -0.166*"football" + -0.163*"war" + '
#>   '0.162*"god" + 0.153*"australia" + -0.148*"play" + -0.146*"ball" + '
#>   '0.122*"atheism" + -0.122*"line"')]

2019-03-24 15:54:28,363 : INFO : using serial LSI version on this node
2019-03-24 15:54:28,365 : INFO : updating model with new documents
2019-03-24 15:54:28,365 : INFO : preparing a new chunk of documents
2019-03-24 15:54:28,444 : INFO : using 100 extra samples and 2 power iterations
2019-03-24 15:54:28,449 : INFO : 1st phase: constructing (43849, 107) action matrix
2019-03-24 15:54:28,470 : INFO : orthonormalizing (43849, 107) action matrix
2019-03-24 15:54:29,541 : INFO : 2nd phase: running dense svd on (107, 100) matrix
2019-03-24 15:54:29,567 : INFO : computing the final decomposition
2019-03-24 15:54:29,568 : INFO : keeping 7 factors (discarding 22.019% of energy spectrum)
2019-03-24 15:54:29,584 : INFO : processed documents up to #100
2019-03-24 15:54:29,586 : INFO : topic #0(3425.455): 0.905*"be" + 0.194*"have" + 0.083*"not" + 0.077*"use" + 0.076*"also" + 0.067*"other" + 0.061*"s" + 0.053*"state" + 0.050*"first" + 0.046*"most"
2019-03-24 15:54:29,588 : INFO : topic #1(577.507):

[(0,
  '0.905*"be" + 0.194*"have" + 0.083*"not" + 0.077*"use" + 0.076*"also" + '
  '0.067*"other" + 0.061*"s" + 0.053*"state" + 0.050*"first" + 0.046*"most"'),
 (1,
  '0.934*"agave" + 0.166*"asia" + 0.099*"aruba" + 0.082*"state" + '
  '0.065*"plant" + 0.050*"east" + 0.050*"century" + 0.049*"island" + '
  '0.046*"congress" + 0.045*"var"'),
 (2,
  '-0.484*"american" + -0.202*"war" + -0.162*"state" + -0.155*"player" + '
  '-0.147*"football" + -0.145*"b" + -0.143*"british" + 0.130*"be" + '
  '-0.122*"day" + -0.112*"australia"'),
 (3,
  '0.226*"state" + 0.211*"lincoln" + -0.178*"player" + -0.174*"football" + '
  '-0.163*"line" + -0.151*"american" + -0.148*"play" + 0.147*"have" + '
  '-0.141*"ball" + 0.118*"atheism"'),
 (4,
  '-0.342*"apollo" + -0.276*"lincoln" + 0.161*"american" + -0.151*"would" + '
  '-0.135*"god" + 0.134*"island" + -0.133*"atheism" + -0.121*"not" + '
  '-0.120*"alexander" + -0.111*"man"'),
 (5,
  '-0.334*"apollo" + 0.178*"football" + 0.172*"atheism" + 0.167*"play" + '
  '