* Example source: https://www.analyticsvidhya.com/blog/2021/06/part-3-topic-modeling-and-latent-dirichlet-allocation-lda-using-gensim-and-sklearn/
* https://github.com/sethns/Latent-Dirichlet-Allocation-LDA-/blob/main/Topic%20Modeling%20_%20Extracting%20Topics_%20Using%20Genism.ipynb
* https://github.com/sethns/Latent-Dirichlet-Allocation-LDA-/blob/main/Topic%20Modeling%20_%20Extracting%20Topics_%20Using%20Genism.ipynb

The Work Flow for executing LDA in Python

1. After importing the required libraries, we will compile all the documents into one list to have the corpus.

2. We will perform the following text preprocessing steps (can use either spacy or NLTK libraries for preprocessing):

    * Convert the text into lowercase
    * Split text into words
    * Remove the stop loss words
    * Remove the Punctuation, any symbols, and special characters
    * Normalize the word (I’ll be using Lemmatization for normalization)


The next step is to convert the cleaned text into a numerical representation where the process for gensim and sklearn packages differ:

3. For sklearn: Use either the Count vectorizer or TF-IDF vectorizer to transform the Document Term Matrix (DTM) into numerical arrays.

4. For gensim: Using gensim for Document Term Matrix(DTM), we don’t need to explicitly create the DTM matrix from scratch. The gensim library has an internal mechanism to create the DTM.


The only requirement for the gensim package is that we need to pass the cleaned data in the form of tokenized words.

5. Next, we pass the vectorized corpus to the LDA model for both the packages gensim and sklearn.

In [21]:
# for text preprocessing
import re
import spacy

from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string

# import numpy for matrix operation
import numpy as np

# Importing Gensim
import gensim
from gensim import corpora

In [14]:
D1 = 'I want to watch a movie this weekend.'
D2 =  'I went shopping yesterday. New Zealand won the World Test Championship by beating India by eight wickets at Southampton.'
D3 =  'I don’t watch cricket. Netflix and Amazon Prime have very good movies to watch.'
D4 =  'Movies are a nice way to chill however, this time I would like to paint and read some good books. It’s been long!'
D5 =  'This blueberry milkshake is so good! Try reading Dr. Joe Dispenza’s books. His work is such a game-changer! His books helped to learn so much about how our thoughts impact our biology and how we can all rewire our brains.'
corpus = [D1, D2, D3, D4, D5]
corpus

['I want to watch a movie this weekend.',
 'I went shopping yesterday. New Zealand won the World Test Championship by beating India by eight wickets at Southampton.',
 'I don’t watch cricket. Netflix and Amazon Prime have very good movies to watch.',
 'Movies are a nice way to chill however, this time I would like to paint and read some good books. It’s been long!',
 'This blueberry milkshake is so good! Try reading Dr. Joe Dispenza’s books. His work is such a game-changer! His books helped to learn so much about how our thoughts impact our biology and how we can all rewire our brains.']

In [24]:
# Apply Preprocessing on the corpus
# stop Loss words
stop = set(stopwords.words('english'))
# punctuation
exclude = set(string.punctuation)

# Lemmatization
lemma = WordNetLemmatizer()

# One function for all the steps:
def clean(doc):
    # convert text into Lower case + split into words
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    # remove any stop words present
    punc_free =''.join(ch for ch in stop_free if ch not in exclude)
    # remove punctuations + normalize the text
    normalized =" ".join(lemma.lemmatize (word) for word in punc_free.split())
    return normalized

# clean data stored in a new List
clean_corpus = [clean(doc).split() for doc in corpus]
clean_corpus

[['want', 'watch', 'movie', 'weekend'],
 ['went',
  'shopping',
  'yesterday',
  'new',
  'zealand',
  'world',
  'test',
  'championship',
  'beating',
  'india',
  'eight',
  'wicket',
  'southampton'],
 ['don’t',
  'watch',
  'cricket',
  'netflix',
  'amazon',
  'prime',
  'good',
  'movie',
  'watch'],
 ['movie',
  'nice',
  'way',
  'chill',
  'however',
  'time',
  'would',
  'like',
  'paint',
  'read',
  'good',
  'book',
  'it’s',
  'long'],
 ['blueberry',
  'milkshake',
  'good',
  'try',
  'reading',
  'dr',
  'joe',
  'dispenza’s',
  'book',
  'work',
  'gamechanger',
  'book',
  'helped',
  'learn',
  'much',
  'thought',
  'impact',
  'biology',
  'rewire',
  'brain']]

# Implementation of LDA using Gensim
* After preprocessing the text, we don’t need to explicitly create the document term matrix (DTM). Gensim package has an internal mechanism to create the DTM.

3. Creating Document Term Matrix
Using gensim for Document Term Matrix(DTM), we don't need to create the DTM matrix from scratch explicitly. The gensim library has internal mechanism to
create the DTM.
The only requirement for gensis package is we need to pass the cleaned data in the form of tokenized words.

In [28]:
# Creating the term dictionary of our corpus that is of all the words (Sepcific to Genism syntax perspective),
2
# where every unique term is assigned an index.
dict_ = corpora.Dictionary(clean_corpus)
# The dictionary had 52 unqiue words in the cleaned corpus.
for i in dict_.values():
    print(i)

movie
want
watch
weekend
beating
championship
eight
india
new
shopping
southampton
test
went
wicket
world
yesterday
zealand
amazon
cricket
don’t
good
netflix
prime
book
chill
however
it’s
like
long
nice
paint
read
time
way
would
biology
blueberry
brain
dispenza’s
dr
gamechanger
helped
impact
joe
learn
milkshake
much
reading
rewire
thought
try
work


The next step is to convert the corpus (the list of documents) into a document-term Matrix using the dictionary that we had prepared above. (The vectorizer used here is the Bag of Words).

In [30]:
# Converting List of documents (corpus) into Document Term Matrix using the dictionary
2
doc_term_matrix = [dict_.doc2bow(i) for i in clean_corpus]
doc_term_matrix

[[(0, 1), (1, 1), (2, 1), (3, 1)],
 [(4, 1),
  (5, 1),
  (6, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 1),
  (11, 1),
  (12, 1),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1)],
 [(0, 1), (2, 2), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1)],
 [(0, 1),
  (20, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 1),
  (30, 1),
  (31, 1),
  (32, 1),
  (33, 1),
  (34, 1)],
 [(20, 1),
  (23, 2),
  (35, 1),
  (36, 1),
  (37, 1),
  (38, 1),
  (39, 1),
  (40, 1),
  (41, 1),
  (42, 1),
  (43, 1),
  (44, 1),
  (45, 1),
  (46, 1),
  (47, 1),
  (48, 1),
  (49, 1),
  (50, 1),
  (51, 1)]]

This output implies:

Document wise we have the index of the word and its frequency.

The 0th word is repeated 1 time, then the 1st word repeated 1, and so on …
Next, we implement the LDA model by creating the object and passing the required arguments:

This output implies:

Document wise we have the index of the word and its frequency.
The 0th word is repeated 1 time, then the 1st word repeated 1, and so on …
Next, we implement the LDA model by creating the object and passing the required arguments:
    
4. Implementation of LDA

In [38]:
# Creating the object for LDA model using gensim Library

Lda = gensim.models.ldamodel.LdaModel
# Running and Training LDA model on the document term matrix.
Ldamodel = Lda(doc_term_matrix, num_topics=6, id2word = dict_, passes=1, random_state=0, eval_every=None)
Ldamodel.print_topics()

[(0,
  '0.136*"watch" + 0.084*"movie" + 0.060*"good" + 0.060*"amazon" + 0.060*"netflix" + 0.060*"prime" + 0.060*"cricket" + 0.060*"don’t" + 0.029*"want" + 0.027*"weekend"'),
 (1,
  '0.074*"weekend" + 0.070*"want" + 0.065*"movie" + 0.063*"watch" + 0.015*"don’t" + 0.015*"good" + 0.015*"book" + 0.015*"cricket" + 0.015*"prime" + 0.015*"new"'),
 (2,
  '0.052*"book" + 0.028*"good" + 0.028*"blueberry" + 0.028*"try" + 0.028*"helped" + 0.028*"reading" + 0.028*"joe" + 0.028*"work" + 0.028*"dispenza’s" + 0.028*"brain"'),
 (3,
  '0.020*"championship" + 0.020*"wicket" + 0.020*"southampton" + 0.020*"yesterday" + 0.020*"went" + 0.020*"india" + 0.020*"shopping" + 0.020*"new" + 0.020*"world" + 0.020*"zealand"'),
 (4,
  '0.051*"movie" + 0.051*"book" + 0.051*"paint" + 0.051*"long" + 0.051*"like" + 0.051*"read" + 0.051*"time" + 0.051*"chill" + 0.051*"would" + 0.051*"however"'),
 (5,
  '0.019*"weekend" + 0.019*"watch" + 0.019*"movie" + 0.019*"good" + 0.019*"want" + 0.019*"don’t" + 0.019*"book" + 0.019*"cri

This output means: each of the 52 unique words is given weights based on the topics. In other words, it implies which of the words dominate the topics.

Now, we find the topics from Corpus:

In [42]:
print(Ldamodel.print_topics(num_topics=6, num_words=5))
# num topics mean: how many topics want to extract
# num words: the number of words that want per topic

[(0, '0.136*"watch" + 0.084*"movie" + 0.060*"good" + 0.060*"amazon" + 0.060*"netflix"'), (1, '0.074*"weekend" + 0.070*"want" + 0.065*"movie" + 0.063*"watch" + 0.015*"don’t"'), (2, '0.052*"book" + 0.028*"good" + 0.028*"blueberry" + 0.028*"try" + 0.028*"helped"'), (3, '0.020*"championship" + 0.020*"wicket" + 0.020*"southampton" + 0.020*"yesterday" + 0.020*"went"'), (4, '0.051*"movie" + 0.051*"book" + 0.051*"paint" + 0.051*"long" + 0.051*"like"'), (5, '0.019*"weekend" + 0.019*"watch" + 0.019*"movie" + 0.019*"good" + 0.019*"want"')]


This returns the following six topics (indexed from 0,1,2,3,4,5) with the five words in each of the topics along with their respective weights.

Now, we assign these resultant topics to the documents via:

In [45]:
count=0
for i in Ldamodel[doc_term_matrix]:
    print("doc :", count, i)
    count=count+1

doc : 0 [(0, 0.28205714), (1, 0.5845116), (2, 0.033334), (3, 0.033336144), (4, 0.033424977), (5, 0.033336177)]
doc : 1 [(0, 0.011905564), (1, 0.011906144), (2, 0.94046915), (3, 0.011907217), (4, 0.011905454), (5, 0.01190652)]
doc : 2 [(0, 0.9166039), (1, 0.016687563), (2, 0.016676312), (3, 0.016667562), (4, 0.016697042), (5, 0.016667575)]
doc : 3 [(0, 0.011138678), (1, 0.011119543), (2, 0.011127169), (3, 0.011111964), (4, 0.94439065), (5, 0.011111975)]
doc : 4 [(2, 0.9602892)]


# Conclusion:
The five documents are assigned the topics with the weightage that will help to tell which is the dominant topic for the respective document.
From above can see:

1. Document 1 has the highest weight of 58.4% for Topic 2.
2. Topic 3 dominates the document 2 having the weightage of 94%. Similarly, 1st topic is the main topic for document 3 with ~92% weight.
3. Document 3 is influenced by the Topic 5 with 94% and Topic 4 rules the document 5