### All Rights Reserved. This notebook is proprietary content of machinelearningplus.com. This can be shared solely for educational purposes, with due credits to machinelearningplus.com

Altered for Colab for DSMA course.

<div class="alert" style="background-color:#fff; color:white; padding:0px 10px; border-radius:5px;"><h1 style='margin:15px 15px; color:#5d3a8e; font-size:40px'> Topic Modeling with Gensim (Python)</h1>
</div>

Topic Modeling is a technique to extract the hidden topics from large volumes of text. Latent Dirichlet Allocation(LDA) is a popular algorithm for topic modeling with excellent implementations in the Python’s Gensim package. The challenge, however, is *how to extract good quality of topics that are clear, segregated and meaningful.* This depends heavily on the quality of text preprocessing and the strategy of finding the optimal number of topics. This tutorial attempts to tackle both of these problems.

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> Content</h2>
</div>

1. Introduction
2. Prerequisites – Download nltk stopwords and spacy model
3. Import Packages
4. What does LDA do?
5. Prepare Stopwords
6. Import Newsgroups Data
7. Remove emails and newline characters
8. Tokenize words and Clean-up text
9. Creating Bigram and Trigram Models
10. Remove Stopwords, Make Bigrams and Lemmatize
11. Create the Dictionary and Corpus needed for Topic Modeling
12. Building the Topic Model
13. View the topics in LDA model
14. Compute Model Perplexity and Coherence Score
15. Visualize the topics-keywords
16. Building LDA Mallet Model
17. How to find the optimal number of topics for LDA?
18. Finding the dominant topic in each sentence
19. Find the most representative document for each topic
20. Topic distribution across documents

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 3. Import Packages</h2>
</div>

The core packages used in this tutorial are `re`, `gensim`, `spacy` and `pyLDAvis`. Besides this we will also using `matplotlib`,`numpy` and `pandas` for data handling and visualization. Let’s import them.

In [1]:
!pip install pyLDAvis
!pip install gensim
# pip install spacy==2.2.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyLDAvis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting sklearn
  Downloading sklearn-0.0.post1.tar.gz (3.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting funcy
  Downloading funcy-1.18-py2.py3-none-any.whl (33 kB)
Building wheels for collected packages: pyLDAvis, sklearn
  Building wheel for pyLDAvis (pyproject.toml) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-3.3.1-py2.py3-none-any.whl size=136898 sha256=d46d70eaa836843d5eca13197f2e9d1d9a14a3abbdb62520bb9c3cb461eebdde
  Stored 

In [2]:
import re
import numpy as np
import pandas as pd
from pprint import pprint

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spacy for lemmatization
import spacy

# Plotting tools
!pip install pyLDAvis==3.3.1
import pyLDAvis
#import pyLDAvis.gensim #THIS IS OLD, WE NEED TO CHANGE IT TO:
import pyLDAvis.gensim_models as gensimvis
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)



Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


  from collections import Iterable


<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 11. Create the Dictionary and Corpus needed for Topic Modeling</h2>
</div>

The two main inputs to the LDA topic model are the dictionary(`id2word`) and the corpus. Let's create them.

In [12]:
# read tokenized text
import json
f = open('/content/ldalist.json')
data_lemmatized = json.load(f)

In [13]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 2), (11, 1), (12, 1), (13, 1), (14, 2), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 2), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 2), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 2), (85, 1), (86, 2), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 2)]]


Gensim creates a unique id for each word in the document. The produced corpus shown above is a mapping of (word_id, word_frequency).

For example, (0, 1) above implies, word id 0 occurs once in the first document. Likewise, word id 1 occurs twice and so on.

This is used as the input by the LDA model.


If you want to see what word a given id corresponds to, pass the id as a key to the dictionary.

In [14]:
id2word[231]

'現實感'

Or, you can see a human readable form of the corpus itself.

In [15]:
corpus[:1][0][:10]

[(0, 1),
 (1, 1),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 1),
 (6, 1),
 (7, 1),
 (8, 1),
 (9, 1)]

In [16]:
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

[[('上市', 1),
  ('上班', 1),
  ('下廚', 1),
  ('下班', 1),
  ('交', 1),
  ('伴侶', 1),
  ('公司', 1),
  ('出櫃', 1),
  ('分手', 1),
  ('助理職', 1),
  ('動物', 2),
  ('同事', 1),
  ('吧歡迎', 1),
  ('呆板', 1),
  ('單身', 2),
  ('回家', 1),
  ('回照', 1),
  ('外表', 1),
  ('密切', 1),
  ('尊重', 1),
  ('尋找', 1),
  ('小小', 1),
  ('小說', 1),
  ('帶', 1),
  ('帶走', 2),
  ('平台', 1),
  ('廢', 1),
  ('強勢', 1),
  ('影片', 1),
  ('徹底', 1),
  ('心力交瘁', 1),
  ('性別', 1),
  ('性格', 1),
  ('房', 1),
  ('房子', 1),
  ('打包', 1),
  ('打理', 1),
  ('收入', 1),
  ('放假', 1),
  ('放棄', 1),
  ('月', 1),
  ('東西', 1),
  ('桃園', 1),
  ('機車人', 1),
  ('正經', 1),
  ('沒辦法', 1),
  ('沒關係', 1),
  ('減下來', 1),
  ('溫柔', 1),
  ('無妨', 1),
  ('照片', 1),
  ('照顧', 1),
  ('煩躁', 1),
  ('煮', 1),
  ('爽快', 1),
  ('特別', 1),
  ('狀態', 2),
  ('狗狗', 1),
  ('獨居', 1),
  ('獨立', 1),
  ('生', 1),
  ('用心', 1),
  ('男人', 1),
  ('留戀', 1),
  ('直接', 1),
  ('簡單', 1),
  ('經營', 1),
  ('缺點', 1),
  ('美食', 1),
  ('老人', 1),
  ('耍', 1),
  ('聯繫', 1),
  ('職場', 1),
  ('胖子', 1),
  ('舒適', 1),
  ('菸', 1),
  ('處於', 1),


Alright, without digressing further let's jump back on track with the next step: Building the topic model.

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 12. Building the Topic Model</h2>
</div>

We have everything required to train the LDA model. In addition to the corpus and dictionary, you need to provide the number of topics as well. 

Apart from that, `alpha` and `eta` are hyperparameters that affect sparsity of the topics. According to the gensim docs, both defaults to 1.0/num_topics prior.

`chunksize` is the number of documents to be used in each training chunk.  `update_every` determines how often the model parameters should be updated and `passes` is the total number of training passes.

In [17]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=10, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 13. View the topics in LDA model</h2>
</div>


In [18]:
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.008*"專長" + 0.004*"不菸" + 0.004*"電腦" + 0.004*"意思" + 0.004*"開玩笑" + '
  '0.004*"線上" + 0.004*"投射" + 0.003*"衝浪" + 0.003*"道理" + 0.003*"劇情"'),
 (1,
  '0.008*"旅伴" + 0.008*"活潑" + 0.008*"順利" + 0.005*"敏感" + 0.005*"烘焙" + 0.005*"廚房" '
  '+ 0.004*"上進" + 0.004*"新北" + 0.004*"牡羊" + 0.004*"台北市"'),
 (2,
  '0.014*"重視" + 0.007*"正向" + 0.005*"可靠" + 0.005*"室友" + 0.004*"生理" + 0.004*"芋頭" '
  '+ 0.004*"異性戀" + 0.004*"電視" + 0.004*"普通" + 0.004*"新鮮"'),
 (3,
  '0.009*"勝過" + 0.007*"女" + 0.006*"台" + 0.006*"介" + 0.005*"正負" + 0.005*"私訊" + '
  '0.005*"必要" + 0.005*"少女" + 0.004*"平台" + 0.004*"真相"'),
 (4,
  '0.004*"作品" + 0.004*"劈腿" + 0.004*"念" + 0.004*"老師" + 0.003*"香水" + 0.003*"設計" '
  '+ 0.003*"評價" + 0.003*"認證" + 0.003*"美國" + 0.003*"結局"'),
 (5,
  '0.009*"月亮" + 0.007*"疾病" + 0.005*"休假" + 0.004*"圈內人" + 0.004*"看過" + '
  '0.004*"依靠" + 0.004*"駕照" + 0.004*"刺青" + 0.004*"女友" + 0.004*"熊"'),
 (6,
  '0.006*"認同" + 0.006*"菜" + 0.005*"紅娘" + 0.005*"擁有" + 0.004*"行程" + 0.004*"不同" '
  '+ 0.004*"講話" + 0.004*"上菜" + 0.004*"眼" + 0.004*"熱情

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 14. Compute Model Perplexity and Coherence Score</h2>
</div>


Model perplexity and [topic coherence](https://rare-technologies.com/what-is-topic-coherence/) provide a convenient measure to judge how good a given topic model is. In my experience, topic coherence score, in particular, has been more helpful.

In [19]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -8.964043551334216

Coherence Score:  0.485386310603768


<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 15. Visualize the topics-keywords</h2>
</div>

Now that the LDA model is built, the next step is to examine the produced topics and the associated keywords. There is no better tool than pyLDAvis package’s interactive chart and is designed to work well with jupyter notebooks.

In [20]:
# Visualize the topics
import pyLDAvis
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word) #HERE WE NEEDED TO CHANGE gensim TO gensim_models
vis

  default_term_info = default_term_info.sort_values(
