# LDA Topic Modeling with Gensim

Heavily modified and expanded based upon Gensim tutorial

By Jon Chun
28 Mar 2022 Updated

# Setup and Configuration

In [None]:
!pip install pyldavis

In [None]:
!pip install contractions

## Configure Jupyter Notebook

In [50]:
## Configure Jupyter Notebook

# Ignore warnings

import warnings
warnings.filterwarnings('ignore')

# Enable multiple outputs from one code cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from IPython.display import display
from IPython.display import Image
from ipywidgets import widgets, interactive

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Import Libraries

In [51]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## [INPUT] Connect Google gDrive to this Jupyter Notebook

In [52]:
# [INPUT REQUIRED]: Authorize access to Google gDrive via popup windows

# Connect this Notebook to your permanent Google Drive
#   so all generated output is saved to permanent storage there

try:
  from google.colab import drive
  IN_COLAB=True
except:
  IN_COLAB=False

if IN_COLAB:
  print("Attempting to attach your Google gDrive to this Colab Jupyter Notebook")
  drive.mount('/gdrive')
else:
  print("Your Google gDrive is attached to this Colab Jupyter Notebook")

Attempting to attach your Google gDrive to this Colab Jupyter Notebook
Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


# Data

You have (2) ways to get data in this tutorial, but if you're following
this tutorial just to learn about LDA I encourage you to consider picking a
corpus on a subject that you are familiar with. Qualitatively evaluating the
output of an LDA model is challenging and can require you to understand the
subject matter of your corpus (depending on your goal with the model).

Reference to Compare::

    The NeurIPS corpus contains 1740 documents, and not particularly long ones.

    `website <http://www.cs.nyu.edu/~roweis/data.html>`
    
    So keep in mind that this tutorial is not geared towards efficiency, and be
    careful before applying the code to a large dataset.


## Option (a): Put Text Datafile in your GDrive project directory

If you have a Text Datafile ready to use, just copy it into your GDrive folder that is the project directory for this LDA exercise (listed below)

In [58]:
print(f'Put your Textfile in the current project directory:\n')
!pwd

Put your Textfile in the current project directory:

/gdrive/MyDrive/cdh/nlp_topic_modeling_lda


**[SKIP] to next Section [LDA MODEL]**

In [59]:
%cd recent_texts

/gdrive/MyDrive/cdh/nlp_topic_modeling_lda/recent_texts


In [60]:
!ls -altr hpotter*

-rw------- 1 root root 1146084 Mar  3 17:09 hpotter7_deathly_hollows_utf8.txt
-rw------- 1 root root  991744 Mar  3 17:09 hpotter6_the_half_blood_prince_utf8.txt
-rw------- 1 root root 1517476 Mar  3 17:09 hpotter5_order_of_the_phoenix_utf8.txt
-rw------- 1 root root 1123157 Mar  3 17:09 hpotter4_the_goblet_of_fire_utf8.txt
-rw------- 1 root root  624699 Mar  3 17:09 hpotter3_the_prisoner_of_azkaban_utf8.txt
-rw------- 1 root root  498919 Mar  3 17:09 hpotter2_chamber_of_secrets_utf8.txt
-rw------- 1 root root  448834 Mar  3 17:09 hpotter1_sorcerers_stone_utf8.txt


In [61]:
!cat hpotter* > all_7hpotter_books_utf8.txt

## Option (b): Use 'wget' Command

If you don't have a Textfile ready, let's just grab one to be able to continue learning about Gensim LDA.

[wget] is a Unix utility that allows you to wEB GET and download most any publically available file on the Internet given the URL to it.

Option (b)(i): If you have a URL to an open, freely available text document enter it in the next code cell below

Option (b)(ii): If you don't have a URL, 

* goto https://www.gutenberg.org/

* search for a longish novel (e.g. 'middlemarch' at https://www.gutenberg.org/ebooks/145)

* Click on the 'Plain Text UTF-8' version to goto the plain text version of the novel, copy the URL and paste it in the next code cell below

In [189]:
# Assign the variable url to your link address
#  (e.g. url = 'https://www.gutenberg.org/ebooks/145')

url_text = 'https://www.gutenberg.org/files/145/145-0.txt'

!wget $url_text

--2022-03-28 18:06:09--  https://www.gutenberg.org/files/145/145-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1865681 (1.8M) [text/plain]
Saving to: ‘145-0.txt.3’


2022-03-28 18:06:10 (9.52 MB/s) - ‘145-0.txt.3’ saved [1865681/1865681]



In [191]:
# What filename was just downloaded
#   (time sorted, so at top of list below)

!ls -altr *145*

-rw------- 1 root root 1865681 Mar 31  2021 145-0.txt.3
-rw------- 1 root root 1865681 Mar 31  2021 145-0.txt.2
-rw------- 1 root root 1865681 Mar 31  2021 145-0.txt.1
-rw------- 1 root root 1865681 Mar 31  2021 145-0.txt


In [192]:
# [OPTIONAL]: Make your Textdata filename more user-friendly

textfile_downloaded = '145-0.txt'
textfile_name = 'middlemarch.txt'

!mv $textfile_downloaded $textfile_name

!ls -altr middle*

-rw------- 1 root root 1865681 Mar 28 18:07 middlemarch.txt



# LDA Model

Introduces Gensim's LDA model and demonstrates its use on the NIPS corpus.


The purpose of this tutorial is to demonstrate how to train and tune an LDA model.

In this tutorial we will:

* Load input data.
* Pre-process that data.
* Transform documents into bag-of-words vectors.
* Train an LDA model.

This tutorial will **not**:

* Explain how Latent Dirichlet Allocation works
* Explain how the LDA model performs inference
* Teach you all the parameters and options for Gensim's LDA implementation

If you are not familiar with the LDA model or how to use it in Gensim, I (Olavur Mortensen)
suggest you read up on that before continuing with this tutorial. Basic
understanding of the LDA model should suffice. Examples:

* `Introduction to Latent Dirichlet Allocation <http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation>`_
* Gensim tutorial: `sphx_glr_auto_examples_core_run_topics_and_transformations.py`
* Gensim's LDA model API docs: :py:class:`gensim.models.LdaModel`

I would also encourage you to consider each step when applying the model to
your data, instead of just blindly applying my solution. The different steps
will depend on your data and possibly your goal with the model.





## Read Textfile

In [65]:
!pwd

/gdrive/MyDrive/cdh/nlp_topic_modeling_lda/recent_texts


In [196]:
!cd ..
!ls *7concatenated.txt

potter_7concatenated.txt


In [195]:
textfile_name = 'potter_7concatenated.txt'
textfile_name

'potter_7concatenated.txt'

In [197]:
with open(textfile_name, 'r', encoding='ascii', errors='ignore') as f:
    book_all_str = f.read()

type(book_all_str)

str

In [198]:
# Get char count
len(book_all_str)

6296045

In [199]:
book_all_str[:1000]

"Harry Potter and the Sorcerer's Stone\nby J.K. Rowling\n\nCHAPTER ONE\n\nTHE BOY WHO LIVED\n\nMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say\nthat they were perfectly normal, thank you very much. They were the last\npeople you'd expect to be involved in anything strange or mysterious,\nbecause they just didn't hold with such nonsense.\n\nMr. Dursley was the director of a firm called Grunnings, which made\ndrills. He was a big, beefy man with hardly any neck, although he did\nhave a very large mustache. Mrs. Dursley was thin and blonde and had\nnearly twice the usual amount of neck, which came in very useful as she\nspent so much of her time craning over garden fences, spying on the\nneighbors. The Dursleys had a small son called Dudley and in their\nopinion there was no finer boy anywhere.\n\nThe Dursleys had everything they wanted, but they also had a secret, and\ntheir greatest fear was that somebody would discover it. They didn't\nthink they could bear it if 

## Split the Book/Corpus into Paragraphs/Documents

In [200]:
# Split into paragraphs
book_parags_ls = book_all_str.split('\n\n')

# Replace stray/embedded /n with a space
book_parags_clean_ls = []
for aparag in book_parags_ls:
  book_parags_clean_ls.append(aparag.replace('\n',' '))

book_parags_ls = book_parags_clean_ls
book_parags_ls[:5]

["Harry Potter and the Sorcerer's Stone by J.K. Rowling",
 'CHAPTER ONE',
 'THE BOY WHO LIVED',
 "Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.",
 'Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere.']

In [135]:
# Get paragraph count
len(book_parags_ls)

40029

In [201]:
# Verify first paragraph (after title and chapter headings)

book_parags_ls[3]
print('\n')
# Paragraph char count
len(book_parags_ls[3])

"Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense."





262

In [202]:
# Delete any paragraphs shorter than MIN_LEN_PARAG

MIN_LEN_PARAG = 5
MIN_LEN_DOC = 1000

# Delete any paragraphs shorter than MIN_LEN_PARAG
book_parags_ls = [x for x in book_parags_ls if len(x) > MIN_LEN_PARAG]

# Trim any leading/trailing/multiple embedded whitespaces
book_parags_ls = [' '.join(x.split()) for x in book_parags_ls]

len(book_parags_ls)

39904

In [203]:
# Agglomerate paragraphs into Documents of MIN_LEN_DOC=1000 chars

parag_ct = len(book_parags_ls)

doc_now_str = ''
doc_now_len = 0
docs = []

for i in range(parag_ct):
  # print(f'Processing Paragraph #{i}')
  parag_now_str = book_parags_ls[i]
  doc_now_str += parag_now_str
  doc_now_len += len(parag_now_str)
  if doc_now_len > MIN_LEN_DOC:
    docs.append(doc_now_str)
    doc_now_str = ''
    doc_now_len = 0

docs[-1] += doc_now_str

print(f'There are now {len(docs)} Documents of {MIN_LEN_DOC} chars or more')

There are now 5375 Documents of 1000 chars or more


In [204]:
# View the first 5 docs

docs[:5]

["Harry Potter and the Sorcerer's Stone by J.K. RowlingCHAPTER ONETHE BOY WHO LIVEDMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere.The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out about the Pott

In [205]:
# View the first 500 chars in the 50th doc

docs[49][:500]

'"Who on earth wants to talk to you this badly?" Dudley asked Harry in amazement.On Sunday morning, Uncle Vernon sat down at the breakfast table looking tired and rather ill, but happy."No post on Sundays," he reminded them cheerfully as he spread marmalade on his newspapers, "no damn letters today --"Something came whizzing down the kitchen chimney as he spoke and caught him sharply on the back of the head. Next moment, thirty or forty letters came pelting out of the fireplace like bullets. The '

In [128]:
# View the last 100 chars in the last doc

docs[-1][-100:]

' scar on his forehead.I know he will.The scar had not pained Harry for nineteen years. All was well.'

## Pre-process and vectorize the documents

As part of preprocessing, we will:

* Expand Contractions
* Tokenize (split the documents into tokens)
* Define stopwords and filter out
* Lemmatize the tokens.
* Compute bigrams.
* Compute a bag-of-words representation of the data.

First we tokenize the text using a regular expression tokenizer from NLTK. We
remove numeric tokens and tokens that are only a single character, as they
don't tend to be useful, and the dataset contains a lot of them.

.. Important::

   This tutorial uses the nltk library for preprocessing, although you can
   replace it with something else if you want.




### Expand Contractions

In [143]:
# Expand Contractions (e.g. can't -> can not)

import contractions

# Test
contractions.fix("yall're happy now", slang=False) # default: true



"yall're happy now"

In [144]:
# Expand all Contractions paragraph by paragraph
docs_clean_ls = []
for adoc in docs:
  docs_clean_ls.append(contractions.fix(adoc))

docs = docs_clean_ls
docs[:5]

["Harry Potter and the Sorcerer's Stone by J.K. RowlingCHAPTER ONETHE BOY WHO LIVEDMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you would expect to be involved in anything strange or mysterious, because they just did not hold with such nonsense.Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere.The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They did not think they could bear it if anyone found out about th

### Tokenize Text

In [210]:
# Tokenize the documents.
from nltk.tokenize import RegexpTokenizer

# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    docs[idx] = docs[idx].lower()  # Convert to lowercase.
    docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isnumeric()] for doc in docs]

# Remove words that are only one character.
docs = [[token for token in doc if len(token) > 1] for doc in docs]

In [213]:
# Verify docs

# content and token count of first doc
docs[0]
print(f'\n\nThere are {len(docs[0])} tokens in the first document')

['harry',
 'potter',
 'and',
 'the',
 'sorcerer',
 'stone',
 'by',
 'rowlingchapter',
 'onethe',
 'boy',
 'who',
 'livedmr',
 'and',
 'mrs',
 'dursley',
 'of',
 'number',
 'four',
 'privet',
 'drive',
 'were',
 'proud',
 'to',
 'say',
 'that',
 'they',
 'were',
 'perfectly',
 'normal',
 'thank',
 'you',
 'very',
 'much',
 'they',
 'were',
 'the',
 'last',
 'people',
 'you',
 'expect',
 'to',
 'be',
 'involved',
 'in',
 'anything',
 'strange',
 'or',
 'mysterious',
 'because',
 'they',
 'just',
 'didn',
 'hold',
 'with',
 'such',
 'nonsense',
 'mr',
 'dursley',
 'was',
 'the',
 'director',
 'of',
 'firm',
 'called',
 'grunnings',
 'which',
 'made',
 'drills',
 'he',
 'was',
 'big',
 'beefy',
 'man',
 'with',
 'hardly',
 'any',
 'neck',
 'although',
 'he',
 'did',
 'have',
 'very',
 'large',
 'mustache',
 'mrs',
 'dursley',
 'was',
 'thin',
 'and',
 'blonde',
 'and',
 'had',
 'nearly',
 'twice',
 'the',
 'usual',
 'amount',
 'of',
 'neck',
 'which',
 'came',
 'in',
 'very',
 'useful',
 '



There are 266 tokens in the first document


### Customize Stopwords

In [214]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

stopwords_ls = stopwords.words('english')

print(f'\nThe first ten stopwords:')
stopwords_ls[20:]
print(f'\n\nThere are [{len(stopwords_ls)}] English stopwords imported from NLTK')



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True


The first ten stopwords:


['himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 'not',
 'only',
 'own',
 'same',
 'so',
 'than',
 'too',
 'very',
 's',
 't',
 'can',
 'will',
 'just',
 'don',
 "don



There are [179] English stopwords imported from NLTK


In [215]:
# Remove all stopwords with contractions (optional - could also remove fragments like 'll', 're', and 've' but not necessary)

stopwords_ls = [x for x in stopwords_ls if not "'" in x]
stopwords_ls[20:]
print(f'\n\nThere are now {len(stopwords_ls)} stopwords after removing words with contractions')

['herself',
 'it',
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 'not',
 'only',
 'own',
 'same',
 'so',
 'than',
 'too',
 'very',
 's',
 't',
 'can',
 'will',
 'just',
 'don',
 'should',
 'now',
 'd',
 'll',
 'm',
 'o',
 're',
 've',
 'y',
 'ain',
 



There are now 153 stopwords after removing words with contractions


**[CUSTOMIZE] Stopwords list by adding or deleting tokens**

In [216]:
# [CUSTOMIZE] Stopwords to ADD or DELETE from default NLTK English stopword list

STOPWORDS_ADD_SET = set(['bazinga', 'woo-hoo']) # Edit this list to add new words to the stopwords list
STOPWORDS_DEL_SET = set(['the']) # Edit this list to remove exising words from the stopwords list

stopwords_en_ls = list(set(stopwords_ls).difference(set(STOPWORDS_DEL_SET)).union(set(STOPWORDS_ADD_SET)))
print(f'Final Count after Customized Add/Del: {len(stopwords_en_ls)} Stopwords')

Final Count after Customized Add/Del: 154 Stopwords


### Lemmatize

We use the WordNet lemmatizer from NLTK. A lemmatizer is preferred over a
stemmer in this case because it produces more readable words. Output that is
easy to read is very desirable in topic modelling.




In [217]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [218]:
%%time

# NOTE: 0m24s @03:48 on 20220228 Colab Pro 

# Lemmatize the documents.
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]

CPU times: user 4.8 s, sys: 28.4 ms, total: 4.83 s
Wall time: 4.87 s


We find bigrams in the documents. Bigrams are sets of two adjacent words.
Using bigrams we can get phrases like "machine_learning" in our output
(spaces are replaced with underscores); without bigrams we would only get
"machine" and "learning".

Note that in the code below, we find bigrams and then add them to the
original data, because we would like to keep the words "machine" and
"learning" as well as the bigram "machine_learning".

.. Important::
    Computing n-grams of large dataset can be very computationally
    and memory intensive.




### Identify Bi- and Tri-Grams

In [220]:
# Compute bigrams.
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)

2022-03-28 18:17:29,876 : INFO : collecting all words and their counts
2022-03-28 18:17:29,883 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2022-03-28 18:17:32,407 : INFO : collected 351552 word types from a corpus of 1054368 words (unigram + bigrams) and 5375 sentences
2022-03-28 18:17:32,413 : INFO : using 351552 counts as vocab in Phrases<0 vocab, min_count=20, threshold=10.0, max_vocab_size=40000000>


We remove rare words and common words based on their *document frequency*.
Below we remove words that appear in less than 20 documents or in more than
50% of the documents. Consider trying to remove words only based on their
frequency, or maybe combining that with this approach.




In [222]:
#@markdown Minimum number of documents a word must appear in: (default 20)

Min_No_Documents = 20 #@param {type:"slider", min:5, max:100, step:1}

#@markdown Max percent of documents a word can appear in: (default 0.50 or 50%)

Max_Percent_Documents = 0.5 #@param {type:"slider", min:0.05, max:0.9, step:0.01}


In [223]:
# Remove rare and common tokens.
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=Min_No_Documents, no_above=Max_Percent_Documents)

2022-03-28 18:17:55,151 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2022-03-28 18:17:56,471 : INFO : built Dictionary(22417 unique tokens: ['a', 'about', 'also', 'although', 'amount']...) from 5375 documents (total 1097259 corpus positions)
2022-03-28 18:17:56,636 : INFO : discarding 18478 tokens: [('a', 3793), ('and', 5321), ('be', 2828), ('beefy', 3), ('but', 3618), ('craning', 16), ('director', 2), ('drill', 8), ('finer', 5), ('for', 3354)]...
2022-03-28 18:17:56,639 : INFO : keeping 3939 tokens which were in no less than 20 and no more than 2687 (=50.0%) documents
2022-03-28 18:17:56,662 : INFO : resulting dictionary: Dictionary(3939 unique tokens: ['about', 'also', 'although', 'amount', 'another']...)


Finally, we transform the documents to a vectorized form. We simply compute
the frequency of each word, including the bigrams.




In [224]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

Let's see how many tokens and documents we have to train on.




In [225]:
# Corpus as a list of documents, each a list of tokens identifyed by dictionary tuples

len(corpus)

5375

In [177]:
type(corpus[0])

list

In [178]:
corpus[0][:10]

[(0, 1),
 (1, 1),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 1),
 (6, 1),
 (7, 1),
 (8, 1),
 (9, 1)]

In [226]:
print('Number of unique tokens: %d' % len(dictionary))  # Orig 1864
print('Number of documents: %d' % len(corpus))          # Orig 1740

Number of unique tokens: 3939
Number of documents: 5375


## Training

We are ready to train the LDA model. We will first discuss how to set some of
the training parameters.

First of all, the elephant in the room: how many topics do I need? There is
really no easy answer for this, it will depend on both your data and your
application. I have used 10 topics here because I wanted to have a few topics
that I could interpret and "label", and because that turned out to give me
reasonably good results. You might not need to interpret all your topics, so
you could use a large number of topics, for example 100.

``chunksize`` controls how many documents are processed at a time in the
training algorithm. Increasing chunksize will speed up training, at least as
long as the chunk of documents easily fit into memory. I've set ``chunksize =
2000``, which is more than the amount of documents, so I process all the
data in one go. Chunksize can however influence the quality of the model, as
discussed in Hoffman and co-authors [2], but the difference was not
substantial in this case.

``passes`` controls how often we train the model on the entire corpus.
Another word for passes might be "epochs". ``iterations`` is somewhat
technical, but essentially it controls how often we repeat a particular loop
over each document. It is important to set the number of "passes" and
"iterations" high enough.

I suggest the following way to choose iterations and passes. First, enable
logging (as described in many Gensim tutorials), and set ``eval_every = 1``
in ``LdaModel``. When training the model look for a line in the log that
looks something like this::

   2016-06-21 15:40:06,753 - gensim.models.ldamodel - DEBUG - 68/1566 documents converged within 400 iterations

If you set ``passes = 20`` you will see this line 20 times. Make sure that by
the final passes, most of the documents have converged. So you want to choose
both passes and iterations to be high enough for this to happen.

We set ``alpha = 'auto'`` and ``eta = 'auto'``. Again this is somewhat
technical, but essentially we are automatically learning two parameters in
the model that we usually would have to specify explicitly.




In [227]:
#@markdown How many Topics do you want to find?

No_of_Topics = 10 #@param {type:"slider", min:2, max:200, step:1}

#@markdown Default 10-50 depending upon how large the text and diverse the vocabulary

In [None]:
%%time

# NOTE: 2m01s @20220328 with Google Colab/CPU on Harry Potter

# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = No_of_Topics
chunksize = 2000
passes = 20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

2022-03-28 18:19:30,976 : INFO : using autotuned alpha, starting with [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
2022-03-28 18:19:30,980 : INFO : using serial LDA version on this node
2022-03-28 18:19:30,990 : INFO : running online (multi-pass) LDA training, 10 topics, 20 passes over the supplied corpus of 5375 documents, updating model once every 2000 documents, evaluating perplexity every 0 documents, iterating 400x with a convergence threshold of 0.001000
2022-03-28 18:19:30,995 : INFO : PROGRESS: pass 0, at document #2000/5375
2022-03-28 18:19:39,605 : INFO : optimized alpha [0.055956107, 0.06970318, 0.085161895, 0.053631734, 0.0656931, 0.063605644, 0.049110614, 0.063020125, 0.06691721, 0.051566537]
2022-03-28 18:19:39,610 : INFO : merging changes from 2000 documents into a model of 5375 documents
2022-03-28 18:19:39,621 : INFO : topic #6 (0.049): 0.008*"have" + 0.008*"there" + 0.007*"their" + 0.006*"if" + 0.006*"mr" + 0.006*"is" + 0.006*"been" + 0.006*"my" + 0.005*"didn" +

We can compute the topic coherence of each topic. Below we display the
average topic coherence and print the topics in order of topic coherence.

Note that we use the "Umass" topic coherence measure here (see
:py:func:`gensim.models.ldamodel.LdaModel.top_topics`), Gensim has recently
obtained an implementation of the "AKSW" topic coherence measure (see
accompanying blog post, http://rare-technologies.com/what-is-topic-coherence/).

If you are familiar with the subject of the articles in this dataset, you can
see that the topics below make a lot of sense. However, they are not without
flaws. We can see that there is substantial overlap between some topics,
others are hard to interpret, and most of them have at least some terms that
seem out of place. If you were able to do better, feel free to share your
methods on the blog at http://rare-technologies.com/lda-training-tips/ !




In [185]:
top_topics = model.top_topics(corpus) #, num_words=20)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.

from pprint import pprint
pprint(top_topics)

print('\n\n')
for i,atopic in enumerate(top_topics):
  print(f'Topic #{i}: coherence = {atopic[1]}')

avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('\nAverage topic coherence: %.4f.' % avg_topic_coherence)

2022-03-28 17:56:42,676 : INFO : CorpusAccumulator accumulated stats from 1000 documents
2022-03-28 17:56:42,722 : INFO : CorpusAccumulator accumulated stats from 2000 documents
2022-03-28 17:56:42,773 : INFO : CorpusAccumulator accumulated stats from 3000 documents
2022-03-28 17:56:42,823 : INFO : CorpusAccumulator accumulated stats from 4000 documents
2022-03-28 17:56:42,878 : INFO : CorpusAccumulator accumulated stats from 5000 documents


[([(0.029319314, 'ron'),
   (0.029076837, 'she'),
   (0.023293182, 'her'),
   (0.021932947, 'hermione'),
   (0.010585379, 'do'),
   (0.007714369, 'are'),
   (0.0073170606, 'did'),
   (0.007317019, 'got'),
   (0.007127823, 'do_not'),
   (0.006803478, 'back'),
   (0.0067026764, 'just'),
   (0.0066377646, 'well'),
   (0.0063258954, 'know'),
   (0.005776355, 'about'),
   (0.0054152827, 'me'),
   (0.0053171036, 'would'),
   (0.0052627763, 'will'),
   (0.0051741274, 'we'),
   (0.005088116, 'looking'),
   (0.005057784, 'them')],
  -0.9431404662814848),
 ([(0.015426293, 'do'),
   (0.013684746, 'we'),
   (0.012124019, 'would'),
   (0.0112393955, 'are'),
   (0.010038636, 'know'),
   (0.009065553, 'if'),
   (0.009002145, 'did'),
   (0.008753977, 'me'),
   (0.008684587, 'ron'),
   (0.008467633, 'hermione'),
   (0.007663319, 'will'),
   (0.007631141, 'snape'),
   (0.007528578, 'do_not'),
   (0.0074608214, 'think'),
   (0.0073579396, 'about'),
   (0.0072945524, 'so'),
   (0.0072705885, 'this'),
   (

# Visualize

In [187]:
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

In [188]:
# vis_data = gensimvis.prepare(lda, corpus, dictionary)
vis_data = gensimvis.prepare(model, corpus, dictionary)
pyLDAvis.display(vis_data)

## Things to experiment with

* ``no_above`` and ``no_below`` parameters in ``filter_extremes`` method.
* Adding trigrams or even higher order n-grams.
* Consider whether using a hold-out set or cross-validation is the way to go for you.
* Try other datasets.

## Where to go from here

* Check out a RaRe blog post on the AKSW topic coherence measure (http://rare-technologies.com/what-is-topic-coherence/).
* pyLDAvis (https://pyldavis.readthedocs.io/en/latest/index.html).
* Read some more Gensim tutorials (https://github.com/RaRe-Technologies/gensim/blob/develop/tutorials.md#tutorials).
* If you haven't already, read [1] and [2] (see references).

## References

1. "Latent Dirichlet Allocation", Blei et al. 2003.
2. "Online Learning for Latent Dirichlet Allocation", Hoffman et al. 2010.


