# Topic Modeling for Fun and Profit

[Source](https://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html)

In this notebook we'll

* vectorize a streamed corpus
* run topic modeling on streamed vectors, using gensim
* explore how to choose, evaluate and tweak topic modeling parameters
* persist trained models to disk, for later re-use
* In the [previous notebook 1 - Streamed Corpora](https://radimrehurek.com/gensim_3.8.3/auto_examples/howtos/run_compare_lda.html) we used the 20newsgroups corpus to demonstrate data preprocessing and streaming.

Now we'll switch to the English Wikipedia and do some topic modeling. Link: https://radimrehurek.com/gensim/auto_examples/core/run_corpora_and_vector_spaces.html#sphx-glr-auto-examples-core-run-corpora-and-vector-spaces-py

In [None]:
from datetime import datetime

# datetime object containing current date and time
now = datetime.now()

print("Begun at", now)

Begun at 2024-04-10 03:43:05.685010


In [None]:
!pip install six cython numpy scipy ipython[notebook]
!pip install nltk gensim pattern requests textblob
!python -m textblob.download_corpora lite
!pip install --upgrade gensim
!pip install --upgrade smart_open

Collecting jedi>=0.16 (from ipython[notebook])
  Downloading jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jedi
Successfully installed jedi-0.19.1
Collecting pattern
  Downloading Pattern-3.6.0.tar.gz (22.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m22.2/22.2 MB[0m [31m36.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting backports.csv (from pattern)
  Downloading backports.csv-1.0.7-py2.py3-none-any.whl (12 kB)
Collecting mysqlclient (from pattern)
  Downloading mysqlclient-2.2.4.tar.gz (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.4/90.4 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... 

In [None]:
!rm -f download_data.py && wget 'https://raw.githubusercontent.com/piskvorky/topic_modeling_tutorial/master/download_data.py'
#
# The older datasets are no longer available, use the latest one.
!sed -i 's/20140623/latest/g' download_data.py
#
# wikimedia sometimes refuses to connect due to excessive load
# use a mirror site instead. see https://dumps.wikimedia.org/mirrors.html
!sed -i 's|dumps.wikimedia.org|dumps.wikimedia.your.org|g' download_data.py

--2024-04-10 03:44:28--  https://raw.githubusercontent.com/piskvorky/topic_modeling_tutorial/master/download_data.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2101 (2.1K) [text/plain]
Saving to: ‘download_data.py’


2024-04-10 03:44:28 (19.4 MB/s) - ‘download_data.py’ saved [2101/2101]



In [None]:
!rm -rf ./data
!mkdir ./data
!python download_data.py ./data

2024-04-10 03:44:29,633 : MainThread : INFO : running download_data.py ./data
2024-04-10 03:44:29,634 : MainThread : INFO : downloading http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz into ./data/20news-bydate.tar.gz
2024-04-10 03:44:30,728 : MainThread : INFO : downloaded 14464277 bytes
2024-04-10 03:44:30,728 : MainThread : INFO : downloading http://dumps.wikimedia.your.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2 into ./data/simplewiki-latest-pages-articles.xml.bz2
2024-04-10 03:44:34,527 : MainThread : INFO : downloaded 235367506 bytes
2024-04-10 03:44:34,528 : MainThread : INFO : finished running download_data.py


In [None]:
# import and setup modules we'll be using in this notebook
import logging
import itertools

import numpy as np
import gensim

logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO  # ipython sometimes messes up the logging setup; restore

def head(stream, n=10):
    """Convenience fnc: return the first `n` elements of the stream, as plain list."""
    return list(itertools.islice(stream, n))

In [None]:
# import and setup modules we'll be using in this notebook
import logging
import itertools

import numpy as np
import gensim

logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO  # ipython sometimes messes up the logging setup; restore

def head(stream, n=10):
    """Convenience fnc: return the first `n` elements of the stream, as plain list."""
    return list(itertools.islice(stream, n))

In [None]:
from gensim.test.utils import datapath, get_tmpfile
from gensim.corpora import WikiCorpus, MmCorpus
path_to_wiki_dump = datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
corpus_path = get_tmpfile("wiki-corpus.mm")
wiki = WikiCorpus(path_to_wiki_dump)  # create word->word_id mapping, ~8h on full wiki
MmCorpus.serialize(corpus_path, wiki)  # another 8h, creates a file in MatrixMarket format and mapping

texts = [' '.join(txt) for txt in wiki.get_texts()]
print(texts[0])
print(texts[1])

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary<0 unique tokens: []>
INFO:gensim.corpora.dictionary:built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)
INFO:gensim.utils:Dictionary lifecycle event {'msg': "built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)", 'datetime': '2024-04-10T03:44:36.509490', 'gensim': '4.3.2', 'python': '3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]', 'platform': 'Linux-6.1.58+-x86_64-with-glibc2.35', 'event': 'created'}
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary<0 unique tokens: []>
INFO:gensim.corpora.wikicorpus:finished iterating over Wikipedia corpus of 106 documents with 452944 positions (total 206 articles, 453267 positions before pruning articles shorter than 50 words)
INFO:gensim.corpora.dictionary:built Dictionary<34212 unique tokens: 

anarchism is political philosophy that advocates self governed societies based on voluntary institutions these are often described as stateless societies although several authors have defined them more specifically as institutions based on non hierarchical free associations anarchism considers the state to be undesirable unnecessary and harmful while anti statism is central anarchism entails opposing authority or hierarchical organisation in the conduct of all human relations including but not limited to the state system anarchism draws on many currents of thought and strategy anarchism does not offer fixed body of doctrine from single particular world view instead fluxing and flowing as philosophy many types and traditions of anarchism exist not all of which are mutually exclusive anarchist schools of thought can differ fundamentally supporting anything from extreme individualism to complete collectivism strains of anarchism have often been divided into the categories of social and in

In [None]:
# import gensim.utils as utils
from smart_open import smart_open
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim.corpora.wikicorpus import _extract_pages, filter_wiki

def tokenize(text):
    return [token for token in simple_preprocess(text) if token not in STOPWORDS]

def iter_wiki(dump_file):
    """Yield each article from the Wikipedia dump, as a `(title, tokens)` 2-tuple."""
    ignore_namespaces = 'Wikipedia Category File Portal Template MediaWiki User Help Book Draft'.split()
    for title, text, pageid in _extract_pages(smart_open(dump_file)):
        text = filter_wiki(text)
        tokens = tokenize(text)
        if len(tokens) < 50 or any(title.startswith(ns + ':') for ns in ignore_namespaces):
            continue  # ignore short articles and various meta-articles
        yield title, tokens

In [None]:
# only use simplewiki in this tutorial (fewer documents)
# the full wiki dump is exactly the same format, but larger
wiki_file = './data/simplewiki-latest-pages-articles.xml.bz2'
stream = iter_wiki(wiki_file)
for title, tokens in itertools.islice(iter_wiki(wiki_file), 8):
    print (title, tokens[:10])  # print the article title and its first ten tokens

April ['april', 'fourth', 'month', 'year', 'julian', 'gregorian', 'calendars', 'comes', 'march', 'months']
August ['august', 'aug', 'eighth', 'month', 'year', 'gregorian', 'calendar', 'coming', 'july', 'september']
Art ['painting', 'renoir', 'work', 'art', 'art', 'creative', 'activity', 'expresses', 'imaginative', 'technical']
A ['writing', 'cursive', 'font', 'letter', 'english', 'alphabet', 'small', 'letter', 'lower', 'case']
Air ['fan', 'air', 'air', 'refers', 'earth', 'atmosphere', 'air', 'mixture', 'gases', 'tiny']
Autonomous communities of Spain ['spain', 'divided', 'parts', 'called', 'autonomous', 'communities', 'autonomous', 'means', 'autonomous', 'communities']
Alan Turing ['statue', 'alan', 'turing', 'turing', 'idea', 'bombe', 'mechanical', 'details', 'added', 'built']
Alanis Morissette ['alanis', 'nadine', 'morissette', 'born', 'june', 'grammy', 'award', 'winning', 'canadian', 'american']


In [None]:
id2word = {0: u'word', 2: u'profit', 300: u'another_word'}

In [None]:
doc_stream = (tokens for _, tokens in iter_wiki(wiki_file))

In [None]:
%time id2word_wiki = gensim.corpora.Dictionary(doc_stream)
print(id2word_wiki)

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary<0 unique tokens: []>
INFO:gensim.corpora.dictionary:adding document #10000 to Dictionary<168992 unique tokens: ['abdicated', 'abdicates', 'abraham', 'additionally', 'adolf']...>
INFO:gensim.corpora.dictionary:adding document #20000 to Dictionary<246465 unique tokens: ['abdicated', 'abdicates', 'abraham', 'additionally', 'adolf']...>
INFO:gensim.corpora.dictionary:adding document #30000 to Dictionary<307692 unique tokens: ['abdicated', 'abdicates', 'abraham', 'additionally', 'adolf']...>
INFO:gensim.corpora.dictionary:adding document #40000 to Dictionary<366887 unique tokens: ['abdicated', 'abdicates', 'abraham', 'additionally', 'adolf']...>
INFO:gensim.corpora.dictionary:adding document #50000 to Dictionary<433236 unique tokens: ['abdicated', 'abdicates', 'abraham', 'additionally', 'adolf']...>
INFO:gensim.corpora.dictionary:adding document #60000 to Dictionary<469090 unique tokens: ['abdicated', 'abdicates', 'abraham', 'a

CPU times: user 11min 52s, sys: 1.73 s, total: 11min 53s
Wall time: 11min 59s
Dictionary<650295 unique tokens: ['abdicated', 'abdicates', 'abraham', 'additionally', 'adolf']...>


In [None]:
# ignore words that appear in less than 20 documents or more than 10% documents
id2word_wiki.filter_extremes(no_below=20, no_above=0.1)
print(id2word_wiki)

INFO:gensim.corpora.dictionary:discarding 610151 tokens: [('alvares', 4), ('american', 20610), ('aperire', 1), ('april', 10648), ('arbroath', 17), ('born', 24070), ('chakri', 16), ('city', 15421), ('cosmonauts', 18), ('davidians', 7)]...
INFO:gensim.corpora.dictionary:keeping 40144 tokens which were in no less than 20 and no more than 9180 (=10.0%) documents
INFO:gensim.corpora.dictionary:resulting dictionary: Dictionary<40144 unique tokens: ['abdicated', 'abdicates', 'abraham', 'additionally', 'adolf']...>


Dictionary<40144 unique tokens: ['abdicated', 'abdicates', 'abraham', 'additionally', 'adolf']...>


In [None]:
now = datetime.now()

print("Done with SimpleWiki at", now)

Done with SimpleWiki at 2024-04-10 03:56:58.478357



**Question 1:** Print all words and their ids from id2word_wiki where the word starts with "human".

**Note for advanced users:** In fully online scenarios, where the documents can only be streamed once (no repeating the stream), we can't exhaust the document stream just to build a dictionary. In this case we can map strings directly into their integer hash, using a hashing function such as MurmurHash or MD5. This is called the "[hashing trick](https://en.wikipedia.org/wiki/Feature_hashing#Feature_vectorization_using_the_hashing_trick)". A dictionary built this way is more difficult to debug, because there may be hash collisions: multiple words represented by a single id. See the documentation of [HashDictionary](https://radimrehurek.com/gensim/corpora/hashdictionary.html) for more details.

## Vectorization
A streamed corpus and a dictionary is all we need to create [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model) vectors:

In [None]:
doc = "A blood cell, also called a hematocyte, is a cell produced by hematopoiesis and normally found in blood."
bow = id2word_wiki.doc2bow(tokenize(doc))
print(bow)

[(989, 1), (1176, 2), (1262, 1), (3368, 2)]


In [None]:
print(id2word_wiki[10882])

naruhito


In [None]:
class WikiCorpus(object):
    def __init__(self, dump_file, dictionary, clip_docs=None):
        """
        Parse the first `clip_docs` Wikipedia documents from file `dump_file`.
        Yield each document in turn, as a list of tokens (unicode strings).

        """
        self.dump_file = dump_file
        self.dictionary = dictionary
        self.clip_docs = clip_docs

    def __iter__(self):
        self.titles = []
        for title, tokens in itertools.islice(iter_wiki(self.dump_file), self.clip_docs):
            self.titles.append(title)
            yield self.dictionary.doc2bow(tokens)

    def __len__(self):
        return self.clip_docs

# create a stream of bag-of-words vectors
wiki_corpus = WikiCorpus(wiki_file, id2word_wiki)
vector = next(iter(wiki_corpus))
print(vector)  # print the first vector in the stream

[(0, 1), (1, 2), (2, 1), (3, 1), (4, 2), (5, 1), (6, 2), (7, 1), (8, 1), (9, 2), (10, 2), (11, 3), (12, 1), (13, 1), (14, 1), (15, 1), (16, 2), (17, 1), (18, 5), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 2), (27, 4), (28, 1), (29, 1), (30, 1), (31, 2), (32, 1), (33, 1), (34, 1), (35, 3), (36, 3), (37, 1), (38, 1), (39, 2), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 2), (48, 1), (49, 1), (50, 5), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 10), (61, 2), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 2), (68, 1), (69, 1), (70, 2), (71, 2), (72, 1), (73, 2), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 2), (85, 1), (86, 2), (87, 1), (88, 2), (89, 1), (90, 2), (91, 1), (92, 2), (93, 1), (94, 1), (95, 1), (96, 1), (97, 2), (98, 1), (99, 2), (100, 2), (101, 2), (102, 4), (103, 2), (104, 1), (105, 1), (106, 2), (107, 1), (108, 1), (109, 2), (110, 1)

In [None]:
len(vector)
max([pair[1] for pair in vector])

index = [pair[1] for pair in vector].index(15)
index

628

In [None]:
# what is the most common word in that first article?

(most_index, most_count) = max(vector, key=lambda pair: pair[1])
print(id2word_wiki[most_index], most_count)

week 15


In [None]:
%time gensim.corpora.MmCorpus.serialize('./data/wiki_bow.mm', wiki_corpus)

INFO:gensim.corpora.mmcorpus:storing corpus in Matrix Market format to ./data/wiki_bow.mm
INFO:gensim.matutils:saving sparse matrix to ./data/wiki_bow.mm
INFO:gensim.matutils:PROGRESS: saving document #0
INFO:gensim.matutils:PROGRESS: saving document #1000
INFO:gensim.matutils:PROGRESS: saving document #2000
INFO:gensim.matutils:PROGRESS: saving document #3000
INFO:gensim.matutils:PROGRESS: saving document #4000
INFO:gensim.matutils:PROGRESS: saving document #5000
INFO:gensim.matutils:PROGRESS: saving document #6000
INFO:gensim.matutils:PROGRESS: saving document #7000
INFO:gensim.matutils:PROGRESS: saving document #8000
INFO:gensim.matutils:PROGRESS: saving document #9000
INFO:gensim.matutils:PROGRESS: saving document #10000
INFO:gensim.matutils:PROGRESS: saving document #11000
INFO:gensim.matutils:PROGRESS: saving document #12000
INFO:gensim.matutils:PROGRESS: saving document #13000
INFO:gensim.matutils:PROGRESS: saving document #14000
INFO:gensim.matutils:PROGRESS: saving document #1

CPU times: user 11min 59s, sys: 3.6 s, total: 12min 3s
Wall time: 12min 9s


In [None]:
mm_corpus = gensim.corpora.MmCorpus('./data/wiki_bow.mm')
print(mm_corpus)

INFO:gensim.corpora.indexedcorpus:loaded corpus index from ./data/wiki_bow.mm.index
INFO:gensim.corpora._mmreader:initializing cython corpus reader from ./data/wiki_bow.mm
INFO:gensim.corpora._mmreader:accepted corpus with 91800 documents, 40144 features, 8783660 non-zero entries


MmCorpus(91800 documents, 40144 features, 8783660 non-zero entries)


In [None]:
print(next(iter(mm_corpus)))

[(0, 1.0), (1, 2.0), (2, 1.0), (3, 1.0), (4, 2.0), (5, 1.0), (6, 2.0), (7, 1.0), (8, 1.0), (9, 2.0), (10, 2.0), (11, 3.0), (12, 1.0), (13, 1.0), (14, 1.0), (15, 1.0), (16, 2.0), (17, 1.0), (18, 5.0), (19, 1.0), (20, 1.0), (21, 1.0), (22, 1.0), (23, 1.0), (24, 1.0), (25, 1.0), (26, 2.0), (27, 4.0), (28, 1.0), (29, 1.0), (30, 1.0), (31, 2.0), (32, 1.0), (33, 1.0), (34, 1.0), (35, 3.0), (36, 3.0), (37, 1.0), (38, 1.0), (39, 2.0), (40, 1.0), (41, 1.0), (42, 1.0), (43, 1.0), (44, 1.0), (45, 1.0), (46, 1.0), (47, 2.0), (48, 1.0), (49, 1.0), (50, 5.0), (51, 1.0), (52, 1.0), (53, 1.0), (54, 1.0), (55, 1.0), (56, 1.0), (57, 1.0), (58, 1.0), (59, 1.0), (60, 10.0), (61, 2.0), (62, 1.0), (63, 1.0), (64, 1.0), (65, 1.0), (66, 1.0), (67, 2.0), (68, 1.0), (69, 1.0), (70, 2.0), (71, 2.0), (72, 1.0), (73, 2.0), (74, 1.0), (75, 1.0), (76, 1.0), (77, 1.0), (78, 1.0), (79, 1.0), (80, 1.0), (81, 1.0), (82, 1.0), (83, 1.0), (84, 2.0), (85, 1.0), (86, 2.0), (87, 1.0), (88, 2.0), (89, 1.0), (90, 2.0), (91, 1.

## Semantic transformations
Topic modeling in gensim is realized via transformations. A transformation is something that takes a corpus and spits out another corpus on output, using `corpus_out = transformation_object[corpus_in]` syntax. What exactly happens in between is determined by what kind of transformation we're using -- options are Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), Random Projections (RP) etc.

Some transformations need to be initialized (=trained) before they can be used. For example, let's train an LDA transformation model, using our bag-of-words WikiCorpus as training data:

In [None]:
from gensim.utils import SaveLoad
class ClippedCorpus(SaveLoad):
    def __init__(self, corpus, max_docs=None):
        """
        Return a corpus that is the "head" of input iterable `corpus`.

        Any documents after `max_docs` are ignored. This effectively limits the
        length of the returned corpus to <= `max_docs`. Set `max_docs=None` for
        "no limit", effectively wrapping the entire input corpus.

        """
        self.corpus = corpus
        self.max_docs = max_docs

    def __iter__(self):
        return itertools.islice(self.corpus, self.max_docs)

    def __len__(self):
        return min(self.max_docs, len(self.corpus))

clipped_corpus = gensim.utils.ClippedCorpus(mm_corpus, 4000)  # use fewer documents during training, LDA is slow
# ClippedCorpus new in gensim 0.10.1
# copy&paste it from https://github.com/piskvorky/gensim/blob/0.10.1/gensim/utils.py#L467 if necessary (or upgrade your gensim)
%time lda_model = gensim.models.LdaModel(clipped_corpus, num_topics=10, id2word=id2word_wiki, passes=4)

INFO:gensim.models.ldamodel:using symmetric alpha at 0.1
INFO:gensim.models.ldamodel:using symmetric eta at 0.1
INFO:gensim.models.ldamodel:using serial LDA version on this node
INFO:gensim.models.ldamodel:running online (multi-pass) LDA training, 10 topics, 4 passes over the supplied corpus of 4000 documents, updating model once every 2000 documents, evaluating perplexity every 4000 documents, iterating 50x with a convergence threshold of 0.001000
INFO:gensim.models.ldamodel:PROGRESS: pass 0, at document #2000/4000
INFO:gensim.models.ldamodel:merging changes from 2000 documents into a model of 4000 documents
INFO:gensim.models.ldamodel:topic #4 (0.100): 0.003*"country" + 0.003*"countries" + 0.002*"example" + 0.002*"language" + 0.002*"german" + 0.002*"usually" + 0.002*"mario" + 0.002*"person" + 0.002*"things" + 0.002*"president"
INFO:gensim.models.ldamodel:topic #5 (0.100): 0.003*"number" + 0.002*"example" + 0.002*"usually" + 0.002*"country" + 0.002*"government" + 0.002*"word" + 0.002*

CPU times: user 46.2 s, sys: 23.1 s, total: 1min 9s
Wall time: 46.2 s


In [None]:
_ = lda_model.print_topics(-1)  # print a few most important words for each LDA topic

INFO:gensim.models.ldamodel:topic #0 (0.100): 0.008*"album" + 0.007*"music" + 0.007*"band" + 0.007*"movie" + 0.005*"released" + 0.005*"award" + 0.004*"series" + 0.004*"song" + 0.004*"film" + 0.004*"movies"
INFO:gensim.models.ldamodel:topic #1 (0.100): 0.005*"jpg" + 0.005*"windows" + 0.005*"capital" + 0.004*"century" + 0.004*"music" + 0.004*"game" + 0.004*"games" + 0.004*"bc" + 0.004*"rural" + 0.003*"file"
INFO:gensim.models.ldamodel:topic #2 (0.100): 0.010*"water" + 0.010*"rgb" + 0.010*"hex" + 0.006*"color" + 0.005*"food" + 0.005*"plants" + 0.004*"sea" + 0.004*"animals" + 0.004*"red" + 0.003*"green"
INFO:gensim.models.ldamodel:topic #3 (0.100): 0.014*"actor" + 0.014*"president" + 0.011*"actress" + 0.008*"singer" + 0.005*"player" + 0.005*"league" + 0.005*"writer" + 0.005*"prime" + 0.005*"british" + 0.004*"politician"
INFO:gensim.models.ldamodel:topic #4 (0.100): 0.007*"language" + 0.005*"person" + 0.004*"word" + 0.004*"words" + 0.004*"languages" + 0.004*"usually" + 0.004*"example" + 0.0

In [None]:
now = datetime.now()

print("LDA Topic Models computed at", now)

LDA Topic Models computed at 2024-04-10 04:09:54.155878


More info on model parameters in [gensim docs](https://radimrehurek.com/gensim/models/lsimodel.html).

Transformation can be stacked. For example, here we'll train a TFIDF model, and then train Latent Semantic Analysis on top of TFIDF:

In [None]:
%time tfidf_model = gensim.models.TfidfModel(mm_corpus, id2word=id2word_wiki)

INFO:gensim.models.tfidfmodel:collecting document frequencies
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #0
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #10000
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #20000
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #30000
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #40000
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #50000
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #60000
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #70000
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #80000
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #90000
INFO:gensim.utils:TfidfModel lifecycle event {'msg': 'calculated IDF weights for 91800 documents and 40144 features (8783660 matrix non-zeros)', 'datetime': '2024-04-10T04:10:03.493705', 'gensim': '4.3.2', 'python': '3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]', 'p

CPU times: user 9.06 s, sys: 214 ms, total: 9.28 s
Wall time: 9.32 s


In [None]:
%time lsi_model = gensim.models.LsiModel(tfidf_model[mm_corpus], id2word=id2word_wiki, num_topics=200)

INFO:gensim.models.lsimodel:using serial LSI version on this node
INFO:gensim.models.lsimodel:updating model with new documents
INFO:gensim.models.lsimodel:preparing a new chunk of documents
INFO:gensim.models.lsimodel:using 100 extra samples and 2 power iterations
INFO:gensim.models.lsimodel:1st phase: constructing (40144, 300) action matrix
INFO:gensim.models.lsimodel:orthonormalizing (40144, 300) action matrix
INFO:gensim.models.lsimodel:2nd phase: running dense svd on (300, 20000) matrix
INFO:gensim.models.lsimodel:computing the final decomposition
INFO:gensim.models.lsimodel:keeping 200 factors (discarding 15.013% of energy spectrum)
INFO:gensim.models.lsimodel:processed documents up to #20000
INFO:gensim.models.lsimodel:topic #0(15.817): 0.225*"footballer" + 0.225*"actor" + 0.218*"politician" + 0.206*"actress" + 0.188*"german" + 0.185*"singer" + 0.165*"french" + 0.160*"writer" + 0.144*"player" + 0.139*"british"
INFO:gensim.models.lsimodel:topic #1(10.686): 0.183*"footballer" + 0.

CPU times: user 2min 6s, sys: 21.2 s, total: 2min 28s
Wall time: 1min 56s



The LSI transformation goes from a space of high dimensionality (~TFIDF, tens of thousands) into a space of low dimensionality (a few hundreds; here 200). For this reason it can also seen as **dimensionality reduction**.

As always, the transformations are applied "lazily", so the resulting output corpus is streamed as well:

In [None]:
print(next(iter(lsi_model[tfidf_model[mm_corpus]])))

[(0, 0.22382995199186093), (1, 0.07502013345985202), (2, 0.0728078974966595), (3, -0.0019364768357310511), (4, 0.046147896026444146), (5, 0.05001920887859849), (6, 0.07490221584802599), (7, 0.05789168706277461), (8, 0.019038390516643425), (9, 0.020226992341137662), (10, 0.06913626586730945), (11, 0.0069941468609778555), (12, 0.06325476807518049), (13, 0.0322296449418242), (14, -0.014625906253587468), (15, -0.011722849282167769), (16, 0.04592667345978194), (17, -0.0032141374884588395), (18, 0.048882046675151426), (19, 0.007431499117849479), (20, 0.05591532189555523), (21, 0.07554520359808363), (22, 0.004999627992401104), (23, 0.04419881442027389), (24, -0.04106893887649706), (25, -0.03646815062924315), (26, 0.03485125805595076), (27, 0.027070115723071286), (28, -0.031059253805146608), (29, -0.04614605231632702), (30, -0.008657474574264752), (31, -0.013994759495503752), (32, -0.03116342308172094), (33, 0.04496359470994195), (34, 0.003027478649069748), (35, 0.003869282802852891), (36, -0.

In [None]:
# cache the transformed corpora to disk, for use in later notebooks
%time gensim.corpora.MmCorpus.serialize('./data/wiki_tfidf.mm', tfidf_model[mm_corpus])
%time gensim.corpora.MmCorpus.serialize('./data/wiki_lsa.mm', lsi_model[tfidf_model[mm_corpus]])
# gensim.corpora.MmCorpus.serialize('./data/wiki_lda.mm', lda_model[mm_corpus])

INFO:gensim.corpora.mmcorpus:storing corpus in Matrix Market format to ./data/wiki_tfidf.mm
INFO:gensim.matutils:saving sparse matrix to ./data/wiki_tfidf.mm
INFO:gensim.matutils:PROGRESS: saving document #0
INFO:gensim.matutils:PROGRESS: saving document #1000
INFO:gensim.matutils:PROGRESS: saving document #2000
INFO:gensim.matutils:PROGRESS: saving document #3000
INFO:gensim.matutils:PROGRESS: saving document #4000
INFO:gensim.matutils:PROGRESS: saving document #5000
INFO:gensim.matutils:PROGRESS: saving document #6000
INFO:gensim.matutils:PROGRESS: saving document #7000
INFO:gensim.matutils:PROGRESS: saving document #8000
INFO:gensim.matutils:PROGRESS: saving document #9000
INFO:gensim.matutils:PROGRESS: saving document #10000
INFO:gensim.matutils:PROGRESS: saving document #11000
INFO:gensim.matutils:PROGRESS: saving document #12000
INFO:gensim.matutils:PROGRESS: saving document #13000
INFO:gensim.matutils:PROGRESS: saving document #14000
INFO:gensim.matutils:PROGRESS: saving documen

CPU times: user 46.4 s, sys: 1.4 s, total: 47.8 s
Wall time: 48.6 s


INFO:gensim.matutils:PROGRESS: saving document #0
INFO:gensim.matutils:PROGRESS: saving document #1000
INFO:gensim.matutils:PROGRESS: saving document #2000
INFO:gensim.matutils:PROGRESS: saving document #3000
INFO:gensim.matutils:PROGRESS: saving document #4000
INFO:gensim.matutils:PROGRESS: saving document #5000
INFO:gensim.matutils:PROGRESS: saving document #6000
INFO:gensim.matutils:PROGRESS: saving document #7000
INFO:gensim.matutils:PROGRESS: saving document #8000
INFO:gensim.matutils:PROGRESS: saving document #9000
INFO:gensim.matutils:PROGRESS: saving document #10000
INFO:gensim.matutils:PROGRESS: saving document #11000
INFO:gensim.matutils:PROGRESS: saving document #12000
INFO:gensim.matutils:PROGRESS: saving document #13000
INFO:gensim.matutils:PROGRESS: saving document #14000
INFO:gensim.matutils:PROGRESS: saving document #15000
INFO:gensim.matutils:PROGRESS: saving document #16000
INFO:gensim.matutils:PROGRESS: saving document #17000
INFO:gensim.matutils:PROGRESS: saving doc

CPU times: user 1min 29s, sys: 3.08 s, total: 1min 32s
Wall time: 1min 34s


In [None]:
tfidf_corpus = gensim.corpora.MmCorpus('./data/wiki_tfidf.mm')
# `tfidf_corpus` is now exactly the same as `tfidf_model[wiki_corpus]`
print(tfidf_corpus)

lsi_corpus = gensim.corpora.MmCorpus('./data/wiki_lsa.mm')
# and `lsi_corpus` now equals `lsi_model[tfidf_model[wiki_corpus]]` = `lsi_model[tfidf_corpus]`
print(lsi_corpus)

INFO:gensim.corpora.indexedcorpus:loaded corpus index from ./data/wiki_tfidf.mm.index
INFO:gensim.corpora._mmreader:initializing cython corpus reader from ./data/wiki_tfidf.mm
INFO:gensim.corpora._mmreader:accepted corpus with 91800 documents, 40144 features, 8783660 non-zero entries
INFO:gensim.corpora.indexedcorpus:loaded corpus index from ./data/wiki_lsa.mm.index
INFO:gensim.corpora._mmreader:initializing cython corpus reader from ./data/wiki_lsa.mm
INFO:gensim.corpora._mmreader:accepted corpus with 91800 documents, 200 features, 18359998 non-zero entries


MmCorpus(91800 documents, 40144 features, 8783660 non-zero entries)
MmCorpus(91800 documents, 200 features, 18359998 non-zero entries)


In [None]:
now = datetime.now()

print("LSI Topic Models computed at", now)

LSI Topic Models computed at 2024-04-10 04:14:22.898972


## Transforming unseen documents
We can use the trained models to transform new, unseen documents into the semantic space:

In [None]:
text = "A blood cell, also called a hematocyte, is a cell produced by hematopoiesis and normally found in blood."

# transform text into the bag-of-words space
bow_vector = id2word_wiki.doc2bow(tokenize(text))
print([(id2word_wiki[id], count) for id, count in bow_vector])

[('normally', 1), ('blood', 2), ('produced', 1), ('cell', 2)]


In [None]:
# transform into LDA space
lda_vector = lda_model[bow_vector]
print(lda_vector)
# print the document's single most prominent LDA topic
print(lda_model.print_topic(max(lda_vector, key=lambda item: item[1])[0]))

[(0, 0.014289404), (1, 0.014288319), (2, 0.01429191), (3, 0.014287765), (4, 0.014288522), (5, 0.014288369), (6, 0.014288897), (7, 0.014287696), (8, 0.8714012), (9, 0.014287937)]
0.006*"body" + 0.005*"earth" + 0.005*"person" + 0.005*"energy" + 0.005*"blood" + 0.004*"things" + 0.004*"light" + 0.003*"god" + 0.003*"example" + 0.003*"water"


**Question 2**: print text transformed into TFIDF space.

For stacked transformations, apply the same stack during transformation as was applied during training:

In [None]:
# transform into LSI space
lsi_vector = lsi_model[tfidf_model[bow_vector]]
print(lsi_vector)
# print the document's single most prominent LSI topic (not interpretable like LDA!)
print(lsi_model.print_topic(max(lsi_vector, key=lambda item: abs(item[1]))[0]))

[(0, 0.020720004660641587), (1, 0.013357516771328265), (2, -0.009379332782649701), (3, -0.015672874403302495), (4, 0.012308826662221144), (5, 0.022699981295838098), (6, -0.022526631153698864), (7, 0.019038195679010304), (8, -0.0012577086915277324), (9, 0.0014327420506452253), (10, 0.003144889803628971), (11, -0.0030415043510007914), (12, -0.0013726238369506367), (13, 0.017716179961230406), (14, -0.01441679191590546), (15, -0.003794538556045766), (16, -0.0061187163008749115), (17, -0.007431193772153609), (18, -0.008753079511298117), (19, -0.016543257261259398), (20, -0.016521488252354867), (21, -0.00583841055182623), (22, -0.04051803652323055), (23, -0.0014276752036127853), (24, 0.0520993609955044), (25, 0.0069610756826950825), (26, 0.030211649107315077), (27, -0.03561156244716797), (28, -0.010036503299963367), (29, 0.019675195617331234), (30, -0.009719054808061366), (31, 0.005839746042498598), (32, 0.0008693477420477864), (33, -0.023744623460863257), (34, -0.0032109367000043333), (35, 

In [None]:
# store all trained models to disk
lda_model.save('./data/lda_wiki.model')
lsi_model.save('./data/lsi_wiki.model')
tfidf_model.save('./data/tfidf_wiki.model')
id2word_wiki.save('./data/wiki.dictionary')

INFO:gensim.utils:LdaState lifecycle event {'fname_or_handle': './data/lda_wiki.model.state', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2024-04-10T04:14:22.976755', 'gensim': '4.3.2', 'python': '3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]', 'platform': 'Linux-6.1.58+-x86_64-with-glibc2.35', 'event': 'saving'}
INFO:gensim.utils:saved ./data/lda_wiki.model.state
INFO:gensim.utils:LdaModel lifecycle event {'fname_or_handle': './data/lda_wiki.model', 'separately': "['expElogbeta', 'sstats']", 'sep_limit': 10485760, 'ignore': ['id2word', 'dispatcher', 'state'], 'datetime': '2024-04-10T04:14:23.041766', 'gensim': '4.3.2', 'python': '3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]', 'platform': 'Linux-6.1.58+-x86_64-with-glibc2.35', 'event': 'saving'}
INFO:gensim.utils:storing np array 'expElogbeta' to ./data/lda_wiki.model.expElogbeta.npy
INFO:gensim.utils:not storing attribute id2word
INFO:gensim.utils:not storing attribute dispatcher
INFO:ge

In [None]:

# load the same model back; the result is equal to `lda_model`
same_lda_model = gensim.models.LdaModel.load('./data/lda_wiki.model')

INFO:gensim.utils:loading LdaModel object from ./data/lda_wiki.model
INFO:gensim.utils:loading expElogbeta from ./data/lda_wiki.model.expElogbeta.npy with mmap=None
INFO:gensim.utils:setting ignored attribute id2word to None
INFO:gensim.utils:setting ignored attribute dispatcher to None
INFO:gensim.utils:setting ignored attribute state to None
INFO:gensim.utils:LdaModel lifecycle event {'fname': './data/lda_wiki.model', 'datetime': '2024-04-10T04:14:23.745980', 'gensim': '4.3.2', 'python': '3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]', 'platform': 'Linux-6.1.58+-x86_64-with-glibc2.35', 'event': 'loaded'}
INFO:gensim.utils:loading LdaState object from ./data/lda_wiki.model.state
INFO:gensim.utils:LdaState lifecycle event {'fname': './data/lda_wiki.model.state', 'datetime': '2024-04-10T04:14:23.749084', 'gensim': '4.3.2', 'python': '3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]', 'platform': 'Linux-6.1.58+-x86_64-with-glibc2.35', 'event': 'loaded'}


## Evaluation
Topic modeling is an **unsupervised task**; we do not know in advance what the topics ought to look like. This makes evaluation tricky: whereas in supervised learning (classification, regression) we simply compare predicted labels to expected labels, there are no "expected labels" in topic modeling.

Each topic modeling method (LSI, LDA...) its own way of measuring internal quality (perplexity, reconstruction error...). But these are an artifact of the particular approach taken (bayesian training, matrix factorization...), and mostly of academic interest. There's no way to compare such scores across different types of topic models, either. The best way to really evaluate quality of unsupervised tasks is to **evaluate how they improve the superordinate task, the one we're actually training them for**.

For example, when the ultimate goal is to retrieve semantically similar documents, we manually tag a set of similar documents and then see how well a given semantic model maps those similar documents together.

Such manual tagging can be resource intensive, so people hae been looking for clever ways to automate it. In [Reading tea leaves: How humans interpret topic models](http://www.umiacs.umd.edu/~jbg/docs/nips2009-rtl.pdf), Wallach *et al* suggest a "word intrusion" method that works well for models where the topics are meant to be "human interpretable", such as LDA. For each trained topic, they take its first ten words, then substitute one of them with another, randomly chosen word (intruder!) and see whether a human can reliably tell which one it was. If so, the trained topic is **topically coherent** (good); if not, the topic has no discernible theme (bad):

## Misplaced Words



In [None]:
# select top 50 words for each of the 20 LDA topics
top_words = [[word for _, word in lda_model.show_topic(topicno, topn=50)] for topicno in range(lda_model.num_topics)]
print(top_words)

[[0.008350637, 0.007397485, 0.006941463, 0.0065830783, 0.0049006, 0.004516132, 0.0041595628, 0.0041440614, 0.004063852, 0.004038208, 0.0038332418, 0.0038062783, 0.003426539, 0.0031350215, 0.0031331906, 0.0029130662, 0.0029055623, 0.0026874328, 0.0025749619, 0.0025652256, 0.0025554355, 0.0025443295, 0.0025236185, 0.002298591, 0.0022937448, 0.0022016324, 0.002194404, 0.0021291524, 0.0021244825, 0.0021002106, 0.0020945922, 0.0020711545, 0.0020437608, 0.0020324371, 0.0020308578, 0.0020105361, 0.002003498, 0.0020023382, 0.0019649477, 0.0019646515, 0.0019359563, 0.0019309411, 0.0019019216, 0.001898027, 0.001871013, 0.0018475604, 0.0018174546, 0.0017710046, 0.0017300264, 0.0017087219], [0.004720031, 0.0046042735, 0.0045501823, 0.004401547, 0.0040799337, 0.0038040925, 0.0036388119, 0.0036215037, 0.0035789288, 0.0034694197, 0.00331112, 0.0032919012, 0.0031988383, 0.0031890618, 0.003051889, 0.002947789, 0.002916951, 0.0027302834, 0.002703499, 0.0026302882, 0.0026198893, 0.002464489, 0.0024533728

In [None]:
# get all top 50 words in all 20 topics, as one large set
all_words = set(itertools.chain.from_iterable(top_words))

print("Can you spot the misplaced word in each topic?")

# for each topic, replace a word at a different index, to make it more interesting
replace_index = np.random.randint(0, 10, lda_model.num_topics)

replacements = []
for topicno, words in enumerate(top_words):
    other_words = all_words.difference(words)
    replacement = np.random.choice(list(other_words))
    replacements.append((words[replace_index[topicno]], replacement))
    words[replace_index[topicno]] = replacement
    print (topicno, ' '.join([str(w) for w in words[:10]]))
    # print("%i: %s" % (topicno, ' '.join(words[:10])))

Can you spot the misplaced word in each topic?
0 0.008350637 0.007397485 0.006941463 0.0065830783 0.0049006 0.004516132 0.0041595628 0.0041440614 0.0045501823 0.004038208
1 0.004720031 0.0046042735 0.0045501823 0.004401547 0.0040799337 0.0038040925 0.0036388119 0.00282801 0.0035789288 0.0034694197
2 0.010379708 0.006941463 0.01022815 0.006335118 0.0048161442 0.0046633324 0.0038361708 0.0037611716 0.0035633536 0.0034463282
3 0.013647125 0.013625862 0.010835143 0.008190733 0.0050404808 0.0050088046 0.0047563515 0.004664614 0.00457027 0.0024304516
4 0.0071540223 0.004780012 0.0042741518 0.0039119376 0.0037019884 0.003685616 0.0035745662 0.0033098771 0.0031256536 0.0017638981
5 0.0054650092 0.004256164 0.0023579265 0.0034891535 0.003452445 0.0033750415 0.0031091657 0.0031069184 0.002869573 0.0028384211
6 0.0016881936 0.0060467264 0.0055051385 0.0049580876 0.004928199 0.00439853 0.0038965896 0.0038204468 0.003677861 0.003040766
7 0.013562761 0.0019019216 0.011208905 0.010848967 0.01074089 0

In [None]:
print("Actual replacements were:")
print(list(enumerate(replacements)))

Actual replacements were:
[(0, (0.004063852, 0.0045501823)), (1, (0.0036215037, 0.00282801)), (2, (0.01024255, 0.006941463)), (3, (0.0043451795, 0.0024304516)), (4, (0.0030545723, 0.0017638981)), (5, (0.0036209396, 0.0023579265)), (6, (0.0063656745, 0.0016881936)), (7, (0.012099842, 0.0019019216)), (8, (0.0049812603, 0.002144262)), (9, (0.0068101785, 0.0017219953))]


In [None]:
# evaluate on 1k documents **not** used in LDA training
doc_stream = (tokens for _, tokens in iter_wiki(wiki_file))  # generator
test_docs = list(itertools.islice(doc_stream, 8000, 9000))

In [None]:
def intra_inter(model, test_docs, num_pairs=10000):
    # split each test document into two halves and compute topics for each half
    half = int(len(test_docs)/2)
    part1 = [model[id2word_wiki.doc2bow(tokens[: half])] for tokens in test_docs]
    part2 = [model[id2word_wiki.doc2bow(tokens[half :])] for tokens in test_docs]

    # print computed similarities (uses cossim)
    print("average cosine similarity between corresponding parts (higher is better):")
    print(np.mean([gensim.matutils.cossim(p1, p2) for p1, p2 in zip(part1, part2)]))

    random_pairs = np.random.randint(0, len(test_docs), size=(num_pairs, 2))
    print("average cosine similarity between 10,000 random parts (lower is better):")
    print(np.mean([gensim.matutils.cossim(part1[i[0]], part2[i[1]]) for i in random_pairs]))

In [None]:
print("LDA results:")
intra_inter(lda_model, test_docs)

LDA results:
average cosine similarity between corresponding parts (higher is better):
0.5163370637185376
average cosine similarity between 10,000 random parts (lower is better):
0.46533279287134977


In [None]:
print("LSI results:")
intra_inter(lsi_model, test_docs)

LSI results:
average cosine similarity between corresponding parts (higher is better):
0.06461499295266829
average cosine similarity between 10,000 random parts (lower is better):
0.008616671789049894


In [None]:
now = datetime.now()

print("Ended at", now)

Ended at 2024-04-10 04:15:39.087591
