Topic Modeling
===

We use `gensim` and `pyLDAvis` to train and visualize a topic model on the Wikitext-103 data.

https://radimrehurek.com/gensim/models/ldamulticore.html

https://github.com/bmabey/pyLDAvis

In [1]:
%matplotlib inline

from pathlib import Path

import pandas as pd
import numpy as np

from collections import Counter
from tqdm import tqdm

import matplotlib.pyplot as plt
import matplotlib.dates as md
import matplotlib
import pylab as pl
from IPython.core.display import display, HTML

In [2]:
# data is stored relative to the root of the git repository
git_root_dir = !git rev-parse --show-toplevel
git_root_dir = Path(git_root_dir[0].strip())
git_root_dir

PosixPath('/home/levon003/repos/nlp-for-hci-workshop')

In [3]:
wikitext_dir = git_root_dir / 'data' / 'wikitext-103'
train = wikitext_dir / "wiki.train.tokens"
valid = wikitext_dir / "wiki.valid.tokens"
test = wikitext_dir / "wiki.test.tokens"
assert train.exists() and valid.exists() and test.exists()

## Topic Modeling

In [4]:
import gensim
import logging
# gensim has an annoying feature that all of its output is produced via logging, so we have to set that up to get feedback on training
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [35]:
sentences = []
with open(train, 'r') as infile:
    for line in infile:
        tokens = line.strip().split()
        if len(tokens) > 2:
            sentences.append(tokens)
len(sentences)

1151408

In [37]:
# keep only long sentences
# these will be our documents; note that it's somewhat strange to think about individual Wikipedia paragraphs as separate documents
# Probably a more useful topic model would break things down by the article level
# Since we're breaking it down by paragraph, we might expect more "functional" topics
long_sentences = [sentence for sentence in sentences if len(sentence) > 8]
len(long_sentences)

874892

In [38]:
# The Gensim dictionary defines a corpora
dct = gensim.corpora.Dictionary(long_sentences)
len(dct)

267572

In [40]:
prev_size = len(dct)
dct.filter_extremes(no_below=5, no_above=0.2, keep_n=80000)
new_size = len(dct)
print(f"Removed {prev_size - new_size} dictionary entries. New size: {new_size}")

Removed 187572 dictionary entries. New size: 80000


In [41]:
# save the dictionary, since it's somewhat slow to compute it in the first place
dct.save((git_root_dir / "data" / "wikitext-103-lda.dict").as_posix())

In [5]:
dct = gensim.corpora.Dictionary.load((git_root_dir / "data" / "wikitext-103-lda.dict").as_posix())
len(dct)

2019-03-14 21:32:09,695 : INFO : loading Dictionary object from /home/levon003/repos/nlp-for-hci-workshop/data/wikitext-103-lda.dict
2019-03-14 21:32:09,721 : INFO : loaded /home/levon003/repos/nlp-for-hci-workshop/data/wikitext-103-lda.dict


80000

In [43]:
# save only sentences with at least a few tokens for use in the LDA corpus
corpus = [dct.doc2bow(sentence) for sentence in tqdm(long_sentences) if len(sentence) > 8]
corpus[0][:10]

100%|██████████| 874892/874892 [00:52<00:00, 16742.64it/s]


[(0, 1),
 (1, 2),
 (2, 2),
 (3, 1),
 (4, 2),
 (5, 1),
 (6, 1),
 (7, 1),
 (8, 1),
 (9, 2)]

In [44]:
# save the corpus, since producing it is somewhat slow
gensim.corpora.MmCorpus.serialize((git_root_dir / "data" / "wikitext-103-lda.mm").as_posix(), corpus)

Output from the above serialization:
```
<snip>
2019-03-14 20:38:46,579 : INFO : saved 874892x80000 matrix, density=0.079% (55589479/69991360000)
2019-03-14 20:38:46,580 : INFO : saving MmCorpus index to /home/levon003/repos/nlp-for-hci-workshop/data/wikitext-103-lda.mm.index
```

In [6]:
mm = gensim.corpora.MmCorpus((git_root_dir / "data" / "wikitext-103-lda.mm").as_posix())

2019-03-14 21:32:16,968 : INFO : loaded corpus index from /home/levon003/repos/nlp-for-hci-workshop/data/wikitext-103-lda.mm.index
2019-03-14 21:32:16,969 : INFO : initializing cython corpus reader from /home/levon003/repos/nlp-for-hci-workshop/data/wikitext-103-lda.mm
2019-03-14 21:32:16,970 : INFO : accepted corpus with 874892 documents, 80000 features, 48314042 non-zero entries


In [8]:
# the lda training produces a ton of logging, so we redirect its output to a file
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO, 
                    filename=(git_root_dir/"data"/"lda.log").as_posix(), filemode='w')
logging.getLogger().removeHandler(logging.getLogger().handlers[0])

file_logger = logging.FileHandler((git_root_dir/"data"/"lda.log").as_posix(), mode='w')
file_logger.setFormatter(logging.Formatter('%(asctime)s : %(levelname)s : %(message)s'))                                 
logging.getLogger('').addHandler(file_logger)

console = logging.StreamHandler()
console.setLevel(logging.CRITICAL)
# add the handler to the root logger
logging.getLogger('').addHandler(console)
logging.info("Reset logging to put all non-critical output in file.")

In [9]:
%%time
lda = gensim.models.LdaMulticore(mm, id2word=dct, num_topics=100, passes=10, chunksize=5000)

  diff = np.log(self.expElogbeta)


CPU times: user 2h 12min 39s, sys: 5min 3s, total: 2h 17min 42s
Wall time: 6h 40min 12s


In [12]:
lda.save((git_root_dir/"data"/"models"/"wikitext-103-t100.lda").as_posix())

In [None]:
lda = gensim.models.LdaMulticore.load((git_root_dir/"data"/"models"/"wikitext-103-t100.lda").as_posix())

In [10]:
lda.show_topics()

[(74,
  '0.029*"King" + 0.016*"Henry" + 0.012*"Edward" + 0.011*"king" + 0.011*"he" + 0.010*"John" + 0.010*"Queen" + 0.010*"William" + 0.009*"England" + 0.009*"Prince"'),
 (54,
  '0.019*"Harrison" + 0.015*"Williams" + 0.015*"1968" + 0.014*"1967" + 0.014*"1970" + 0.013*"1969" + 0.013*"1966" + 0.012*"1971" + 0.012*"1973" + 0.011*"1965"'),
 (67,
  '0.039*"album" + 0.036*"song" + 0.016*"number" + 0.012*"single" + 0.011*"music" + 0.011*"songs" + 0.010*"It" + 0.008*"track" + 0.008*"chart" + 0.008*"released"'),
 (88,
  '0.028*"–" + 0.021*"club" + 0.015*"season" + 0.014*"League" + 0.012*"first" + 0.011*"team" + 0.010*"1" + 0.010*"against" + 0.010*"Cup" + 0.009*"goal"'),
 (11,
  '0.011*"Division" + 0.008*"attack" + 0.008*"British" + 0.007*"German" + 0.006*"Infantry" + 0.006*"their" + 0.006*"forces" + 0.005*":" + 0.005*"Army" + 0.005*"Battalion"'),
 (99,
  '0.037*"ship" + 0.034*"ships" + 0.023*"Navy" + 0.015*"class" + 0.014*"fleet" + 0.011*"crew" + 0.010*"naval" + 0.009*"Admiral" + 0.009*"U" + 0.

In [11]:
for i in range(100):
    words = [word for word, prob in lda.get_topic_terms(i, topn=15)]
    print(" ".join([str(dct.id2token[word]) for word in words]))

= Early Tropical Post Storm career life season Battle present years – First Pre Other
– : ; 1 - 3 2 4 A ISBN 0 & D 10 00
she until She On May June April August her November July December September October March
or may treatment disease medical blood be health risk cause brain hospital have effects cancer
bridge steel ” “ thick iron mm armor Bridge belt inches deck protected armour plates
video song Jackson music Madonna performed dance performance Beyoncé wearing during Gaga Spears MTV also
novel story first book stories fiction ; published comic science Adams novels who written writer
match Championship title Team event WWE World defeated ring ! after against team On won
@,@ 000 $ million 1 £ over 500 than sold 100 200 year 2 cost
California Los Smith Angeles American National Chicago named Center ; Thompson John Association 1988 &
are two side into three each upper one above right lower ; front wall left
Division attack British German Infantry their forces : Army Battalion two Brigad

Man Clark Jack Gordon Batman Bruce Green Knight Lee Leslie Tracy who Max Mark Men
century early 20th 19th area American first population land Museum trade became late people region
Club breed Russell registered Clarke dogs Horse whale Princeton dog Ivy Cornell Data Holland Shepard
ship ships Navy class fleet crew naval Admiral U Sea two British HMS Royal port


In [13]:
import pyLDAvis
import pyLDAvis.gensim

In [14]:
%%time
prepared_data = pyLDAvis.gensim.prepare(lda, mm, dct)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


CPU times: user 37min 37s, sys: 6.27 s, total: 37min 43s
Wall time: 30min


In [15]:
pyLDAvis.save_html(prepared_data, (git_root_dir/"data"/"lda_vis.html").as_posix())

In [16]:
pyLDAvis.display(prepared_data)