# Topic Modeling: Beyond LDA

While LDA topic modeling's ability to pick up on latent themes in large collections of texts can be quite useful (hence the model's popularity), LDA models nevertheless have a number of limitations. To name a few, LDA models don't account for the passage of time, the models have difficulty determining any relationships among generated topics, and topics become considerably less useful when the model is applied to shorter corpera with shorter document lengths.

In this chapter, we'll present some alternative approaches to topic modeling that help to mitigate these limitations of LDA modeling.  

## Dynamic Topic Modeling 


In developing Dynamic Topic Modeling, or DTM, [Blei and Lafferty](https://dl.acm.org/doi/pdf/10.1145/1143844.1143859) wanted to account for the possibility that content within a collection of texts could evolve over time, something traditional LDA topic modeling doesn't consider. To do so, they developed a form of topic modeling that could trace the evolution of the topics generated over time. In DTM, then, we can see our topics develop as time passes.

### Gensim

In order to run Dynamic Topic Modeling in Python, we'll be installing the [Gensim](https://radimrehurek.com/gensim/index.html) topic modeling library. (We'll also want to make sure we've installed Gensim's dependencies.)

In [1]:
pip install gensim



From Gensim, we'll import [ldaseq](https://radimrehurek.com/gensim/models/ldaseqmodel.html), the library's built-in Dynamic Topic Modeling function.  

In [2]:
from gensim.models import LdaSeqModel

Some relavant parameters for `LdaSeqModel` are listed below. For an exhaustive list of parameters, see [here](https://radimrehurek.com/gensim/models/ldaseqmodel.html).

#### LdaSeqModel Parameters: 

- **corpus**: The collection of document vectors we'll use to fit our LDA Sequence model.


- **id2word**: Allows us to map word IDs onto words, and helps determinine the size of our vocabulary.



- **time_slice**: The number of documents we'd like to include within each period of time we want our model to consider. 


- **num_topics**: The total number of topics we'd like our model to determine.


- **passes**: This parameter functions in a similar manner to the `max_iter` parameter we set when running a conventional LDA model with scikit-learn. Like `max_iter`, `passes` is set to 10 by default. As was the case before, we'll almost always want to set `passes` to a number higher than 10.  


- **em_max_iter**: Sets a maximum threshold on the number of iterations until we reach convergence of the  [Expectation-Maximization](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) algorithm.

(*note to self: need a better link than Wikipedia*)



Just to get a handle on the `LdaSeqModel` function, we'll run the model with gensim's common corpus and dictionary. We'll use the common dictionary for the `id2word` parameter when we run our model.

In [3]:
from gensim.test.utils import common_corpus
from gensim.test.utils import common_dictionary

corpus = common_corpus
dictionary = common_dictionary

In [None]:
ldaseq = LdaSeqModel(corpus=common_corpus, id2word=dictionary, time_slice=[2, 4, 3], num_topics=2)

  convergence = np.fabs((bound - old_bound) / old_bound)


As you can see below, the topics generated with the Gensim common corpus and dictionary don't appear to vary much over time:

In [5]:
ldaseq.print_topics(time=0)

[[('system', 0.21103477186241593),
  ('user', 0.16397962016469478),
  ('interface', 0.11745461526922607),
  ('computer', 0.11727548956705906),
  ('response', 0.0719533519106377),
  ('time', 0.0719533519106377),
  ('eps', 0.0719533519106377),
  ('survey', 0.05147858004212989),
  ('trees', 0.030901334149172184),
  ('graph', 0.03071583509559593),
  ('minors', 0.03071583509559593),
  ('human', 0.03058386302219727)],
 [('trees', 0.3228488866688078),
  ('graph', 0.322041720667678),
  ('minors', 0.21890836764303392),
  ('computer', 0.03684688982026238),
  ('human', 0.017880791435300917),
  ('interface', 0.017880791435300917),
  ('survey', 0.017880791435300917),
  ('response', 0.009142352178863018),
  ('system', 0.009142352178863018),
  ('time', 0.009142352178863018),
  ('user', 0.009142352178863018),
  ('eps', 0.009142352178863018)]]

In [6]:
ldaseq.print_topics(time=1)

[[('system', 0.21184048867551472),
  ('user', 0.1640590986122984),
  ('interface', 0.117102571455535),
  ('computer', 0.11695025075276108),
  ('response', 0.07192461562115725),
  ('time', 0.07192461562115725),
  ('eps', 0.07192461562115725),
  ('survey', 0.0514466411016321),
  ('trees', 0.03087175342171288),
  ('graph', 0.030686309902832887),
  ('minors', 0.030686309902832887),
  ('human', 0.030582729311408343)],
 [('trees', 0.32293164889527337),
  ('graph', 0.3221423932188621),
  ('minors', 0.21897804755601968),
  ('computer', 0.03678739908968738),
  ('human', 0.017849582746292062),
  ('interface', 0.017849582746292062),
  ('survey', 0.017849582746292062),
  ('response', 0.009122352600256255),
  ('system', 0.009122352600256255),
  ('time', 0.009122352600256255),
  ('user', 0.009122352600256255),
  ('eps', 0.009122352600256255)]]

In [7]:
ldaseq.print_topics(time=2)

[[('system', 0.21192796388747198),
  ('user', 0.16410763095116246),
  ('interface', 0.11710083582672116),
  ('computer', 0.1170165264256437),
  ('response', 0.0718891670848555),
  ('time', 0.0718891670848555),
  ('eps', 0.0718891670848555),
  ('survey', 0.05141562743955698),
  ('trees', 0.03084796899147348),
  ('graph', 0.03066260767539272),
  ('minors', 0.03066260767539272),
  ('human', 0.030590729872618242)],
 [('graph', 0.3223834884555004),
  ('trees', 0.32177799829599324),
  ('minors', 0.21992300080023172),
  ('computer', 0.036777841094811504),
  ('human', 0.01784728082781973),
  ('interface', 0.01784728082781973),
  ('survey', 0.01784728082781973),
  ('response', 0.009119165774000784),
  ('system', 0.009119165774000784),
  ('time', 0.009119165774000784),
  ('user', 0.009119165774000784),
  ('eps', 0.009119165774000784)]]

And that's alright: we have no reason to expect any changes! In order to see some real time variance, let's walk through how to apply LDA Sequence topic modeling to our UN general debate data.

In [8]:
import pandas as pd

un_df = pd.read_json('un-general-debates.json')

It might be interesting to see how the results of our LDA Sequence model compare to what we saw with the `post-soviet` key when running a conventional LDA model with our UN data.


To hone in on the transition from pre-Soviet to post-Soviet (and for the sake of limiting excessive run times), let's look specifically at the 660 documents in the UN dataset from the two years leading up to, and the two years following, the dissolution of the Soviet Union. 

In [10]:
a = un_df['speech_year'] < 1990

In [11]:
b = un_df['speech_year'] > 1993

In [12]:
un_df = un_df[a == False]
un_df = un_df[b == False]

  un_df = un_df[b == False]


In [13]:
len(un_df)

660

In [14]:
un_df['speech_year'].describe()

count     660.000000
mean     1991.546970
std         1.119251
min      1990.000000
25%      1991.000000
50%      1992.000000
75%      1993.000000
max      1993.000000
Name: speech_year, dtype: float64

#### Setting up our Corpus

[Corpora and Vector Spaces, Gensim style](https://radimrehurek.com/gensim/auto_examples/core/run_corpora_and_vector_spaces.html)

The collection of text documents we're looking at is contained within the 'speech_text' key of our UN dataframe:

In [15]:
un_df['speech_text']

290     ﻿On behalf of my delegation and on my own beha...
291     ﻿It is truly an honour for me to address a bod...
292     ﻿Mr. President, allow me to begin by warmly co...
293     ﻿At the outset, I would like to congratulate y...
294     ﻿Mr. President, it is my high honour to addres...
                              ...                        
3031    ﻿On behalf of my Government and on my own beha...
3032    ﻿\nIt is a source of particular pleasure for m...
3033    ﻿Mr. President, please accept my congratulatio...
3034    ﻿Mr. President, allow me, at the outset of my ...
3035    ﻿\nOn behalf of the Malawi delegation. I have ...
Name: speech_text, Length: 660, dtype: object

To set up our corpus with gensim, we'll define a function to iterate over each line of `speech_text`, and generate new tokenized words every time we hit white space. The `doc2bow` method available through gensim.dictionary lets us convert our documents into bag-of-words vectors. 

In [16]:
class Corpus(object):
    def __iter__(self):
        for line in un_df['speech_text']:
            yield dictionary.doc2bow(line.lower().split())

One beneficial aspect of gensim is its ability to load one vectorized document within a corpus into memory at a time, rather than require the entire corpus be stored in RAM. This capability is particularly useful when looking at exceptionally large corpora.

In [17]:
corpus_memory_friendly = Corpus()
print(corpus_memory_friendly)

<__main__.Corpus object at 0x1197ace50>


In [18]:
for vector in corpus_memory_friendly:
    print(vector)



#### Setting Up Our Dictionary

For our stoplist, we'll use the standard ["english"](https://gist.github.com/sebleier/554280) list of stopwords, in addition to "also," "united," "nations," and "-". We'll also remove any words that occur in our corpus only once. We can use the `compactify` method to remove extra spaces between words after filtering our tokens for stop words and once-only occurences.

In [28]:
from gensim import corpora
from six import iteritems
# collect statistics about all tokens
dictionary = corpora.Dictionary(line.lower().split() for line in un_df['speech_text'])
# remove stop words and words that appear only once
stoplist = set('also - united nations i me my myself we our ours ourselves you your yours yourself yourselves he him his himself she her hers herself it its itself they them their theirs themselves what which who whom this that these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don should now'.split())
stop_ids = [
    dictionary.token2id[stopword]
    for stopword in stoplist
    if stopword in dictionary.token2id
]
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]
dictionary.filter_tokens(stop_ids + once_ids) 
dictionary.compactify()  # remove gaps in id sequence after words that were removed
print(dictionary)

INFO : adding document #0 to Dictionary(0 unique tokens: [])
INFO : built Dictionary(43938 unique tokens: ['(habitat),', '(unhcr)', '-', '--', '1960s']...) from 660 documents (total 2009288 corpus positions)


Dictionary(24701 unique tokens: ['(habitat),', '(unhcr)', '1960s', '1970s', '1990s.']...)


#### Setting up our Time Slices

Let's say we want to look at the evolution of our UN general debate topics over the two years leading up to, and the two years following, the dissolution of the Soviet Union:

- 1990 = 156


- 1991 = 162


- 1992 = 167


- 1993 = 175 


In [29]:
time_slice = [156,162, 167, 175]

#### Putting it all together: Running the Model

In [21]:
import logging
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO

In [30]:
ldaseq = LdaSeqModel(corpus=corpus_memory_friendly, 
                     id2word=dictionary, 
                     time_slice=time_slice,
                     passes = 50,
                     em_max_iter=5,
                     num_topics=3)



With our model run, let's print out our topics over each time slice: 

In [34]:
ldaseq.print_topics(time=0)

[[('world', 0.007490298177521934),
  ('must', 0.006549445558411537),
  ('new', 0.006244327471864768),
  ('people', 0.005065612755036083),
  ('international', 0.004782458186102162),
  ('us', 0.0046574217503769135),
  ('countries', 0.00441736467373003),
  ('economic', 0.003935317468113478),
  ('peace', 0.003810425712643679),
  ('one', 0.0036811995633896245),
  ('development', 0.003283022027559971),
  ('many', 0.0031241478354439556),
  ('human', 0.0030681736834799493),
  ('states', 0.0029200976750825445),
  ('country', 0.0027416360669084944),
  ('government', 0.002739811942797668),
  ('would', 0.0027223647105326146),
  ('great', 0.0023232114841494427),
  ('political', 0.002316781646675052),
  ('democracy', 0.0023016256362888205)],
 [('international', 0.01321983465075864),
  ('peace', 0.0063542257126587166),
  ('world', 0.006285159993353706),
  ('economic', 0.006241892643327187),
  ('countries', 0.005843954519435114),
  ('new', 0.004461861285234271),
  ('security', 0.0042970183117730125),


In [35]:
ldaseq.print_topics(time=1)

[[('world', 0.007525992417898458),
  ('must', 0.006582235531003249),
  ('new', 0.006263348248554922),
  ('people', 0.005084335863601741),
  ('us', 0.004720532439211315),
  ('international', 0.004683729390100844),
  ('countries', 0.00464784511820726),
  ('economic', 0.003946587705482039),
  ('peace', 0.0038322201217897227),
  ('one', 0.0035345891444205314),
  ('development', 0.0032953394919181932),
  ('many', 0.0031375508863386984),
  ('human', 0.003076328958185783),
  ('states', 0.0029161221191692983),
  ('government', 0.0028056633569699782),
  ('country', 0.00275339162228974),
  ('would', 0.0027306122051111415),
  ('political', 0.0023248312961721843),
  ('great', 0.0023052407832702985),
  ('democracy', 0.002302915573184192)],
 [('international', 0.013263640291491588),
  ('peace', 0.006364334340705556),
  ('world', 0.006300332278450357),
  ('economic', 0.006254516533521765),
  ('countries', 0.005863526302179306),
  ('new', 0.004466224275532365),
  ('security', 0.004312423124295029),
  

In [36]:
ldaseq.print_topics(time=2)

[[('world', 0.0075775231516884934),
  ('must', 0.00662442830134853),
  ('new', 0.006290277434746623),
  ('people', 0.005105871080849557),
  ('us', 0.004785919739885581),
  ('international', 0.004779281630947735),
  ('countries', 0.004512687256552781),
  ('economic', 0.00396164956260618),
  ('peace', 0.0038550855010333488),
  ('one', 0.003600677408349859),
  ('development', 0.0033080532555901155),
  ('many', 0.0031506057200634116),
  ('human', 0.003086249424135719),
  ('states', 0.002912197783690612),
  ('country', 0.0027687833421748314),
  ('would', 0.0027416723854016673),
  ('government', 0.002734371687995242),
  ('political', 0.0023368886596403563),
  ('democracy', 0.0023037558366195486),
  ('time', 0.0022978147038505495)],
 [('international', 0.013328639855462922),
  ('peace', 0.0063767628996250995),
  ('world', 0.006327077941908732),
  ('economic', 0.0062755822361628055),
  ('countries', 0.005897229696867381),
  ('new', 0.004478791165627977),
  ('security', 0.004336200959949073),
 

In [33]:
ldaseq.print_topics(time=3)

[[('world', 0.007616953171275815),
  ('must', 0.006653124321057768),
  ('new', 0.006310910319582272),
  ('people', 0.005119295611629798),
  ('international', 0.004823005642591637),
  ('us', 0.004765673362271388),
  ('countries', 0.00452653672569581),
  ('economic', 0.003972867164769877),
  ('peace', 0.0038671806577720244),
  ('one', 0.0036617806005182526),
  ('development', 0.0033150915238955897),
  ('many', 0.0031589858443702664),
  ('human', 0.0030926678068101507),
  ('states', 0.0029093833647568863),
  ('country', 0.0027794168368489086),
  ('would', 0.0027494802795279788),
  ('government', 0.0026818455708699667),
  ('political', 0.002347128882270167),
  ('time', 0.0023185411694922804),
  ('democracy', 0.002304950773052879)],
 [('international', 0.013387531182957),
  ('peace', 0.006388295625500919),
  ('world', 0.006351946564518667),
  ('economic', 0.006296760212557907),
  ('countries', 0.005929076957719861),
  ('new', 0.004491391541158278),
  ('security', 0.004357861642997383),
  ('

###### Short Text Topic Modeling

While LDA topic modeling works well when the texts in our corpus are considerably lengthy (around fifty words or more), LDA models run into some issues when applied to shorter texts. This happens because of a major assumption of LDA modeling: that each text is a *mixture of topics*. While this makes sense in the case of longer texts, shorter texts, like social media posts, often consist of only a *single topic*. 

In [None]:
import spacy
import gsdmm

from sklearn.datasets import fetch_20newsgroups

import pickle
import matplotlib
import pandas as pd
import numpy as np
import ast

In [31]:
cats = ['talk.politics.mideast', 'comp.windows.x', 'sci.space']

newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), categories=cats)
newsgroups_train_subject = fetch_20newsgroups(subset='train', categories=cats)

data = newsgroups_train.data
data_subject = newsgroups_train_subject.data

targets = newsgroups_train.target.tolist()
target_names = newsgroups_train.target_names

## References

Alghamdi, Rubayyi and Khalid Alfalqi. 2015. "A Survey of Topic Modeling in Text Mining." *Int. J. Adv. Comput. Sci. Appl.(IJACSA)*, 6(1).

Blei, David M. and John D. Lafferty. 2006. "Dynamic Topic Models." In *Proceedings of the 23rd International Conference on Machine Learning* (pp. 113-120).

https://towardsdatascience.com/short-text-topic-modeling-70e50a57c883

https://markroxor.github.io/gensim/static/notebooks/ldaseqmodel.html