# Topic Modeling: Beyond LDA

While LDA topic modeling's ability to pick up on latent themes in large collections of texts can be quite useful (hence the model's popularity), LDA models nevertheless have a number of limitations. To name a few, LDA models don't account for the passage of time, the models have difficulty determining any relationships among generated topics, and topics become considerably less useful when the model is applied to shorter corpera with shorter document lengths.

In this chapter, we'll present some alternative approaches to topic modeling that help to mitigate these limitations of LDA modeling.  

## Dynamic Topic Modeling 


In developing Dynamic Topic Modeling, or DTM, [Blei and Lafferty](https://dl.acm.org/doi/pdf/10.1145/1143844.1143859) wanted to account for the possibility that content within a collection of texts could evolve over time, something traditional LDA topic modeling doesn't consider. To do so, they developed a form of topic modeling that could trace the evolution of the topics generated over time. In DTM, then, we can see our topics develop as time passes.

### Gensim

In order to run Dynamic Topic Modeling in Python, we'll be installing the [Gensim](https://radimrehurek.com/gensim/index.html) topic modeling library. (We'll also want to make sure we've installed Gensim's dependencies.)

In [1]:
pip install gensim



From Gensim, we'll import [ldaseq](https://radimrehurek.com/gensim/models/ldaseqmodel.html), the library's built-in Dynamic Topic Modeling function.  

In [2]:
from gensim.models import LdaSeqModel

Some relavant parameters for `LdaSeqModel` are listed below. For an exhaustive list of parameters, see [here](https://radimrehurek.com/gensim/models/ldaseqmodel.html).

#### LdaSeqModel Parameters: 

- **corpus**: The collection of document vectors we'll use to fit our LDA Sequence model.


- **id2word**: Allows us to map word IDs onto words, and helps determinine the size of our vocabulary.



- **time_slice**: The number of documents we'd like to include within each period of time we want our model to consider. 


- **num_topics**: The total number of topics we'd like our model to determine.


- **passes**: This parameter functions in a similar manner to the `max_iter` parameter we set when running a conventional LDA model with scikit-learn. Like `max_iter`, `passes` is set to 10 by default. As was the case before, we'll almost always want to set `passes` to a number higher than 10.  


- **em_max_iter**: Sets a maximum threshold on the number of iterations until we reach convergence of the  [Expectation-Maximization](https://machinelearningmastery.com/expectation-maximization-em-algorithm/) algorithm.



Just to get a handle on the `LdaSeqModel` function, we'll run the model with gensim's common corpus and dictionary. We'll use the common dictionary for the `id2word` parameter when we run our model.

In [3]:
from gensim.test.utils import common_corpus
from gensim.test.utils import common_dictionary

corpus = common_corpus
dictionary = common_dictionary

In [4]:
ldaseq = LdaSeqModel(corpus=common_corpus, id2word=dictionary, time_slice=[2, 4, 3], num_topics=2)

  convergence = np.fabs((bound - old_bound) / old_bound)


As you can see below, the topics generated with the Gensim common corpus and dictionary don't appear to vary much over time:

In [5]:
ldaseq.print_topics(time=0)

[[('system', 0.3500872945637462),
  ('human', 0.11526039385770404),
  ('eps', 0.11526039385770404),
  ('trees', 0.11526039385770404),
  ('computer', 0.05712939613987601),
  ('interface', 0.0352860182461808),
  ('response', 0.0352860182461808),
  ('survey', 0.0352860182461808),
  ('time', 0.0352860182461808),
  ('user', 0.0352860182461808),
  ('graph', 0.0352860182461808),
  ('minors', 0.0352860182461808)],
 [('graph', 0.16474522918783052),
  ('computer', 0.1194370345800401),
  ('minors', 0.11838819066794232),
  ('user', 0.087735576731536),
  ('interface', 0.0707031415981611),
  ('response', 0.0707031415981611),
  ('survey', 0.0707031415981611),
  ('system', 0.0707031415981611),
  ('time', 0.0707031415981611),
  ('trees', 0.0707031415981611),
  ('eps', 0.05352918972819657),
  ('human', 0.031945929515487775)]]

In [7]:
ldaseq.print_topics(time=1)

[[('trees', 0.3243608170592472),
  ('graph', 0.3234742151110741),
  ('minors', 0.220496813175908),
  ('computer', 0.03785777155563546),
  ('survey', 0.02257670871125338),
  ('response', 0.01049634623121019),
  ('time', 0.01049634623121019),
  ('user', 0.010429346903999211),
  ('system', 0.010393481443914823),
  ('interface', 0.00982587347322914),
  ('human', 0.00982162121086277),
  ('eps', 0.009770658892455517)],
 [('system', 0.21248932211335092),
  ('user', 0.16452232678346376),
  ('computer', 0.11765051434847837),
  ('interface', 0.11764919238744044),
  ('eps', 0.0703905784058816),
  ('response', 0.06926666546634111),
  ('time', 0.06926666546634111),
  ('survey', 0.050163248656948614),
  ('trees', 0.032665889445099405),
  ('graph', 0.03249576148851793),
  ('minors', 0.03249576148851793),
  ('human', 0.030944073949618966)]]

In [6]:
ldaseq.print_topics(time=2)

[[('system', 0.35108063747596746),
  ('human', 0.11513026486254897),
  ('eps', 0.11513026486254897),
  ('trees', 0.11513026486254897),
  ('computer', 0.05709400423341178),
  ('interface', 0.035204937671853384),
  ('response', 0.035204937671853384),
  ('survey', 0.035204937671853384),
  ('time', 0.035204937671853384),
  ('user', 0.035204937671853384),
  ('graph', 0.035204937671853384),
  ('minors', 0.035204937671853384)],
 [('graph', 0.16655770842238538),
  ('minors', 0.11952449872404979),
  ('computer', 0.11817651800551945),
  ('user', 0.08750640621056115),
  ('interface', 0.07050340763971134),
  ('response', 0.07050340763971134),
  ('survey', 0.07050340763971134),
  ('system', 0.07050340763971134),
  ('time', 0.07050340763971134),
  ('trees', 0.07050340763971134),
  ('eps', 0.05336328226104547),
  ('human', 0.03185114053817071)]]

And that's alright: we have no reason to expect any changes! In order to see some real time variance, let's walk through how to apply LDA Sequence topic modeling to our UN general debate data.

In [7]:
import pandas as pd

un_df = pd.read_json('un-general-debates.json')

In [8]:
len(un_df)

3214

Because LDA Sequence models take quite a while to run, we can cut down on runtime by looking at a random sample of 25% of the original dataset:

In [9]:
un_df = un_df.sample(804)

In [10]:
un_df

Unnamed: 0,index,speech_year,country_code,speech_text
2573,5870,1981,ISL,"Mr. President, I should like to join my collea..."
984,2089,1988,AFG,﻿It gives me great pleasure to express to Mt. ...
1905,3971,1987,BWA,"﻿\nMr. President, your great country, the Germ..."
10,10,1989,SUR,﻿\n \n\nLike many other countries in the devel...
389,844,1991,DEU,﻿This session of the General Assembly is takin...
...,...,...,...,...
2782,6257,1992,NPL,I have the pleasure to extend to Mr. Ganev the...
2407,5564,1993,TTO,It is with\nimmense pride that I congratulate ...
1780,3846,1987,CPV,"﻿We should like first of all, Sir, to say how ..."
753,1522,1996,MMR,﻿May I begin by extending to\nMr. Razali the w...


In [11]:
un_df.sort_values(by='speech_year')

Unnamed: 0,index,speech_year,country_code,speech_text
2316,4763,1980,BTN,﻿On behalf of my delegation and on my own beha...
2307,4754,1980,BDI,"﻿Mr. President, may I, on behalf of the delega..."
2350,4797,1980,AUS,"﻿Mr. President, on behalf of the Australian de..."
2289,4736,1980,NZL,"﻿Sir, may I congratulate you on your election ..."
2358,4805,1980,LCA,"﻿May I extend congratulations, on behalf of my..."
...,...,...,...,...
1648,3529,1999,ZAF,On behalf of our Government and\nin my capacit...
1760,3641,1999,SVN,Let me take this\nopportunity to congratulate ...
1660,3541,1999,POL,"Please accept my\ncongratulations, Sir, on you..."
1734,3615,1999,GNB,"Guinea-Bissau is gratified,\nSir, at your assu..."


It might be interesting to see how the results of our LDA Sequence model compare to what we saw with the `post-soviet` key when running a conventional LDA model with our UN data.

#### Setting up our Corpus

[Corpora and Vector Spaces, Gensim style](https://radimrehurek.com/gensim/auto_examples/core/run_corpora_and_vector_spaces.html)

The collection of text documents we're looking at is contained within the 'speech_text' key of our UN dataframe:

In [12]:
print(str(un_df['speech_text']))

2573    Mr. President, I should like to join my collea...
984     ﻿It gives me great pleasure to express to Mt. ...
1905    ﻿\nMr. President, your great country, the Germ...
10      ﻿\n \n\nLike many other countries in the devel...
389     ﻿This session of the General Assembly is takin...
                              ...                        
2782    I have the pleasure to extend to Mr. Ganev the...
2407    It is with\nimmense pride that I congratulate ...
1780    ﻿We should like first of all, Sir, to say how ...
753     ﻿May I begin by extending to\nMr. Razali the w...
2227    Allow me, at the outset, to congratulate you a...
Name: speech_text, Length: 804, dtype: object


To set up our corpus with gensim, we'll define a function to iterate over each line of `speech_text`, and generate new tokenized words every time we hit white space. The `doc2bow` method available through gensim.dictionary lets us convert our documents into bag-of-words vectors. 

In [14]:
class Corpus(object):
    def __iter__(self):
        for line in un_df['speech_text']:
            yield dictionary.doc2bow(line.lower().split())

One beneficial aspect of gensim is its ability to load one vectorized document within a corpus into memory at a time, rather than require the entire corpus be stored in RAM. This capability is particularly useful when looking at exceptionally large corpora.

In [15]:
corpus_memory_friendly = Corpus()
print(corpus_memory_friendly)

<__main__.Corpus object at 0x133bad250>


In [16]:
for vector in corpus_memory_friendly:
    print(vector)



#### Setting Up Our Dictionary

For our stoplist, we'll use the standard ["english"](https://gist.github.com/sebleier/554280) list of stopwords, in addition to "also," "united," "nations," "-," and a number of other years and phrases whose removal would benefit our model. We'll also remove any words that occur in our corpus only once. We can use the `compactify` method to remove extra spaces between words after filtering our tokens for stop words and once-only occurences.

In [23]:
from gensim import corpora
from six import iteritems
# collect statistics about all tokens
dictionary = corpora.Dictionary(line.lower().split() for line in un_df['speech_text'])
# remove stop words and words that appear only once
stoplist = set('338 36 4. 5. 6. 7. 8. 9. 1980. 23. 24. 242 3. 17. 18. 19. 1945 1960s. 1975 1980 20. 21. 22. international national mr. 1 10. 11. 12. 13. 14. 15. 16. (1967) (1973). 48 1963 1971 1979 1981 1987 1987, 1990. 2000, 2,000 435 (unctad), (plo), (imf) (gatt). (1978). 1991 10 1996. 1998 1998, (csce). 50 60 1991. 40 45 1925 1947, 1947 1948 1966, so also - united nations i me my myself we our ours ourselves you your yours yourself yourselves he him his himself she her hers herself it its itself they them their theirs themselves what which who whom this that these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don should now'.split())
stop_ids = [
    dictionary.token2id[stopword]
    for stopword in stoplist
    if stopword in dictionary.token2id
]
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]
dictionary.filter_tokens(stop_ids + once_ids) 
dictionary.compactify()  # remove gaps in id sequence after words that were removed
print(dictionary)

Dictionary(29152 unique tokens: ['abide', 'abolition', 'accept', 'accepted', 'accordance']...)


#### Setting up our Time Slices

Let's say we want to look at the evolution of our UN general debate topics over four periods of five years, from 1980-1999:

- 1980-1984 = 40+35+36+44+39 = **194**


- 1985-1989 = 40+31+43+38+43 = **195**


- 1990-1994 = 41+47+37+41+47 = **213**


- 1995-1999 = 41+42+40+40+39 = **202** 


In [26]:
time_slice = [194, 195, 213, 202]

#### Running the LDA Sequence Model

In [27]:
import logging
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO

In [28]:
ldaseq = LdaSeqModel(corpus=corpus_memory_friendly, 
                     id2word=dictionary, 
                     time_slice=time_slice,
                     passes = 100,
                     em_max_iter=3,
                     num_topics=10)



With our model run, let's print out our topics over each time slice: 

In [31]:
ldaseq.print_topics(time=0,  top_terms = 10)

[[('japan', 0.3206225329485179),
  ('northern', 0.06882528841966253),
  ('strongly', 0.048478546803078004),
  ('japan,', 0.030072683274910725),
  ('functions', 0.025418348221621215),
  ('resolved', 0.024703875612475196),
  ('japanese', 0.023299282875011307),
  ('promptly', 0.016998147834482027),
  ("japan's", 0.01671320401740911),
  ('dispatched', 0.014314342992498046)],
 [('african', 0.021974390006605664),
  ('government', 0.02145955115585987),
  ('sierra', 0.0205511182150015),
  ('liberia', 0.019493585616539234),
  ('community', 0.01875616100044433),
  ('bahamas', 0.018550502273844468),
  ('vincent', 0.014876782575641652),
  ('ethiopia', 0.014236346228165234),
  ('liberian', 0.013495807194655244),
  ('grenadines', 0.0127909102350182)],
 [('states', 0.010344547942750781),
  ('countries', 0.00788566846624124),
  ('nuclear', 0.006736710671784701),
  ('peace', 0.005852576353326711),
  ('people', 0.0058078825572129514),
  ('world', 0.005681828231417625),
  ('republic', 0.00546846542670397

In [32]:
ldaseq.print_topics(time=1, top_terms = 10)

[[('japan', 0.30944475713426245),
  ('northern', 0.0724895749642994),
  ('strongly', 0.04962504524697425),
  ('japan,', 0.03068247251851244),
  ('functions', 0.026079333304968158),
  ('resolved', 0.02486917390712611),
  ('japanese', 0.02340260515893813),
  ('promptly', 0.01717833770088256),
  ("japan's", 0.016938253289153316),
  ('dispatched', 0.014524065819900518)],
 [('african', 0.02215970766816701),
  ('government', 0.02160197216322086),
  ('sierra', 0.020498317589957793),
  ('liberia', 0.019696904166259518),
  ('community', 0.018888693319723624),
  ('bahamas', 0.01852534759342041),
  ('vincent', 0.01485383383838579),
  ('ethiopia', 0.014344587961097749),
  ('liberian', 0.013580371920435257),
  ('grenadines', 0.012761486661159612)],
 [('states', 0.010373835122726648),
  ('countries', 0.008495085351695313),
  ('nuclear', 0.006742436016814167),
  ('peace', 0.005875580404490807),
  ('people', 0.005823005945804174),
  ('world', 0.005698230038002355),
  ('republic', 0.005488604598719764)

In [33]:
ldaseq.print_topics(time=2, top_terms = 10)

[[('japan', 0.30957757529582136),
  ('northern', 0.07662287162990616),
  ('strongly', 0.049932828704531806),
  ('japan,', 0.030775502668000797),
  ('functions', 0.02564537612329232),
  ('resolved', 0.024668330967035638),
  ('japanese', 0.02327929213816111),
  ('promptly', 0.01708286207699411),
  ("japan's", 0.01679401442899101),
  ('dispatched', 0.014452481609676276)],
 [('african', 0.022403353234398096),
  ('government', 0.021795069687032716),
  ('sierra', 0.02038852051008029),
  ('liberia', 0.019945385185491882),
  ('community', 0.01905564800240102),
  ('bahamas', 0.018348098892264873),
  ('vincent', 0.014757673987625373),
  ('ethiopia', 0.014519714947224734),
  ('liberian', 0.013678401686370871),
  ('grenadines', 0.01267752305675072)],
 [('states', 0.010415985580443715),
  ('countries', 0.008146058660105811),
  ('nuclear', 0.006752511278224654),
  ('peace', 0.005909734343235575),
  ('world', 0.005722300990496996),
  ('people', 0.005638306963826615),
  ('republic', 0.0055185752443729

In [34]:
ldaseq.print_topics(time=3, top_terms = 10)

[[('japan', 0.313137398717648),
  ('northern', 0.07940354678400134),
  ('strongly', 0.0499245368055028),
  ('japan,', 0.030649379585877518),
  ('functions', 0.025184089436399024),
  ('resolved', 0.02435097093973134),
  ('japanese', 0.023144801352555654),
  ('promptly', 0.01694457521847881),
  ("japan's", 0.016577458419746956),
  ('dispatched', 0.014315419357596296)],
 [('african', 0.02259245007595249),
  ('government', 0.022156084587159033),
  ('sierra', 0.02031109422207415),
  ('liberia', 0.020124572157524608),
  ('community', 0.01917340087654538),
  ('bahamas', 0.01801584611701424),
  ('ethiopia', 0.014727227907361253),
  ('vincent', 0.01469044721098491),
  ('liberian', 0.013737294359158938),
  ('grenadines', 0.012619052628448223)],
 [('states', 0.010459639046638351),
  ('countries', 0.007269286851723156),
  ('nuclear', 0.006766626521058904),
  ('peace', 0.005942150859034565),
  ('world', 0.005744025451445045),
  ('people', 0.005552961691464973),
  ('republic', 0.00554611630854571),


###### Short Text Topic Modeling

While LDA topic modeling works well when the texts in our corpus are considerably lengthy (around fifty words or more), LDA models run into some issues when applied to shorter texts. This happens because of a major assumption of LDA modeling: that each text is a *mixture of topics*. While this makes sense in the case of longer texts, shorter texts, like social media posts, often consist of only a *single topic*. 

In [1]:
import spacy
import gsdmm

from sklearn.datasets import fetch_20newsgroups

import pickle
import matplotlib
import pandas as pd
import numpy as np
import ast

[example code](https://github.com/mamrou/short_text_topic_modeling/blob/master/notebook_sttm_example.ipynb)

In [2]:
cats = ['talk.politics.mideast', 'comp.windows.x', 'sci.space']

newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), categories=cats)
newsgroups_train_subject = fetch_20newsgroups(subset='train', categories=cats)

data = newsgroups_train.data
data_subject = newsgroups_train_subject.data

targets = newsgroups_train.target.tolist()
target_names = newsgroups_train.target_names

Are our topics evenly distributed?

In [3]:
df_targets = pd.DataFrame({'targets': targets})
order_list = df_targets.targets.value_counts()
order_list

1    593
0    593
2    564
Name: targets, dtype: int64

In [4]:
def extract_first_sentence(data_subject):
    list_first_sentence = []
    for text in data:
        first_sentence = text.split(".")[0].replace("\n", "")
        list_first_sentence.append(first_sentence)
    return list_first_sentence


In [5]:
def extract_subject(data):
    c = 0
    s = "Subject:"
    list_subjects = []
    for new in data_subject:    
        lines = new.split("\n")
        b = 0 # loop out at the first "Subject:", they may be several and we want first one only
        for line in lines:
            if s in line and b == 0:
                subject = " ".join(line.split(":")[1:]).strip()
                subject = subject.replace('Re', '').strip()
                list_subjects.append(subject)
                c += 1
                b = 1
    return list_subjects

In [6]:
def concatenate(list_first_sentence, list_subjects):
    list_docs = []
    for i in range(len(list_first_sentence)):
        list_docs.append(list_subjects[i] + " " + list_first_sentence[i])
    return list_docs


In [7]:
list_first_sentence = extract_first_sentence(data)
list_subjects = extract_subject(data_subject)
list_docs = concatenate(list_first_sentence, list_subjects)

In [8]:
df = pd.DataFrame(columns=['content', 'topic_id', 'topic_true_name'])
df['content'] = list_docs
df['topic_id'] = targets

def true_topic_name(x, target_names):
    return target_names[x].split('.')[-1]

df['topic_true_name'] = df['topic_id'].apply(lambda x: true_topic_name(x, target_names))
df.head()

Unnamed: 0,content,topic_id,topic_true_name
0,Elevator to the top floor Reading from a Amoco...,1,space
1,"Title for XTerm Yet again,the escape sequences...",0,x
2,From Israeli press. Madness. Before getting ex...,2,mideast
3,Accounts of Anti-Armenian Human Right Violatio...,2,mideast
4,How many israeli soldiers does it take to kill...,2,mideast


In [49]:
import nltk
nltk.download('all')




In [74]:
s0 = str(df['content'])

In [75]:
s0

'0       Elevator to the top floor Reading from a Amoco...\n1       Title for XTerm Yet again,the escape sequences...\n2       From Israeli press. Madness. Before getting ex...\n3       Accounts of Anti-Armenian Human Right Violatio...\n4       How many israeli soldiers does it take to kill...\n                              ...                        \n1745    Atlas revisited   I found it very interesting ...\n1746       How to get 24bit color with xview frames ? Yes\n1747    Deir Yassin You apparently think you are some ...\n1748    Vulcan?  (No, not the guy with the ears!) The ...\n1749    REPOST  XView slider Hi Xperts,this is a repos...\nName: content, Length: 1750, dtype: object'

In [76]:
tokenized_data = nltk.word_tokenize(s0)

In [77]:
tokenized_data

['0',
 'Elevator',
 'to',
 'the',
 'top',
 'floor',
 'Reading',
 'from',
 'a',
 'Amoco',
 '...',
 '1',
 'Title',
 'for',
 'XTerm',
 'Yet',
 'again',
 ',',
 'the',
 'escape',
 'sequences',
 '...',
 '2',
 'From',
 'Israeli',
 'press',
 '.',
 'Madness',
 '.',
 'Before',
 'getting',
 'ex',
 '...',
 '3',
 'Accounts',
 'of',
 'Anti-Armenian',
 'Human',
 'Right',
 'Violatio',
 '...',
 '4',
 'How',
 'many',
 'israeli',
 'soldiers',
 'does',
 'it',
 'take',
 'to',
 'kill',
 '...',
 '...',
 '1745',
 'Atlas',
 'revisited',
 'I',
 'found',
 'it',
 'very',
 'interesting',
 '...',
 '1746',
 'How',
 'to',
 'get',
 '24bit',
 'color',
 'with',
 'xview',
 'frames',
 '?',
 'Yes',
 '1747',
 'Deir',
 'Yassin',
 'You',
 'apparently',
 'think',
 'you',
 'are',
 'some',
 '...',
 '1748',
 'Vulcan',
 '?',
 '(',
 'No',
 ',',
 'not',
 'the',
 'guy',
 'with',
 'the',
 'ears',
 '!',
 ')',
 'The',
 '...',
 '1749',
 'REPOST',
 'XView',
 'slider',
 'Hi',
 'Xperts',
 ',',
 'this',
 'is',
 'a',
 'repos',
 '...',
 'Name'

## References

Amrouche, Matyas. 2019. “Short Text Topic Modeling.” Medium, Towards Data Science, towardsdatascience.com/short-text-topic-modeling-70e50a57c883.

Alghamdi, Rubayyi and Khalid Alfalqi. 2015. "A Survey of Topic Modeling in Text Mining." *Int. J. Adv. Comput. Sci. Appl.(IJACSA)*, 6(1).

Blei, David M. and John D. Lafferty. 2006. "Dynamic Topic Models." In *Proceedings of the 23rd International Conference on Machine Learning* (pp. 113-120).

Brownlee, Jason. 2019. “A Gentle Introduction to Expectation-Maximization (EM Algorithm).” Machine Learning Mastery, machinelearningmastery.com/expectation-maximization-em-algorithm/.

“Dynamic Topic Models Tutorial.” Ldaseqmodel, markroxor.github.io/gensim/static/notebooks/ldaseqmodel.html.


https://towardsdatascience.com/exploring-the-un-general-debates-with-dynamic-topic-models-72dc0e307696