# Topic Modeling: Beyond LDA

While LDA topic modeling's ability to pick up on latent themes in large collections of texts can be quite useful (hence the model's popularity), LDA models nevertheless have a number of limitations. To name a few, LDA models don't account for the passage of time, the models have difficulty determining any relationships among generated topics, and topics become considerably less useful when the model is applied to shorter corpera with shorter document lengths.

In this chapter, we'll present some alternative approaches to topic modeling that help to mitigate these limitations of LDA modeling.  

## Dynamic Topic Modeling 


In developing Dynamic Topic Modeling, or DTM, [Blei and Lafferty](https://dl.acm.org/doi/pdf/10.1145/1143844.1143859) wanted to account for the possibility that content within a collection of texts could evolve over time, something traditional LDA topic modeling doesn't consider. To do so, they developed a form of topic modeling that could trace the evolution of the topics generated over time. In DTM, then, we can see our topics develop as time passes.

### Gensim

In order to run Dynamic Topic Modeling in Python, we'll be installing the [Gensim](https://radimrehurek.com/gensim/index.html) topic modeling library. (We'll also want to make sure we've installed Gensim's dependencies.)

In [1]:
pip install gensim



From Gensim, we'll import [ldaseq](https://radimrehurek.com/gensim/models/ldaseqmodel.html), the library's built-in Dynamic Topic Modeling function.  

In [2]:
from gensim.models import LdaSeqModel

Some relavant parameters for `LdaSeqModel` are listed below. For an exhaustive list of parameters, see [here](https://radimrehurek.com/gensim/models/ldaseqmodel.html).

#### LdaSeqModel Parameters: 

- **corpus**: The collection of document vectors we'll use to fit our LDA Sequence model.


- **id2word**: Allows us to map word IDs onto words, and helps determinine the size of our vocabulary.


- **time_slice**: The number of documents we'd like to include within each period of time we want our model to consider. 


- **num_topics**: The total number of topics we'd like our model to determine.



Just to get a handle on the `LdaSeqModel` function, we'll run the model with gensim's common corpus and dictionary. We'll use the common dictionary for the `id2word` parameter when we run our model.

In [3]:
from gensim.test.utils import common_corpus
from gensim.test.utils import common_dictionary

corpus = common_corpus
dictionary = common_dictionary

In [4]:
ldaseq = LdaSeqModel(corpus=common_corpus, id2word=dictionary, time_slice=[2, 4, 3], num_topics=2)

  convergence = np.fabs((bound - old_bound) / old_bound)


As you can see below, the topics generated with the Gensim common corpus and dictionary don't appear to vary much over time:

In [5]:
ldaseq.print_topics(time=0)

[[('system', 0.21157474442727292),
  ('user', 0.16464543889319655),
  ('computer', 0.11780520871853883),
  ('interface', 0.11780520871853883),
  ('eps', 0.06967385482884698),
  ('response', 0.06951097868862414),
  ('time', 0.06951097868862414),
  ('survey', 0.051180555909905195),
  ('trees', 0.032452567340626),
  ('graph', 0.032452567340626),
  ('minors', 0.032452567340626),
  ('human', 0.030935329104574284)],
 [('trees', 0.32492222725659636),
  ('graph', 0.3235563263440136),
  ('minors', 0.2205714063302323),
  ('computer', 0.0382488601600507),
  ('survey', 0.022503013800186142),
  ('human', 0.0100605551936047),
  ('interface', 0.0100605551936047),
  ('response', 0.0100605551936047),
  ('time', 0.0100605551936047),
  ('user', 0.0100605551936047),
  ('system', 0.00994769507044868),
  ('eps', 0.00994769507044868)]]

In [6]:
ldaseq.print_topics(time=1)

[[('system', 0.21240589151067885),
  ('user', 0.1646899189988433),
  ('computer', 0.11749357289440712),
  ('interface', 0.11749357289440712),
  ('eps', 0.06963604929122659),
  ('response', 0.0694731541329969),
  ('time', 0.0694731541329969),
  ('survey', 0.05114236405938337),
  ('trees', 0.03241859405373477),
  ('graph', 0.03241859405373477),
  ('minors', 0.03241859405373477),
  ('human', 0.030936539923855384)],
 [('trees', 0.3249256535151684),
  ('graph', 0.32373545741107845),
  ('minors', 0.22060806640237762),
  ('computer', 0.03819443709873786),
  ('survey', 0.022472092171356966),
  ('human', 0.010041390232247207),
  ('interface', 0.010041390232247207),
  ('response', 0.010041390232247207),
  ('time', 0.010041390232247207),
  ('user', 0.010041390232247207),
  ('system', 0.00992867112002237),
  ('eps', 0.00992867112002237)]]

In [7]:
ldaseq.print_topics(time=2)

[[('system', 0.21241936929164823),
  ('user', 0.16472754916179352),
  ('computer', 0.11755004609140896),
  ('interface', 0.11755004609140896),
  ('eps', 0.06960644191233825),
  ('response', 0.06944356260480058),
  ('time', 0.06944356260480058),
  ('survey', 0.05111543816120229),
  ('trees', 0.032396675636444675),
  ('graph', 0.032396675636444675),
  ('minors', 0.032396675636444675),
  ('human', 0.030953957171264507)],
 [('graph', 0.32400816741793875),
  ('trees', 0.32379355787054775),
  ('minors', 0.22150884150941086),
  ('computer', 0.038181772854098815),
  ('survey', 0.0224696342220385),
  ('human', 0.010037636236713735),
  ('interface', 0.010037636236713735),
  ('response', 0.010037636236713735),
  ('time', 0.010037636236713735),
  ('user', 0.010037636236713735),
  ('system', 0.009924922471198317),
  ('eps', 0.009924922471198317)]]

And that's alright: we have no reason to expect any changes! In order to see some real time variance, let's walk through how to apply LDA Sequence topic modeling to our UN general debate data.

In [9]:
import pandas as pd

un_df = pd.read_json('un-general-debates.json')

#### Cleaning the data

In [40]:
a = un_df['speech_year'] < 1989

In [41]:
a.describe()

count      3214
unique        2
top       False
freq       1882
Name: speech_year, dtype: object

In [42]:
b = un_df['speech_year'] > 1994

In [43]:
b.describe()

count      3214
unique        2
top       False
freq       2323
Name: speech_year, dtype: object

In [44]:
un_df = un_df[a == False]

In [45]:
un_df = un_df[b == False]

  un_df = un_df[b == False]


In [46]:
len(un_df)

991

In [47]:
un_df['speech_year'].describe()

count     991.000000
mean     1991.594349
std         1.707656
min      1989.000000
25%      1990.000000
50%      1992.000000
75%      1993.000000
max      1994.000000
Name: speech_year, dtype: float64

#### Setting up our Corpus

[Corpora and Vector Spaces, Gensim style](https://radimrehurek.com/gensim/auto_examples/core/run_corpora_and_vector_spaces.html)

In [10]:
un_df['speech_text']

0       ﻿It is indeed a pleasure for me and the member...
1       ﻿\nMay I begin by congratulating you. Sir, on ...
2       ﻿\nMr. President, it is a particular pleasure ...
3       ﻿\nDuring the debate at the fortieth session o...
4       ﻿I should like at the outset to express my del...
                              ...                        
3209    I should like to congratulate\nMr. Essy on his...
3210    It is with pleasure that I begin this speech b...
3211    Allow me first of all, Sir, to congratulate yo...
3212    It is a great pleasure to attend the beginning...
3213    Your election,\nSir, to the presidency of the ...
Name: speech_text, Length: 3214, dtype: object

In [37]:
# Trying the CountVectorizer from sklearn

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [38]:
vectorizer = CountVectorizer(lowercase   = True,
                             ngram_range = (1,1),
                             max_df      = .90,
                             stop_words   = 'english',
                             max_features = 1000)

In [39]:
vectorizer.fit(un_df['speech_text'])
un_word_counts = vectorizer.transform(un_df['speech_text'])

In [40]:
# Based on Gensim link above, trying to convert to corpus

from gensim.matutils import Sparse2Corpus
from gensim.matutils import corpus2csc
import scipy.sparse

corpus = Sparse2Corpus(un_word_counts)

In [200]:
corpus

<3214x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 1299630 stored elements in Compressed Sparse Row format>

#### Setting Up Our Dictionary

In [28]:
# documents to string

documents = str(un_df['speech_text'])

In [29]:
# Tokenize with spacy

import spacy

In [30]:
nlp = spacy.load("en_core_web_md")

In [31]:
doc = nlp(documents)

In [None]:
for token in doc:
    print("\t", token.text)

In [11]:
# Use gensim to create dictionary

from gensim.corpora import Dictionary

#### Setting up our Time Slices

Let's say we want to look at the evolution of our UN general debate topics by comparing speeches given in the three years before and three years after the fall of the Soviet Union:



- 1989-1991 = 153+156+162 = **471**


- 1992-1994 = 167+175+178 = **520**


In [49]:
time_slice = [471, 520]

#### Putting it all together: Running the Model

In [None]:
# I think this might be working but it's taking forever, as always

ldaseq = LdaSeqModel(corpus=corpus, time_slice=time_slice, num_topics=5)

## Short Text Topic Modeling

While LDA topic modeling works well when the texts in our corpus are considerably lengthy (around fifty words or more), LDA models run into some issues when applied to shorter texts. This happens because of a major assumption of LDA modeling: that each text is a *mixture of topics*. While this makes sense in the case of longer texts, shorter texts, like social media posts, often consist of only a *single topic*. 

In [30]:
# Working on it...Code's giving me trouble but that's nothing new. 

import spacy
import gsdmm

from sklearn.datasets import fetch_20newsgroups

import pickle
import matplotlib
import pandas as pd
import numpy as np
import ast

In [31]:
cats = ['talk.politics.mideast', 'comp.windows.x', 'sci.space']

newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), categories=cats)
newsgroups_train_subject = fetch_20newsgroups(subset='train', categories=cats)

data = newsgroups_train.data
data_subject = newsgroups_train_subject.data

targets = newsgroups_train.target.tolist()
target_names = newsgroups_train.target_names

## References

Alghamdi, Rubayyi and Khalid Alfalqi. 2015. "A Survey of Topic Modeling in Text Mining." *Int. J. Adv. Comput. Sci. Appl.(IJACSA)*, 6(1).

Blei, David M. and John D. Lafferty. 2006. "Dynamic Topic Models." In *Proceedings of the 23rd International Conference on Machine Learning* (pp. 113-120).

https://towardsdatascience.com/short-text-topic-modeling-70e50a57c883

https://markroxor.github.io/gensim/static/notebooks/ldaseqmodel.html