# Topic Modeling: Beyond LDA

While LDA topic modeling's ability to pick up on latent themes in large collections of texts can be quite useful (hence the model's popularity), LDA models nevertheless have a number of limitations. To name a few, LDA models don't account for the passage of time, the models have difficulty determining any relationships among generated topics, and topics become considerably less useful when the model is applied to shorter corpera with shorter document lengths.

In this chapter, we'll present some alternative approaches to topic modeling that help to mitigate these limitations of LDA modeling.  

## Dynamic Topic Modeling 


In developing Dynamic Topic Modeling, or DTM, [Blei and Lafferty](https://dl.acm.org/doi/pdf/10.1145/1143844.1143859) wanted to account for the possibility that content within a collection of texts could evolve over time, something traditional LDA topic modeling doesn't consider. To do so, they developed a form of topic modeling that could trace the evolution of the topics generated over time. In DTM, then, we can see our topics develop as time passes.

### Gensim

In order to run Dynamic Topic Modeling in Python, we'll be installing the [Gensim](https://radimrehurek.com/gensim/index.html) topic modeling library. (We'll also want to make sure we've installed Gensim's dependencies.)

In [1]:
pip install gensim



From Gensim, we'll import [ldaseq](https://radimrehurek.com/gensim/models/ldaseqmodel.html), the library's built-in Dynamic Topic Modeling function.  

In [25]:
from gensim.models import LdaSeqModel

Some relavant parameters for `LdaSeqModel` are listed below. For an exhaustive list of parameters, see [here](https://radimrehurek.com/gensim/models/ldaseqmodel.html).

#### LdaSeqModel Parameters: 

- **corpus**: 


- **id2word**: Allows us to map word IDs onto words. Helps with determining the size of our vocabulary.


- **time_slice**:


- **num_topics**:


- **chunksize**:

Just to get a handle on the `LdaSeqModel` function, we'll run the model with gensim's common corpus and dictionary. We'll use the common dictionary to for the `id2word` parameter when we run our model.

In [29]:
from gensim.test.utils import common_corpus
from gensim.test.utils import common_dictionary

corpus = common_corpus
dictionary = common_dictionary

In [33]:
ldaseq = LdaSeqModel(corpus=common_corpus, id2word=dictionary, time_slice=[2, 4, 3], num_topics=2, chunksize=1)

  convergence = np.fabs((bound - old_bound) / old_bound)


In [34]:
ldaseq.print_topics(time=0)

[[('graph', 0.24397931991219668),
  ('minors', 0.24397931991219668),
  ('human', 0.09001845665111263),
  ('interface', 0.09001845665111263),
  ('survey', 0.09001845665111263),
  ('computer', 0.05535520920529867),
  ('response', 0.03110513016949502),
  ('system', 0.03110513016949502),
  ('time', 0.03110513016949502),
  ('user', 0.03110513016949502),
  ('eps', 0.03110513016949502),
  ('trees', 0.03110513016949502)],
 [('system', 0.212521467190357),
  ('user', 0.16521007402047225),
  ('trees', 0.11814820948685073),
  ('response', 0.06745061829382003),
  ('time', 0.06745061829382003),
  ('eps', 0.06745061829382003),
  ('graph', 0.06745061829382003),
  ('human', 0.050716353804357746),
  ('interface', 0.050716353804357746),
  ('survey', 0.050716353804357746),
  ('minors', 0.050716353804357746),
  ('computer', 0.031452360909608915)]]

In [35]:
ldaseq.print_topics(time=1)

[[('graph', 0.24452847533931363),
  ('minors', 0.24452847533931363),
  ('human', 0.08985079871146644),
  ('interface', 0.08985079871146644),
  ('survey', 0.08985079871146644),
  ('computer', 0.05523817104959015),
  ('response', 0.031025413689563866),
  ('system', 0.031025413689563866),
  ('time', 0.031025413689563866),
  ('user', 0.031025413689563866),
  ('eps', 0.031025413689563866),
  ('trees', 0.031025413689563866)],
 [('system', 0.21255385122072792),
  ('user', 0.16484640415976715),
  ('trees', 0.11872939728000363),
  ('response', 0.06742129842200774),
  ('time', 0.06742129842200774),
  ('eps', 0.06742129842200774),
  ('graph', 0.06742129842200774),
  ('human', 0.05068476120535119),
  ('interface', 0.05068476120535119),
  ('survey', 0.05068476120535119),
  ('minors', 0.05068476120535119),
  ('computer', 0.03144610883006575)]]

In [36]:
ldaseq.print_topics(time=2)

[[('graph', 0.2450045719364962),
  ('minors', 0.2450045719364962),
  ('human', 0.08970051065008355),
  ('interface', 0.08970051065008355),
  ('survey', 0.08970051065008355),
  ('computer', 0.05511309118233615),
  ('response', 0.030962705499070146),
  ('system', 0.030962705499070146),
  ('time', 0.030962705499070146),
  ('user', 0.030962705499070146),
  ('eps', 0.030962705499070146),
  ('trees', 0.030962705499070146)],
 [('system', 0.2125540016819885),
  ('user', 0.1649510553212659),
  ('trees', 0.11875577524080093),
  ('response', 0.06740263821463657),
  ('time', 0.06740263821463657),
  ('eps', 0.06740263821463657),
  ('graph', 0.06740263821463657),
  ('human', 0.05066598546277992),
  ('interface', 0.05066598546277992),
  ('survey', 0.05066598546277992),
  ('minors', 0.05066598546277992),
  ('computer', 0.03146467304627889)]]

In [7]:
from gensim.test.utils import datapath

In [4]:
from gensim.test.utils import common_corpus, common_dictionary
from gensim.models.wrappers import DtmModel

In [5]:
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import PunktSentenceTokenizer, RegexpTokenizer


In [None]:
try:
    dictionary = Dictionary.load('news_dictionary')
except FileNotFoundError as e:
    raise ValueError("SKIP: Please download the Corpus/news_dictionary dataset.")
corpus = bleicorpus.BleiCorpus('news_corpus')

In [None]:
time_slice = [438, 430, 456]

In [None]:
ldaseq = ldaseqmodel.LdaSeqModel(corpus=corpus, id2word=dictionary, time_slice=time_slice, num_topics=5)

## Short Text Topic Modeling

While LDA topic modeling works well when the texts in our corpus are considerably lengthy (around fifty words or more), LDA models run into some issues when applied to shorter texts. This happens because of a major assumption of LDA modeling: that each text is a *mixture of topics*. While this makes sense in the case of longer texts, shorter texts, like social media posts, often consist of only a *single topic*. 

In [30]:
# Working on it...Code's giving me trouble but that's nothing new. 

import spacy
import gsdmm

from sklearn.datasets import fetch_20newsgroups

import pickle
import matplotlib
import pandas as pd
import numpy as np
import ast

In [31]:
cats = ['talk.politics.mideast', 'comp.windows.x', 'sci.space']

newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), categories=cats)
newsgroups_train_subject = fetch_20newsgroups(subset='train', categories=cats)

data = newsgroups_train.data
data_subject = newsgroups_train_subject.data

targets = newsgroups_train.target.tolist()
target_names = newsgroups_train.target_names

## References

Alghamdi, Rubayyi and Khalid Alfalqi. 2015. "A Survey of Topic Modeling in Text Mining." *Int. J. Adv. Comput. Sci. Appl.(IJACSA)*, 6(1).

Blei, David M. and John D. Lafferty. 2006. "Dynamic Topic Models." In *Proceedings of the 23rd International Conference on Machine Learning* (pp. 113-120).

https://towardsdatascience.com/short-text-topic-modeling-70e50a57c883

https://markroxor.github.io/gensim/static/notebooks/ldaseqmodel.html