# Topic Modeling: Beyond LDA

While LDA topic modeling's ability to pick up on latent themes in large collections of texts can be quite useful (hence the model's popularity), LDA models nevertheless have a number of limitations. To name a few, LDA models don't account for the passage of time, the models have difficulty determining any relationships among generated topics, and topics become considerably less useful when the model is applied to shorter corpera with shorter document lengths.

In this chapter, we'll present some alternative approaches to topic modeling that help to mitigate these limitations of LDA modeling.  

## Dynamic Topic Modeling 


In developing Dynamic Topic Modeling, or DTM, [Blei and Lafferty](https://dl.acm.org/doi/pdf/10.1145/1143844.1143859) wanted to account for the possibility that content within a collection of texts could evolve over time, something traditional LDA topic modeling doesn't consider. To do so, they developed a form of topic modeling that could trace the evolution of the topics generated over time. In DTM, then, we can see our topics develop as time passes.

### Gensim

In order to run Dynamic Topic Modeling in Python, we'll be installing the [Gensim](https://radimrehurek.com/gensim/index.html) topic modeling library. We'll also want to make sure we've installed Gensim's dependencies. 

In [1]:
pip install gensim



In [2]:
pip install smart-open



From Gensim, we'll import [ldaseq](https://radimrehurek.com/gensim/models/ldaseqmodel.html), the library's built-in Dynamic Topic Modeling function.  

In [9]:
from gensim.models import ldaseqmodel
from gensim.corpora import Dictionary, bleicorpus
import numpy
from gensim.matutils import hellinger

import gensim.downloader as api

In [12]:
try:
    dictionary = Dictionary.load('data/news_dictionary')
except FileNotFoundError as e:
    raise ValueError("SKIP: Please download the Corpus/news_dictionary dataset.")
corpus = bleicorpus.BleiCorpus('data/news_corpus')
# it's very important that your corpus is saved in order of your time-slices!



In [13]:
time_slice = [438, 430, 456]



In [14]:
ldaseq = ldaseqmodel.LdaSeqModel(corpus=corpus, id2word=dictionary, time_slice=time_slice, num_topics=5)


  convergence = np.fabs((bound - old_bound) / old_bound)


In [15]:
ldaseq.print_topics(time=0)

[[('best', 0.0064699199352853745),
  ('film', 0.0046696348871453405),
  ('music', 0.0037431748326163796),
  ('users', 0.0035332218935211685),
  ('use', 0.003357881613604223),
  ('mobile', 0.0033464010016513493),
  ('number', 0.003134513814075107),
  ('net', 0.002801237575047667),
  ('last', 0.002730599219249108),
  ('first', 0.002662086215837613),
  ('million', 0.00262103365811086),
  ('tv', 0.002486489364106573),
  ('uk', 0.0023887279979344497),
  ('phone', 0.002331825251081979),
  ('top', 0.002329741843296296),
  ('information', 0.0023156870935885825),
  ('used', 0.0022992825186095964),
  ('search', 0.002269481138485555),
  ('show', 0.002259953203175781),
  ('band', 0.002236894206645714)],
 [('game', 0.00602000929605834),
  ('club', 0.003994651876693231),
  ('chelsea', 0.0038912733368930076),
  ('players', 0.003779559095340811),
  ('last', 0.0037256236854489176),
  ('first', 0.0037147325049830733),
  ('league', 0.003423070085586388),
  ('think', 0.003386028605100013),
  ('united', 0.

In [18]:
ldaseq.print_topic_times(topic=2)

[[('world', 0.003226564248493927),
  ('first', 0.00251849513004872),
  ('make', 0.0023643297172749626),
  ('next', 0.002248303128048816),
  ('set', 0.002241134731520339),
  ('games', 0.0022395343314237756),
  ('technology', 0.0020615009605426914),
  ('much', 0.001933529034166009),
  ('time', 0.0019304698909921566),
  ('says', 0.0018294442561707707),
  ('president', 0.001732627955899744),
  ('like', 0.001713090897948452),
  ('uk', 0.0016947364991772244),
  ('report', 0.0015568530832991774),
  ('take', 0.0015289887109389854),
  ('year', 0.0015275098927167306),
  ('european', 0.0015227195805888826),
  ('tv', 0.0015156940048914641),
  ('bbc', 0.0014819914271543078),
  ('international', 0.0014754052019566727)],
 [('world', 0.003256410401115555),
  ('first', 0.002483641568147494),
  ('make', 0.0023734800644187315),
  ('next', 0.0022754553324343643),
  ('games', 0.0022717957491604447),
  ('set', 0.0022226226850457407),
  ('technology', 0.002098170012102668),
  ('much', 0.00195653973044246),
 

In [13]:
words = [dictionary[word_id] for word_id, count in corpus[558]]
print (words)

['set', 'time,"', 'chairman', 'decision', 'news', 'director', 'former', 'vowed', '"it', 'results', 'club', 'third', 'home', 'paul', 'saturday.', 'south', 'conference', 'leading', '"some', 'survival', 'needed', 'coach', "don't", 'every', 'trouble', 'desperate', 'eight', 'first', 'win', 'going', 'park', 'near', 'chance', 'manager', 'league', 'milan', 'games', 'go', 'game', 'foot', 'say', 'upset', "i'm", 'poor', 'season.', 'executive', 'road', '24', 'debut', 'portsmouth.', 'give', 'claiming', 'steve', 'break', 'rivals', 'boss', 'kevin', 'premiership', 'little', 'left', 'table.', 'life', 'join', 'years.', 'bring', 'season,', 'director.', 'became', 'st', 'according', 'official', 'hope', 'shocked', 'though', 'phone', 'charge', '14', 'website.', 'time,', 'claimed', 'kept', 'bond', 'appointment', 'unveil', 'november', 'picked', 'confirmed,', 'believed', 'deep', 'position', 'surprised', 'negotiations', 'talks', 'gmt', 'middlesbrough', 'replaced', 'appear', 'football,', '"i\'m', 'charge.', 'sain

In [14]:
doc = ldaseq.doc_topics(558)
print (doc)

[5.46298825e-05 5.46298825e-05 5.46298825e-05 9.99781480e-01
 5.46298825e-05]


In [15]:
doc_football_1 = ['economy', 'bank', 'mobile', 'phone', 'markets', 'buy', 'football', 'united', 'giggs']
doc_football_1 = dictionary.doc2bow(doc_football_1)
doc_football_1 = ldaseq[doc_football_1]
print (doc_football_1)

[0.00110497 0.00110497 0.28760389 0.32584256 0.38434361]


In [16]:
doc_football_2 = ['arsenal', 'fourth', 'wenger', 'oil', 'middle', 'east', 'sanction', 'fluctuation']
doc_football_2 = dictionary.doc2bow(doc_football_2)
doc_football_2 = ldaseq[doc_football_2]

In [17]:
hellinger(doc_football_1, doc_football_2)

0.3708314381087008

In [18]:
doc_governemt_1 = ['tony', 'government', 'house', 'party', 'vote', 'european', 'official', 'house']
doc_governemt_1 = dictionary.doc2bow(doc_governemt_1)
doc_governemt_1 = ldaseq[doc_governemt_1]

hellinger(doc_football_1, doc_governemt_1)

0.5832871030021028

In [19]:
ldaseq.print_topic_times(1)


[[('games', 0.0030960428821932623),
  ('use', 0.0030513819041573867),
  ('make', 0.0026425481484576747),
  ('used', 0.002512950927273868),
  ('first', 0.0024090568962250084),
  ('bbc', 0.002392128312440404),
  ('technology', 0.0021596236304352668),
  ('like', 0.0021132297702587),
  ('next', 0.0020291895275067755),
  ('law', 0.0020103860316839253),
  ('mobile', 0.00199904809885942),
  ('police', 0.0019562567886351407),
  ('home', 0.0018999170629895186),
  ('uk', 0.0018614826923549562),
  ('way', 0.0018095329990182067),
  ('last', 0.0017889330070435848),
  ('using', 0.001766340299299561),
  ('work', 0.001759745370395607),
  ('government', 0.0016859354450111698),
  ('rights', 0.0016725928367058124)],
 [('games', 0.0033398135070882976),
  ('use', 0.0028781838964077305),
  ('make', 0.0025582169710486814),
  ('used', 0.0025085443295097456),
  ('bbc', 0.0024502671107112252),
  ('first', 0.0024151565516086254),
  ('technology', 0.0022497541869066643),
  ('like', 0.002136215193883239),
  ('poli

## Short Text Topic Modeling

While LDA topic modeling works well when the texts in our corpus are considerably lengthy (around fifty words or more), LDA models run into some issues when applied to shorter texts. This happens because of a major assumption of LDA modeling: that each text is a *mixture of topics*. While this makes sense in the case of longer texts, shorter texts, like social media posts, often consist of only a *single topic*. 

In [30]:
# Working on it...Code's giving me trouble but that's nothing new. 

import spacy
import gsdmm

from sklearn.datasets import fetch_20newsgroups

import pickle
import matplotlib
import pandas as pd
import numpy as np
import ast

In [31]:
cats = ['talk.politics.mideast', 'comp.windows.x', 'sci.space']

newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), categories=cats)
newsgroups_train_subject = fetch_20newsgroups(subset='train', categories=cats)

data = newsgroups_train.data
data_subject = newsgroups_train_subject.data

targets = newsgroups_train.target.tolist()
target_names = newsgroups_train.target_names

## References

Alghamdi, Rubayyi and Khalid Alfalqi. 2015. "A Survey of Topic Modeling in Text Mining." *Int. J. Adv. Comput. Sci. Appl.(IJACSA)*, 6(1).

Blei, David M. and John D. Lafferty. 2006. "Dynamic Topic Models." In *Proceedings of the 23rd International Conference on Machine Learning* (pp. 113-120).

https://towardsdatascience.com/short-text-topic-modeling-70e50a57c883

https://markroxor.github.io/gensim/static/notebooks/ldaseqmodel.html