# What other Topic Models are There?

*What other Topic Models are There?*
___

## Contents:

1. [What we (often) use](#1.-What-we-(often)-use)
    - LDA
    - STM
2. [Beyond the BOW approach](#2.-Beyond-the-Bag-of-Words-approach)
    - CTM
3. [Textual Information as Networks](#3.-Textual-Information-as-Networks)
    - TopSBM
4. [Discussion](#4.-Discussion)
5. [Sources](#5.-Sources)

# 1. What we (often) use

*What we (often) use*
___

## A very short introduction to LDA and STM

- basic idea: use text as data and try to understand what a text is about
- three main components and a "target": words, documents, corpora and *topics*
- closely related to dimensionality reduction
    - tf-idf
    - LSI/pLSI

*What we (often) use - LDA and STM*
___
- LDA [[1](#Sources)] assumes a set of underlying topics for a corpus of documents and a distribution of all words over those topics


- this way we get 
    - probabilities for documents to belong to certain topics
    - a characterization of topics by frequent words
    - information about the topic proportions in our corpus

*What we (often) use - LDA and STM*
___
<img src="../images/blei_tm.png" width="1100" height="1100">

___
[[2](#5.-Sources)]

*What we (often) use - LDA and STM*
___
- STM [[3](#Sources)] extends correlated topic models (also abbreviated by CTM) which in turn improved LDA
    - introduction of a linear term for topic probabilities
    - covariates (e.g. publication date and/or source) can be used to to get a better representation of topic prevalence

*What we (often) use - LDA and STM*
___
LDA                        |  STM
:-------------------------:|:-------------------------:
<img src="../images/lda_full.png" width="1200" height="1200">   |  <img src="../images/stm_full.png" width="1200" height="1200">

___
Own images after Blei et al. 2003 [[1](#5.-Sources)] and Stewart et al. 2013 [[3](#5.-Sources)] 

*What we (often) use - LDA and STM*
___
## Pros and Cons

- LDA is widely applied and can be used in R and Python
- does not allow covariates


- STM is only implemented in R
- covariates (supposedly) make the model more interpretable
- not as widely used as LDA (yet)


- both rely on the BOW approach
- both are questionable for short documents

# 2. Beyond the Bag of Words approach

*Beyond the BOW approach*
___
## Contextualized Topic Modeling (CTM):
- CTM [[4](#Sources)] uses pre-trained language models to overcome the BOW approach by using semantic and syntactic context information
- The main point of interest: pre-trained language models specifically, **BERT** [[5](#Sources)] (**B**idirectional **E**ncoder **R**epresentations from **T**ransformers)

*Beyond the BOW approach - CTM*
___
- Transformers are deep learning algorithms that can predict outcomes* from contextual information
- used e.g. for translation tasks
- computationally expensive to train but relatively cheap to implement once trained
- competitive or even state of the art performance in top language modelling tasks
- no one really knows why

___
\* E.g.: what is the next sentence *y* if we have sentence *x* before and sentence *z* after.

In [5]:
import os

In [1]:
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_file

In [2]:
qt = TopicModelDataPreparation("paraphrase-distilroberta-base-v2")

In [10]:
path_data = '../data/'

fname_data = 'corpus.txt'
filename = os.path.join(path_data, fname_data)

with open(filename,'r', encoding = 'utf8') as f:
    x = f.readlines()
texts = [h.split() for h in x]

In [15]:
training_dataset = qt.fit(text_for_contextual=x, text_for_bow=x) 

Downloading:   0%|          | 0.00/736 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.74k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/686 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]



In [16]:
ctm = CombinedTM(bow_size=len(qt.vocab), contextual_size=768, n_components=21) # 50 topics

In [17]:
ctm.fit(training_dataset) # run the model

Epoch: [100/100]	 Seen Samples: [6300/6300]	Train Loss: 4462.190972222223	Time: 0:00:14.718173: : 100it [26:50, 16.11s/it]
Sampling: [20/20]: : 20it [04:47, 14.37s/it]


In [19]:
ctm.get_topics(10)

defaultdict(list,
            {0: ['targets',
              'june',
              'catalytic',
              'passing',
              'top',
              'knowledge',
              'production',
              'extent',
              'performing',
              'authors'],
             1: ['atoms',
              'assessment',
              'concentrations',
              'informatics',
              'evolutionary',
              'complete',
              'operations',
              'circle',
              'dynamical',
              'infection'],
             2: ['diffraction',
              'right',
              'polarizability',
              'alpha',
              'measured',
              'refraction',
              'real',
              'frequency',
              'spacing',
              'requires'],
             3: ['and',
              'of',
              'the',
              'in',
              'for',
              'that',
              'as',
              'to',
              '

*Beyond the BOW approach - CTM*
___
### why should we care about CTM?
- context leads to an increase in coherence compared to LDA 
- can use pre-trained models for different domains and languages
- multi-language topic modeling [[6](#Sources)]
- there are already implementations (at least for Python) [[7](#Sources)]

# 3. Textual Information as Networks

*Textual Information as Networks*
___
## hSBM - Topic models based on Stochastic Block Models

- Block Modeling is a method of community detection used in social network analysis (SNA) [[8](#Sources)]
- the used network structure is a (weighted) bipartite network based on the word-document matrix*
- Hierarchical stochastic block modeling [[9](#Sources)] is implemented to mirror LDA in the network approach 

___
\* Words and documents are nodes that are connected if a word occurs within a document. This way, words can be linked via documents and vice versa. The word frequency is reflected in weighted ties.

*Textual Information as Networks - hSBM*
___
<img src="../images/gerlach_tm.png" width="700" height="700">

[[9](#5.-Sources)]

*Textual Information as Networks - hSBM*
___
### Why should we care about hSBM?

- the model is more agnostic towards the topic distribution and therefore more appropriate to address known properties of textual data such as Zipf's Law
- outperforms LDA in minimum description length* in most settings 
- there is an implementation that is sort of ready in Python: TopSBM [[10](#Sources), [11](#Sources)]
- The number of topics can be inferredfrom the model
- Combines NLP and SNA

___
\* Measures how parsimonious a model is in describing the data, lower is better.

# 4. Discussion

*Discussion*
___
- Why do we (maybe) still use LDA and STM?
- What is missing?
- Should we use different models in the future?
- How should models be compared and validated (for our usecases)
- ...

# 5. Sources

- [[1]](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf?ref=https://githubhelp.com) Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." Journal of machine Learning research 3.Jan (2003): 993-1022.


- [[2]](https://dl.acm.org/doi/10.1145/2133806.2133826) Blei, David M. "Probabilistic topic models." Communications of the ACM 55.4 (2012): 77-84.


- [[3]](https://www.jstatsoft.org/article/view/v091i02) Roberts, Margaret E., Brandon M. Stewart, and Dustin Tingley. "Stm: An R package for structural topic models." Journal of Statistical Software 91 (2019): 1-40.


- [[4]](https://arxiv.org/abs/2004.03974) Bianchi, Federico, Silvia Terragni, and Dirk Hovy. "Pre-training is a hot topic: Contextualized document embeddings improve topic coherence." arXiv preprint arXiv:2004.03974 (2020).


- [[5]](https://arxiv.org/abs/1810.04805v2) Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

- [[6]](https://arxiv.org/abs/2004.07737) Bianchi, Federico, et al. "Cross-lingual contextualized topic models with zero-shot learning." arXiv preprint arXiv:2004.07737 (2020).


- [[7]](https://github.com/MilaNLProc/contextualized-topic-models) Contextualized Topic Modeling on github.


- [[8]](https://methods.sagepub.com/book/the-sage-handbook-of-social-network-analysis/n31.xml) Van Duijn, Marijtje AJ, and Mark Huisman. "Statistical models for ties and actors." The SAGE handbook of social network analysis (2011): 459-483.


- [[9]](https://www.science.org/doi/10.1126/sciadv.aaq1360) Gerlach, Martin, Tiago P. Peixoto, and Eduardo G. Altmann. "A network approach to topic models." Science advances 4.7 (2018): eaaq1360.


- [[10]](https://topsbm.github.io/) Topic Models based on Stochastic Block Models Blog on github


- [[11]](https://github.com/martingerlach/hSBM_Topicmodel) hSBM Topic Model on github