# Lab6.1-Topic Modeling Introduction

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

Credits: this notebook is an adaptation of a blog by Shivam Bansal:

https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/

This lab provides a basic introduction into topic modeling and classification. It consists of the following notebooks:

* Lab6.1-Topic-modeling-introduction.ipynb (this notebook)
* Lab6.2-Topic-modeling-gensim.ipynb (how to create and apply topic models using gensim)
* Lab6.3-Topic-modeling-sklearn.ipynb (how to create topic models using sklearn) + BERT topic modeling
* Lab6.4-Topic-classification-BERT.ipynb (how to fine-tune transformer models for topic classification)
* Lab6-assignment-topic-classification.ipynb

We suggest you work through these notebooks in this order. In this notebook, we explain some of the basics.

Topic modeling is a clustering task that groups documents on the basis of their topic. A topic can be defined as a main area of interest that the text is about. There is no a priori definition of all the topics. Topic modelling typically assumes that any set of documents can be split into groups or clusters that use similar words. 

An example of topic modeling done at the VU for the National Science Agenda can be see here:

http://i.amcat.nl/renwa/index.html


There are many approaches for obtaining topics from a text such as – Term Frequency and Inverse Document Frequency, and NonNegative Matrix Factorization techniques. Latent Dirichlet Allocation (LDA) is the most popular topic modeling technique.

LDA assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place.

LDA is a matrix factorization technique. In vector space, any corpus (collection of documents) can be represented as a document-term matrix. The following matrix shows a corpus of N documents D1, D2, D3 … Dn and vocabulary size of M words W1,W2 .. Wn. The value of i,j cell gives the frequency count of word Wj in Document Di.

<img src="images/lda2.1.png">


LDA converts this Document-Term Matrix into two lower dimensional matrices – M1 and M2.
M1 is a document-topics matrix and M2 is a topic – terms matrix with dimensions (N,  K) and (K, M) respectively, where N is the number of documents, K is the number of topics and M is the vocabulary size.

<img src="images/lda2.2.png">

<img src="images/lda2.3.png">


Notice that these two matrices already provides topic word and document topic distributions. However, these distribution needs to be improved, which is the main aim of LDA. LDA makes use of sampling techniques in order to improve these matrices.

It Iterates through each word “w” for each document “d” and tries to adjust the current topic – word assignment with a new assignment. A new topic “k” is assigned to word “w” with a probability P which is a product of two probabilities p1 and p2.

For every topic, two probabilities p1 and p2 are calculated. P1 – p(topic t / document d) = the proportion of words in document d that are currently assigned to topic t. P2 – p(word w / topic t) = the proportion of assignments to topic t over all documents that come from this word w.

The current topic – word assignment is updated with a new topic with the probability, product of p1 and p2 . In this step, the model assumes that all the existing word – topic assignments except the current word are correct. This is essentially the probability that topic t generated word w, so it makes sense to adjust the current word’s topic with new probability.

After a number of iterations, a steady state is achieved where the document topic and topic term distributions are fairly good. This is the convergence point of LDA.

## Parameters of LDA

Alpha and Beta Hyperparameters – alpha represents document-topic density and Beta represents topic-word density. Higher the value of alpha, documents are composed of more topics and lower the value of alpha, documents contain fewer topics. On the other hand, higher the beta, topics are composed of a large number of words in the corpus, and with the lower value of beta, they are composed of few words.

Number of Topics – Number of topics to be extracted from the corpus. Researchers have developed approaches to obtain an optimal number of topics by using Kullback Leibler Divergence Score. I will not discuss this in detail, as it is too mathematical. For understanding, one can refer to this[1] original paper on the use of KL divergence.

Number of Topic Terms – Number of terms composed in a single topic. It is generally decided according to the requirement. If the problem statement talks about extracting themes or concepts, it is recommended to choose a higher number, if problem statement talks about extracting features or terms, a low number is recommended.

Number of Iterations / passes – Maximum number of iterations allowed to LDA algorithm for convergence.

## End of this notebook