# Session 2: Topic Modelling

In this session, we will focus on a popular kind of probabilistic Machine Learning algorithm. The algorithm itself is called 'latent dirichlet allocation' (LDA), but as it is the most popular algorithm used for 'topic modelling', it is often simply referred to as 'topic modelling'.

The aim of topic modelling is quite intuitive. We show the computer a selection of **documents**, and we ask it: What are these documents about? The computer examines the vocabulary of all the documents, and sorts the words into various **topics**. Now a human thinks of a 'topic' as a real-world thing or process which becomes a 'topic' of conversation. We all stand by a forest, point at it, and talk about how beautiful it is. The forest is the 'topic'. A computer cannot do this, so it approaches the question from another angle. It considers a 'topic' to be a set of words that tend to co-occur. When we talk about forests, we would tend to use the words 'tree', 'fox', 'path', 'dark', 'big', 'natural' and so on. We would not tend to use the words 'fricassé', 'printer' or 'transubstantiation'. When we use a computer to perform topic modelling, it sorts all the words in our corpus into clusters of co-occuring words. These clusters are the **topics**.

# The Algorithm

The LDA Algorithm can be summarised diagrammatically:

![LDA plate notation](https://upload.wikimedia.org/wikipedia/commons/4/4d/Smoothed_LDA.png)

$M$ denotes the number of documents

$N$ is number of words in a given document (document $i$ has $N_{i}$ words)

$\alpha$ is the parameter of the Dirichlet prior on the per-document topic distributions

$\beta$ is the parameter of the Dirichlet prior on the per-topic word distribution

$\theta_{i}$ is the topic distribution for document $i$

$\phi_{k}$ is the word distribution for topic $k$

$z_{ij}$ is the topic for the $j$-th word in document $i$

$w_{ij}$ is the specific word.

This diagram explains how the model works *generatively*. This is a generative model, because it learns how to create new documents based on the word/topic mixture of the corpus it is trained on. But topic moels are not generally actually *used* to generate text (partly because they have [no concept of word order](https://en.wikipedia.org/wiki/Bag-of-words_model)). Instead, once the topic model is trained, its internal parameters are examined to inform the user about the structure of the corpus, or the model is applied to the text to provide data for further analysis.

I step through the diagram in the [slides](slides/topic-modelling.pdf).

# Topic Modelling in Python

Now we have some grasp of what the algorithm is doing, we can learn to apply it in Python using the Gensim package.

If you want to do Topic Modelling in R, you can get very similiar results using [very similar code with the help of the MALLET or LDA packages](https://www.tidytextmining.com/topicmodeling.html).

## Data

For this tutorial we are going to use a small corpus of books from Project Gutenberg, which come included in the Natural Language Toolkit, a very useful text-analysis package for Python. Execute the cell below to import the NLTK and download the Gutenberg books.

In [5]:
import nltk
nltk.download('gutenberg')
nltk.download('punkt')
gutenberg = nltk.corpus.gutenberg

[nltk_data] Error loading gutenberg: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1108)>
[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1108)>


In [12]:
books = gutenberg.fileids()

print(f'Downloaded books:')
print("  - " + "\n  - ".join(books))

Downloaded books:
  - austen-emma.txt
  - austen-persuasion.txt
  - austen-sense.txt
  - bible-kjv.txt
  - blake-poems.txt
  - bryant-stories.txt
  - burgess-busterbrown.txt
  - carroll-alice.txt
  - chesterton-ball.txt
  - chesterton-brown.txt
  - chesterton-thursday.txt
  - edgeworth-parents.txt
  - melville-moby_dick.txt
  - milton-paradise.txt
  - shakespeare-caesar.txt
  - shakespeare-hamlet.txt
  - shakespeare-macbeth.txt
  - whitman-leaves.txt


## Preprocesing

We will be performing topic modelling using the Gensim library. It requires that the texts be turned into a 'bag of words' model first. A 'bag of words' model is a very simple model of texts, where each row is a *document*, and each column represents a particular *word*. Let's imagine that document 7 is *Moby Dick* and word 2223 is *whale*. In our big bag-of-words table, we would expect the number in row 7, column 2223 to be high, say $2000$. If document 64 is *Pride and Prejudice*, we would expect the number in column 2223 to be $0$, since whales are never mentioned in that novel.

When you are topic modelling larger texts, it can be useful to split them into smaller chunks. The NLTK gutenberg corpus has a useful method that splits all the texts into paragraphs. Then we need to 'tokenise' them (split them into individual words), and then convert them into the 'bag of words model' (the big table saying how many times each word appears in each paragraph).

In [None]:
paragraph_sentences = [gutenberg.paras(book) for book in books]

In [None]:
def flatten(paragraph_sentences):
  """Convert list of sentences into list of words"""
  flat_paras = []
  for para in paragraph_sentences:
    words = [word for sent in para for word in sent]
    flat_paras.append(words)
  return(flat_paras)

In [None]:
paragraphs = [flatten(ps) for ps in paragraph_sentences]

docs = {}
for book, para_list in zip(books, paragraphs):
    n = 0
    for paragraph in para_list:
        docs[book + "_" + n] = paragraph
        n += 1

total = 0
print("PARAGRAPHS IN EACH NOVEL:\n")
for book,para in zip(books, paragraphs):
  print(f'{book:25s}  ::  {len(para)}')
  total += len(para)
print(f"\nNUMBER OF DOCUMENTS: {total}")

In [None]:
import gensim
from gensim.corpora import Dictionary # This will create the 'bag of words'

# Initialise the dictionary (this works out the vocab)
dictionary = Dictionary(docs.values())

# Filter out the most common words
dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=None)

# Create bag-of-words matrix
bag_of_words = [dictionary.doc2bow(doc) for doc in doc.values()]

## Training

Now that we have our corpus in the proper format, we can initialise and train a topic model on it. This will produce all the different things described in the diagram above: all the different probability distributions describing which topics are likely to appear where and which words are likely to appear in each topic.

In [None]:
from gensim.models import LdaModel

# We need to set some hyperparameters here
num_topics = 50
chunksize = 2000
passes = 20
iterations = 400
eval_every = None

# Get mapping of word id numbers to the actual words out of the dictionary
id2word = dictionary.id2token

# Now initialise and train the model (in Gensim you do this in one step, rather than defining the model then calling the .fit() method)
topic_model = LdaModel(
    corpus=bag_of_words,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

## Inference

Now we have trained the model, we can apply it to a particular paragraph and see how it decomposes the paragraph into topics

In [None]:
topic_model.get_document_topics()


Although we trained the model on the *paragraphs*, we could apply it to a whole book and see how it does.

Should we train the model again on just the books?

In [None]:
alice_in_wonderland = dictionary.doc2bow(gutenberg.words(alice_in_wonderland))
alice_topics = topic_model.get_document_topics(alice_in_wonderland)

## Inspecting and Evaluating the Model

The model comes with numerous methods we can use to explore its structure. You can see all the methods that come with the model [in the official documentation](https://radimrehurek.com/gensim/models/ldamodel.html).

In [None]:
# To see the top words for a particular topic
# topic_model.show_topic()

# To see the top n most significant topics in the corpus
# topic_model.show_topics(num_topics=10, num_words=10)

# Get the topics with the highest coherence score
# topic_model.top_topics()

# Calculate the perplexity of the model on a subset of the corpus
# (This is meaningless unless you have multiple models to compare)
# TEST SAMPLING CODE
# topic_model.log_perplexity