<img src="https://nyp-aicourse.s3.ap-southeast-1.amazonaws.com/agods/nyp_ago_logo.png" width='400'/>

# Topic Modelling

In this exercise, we we will learn how to use topic modeling to find 'latent' topics of a large corpus of text.

At the end of the session, you will learn:
- how to use the popular *gensim* library to create topic models
- the importance of choosing the correct granularity of entities for calculating topic models
- to experiment with many parameters to find the optimal topic model
- to evaluate the quality of the resulting topic models by quantitative methods



In [None]:
'''
step 1. import necessary libraries
'''
import gensim
from gensim import corpora
import string
import os
from pathlib import Path
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from tqdm import tqdm
import matplotlib.pyplot as plt
import numpy as np
nltk.download('stopwords')
nltk.download('wordnet')

import warnings
warnings.filterwarnings('ignore')

In [None]:
# !python -m spacy download en_core_web_sm

## Getting the Data

We have a few hundred news articles, and we are interested in knowing what are the 'topics' the different news articles are taling about.  To build a topic model, we first need to build our corpus. We will loop through our data directory and read all the text files and append them to the corpus. If you need to unzip the data file in Colab, you can use !unzip file.zip

In [None]:
'''
step 2. read in files (from directory) for analysis
'''

# r is the raw sting literals so that windows path slash won't create problem
data_folder = Path(r'datasets/news')

# read each file from the directory into an array and name it corpus
corpus = []
filenames = []

for filename in data_folder.iterdir():
    if os.path.isfile(filename):
        fp = open(str(filename), 'r', encoding='iso8859_2')
        corpus.append(fp.read())

        #keep the filename for later use
        filenames.append(filename.name)
        fp.close()

print(corpus.__len__())

## Preprocess the Text Data

We will need to preprocess the text data before we can use it to train our model. Typical preprocessing steps include: tokenization, normalization, stopword removal, stemming/lemmatization.

In [None]:
stop = set(stopwords.words('english'))

# Use the spacy stopwords instead
# from spacy.lang.en.stop_words import STOP_WORDS as stopwords
# stop = set(stopwords)

exclude = set(string.punctuation)
lemma = WordNetLemmatizer()


In [None]:
nltk.download('omw-1.4')

In [None]:
def preprocess(doc):

    punc_free = ''.join([ch for ch in doc.lower() if ch not in exclude])
    stop_free = ' '.join([i for i in punc_free.split() if i not in stop])
    normalized = ' '.join(lemma.lemmatize(word) for word in stop_free.split())
    #stemmed = ' '.join(stemmer.stem(word) for word in normalized.split())
    return normalized

processed_docs = [preprocess(doc).split() for doc in corpus]

In [None]:
print(processed_docs[0])

In [None]:
print (corpus[0])

Once we have the preprocessed list of tokens, we can then create a Dictionary, which is basically a word to integer id mapping.

In [None]:
# Remove rare and common tokens.
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(processed_docs)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
#dictionary.filter_extremes(no_below=20, no_above=0.5)

Finally, we transform the documents to a vectorized form (aka. document term matrix).  Here we convert each doc to bag-of-words (which is just frequency count of each word).

In [None]:
bows = [dictionary.doc2bow(processed_doc) for processed_doc in processed_docs]

In [None]:
print(bows[5])

Let’s see how many tokens and documents we have to train on.

In [None]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(bows))

## Train LDA Model

We are ready to train the LDA model. We will first discuss how to set some of the training parameters.

First question: how many topics do I need? There is really no easy answer for this, it will depend on both your data and your application. Here we use 5, but it can be any of your choice. We will see later how we can use perplexity or coherence to decide on the 'optimal' number of topics.

`chunksize` controls how many documents are processed at a time in the training algorithm. Increasing chunksize will speed up training, at least as long as the chunk of documents easily fit into memory.

`passes` controls how often we train the model on the entire corpus. Another word for passes might be “epochs”.

`iterations` is somewhat technical, but essentially it controls how often we repeat the algorithm. It is important to set the number of “passes” and “iterations” high enough.

In [None]:
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 5
chunksize = 250
passes = 20
iterations = 50
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make an index to word dictionary.
id2word = dictionary

np.random.seed(10)

ldamodel = LdaModel(
    corpus=bows,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

The above LDA model is built with 5 different topics where each topic is a combination of keywords and each keyword contributes a certain weightage to the topic.

You can see the keywords for each topic and the weightage(importance) of each keyword as below

In [None]:
ldamodel.print_topics()

We can also obtain all the documents and their topic ids with corresponding probabilities

In [None]:
print('\nFile name and its corresponding topic id with probability:')
dic_topic_doc = {}
for index, doc in enumerate(processed_docs):
    bow = dictionary.doc2bow(doc)
    #get topic distribution of the ldamodel
    t = ldamodel.get_document_topics(bow)
    #sort the probability value in descending order to extract the top contributing topic id
    sorted_t = sorted(t, key=lambda x: x[1], reverse=True)
    #print only the filename
    print(filenames[index],sorted_t)


One way to measure our model is the perplexity score. The lower the perplexity the better.

In [None]:
perplexity = ldamodel.log_perplexity(bows)
print(perplexity)

We can also use coherence score to measure our model. The higher the coherence score the better.

In [None]:
from gensim.models.coherencemodel import CoherenceModel

lda_coherence = CoherenceModel(model=ldamodel,
                               texts=processed_docs,
                               dictionary=dictionary,
                               coherence='c_v')
coherence_score = lda_coherence.get_coherence()
print(coherence_score)

# Optimal number of topics

We can find the number of topics that gives us the lowest perplexity and coherence score using the following codes

In [None]:
def compute_coherence_values(id2word, corpus, texts, limit, start=2, step=3):

    coherence_values = []
    perplexity_values = []
    topics_num = []
    model_list = []

    for num_topics in tqdm(range(start, limit, step)):
        np.random.seed(10)
        ldamodel = LdaModel(
            corpus=corpus,
            id2word=id2word,
            chunksize=250,
            alpha='auto',
            eta='auto',
            iterations=50,
            num_topics=num_topics,
            passes=20,
            eval_every=eval_every
        )
        model_list.append(ldamodel)
        coherencemodel = CoherenceModel(model=ldamodel, texts=texts,
                                       dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
        perplexity_values.append(ldamodel.log_perplexity(corpus))
        topics_num.append(num_topics)

    return model_list, coherence_values, perplexity_values, topics_num

In [None]:
# search through k-topics in steps
start=1; limit=7; step=1;
#start=5; limit=50; step=5;

model_list, coherence_values, perplexity_values, topics_num = compute_coherence_values(id2word,
                                                                           corpus=bows,
                                                                           texts=processed_docs,
                                                                           start=start, limit=limit, step=step)

In [None]:
# # Show Perplexity and Coherence graph
x = range(start, limit, step)

fig, ax1 = plt.subplots()

color = 'tab:red'
ax1.set_xlabel('Num Topics')
ax1.set_ylabel('Perplexity Score', color=color)
ax1.plot(x, perplexity_values, color=color)
ax1.tick_params(axis='y', labelcolor=color)

ax2 = ax1.twinx() #instantiate a second axes that share the same x-axis

color = 'tab:blue'
ax2.set_ylabel('Coherence Score', color=color)
ax2.plot(x, coherence_values, color=color)
ax2.tick_params(axis='y', labelcolor=color)

fig.tight_layout() # otherwise the right y-label is slightly clipped
plt.show()

If we prefer, we can view the scores in a Pandas dataframe

In [None]:
import pandas as pd
topics_df =pd.DataFrame({"Topics Num": topics_num,"Coherence Value": coherence_values,"Perplexity Value": perplexity_values})
topics_df

## Visualizing Topic Model

We can visualize the LDA results using a nice tool called pyLDAvis.

Each bubble on the left-hand side plot represents a topic. The larger the bubble, the more prevalent is that topic.

A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant.

A model with too many topics, will typically have many overlaps, small sized bubbles clustered in one region of the chart.

If you move the cursor over one of the bubbles, the words and bars on the right-hand side will update. These words are the salient keywords that form the selected topic.

In [None]:
import pyLDAvis.gensim as gensimvis
import pyLDAvis

pyLDAvis.enable_notebook()
vis_data = gensimvis.prepare(ldamodel, bows, dictionary)
pyLDAvis.display(vis_data)

Can you guess what each of the identified topic is, based on your 'human' interpretation?

### Exercise 1

1. What do you observe about the words of these topics? Are the top words significant to tell what topic it is?  What can you do with these words to improve the model?

<details><summary>Click here for answer</summary>
    
We can see that the words like "said", "mr", "year", "people" are appearing across topics as top words. These words are not vey useful to differentiate between the topics. We should remove them. We can do so by treating them as stopwords and update our stopwords list.

```python

domain_stopwords = ["said", "year", "mr", "people"]
stop.update(domain_stopwords)

```
    