In natural language processing (NLP), topic modelling is a text mining technique that applies unsupervised machine learning on large sets of texts to produce a summary set of terms derived from those documents that represent the collection’s overall primary set of topics. Topic models specifically co-occuring keywords across a text dataset in order to classify documents among a set of automatically generated topics. In this way, topic models can function as part of a text classification pipeline by thematically annotating collections of documents across text corpora.

## How topic models work

Topic modeling essentially treats each individual document in a collection of texts as a bag of words model. This means that the topic modelling algorithm ignores word order and context, simply focusing on term frequency and co-occurance in individual text documents. In fact, topic modeling algorithms--namely latent semantic analysis (LSA) and latent Dirichlet allocation (LDA)--build from term frequency-inverse document frequency (TF-IDF). TF-IDF is a modification of bag of words intended to address the issues resulting from common yet semantically irrelevant words by accounting for each word’s prevalence throughout every document in a text corpus. 

Topic models are not synonymous with bag of words or tf-idf however. While the a bag of words merely enumerates words within a collection of documents, topic models group commonly co-occurring words into sets of topics. Each topic is modeled as a probability distribution or weighting across a vocabulary of words. Each document in the collection is then represented in terms of those topics. In this way, topic models essentially attempt to reverse engineer the discourses (i.e. topics) that produced the documents in question. 

In this tutorial, we’ll use the scikit-learn natural language toolkit (NLTK) and Gensim to generate topic models of Charles Dickens' novels in Python. We will also walk through various text preprocessing techniques—namely tokenization, stopword removal, and lemmatization—in order to improve our final topic models. There are several topic modeling techniques, but the most popular by far is LDA--not to be confused with linear discriminant analysis. As such, this tutorial addresses how to generate a LDA model.

**Prerequisites**

    Create an IBM Cloud® account

    Create a Kaggle account

    Install NLTK

# Steps
# Step 1: Set up your environment

While there are a number of tools to choose from, we’ll walk you through how to set up an IBM account to use a Jupyter notebook. Jupyter notebooks are widely used tools in data science to combine code, text, and visualizations to formulate well-formed analyses.

**Log in to watsonx.ai using your IBM Cloud account.**

**Create a watsonx.ai project.**

        a. Click the hamburger menu at the top left of the screen, and then select Projects > View all projects.
        b. Click the New project button.
        c. Select Create an empty project.
        d. Enter a project name in the Name field.
        e. Select Create.
        
**Create a Jupyter notebook.**

        a. In your project environment, select the Assets tab.
        b. Click the blue New asset button.
        c. Scroll down in the pop-up window and select Jupyter notebook editor.
        d. Enter a name for your notebook in the Name field.
        e. Click the blue Create button.

This will open a notebook environment for you to load your data set and copy code from this beginner tutorial to tackle a simple single-file text stemming task. In order to view how each block of code affects the text file, each step’s code block is best inserted as a separate cell of code in your watson project notebook.


# Step 2: Install and import relevant libraries

We'll need a few libraries for this tutorial, principally sklearn's NLTK and Gensim. Make sure to import the ones below. If they're not installed, you can resolve this with a quick pip install, included at the top of the code.

In [2]:
# download necessary libraries and packages for our topic modeling algorithm
%pip install nltk -U
%pip install spacy -U
%pip install gensim
%pip install pyldavis
%pip install gutenbergpy

import os
import nltk
import re
import string
import gensim
import numpy as np

# for cleaning prefatory matter from Project Gutenberg texts 
from gutenbergpy import textget

# for tokenization
from nltk.tokenize import word_tokenize
nltk.download("punkt")
nltk.download('wordnet')

# for stopword removal
from nltk.corpus import stopwords
nltk.download('stopwords')

# for lemmatization and POS tagging
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('averaged_perceptron_tagger')

# for LDA
from gensim import corpora
from gensim.models import LdaModel
from gensim.models.coherencemodel import CoherenceModel

# for LDA evaluation
import pyLDAvis
import pyLDAvis.gensim_models as gensimvisualize

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/pytho

[nltk_data] Downloading package punkt to /Users/jacob/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/jacob/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/jacob/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jacob/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


# Step 3. Load data
For this tutorial, we will use a [Charles Dickens corpus](https://www.kaggle.com/datasets/fuzzyfroghunter/dickens) available through Kaggle and sourced from Project Gutenberg. With this dataset, we will train an LDA model and learn basic methods for finetuning the model.

We will be using the Kaggle API to load the dataset directly into our watson notebook. This requires creating a free Kaggle account. Once you have done so, you can generate your API key. Directions for generating your key are found in the [Kaggle API documentation](https://github.com/Kaggle/kaggle-api/blob/main/docs/README.md) under the section "API credentials."

Once you have generated your API key, use the following code to install the Kaggle API and load the dataset. Remember to change the username and key strings to your own Kaggle account username and API key.

In [None]:
%pip install kaggle

os.environ["KAGGLE_USERNAME"] = "username"
os.environ["KAGGLE_KEY"] = "apiKey"

!kaggle datasets download fuzzyfroghunter/dickens --unzip

# Step 4: Preprocess data

Before we can generate LDA models of our text collection, we need to reformat the text files. This is necessary, not only to make certain the text is in a machine-readable format for processing by the LDA algorithm, but also in order to reduce noise in the final generated topic models.

In the following script, we've defined a function that removes line breaks and whitespace, formats the text to all lowercase, tokenizes the text, filters out non-alphabetic characters, removes stopwords, and lemmatizes textual data. The script also defines a wordnet_pos_tags() function, which will assign part-of-speech tags using NLTK's WordNet—this is a necessary step for lemmatization.

Before proceeding, here is a quick overview of some of these techinques, which are common in many text mining and NLP tasks:

Tokenization - This breaks down unstructured text data into smaller units called tokens that can be read by the machine. A token can range from a single character or individual word to much larger textual units. In this tutorial, we use word tokenization. For a more in-deph guide to tokenization, see the Python tokenization tutorial [link].

Stopeword removal - A stoplist is a non-universal list of words removed from text during preprocessing. A stoplist typically consists of the most commonly used words in a language, which are believed to add little value to—and potentially obfuscate—NLP output. Rather than create an original collection of stopwords, we will simply load the NLTK English language stoplist. We also add some additional corpus-specific words using stopwords.extend() function.

Lemmatization - This is the process of reducing inflectional variants to one base word form. Lemmatization utilizes a a fairly robust morphological analysis in determining word variants, namely part-of-speech (POS) tagging. POS essentially assigns each word a tag signifying its respective syntactic function. Stemming is another process with the shared aim reducing inflectional variants, although stemming is a more heuristic process that involves simply stripping suffixes from words. Although this tutorial uses lemmatization, you can gain a more in-depth look at stemming in the Python stemming tutorial [link].

Note the corpus specific preprocessing techinques in this script. For instance, we use the gutenberg Python library on Github [link] to remove Project Gutenberg's headers, as well as an original patter-detection method to remove the afterwards of legal matter in Project Gutenberg texts. In filtering out non-alphabetic tokens, we also filter out Roman numerals, as these are often used in the .txt files to signify chapter headings. We've also appended a list of words frequent in the Project Gutenberg transcriptions of Dickens' novels but which may not be included in the NLTK stopword list, such honorifics or archaic forms of address. This highlights the importance of knowing one's corpus and implementing preprocessing techniques specific to it.

In [54]:
import os
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from gutenbergpy import textget

# load WordNet POS tags for lemmatization
def wordnet_pos_tags(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

# function to remove standardize, tokenize, and lemmatize loaded .txt files
def txt_preprocess_pipeline(text):
    # read the file contents into a string variable
    working_txt = text.read()
    # strip Project Gutenberg headers
    main_txt = textget.strip_headers(working_txt.encode('utf-8')).decode('utf-8')
    # strip Project Gutenberg footer
    main_txt = re.sub(r'end of the project gutenberg', '', main_txt, flags=re.IGNORECASE)
    # lowercase
    standard_txt = main_txt.lower()
    # remove multiple white spaces and line breaks
    clean_txt = re.sub(r'\n', ' ', standard_txt)
    clean_txt = re.sub(r'\s+', ' ', clean_txt)
    clean_txt = clean_txt.strip()
    # tokenize text
    tokens = word_tokenize(clean_txt)
    # remove remaining non-alphabetic tokens (account for roman numerals)
    filtered_tokens_alpha = [word for word in tokens if word.isalpha() and not re.match(r'^[ivxlcdm]+$', word)]
    # load NLTK stopword list
    stop_words = stopwords.words('english')
    # corpus-specific stoplist customization
    stop_words.extend(['thee', 'thou', 'thy', 'ye', 'computer', 'gutenberg', 'http', 'chapter', 'mr', 'mrs', 'ms', 'dr'])
    # remove stopwords
    filtered_tokens_final = [w for w in filtered_tokens_alpha if not w in stop_words]
    # define lemmatizer
    lemmatizer = WordNetLemmatizer()
    # conduct POS tagging
    pos_tags = nltk.pos_tag(filtered_tokens_final)
    # lemmatize word-tokens via assigned POS tags
    lemma_tokens = [lemmatizer.lemmatize(token, wordnet_pos_tags(pos_tag)) for token, pos_tag in pos_tags]
    return lemma_tokens

# function to iterate through .txt files with preprocessing function
def iterate_txt_files(txt_dir):
    texts = []
    for filename in os.listdir(txt_dir):
        if filename.endswith('.txt'):
            with open(os.path.join(txt_dir, filename), 'r', encoding='utf-8') as file:
                txt_tokens = txt_preprocess_pipeline(file)
                texts.append(txt_tokens)
    return texts


Now that we've defined our preprocessing functions, we can iterate through each text. This function will produce a list of lists—the larger list being the corpus as a whole, with each contained sub-list an individual text document. Each item within a sub-list will be a token from that text. Thus, the function output will follow this format:

[['doc1_token1', 'doc1_token2', 'doc1_tokenX' ...],['docX_token1', 'docX_token2', 'docX_tokenX'...]...]

To confirm the text is processed, we will print the first list item, which should be the first tokenized text document.

In [55]:
# specify working directory
work_dir = os.getcwd()
# specify path to text corpus via work_dir
file_dir = f'{work_dir}/dickens'

# iterate through each text
texts = iterate_txt_files(file_dir)
# print first processed text
print(texts[:1])


[['christmas', 'carol', 'charles', 'dickens', 'illustrate', 'george', 'alfred', 'williams', 'new', 'york', 'platt', 'peck', 'baker', 'taylor', 'company', 'illustration', 'tim', 'blood', 'horse', 'way', 'church', 'introduction', 'combine', 'quality', 'realist', 'idealist', 'dickens', 'possess', 'remarkable', 'degree', 'together', 'naturally', 'jovial', 'attitude', 'toward', 'life', 'general', 'seem', 'give', 'remarkably', 'happy', 'feel', 'toward', 'christmas', 'though', 'privation', 'hardship', 'boyhood', 'could', 'allow', 'little', 'real', 'experience', 'day', 'day', 'dickens', 'give', 'first', 'formal', 'expression', 'christmas', 'thought', 'series', 'small', 'book', 'first', 'famous', 'christmas', 'carol', 'one', 'perfect', 'chrysolite', 'success', 'book', 'immediate', 'thackeray', 'write', 'listen', 'objection', 'regard', 'book', 'seem', 'national', 'benefit', 'every', 'man', 'woman', 'read', 'personal', 'kindness', 'volume', 'put', 'forth', 'attractive', 'manner', 'illustration', 

# Step 5: Generate topic models

Now that we've prepared our data, we can create our topic models. Genism's LdaModel() function provides a readily accessible method for topic model generation. Below is a fairly basic script for producing LDA models with Genism. There are a few customizations, however, that have been included to improve our final predictive models.

The remaining customizations are all parameters to Genism's LdaModel() function. Although they are defined in Genism's documention, we can provide a brief overview here:

- random_state: This is comparable to 'seed' in many packages and libraries. LDA is probabilistic, not deterministic. This means that, even were we to train models with the same parameters on the same corpus, our models may vary minutely each time. random_state helps mitigate this variation and thereby aid reproducibility.

- chunksize: This specifies the number of texts the modeling function considers at a time. 

- num_topics: The number of topics into which the model distributes words and documents. Methods for choosing an appropriate number of topics is an ongoing area of research.

- passes: The number of times the algorithm passes through the whole corpus. This is comparable to 'epochs' in other packages and libraries.

- iterations: This requires a more thorough understanding of the math behind LDA. To be brief, it specifies the maximum number of times the machine will pass a text through the E-step in the LDA algorithm. You may think of the E-step as a stage of the LDA algorithm in which the machine uses observed data to estimate missing data values.  the observed data. Further, E-step primarily updates the variables. You want to set this high so that text can converge--that is, estimations match observations.

Now let's execute the models. Note that this process may take several minutes.

In [14]:
# load dictionary, filter out one-time words
dictionary = corpora.Dictionary(texts)
dictionary.filter_extremes(no_below = 2)

# create corpus as BoW
corpus = [dictionary.doc2bow(text) for text in texts]
    
# train LDA model 
lda_model = LdaModel(corpus=corpus, id2word=dictionary, random_state=4583, chunksize=20, num_topics=7, passes=200, iterations=400)

# print LDA topics
for topic in lda_model.print_topics(num_topics=7, num_words=10):
    print(topic)

(0, '0.033*"joe" + 0.012*"duke" + 0.012*"pip" + 0.011*"herbert" + 0.011*"hugh" + 0.009*"earl" + 0.008*"edward" + 0.007*"madame" + 0.007*"locksmith" + 0.007*"henry"')
(1, '0.012*"richard" + 0.011*"kit" + 0.010*"quilp" + 0.009*"bounderby" + 0.007*"swiveller" + 0.006*"nell" + 0.005*"leicester" + 0.005*"bucket" + 0.005*"louisa" + 0.004*"dick"')
(2, '0.009*"jasper" + 0.008*"rosa" + 0.006*"neville" + 0.004*"edwin" + 0.004*"parson" + 0.003*"watkins" + 0.003*"minor" + 0.003*"helena" + 0.003*"canon" + 0.003*"baron"')
(3, '0.058*"pickwick" + 0.029*"sam" + 0.024*"nicholas" + 0.022*"weller" + 0.015*"winkle" + 0.012*"ralph" + 0.009*"tupman" + 0.009*"kate" + 0.008*"wery" + 0.007*"wardle"')
(4, '0.045*"dorrit" + 0.043*"clennam" + 0.021*"arthur" + 0.019*"merdle" + 0.017*"scrooge" + 0.016*"fanny" + 0.012*"caleb" + 0.011*"tackleton" + 0.011*"amy" + 0.011*"carrier"')
(5, '0.083*"oliver" + 0.032*"bumble" + 0.027*"jew" + 0.022*"fagin" + 0.015*"noah" + 0.014*"brownlow" + 0.012*"dodger" + 0.010*"sowerberry" 

Your output should look roughly similar to this:

(0, '0.033*"joe" + 0.012*"duke" + 0.012*"pip" + 0.011*"herbert" + 0.011*"hugh" + 0.009*"earl" + 0.008*"edward" + 0.007*"madame" + 0.007*"locksmith" + 0.007*"henry"')
(1, '0.012*"richard" + 0.011*"kit" + 0.010*"quilp" + 0.009*"bounderby" + 0.007*"swiveller" + 0.006*"nell" + 0.005*"leicester" + 0.005*"bucket" + 0.005*"louisa" + 0.004*"dick"')
(2, '0.009*"jasper" + 0.008*"rosa" + 0.006*"neville" + 0.004*"edwin" + 0.004*"parson" + 0.003*"watkins" + 0.003*"minor" + 0.003*"helena" + 0.003*"canon" + 0.003*"baron"')
(3, '0.058*"pickwick" + 0.029*"sam" + 0.024*"nicholas" + 0.022*"weller" + 0.015*"winkle" + 0.012*"ralph" + 0.009*"tupman" + 0.009*"kate" + 0.008*"wery" + 0.007*"wardle"')
(4, '0.045*"dorrit" + 0.043*"clennam" + 0.021*"arthur" + 0.019*"merdle" + 0.017*"scrooge" + 0.016*"fanny" + 0.012*"caleb" + 0.011*"tackleton" + 0.011*"amy" + 0.011*"carrier"')
(5, '0.083*"oliver" + 0.032*"bumble" + 0.027*"jew" + 0.022*"fagin" + 0.015*"noah" + 0.014*"brownlow" + 0.012*"dodger" + 0.010*"sowerberry" + 0.009*"monk" + 0.008*"giles"')
(6, '0.027*"peggotty" + 0.025*"micawber" + 0.023*"bella" + 0.016*"copperfield" + 0.015*"traddles" + 0.013*"fledgeby" + 0.013*"steerforth" + 0.013*"agnes" + 0.012*"murdstone" + 0.011*"venus"')

Again, LDA models are probabilistic, and so your output may differ in small ways, such as distribution values that vary by .002 or so. Nevertheless, if you are using the same parameters and random_state value, your output should generally match this.

This output may seem uninformative--it's just a list of names. But, if we are familiar our dataset, we can make a little sense out of a few topics. For instance, the topic keywords in topic 3 are 'pickwick,' 'sam,' 'weller,' 'winkle,' and 'tupman'. These are all characters in The Pickwick Papers. Similarly, topic 5 lists characters in the novel Oliver Twist (e.g. 'oliver,' 'bumble,' 'jew,' 'fagin,' etc.). Topic 6 contains all terms ostensibly lifted from David Copperfield.

Other topics, admittedly, are less decipherable. For instance, topic 0 contains several titles or terms of address--"madame," "earl," and "duke." Yet how are these terms related to Pip, Herbert, and Joe, which all characters from Great Expectations, or even Edward, a character name found in several Dickensian novels?

Topic models can defy ready interpretation. Moreover, to assume topics are valuable only because we can interpret them--or even invaluable becase we cannot interpret them--prevents our learning anything new about out dataset through topic models. Perhaps a topic is uninterpretable at the moment because it has provided some new insight into the corpus we did not previously perceive. But how else might we evaluate our topics?

# Step 6: Evaluate models

Researchers use both qualitative and quantitative methods to evaluate models. The former, often employed in real-world use cases, "eyeballs" top key terms to examine each topic's interpretability.  This requires signficicant domain knowledge however. Quantitiave metrics include log-liklihood and cohesion score. These measure the probability and cohesion of model topics. For now, we'll focus on the latter.

Topic coherence is a widely popular method for evaluating LDA topics. Topic coherence attempts to measure topic interpretability. In other words, per its name, topic coherence aims to measure how coherent a topic is. The coherence score can be any value between 0 and 1, with 1 being perfect coherence and 0 being none.

Topic coherence methods generally sort each topic's key terms from highest to lowerst term weights. It then selects the first n terms in each resepctive topic and measures the degeree of similarity of these terms within each topic. How does algorithm measure similarity? There are myriad methods for measuring topic coherence. The Cv method is one widely adopted method that can be readily implemented via the CoherenceModel() function in gensim.

The Cv method pass over our entire coprus, enumerating term frequnecy and co-occurrence for the top n number of terms within each topic. It then uses these values to calculate normalized pointwise mutual information (NPMI) between every top word across the topics. In brief, NPMI is a statistical concept used to predict the probability that two independent events co-occur. NPMI prodcues a set of word vectors for each of the top key words considered. Cv then calculates distnace between between these vectors using cosine similarity. The final output coherence score is teh mean of these similarities.

As mentioned, we can check our model's coherence score using gensim's CoherenceModel():

In [51]:
coherence_model = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_score = coherence_model.get_coherence()
print(coherence_score)

0.61380824686216


The resultant coherence score should be around 0.442. Admittedly, this is a fairly poor score. Let's look at how we can improve the model's coherence.

# Step 7: Finetuning our LDA model

Improving our LDA model is a matter of tuning its parameters, potentially even cleaning our data. But how do we determine which parameters to tune? And how do we determine the degree by which to tune each?

Perhaps the most crucial factor in improving LDA models is data noise reduction. But we've already done a lot to prepare our data and strip out noisy features. Thus, altering the number of topics is perhaps the next most crucial step.

How to determine the appropriate number of topics involves a bit of trial and error. The most desirable approach would be to generate hundreds of models with different topics, compare their coherence scores, and select the topic number with the highest coherence. Generating hundreds number of models is computational expensive, however, and so beyond the scope of this tutorial. So for now, let's just select a much higher number of topics: 40.

In [59]:
# load dictionary, filter out one-time words
dictionary = corpora.Dictionary(texts)
dictionary.filter_extremes(no_below = 2)

# create corpus as BoW
corpus = [dictionary.doc2bow(text) for text in texts]
    
# train LDA model 
lda_model = LdaModel(corpus=corpus, id2word=dictionary, random_state=4583, chunksize=20, num_topics=40, passes=200, iterations=400)

# print LDA topics
for topic in lda_model.print_topics(num_topics=40, num_words=10):
    print(topic)

(0, '0.073*"clemency" + 0.046*"alfred" + 0.025*"client" + 0.023*"warden" + 0.012*"michael" + 0.011*"ben" + 0.008*"orchard" + 0.007*"ca" + 0.007*"thimble" + 0.006*"mister"')
(1, '0.039*"richard" + 0.028*"bucket" + 0.027*"leicester" + 0.016*"charley" + 0.015*"caddy" + 0.013*"jo" + 0.010*"trooper" + 0.009*"ca" + 0.008*"ladyship" + 0.007*"chancery"')
(2, '0.000*"battalion" + 0.000*"basilisk" + 0.000*"bedevilment" + 0.000*"beautifullest" + 0.000*"beamingly" + 0.000*"beadles" + 0.000*"begat" + 0.000*"bathing" + 0.000*"barnet" + 0.000*"barrows"')
(3, '0.226*"nicholas" + 0.115*"ralph" + 0.078*"kate" + 0.059*"newman" + 0.035*"tim" + 0.033*"mulberry" + 0.025*"la" + 0.018*"arthur" + 0.018*"madame" + 0.012*"ned"')
(4, '0.000*"battalion" + 0.000*"basilisk" + 0.000*"bedevilment" + 0.000*"beautifullest" + 0.000*"beamingly" + 0.000*"beadles" + 0.000*"begat" + 0.000*"bathing" + 0.000*"barnet" + 0.000*"barrows"')
(5, '0.000*"battalion" + 0.000*"basilisk" + 0.000*"bedevilment" + 0.000*"beautifullest" + 0

You'll see that a lot of the produced topics have distribution values of 0.000 and consist of the same words:

('0.000*"battalion" + 0.000*"basilisk" + 0.000*"bedevilment" + 0.000*"beautifullest" + 0.000*"beamingly" + 0.000*"beadles" + 0.000*"begat" + 0.000*"bathing" + 0.000*"barnet" + 0.000*"barrows"')

This is not atypical when generating topics with an LDA model, especially given the small size of our dataset. As such, ignore these topics for now.

We can go back and calculate this new model's coherence score using our previous script. Simply run that cohrence score script again, and it will automatically calculate the score for this new model. You can see that this new LDA model has a coherence score around .533. So increasing the number of generated topics improved model cohesion quite a bit. Nevertheless, while this score is better, it is still undesirable. How else might we improve our model?

# Step 8: Visualize topics

Visualizing a model's topics provides another means of evaluating the model. For instance, by mapping topics onto a coordinate space, we can examine how similar, diverse, and distinct they are in relation to one another. In this way, visualization provides another means of "eyeballing" topic coherence and diversity. A common LDA topic visualization tool is the pyLDAvis library. We can generate our visualization with this code:

In [60]:
dickens_visual = gensimvisualize.prepare(lda_model, corpus, dictionary, mds='mmds')
pyLDAvis.display(dickens_visual)



Your code should generate an interactive display that looks roughly similar to the following image:

[image of pyLDAvis output]

The left-hand space maps our topics while the right-hand bar graph displays the most frequent terms in our corpus. As you move your mouse over different topic circles, the right-hand graph changes. The red bars visualize each terms projected frequency within that given topic, while the blue indicates that term's overall frequency. If we move our mouse away from any of the topics, the bar graph returns to showing only the most frequent terms in the corpus.

Looking at this graph, we see that some of the most frequent terms are 'pickwick,' 'nicholas,' 'sam,' and so forth. This is, perhaps, unsurprising, given these are the names of main characters in some of Dicken's lengthy novels. The disproportionate frequency with which 'pickwick' appears is all the more understandable given it is the name of a main character and social club around which Dickens' novel The Pickwick Papers centers.

We may remember how such names populated our initial seven topics. If we move our some of the larger circles in this display (as well as examine the output of forty topics from our newer model), we can see that these same character names have prominent positions in the topics. The prominence of words like 'pickwick' can deterimental unhelpful however. Though central to one novel, its disproportionate frequency grants it too much weight. By contrast, if we examine some of the smaller topic circles, we see these topics are comprised of words that appear far less freuqently--sometimes as little as possibly fifty times (compared to 'pickwick's 2500+ appearances).

Over-weighted terms like 'pickwick' and rarer terms can constitute noise inhibiting model performance. We already attempted to remove some of these words through stopword removal during preprocessing. Moreoever, you'll remember we added the no_below = 2 parameter to filter out words that appear in only one text document. Nevertheless, we may need stricter measures to filter out frequently used words that appear in only two or three texts (such as 'pickwick'). Additionally, it may help to filter out words that appear in every text, such as 'charles' and 'dickens'. These two measures together can potentially tighten the scope of our model. By increasing our no_below value, we can focus on the most frequent terms throughout the entire corpus, removing terms that may simply have high frequency because of two or three texts. By adding a no_above value, we can filter out terms that appear in every corpus--such as author and publisher info--that may create noise.

These two parameters are added to the dictionary variable in gensim:

- no_below: This modification directs the machine to ignore words that appear less than a given number of times. Stopword removal reduces noise by removing the most frequent words that have little semantic meaning, such as articles and conjunctions. This modification similarly helps reduce noice by eliminating words that may only appear once in our corpus, and so have little semantic significance. The no_below value indicates the number of text documents in which a word must appear in order to be considered. We previously set it to two in order to ignore words that appear in only one text. We will now increase it to five to ignore words that appear in four or fewer texts. 

- no_above: This modification directs the machine to ignore words that appear in a certain proportion of texts in the corpus. While the no_below number indicates a number of texts by which to filter a word, no_above indicates the proportion of texts by which to filter a word. For example, we can set our's to .9 for now. This means that the machine will ignore words that appear in ninety percent or more of the documents across the entier corpus.

By adding these two values, we are essentially tightening the semantic scope by which our model generates topics. Run the following script with the added parameters:

In [57]:
# load dictionary, filter out one-time words
dictionary = corpora.Dictionary(texts)
dictionary.filter_extremes(no_below = 5, no_above= .9)

# create corpus as BoW
corpus = [dictionary.doc2bow(text) for text in texts]
    
# train LDA model 
lda_model = LdaModel(corpus=corpus, id2word=dictionary, random_state=4583, chunksize=20, num_topics=40, passes=200, iterations=400)

# print LDA topics
for topic in lda_model.print_topics(num_topics=40, num_words=10):
    print(topic)

(0, '0.408*"captain" + 0.043*"fisherman" + 0.032*"village" + 0.010*"parlour" + 0.008*"ca" + 0.000*"tarpaulin" + 0.000*"tarnish" + 0.000*"tart" + 0.000*"tantalise" + 0.000*"tar"')
(1, '0.000*"tattered" + 0.000*"taunt" + 0.000*"tantalise" + 0.000*"tape" + 0.000*"taper" + 0.000*"tar" + 0.000*"tarnish" + 0.000*"tarpaulin" + 0.000*"tart" + 0.000*"tastefully"')
(2, '0.000*"tattered" + 0.000*"taunt" + 0.000*"tantalise" + 0.000*"tape" + 0.000*"taper" + 0.000*"tar" + 0.000*"tarnish" + 0.000*"tarpaulin" + 0.000*"tart" + 0.000*"tastefully"')
(3, '0.000*"tattered" + 0.000*"taunt" + 0.000*"tantalise" + 0.000*"tape" + 0.000*"taper" + 0.000*"tar" + 0.000*"tarnish" + 0.000*"tarpaulin" + 0.000*"tart" + 0.000*"tastefully"')
(4, '0.059*"polly" + 0.052*"train" + 0.041*"engine" + 0.041*"line" + 0.036*"junction" + 0.035*"tunnel" + 0.032*"missis" + 0.031*"lamp" + 0.027*"station" + 0.026*"sniff"')
(5, '0.050*"richard" + 0.026*"george" + 0.025*"bucket" + 0.024*"guardian" + 0.019*"miss" + 0.014*"charley" + 0.01

Once the new model is generated, go back and run the coherence score script. It should output a score around .607. So tightening our semantic scope seems to have improved topic coherence. Looking thorugh some of the topics, they may even make a bit more sense now. For instance, one topic has the following key terms:

('0.196*"king" + 0.035*"duke" + 0.033*"queen" + 0.027*"prince" + 0.026*"england" + 0.022*"parliament" + 0.020*"henry" + 0.018*"army" + 0.015*"john" + 0.015*"edwin"')

This is clearly related to eigteenth-century English politics and government. Another contains the following key terms:

('0.059*"polly" + 0.052*"train" + 0.041*"engine" + 0.041*"line" + 0.036*"junction" + 0.035*"tunnel" + 0.032*"missis" + 0.031*"lamp" + 0.027*"station" + 0.026*"sniff"')

Polly is a character in Dickens'novel Dombey and Sons, in which trains serve as a prominent plot device. This topic, then, appears to summarize that novel.

Of course, while a coherence score of .607 is better than our initial score, it is not outstanding. Moreover, other topics generated by this latest model are less interpretable. More finetuning is therefore necessary to improve our model's topics. Some possible approaches may be narrowing the model's semantic scope or altering the number of topics. But in order to finetune our model, it helps to know our model's intended purpose.

# Step 9. Text classification

Up to now, we have looked at how we can finetune our LDA model using a combination of qualitative and quantitative evaluation metrics. But for what purpose have we trained this model? For what task can we use our model's topics? One potential use case is text classification. Using our LDA model, we can classify all of the documents in our collection according to their share in different topics. To view our each document's distribution over our model's topics, we can run the following code:

In [52]:
for i, doc in enumerate(corpus):
    doc_topics = lda_model.get_document_topics(doc)
    print(f"Document {i}: {doc_topics}")

Document 0: [(2, 0.48748216), (8, 0.15674685), (13, 0.09316654), (26, 0.074498475), (30, 0.12822926), (33, 0.052366953)]
Document 1: [(2, 0.89645076), (26, 0.032256767), (33, 0.06428684)]
Document 2: [(0, 0.115223795), (2, 0.41612628), (8, 0.2186829), (26, 0.02413957), (33, 0.22508238)]
Document 3: [(2, 0.31269974), (8, 0.06567101), (26, 0.57629734), (33, 0.0452741)]
Document 4: [(1, 0.21820545), (2, 0.3966066), (8, 0.13915363), (26, 0.0510072), (33, 0.19216517)]
Document 5: [(2, 0.6944995), (8, 0.24499126), (33, 0.05377284)]
Document 6: [(2, 0.41154274), (8, 0.122974664), (24, 0.21241716), (26, 0.023879942), (33, 0.22464754)]
Document 7: [(2, 0.4477418), (8, 0.1802635), (13, 0.2113019), (26, 0.032091513), (30, 0.020367874), (33, 0.108048305)]
Document 8: [(0, 0.01657209), (2, 0.32046065), (7, 0.06413563), (13, 0.012753192), (24, 0.09831223), (26, 0.3565997), (30, 0.023541987), (33, 0.10719005)]
Document 9: [(24, 0.6382965), (26, 0.36019194)]
Document 10: [(15, 0.9969169)]
Document 11:

The printed list shows each document's distirubtion over topics. For example, the distribution over topics for Document 1 is:

Document 1: [(6, 0.82226485), (16, 0.016346056), (26, 0.15616614)]

This means that, according to our model, Document 1 is 82% comprised of Topic 6, 2% comprised of Topic 16 and 16% comprised of Topic 26. This distribution is determined by the number of words from each topic found in Document 1.

We can potentially use these distributions for text classification. For instance, we can organize documents into clusters according to their topic distributions. All documents whose primary share is in a given topic are given one label, while documents whose primary share is in a different topic are given another. Or an alternative method for classificaiton may be to use the topic distributions as metadata labels. If cataloging these documents in a database, we can assign all documents that share in a politically-oriented topic as "political texts." These are only two potential approaches for text classification via topic models.

If our topics are not coherent and distinct, however, this may impede classification.  One problem with using topic models for classification may be that all of documents are not sufficiently coherent and distinct, every document within in a corpus may share in a given topic. To illustrate this, we can review our generated list of document-topic distributions. A cursory glance shows that twenty-one of our thirty-one documents are primarily distributed among Topic 6. Turning back to our word-topic distribution's, we see that Topic 6 is defined as:

(6, '0.005*"miss" + 0.004*"oh" + 0.003*"ha" + 0.003*"kit" + 0.002*"winkle" + 0.002*"fellow" + 0.002*"brass" + 0.002*"party" + 0.002*"rejoin" + 0.002*"coach"')

The terms in this topic all have a low probability. This means that the top key words in Topic 6 all have a low probility of appearing in that topic. But Topic 6 also has a high probability across the majority of documents in our corpus. This may suggest that Topic 6 is non-cohesive. In other words, think of Topic 6 as a bucket that contains a lot of words used throughout the corpus but that may not share an coherent and overarching semantic relationship (as least to human users). Indeed, if we regenerate the pyLDAvis visualization for this new model, we an see that Topic 6 (labeled "1" in visual) occupies the largest and most central space, representing its distribution across the corpus:

[image]

In this visual, we also see that 'miss' is the most salient term in the corpus with over 5000 instnace. Although we included the abbreviated 'ms' in our custom stopword list, we did not account for its non-abbreviated form. The same is true for 'doctor'; our custom stoplist contains 'dr' as a form of address, but not the non-abbreviated 'doctor.' Other top terms, according to this latest visual, that also appear in Topic 6 have little semantic meaning, notably 'oh' and 'ha'. If we shift through our model's other topics, we find archaic or vernacularized forms of words from the NLTK stoplist, such as 'em' (being the vernacular of 'them'). Non-abbreviated forms of address, expressive words, and domain-specific variants of NLTK stopwords are all words we can add to our custom stoplist. This is one additional approach for further finetuning our model.

On the other hand, a single topic's looming presence across our corpus may be expected. We are, admittedly, working with a very small corpus. Moreover, all of our documents are novels authored by the same individual, and so born out of the same geo-political era, bearing similar themes, tropes, and language. Given the corpus' small size and singular origin, we can reasonably expect a certain degree of conceptual homogeneity revealed through our model.

This is where domain knowledge is important. When training and finetuning an LDA topic model, one needs to be familiar with the nature of the documents being analyzed. Expected distributions for a corpus of social media posts, for example, may not reflect those of a literary corpus. Model parameters suitable for one dataset will not necessarily extrapolate to another. As this tutorial illustrates, topic modeling is an iterative and domain-specific process of experimentation. The ideal model depends on one's dataset and intended overall purpose.

# Latent semantic analysis

LDA is not the only topic modelling approach. Latent semantic analysis (LSA) is another topic modeling algorithm from which LDA builds. To put it breifly, LSA takes the document-term matrix produced in bag of words TF-IDF and reduces its dimensions through singular value decomposition (SVD). LSA then calculates the cosine similarity of between documents in this reduced matrix. In this way, LSA essentially calculates similarity between documents according to the frequency and co-occurence of words, and thereby produces keyword topic lists that can be used classify and sort documents.

# Summary and next steps

In this tutorial, you trained and finetuned an LDA topic with Python's NLTK and Gensim. We have explored both qualitative and quantitiave methods for improving our LDA model's topics. We have also introduced topic modelling's potential use in text classification and analysis. You can now experiment with applying these scripts and methods to other text corpora, modifying preprocessing methods and model parameters to fit your own set of documents. 

Build an AI strategy for your business on one collaborative AI and data platform called IBM watsonx, which brings together new generative AI capabilities, powered by foundation models, and traditional machine learning into a powerful platform spanning the AI lifecycle. With watsonx.ai, you can train, validate, tune, and deploy models with ease and build AI applications in a fraction of the time with a fraction of the data.

Try watsonx.ai, the next-generation studio for AI builders. Explore more articles and tutorials about watsonx on IBM Developer.

To continue learning, we recommend exploring this content: