# Topic Modeling in Gensim

## Introduction to Topic Modeling

Topic modeling is a more obscure process than, for example, TF-IDF.  TF-IDF is rules-based and the algorithm is transparent, whereas topic modeling is an unsupervised approach.

Topic modeling sorts words together based on co-occurrence probabilities. It sorts words from a collection of document into likely clusters of co-occurring words, i.e. how often words likely appear close to one another. These are the topics.

You determine that there is a certain number of topics that constitutes each document. The model then estimates the probability of each word in the documents of belonging to a topic. It iterates over and over adjusting each time its probability scores. 

As with other text analysis methods, there are different algorithms for implementing the topic modeling process. One of the most popular implementations of topic modeling is LDA (Latent Dirichlet Allocation)

LDA’s approach to topic modeling is that it considers each document as a collection of topics in a certain proportion. And each topic as a collection of keywords, again, in a certain proportion. Once you provide the algorithm with the number of topics, all it does it to rearrange the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of topic-keywords distribution. LDA is non-deterministic. This means that a different result can be obtained for each run. Fortunately, the result usually converges towards a stable state. Run it several times and compare the results.

This notebook covers LDA implementation in Gensim. [MALLET](https://mimno.github.io/Mallet/) is another popular LDA implementation - some argue it is more efficient and give better topic segregation. It’s also quite intuitive to use, but it’s difficult to set up through JupyterHub.


**Topic Modeling Outputs** 

In this notebook we’ll look at two main topic modeling outputs:

_Topics (probability distributions of words over documents)_: each topic is a list of likely frequently co-occurring terms and each word more or less characteristic of the topic

_Distributions of topics over documents_: list the more characteristic topics for each topic and how strongly the document is associated with the topic

> “A topic model provides a different perspective on a collection. It creates a set of probability distributions over the vocabulary of the collection, which, when combined together in different proportions, best match the content of the collection. We can sort the words in each of these distributions in descending order by probability, take some arbitrary number of most-probable words, and get a sense of what (if anything) the topic is “about.” Each of the text segments also has its own distribution over the topics, and we can sort these segments by their probability within a given topic to get a sense of how that topic is used.” (Nguyen et al. “How We Do Things With Words”, p. 9)


**Interpreting topic models**

Topic modeling is useful to get a sense of the patterns across a corpus and to identify patterns you might want to look into further. 

> “Topic models (e.g., LDA, Blei et al., 2003) are usually unsupervised and therefore less biased toward human-defined categories. They are especially suited for insight-driven analysis, because they are constrained in ways that make their output interpretable. Although there is no guarantee that a “topic” will correspond to a recognizable theme or event or discourse, they often do so in ways that other methods do not. Their easy applicability without supervision and ready interpretability make topic models good for exploration. Topic models are less successful for many performance-driven applications. Raw word features are almost always better than topics for search and document classification. LSTMs and other neural network models are better as language models. Continuous word embeddings have more expressive power to represent fine-grained semantic similarities between words.” (Nguyen et al. “How We Do Things With Words”, p. 9)

Topic modeling was introduced to the wider public as an information retrieval and categorization tool: as a way of categorizing documents by identifying its overarching themes. This created a sense that topic modeling outputs coherent categories that represent themes in documents. 

But it’s important to keep in mind that topic models identify consistent patterns across a corpus. They will identify features of language that likely frequently co-occur. These patterns can seem like “topics”, like coherent themes, because (according to the distributional hypothesis) words that share the same context, words that frequently occur close together, 
generate meaning, are proxies of meaning. When a document contains words such as “telescope”, “sky”, “planet”, “star” this can indicate that the document probably relates to astronomy. However, a document that contains words such as “star”, “famous”, “fans”, “film” suggests a different kind of meaning, or topic being discussed. Topic modeling picks up on these different groupings. 

> “Indeed calling these models “topic models” is retrospective — the topics that emerge from the inference algorithm are interpretable for almost any collection that is analyzed. The fact that these look like topics has to do with the statistical structure of observed language and how it interacts with the specific probabilistic assumptions of LDA.” (Blei, “Probabilistic Topic Models” footnote p. 79)

But consistent patterns of words don’t necessarily simply equate with themes or topics, what a text is “about”. If there is one document or set of documents in your corpus that consistently repeats particular words or phrases this will be picked up as a pattern. Sometimes it’s significant, sometimes it’s not. For example:

- If you are comparing documents that are of different types (e.g. newspapers v transcripts or plays v novels etc.) models will likely pick up on different language conventions that typify each genre which may or may not be relevant for your analysis. Similarly, if some document contain distinctive language or dialect features these can be grouped together as a topic. 

> “Longer or extended poems that outsize the majority of other documents in the subset pull one or more topics toward language specific to that particular poem. (…) one poem with high levels of repetition can pull a topic away from the rest of the corpus, along with other poems with high frequency repetitions of particular phrases” (Rhody, “Topic Modeling and Figurative Language”)

- Models can pick up on recurring preprocessing errors, e.g. if there are OCR misspelling or words repeatedly split at the end of lines 

> “After fitting the model, it may be necessary to circle back to an earlier phase. Topic models find consistent patterns. When authors repeatedly use a particular theme or discourse, that repetition creates a consistent pattern. But other factors can also create similar patterns, which look as good to the algorithm. We might notice a topic that has highest probability on French stopwords, indicating that we need to do a better job of filtering by language. We might notice a topic of word fragments, such as “ing,” “tion,” “inter,” indicating that we are not handling end-of- line hyphenation correctly. We may need to add to our stoplist or change how we curate multi-word terms.” (Nguyen et al. “How We Do Things With Words”, p. 10)

**Tuning Topic Models**

One of the main challenges with topic modeling is generating outputs that are meaningful and relevant. It will take some experimentation. Two of the most important aspects for generating meaningful topics is preprocessing and the number of topics. There is no general rule for how many topic is a good number of topics. This will take some tuning and experimentation.

> “One of the most common questions about topic models is how many topics to use, usually with the implicit assumption that there is a “right” number that is inherent in the collection. We prefer to think of this parameter as more like the scale of a map or the magnification of a microscope. The “right” number is determined by the needs of the user, not by the collection. If the analyst is looking for a broad overview, a relatively small number of topics may be best. If the analyst is looking for fine-grained phenomena, a larger number is better.” (Nguyen et al. “How We Do Things With Words”, p. 9-10)

# Imports

In [None]:
import re
import spacy
!python -m spacy download en_core_web_md

from pprint import pprint

from pathlib import Path  
import glob

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel


#import pyLDAvis
#import pyLDAvis.gensim_models as gensimvis

## 1-Setting up for Building the Models

There's some set-up involved for using Genism to create topic models. 

If we follow some preprocessing steps (most of which should be familiar to us by now) this will increase the quality of the models. 

Gensim also expects the data to be structured in a certain way to generate models (as a dictionary and corpus created from that dictionary).

We'll go over these steps which involve: preprocessing then making a dictionary and corpus.

### Preprocessing

**Lemmatize the files**

In [None]:
#This loops over multiple files in a directory
#but it might make the kernel crash if it runs out memory
#If the kernel crash you might have to lemmatize single files at a time (cf. below)

#Load language model (it needs to match the name above)
nlp = spacy.load('en_core_web_md')

#Open your texts and create spaCy document
filepath = 'kafka-corpus/'
text_files = glob.glob(f'{filepath}/*.txt')

#Loop through the files and open as spacy document
for file in text_files:
    with open(file, 'r', encoding='utf-8') as f:
        text = f.read()
        print(file)
        document = nlp(text)
        
    #Lemmatize
    outname = file.replace('.txt', '-lemmatized.txt')
    with open(outname, 'w', encoding='utf8') as out:   
        for token in document:
            # Get the lemma for each token
            out.write(token.lemma_.lower())
            # Insert white space between each token
            out.write(' ')

In [None]:
#Lemmatize single files
#Kernel crashes for large single files
#Either chunk the text into smaller text units and lemmatize
#Or skip lemmatization phase

#Load language model (it needs to match the name above)
nlp = spacy.load('en_core_web_md')

#Open your text and create spaCy document
filepath = 'kafka-corpus/kafka_the-trial.txt'
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

outname = filepath.replace('.txt', '-lemmatized.txt')
with open(outname, 'w', encoding='utf8') as out:   
    for token in document:
        # Get the lemma for each token
        out.write(token.lemma_.lower())
        # Insert white space between each token
        out.write(' ')

**Tokenize your text either using gensim built-in tokenizing or using your own tokenizing function**

In [None]:
# Tokenize using gensim built-in tokenization

#Put all texts into a single list
#Loop through the texts and tokenize them with custom tokenizing function
directory_path = 'kafka-corpus/'
all_docs = []

for filepath in Path(directory_path).glob("*.txt"):
    with open(filepath, 'r', encoding='utf-8') as file:
        text = file.read()
        tokenized_text = gensim.utils.simple_preprocess(text)
        all_docs.append(tokenized_text)

#See the first document as tokenized list of words
all_docs[0]

In [None]:
# Tokenize using cutsom tokenizing function

#Put all texts into a single list
#Loop through the texts and tokenize them with custom tokenizing function
from pathlib import Path
directory_path = 'kafka-corpus/'
all_docs = []

def tokenize(text):
    lowercase_text = text.lower()
    split_words = re.split(r'\W+', lowercase_text)
    tokenized = [word for word in split_words if word.isalpha()]
    return tokenized

for filepath in Path(directory_path).glob("*.txt"):
    with open(filepath, 'r', encoding='utf-8') as file:
        text = file.read()
        tokenized_text = tokenize(text)
        all_docs.append(tokenized_text)

#See the first document as tokenized list of words
all_docs[0]

**Remove stopwords**

In [None]:
#Stopwords: refer to "Preprocessing" notebook for more details on stopwords

#Load custom stopwords list (this is the default spacy list)
#open your txt file and convert to a Python list
with open("custom-stopwords.txt", "r") as file_object:
    custom_stopwords = [s.rstrip('\n') for s in file_object.readlines()] 

custom_stopwords

In [None]:
def remove_stopwords(list_of_tokens, stopwords):
    return [token for token in list_of_tokens if token not in stopwords]

all_docs_no_stop = []

for file in all_docs: 
    nostop = remove_stopwords(file, custom_stopwords)
    all_docs_no_stop.append(nostop)
    
all_docs_no_stop[0]

## Creating Bigrams and Trigrams

Bigrams are two words frequently occurring together that need to be grouped together to make sense (e.g. "black hole", "European Union"). Trigrams are 3 words frequently occurring together that need to be grouped together to make sense. Identifying bigrams and trigrams in our corpus will improve the quality of the models.

In [None]:
# Build the bigram and trigram models
# min_count: minimum number of times words occur together to be considered a bigram
# threshhold: the higher the number the fewer number of ngrams will be identified
bigram = gensim.models.Phrases(all_docs_no_stop, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[all_docs_no_stop], threshold=100)

bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

data_bigrams = make_bigrams(all_docs_no_stop)
data_bigrams_trigrams = make_trigrams(data_bigrams)

#You can find the ngram by searching for words linked with underscore 
#(command + F and search for underscore)
#If you're not staisfied with the bigrams you're getting (capturing too many
#or too few then modify the min_count and threshhold parameters
print(data_bigrams_trigrams[0])

## Create Corpus and Dictionary needed for Topic Modeling

Creating a dictionary from the corpus restructures the text in a way that prepares it for topic modeling. In this stage we create an id (key) for each unique word in a document and associate it with the frequency of the word in a document. The dictionary create the unique ids (keys) and the corpus maps the word ids to their frequency.

In [None]:
# Create Dictionary (associates a key to each unique word)
id2word = corpora.Dictionary(data_bigrams_trigrams)

# Create Corpus (associates word frequency with each key for each unique word from Dictionary
corpus = []
for text in data_bigrams_trigrams:
    new = id2word.doc2bow(text)
    corpus.append(new)

#Print corpus for first document
#You will see a list of (unique word ID, and its frequency)
print (corpus[0])

In [None]:
#See what word is associated with a particular key/ID number
print(id2word[39])

In [None]:
# Human readable format of corpus (term and its frequency)
#List the unique words, and their frequency for first document
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

# 2-Building the Topic Models

In [None]:
"""
Parameters: 
corpus and dictionary we created above

num_topics is the number of topics

passes: total number of training passes, the number of passes through training data

chunksize: the number of documents to be used in each training chunk

update_every: determines how often the model parameters should be updated

See gensim documentation for more

"""

# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=5, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=100,
                                           alpha='auto',
                                           per_word_topics=True)

In [None]:
# Topics (probability distributions of words across the corpus)
#This list the topics (Topic 0,1,2 etc.)
#and print the list of words most characteristic for each topic
#preceded by its proability score (how strongly it is characteristic of the topic)
#change the num_words to get more or less words for each topic
pprint(lda_model.print_topics(num_words = 10))

In [None]:
# Distributions of topics over documents: 
#what topics are associated with each document
#This returns a list of each document which lists the most characteristic topics 
#for that document and their weight of association (topic proability) with that document

topics_per_document=[lda_model.get_document_topics(item) for item in corpus]
topics_per_document

## Visualizing models

In [None]:
# Visualize the topics
pyLDAvis.enable_notebook()
lda_display = gensimvis.prepare(lda_model, corpus, id2word)
pyLDAvis.display(lda_display)

Each bubble on the left-hand side of the plot represents a topic. The larger the bubble, the more prevalent is that topic and the more documents associated with that topic.

A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant. Although slightly overlapping topics is not a bad thing: they reveal connections between topics.

A model with too many topics, will typically have many overlaps, small sized bubbles clustered in one region of the chart.

If you move the cursor over the bubbles, the words and bars on the right-hand side will update. These words are the words characteristic of that topic.

This visualization can give you sense of how you can tune your models by adjusting your parameters (e.g. increasing or decreasing number of topics). If words are not meaningful can also add them to custom stopwords list (cf. Preprocessing notebook)

## Tuning topic models

You can refine your models using the visualization above. If you want to get more in-depth you could use perplexity and coherence scores. These measure how coherent topics are (the more coherence, the more meaningful and interpretable the topics). They can be heplful gives to identifying an optimal number of topics. You can try building different models with different number of topics and seeing how the scores change (especially the coherence score).

In [None]:
# Compute Perplexity
# the lower the better.
print('\nPerplexity: ', lda_model.log_perplexity(corpus))

In [None]:
# Compute Coherence Score
#the higher the better
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_bigrams_trigrams, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)