# Topic Modeling and Latent Dirichlet Allocation (LDA)

Topic modeling is a common NLP task, which attempts to find the topics within a text document. Topic modeling is an unsupervised approach, and topic modeling only gives you an idea of which words frequently occur. It is then up to you to deduce the topic having seen the frequently and co-occurring words.

One of the primary applications of natural language processing is to automatically extract what topics people are discussing from large volumes of text. Some examples of large text could be feeds from social media, customer reviews of hotels, movies, etc, user feedbacks, news stories, e-mails of customer complaints etc.

Knowing what people are talking about and understanding their problems and opinions is highly valuable to businesses, administrators, political campaigns. And it’s really hard to manually read through such large volumes and compile the topics.

Thus is required an automated algorithm that can read through the text documents and automatically output the topics discussed.

In this tutorial, we will take a real example of the ’20 Newsgroups’ dataset and use LDA to extract the naturally discussed topics.

I will be using the Latent Dirichlet Allocation (LDA) from Gensim package along with the Mallet’s implementation (via Gensim). Mallet has an efficient implementation of the LDA. It is known to run faster and gives better topics segregation.

We will also extract the volume and percentage contribution of each topic to get an idea of how important a topic is. 


LDA’s approach to topic modeling is it considers each document as a collection of topics in a certain proportion. And each topic as a collection of keywords, again, in a certain proportion.

Once you provide the algorithm with the number of topics, all it does it to rearrange the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of topic-keywords distribution.

When I say topic, what is it actually and how it is represented?

A topic is nothing but a collection of dominant keywords that are typical representatives. Just by looking at the keywords, you can identify what the topic is all about.

The following are key factors to obtaining good segregation topics:

    The quality of text processing.
    The variety of topics the text talks about.
    The choice of topic modeling algorithm.
    The number of topics fed to the algorithm.
    The algorithms tuning parameters.



#### Assumptions:

    1. Each document is just a collection of words or a “bag of words”. Thus, the order of the words and the grammatical role of the words (subject, object, verbs, ..) are not considered in the model.  
    
    2. Words like am/is/are/of/a/the/but/… don’t carry any information about the “topics” and therefore can be eliminated from the documents as a preprocessing step. In fact, we can eliminate words that occur in at least %80 ~ %90 of the documents, without losing any information. For example, if our corpus contains only medical documents, words like human, body, health, etc might be present in most of the documents and hence can be removed as they don’t add any specific information which would make the document stand out.  

    3. We know beforehand how many topics we want. ‘k’ is pre-decided.  

    4. All topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated  


#### How does LDA work?
There are 2 parts in LDA:

    The words that belong to a document, that we already know.
    The words that belong to a topic or the probability of words belonging into a topic, that we need to calculate.

The Algorithm to find the latter

    Go through each document and randomly assign each word in the document to one of k topics (k is chosen beforehand).
    For each document d, go through each word w and compute:  
        1. p(topic t | document d):  
            - the proportion of words in document d that are assigned to topic t. Tries to capture how many words belong to the topic t for a given document d. Excluding the current word.  
            - If a lot of words from d belongs to t, it is more probable that word w belongs to t. ( #words in d with t +alpha/ #words in d with any topic+ k*alpha)
        2. p(word w| topic t):  
            - the proportion of assignments to topic t over all documents that come from this word w. Tries to capture how many documents are in topic t because of word w.  

            - LDA represents documents as a mixture of topics. Similarly, a topic is a mixture of words. If a word has high probability of being in a topic, all the documents having w will be more strongly associated with t as well. Similarly, if w is not very probable to be in t, the documents which contain the w will be having very low probability of being in t, because rest of the words in d will belong to some other topic and hence d will have a higher probability for those topic. So even if w gets added to t, it won’t be bringing many such documents to t.

    Update the probability for the word w belonging to topic t, as  
            p(word w with topic t) = p(topic t | document d) * p(word w | topic t)

<!-- For this exercise we use the news headlines published over a period of eighteen years from Australian sources ABC from [kaggle](https://www.kaggle.com/therohk/million-headlines/data) -->

In [1]:
import re
import numpy as np
import pandas as pd
from pprint import pprint

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spacy for lemmatization
import spacy

# Plotting tools
# import pyLDAvis
# import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

In [None]:
# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

## Data Pre-processing

In [2]:
# Import Dataset
data = pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json')

In [9]:
data.shape

(11314, 3)

In [10]:
print(data.target_names.unique())

['rec.autos' 'comp.sys.mac.hardware' 'comp.graphics' 'sci.space'
 'talk.politics.guns' 'sci.med' 'comp.sys.ibm.pc.hardware'
 'comp.os.ms-windows.misc' 'rec.motorcycles' 'talk.religion.misc'
 'misc.forsale' 'alt.atheism' 'sci.electronics' 'comp.windows.x'
 'rec.sport.hockey' 'rec.sport.baseball' 'soc.religion.christian'
 'talk.politics.mideast' 'talk.politics.misc' 'sci.crypt']


In [11]:
data.head()

Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space


In [7]:
# randomly select a subset
data = data.sample(n = 1000)

In [15]:
# Convert to list
data_text = data.content.values.tolist()

# Remove Emails
data_text = [re.sub('\S*@\S*\s?', '', sent) for sent in data_text]

# Remove new line characters
data_text = [re.sub('\s+', ' ', sent) for sent in data_text]

# Remove distracting single quotes
data_text = [re.sub("\'", "", sent) for sent in data_text]

In [16]:
pprint(data_text[:1])

['From: (Doug Loss) Subject: Re: Crazy? or just Imaginitive? Organization: '
 'Electrical and Computer Engineering, Carnegie Mellon Lines: 22 In article '
 'writes: > >Unfortunately H. Beam Piper killed him self just weeks short of '
 'having his >first book published, and have his ideas see light.. Such a '
 'waste. > > Piper lived in my town (Williamsport, PA) when he killed himself. '
 'It was in the early 60s. He had had more than a few books published by that '
 'time, but he was down on his luck financially. Rumor was that he was hunting '
 'urban pigeons with birdshot for food. He viewed himself as a resourceful '
 'man, and (IMO) decided to check out gracefully if he couldnt support '
 'himself. The worst part is that John Campbell, the long-time editor of '
 'Astounding/Analog SF magazine had cut a check for Pipers most recent story, '
 'and said check was in the mail. If Campbell had known Pipers straits, Im '
 'sure he would have phoned to say hang on. Campbell was like that

### Tokenize words and Clean-up text

Let’s tokenize each sentence into a list of words, removing punctuations and unnecessary characters altogether.

Gensim’s simple_preprocess() is great for this. Additionally I have set deacc=True to remove the punctuations

In [21]:
# Initialize spacy language model
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

#### Customized stop words

In [16]:
# New stop words list 
customize_stop_words = [
    'doi', 'preprint', 'copyright', 'org', 'https', 'et', 'al', 'author', 'figure', 'table', 'e-mail', 'file'
    'rights', 'reserved', 'permission', 'use', 'used', 'using', 'biorxiv', 'medrxiv', 'license', 'fig', 'fig.', 'al.', 'Elsevier', 'PMC', 'CZI',
    '-PRON-', 'usually'
]

# Mark them as stop words
for w in customize_stop_words:
    nlp.vocab[w].is_stop = True

In [17]:
# check word frequencys
def spacy_tokenizer(sentence):
    return [word.lemma_ for word in nlp(sentence) if not (word.like_num or word.is_stop or word.is_punct or word.is_space or len(word)==1)]

In [18]:
vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, min_df=2)
data_vectorized = vectorizer.fit_transform(data_text)

In [19]:
# most frequent words
word_count = pd.DataFrame({'word': vectorizer.get_feature_names(), 'count': np.asarray(data_vectorized.sum(axis=0))[0]})
word_count.sort_values('count', ascending=False).set_index('word')[:20].sort_values('count', ascending=True).plot(kind='barh')

In [20]:
# convert sentences to words
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

In [21]:
data_words = list(sent_to_words(data_text))
print(data_words[:1])

[['from', 'andrew', 'byler', 'subject', 're', 'revelations', 'babylon', 'organization', 'freshman', 'civil', 'engineering', 'carnegie', 'mellon', 'pittsburgh', 'pa', 'lines', 'hal', 'heydt', 'writes', 'that', 'was', 'only', 'the', 'fall', 'of', 'the', 'western', 'empire', 'the', 'eastern', 'empire', 'continued', 'for', 'another', 'years', 'and', 'key', 'element', 'in', 'its', 'fall', 'was', 'the', 'christian', 'sack', 'of', 'constantinople', 'note', 'that', 'said', 'the', 'fall', 'of', 'rome', 'not', 'of', 'the', 'empire', 'the', 'roman', 'empire', 'lasted', 'until', 'with', 'its', 'transfered', 'capital', 'in', 'constantinople', 'the', 'main', 'reason', 'for', 'its', 'fall', 'was', 'not', 'so', 'much', 'the', 'sack', 'of', 'constantinople', 'by', 'the', 'men', 'of', 'the', 'th', 'crusade', 'who', 'were', 'not', 'christians', 'they', 'had', 'been', 'excommunicated', 'down', 'to', 'the', 'last', 'man', 'after', 'attacking', 'the', 'christian', 'city', 'of', 'zara', 'in', 'croatia', 'but

### Creating Bigram and Trigram Models

Bigrams are two words frequently occurring together in the document. Trigrams are 3 words frequently occurring.

Gensim’s Phrases model can build and implement the bigrams, trigrams, quadgrams and more. The two important arguments to Phrases are min_count and threshold. The higher the values of these param, the harder it is for words to be combined to bigrams.

In [22]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

In [23]:
# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

In [24]:
# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])

['from', 'andrew', 'byler', 'subject_re', 'revelations', 'babylon', 'organization', 'freshman', 'civil', 'engineering', 'carnegie_mellon', 'pittsburgh_pa', 'lines', 'hal', 'heydt', 'writes', 'that', 'was', 'only', 'the', 'fall', 'of', 'the', 'western', 'empire', 'the', 'eastern', 'empire', 'continued', 'for', 'another', 'years', 'and', 'key', 'element', 'in', 'its', 'fall', 'was', 'the', 'christian', 'sack', 'of', 'constantinople', 'note', 'that', 'said', 'the', 'fall', 'of', 'rome', 'not', 'of', 'the', 'empire', 'the', 'roman', 'empire', 'lasted', 'until', 'with', 'its', 'transfered', 'capital', 'in', 'constantinople', 'the', 'main', 'reason', 'for', 'its', 'fall', 'was', 'not', 'so', 'much', 'the', 'sack', 'of', 'constantinople', 'by', 'the', 'men', 'of', 'the', 'th', 'crusade', 'who', 'were', 'not', 'christians', 'they', 'had', 'been', 'excommunicated', 'down', 'to', 'the', 'last', 'man', 'after', 'attacking', 'the', 'christian', 'city', 'of', 'zara', 'in', 'croatia', 'but', 'rather

#### Remove Stopwords, Make Bigrams and Lemmatize

The bigrams model is ready. Let’s define the functions to remove the stopwords, make bigrams and lemmatization and call them sequentially.

In [25]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [26]:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

In [27]:
# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

In [28]:
# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])

[['revelation', 'freshman', 'civil', 'engineering', 'carnegie_mellon', 'line', 'hal', 'heydt', 'write', 'fall', 'western', 'empire', 'eastern', 'empire', 'continue', 'year', 'key', 'element', 'fall', 'note', 'say', 'empire', 'last', 'transfered', 'capital', 'constantinople', 'main', 'reason', 'fall', 'much', 'sack', 'constantinople', 'man', 'crusade', 'christian', 'excommunicate', 'last', 'man', 'attack', 'christian', 'city', 'rather', 'disastorous', 'defeat', 'battle', 'mazinkert', 'turk', 'breach', 'frontier', 'matter', 'time', 'empire', 'fall', 'inability', 'empire', 'hold', 'seljuk', 'middle', 'quite', 'obvious', 'student', 'history', 'sack', 'constantinople', 'hasten', 'inevitable', 'want', 'save', 'empire', 'cooperate', 'crusader', 'come', 'battle', 'crusade', 'obstinacy', 'cooperate', 'people', 'consider', 'heretic', 'even', 'heretic', 'fighting', 'cause', 'empire', 'christendom', 'battle', 'turkish', 'horde', 'horde', 'later', 'sack', 'constantinople', 'balkan', 'hungary', 'ukr

#### Create the Dictionary and Corpus needed for Topic Modeling

The two main inputs to the LDA topic model are the dictionary(id2word) and the corpus.  

Gensim creates a unique id for each word in the document. The produced corpus shown above is a mapping of (word_id, word_frequency).

For example, (0, 1) above implies, word id 0 occurs once in the first document. Likewise, word id 1 occurs twice and so on.

This is used as the input by the LDA model.

In [29]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])

[[(0, 1), (1, 1), (2, 3), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 2), (10, 1), (11, 1), (12, 1), (13, 1), (14, 4), (15, 1), (16, 2), (17, 2), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 7), (24, 1), (25, 1), (26, 1), (27, 4), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 2), (34, 1), (35, 1), (36, 1), (37, 2), (38, 1), (39, 1), (40, 1), (41, 1), (42, 2), (43, 1), (44, 1), (45, 1), (46, 2), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 3), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1)]]


In [30]:
id2word[0]

'attack'

In [31]:
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

[[('attack', 1),
  ('balkan', 1),
  ('battle', 3),
  ('breach', 1),
  ('capital', 1),
  ('carnegie_mellon', 1),
  ('caucasus', 1),
  ('cause', 1),
  ('christendom', 1),
  ('christian', 2),
  ('city', 1),
  ('civil', 1),
  ('come', 1),
  ('consider', 1),
  ('constantinople', 4),
  ('continue', 1),
  ('cooperate', 2),
  ('crusade', 2),
  ('crusader', 1),
  ('defeat', 1),
  ('disastorous', 1),
  ('eastern', 1),
  ('element', 1),
  ('empire', 7),
  ('engineering', 1),
  ('even', 1),
  ('excommunicate', 1),
  ('fall', 4),
  ('fighting', 1),
  ('freshman', 1),
  ('frontier', 1),
  ('hal', 1),
  ('hasten', 1),
  ('heretic', 2),
  ('heydt', 1),
  ('history', 1),
  ('hold', 1),
  ('horde', 2),
  ('hungary', 1),
  ('inability', 1),
  ('inevitable', 1),
  ('key', 1),
  ('last', 2),
  ('later', 1),
  ('line', 1),
  ('main', 1),
  ('man', 2),
  ('matter', 1),
  ('mazinkert', 1),
  ('middle', 1),
  ('much', 1),
  ('note', 1),
  ('obstinacy', 1),
  ('obvious', 1),
  ('people', 1),
  ('quite', 1),
  (

### Building the Topic Model

We have everything required to train the LDA model. In addition to the corpus and dictionary, you need to provide the number of topics as well.

Apart from that, alpha and eta are hyperparameters that affect sparsity of the topics. According to the Gensim docs, both defaults to 1.0/num_topics prior.

chunksize is the number of documents to be used in each training chunk. update_every determines how often the model parameters should be updated and passes is the total number of training passes.  

LDA has 3 important parameters
    - alpha: document-topic density factor  
    - beta: word density in a topic  
    - k: number of topics to cluster the document into  

In [32]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

#### View the topics in LDA model

The above LDA model is built with 20 different topics where each topic is a combination of keywords and each keyword contributes a certain weightage to the topic.

You can see the keywords for each topic and the weightage(importance) of each keyword using lda_model.print_topics() as shown next.

In [33]:
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
# doc_lda = lda_model[corpus]

[(0,
  '0.041*"cd" + 0.007*"st" + 0.002*"targa" + 0.002*"remixe" + 0.000*"eq" + '
  '0.000*"vamp" + 0.000*"duran" + 0.000*"transvision" + 0.000*"nitzer" + '
  '0.000*"ebb"'),
 (1,
  '0.021*"blue" + 0.018*"vote" + 0.018*"dream" + 0.017*"implement" + '
  '0.016*"relationship" + 0.015*"gordon_bank" + 0.013*"seal" + '
  '0.013*"homosexual" + 0.013*"pocket" + 0.012*"excuse"'),
 (2,
  '0.036*"motorcycle" + 0.019*"bruin" + 0.017*"root" + 0.014*"brian" + '
  '0.011*"ranger" + 0.009*"uunet" + 0.008*"maine" + 0.006*"phase" + '
  '0.005*"boss" + 0.005*"inferior"'),
 (3,
  '0.020*"line" + 0.020*"write" + 0.014*"get" + 0.013*"article" + 0.013*"know" '
  '+ 0.012*"make" + 0.012*"say" + 0.011*"think" + 0.011*"go" + 0.010*"people"'),
 (4,
  '0.051*"bike" + 0.029*"turkish" + 0.024*"logic" + 0.023*"turk" + 0.017*"pit" '
  '+ 0.016*"street" + 0.016*"dog" + 0.015*"fair" + 0.015*"truck" + '
  '0.014*"ensure"'),
 (5,
  '0.079*"game" + 0.077*"team" + 0.057*"player" + 0.042*"play" + 0.029*"win" + '
  '0.027*"

#### How to interpret this?

Topic 0 is a represented as _0.016“car” + 0.014“power” + 0.010“light” + 0.009“drive” + 0.007“mount” + 0.007“controller” + 0.007“cool” + 0.007“engine” + 0.007“back” + ‘0.006“turn”.

It means the top 10 keywords that contribute to this topic are: ‘car’, ‘power’, ‘light’.. and so on and the weight of ‘car’ on topic 0 is 0.016.

The weights reflect how important a keyword is to that topic.

Looking at these keywords, can you guess what this topic could be? You may summarise it either are ‘cars’ or ‘automobiles’.

Likewise, can you go through the remaining topic keywords and judge what the topic is?

#### Compute Model Perplexity and Coherence Score

Model perplexity and topic coherence provide a convenient measure to judge how good a given topic model is. In my experience, topic coherence score, in particular, has been more helpful.

In [34]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.


Perplexity:  -12.159954067483701


In [35]:
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.4812691560239295


#### Visualize the topics-keywords

Now that the LDA model is built, the next step is to examine the produced topics and the associated keywords. There is no better tool than pyLDAvis package’s interactive chart and is designed to work well with jupyter notebooks.

In [None]:
# !pip install pyLDAvis
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

In [40]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model, corpus, id2word)
vis

  and should_run_async(code)
  default_term_info = default_term_info.sort_values(


In [4]:
data.head(50)

Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space
5,From: dfo@vttoulu.tko.vtt.fi (Foxvog Douglas)\...,16,talk.politics.guns
6,From: bmdelane@quads.uchicago.edu (brian manni...,13,sci.med
7,From: bgrubb@dante.nmsu.edu (GRUBB)\nSubject: ...,3,comp.sys.ibm.pc.hardware
8,From: holmes7000@iscsvax.uni.edu\nSubject: WIn...,2,comp.os.ms-windows.misc
9,From: kerr@ux1.cso.uiuc.edu (Stan Kerr)\nSubje...,4,comp.sys.mac.hardware


So how to infer pyLDAvis’s output?

Each bubble on the left-hand side plot represents a topic. The larger the bubble, the more prevalent is that topic.

A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant.

A model with too many topics, will typically have many overlaps, small sized bubbles clustered in one region of the chart.

Alright, if you move the cursor over one of the bubbles, the words and bars on the right-hand side will update. These words are the salient keywords that form the selected topic.

We have successfully built a good looking topic model.

Given our prior knowledge of the number of natural topics in the document, finding the best model was fairly straightforward.

### Topic modelling with BERT  

Three main algorithm components
1. Embed Documents: Extract document embeddings with Sentence Transformers. Since the data we are working with are article titles, we will need to obtain sentence embeddings, which BERTopic lets us do conveniently, by employing its default sentence transformer model paraphrase-MiniLM-L6-v2.
2. Cluster Documents: Create groups of similar documents with UMAP (to reduce the dimensionality of embeddings) and HDBSCAN (to identify and cluster semantically similar documents)
3. Create Topic Representation: Extract and reduce topics with c-TF-IDF (class-based term frequency, inverse document frequency). If you are unfamiliar with TF-IDF in the first place, all you need to know in order to generally grasp what is going on here is one thing: it allows for comparing the importance of words between documents by computing the frequency of a word in a given document and also the measure of how prevalent the word is in the entire corpus. Now, if we instead treat all documents in a single cluster as a single document and then perform TF-IDF, the result would be importance scores for words within a cluster. The more important words are within a cluster, the more representative they are of that topic. Therefore, we can obtain keyword-based descriptions for each topic! This is super powerful when it comes to inferring meaning from the groupings yielded by any unsupervised clustering technique.

In [None]:
# !pip install bertopic
# !pip install bertopic[visualization]

In [10]:
from bertopic import BERTopic

In [11]:
topic_model = BERTopic(min_topic_size=70, n_gram_range=(1,3), verbose=True)

In [5]:
topics, _ = topic_model.fit_transform(data_text)

In [13]:
freq = topic_model.get_topic_info()
freq.head(10)

Unnamed: 0,Topic,Count,Name
0,0,892,0_the_to_of_and
1,1,108,1_the_to_and_in


In [23]:
topic_nr = freq.iloc[1]["Topic"] # select a frequent topic
topic_model.get_topic(topic_nr)

[('the', 0.061614601895700435),
 ('to', 0.02906812325047156),
 ('and', 0.027099973271320396),
 ('in', 0.025387093183087617),
 ('of', 0.02523615989598884),
 ('he', 0.01891689283123492),
 ('that', 0.018160163280074318),
 ('for', 0.015790882488020297),
 ('is', 0.015627992999941914),
 ('was', 0.013540629117362939)]

In [25]:
topic_model.visualize_topics()