# Discovering and Visualizing Topics in Texts

In [1]:
import pandas as pd
import spacy
import numpy as np



**COMMENTS**
* Could not find the data

The most classic applications of Natural Language Processing involve supervised Machine Learning. In the most typical cases of text classification, named entity recognition, question answering, etc., NLPers have access to a collection of texts with their labels. However, in real-life scenario's, we're often less lucky. Many text collections do not come with metadata labels that tell you what the texts are about. When people answer open-ended survey questions, for example, they don't tag their answer with the topics they discuss. In such cases, we can make use of unsupervised techniques we call topic models.

Topic models are a family of models that are able to discover the topics in a collection of texts. In this context, "topics" refers to groups of related words that often occur together in the same text. For example, in a collection of newspaper articles a topic model may identify one topic that is made up of words such as "politician", "law", and "parliament", and another characterized by words such as "player", "match" and "penalty". Topic models can only find such clusters of related words; it is our task as humans to interpret these topics and give them labels such as "politics" and "football". 

One of the most popular such models is Latent Dirichlet Allocation (LDA). LDA is a generative model that sees every text as a mixture of topics. Each of these topics are responsible for some of the words in the text. For example, the "football" topic will generate the word "penalty" with a high probability, while the "politics" topic will have a much higher probability for "politician" than for "penalty". Other words, such as "the" and "an", will have similar probabilities in all topics. LDA takes its name from the Dirichlet probability distribution. This is the prior distribution it assumes the topics in a text will have.

## Data

One of the contexts where topic modelling is extremely useful is that of open-ended survey questions. It allows us to explore the variation in topics that people's answers contain. As our example data set, let's therefore take a look at an extensive set of answers from the Grand Débat National in France, the public debate organized by president Macron. The goal of the debate was to better understand the French people's needs and opinions after the mass demonstrations of the Yellow vests movement. The results of this debate are now [available as open data](https://granddebat.fr/pages/donnees-ouvertes). For our experiments, we'll download one of the csv files about the ecological transition and load the contents into a [pandas](https://pandas.pydata.org/) dataframe.

In [2]:
#%matplotlib inline
#import pandas as pd

#f = "data/topics/LA_TRANSITION_ECOLOGIQUE.csv"
#f = "ministere-de-la-transition-ecologique-datasets-2025-03-14-17-46.csv"
#df = pd.read_csv(f)

Each of the rows in this data frame contains some metadata and a respondent's answers to a list of questions about the ecological transition. Some of these questions are multiple choice, while other ones are open-ended. 

In [3]:
#df.columns

We'll focus on the last of the questions, which gives the most freedom to the respondents: it asks them whether they have any additional comments about the ecological transition. We hope LDA will help us analyze what topics their answers focus on. The first few answers to this question already give us an idea of the variety of topics people bring up: alternative energy sources ("les centrales géothermiques"), politics ("une vrai politique écologique") and education ("pédagogie").

In [4]:
#question = "Y a-t-il d'autres points sur la transition écologique sur lesquels vous souhaiteriez vous exprimer ?"
#df[question].head(10)

## Preprocessing

Before we train a topic model, we need to tokenize our texts. Let's do this with the [spaCy](https://spacy.io/) NLP library. Because we're only going to do some basic preprocessing, we don't need to download any of its statistical models. We'll just initialize a blank model for French instead.

In [5]:
import spacy

#nlp = spacy.blank("fr")
nlp = spacy.blank("en")

First we remove all the rows from the data frame that don't have a response for our target question (the `NaN`s above), then we take all the texts in the target column. Next, we use spaCy to perform our first preprocessing pass. 

In [8]:
# Acquire different data
nlp = spacy.load("en_core_web_lg")  # tokenize, lemmatize using the large model
data_json = 'data/sentiment_analysis/review_corpus_en.ndjson'  # extract product reviews
with open(data_json, 'r') as fd:
    dcts = fd.readlines()   # Read products as JSON
titles = []
bodies = []
for element in dcts:
    dct = eval(element)
    titles.append(dct["title"])
    bodies.append(dct["body"])
#
data_df = pd.DataFrame({"title": titles, "body": bodies})
texts = data_df["body"]
#spacy_docs = list(nlp.pipe(texts))
spacy_docs = [nlp(t) for t in texts]

Tokens have two key properties:
* pos_ - grammatical type in sentence
* lemma_ - root word

In [9]:
text = data_df.loc[0, "body"]
text

"Works as mentioned. Fits the GPS perfectly. I shall use it with a Panavise Custom InDash Mount for Honda Odyssey '05. Shipped the same day - received in 2 business days!"

In [10]:
spacy_doc = nlp((text))
spacy_doc

Works as mentioned. Fits the GPS perfectly. I shall use it with a Panavise Custom InDash Mount for Honda Odyssey '05. Shipped the same day - received in 2 business days!

In [11]:
for token in spacy_doc: 
    print(token, token.pos_, token.lemma_) 

Works NOUN work
as SCONJ as
mentioned VERB mention
. PUNCT .
Fits VERB fit
the DET the
GPS NOUN gps
perfectly ADV perfectly
. PUNCT .
I PRON I
shall AUX shall
use VERB use
it PRON it
with ADP with
a DET a
Panavise PROPN Panavise
Custom PROPN Custom
InDash PROPN InDash
Mount PROPN Mount
for ADP for
Honda PROPN Honda
Odyssey PROPN Odyssey
' NUM '
05 NUM 05
. PUNCT .
Shipped VERB ship
the DET the
same ADJ same
day NOUN day
- PUNCT -
received VERB receive
in ADP in
2 NUM 2
business NOUN business
days NOUN day
! PUNCT !


**lemmatization**: reducing words to a core. like stemming.

* ``spacy_docs`` is ``list-spacy.tokens.doc.Doc``
* ``spacy.tokens.doc.Doc`` is ``list-spacy.tokens.token.Token`` A ``Token`` has a `lemma_`` method that provides lemmatization.

In [12]:
a = [[word.lemma_ for word in doc
      if word.is_punct == False and word.is_stop == False] for doc in spacy_docs]
a[0]

['work',
 'mention',
 'fit',
 'gps',
 'perfectly',
 'shall',
 'use',
 'Panavise',
 'Custom',
 'InDash',
 'Mount',
 'Honda',
 'Odyssey',
 '05',
 'ship',
 'day',
 'receive',
 '2',
 'business',
 'day']

Now that we have a list of spaCy documents, we transform them to lists of tokens. Instead of the original tokens, we're going to work with the lemmas instead. This will allow our model to generalize better, as it will be able to see that "géothermiques" and "géothermique" are actually just two forms of the same words. This is the full list of our initial preprocessing steps: 
 
- we remove all words shorter than 3 characters (these are often fairly uninteresting from a topical point of view),
- we drop all stopwords, and
- we take them lemmas of the remaining words and lowercase them.

In [13]:
docs = [[t.lemma_.lower() for t in doc if len(t.orth_) > 3 and not t.is_stop] for doc in spacy_docs]
print(docs[-10:])

[['quality', 'decent', 'price', 'problem', 'magnetic', 'strap', 'little', 'center', 'live', 'problem', 'strip', 'leather', 'make', 'difficult', 'button', 'different', 'case', 'well', 'access', 'button'], ['trash', 'megabyte', 'drive', 'adapter', 'convert', 'drive', 'work', 'motherboard'], ['relatively', 'decent', 'fitting', 'case', 'ipad', 'home', 'button', 'work', 'little', 'tend', 'constantly', 'mash', 'want', 'mash', 'probably', 'common', 'problem', 'case', 'design', 'case', 'rugged', 'protect', 'ipad', 'abuse', 'factor', 'lack', 'lack', 'screen', 'protection', 'kensington', 'case', 'keyboard', 'lose', 'magnet', 'clasp', 'smack', 'screen', 'constantly', 'decide', 'case', 'instead', 'hope', 'adjustable', 'angle', 'work', 'consider', 'fact', 'cover', 'know', 'pretty', 'obvious', 'easy', 'tunnel', 'vision', 'shop', 'interested', 'rubber', 'case', 'adjustable', 'design', 'sadly', 'adjustable', 'design', 'great', 'clunky', 'lock', 'place', 'hope', 'attract', 'dirt', 'dust', 'hair', 'get'

Next, we also want to take frequent bigrams into account. After all, French has many multiword units, such as "poids lourds" (trucks) that actually form one word rather than two. This is the first step where we use the [Gensim](https://radimrehurek.com/gensim/) library, a great NLP library for topic modelling. First we identify the frequent bigrams in the corpus, then we append them to the list of tokens for the documents in which they appear. This means the bigrams will not be in their correct position in the text, but that's fine: topic models are bag-of-word models that ignore word position anyway.

**Not sure if bigrams are a concern in English?**

In [14]:
import re
from gensim.models import Phrases

bigram = Phrases(docs, min_count=10)

for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:  # bigrams can be recognized by the "_" that joins the invidual words
            docs[idx].append(token)  # Does not remove the original token, just adds the phrase

In [15]:
bigram[docs[0]]

['work',
 'mention',
 'fit_perfectly',
 'shall',
 'panavise',
 'custom',
 'indash',
 'mount',
 'honda',
 'odyssey',
 'ship',
 'receive',
 'business',
 'day',
 'fit_perfectly']

In [16]:
docs[2]

['2011',
 'black',
 'hard',
 'drives',
 'quiet',
 'fast',
 'activity',
 'build',
 'fast',
 'computer',
 'amazed',
 'little',
 'delay',
 'experience',
 'hard',
 'drive',
 'activity',
 'buy',
 'expensive',
 'black',
 'model',
 'well',
 'quality.1',
 '2013',
 'recieve',
 'black',
 'desktop',
 'sata',
 '7200',
 'inch',
 'internal',
 'desktop',
 'hard',
 'drive',
 'retail',
 'amazon',
 'manufacturers',
 'product',
 'thank',
 'manufacturer',
 'packaging',
 'unit',
 'damage',
 'transit',
 'work',
 'run',
 'window',
 'drive',
 'system',
 'system',
 'recognize',
 'immediately',
 'take',
 'awhile',
 'locate',
 'format',
 'drive',
 'control',
 'panal',
 'storage',
 'devices',
 'intend',
 'backup',
 'quiet',
 'hard_drive',
 'hard_drive']

Next, we move on to the final Gensim-specific preprocessing steps. First, we create a dictionary representation of the documents. This dictionary will map each word to a unique ID and help us **create bag-of-word representations** of each document. These bag-of-word representations contain the ids of the words in the document, together with their frequency. Additionally, we can remove the least and most frequent words from the vocabulary. This improves the quality of our topic model and speeds up its training. The minimum frequency of a word is expressed as an absolute number, the maximum frequency is the proportion of documents a word is allowed to occur in.

**Notes**

``dictionary`` has a key that is an index and the value is the word. ``dictionary.doc2bow(docs[2])`` produces a "bag of words" representation that counts the number of occurrences of the word in the document (``docs[2]``).

In [17]:
from gensim.corpora import Dictionary

dictionary = Dictionary(docs)
print('Number of unique words in original documents:', len(dictionary))

dictionary.filter_extremes(no_below=3, no_above=0.25)
print('Number of unique words after removing rare and common words:', len(dictionary))

print("Example representation of document 3:", dictionary.doc2bow(docs[2]))

Number of unique words in original documents: 13181
Number of unique words after removing rare and common words: 3872
Example representation of document 3: [(15, 1), (17, 1), (18, 1), (19, 1), (20, 2), (21, 1), (22, 1), (23, 1), (24, 1), (25, 3), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 2), (33, 1), (34, 4), (35, 1), (36, 1), (37, 2), (38, 1), (39, 3), (40, 2), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 2), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 2), (59, 1), (60, 1), (61, 1), (62, 1)]


Then we create bag-of-word representations for each document in the corpus:

In [33]:
corpus = [dictionary.doc2bow(doc) for doc in docs]

## Training

Now it's time to train our topic model. We do this with the following parameters: 

- corpus: the bag-of-word representations of our documents
- id2token: the mapping from indices to words
- num_topics: the number of topics we want the model to identify
- chunksize: the number of documents the model sees for every update
- passes: the number of times we show the total corpus to the model during training
- random_state: we use a seed to ensure reproducibility.

On a corpus of this size, the training will typically take one or two minutes.

In [19]:
from gensim.models import LdaModel

%time model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, chunksize=1000, passes=5, random_state=1)

CPU times: user 3.38 s, sys: 17.8 ms, total: 3.39 s
Wall time: 3.4 s


## Results

Let's take a look at what the model has learnt. We do this by printing out the ten words that are most characteristic for each of the topics. This shows some interesting patterns already: while some topics are more general (such as 3), others point to some very relevant recurring themes: electric vehicles (topic 1), (alternative) energy (topic 2), agriculture (topic 6), waste and recycling (topic 7) and taxes (topic 9).

In [20]:
for (topic, words) in model.print_topics():
    print(topic+1, ":", words)

1 : 0.021*"time" + 0.018*"battery" + 0.016*"product" + 0.015*"cable" + 0.013*"long" + 0.013*"return" + 0.010*"month" + 0.009*"plug" + 0.008*"customer" + 0.008*"fail"
2 : 0.018*"mouse" + 0.012*"great" + 0.010*"quality" + 0.009*"look" + 0.008*"need" + 0.008*"thing" + 0.008*"play" + 0.008*"small" + 0.007*"cable" + 0.007*"price"
3 : 0.058*"camera" + 0.018*"video" + 0.017*"picture" + 0.013*"photo" + 0.012*"canon" + 0.010*"great" + 0.010*"quality" + 0.010*"lens" + 0.009*"take" + 0.009*"want"
4 : 0.042*"monitor" + 0.014*"trackball" + 0.013*"tripod" + 0.013*"great" + 0.012*"year" + 0.011*"warranty" + 0.011*"screw" + 0.008*"quality" + 0.008*"head" + 0.007*"phone"
5 : 0.024*"drive" + 0.022*"player" + 0.018*"file" + 0.011*"play" + 0.009*"ipod" + 0.008*"hard" + 0.008*"unit" + 0.007*"thing" + 0.007*"video" + 0.007*"connector"
6 : 0.036*"keyboard" + 0.033*"card" + 0.016*"speed" + 0.011*"key" + 0.010*"lens" + 0.008*"time" + 0.008*"light" + 0.007*"color" + 0.006*"fast" + 0.006*"write"
7 : 0.044*"sound

Another way of inspecting the topics is by visualizing them. This can be done with the [pyLDAvis](https://github.com/bmabey/pyLDAvis) library. PyLDAvis will show us how popular the topics are in our corpus, how similar the topics are, and which are the most salient words for this topic. Note it's important to set sort_topics=False on the call to pyLDAvis. If you don't, it will order the topics differently than Gensim. 

In [21]:
import pyLDAvis.gensim
import warnings

pyLDAvis.enable_notebook()
warnings.filterwarnings("ignore", category=DeprecationWarning) 

pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics=False)

Finally, let's inspect the topics the model recognizes in some of the individual documents. Here we see how LDA tends to assign a high probability to a low number of topics for each documents, which makes its results very interpretable.

In [22]:
for (text, doc) in zip(texts[:20], docs[:20]):
    print(text)
    print([(topic+1, prob) for (topic, prob) in model[dictionary.doc2bow(doc)] if prob > 0.1])
    

Works as mentioned. Fits the GPS perfectly. I shall use it with a Panavise Custom InDash Mount for Honda Odyssey '05. Shipped the same day - received in 2 business days!
[(5, 0.109370254), (8, 0.15812883), (9, 0.6786147)]
This works with my ipad. It took a minute to set up but it has been working without too many issues.
[(9, 0.25829175), (10, 0.6083382)]
1-2-2011. I have two of 1 TB Black Hard Drives.  They are QUIET and fast enough for most activities. I built a new fast computer and was amazed at how little delay I am experiencing from the hard drive activity.  I bought the more expensive BLACK models due to better quality.1-22-2013. I recieved WD Black Desktop 2TB SATA 6.0 GB/s 7200 RPM 3.5-Inch Internal Desktop Hard Drive Retail Kit in an Amazon box that "just fit" the WDC manufacturers product box.  Thanks to the manufacturers packaging, the unit was not damaged in transit. It works well. I'm running Window 8 Pro and it is my third drive in the system.  The system recognized it i

## Conclusions

Many collections of unstructured texts don't come with any labels. Topic models such as Latent Dirichlet Allocation are a useful technique to discover the most prominent topics in such documents. Gensim makes training these topics model easy, and pyLDAvis presents the results in a visually attractive way. Together they form a powerful toolkit to better understand what's inside large sets of documents, and to explore subsets of related texts. While these results are often very revealing already, it's also possible to use them as a starting point, for example for a labelling exercise for supervised text classification. Although traditional topic models are lacking in more semantic information (they don't use word embeddings, for instance), they should be in every NLPer's toolkit as a really quick way of getting insights into large collections of documents.