# Discovering and Visualizing Topics in Texts

The most classic applications of Natural Language Processing involve supervised Machine Learning. In the most typical cases of text classification, named entity recognition, question answering, etc., NLPers have access to a collection of texts with their labels. However, in real-life scenario's, we're often less lucky. Many text collections do not come with metadata labels that tell you what the texts are about. When people answer open-ended survey questions, for example, they don't tag their answer with the topics they discuss. In such cases, we can make use of unsupervised techniques we call topic models.

Topic models are a family of models that are able to discover the topics in a collection of texts. In this context, "topics" refers to groups of related words that often occur together in the same text. For example, in a collection of newspaper articles a topic model may identify one topic that is made up of words such as "politician", "law", and "parliament", and another characterized by words such as "player", "match" and "penalty". Topic models can only find such clusters of related words; it is our task as humans to interpret these topics and give them labels such as "politics" and "football". 

One of the most popular such models is Latent Dirichlet Allocation (LDA). LDA is a generative model that sees every text as a mixture of topics. Each of these topics are responsible for some of the words in the text. For example, the "football" topic will generate the word "penalty" with a high probability, while the "politics" topic will have a much higher probability for "politician" than for "penalty". Other words, such as "the" and "an", will have similar probabilities in all topics. LDA takes its name from the Dirichlet probability distribution. This is the prior distribution it assumes the topics in a text will have.

## Data

One of the contexts where topic modelling is extremely useful is that of open-ended survey questions. It allows us to explore the variation in topics that people's answers contain. As our example data set, let's therefore take a look at an extensive set of answers from the Grand Débat National in France, the public debate organized by president Macron. The goal of the debate was to better understand the French people's needs and opinions after the mass demonstrations of the Yellow vests movement. The results of this debate are now [available as open data](https://granddebat.fr/pages/donnees-ouvertes). For our experiments, we'll download one of the csv files about the ecological transition and load the contents into a [pandas](https://pandas.pydata.org/) dataframe.

In [1]:
%pylab inline
import pandas as pd

f = "data/topics/LA_TRANSITION_ECOLOGIQUE.csv"
df = pd.read_csv(f)

Populating the interactive namespace from numpy and matplotlib


  interactivity=interactivity, compiler=compiler, result=result)


Each of the rows in this data frame contains some metadata and a respondent's answers to a list of questions about the ecological transition. Some of these questions are multiple choice, while other ones are open-ended. 

In [2]:
df.columns

Index(['reference', 'title', 'createdAt', 'publishedAt', 'updatedAt',
       'trashed', 'trashedStatus', 'authorId', 'authorType', 'authorZipCode',
       'Quel est aujourd'hui pour vous le problème concret le plus important dans le domaine de l'environnement ?',
       'Que faudrait-il faire selon vous pour apporter des réponses à ce problème ?',
       'Diriez-vous que votre vie quotidienne est aujourd'hui touchée par le changement climatique ?',
       'Si oui, de quelle manière votre vie quotidienne est-elle touchée par le changement climatique ?',
       'À titre personnel, pensez-vous pouvoir contribuer à protéger l'environnement ?',
       'Si oui, que faites-vous aujourd'hui pour protéger l'environnement et/ou que pourriez-vous faire ?',
       'Qu'est-ce qui pourrait vous inciter à changer vos comportements comme par exemple mieux entretenir et régler votre chauffage, modifier votre manière de conduire ou renoncer à prendre votre véhicule pour de très petites distances ?',
   

We'll focus on the last of the questions, which gives the most freedom to the respondents: it asks them whether they have any additional comments about the ecological transition. We hope LDA will help us analyze what topics their answers focus on. The first few answers to this question already give us an idea of the variety of topics people bring up: alternative energy sources ("les centrales géothermiques"), politics ("une vrai politique écologique") and education ("pédagogie").

In [3]:
question = "Y a-t-il d'autres points sur la transition écologique sur lesquels vous souhaiteriez vous exprimer ?"
df[question].head(10)

0               Multiplier les centrales géothermiques
1    Les problèmes auxquels se trouve confronté l’e...
2                                                  NaN
3                                                  NaN
4      Une vrai politique écologique et non économique
5    Les bonnes idées ne grandissent que par le par...
6    Pédagogie dans ce sens là dés la petite école ...
7                                                  NaN
8    faire de l'écologie incitative et non punitive...
9    Développer le ferroutage pour les poids lourds...
Name: Y a-t-il d'autres points sur la transition écologique sur lesquels vous souhaiteriez vous exprimer ?, dtype: object

## Preprocessing

Before we train a topic model, we need to tokenize our texts. Let's do this with the [spaCy](https://spacy.io/) NLP library. Because we're only going to do some basic preprocessing, we don't need to download any of its statistical models. We'll just initialize a blank model for French instead.

In [4]:
import spacy

nlp = spacy.blank("fr")

First we remove all the rows from the data frame that don't have a response for our target question (the `NaN`s above), then we take all the texts in the target column. Next, we use spaCy to perform our first preprocessing pass. 

In [5]:
texts = df[df[question].notnull()][question]
%time spacy_docs = list(nlp.pipe(texts))

CPU times: user 2min 34s, sys: 1.56 s, total: 2min 36s
Wall time: 2min 38s


Now that we have a list of spaCy documents, we transform them to lists of tokens. Instead of the original tokens, we're going to work with the lemmas instead. This will allow our model to generalize better, as it will be able to see that "géothermiques" and "géothermique" are actually just two forms of the same words. This is the full list of our initial preprocessing steps: 
 
- we remove all words shorter than 3 characters (these are often fairly uninteresting from a topical point of view),
- we drop all stopwords, and
- we take them lemmas of the remaining words and lowercase them.

In [6]:
docs = [[t.lemma_.lower() for t in doc if len(t.orth_) > 3 and not t.is_stop] for doc in spacy_docs]
print(docs[:3])

[['multiplier', 'centrale', 'géothermique'], ['problème', 'trouver', 'confronter', 'ensemble', 'planète', 'dénoncer', 'parfaire', 'désordre', 'gilet', 'jaune', 'france', 'surpopulation', 'mondial', 'cette', 'population', 'passer', 'd’1,5', 'milliard', 'habitant', '1900', 'milliard', '2020', 'monter', 'bientôt', 'milliard', '2040', 'avec', 'progrès', 'communication', 'village', 'mondial', 'individu', 'fondre', 'asie', 'fondre', 'afrique', 'passer', 'quartier', 'campagne', 'pays', 'aspirer', 'vivre', 'blâmer', 'lotir', 'concitoyen', 'logement', 'nourriture', 'bien', 'consommation', 'déplacement', 'voilà', 'mère', 'problème', 'solution', 'problème', 'stabilisation', 'croissance', 'démographique', 'partager', 'richesse', 'partager', 'terrer', 'partager', 'protection', 'biodiversité', 'règlement', 'conflit', 'lutter', 'déforestation', 'lutter', 'dérèglement', 'climatique', 'règlement', 'conflit', 'stabilisation', 'migration', 'concurrencer', 'commercial', 'mondial', 'etc.', 'français', 'eur

Next, we also want to take frequent bigrams into account. After all, French has many multiword units, such as "poids lourds" (trucks) that actually form one word rather than two. This is the first step where we use the [Gensim](https://radimrehurek.com/gensim/) library, a great NLP library for topic modelling. First we identify the frequent bigrams in the corpus, then we append them to the list of tokens for the documents in which they appear. This means the bigrams will not be in their correct position in the text, but that's fine: topic models are bag-of-word models that ignore word position anyway.

In [7]:
import re
from gensim.models import Phrases

bigram = Phrases(docs, min_count=10)

for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:  # bigrams can be recognized by the "_" that joins the invidual words
            docs[idx].append(token)

In [8]:
docs[2]

['vrai', 'politique', 'écologique', 'économique', 'vrai_politique']

Next, we move on to the final Gensim-specific preprocessing steps. First, we create a dictionary representation of the documents. This dictionary will map each word to a unique ID and help us create bag-of-word representations of each document. These bag-of-word representations contain the ids of the words in the document, together with their frequency. Additionally, we can remove the least and most frequent words from the vocabulary. This improves the quality of our topic model and speeds up its training. The minimum frequency of a word is expressed as an absolute number, the maximum frequency is the proportion of documents a word is allowed to occur in.

In [9]:
from gensim.corpora import Dictionary

dictionary = Dictionary(docs)
print('Number of unique words in original documents:', len(dictionary))

dictionary.filter_extremes(no_below=3, no_above=0.25)
print('Number of unique words after removing rare and common words:', len(dictionary))

print("Example representation of document 3:", dictionary.doc2bow(docs[2]))

Number of unique words in original documents: 30517
Number of unique words after removing rare and common words: 11708
Example representation of document 3: [(87, 1), (88, 1), (89, 1), (90, 1), (91, 1)]


Then we create bag-of-word representations for each document in the corpus:

In [10]:
corpus = [dictionary.doc2bow(doc) for doc in docs]

## Training

Now it's time to train our topic model. We do this with the following parameters: 

- corpus: the bag-of-word representations of our documents
- id2token: the mapping from indices to words
- num_topics: the number of topics we want the model to identify
- chunksize: the number of documents the model sees for every update
- passes: the number of times we show the total corpus to the model during training
- random_state: we use a seed to ensure reproducibility.

On a corpus of this size, the training will typically take one or two minutes.

In [11]:
from gensim.models import LdaModel

%time model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, chunksize=1000, passes=5, random_state=1)

CPU times: user 1min 36s, sys: 1.42 s, total: 1min 37s
Wall time: 55.3 s


## Results

Let's take a look at what the model has learnt. We do this by printing out the ten words that are most characteristic for each of the topics. This shows some interesting patterns already: while some topics are more general (such as 3), others point to some very relevant recurring themes: electric vehicles (topic 1), (alternative) energy (topic 2), agriculture (topic 6), waste and recycling (topic 7) and taxes (topic 9).

In [12]:
for (topic, words) in model.print_topics():
    print(topic+1, ":", words)

1 : 0.046*"voiturer" + 0.039*"électrique" + 0.035*"véhiculer" + 0.016*"voiturer_électrique" + 0.016*"diesel" + 0.013*"batterie" + 0.013*"polluer" + 0.012*"solution" + 0.011*"faire" + 0.010*"exemple"
2 : 0.062*"énergie" + 0.038*"nucléaire" + 0.016*"production" + 0.016*"renouvelable" + 0.015*"solaire" + 0.015*"développer" + 0.014*"éolien" + 0.014*"centrale" + 0.014*"électricité" + 0.012*"éolienne"
3 : 0.046*"écologique" + 0.041*"transition" + 0.025*"transition_écologique" + 0.022*"falloir" + 0.021*"faire" + 0.012*"citoyen" + 0.011*"prendre" + 0.010*"taxer" + 0.009*"politique" + 0.009*"écologie"
4 : 0.027*"ville" + 0.016*"zone" + 0.014*"france" + 0.013*"centrer" + 0.013*"grand" + 0.010*"pays" + 0.010*"commercial" + 0.010*"urbain" + 0.010*"pollution" + 0.009*"voir"
5 : 0.019*"animal" + 0.014*"environnement" + 0.012*"public" + 0.010*"placer" + 0.009*"santé" + 0.009*"biodiversité" + 0.007*"protection" + 0.007*"mettre" + 0.007*"chasser" + 0.007*"environnemental"
6 : 0.063*"produit" + 0.027*"a

Another way of inspecting the topics is by visualizing them. This can be done with the [pyLDAvis](https://github.com/bmabey/pyLDAvis) library. PyLDAvis will show us how popular the topics are in our corpus, how similar the topics are, and which are the most salient words for this topic. Note it's important to set sort_topics=False on the call to pyLDAvis. If you don't, it will order the topics differently than Gensim. 

In [13]:
import pyLDAvis.gensim
import warnings

pyLDAvis.enable_notebook()
warnings.filterwarnings("ignore", category=DeprecationWarning) 

pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics=False)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


Finally, let's inspect the topics the model recognizes in some of the individual documents. Here we see how LDA tends to assign a high probability to a low number of topics for each documents, which makes its results very interpretable.

In [14]:
for (text, doc) in zip(texts[:20], docs[:20]):
    print(text)
    print([(topic+1, prob) for (topic, prob) in model[dictionary.doc2bow(doc)] if prob > 0.1])
    

Multiplier les centrales géothermiques
[(2, 0.7749154)]
Les problèmes auxquels se trouve confronté l’ensemble de la planète et que dénoncent, dans le plus parfait désordre, les gilets jaunes de France ne sont-ils pas dus, avant tout, à la surpopulation mondiale ? Cette population est passée d’1,5 milliards d’habitants en 1900 à 7 milliards en 2020 et montera bientôt à 10 milliards vers 2040.  Avec les progrès de la communication dans ce village mondial, chaque individu, du fin fond de l’Asie au fin fond de l’Afrique, en passant par les « quartiers » et les « campagnes » de notre pays, aspire à vivre – et on ne peu l’en blâmer – comme les moins mal lotis de nos concitoyens (logement, nourriture, biens de consommation, déplacement,etc.).  Voilà la mère de tous les problèmes. Si tel est bien le cas, la solution à tous les problèmes (stabilisation de la croissance démographique, partage des richesses, partage des terres, partage de l’eau, protection de la biodiversité, règlement des confli

## Conclusions

Many collections of unstructured texts don't come with any labels. Topic models such as Latent Dirichlet Allocation are a useful technique to discover the most prominent topics in such documents. Gensim makes training these topics model easy, and pyLDAvis presents the results in a visually attractive way. Together they form a powerful toolkit to better understand what's inside large sets of documents, and to explore subsets of related texts. While these results are often very revealing already, it's also possible to use them as a starting point, for example for a labelling exercise for supervised text classification. Although traditional topic models are lacking in more semantic information (they don't use word embeddings, for instance), they should be in every NLPer's toolkit as a really quick way of getting insights into large collections of documents.