In [None]:
#|hide
#|default_exp discovering_topics

# Discovering and Visualizing Topics in Texts
(follows: )

Often texts are just that: texts without metadata and labels that tell us what the texts are about. We can use unsupervised ML, topic models, in such cases to find out about the topics discussed in the texts.

Topics: Groups of related words that often occur together in texts. Topic models can find clusters of related words. The humans interpret these clusters and assign them labels.

Popular topic model: Latent Dirichlet Allocation (LDA). It uses a prior distribution topics in a text will have (Dirichlet probability distribution). LDA is often used to model open-ended survey questions.

Here we will use the data from the Grand Debat Nationale in France.

In [None]:
#| export
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd

In [None]:
#| export
f = "/home/peter/Documents/data/nlp/la-transition-ecologique.csv"
df = pd.read_csv(f)

  df = pd.read_csv(f)


In [None]:
#| export
df.columns

Index(['id', 'reference', 'title', 'createdAt', 'publishedAt', 'updatedAt',
       'trashed', 'trashedStatus', 'authorId', 'authorType', 'authorZipCode',
       'QUXVlc3Rpb246MTYw - Quel est aujourd'hui pour vous le problème concret le plus important dans le domaine de l'environnement ?',
       'QUXVlc3Rpb246MTYx - Que faudrait-il faire selon vous pour apporter des réponses à ce problème ?',
       'QUXVlc3Rpb246MTQ2 - Diriez-vous que votre vie quotidienne est aujourd'hui touchée par le changement climatique ?',
       'QUXVlc3Rpb246MTQ3 - Si oui, de quelle manière votre vie quotidienne est-elle touchée par le changement climatique ?',
       'QUXVlc3Rpb246MTQ4 - À titre personnel, pensez-vous pouvoir contribuer à protéger l'environnement ?',
       'QUXVlc3Rpb246MTQ5 - Si oui, que faites-vous aujourd'hui pour protéger l'environnement et/ou que pourriez-vous faire ?',
       'QUXVlc3Rpb246MTUw - Qu'est-ce qui pourrait vous inciter à changer vos comportements comme par exemple mieux en

We will focus on the contents of the last, open question of the questionnaire:

In [None]:
#| export
question = "QUXVlc3Rpb246MTU5 - Y a-t-il d'autres points sur la transition écologique sur lesquels vous souhaiteriez vous exprimer ?"
df[question].head(10)

0               Multiplier les centrales géothermiques
1    Les problèmes auxquels se trouve confronté l’e...
2                                                  NaN
3                                                  NaN
4      Une vrai politique écologique et non économique
5    Les bonnes idées ne grandissent que par le par...
6    Pédagogie dans ce sens là dés la petite école ...
7                                                  NaN
8    faire de l'écologie incitative et non punitive...
9                                                  NaN
Name: QUXVlc3Rpb246MTU5 - Y a-t-il d'autres points sur la transition écologique sur lesquels vous souhaiteriez vous exprimer ?, dtype: object

## Preprocessing

Before we can train a model, we need to tokenize the texts. For this we use the spaCy NLP library. The author uses a blank model (does not work anymore).

In [None]:
#| export
import spacy
nlp = spacy.load('fr_core_news_sm')

The are 4 NaN's in the first 10 answers, so we throw these out and keep all the texts in the target column.

In [None]:
#| export
texts = df[df[question].notnull()][question]

Next we use spaCy to perform the first pre-processing pass:

In [None]:
#| export
%time spacy_docs = list(nlp.pipe(texts))

CPU times: user 8min 2s, sys: 3.2 s, total: 8min 6s
Wall time: 8min 6s


Now we have a list of spaCy documents that we need to transform into a list of tokens. We will work with lemmatized tokens in order to be able to work with the lemmas. So, these are the following pre-processing steps:

- remove all words < 3 characters (interesting for sentiment analysis, but no so much for topic analysis)
- drop all stopwords
- take the lemmas of all remaining words and lowercase them

In [None]:
#| export
docs = [[t.lemma_.lower() for t in doc if len(t.orth_) > 3 and not t.is_stop] for doc in spacy_docs]

docs is a list of lists. The lists contain the lemmas of the answers of the survey participants.

But we want to take frequent bigrams into account when topic modelling. In tge French language they often carry important meaning ("poids lourds" = "trucks").

For this we use the Python Gensim library:

- identify frequent bigrams in the corpus
- append these to the list of tokens for the documents in which they appear

In [None]:
#| export
import re
from gensim.models import Phrases

bigram = Phrases(docs, min_count=10)

for idx in range(len(docs)):
  for token in bigram[docs[idx]]:
    if '_' in token: # bigrams can be picked out by using the '_' that joins the individual words
      docs[idx].append(token) # appended to the end, but topic modelling is BoW, so order is not important!

Lets have a look at the fifth document:

In [None]:
#| export
docs[4]

['pédagogie',
 'sens',
 'petit',
 'école',
 'sensibilisation',
 'parc',
 'naturel',
 'enfant',
 'devenir',
 'prescripteur',
 'génération',
 'futur',
 'urgence',
 'parc_naturel',
 'génération_futur']

Perfect, we have found two frequently used (over the corpus) in this particular document of the corpus.

Next, the final Gensim-specific pre-processing steps:

- create a dictionary representation of the documents; the dictionary will map each word to an unique ID so that we can make BoW representations of each document. The dictionary will contain ids of words in documents and their frequency;
- we can remove the least and most frequent words from the vocabulary (faster, better quality). We express the min freq as an absolute number, the max freq is the proportion of documents a word is allowed to occur in:

In [None]:
#| export
from gensim.corpora import Dictionary

dictionary = Dictionary(docs)
print(f"Number of unique words in original documents: {len(dictionary)}")

dictionary.filter_extremes(no_below=3, no_above=0.25)
print(f"Number of unique words after removing rare and common words: {len(dictionary)}")

# Let's look at an example document:
print(f"Example representation of document 5: {dictionary.doc2bow(docs[5])}")

Number of unique words in original documents: 80955
Number of unique words after removing rare and common words: 32718
Example representation of document 5: [(191, 1), (192, 1), (193, 1), (194, 1), (195, 1), (196, 1), (197, 1), (198, 1), (199, 1), (200, 1)]


Next, we create bag-of-word (BoW) representations for each of our documents in the corpus:

In [None]:
#| export
corpus = [dictionary.doc2bow(doc) for doc in docs]
corpus[5]

[(191, 1),
 (192, 1),
 (193, 1),
 (194, 1),
 (195, 1),
 (196, 1),
 (197, 1),
 (198, 1),
 (199, 1),
 (200, 1)]

## Training

In [None]:
#| export
from gensim.models import LdaModel

%time model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, chunksize=1000, passes=5, random_state=1)

CPU times: user 2min 55s, sys: 843 ms, total: 2min 55s
Wall time: 1min 28s


## Results

What did the model learn? We start by printing out the 10 words that were most characteristic for each of the topics. Some of the topics are general, but others more precise:

In [None]:
#| export
for (topic, words) in model.print_topics():
  print(topic + 1, ":", words)

1 : 0.034*"agriculture" + 0.024*"animal" + 0.020*"pesticide" + 0.020*"agriculteur" + 0.018*"produit" + 0.014*"santé" + 0.012*"environnement" + 0.011*"agricole" + 0.011*"interdire" + 0.011*"production"
2 : 0.048*"voiture" + 0.038*"véhicule" + 0.037*"électrique" + 0.017*"solution" + 0.017*"batterie" + 0.016*"diesel" + 0.014*"falloir" + 0.014*"voiture_électrique" + 0.014*"moteur" + 0.013*"pollution"
3 : 0.012*"france" + 0.011*"bien" + 0.008*"année" + 0.007*"grand" + 0.007*"pays" + 0.006*"pouvoir" + 0.006*"français" + 0.006*"voir" + 0.006*"temps" + 0.005*"aucun"
4 : 0.061*"énergie" + 0.032*"nucléaire" + 0.018*"production" + 0.017*"renouvelable" + 0.016*"éolien" + 0.015*"solaire" + 0.013*"énergie_renouvelable" + 0.012*"électricité" + 0.012*"centrale" + 0.012*"développer"
5 : 0.074*"transport" + 0.017*"taxer" + 0.015*"avion" + 0.014*"train" + 0.014*"camion" + 0.014*"route" + 0.013*"commun" + 0.012*"transport_commun" + 0.011*"ligne" + 0.011*"routier"
6 : 0.030*"écologique" + 0.026*"transition

Some interesting topics:

- agriculture (topic 1)
- vehicles (topic 2)
- energy (topic 4)
- waste and recycling (topic 8)
- tax incentives (topic 9)

In [None]:
#| export
import pyLDAvis.gensim_models
import warnings

pyLDAvis.enable_notebook()
warnings.filterwarnings("ignore", category=DeprecationWarning)

pyLDAvis.gensim_models.prepare(model, corpus, dictionary, sort_topics=False)

  default_term_info = default_term_info.sort_values(
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  if LooseVersion(np.__version__) < '1.13':
  if LooseVersion(np.__version__) < '1.13':
  if LooseVersion(np.__version__) < '1.13':
  if LooseVersion(np.__version__) < '1.13':
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  other = LooseVersion(other)
  other = LooseVersion(other)
  other = LooseVersion(other)
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  if LooseVersion(np.__version__) < '1.13':
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  other = LooseVersion(other)
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = Loo

Let's check the topics the model assigns to some individual documents. LDA assigns a high probability to a low number of topics for each document:

In [None]:
#| export
for (text, doc) in zip(texts[:10], docs[:10]):
    print(text)
    print([(topic+1, prob) for (topic, prob) in model[dictionary.doc2bow(doc)] if prob > 0.1])

Multiplier les centrales géothermiques
[(4, 0.77493656)]
Les problèmes auxquels se trouve confronté l’ensemble de la planète et que dénoncent, dans le plus parfait désordre, les gilets jaunes de France ne sont-ils pas dus, avant tout, à la surpopulation mondiale ? Cette population est passée d’1,5 milliards d’habitants en 1900 à 7 milliards en 2020 et montera bientôt à 10 milliards vers 2040.  Avec les progrès de la communication dans ce village mondial qu'est notre planète, chaque individu, du fin fond de l’Asie au fin fond de l’Afrique, en passant par les « quartiers » et les « campagnes » de notre pays, aspire à vivre – et on ne peu l’en blâmer – comme les moins mal lotis de nos concitoyens (logement, nourriture, biens de consommation, déplacement, etc.).  Voilà la mère de tous les problèmes. Si tel est bien le cas, la solution à tous les problèmes (stabilisation de la croissance démographique, partage des richesses, partage des terres, partage de l’eau, protection de la biodiversit