# Topic modeling : LDA

In [36]:
# from jyquickhelper import add_notebook_menu
# add_notebook_menu()

## Dataset : Grand Débat National (Great national debate)

The aim of this exercise is to be familiar with the text-mining and topic models such as LDA. One of the contexts where topic modeling is very useful is in open-ended questions. It allows us to explore the variation of topics addressed in people's responses. For this we will use the french "Grand Débat National" dataset. This dataset presents a complete set of responses from the [Grand Débat National](https://granddebat.fr/), the public debate organized by President Macron. The purpose of the debate was to better understand the needs and opinions of the French people following the Yellow Vests protests. The results of this debate are now available as [open data](https://granddebat.fr/pages/donnees-ouvertes).

## 1. Import data

**Question 1 :** Download one of the ecological transition csv files and load the content into a pandas dataframe. Name this variable `raw_data`

In [37]:
import pandas as pd

raw_data = pd.read_csv('/Users/ibrahim/Desktop/AIS/S2/ML 2 -  Bayesian and Unsupervised Methods/Baysian 3/data_debat_1.csv',low_memory=False)

In [38]:
raw_data.head()

Unnamed: 0,id,createdAt,publishedAt,updatedAt,authorId,authorType,authorZipCode,QUXVlc3Rpb246NTc= - Pensez-vous que vos actions en faveur de l'environnement peuvent vous permettre de faire des économies ?,"QUXVlc3Rpb246NTU= - Diriez-vous que vous connaissez les aides et dispositifs qui sont aujourd'hui proposés par l'Etat, les collectivités, les entreprises et les associations pour l'isolation et le chauffage des logements, et pour les déplacements ?",QUXVlc3Rpb246NTg= - Pensez-vous que les taxes sur le diesel et sur l’essence peuvent permettre de modifier les comportements des utilisateurs ?,QUXVlc3Rpb246NjI= - À quoi les recettes liées aux taxes sur le diesel et l’essence doivent-elles avant tout servir ?,"QUXVlc3Rpb246NjE= - Selon vous, la transition écologique doit être avant tout financée :",QUXVlc3Rpb246NjA= - Et qui doit être en priorité concerné par le financement de la transition écologique ?,"QUXVlc3Rpb246NTk= - Que faudrait-il faire pour protéger la biodiversité et le climat tout en maintenant des activités agricoles et industrielles compétitives par rapport à leurs concurrents étrangers, notamment européens ?"
0,00001c11-2b88-11e9-bf56-fa163eeb11e1,2019-02-08 10:57:52,2019-02-08 11:02:12,2019-02-08 11:02:12,VXNlcjpjMDhhMmI0YS0yYjg3LTExZTktYmY1Ni1mYTE2M2...,Citoyen / Citoyenne,38100,Oui,Oui,Oui,À financer des aides pour accompagner les Fran...,Les deux,Tout le monde,Cofinancer un plan d’investissement pour chang...
1,00007c25-2711-11e9-94d2-fa163eeb11e1,2019-02-02 18:35:57,2019-02-02 18:35:57,2019-02-02 18:35:57,VXNlcjo3MjdhYWQ4Mi0yNzEwLTExZTktOTRkMi1mYTE2M2...,Citoyen / Citoyenne,64240,Oui,Oui,Oui,À financer des aides pour accompagner les Fran...,Par le budget général de l’État,Les entreprises,Cofinancer un plan d’investissement pour chang...
2,0000b21f-1f25-11e9-94d2-fa163eeb11e1,2019-01-23 16:38:58,2019-01-23 16:38:58,2019-01-23 16:38:58,VXNlcjo1NjM1NjBmNi0xZTVkLTExZTktOTRkMi1mYTE2M2...,Citoyen / Citoyenne,91630,Oui,Non,Oui,À financer des aides pour accompagner les Fran...,Par la fiscalité écologique,Tout le monde,Taxer les produits importés qui dégradent l’en...
3,0000c8c4-2265-11e9-94d2-fa163eeb11e1,2019-01-27 19:54:39,2019-01-27 19:55:33,2019-01-27 19:55:33,VXNlcjo3YzcyMmYyZS0yMjY0LTExZTktOTRkMi1mYTE2M2...,Citoyen / Citoyenne,92100,Non,Non,Non,À baisser d’autres impôts comme par exemple l’...,,Les entreprises,Modifier les accords commerciaux
4,0001704a-434b-11e9-bf56-fa163eeb11e1,2019-03-10 16:41:41,2019-03-10 16:41:41,2019-03-10 16:41:41,VXNlcjo4NmJlOTEwNS00MzRhLTExZTktYmY1Ni1mYTE2M2...,Citoyen / Citoyenne,92320,Oui,Non,Non,À financer des aides pour accompagner les Fran...,Par le budget général de l’État,Tout le monde,Cofinancer un plan d’investissement pour chang...


We will focus on the last question: ``Y a-t-il d'autres points sur la transition écologique sur lesquels vous souhaiteriez vous exprimer ?`` We hope that our LDA model will help us to analyze the topics on which their responses are focused. Let's take a look on the data :

In [39]:
question = "QUXVlc3Rpb246NjI= - À quoi les recettes liées aux taxes sur le diesel et l’essence doivent-elles avant tout servir ?"
raw_data[question].head(10)

0    À financer des aides pour accompagner les Fran...
1    À financer des aides pour accompagner les Fran...
2    À financer des aides pour accompagner les Fran...
3    À baisser d’autres impôts comme par exemple l’...
4    À financer des aides pour accompagner les Fran...
5    À baisser d’autres impôts comme par exemple l’...
6    À financer des aides pour accompagner les Fran...
7    À financer des investissements en faveur du cl...
8    À financer des aides pour accompagner les Fran...
9    À financer des aides pour accompagner les Fran...
Name: QUXVlc3Rpb246NjI= - À quoi les recettes liées aux taxes sur le diesel et l’essence doivent-elles avant tout servir ?, dtype: object

As we note, there is a lot of missing data (like any open-ended question, people decide whether or not to write a comment). A cleanup step is necessary.

## 2. Clean and vectorize documents

Before training our LDA model, we need to tokenize our text. We will tokenize with the [spaCy]  (https://spacy.io/) library because we will only perform some basic preprocessing. We will just initialize a blank template for the French language.

In [40]:
import spacy

nlp = spacy.blank("fr")

Let's remove all the rows from the dataframe that don't have an answer for our question (the `NaNs above). This new dataframe will be called ``texts``.

In [41]:
texts = raw_data[raw_data[question].notna()]

First preprocessing with spacy :

In [42]:
spacy_docs = list(nlp.pipe(texts))

We now have a list of spaCy documents. We will transform each spaCy document into a list of tokens. Instead of the original tokens, we will work with lemmas instead. This will allow our model to generalize better

Here is the full list of preprocessing used: 
 
- remove all **words less than 3 characters**,
- remove all **stop-words**, and
- lemmatize** the remaining words and,
- put these words in **minuscule**.

In [43]:
docs = [[t.lemma_.lower() for t in doc if len(t.orth_) > 3 and not t.is_stop] for doc in spacy_docs]
print(docs[:3])

[[], ['createdat'], ['publishedat']]


In order to preserve some word order in our modeling, we will take into account frequent bigrams. For this, we will use the [Gensim](https://radimrehurek.com/gensim/)library. We would like to point out that the Gensim library is an excellent NLP library for topics modeling. 

Here is the chosen process: 

- We first identify the frequent bigrams in the corpus, 
- then we add them to the list of tokens for the documents in which they appear. This means that the bigrams will not be in their correct position in the text, but this is not a problem: topic models are bag-of-words models that ignore the position of words anyway.

In [44]:
import re
from gensim.models import Phrases

bigram = Phrases(docs, min_count=10)

for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:  # les bigrammes peuvent être reconnus par "_" qui concatène les mots
            docs[idx].append(token)

Let's move on to the last steps of the Gensim specific preprocessing. First, we will create a dictionary representation of the documents. This dictionary will map each word to a unique identifier and will help us create word-sack representations of each document. These bag-of-words representations contain the identifiers of the words in the document as well as their frequency. In addition, we can remove the least frequent and most frequent words from the vocabulary. This will improve the quality of our model and speed up its training. The minimum frequency of a word is expressed as an absolute number, the maximum frequency is the proportion of documents in which a word can appear.

In [45]:
from gensim.corpora import Dictionary

dictionary = Dictionary(docs)
print('Number of unique words in original documents :', len(dictionary))

dictionary.filter_extremes(no_below=3, no_above=0.25)
print('Number of unique words after removing rare and common words :', len(dictionary))

print("Example representation of document 3 :", dictionary.doc2bow(docs[2]))

Number of unique words in original documents : 61
Number of unique words after removing rare and common words : 1
Example representation of document 3 : []


Next, we create bag-of-words representations for each document in the corpus see method [doc2bow](https://radimrehurek.com/gensim/corpora/dictionary.html) :

In [46]:
corpus = [ dictionary.doc2bow(doc) for doc in docs]

## 3. Topic Modeling with LDA

Now it's time to train our LDA! To do this, we use the following parameters: 

- **corpus**: the bag-of-words representations of our documents
- **id2token**: the mapping of indexes to words
- **num_topics** : the number of topics the model should identify (let's set <font color = "red"><b>10</b></font>)
- **chunksize**: the number of documents the model sees on each update (let's set to <font color = "red"><b>1,000</b></font>)
- **passes**: the number of times we show the total corpus to the model during training (let's set to <font color = "red"><b>5</b></font>)
- **random_state**: we use a seed to ensure reproducibility (let's set to <font color = "red"><b>1</b></font>)

On a corpus of this size, training usually takes one or two minutes.

**Question 2 :**

In [47]:
from gensim.models import LdaModel

lda_model = LdaModel(corpus,
                     num_topics=10,
                     id2word=dictionary,
                     random_state=1,
                     chunksize=1000,
                     passes=5)

In [48]:
from gensim.models import LdaMulticore

lda_model_multi = LdaModel(corpus,
                     num_topics=10,
                     id2word=dictionary,
                     random_state=1,
                     chunksize=1000,
                     passes=5)

## 4. Results and visualization

**Question 3 :** Let's see what the model has learned. To do this, let's display the ten most characteristic words for each topic.

In [49]:
for (topic, words) in lda_model.print_topics():
    print("***********")
    print("* topic", topic+1, "*")
    print("***********")
    print(topic+1, ":", words)
    print()

***********
* topic 1 *
***********
1 : 1.000*"-vous"

***********
* topic 2 *
***********
2 : 1.000*"-vous"

***********
* topic 3 *
***********
3 : 1.000*"-vous"

***********
* topic 4 *
***********
4 : 1.000*"-vous"

***********
* topic 5 *
***********
5 : 1.000*"-vous"

***********
* topic 6 *
***********
6 : 1.000*"-vous"

***********
* topic 7 *
***********
7 : 1.000*"-vous"

***********
* topic 8 *
***********
8 : 1.000*"-vous"

***********
* topic 9 *
***********
9 : 1.000*"-vous"

***********
* topic 10 *
***********
10 : 1.000*"-vous"



In [50]:
for (topic, words) in lda_model_multi.print_topics():
    print("***********")
    print("* topic", topic+1, "*")
    print("***********")
    print(topic+1, ":", words)
    print()

***********
* topic 1 *
***********
1 : 1.000*"-vous"

***********
* topic 2 *
***********
2 : 1.000*"-vous"

***********
* topic 3 *
***********
3 : 1.000*"-vous"

***********
* topic 4 *
***********
4 : 1.000*"-vous"

***********
* topic 5 *
***********
5 : 1.000*"-vous"

***********
* topic 6 *
***********
6 : 1.000*"-vous"

***********
* topic 7 *
***********
7 : 1.000*"-vous"

***********
* topic 8 *
***********
8 : 1.000*"-vous"

***********
* topic 9 *
***********
9 : 1.000*"-vous"

***********
* topic 10 *
***********
10 : 1.000*"-vous"



Another way to observe topics is to **visualize** them. This can be done with the library [pyLDAvis](https://github.com/bmabey/pyLDAvis). PyLDAvis will show us how popular the topics are in our corpus, how similar the topics are, and which words are most important for that topic. Note that it is important to set ``sort_topics = False`` on the call to pyLDAvis. If you don't, the topics will be sorted differently than in Gensim. This may take a few minutes to load.

**Question 5 :**

In [51]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import warnings

pyLDAvis.enable_notebook()
warnings.filterwarnings("ignore", category=DeprecationWarning) 

gensimvis.prepare(lda_model, corpus, dictionary, sort_topics=False)

  default_term_info = default_term_info.sort_values(


Finally, let's look at the topics that the model recognizes in some of the individual documents. Here we see how LDA tends to assign a high probability to a small number of topics for each document, making its results highly interpretable.

In [52]:
# Nous en affichons que 8
i = 0
for (text, doc) in zip(texts[:8], docs[:8]):
    i += 1
    print("***********")
    print("* doc", i, "  *")
    print("***********")
    print(text)
    print([(topic+1, prob) for (topic, prob) in lda_model[dictionary.doc2bow(doc)] if prob > 0.1])
    print()

***********
* doc 1   *
***********
id
[(1, 0.1), (2, 0.1), (3, 0.1), (4, 0.1), (5, 0.1), (6, 0.1), (7, 0.1), (8, 0.1), (9, 0.1), (10, 0.1)]

***********
* doc 2   *
***********
createdAt
[(1, 0.1), (2, 0.1), (3, 0.1), (4, 0.1), (5, 0.1), (6, 0.1), (7, 0.1), (8, 0.1), (9, 0.1), (10, 0.1)]

***********
* doc 3   *
***********
publishedAt
[(1, 0.1), (2, 0.1), (3, 0.1), (4, 0.1), (5, 0.1), (6, 0.1), (7, 0.1), (8, 0.1), (9, 0.1), (10, 0.1)]

***********
* doc 4   *
***********
updatedAt
[(1, 0.1), (2, 0.1), (3, 0.1), (4, 0.1), (5, 0.1), (6, 0.1), (7, 0.1), (8, 0.1), (9, 0.1), (10, 0.1)]

***********
* doc 5   *
***********
authorId
[(1, 0.1), (2, 0.1), (3, 0.1), (4, 0.1), (5, 0.1), (6, 0.1), (7, 0.1), (8, 0.1), (9, 0.1), (10, 0.1)]

***********
* doc 6   *
***********
authorType
[(1, 0.1), (2, 0.1), (3, 0.1), (4, 0.1), (5, 0.1), (6, 0.1), (7, 0.1), (8, 0.1), (9, 0.1), (10, 0.1)]

***********
* doc 7   *
***********
authorZipCode
[(1, 0.1), (2, 0.1), (3, 0.1), (4, 0.1), (5, 0.1), (6, 0.1), 

Many collections of unstructured text are not accompanied by labels. Topic models such as LDA are a useful technique for discovering the most important topics in these documents. **Gensim** facilitates learning about these topics and **pyLDAvis** presents the results in a visually appealing way. Together, they form a powerful toolkit for better understanding what's inside large document sets and for exploring subsets of related texts. While these results are often already quite revealing, it is also possible to use them as a starting point, for example, for a labeling exercise for supervised text classification. In sum, thematic models should be in every data scientist's toolbox as a very quick way to gain insight into large document collections.