# Topic Modeling

The goal of this notebook is two-fold:
* identify the main topics of the corpus through approach (unsupervised clustering)
* build a text classifier based on these topics (supervised learning).

### Contents

__Preliminaries__
* a. Overview of text modeling techniques
* b. Imports

__1. LDA__

__2. Clustering__
* a. Tf-Idf for Clustering
* b. K-Means
* c. DBSCAN

__3. Topic classifier__

What's next

# Preliminaries

__a. Overview of text modeling techniques__

* [LDA](https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0)

* Clustering. In order to build clusters we have to transform our news articles into a numerical representation that models can handle. Here we go for old good tf-idf vectors (we will test more state-of-the-art techniques in the next steps). Algorithms: K-Means and DBSCAN seem like the most relevant options to begin with.

* ?

__b. Imports__

If you don't have already done so, please download these NLTK dependancies by running the following lines:

> nltk.download('wordnet')

> nltk.download('stopwords')

In [1]:
# Classic packages

import os 
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter
import random 
random.seed(a=2905) # set random seed 
import pickle


# NLP packages

import gensim
from gensim import corpora

import spacy
try: 
    print("fr_core_news_sm loaded")
    nlp = spacy.load("fr_core_news_sm") # load pre-trained models for French
except:
    print("fr loaded")
    nlp=spacy.load('fr') # fr calls fr_core_news_sm 

import nltk
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer # not adapted to French?
from nltk.stem.snowball import FrenchStemmer # already something 
# --> Lemmatizer would be better --> use spaCy lemmatizer

# ML with sklearn
import sklearn.cluster
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer 
from sklearn.feature_extraction.text import  TfidfVectorizer


fr_core_news_sm loaded


In [2]:
# data
news_df=pd.read_csv("./articles.csv")

# 1. Latent Dirichlet Allocation (LDA)


Main idea: Each document is represented as a distribution over topics, and each topic is represented as a distribution over words.
Here we do not set the number of topics in advance, we rather set it arbitrarily like a threshold and see if the results are relevant.

Code freely adapted from this [TDS post](https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21).



__Text cleaning for LDA__

* tokenize words (here using spacy parser for French)
* lemmatize (using NLTK WordNetLemmatizer)
* stopwords removal (using the default NLTK stopwords list for French)
* apply pipeline on titles

In [3]:
## spacy LDA

spacy.load('fr')
from spacy.lang.fr import French
parser = French()

def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        elif token.like_url:
            lda_tokens.append('URL')
        elif token.orth_.startswith('@'):
            lda_tokens.append('SCREEN_NAME')
        else:
            lda_tokens.append(token.lower_)
    return lda_tokens

In [4]:
# stopwords removal
fr_stop = set(nltk.corpus.stopwords.words('french'))
my_fr_stop=fr_stop.union({'un' ,'deux','trois','quatre','cinq','six','sept','huit','neuf','dix','onze', 'douze', 
                          'treize','quatorze', 'quinze', 'seize', 'vingt', 'trente', 
                          'quarante', 'cinquante','soixante','cent'}, fr_stop)
#'mille'  'million' 'milliard' 'billion' are not added to the stopwords list because they are more discriminative

print(my_fr_stop)

{'une', 'es', 'aurions', 'fussions', 'serez', 'serait', 's', 'au', 'quarante', 'y', 'serais', 'fusses', 'étaient', 'serions', 'notre', 'dans', 'étantes', 'dix', 'il', 'sa', 'trois', 'suis', 'ayante', 'aies', 'ayantes', 'des', 'serai', 'ton', 'ne', 'te', 'aurez', 'auras', 'c', 'soixante', 'eûmes', 'ses', 'je', 'mon', 'cent', 'pas', 'soit', 'seriez', 'que', 'quatorze', 'ayant', 'fussiez', 'neuf', 'par', 'cinq', 'eût', 'ces', 'ayez', 'aviez', 'fussent', 'en', 'étante', 'auraient', 'avez', 'ait', 'ou', 'deux', 'avais', 'as', 'seront', 'même', 'qui', 'avec', 'un', 'étant', 'tu', 'leur', 'fûmes', 'aient', 'avaient', 'tes', 'étiez', 'lui', 'son', 'sur', 'le', 'est', 'eussions', 'vos', 'qu', 'étés', 'fût', 'se', 'aurai', 'cinquante', 'eurent', 'été', 'eussent', 'elle', 'treize', 'douze', 'quinze', 't', 'n', 'six', 'étée', 'mais', 'la', 'fut', 'l', 'seraient', 'mes', 'toi', 'aurons', 'sommes', 'les', 'eusse', 'êtes', 'étions', 'd', 'm', 'eus', 'sont', 'quatre', 'du', 'eues', 'seize', 'trente', 

In [5]:
# Comparison: nltk french stemmer vs spacy french lemmatizer

# spacy
print("\nLemmas")
doc = nlp(u"les manifestations qui ont agitées la France ces derniers mois")
for token in doc:
    print(token, '-->', token.lemma_)
print("\nStems")
stemmer = FrenchStemmer()
for w in "les manifestations qui ont agitées la France ces derniers mois".split():
    print(w, '-->', stemmer.stem(w))


Lemmas
les --> le
manifestations --> manifestation
qui --> qui
ont --> avoir
agitées --> agiter
la --> le
France --> France
ces --> ce
derniers --> dernier
mois --> mois

Stems
les --> le
manifestations --> manifest
qui --> qui
ont --> ont
agitées --> agit
la --> la
France --> franc
ces --> ce
derniers --> derni
mois --> mois


In [57]:
# Tokenize, delete token if too small, 
def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4] # arbitrary
    tokens = [token for token in tokens if token not in fr_stop]
    doc = nlp(' '.join(tokens))
    tokens = [token.lemma_ for token in doc]
    return tokens  #[t.encode('utf-8') for t in tokens] 

In [58]:
title_tokens = []
text_tokens = []

## Apply on titles ##

for t in news_df.title:
    tokens = prepare_text_for_lda(t)
    title_tokens.append(tokens)
    if random.random() >0.99:
        print(tokens)
        
## Apply on titles ##

for t in news_df.text:
    tokens = prepare_text_for_lda(t)
    text_tokens.append(tokens)
    if random.random() >0.999:
        print(tokens)
        
print(title_tokens[:5])
print(text_tokens[0][:10])

['décembre', 'attentat', 'contre', 'carrero', 'blanco']
['tavernier', 'noiret', 'union', 'sacrer']
['michel', 'hermon']
['droit']
['henri', 'castrier', 'regner', 'royaume']
['maltraitance']
['france', 'activite', 'industriel', 'hausse', 'ralentir']
['avril', 'réacteur', 'central', 'explose', 'après', 'voici', 'premier', 'histoire', 'oser', 'affronter', 'radiation', 'impression', 'entrer', 'tombeau', 'humanité', 'janvier', 'photographe', 'russe', 'victoria', 'ivleva', 'prépare', 'pénétrer', 'central', 'nucléaire', 'tchernobyl', 'maudire', 'voilà', 'presque', 'explosion', 'terrifiante', 'ravager', 'réacteur', 'disperser', 'tout', 'europe', 'poison', 'radioactif', 'rester', 'quart', 'heure', 'escalader', 'vestige', 'matérial', 'tordu', 'entasser', 'fondu', 'revêtir', 'premier', 'combinaison', 'coton', 'blanc', 'second', 'plastique', 'botte', 'gant', 'toqu', 'masqu', 'respirateur', 'victoria', 'ivleva', 'marche', 'interdire', 'homme', 'youri', 'kovsar', 'mikhailov', 'accompagnent', 'enfer'

[['tintin', 'espac'], ['suicide', 'robert', 'boulin'], ['pierre', 'contre', 'certitude'], ['otage', 'soudain', 'mercredi'], ['secret', 'planète', 'rouge']]
['trois', 'semaine', 'station', 'soviétique', 'jean-loup', 'chrétien', 'premier', 'ouest-européen', 'sortir', 'espace']


__Perform LDA with Gensim__

Fixage arbitraire du nombre de topics, comme on fixerait arbitrairement k dans un k-means.

In [10]:
dic_title = corpora.Dictionary(title_tokens)
dic_text = corpora.Dictionary(text_tokens)
corpus_title = [dic_title.doc2bow(token) for token in title_tokens]
corpus_text = [dic_text.doc2bow(token) for token in text_tokens]

pickle.dump(corpus_title, open('corpus_title.pkl', 'wb'))
dic_title.save('dic_title.gensim')
pickle.dump(corpus_title, open('corpus_text.pkl', 'wb'))
dic_text.save('dic_text.gensim')

In [11]:
print([k for k in dic_title.values()])

['espace', 'tintin', 'boulin', 'robert', 'suicide', 'certitude', 'contre', 'pierre', 'mercredi', 'otages', 'soudain', 'planète', 'rouge', 'secret', 'ammar', 'forcée', 'marche', 'champion', 'discret', 'olympique', 'quinon', 'algérie', 'faillite', 'sanglante', 'israël', 'ébranlé', 'homme', 'objets', 'ponge', 'besse', 'george', 'pourquoi', 'africain', 'longue', 'mémoire', 'malgré', 'antigone', 'benazir', 'bhutto', 'janvier', 'pakistan', 'soupçons', 'triangle', 'campagne', 'ombre', 'darwin', 'trompé', 'dayan', 'symbole', 'impopulaire', 'leader', 'pérès', 'shimon', 'ainsi', 'exclusif', 'finit', 'phnom', 'témoignage', 'explique', 'simone', 'bureau', 'informatique', 'livré', 'jérusalem', 'nouveau', 'seuls', 'chopinet', 'express', 'retrouve', 'touvier', 'tutelle', 'agent', 'fausses', 'france', 'star', 'vrais', 'complexe', 'pinochet', 'challenger', 'drame', 'beineix', 'coluche', 'nourricier', 'marchandage', 'terrorisme', 'enquête', 'reprend', 'espagne', 'goytisolo', 'sarrasin', 'attentat', 'bla

In [13]:
# get the topics! 

NUM_TOPICS = 10

for (corpus, dictionary) in [(corpus_title, dic_title), (corpus_text, dic_text)]:
    print('\nTopics')
    ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
    ldamodel.save('model5.gensim')
    topics = ldamodel.print_topics(num_words=4)
    for topic in topics:
        print(topic)


Topics
(0, '0.020*"retour" + 0.010*"encore" + 0.010*"heure" + 0.007*"berlin"')
(1, '0.010*"homme" + 0.010*"affaire" + 0.010*"boulin" + 0.007*"cessa"')
(2, '0.007*"france" + 0.007*"comment" + 0.007*"jospin" + 0.007*"victoire"')
(3, '0.020*"express" + 0.010*"drame" + 0.007*"monde" + 0.007*"marche"')
(4, '0.009*"robert" + 0.009*"suicide" + 0.009*"heures" + 0.009*"express"')
(5, '0.009*"allemagne" + 0.009*"petit" + 0.009*"chute" + 0.006*"semaine"')
(6, '0.014*"france" + 0.006*"décembre" + 0.006*"nouvelle" + 0.006*"mode"')
(7, '0.011*"pouvoir" + 0.007*"jours" + 0.007*"guerre" + 0.007*"extraits"')
(8, '0.010*"europe" + 0.007*"pierre" + 0.007*"patron" + 0.007*"patrick"')
(9, '0.018*"contre" + 0.015*"france" + 0.012*"francais" + 0.009*"europe"')

Topics
(0, '0.008*"affaire" + 0.008*"suite" + 0.008*"ombre" + 0.008*"deutsche"')
(1, '0.009*"express" + 0.006*"henry" + 0.006*"reseau" + 0.006*"patron"')
(2, '0.010*"suicide" + 0.007*"israël" + 0.007*"jospin" + 0.007*"pierre"')
(3, '0.034*"france" + 

In [None]:
# to be continued
# https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21

__Analysis__

* at first sight, key words associated to the same topic do not always have much in common 
* some of the words used to define topics do not seem like relevant, high level key words
* maybe the corpus is to small or to heterogeneous to obtain relevant topics this way!

--> *Improvement ideas*
- try clustering methods and put a word on the topics
- improve dictionary
- improve this first LDA model (more ideas [here](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/))
- get more data on specific topics

# 2. Clustering 

K-Means clustering and DBSCAN with tf-idf representation



### a. Build a tf-idf matrix to represent the corpus

__Naive approach based on word count__

In [14]:
#instantiate CountVectorizer() 
cv=CountVectorizer() 
 
# this steps generates word counts for the words in your docs 
word_count_vector=cv.fit_transform(news_df.text)

print(word_count_vector.shape)
# Houston, we have a dimensionality problem

(726, 43869)


In [15]:
# tf-idf with all the words
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True) 
tfidf_transformer.fit(word_count_vector)

# print idf values 
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"]) 
 
# sort ascending 
df_idf.sort_values(by=['idf_weights'])

Unnamed: 0,idf_weights
de,1.000000
la,1.022254
le,1.036419
des,1.047896
les,1.052235
et,1.053685
un,1.074214
en,1.077181
du,1.081649
une,1.124338


We have to select the words with highest tf-idf count.

__Tf-Idf vectorizer with constraints__

In [16]:
# fit tf-idf on all texts with constraints on document frequency and number of tokens
tfidf_vectorizer=TfidfVectorizer(max_df = 700, max_features=500, smooth_idf=True,use_idf=True) 
X = tfidf_vectorizer.fit_transform(news_df.text) # X is a matrix

In [17]:
# take a look at the result
print(tfidf_vectorizer.get_feature_names())
print(X.shape)

['000', '10', '100', '11', '12', '15', '16', '17', '1990', '1997', '20', '30', '31', '50', '500', '60', 'abord', 'accord', 'action', 'affaire', 'affaires', 'afin', 'afrique', 'ai', 'aide', 'ailleurs', 'ainsi', 'ait', 'algérie', 'allemagne', 'alors', 'américain', 'américaine', 'américains', 'amérique', 'an', 'ancien', 'année', 'années', 'ans', 'août', 'après', 'argent', 'armes', 'armée', 'assez', 'au', 'aucun', 'aucune', 'aujourd', 'aura', 'aurait', 'aussi', 'autant', 'autour', 'autre', 'autres', 'aux', 'avaient', 'avait', 'avant', 'avec', 'avenir', 'avoir', 'avons', 'avril', 'bas', 'beaucoup', 'bien', 'bon', 'bonne', 'bout', 'camp', 'campagne', 'capitale', 'car', 'cas', 'cause', 'ce', 'cela', 'celle', 'celui', 'centre', 'certaines', 'certains', 'ces', 'cet', 'cette', 'ceux', 'chacun', 'chaque', 'chef', 'chez', 'chine', 'chose', 'chômage', 'ci', 'cinq', 'comme', 'comment', 'commerce', 'commission', 'compte', 'conseil', 'contre', 'contrôle', 'corps', 'coup', 'coups', 'cours', 'crise', 'c

Improvement ideas:
    
* get rid of figures "000", "cinq"
* get rid of non topic words (destructive approach)
* custom build vocabulary for the corpus, based on previous NER extraction 
* use lemmas (root words) instead of full words (ex: français, france go back to the same entity)


__Tf-idf with lemmatized words__

Same idea as preprocessing for LDA ?

__Tf-idf with entity-based vocabulary__

In [18]:
PER_df=pd.read_csv('articles_PER.csv')
LOC_df=pd.read_csv('articles_LOC.csv')

In [19]:
PER_df.describe()

Unnamed: 0.1,Unnamed: 0,PER_count
count,287.0,287.0
mean,143.0,5.487805
std,82.993976,4.539534
min,0.0,3.0
25%,71.5,3.0
50%,143.0,4.0
75%,214.5,6.0
max,286.0,35.0


In [27]:
PER_highest=PER_df[PER_df['PER_count']>5]
print(list(PER_highest.PER_entity))
PER_highest.head()

['françois mitterrand', 'jacques chirac', 'ii', 'de gaulle', 'président de la république', 'lionel jospin', 'général de gaulle', 'chirac', 'yasser arafat', 'mitterrand', 'menahem begin', 'bill clinton', 'charles pasqua', 'moshe dayan', 'arafat', 'hier', 'staline', 'alain juppé', 'edouard balladur', '-major', 'washington', 'begin', '-aviv', '-ci', 'itzhak rabin', 'sadate', 'lénine', 'golda meir', 'fini', 'helmut kohl', 'pierre bérégovoy', 'martine aubry', 'kennedy', 'hitler', "valéry giscard d'estaing", 'nasser', 'rabin', 'raymond barre', 'expert', 'shimon peres', 'napoléon', 'eisenhower', 'hasard', 'giscard', 'george bush', 'mao', 'saddam hussein', 'boulin', 'delmas', 'david ben gourion', 'robert badinter', 'georges pompidou', 'anouar el-sadate', 'voulez', 'jean-paul ii', 'christ', 'jimmy carter', 'michel rocard', 'voltaire', 'pierre mendès france', 'pierre mauroy', 'robert', 'regardez', 'marx', 'françoise giroud', 'soudain', 'mendès france', 'pompidou', 'louis xiv', 'mikhaïl gorbatche

Unnamed: 0.1,Unnamed: 0,PER_entity,PER_count
0,0,françois mitterrand,35
1,1,jacques chirac,31
2,2,ii,29
3,3,de gaulle,26
4,4,président de la république,26


In [21]:
LOC_df.describe()

Unnamed: 0.1,Unnamed: 0,LOC_count
count,439.0,439.0
mean,219.0,9.621868
std,126.87264,17.268576
min,0.0,3.0
25%,109.5,3.0
50%,219.0,5.0
75%,328.5,8.5
max,438.0,158.0


In [25]:
LOC_highest=LOC_df[LOC_df['LOC_count']>6]
print(list(LOC_highest.LOC_entity))
LOC_highest.head()

['paris', 'france', 'la france', 'etat', 'etats-unis', 'europe', '-', "l'europe", 'allemagne', 'américains', 'état', 'de france', 'israël', 'londres', 'jérusalem', 'français', 'grande-bretagne', 'moscou', 'terre', 'washington', 'chine', 'afrique', '»', 'la chine', 'amérique', 'new york', 'algérie', 'allemands', 'angleterre', 'occident', 'berlin', 'italie', 'japon', 'beyrouth', 'espagne', 'genève', 'russie', 'suisse', 'pékin', "l'amérique", 'bruxelles', 'soleil', "l'italie", 'f', 'gaza', 'liban', 'cisjordanie', 'syrie', 'la terre', 'jordanie', 'iran', 'sinaï', 'asie', 'mercredi', 'damas', 'cher', 'lyon', 'européens', 'lune', "l'afrique", 'russes', 'pologne', 'egypte', 'algériens', 'israéliens', 'rome', 'c?ur', 'anglais', 'alger', 'inde', 'cuba', 'arabie saoudite', 'matignon', 'marseille', 'irak', 'vienne', "l'empire", 'madrid', 'pacifique', 'hongrie', 'suède', 'bordeaux', "l'algérie", 'tunisie', 'strasbourg', 'atlantique', 'britanniques', 'soudan', "quai d'orsay", 'mexique', 'belgique',

Unnamed: 0.1,Unnamed: 0,LOC_entity,LOC_count
0,0,paris,158
1,1,france,156
2,2,la france,144
3,3,etat,124
4,4,etats-unis,116


In [None]:
vocabulary=['paris', 'france', 'etat', 'etats-unis', 'europe', 'allemagne', 'américains', 'état', 'de france', 'israël', 'londres', 'jérusalem', 'français', 'grande-bretagne', 'moscou', 'terre', 'washington', 'chine', 'afrique', '»', 'la chine', 'amérique', 'new york', 'algérie', 'allemands', 'angleterre', 'occident', 'berlin', 'italie', 'japon', 'beyrouth', 'espagne', 'genève', 'russie', 'suisse', 'pékin', "l'amérique", 'bruxelles', 'soleil', "l'italie", 'f', 'gaza', 'liban', 'cisjordanie', 'syrie', 'la terre', 'jordanie', 'iran', 'sinaï', 'asie', 'mercredi', 'damas', 'cher', 'lyon', 'européens', 'lune', "l'afrique", 'russes', 'pologne', 'egypte', 'algériens', 'israéliens', 'rome', 'c?ur', 'anglais', 'alger', 'inde', 'cuba', 'arabie saoudite', 'matignon', 'marseille', 'irak', 'vienne', "l'empire", 'madrid', 'pacifique', 'hongrie', 'suède', 'bordeaux', "l'algérie", 'tunisie', 'strasbourg', 'atlantique', 'britanniques', 'soudan', "quai d'orsay", 'mexique', 'belgique', 'sienne', 'maroc', 'chinois', 'japonais', 'italiens', 'téhéran', 'los angeles', "l'asie", 'champagne', 'syriens', 'turquie', 'munich', 'egyptiens', 'afrique du sud', 'libye', 'avais', 'danemark', 'américain', 'hexagone', 'portugal', 'au caire', 'koweït', 'méditerranée', 'le soleil', 'canal de suez', 'vatican', 'vichy', 'kremlin', "l'espagne", 'caire', 'roumanie', 'tchécoslovaquie', 'norvège', 'tokyo', 'le japon', 'autriche', 'canada', 'versailles', "côte-d'ivoire", 'kenya', 'maghreb', 'pakistan', 'laos', 'sénégal', 'francfort', 'vietnam', 'loire', 'toulouse', 'californie', 'prague', 'grèce', 'nigeria', 'gatt', 'du japon', 'madagascar', 'corée', 'congo', 'gabon', 'croyez', 'australie', 'voudrais', 'cambodge', 'luxembourg']