# Topic Modeling

The goal of this notebook is two-fold:
* identify the main topics of the corpus through approach (unsupervised clustering)
* build a text classifier based on these topics (supervised learning).

### Contents

__Preliminaries__
* a. Overview of text modeling techniques
* b. Imports

__1. LDA__

__2. Clustering__
* a. Tf-Idf for Clustering
* b. K-Means
* c. DBSCAN

__3. Topic classifier__

What's next

# Preliminaries

__a. Overview of text modeling techniques__

* [LDA](https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0)

* Clustering. In order to build clusters we have to transform our news articles into a numerical representation that models can handle. Here we go for old good tf-idf vectors (we will test more state-of-the-art techniques in the next steps). Algorithms: K-Means and DBSCAN seem like the most relevant options to begin with.

* ?

__b. Imports__

If you don't have already done so, please download NLTK WordNet models by runnung the following line:

> nltk.download('wordnet')

In [21]:
# Classic packages

import os 
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter
import random 
random.seed(a=2905) # set random seed 
import pickle


# NLP packages

import gensim
from gensim import corpora

import spacy
try: 
    print("fr_core_news_sm loaded")
    nlp = spacy.load("fr_core_news_sm") # load pre-trained models for French
except:
    print("fr loaded")
    nlp=spacy.load('fr') # fr calls fr_core_news_sm 

import nltk
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer

# ML with sklearn
import sklearn.cluster
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.feature_extraction.text import CountVectorizer 


fr_core_news_sm loaded


[nltk_data] Downloading package wordnet to /Users/eva/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# data
news_df=pd.read_csv("./articles.csv")

# 1. Latent Dirichlet Allocation (LDA)


Main idea: Each document is represented as a distribution over topics, and each topic is represented as a distribution over words.
Here we do not set the number of topics in advance, we rather set it arbitrarily like a threshold and see if the results are relevant.

Code freely adapted from this [TDS post](https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21).



__Text cleaning for LDA__

* tokenize words (here using spacy parser for French)
* lemmatize (using NLTK WordNetLemmatizer)
* stopwords removal (using the default NLTK stopwords list for French)
* apply pipeline on titles

In [4]:
## spacy LDA

spacy.load('fr')
from spacy.lang.fr import French
parser = French()

def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        elif token.like_url:
            lda_tokens.append('URL')
        elif token.orth_.startswith('@'):
            lda_tokens.append('SCREEN_NAME')
        else:
            lda_tokens.append(token.lower_)
    return lda_tokens

In [11]:
## NLTK lemmatizer

def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma
    
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word, language='french')

# stopwords removal
nltk.download('stopwords')
fr_stop = set(nltk.corpus.stopwords.words('french'))

[nltk_data] Downloading package stopwords to /Users/eva/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [44]:
def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in fr_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens  #[t.encode('utf-8') for t in tokens] 

In [45]:
title_tokens = []
text_tokens = []

## Apply on titles ##

for t in news_df.title:
    tokens = prepare_text_for_lda(t)
    title_tokens.append(tokens)
    if random.random() >0.99:
        print(tokens)
        
## Apply on titles ##

for t in news_df.title:
    tokens = prepare_text_for_lda(t)
    text_tokens.append(tokens)
    if random.random() >0.995:
        print(tokens)
        
print(title_tokens[:5])
print(text_tokens[:5])

['quelle', 'politique', 'étrangère']
['heure', 'comptes']
['dernier', 'dialogue']
['vent', 'meilleur', 'decennie']
['couac', 'fusion', 'matra', 'aerospatiale']
['action', 'humanitaire', 'fatiguée']
['tapie', 'prisonnier', 'mensonge']
['kurdistan', 'oasis', 'enfer']
[['tintin', 'espace'], ['suicide', 'robert', 'boulin'], ['pierre', 'contre', 'certitude'], ['otages', 'soudain', 'mercredi'], ['secret', 'planète', 'rouge']]
[['tintin', 'espace'], ['suicide', 'robert', 'boulin'], ['pierre', 'contre', 'certitude'], ['otages', 'soudain', 'mercredi'], ['secret', 'planète', 'rouge']]


__Perform LDA with Gensim__

Fixage arbitraire du nombre de topics, comme on fixerait arbitrairement k dans un k-means.

In [46]:
dic_title = corpora.Dictionary(title_tokens)
dic_text = corpora.Dictionary(text_tokens)
corpus_title = [dictionary.doc2bow(token) for token in title_tokens]
corpus_text = [dictionary.doc2bow(token) for token in text_tokens]

pickle.dump(corpus_title, open('corpus_title.pkl', 'wb'))
dictionary.save('dictionary_title.gensim')
pickle.dump(corpus_title, open('corpus_text.pkl', 'wb'))
dictionary.save('dictionary_text.gensim')

In [47]:
print([k for k in dic_title.values()])

['espace', 'tintin', 'boulin', 'robert', 'suicide', 'certitude', 'contre', 'pierre', 'mercredi', 'otages', 'soudain', 'planète', 'rouge', 'secret', 'ammar', 'forcée', 'marche', 'champion', 'discret', 'olympique', 'quinon', 'algérie', 'faillite', 'sanglante', 'israël', 'ébranlé', 'homme', 'objets', 'ponge', 'besse', 'george', 'pourquoi', 'africain', 'longue', 'mémoire', 'malgré', 'antigone', 'benazir', 'bhutto', 'janvier', 'pakistan', 'soupçons', 'triangle', 'campagne', 'ombre', 'darwin', 'trompé', 'dayan', 'symbole', 'impopulaire', 'leader', 'pérès', 'shimon', 'ainsi', 'exclusif', 'finit', 'phnom', 'témoignage', 'explique', 'simone', 'bureau', 'informatique', 'livré', 'jérusalem', 'nouveau', 'seuls', 'chopinet', 'express', 'retrouve', 'touvier', 'tutelle', 'agent', 'fausses', 'france', 'star', 'vrais', 'complexe', 'pinochet', 'challenger', 'drame', 'beineix', 'coluche', 'nourricier', 'marchandage', 'terrorisme', 'enquête', 'reprend', 'espagne', 'goytisolo', 'sarrasin', 'attentat', 'bla

In [53]:
# get the topics! 

NUM_TOPICS = 10

for dictionary in [dic_title, dic_text]:
    print('\nTopics')
    ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
    ldamodel.save('model5.gensim')
    topics = ldamodel.print_topics(num_words=4)
    for topic in topics:
        print(topic)


Topics
(0, '0.007*"leader" + 0.007*"blanco" + 0.007*"carrero" + 0.007*"avortement"')
(1, '0.014*"chopinet" + 0.007*"seuls" + 0.007*"finit" + 0.007*"livré"')
(2, '0.008*"challenger" + 0.008*"pinochet" + 0.008*"complexe" + 0.008*"marchandage"')
(3, '0.007*"témoignage" + 0.007*"enquête" + 0.007*"reprend" + 0.007*"ombre"')
(4, '0.007*"certitude" + 0.007*"robert" + 0.007*"contre" + 0.007*"pierre"')
(5, '0.007*"ammar" + 0.007*"secret" + 0.007*"forcée" + 0.007*"attentat"')
(6, '0.007*"africain" + 0.007*"malgré" + 0.007*"bhutto" + 0.007*"benazir"')
(7, '0.008*"trompé" + 0.008*"dayan" + 0.008*"symbole" + 0.008*"espagne"')
(8, '0.008*"janvier" + 0.001*"drame" + 0.001*"ébranlé" + 0.001*"quinon"')
(9, '0.007*"tintin" + 0.007*"boulin" + 0.007*"espace" + 0.007*"discret"')

Topics
(0, '0.013*"chopinet" + 0.007*"certitude" + 0.007*"robert" + 0.007*"suicide"')
(1, '0.013*"objets" + 0.007*"leader" + 0.007*"décembre" + 0.007*"blanco"')
(2, '0.007*"rouge" + 0.007*"planète" + 0.007*"soudain" + 0.007*"mémo

__Analysis__

* at first sight, key words associated to the same topic do not always have much in common 
* some of the words used to define topics do not seem like relevant, high level key words
* maybe the corpus is to small or to heterogeneous to obtain relevant topics this way!

--> *Improvement ideas*
- try clustering methods and put a word on the topics
- improve dictionary
- improve this first LDA model (more ideas [here](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/))
- get more data on specific topics

# 2. Clustering 

K-Means clustering and DBSCAN with tf-idf representation



__a. Build a tf-idf matrix to represent the corpus__

In [57]:
#instantiate CountVectorizer() 
cv=CountVectorizer() 
 
# this steps generates word counts for the words in your docs 
word_count_vector=cv.fit_transform(news_df.text)

In [58]:
print(word_count_vector.shape)
# Houston, we have a dimensionality problem

(726, 43869)


In [59]:
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True) 
tfidf_transformer.fit(word_count_vector)

TfidfTransformer()

In [60]:
# print idf values 
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"]) 
 
# sort ascending 
df_idf.sort_values(by=['idf_weights'])

Unnamed: 0,idf_weights
de,1.000000
la,1.022254
le,1.036419
des,1.047896
les,1.052235
et,1.053685
un,1.074214
en,1.077181
du,1.081649
une,1.124338


We have to select the words with highest tf-idf count.