# Florian Guillot & Julien Donche: Project 7
### Topic modeling and keywords extractions for Holi.io
### Jedha Full Stack, dsmf-paris-13
### 08-2021

This project is the final project as Jedha Students. 
Idea has been submitted by Holi.io Founder : Clément Sirvente
The specifications from Holi.io can be found [here](https://github.com/FlorianG-dev/Jedha_certification/blob/master/7_Holi/Project_initialization.pdf). It is the projet number 1 : Topic modeling

---

This notebook is the **second** notebook in a serie of two

# **1) Initialization**
----
## **1.1) We begin with the Import of the different libraries we will use and their configurations**
----

In [14]:
''' To run if you work on a notebook : 
!pip install pyLDAvis -q 
!pip install gensim -q
!pip install spacy -q
'''

In [1]:
import numpy as np
import pandas as pd
import os
import joblib

import tqdm

# Gensim
import gensim
from gensim.models import Phrases
from gensim.models import CoherenceModel
from gensim.corpora.dictionary import Dictionary

# Plotting tools; we choose pyLDAvis for visualisation purpose 
import pyLDAvis
import pyLDAvis.gensim_models  # don't skip this
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

  def _figure_formats_changed(self, name, old, new):


## **1.2) Data collection**
---

In [3]:
df = pd.read_csv("Data/smallMind_clean_data_without_stop_words.csv")
df.head()

Unnamed: 0,id,category,subcategory,title,nid,text,text_cleaned,nlp_ready
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...",AAGH0ET,The royals are free to shop wherever they cho...,The royals are free to shop wherever they choo...,royal free shop choose tend family royal warra...
1,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,AAJgNsz,"ZOLOTE, Ukraine — Lt. Ivan Molchanets peeked o...",ZOLOTE Ukraine Lt Ivan Molchanets peeked over ...,zolote ukraine lt ivan molchanets peek parapet...
2,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,AACk2N6,I had to be perfect. In order to s...,I had to be perfect In order to shed my perfec...,perfect order shed perfectionism know major li...
3,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...",AAAKEkt,"As you get older, little growths called skin t...",As you get older little growths called skin ta...,old little growth skin tag start pop body reco...
4,N2073,sports,football_nfl,Should NFL be able to fine players for critici...,AAJ4lap,The officiating in the Packers' 23-22 Monday n...,The officiating in the Packers Monday night wi...,officiating packers monday night win lions egr...


# **2) Preprocessing**
---

## **2.1) Dropping nan rows** (due to lemmatization) and change the type of "nlp_ready" to make it a list of string
---

In [3]:
df = df.dropna() # Dropping NaN rows
texts = df["nlp_ready"].str.split().tolist()

## **2.2) Recognizing & adding Bigrams**
---
We add the bigrams in each text if it appears more than 20 times in our articles. We could not do that before regarding the cleaning we have chosen

In [6]:
%time multigrams = Phrases(texts, min_count=20) # Method Phrases automatically detect common phrases
for idx in range(len(texts)):
    for token in multigrams[texts[idx]]:
        if '_' in token:  # It means token is a multigram, we add it to the document.
            texts[idx].append(token)

Wall time: 12.7 s


## **2.3) Creating the Dictionnary and the corpus**
---
needed by the LDA model

In [7]:
common_dictionary = Dictionary(texts)

# We filter out words that occur in less than 20 documents, or more than 50% of the documents.
common_dictionary.filter_extremes(no_below=20, no_above=0.5)

# We transform the documents to a vectorized form. We simply compute the frequency of each word, including the bigrams.
common_corpus = [common_dictionary.doc2bow(text) for text in texts] # Here we use Doc2Bow to create the corpus as it is recommanded with LDA

In [19]:
print('Number of unique tokens: %d' % len(common_dictionary))
print('Number of documents: %d' % len(common_corpus))

Number of unique tokens: 30084
Number of documents: 50001


# **3) Fine tuning our model LDA**
---

## **3.1) Creating a scoring function**
---

We create the function used for the parameters optimization:
 - k : number of topics 
 - alpha : for topics probability
 - eta : for words probability 

In [20]:
# Supporting function, calculate topic coherence for topic models.
def compute_coherence_values(corpus, dictionary, k, a, b):
    
    lda_model = gensim.models.ldamodel.LdaModel(corpus=common_corpus,
                                           id2word=common_dictionary,
                                           num_topics=k, 
                                           random_state=100,
                                           chunksize=100, # Number of documents to be used in each training chunk.
                                           passes=10, # Number of passes through the corpus during training.
                                           alpha=a,
                                           eta=b)
    
    coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=common_dictionary, coherence='c_v')
    
    return coherence_model_lda.get_coherence()

## **3.2) Finding the best parameters for our couple model/dataset**
---

We create a loop, inspired from Gridsearch, to try the following parameters:
 - k : number of topics 
 - alpha : for topics probability
 - eta : for words probability 

**Warning : take several days to operate**

In [41]:

# Topics range
topics_range = [30,25,20,15]

# Alpha parameter
alpha = list(np.arange(0.01, 1, 0.3))
alpha.append('symmetric')
alpha.append('asymmetric')

# Beta parameter
beta = list(np.arange(0.01, 1, 0.3))
beta.append('symmetric')

# Validation sets
num_of_docs = len(common_corpus)
corpus_sets = [common_corpus,
               gensim.utils.ClippedCorpus(common_corpus, num_of_docs*0.75)]
corpus_title = ['100% Corpus','75% Corpus']
model_results = {'Validation_Set': [],
                 'Topics': [],
                 'Alpha': [],
                 'Beta': [],
                 'Coherence': []
                }

# Can take a long time to run
if 1 == 1:
    pbar = tqdm.tqdm(total=240)
    
    # Iterate through validation corpuses
    for i in range(len(corpus_sets)):
        # Iterate through number of topics
        for k in topics_range:
            # Iterate through alpha valuesI 
            for a in alpha:
                # Iterare through beta values
                for b in beta:
                    # Get the coherence score for the given parameters
                    cv = compute_coherence_values(corpus=corpus_sets[i], dictionary=common_dictionary, 
                                                  k=k, a=a, b=b)
                    # Save the model results
                    model_results['Validation_Set'].append(corpus_title[i])
                    model_results['Topics'].append(k)
                    model_results['Alpha'].append(a)
                    model_results['Beta'].append(b)
                    model_results['Coherence'].append(cv)
                    pbar.update(1)
    pd.DataFrame(model_results).to_csv('lda_tuning_results.csv', index=False) # We put those results in a csv file
    pbar.close()




  0%|          | 0/240 [12:44<?, ?it/s][A[A[A
  0%|          | 0/240 [21:37<?, ?it/s]
  0%|          | 0/80 [28:40<?, ?it/s]



  0%|          | 1/240 [08:22<33:20:32, 502.23s/it][A[A[A


  1%|          | 2/240 [17:15<34:25:32, 520.73s/it][A[A[A


  1%|▏         | 3/240 [25:53<34:12:01, 519.50s/it][A[A[A


  2%|▏         | 4/240 [34:39<34:13:23, 522.05s/it][A[A[A


  2%|▏         | 5/240 [43:03<33:39:18, 515.57s/it][A[A[A


  2%|▎         | 6/240 [51:44<33:36:58, 517.17s/it][A[A[A


  3%|▎         | 7/240 [1:00:36<33:47:21, 522.07s/it][A[A[A


  3%|▎         | 8/240 [1:09:23<33:44:22, 523.55s/it][A[A[A


  4%|▍         | 9/240 [1:18:11<33:40:52, 524.90s/it][A[A[A


  4%|▍         | 10/240 [1:26:47<33:22:12, 522.31s/it][A[A[A


  5%|▍         | 11/240 [1:34:50<32:27:42, 510.32s/it][A[A[A


  5%|▌         | 12/240 [1:43:07<32:03:33, 506.20s/it][A[A[A


  5%|▌         | 13/240 [1:51:23<31:43:40, 503.18s/it][A[A[A


  6%|▌         | 14/240 [1:59

### **Here are the best parameters our Gridsearch found**

In [18]:
df_lda = pd.read_csv('lda_tuning_results.csv') # We read the csv with the results
df_lda.iloc[df_lda['Coherence'].idxmax()]


Validation_Set           100% Corpus
Topics                            25
Alpha                           0.31
Beta              0.9099999999999999
Coherence                   0.632185
Name: 38, dtype: object

# **4) Building the model with the optimized parameters**
---

## **4.1) Training the model**
---

In [22]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=common_corpus,
                                           id2word=common_dictionary,
                                           num_topics=25, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha=0.31,
                                           eta=0.909,
                                           per_word_topics=True)

Wall time: 0 ns


## **4.2) Asserting performance**
---

In [53]:
# Compute Perplexity, a measure of how good the model is. The lower the better.
print('\nModel Perplexity: ', lda_model.log_perplexity(common_corpus)) 

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, corpus=common_corpus, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nModel Coherence Score: ', coherence_lda)


Perplexity:  -8.207692628329141

Coherence Score:  0.6321854058803597


## **4.3) Visualizing ans saving the model**
---
This visualization will be used in our live app directly as an HTML page

In [56]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, common_corpus, common_dictionary,mds="tsne")
vis



![model](figures/model.PNG)

We extract these data into files for our app
 - We save the model with joblib
 - We save the visualization as a HTML page

In [61]:
joblib.dump(lda_model, os.path.join("lda_model_25.joblib"))
pyLDAvis.save_html(vis, 'lda_25_bis.html')

['lda_model_25.joblib']