# Evaluate Topic Model in Python: Latent Dirichlet Allocation (LDA)

## Model Implementation

1. Loading data
2. Data Cleaning
3. Phrase Modeling: Bi-grams and Tri-grams
4. Data transformation: Corpus and Dictionary
5. Base Model Performance
6. Hyperparameter Tuning
7. Final Model
8. Visualize Results

#### What is topic coherence?
Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference. But,

#### What is coherence?
A set of statements or facts is said to be coherent, if they support each other. Thus, a coherent fact set can be interpreted in a context that covers all or most of the facts. An example of a coherent fact set is "the game is a team sport", "the game is played with a ball", "the game demands great physical efforts"

## 1. Loading Data 

In [4]:
import pandas as pd
import os 
import numpy as np

os.chdir('C:/Users/alexa/Desktop/Eiemplos LDA/2.- LDA Evaluation')

# Read data into papers
papers = pd.read_csv('./data/NIPS Papers/papers.csv')
papers.head()

Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text
0,1,1987,Self-Organization of Associative Database and ...,,1-self-organization-of-associative-database-an...,Abstract Missing,767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...
1,10,1987,A Mean Field Theory of Layer IV of Visual Cort...,,10-a-mean-field-theory-of-layer-iv-of-visual-c...,Abstract Missing,683\n\nA MEAN FIELD THEORY OF LAYER IV OF VISU...
2,100,1988,Storing Covariance by the Associative Long-Ter...,,100-storing-covariance-by-the-associative-long...,Abstract Missing,394\n\nSTORING COVARIANCE BY THE ASSOCIATIVE\n...
3,1000,1994,Bayesian Query Construction for Neural Network...,,1000-bayesian-query-construction-for-neural-ne...,Abstract Missing,Bayesian Query Construction for Neural\nNetwor...
4,1001,1994,"Neural Network Ensembles, Cross Validation, an...",,1001-neural-network-ensembles-cross-validation...,Abstract Missing,"Neural Network Ensembles, Cross\nValidation, a..."


## 2. Data Cleaning

In [5]:
# Remove the columns
papers = papers.drop(columns=['id', 'title', 'abstract', 
                              'event_type', 'pdf_name', 'year'], axis=1)

# sample only 100 papers
papers = papers.sample(100)       # <------- sample: Muestra aleatoria

# Print out the first rows of papers
papers.head()

Unnamed: 0,paper_text
2765,Playing Pinball with non-invasive BCI\n\nMicha...
1924,Distributed Occlusion Reasoning for Tracking\n...
3352,Semi-Supervised Learning with Adversarially\nM...
4636,Action from Still Image Dataset and Inverse Op...
2043,Computing the Solution Path for the\nRegulariz...


In [8]:
papers['paper_text'][2765]

'Playing Pinball with non-invasive BCI\n\nMichael W. Tangermann\nMachine Learning Laboratory\nBerlin Institute of Technology\nBerlin, Germany\n\nMatthias Krauledat\nMachine Learning Laboratory\nBerlin Institute of Technology\nBerlin, Germany\n\nschroedm@cs.tu-berlin.de\n\nkraulem@cs.tu-berlin.de\n\nKonrad Grzeska\nMachine Learning Laboratory\nBerlin Institute of Technology\nBerlin, Germany\n\nMax Sagebaum\nMachine Learning Laboratory\nBerlin Institute of Technology\nBerlin, Germany\n\nkonradg@cs.tu-berlin.de\n\nmax.sagebaum@first.fraunhofer.de\n\nCarmen Vidaurre\nMachine Learning Laboratory\nBerlin Institute of Technology\nBerlin, Germany\n\nBenjamin Blankertz\nMachine Learning Laboratory\nBerlin Institute of Technology\nBerlin, Germany\n\nvidcar@cs.tu-berlin.de\n\nblanker@cs.tu-berlin.de\n\n?\nKlaus-Robert Muller\nMachine Learning Laboratory, Berlin Institute of Technology, Berlin, Germany\nkrm@cs.tu-berlin.de\n\nAbstract\nCompared to invasive Brain-Computer Interfaces (BCI), non-inva

### 2.1. Remove punctuation / lower casing

In [3]:
# Load the regular expression library
import re

# Remove punctuation
papers['paper_text_processed'] = papers['paper_text'].map(lambda x: re.sub('[,\.!?]', '', x))

# Convert the titles to lowercase
papers['paper_text_processed'] = papers['paper_text_processed'].map(lambda x: x.lower())

# Print out the first rows of papers
papers['paper_text_processed'].head()

4597    neural network routing for random multistage\n...
3463    tiled convolutional neural networks\nquoc v le...
4090    iterative ranking from pair-wise comparisons\n...
4044    priors for diversity in generative\nlatent var...
1731    sub-microwatt analog vlsi\nsupport vector mach...
Name: paper_text_processed, dtype: object

### 2.2. Tokenize words and further clean-up text

In [4]:
import gensim 
from gensim.utils import simple_preprocess

def sent_to_words(sentences):
    for sentence in sentences: 
        yield (gensim.utils.simple_preprocess(str(sentence), deacc=True))  #  deacc=True removes punctuaction! 

data = papers.paper_text_processed.values.tolist()
data_words = list(sent_to_words(data))

print(data_words[:1][0][:30])

['neural', 'network', 'routing', 'for', 'random', 'multistage', 'interconnection', 'networks', 'mark', 'goudreau', 'princeton', 'university', 'and', 'nee', 'research', 'institute', 'inc', 'independence', 'way', 'princeton', 'nj', 'lee', 'giles', 'nec', 'research', 'institute', 'inc', 'independence', 'way', 'princeton']


## 3.- Phrase Modeling: Bi-grams and Tri-grams

Bigrams are two words frequently occurring together in the document. Trigrams are 3 words frequently occurring. Some examples in our example are: 'back_bumper', 'oil_leakage', 'maryland_college_park' etc.

Gensim's Phrases model can build and implement the bigrams, trigrams, quadgrams and more. The two important arguments to Phrases are min_count and threshold.

In [5]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

### 3.1. Remove Stopwords , make Bigrams and Lemmatize 

In [6]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\alexa\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [7]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts] # Devuelve los no stopwords

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [8]:
%%time

# Calling those functions in order
import spacy

# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1][0][:30])

['neural', 'network', 'route', 'random', 'network', 'abstract', 'route', 'scheme', 'use', 'neural', 'network', 'develop', 'aid', 'establishing', 'point', 'point', 'communication', 'route', 'network', 'min', 'network', 'type', 'examine', 'hopfield', 'hopfield', 'work', 'problem', 'establish', 'route', 'random']
Wall time: 44.9 s


## 4.- Data Transformation: Corpus and Dictionary 

The two main inputs to the LDA topic model are the dictionary(id2word) and the corpus. Let’s create them

**doc2bow:** Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

In [9]:
import gensim.corpora as corpora

# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts] # ----> (token_id, token_count)

# View
print(corpus[:1][0][:30])

[(0, 1), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 1), (7, 2), (8, 1), (9, 1), (10, 2), (11, 1), (12, 1), (13, 1), (14, 1), (15, 3), (16, 2), (17, 3), (18, 1), (19, 2), (20, 3), (21, 1), (22, 2), (23, 1), (24, 8), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1)]


### 4.1.- Building the base topic model

We have everything required to train the base LDA model. In addition to the corpus and dictionary, you need to provide the number of topics as well. Apart from that, alpha and eta are hyperparameters that affect sparsity of the topics. According to the Gensim docs, both defaults to 1.0/num_topics prior (we'll use default for the base model)

In [10]:
# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=10, 
                                       random_state=100,
                                       chunksize=100,
                                       passes=10,
                                       per_word_topics=True)

### 4.2.- View the topics in LDA model

In [11]:
from pprint import pprint

# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.015*"inference" + 0.014*"message" + 0.014*"variable" + 0.014*"learn" + '
  '0.011*"image" + 0.009*"feature" + 0.008*"line" + 0.007*"set" + '
  '0.007*"potential" + 0.007*"method"'),
 (1,
  '0.022*"sample" + 0.011*"number" + 0.011*"use" + 0.010*"approach" + '
  '0.008*"time" + 0.008*"set" + 0.008*"classifier" + 0.008*"show" + '
  '0.007*"current" + 0.007*"base"'),
 (2,
  '0.026*"network" + 0.015*"neuron" + 0.010*"neural" + 0.009*"use" + '
  '0.009*"image" + 0.008*"set" + 0.007*"model" + 0.006*"value" + 0.005*"input" '
  '+ 0.005*"output"'),
 (3,
  '0.016*"model" + 0.010*"experiment" + 0.009*"image" + 0.008*"set" + '
  '0.008*"number" + 0.007*"datum" + 0.007*"use" + 0.007*"show" + '
  '0.007*"distribution" + 0.007*"order"'),
 (4,
  '0.011*"use" + 0.010*"image" + 0.010*"feature" + 0.010*"learn" + 0.008*"set" '
  '+ 0.008*"model" + 0.008*"function" + 0.006*"datum" + 0.006*"problem" + '
  '0.006*"method"'),
 (5,
  '0.012*"model" + 0.012*"set" + 0.008*"network" + 0.008*"function" +

## 5.- Base Model Performance

#### Compute Model Perplexity and Coherence Score

Let's calculate the baseline coherence score

In [12]:
from gensim.models import CoherenceModel

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

Coherence Score:  0.2503859950174131


## 6.- Hyperparameter Tuning

First, let's differentiate between model hyperparameters and model parameters :

- `Model hyperparameters` can be thought of as settings for a machine learning algorithm that are tuned by the data scientist before training. Examples would be the number of trees in the random forest, or in our case, number of topics K

- `Model parameters` can be thought of as what the model learns during training, such as the weights for each word in a given topic.

Now that we have the baseline coherence score for the default LDA model, let's perform a series of sensitivity tests to help determine the following model hyperparameters: 
- Number of Topics (K)
- Dirichlet hyperparameter alpha: Document-Topic Density
- Dirichlet hyperparameter beta: Word-Topic Density

We'll perform these tests in sequence, one parameter at a time by keeping others constant and run them over the two difference validation corpus sets. We'll use `C_v` as our choice of metric for performance comparison 

In [13]:
# supporting function
def compute_coherence_values(corpus, dictionary, k, a, b):
    
    lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=k, 
                                           random_state=100,
                                           chunksize=100,
                                           passes=10,
                                           alpha=a,
                                           eta=b)
    
    coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
    
    return coherence_model_lda.get_coherence()

**Let's call the function, and iterate it over the range of topics, alpha, and beta parameter values**

In [14]:
%%time

import numpy as np
import tqdm 

#tqdm --> Instantly make your loops show a smart progress meter - just wrap any iterable with tqdm(iterable), and you're done!

grid = {}
grid['Validation_Set'] = {}

# Topics range
min_topics = 2
max_topics = 11
step_size = 1
topics_range = range(min_topics, max_topics, step_size)

# Alpha parameter
alpha = list(np.arange(0.01, 1, 0.3))
alpha.append('symmetric')
alpha.append('asymmetric')

# Beta parameter
beta = list(np.arange(0.01, 1, 0.3))
beta.append('symmetric')

# Validation sets
num_of_docs = len(corpus)
corpus_sets = [# gensim.utils.ClippedCorpus(corpus, num_of_docs*0.25), 
               # gensim.utils.ClippedCorpus(corpus, num_of_docs*0.5), 
               # gensim.utils.ClippedCorpus(corpus, num_of_docs*0.75), 
               corpus]

corpus_title = ['100% Corpus']

model_results = {'Validation_Set': [],
                 'Topics': [],
                 'Alpha': [],
                 'Beta': [],
                 'Coherence': []
                }

# Can take a long time to run
if 1 == 1:
    pbar = tqdm.tqdm(total=(len(beta)*len(alpha)*len(topics_range)*len(corpus_title)))
    
    # iterate through validation corpuses
    for i in range(len(corpus_sets)):
        # iterate through number of topics
        for k in topics_range:
            # iterate through alpha values
            for a in alpha:
                # iterare through beta values
                for b in beta:
                    # get the coherence score for the given parameters
                    cv = compute_coherence_values(corpus=corpus_sets[i], dictionary=id2word, 
                                                  k=k, a=a, b=b)
                    # Save the model results
                    model_results['Validation_Set'].append(corpus_title[i])
                    model_results['Topics'].append(k)
                    model_results['Alpha'].append(a)
                    model_results['Beta'].append(b)
                    model_results['Coherence'].append(cv)
                    
                    pbar.update(1)
    pd.DataFrame(model_results).to_csv('lda_tuning_results.csv', index=False)
    pbar.close()

100%|██████████| 270/270 [3:39:46<00:00, 48.84s/it]     

Wall time: 3h 39min 46s





## 7.- Final Model

Based on external evaluation (Code to be added from Excel based analysis), train the final model

In [15]:
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=8, 
                                           random_state=100,
                                           chunksize=100,
                                           passes=10,
                                           alpha=0.01,
                                           eta=0.9)

In [16]:
from pprint import pprint

# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.000*"set" + 0.000*"use" + 0.000*"learn" + 0.000*"image" + 0.000*"model" + '
  '0.000*"feature" + 0.000*"function" + 0.000*"method" + 0.000*"problem" + '
  '0.000*"figure"'),
 (1,
  '0.010*"current" + 0.005*"voltage" + 0.003*"time" + 0.003*"dynamic" + '
  '0.003*"action_potential" + 0.003*"constant" + 0.002*"cell" + 0.002*"unit" + '
  '0.002*"circuit" + 0.002*"chip"'),
 (2,
  '0.016*"network" + 0.010*"neuron" + 0.006*"neural" + 0.005*"route" + '
  '0.004*"template" + 0.004*"transformation" + 0.004*"firing_rate" + '
  '0.003*"syllable" + 0.003*"face" + 0.003*"router"'),
 (3,
  '0.013*"model" + 0.007*"set" + 0.006*"algorithm" + 0.005*"order" + '
  '0.005*"result" + 0.005*"show" + 0.004*"problem" + 0.004*"use" + 0.004*"let" '
  '+ 0.004*"number"'),
 (4,
  '0.011*"use" + 0.010*"model" + 0.009*"image" + 0.009*"learn" + '
  '0.009*"function" + 0.008*"set" + 0.007*"feature" + 0.006*"method" + '
  '0.006*"show" + 0.006*"datum"'),
 (5,
  '0.010*"set" + 0.008*"cost" + 0.007*"function" +

## 8.- Visualize Results

In [17]:
import pyLDAvis.gensim
import pickle 
import pyLDAvis

# Visualize the topics
pyLDAvis.enable_notebook()

LDAvis_prepared = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)

LDAvis_prepared