Original idea from: https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/


In this post, we discuss techniques to visualize the output and results from topic model (LDA) based on the gensim package. I will be using a portion of the 20 Newsgroups dataset since the focus is more on approaches to visualizing the results.

Let’s begin by importing the packages and the 20 News Groups dataset.

In [1]:
import sys
# !{sys.executable} -m spacy download en
import re, numpy as np, pandas as pd
from pprint import pprint

# Gensim
import gensim, spacy, logging, warnings
import gensim.corpora as corpora
from gensim.utils import lemmatize, simple_preprocess
from gensim.models import CoherenceModel
import matplotlib.pyplot as plt

# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use', 'not', 'would', 'say', 'could', '_', 'be', 'know', 'good', 'go', 'get', 'do', 'done', 'try', 'many', 'some', 'nice', 'thank', 'think', 'see', 'rather', 'easy', 'easily', 'lot', 'lack', 'make', 'want', 'seem', 'run', 'need', 'even', 'right', 'line', 'even', 'also', 'may', 'take', 'come'])

%matplotlib inline
warnings.filterwarnings("ignore",category=DeprecationWarning)
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

## Import NewsGroups Dataset


In [2]:
#pd.set_option('display.max_colwidth', -1) This line is optional to see the full width of the column content

In [3]:
# Import Dataset
df = pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json')


In [4]:
#df = df.sample(10)

In [5]:
df.head()

Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space


## Tokenize Sentences and Clean

Removing the emails, new line characters, single quotes and finally split the sentence into a list of words using gensim’s simple_preprocess(). Setting the deacc=True option removes punctuations.

In [6]:
def sent_to_words(sentences):
    for sent in sentences:
        sent = re.sub('\S*@\S*\s?', '', sent)  # remove emails
        sent = re.sub('\s+', ' ', sent)  # remove newline chars
        sent = re.sub("\'", "", sent)  # remove single quotes
        sent = gensim.utils.simple_preprocess(str(sent), deacc=True) 
        yield(sent)  

# Convert to list
data = df.content.values.tolist()
data_words = list(sent_to_words(data))
print(data_words[:1])
# [['from', 'irwin', 'arnstein', 'subject', 're', 'recommendation', 'on', 'duc', 'summary', 'whats', 'it', 'worth', 'distribution', 'usa', 'expires', 'sat', 'may', 'gmt', ...trucated...]]

[['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'edu', 'organization', 'university', 'of', 'maryland', 'college', 'park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']]


## 4. Build the Bigram, Trigram Models and Lemmatize


Let’s form the bigram and trigrams using the Phrases model. This is passed to Phraser() for efficiency in speed of execution.

Next, lemmatize each word to its root form, keeping only nouns, adjectives, verbs and adverbs.

We keep only these POS tags because they are the ones contributing the most to the meaning of the sentences. Here, I use spacy for lemmatization.

In [7]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# !python3 -m spacy download en  # run in terminal once
def process_words(texts, stop_words=stop_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """Remove Stopwords, Form Bigrams, Trigrams and Lemmatization"""
    texts = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
    texts = [bigram_mod[doc] for doc in texts]
    texts = [trigram_mod[bigram_mod[doc]] for doc in texts]
    texts_out = []    
    nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    # remove stopwords once more after lemmatization
    texts_out = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts_out]    
    return texts_out

data_ready = process_words(data_words)  # processed Text Data!

# Build the topic model

To build the LDA topic model using LdaModel(), you need the corpus and the dictionary. Let’s create them first and then build the model. The trained topics (keywords and weights) are printed below as well.



In [8]:
number_topics = 20

In [9]:
# Create Dictionary
id2word = corpora.Dictionary(data_ready)

# Create Corpus: Term Document Frequency
corpus = [id2word.doc2bow(text) for text in data_ready]

# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=number_topics, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=10,
                                           passes=10,
                                           alpha='symmetric',
                                           iterations=100,
                                           per_word_topics=True)

pprint(lda_model.print_topics())

[(0,
  '0.122*"information" + 0.109*"far" + 0.106*"person" + 0.100*"address" + '
  '0.087*"require" + 0.079*"sense" + 0.064*"pretty" + 0.053*"phone" + '
  '0.046*"stuff" + 0.031*"division"'),
 (1,
  '0.274*"experience" + 0.041*"brave" + 0.000*"evidence" + 0.000*"reason" + '
  '0.000*"faith" + 0.000*"explain" + 0.000*"claim" + 0.000*"physical" + '
  '0.000*"valid" + 0.000*"never"'),
 (2,
  '0.141*"work" + 0.116*"problem" + 0.075*"sure" + 0.062*"however" + '
  '0.059*"file" + 0.056*"buy" + 0.043*"wrong" + 0.043*"technology" + '
  '0.038*"lose" + 0.033*"correct"'),
 (3,
  '0.000*"wollt" + 0.000*"overjoyed" + 0.000*"palaestinens" + 0.000*"quaelt" + '
  '0.000*"schneller" + 0.000*"sein" + 0.000*"vaeter" + 0.000*"wehrmacht" + '
  '0.000*"ihr" + 0.000*"bradly"'),
 (4,
  '0.187*"thing" + 0.124*"call" + 0.118*"car" + 0.075*"name" + 0.052*"model" + '
  '0.051*"small" + 0.040*"bring" + 0.039*"history" + 0.034*"body" + '
  '0.032*"early"'),
 (5,
  '0.084*"drive" + 0.069*"system" + 0.050*"tell" + 0

# Get most relevant documents - LDA Gensim

In [10]:
'''
See the discusion here:
https://stackoverflow.com/questions/23509699/understanding-lda-transformed-corpus-in-gensim/37708396?noredirect=1#comment77429460_37708396
https://stackoverflow.com/questions/45310925/how-to-get-a-complete-topic-distribution-for-a-document-using-gensim-lda
'''

'\nSee the discusion here:\nhttps://stackoverflow.com/questions/23509699/understanding-lda-transformed-corpus-in-gensim/37708396?noredirect=1#comment77429460_37708396\nhttps://stackoverflow.com/questions/45310925/how-to-get-a-complete-topic-distribution-for-a-document-using-gensim-lda\n'

In [11]:
#with this code we get the full matrix of topic-documents contribution
matrix_documents_topic_contribution, _ = lda_model.inference(corpus)
matrix_documents_topic_contribution /= matrix_documents_topic_contribution.sum(axis=1)[:, None]

In [12]:
matrix_documents_topic_contribution = pd.DataFrame(matrix_documents_topic_contribution)

In [13]:
matrix_documents_topic_contribution.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.001563,0.001563,0.001563,0.001563,0.720319,0.001563,0.001563,0.001563,0.001563,0.120832,0.001563,0.001563,0.001563,0.001563,0.101036,0.001563,0.001563,0.032812,0.001563,0.001563
1,0.001191,0.072623,0.001191,0.001191,0.025,0.048802,0.001191,0.001191,0.048804,0.120212,0.001191,0.001191,0.001191,0.001191,0.096427,0.001191,0.001191,0.001191,0.572654,0.001191
2,0.000538,0.000538,0.000538,0.000538,0.000538,0.000538,0.000538,0.000538,0.011287,0.505692,0.000538,0.000538,0.000538,0.000538,0.473881,0.000538,0.000538,0.000538,0.000538,0.000538
3,0.567398,0.002174,0.002174,0.002174,0.045652,0.045646,0.002174,0.002174,0.002174,0.306522,0.002174,0.002174,0.002174,0.002174,0.002174,0.002174,0.002174,0.002174,0.002174,0.002174
4,0.001191,0.001191,0.001191,0.001191,0.025,0.667864,0.001191,0.001191,0.001191,0.23591,0.001191,0.001191,0.001191,0.001191,0.052178,0.001191,0.001191,0.001191,0.001191,0.001191


In [14]:
#add document's text in last column
contents = pd.Series(df['content']).reset_index(drop=True)

In [15]:
matrix_documents_topic_contribution = pd.concat([matrix_documents_topic_contribution, contents], axis=1)


In [16]:
matrix_documents_topic_contribution.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11,12,13,14,15,16,17,18,19,content
0,0.001563,0.001563,0.001563,0.001563,0.720319,0.001563,0.001563,0.001563,0.001563,0.120832,...,0.001563,0.001563,0.001563,0.101036,0.001563,0.001563,0.032812,0.001563,0.001563,From: lerxst@wam.umd.edu (where's my thing)\nS...
1,0.001191,0.072623,0.001191,0.001191,0.025,0.048802,0.001191,0.001191,0.048804,0.120212,...,0.001191,0.001191,0.001191,0.096427,0.001191,0.001191,0.001191,0.572654,0.001191,From: guykuo@carson.u.washington.edu (Guy Kuo)...
2,0.000538,0.000538,0.000538,0.000538,0.000538,0.000538,0.000538,0.000538,0.011287,0.505692,...,0.000538,0.000538,0.000538,0.473881,0.000538,0.000538,0.000538,0.000538,0.000538,From: twillis@ec.ecn.purdue.edu (Thomas E Will...
3,0.567398,0.002174,0.002174,0.002174,0.045652,0.045646,0.002174,0.002174,0.002174,0.306522,...,0.002174,0.002174,0.002174,0.002174,0.002174,0.002174,0.002174,0.002174,0.002174,From: jgreen@amber (Joe Green)\nSubject: Re: W...
4,0.001191,0.001191,0.001191,0.001191,0.025,0.667864,0.001191,0.001191,0.001191,0.23591,...,0.001191,0.001191,0.001191,0.052178,0.001191,0.001191,0.001191,0.001191,0.001191,From: jcm@head-cfa.harvard.edu (Jonathan McDow...


# Topic similarity metric

## Single corpora

In [17]:
from gensim.models import KeyedVectors 

ruta_word_embedding = 'data/wiki.multi.en.vec'
word_embedding_model = KeyedVectors.load_word2vec_format(ruta_word_embedding)

# Choose the # top keywords and # top documents a considerar en la metrica

topn_terms = 20
topk_documents = 20
relevance_lambda = 0.6 



In [19]:
import importlib

  and should_run_async(code)


In [20]:
import topicvisexplorer
importlib.reload(topicvisexplorer)
vis = topicvisexplorer.TopicVisExplorer("borrar_nombre")
topic_similarity_matrix = vis.calculate_topic_similarity_on_single_corpus(word_embedding_model, lda_model, corpus, id2word, matrix_documents_topic_contribution,topn_terms, topk_documents, relevance_lambda)

  and should_run_async(code)


Calculating for omega =  0.0
Calculating for omega =  0.01
Calculating for omega =  0.02
Calculating for omega =  0.03
Calculating for omega =  0.04
Calculating for omega =  0.05
Calculating for omega =  0.06
Calculating for omega =  0.07
Calculating for omega =  0.08
Calculating for omega =  0.09
Calculating for omega =  0.1
Calculating for omega =  0.11
Calculating for omega =  0.12
Calculating for omega =  0.13
Calculating for omega =  0.14
Calculating for omega =  0.15
Calculating for omega =  0.16
Calculating for omega =  0.17
Calculating for omega =  0.18
Calculating for omega =  0.19
Calculating for omega =  0.2
Calculating for omega =  0.21
Calculating for omega =  0.22
Calculating for omega =  0.23
Calculating for omega =  0.24
Calculating for omega =  0.25
Calculating for omega =  0.26
Calculating for omega =  0.27
Calculating for omega =  0.28
Calculating for omega =  0.29
Calculating for omega =  0.3
Calculating for omega =  0.31
Calculating for omega =  0.32
Calculating fo

KeyboardInterrupt: 

## Multi corpora

In [None]:
from gensim.models import KeyedVectors 

ruta_word_embedding = 'data/wiki.multi.en.vec'
word_embedding_model = KeyedVectors.load_word2vec_format(ruta_word_embedding)

# Choose the # top keywords and # top documents a considerar en la metrica

topn_terms = 20
topk_documents = 20
relevance_lambda = 0.6 



In [None]:
import topicvisexplorer
import importlib
importlib.reload(topicvisexplorer)

vis = topicvisexplorer.TopicVisExplorer("borrar_nombre")
topic_similarity_matrix_multicorpora = vis.calculate_topic_similarity_on_multi_corpora(word_embedding_model, lda_model,lda_model, corpus,corpus, id2word,id2word, matrix_documents_topic_contribution,matrix_documents_topic_contribution, topn_terms, topk_documents, relevance_lambda)

### Show visualization - Single corpus

In [None]:
#import topicvisexplorer
import importlib
importlib.reload(topicvisexplorer)
vis = topicvisexplorer.TopicVisExplorer("borrar_nombre")
vis.prepare_single_corpus( lda_model, corpus, id2word, matrix_documents_topic_contribution, topic_similarity_matrix)


In [None]:
#save data
vis.save_single_corpus_data("single_corpus_data_newsgroup_lda_gensim_20_topics.pkl")

In [34]:
vis.run()

 * Serving Flask app "borrar_nombre" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off


  and should_run_async(code)
2020-12-31 18:24:59,005 : INFO :  * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
2020-12-31 18:25:05,207 : INFO : 127.0.0.1 - - [31/Dec/2020 18:25:05] "[37mGET /singlecorpus HTTP/1.1[0m" 200 -


que le pase a jinja  <class 'list'>


2020-12-31 18:25:07,267 : INFO : 127.0.0.1 - - [31/Dec/2020 18:25:07] "[37mGET /singlecorpus HTTP/1.1[0m" 200 -
2020-12-31 18:25:07,296 : INFO : 127.0.0.1 - - [31/Dec/2020 18:25:07] "[37mGET /static/js/jquery.min.js HTTP/1.1[0m" 200 -


que le pase a jinja  <class 'list'>


2020-12-31 18:25:07,601 : INFO : 127.0.0.1 - - [31/Dec/2020 18:25:07] "[37mGET /static/css/bootstrap-table.min.css HTTP/1.1[0m" 200 -
2020-12-31 18:25:07,608 : INFO : 127.0.0.1 - - [31/Dec/2020 18:25:07] "[37mGET /static/css/LDAvis.css HTTP/1.1[0m" 200 -
2020-12-31 18:25:07,614 : INFO : 127.0.0.1 - - [31/Dec/2020 18:25:07] "[37mGET /static/css/bootstrap.min.css HTTP/1.1[0m" 200 -
2020-12-31 18:25:07,618 : INFO : 127.0.0.1 - - [31/Dec/2020 18:25:07] "[37mGET /static/css/nouislider.css HTTP/1.1[0m" 200 -
2020-12-31 18:25:07,620 : INFO : 127.0.0.1 - - [31/Dec/2020 18:25:07] "[37mGET /static/js/popper.min.js HTTP/1.1[0m" 200 -
2020-12-31 18:25:07,659 : INFO : 127.0.0.1 - - [31/Dec/2020 18:25:07] "[37mGET /static/js/bootstrap.min.js HTTP/1.1[0m" 200 -
2020-12-31 18:25:07,918 : INFO : 127.0.0.1 - - [31/Dec/2020 18:25:07] "[37mGET /static/js/d3.v5.min.js HTTP/1.1[0m" 200 -
2020-12-31 18:25:07,923 : INFO : 127.0.0.1 - - [31/Dec/2020 18:25:07] "[37mGET /static/js/sankey.js HTTP/1

### Show visualization - Multi corpora



In [None]:
import importlib


In [None]:
import topicvisexplorer
importlib.reload(topicvisexplorer)



In [None]:
vis = topicvisexplorer.TopicVisExplorer("borrar_nombre")
vis.prepare_multi_corpora( lda_model,lda_model, corpus, corpus, id2word,id2word,  matrix_documents_topic_contribution, matrix_documents_topic_contribution, topic_similarity_matrix_multicorpora)

In [None]:
vis.save_multi_corpora_data("multi_corpora_data_newsgroup_lda_gensim_20_topics.pkl")

In [28]:
vis.run()

 * Serving Flask app "borrar_nombre" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off


  and should_run_async(code)
2020-12-31 18:18:01,664 : INFO :  * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
2020-12-31 18:18:04,882 : INFO : 127.0.0.1 - - [31/Dec/2020 18:18:04] "[37mGET /multicorpora HTTP/1.1[0m" 200 -
2020-12-31 18:18:04,921 : INFO : 127.0.0.1 - - [31/Dec/2020 18:18:04] "[37mGET /static/js/jquery.min.js HTTP/1.1[0m" 200 -
2020-12-31 18:18:05,005 : INFO : 127.0.0.1 - - [31/Dec/2020 18:18:05] "[37mGET /static/css/bootstrap-table.min.css HTTP/1.1[0m" 200 -


que le pase a jinja  <class 'list'>


2020-12-31 18:18:05,224 : INFO : 127.0.0.1 - - [31/Dec/2020 18:18:05] "[37mGET /static/css/bootstrap.min.css HTTP/1.1[0m" 200 -
2020-12-31 18:18:05,227 : INFO : 127.0.0.1 - - [31/Dec/2020 18:18:05] "[37mGET /static/css/LDAvis.css HTTP/1.1[0m" 200 -
2020-12-31 18:18:05,227 : INFO : 127.0.0.1 - - [31/Dec/2020 18:18:05] "[37mGET /static/css/nouislider.css HTTP/1.1[0m" 200 -
2020-12-31 18:18:05,237 : INFO : 127.0.0.1 - - [31/Dec/2020 18:18:05] "[37mGET /static/js/popper.min.js HTTP/1.1[0m" 200 -
2020-12-31 18:18:05,312 : INFO : 127.0.0.1 - - [31/Dec/2020 18:18:05] "[37mGET /static/js/bootstrap.min.js HTTP/1.1[0m" 200 -
2020-12-31 18:18:05,323 : INFO : 127.0.0.1 - - [31/Dec/2020 18:18:05] "[37mGET /static/js/d3.v5.min.js HTTP/1.1[0m" 200 -
2020-12-31 18:18:05,538 : INFO : 127.0.0.1 - - [31/Dec/2020 18:18:05] "[37mGET /static/js/sankey.js HTTP/1.1[0m" 200 -
2020-12-31 18:18:05,542 : INFO : 127.0.0.1 - - [31/Dec/2020 18:18:05] "[37mGET /static/js/nouislider.js HTTP/1.1[0m" 200