## Visualization of Topics

We are using pyLDAvis visualization by [Ben Mabey](https://github.com/bmabey/pyLDAvis) who adapted the original R package to Python.

pyLDAvis shows topics as circles in a 2D plot. This is an approximation of topic similarity. The more similar two topics are, the closer they will be in the plot. The size of the circle corresponds to the presence of the topic in the corpus.

The visualization shows the top 30 most salient terms (not frequent!), where saliency refers to the importance of each word for the topic. If a word is frequent in a topic, but also in the entire corpus, it will get a lower saliency score than a word that is frequent in a topic alone. Conceptually, it is similar to TF-IDF.
If the topic is selected, it shows most relevant (frequent) terms in a selected topic. Relevance is similar to saliency. It is a weighted measure of term probability and lift, where lambda = 1 ranks only by probability of the term and lambda = 0 ranks only by lift (the ratio of a term’s probability within a topic to its marginal probability across the corpus).

In [None]:
from gensim import corpora
from gensim.models.ldamodel import LdaModel
import pyLDAvis.gensim

In [None]:
# used to hush the warnings that appear in pyLDAvis
import warnings
warnings.filterwarnings("ignore")

Please, make sure to remove any redundant words from the dictionary.

Decide on the number of topics you wish to observe by setting the parameter `num_topics`.

In [None]:
def pyldavis_prep(tokens, num_topics=5, num_words=10):
    dictionary = corpora.Dictionary(tokens)
    dictionary.filter_extremes(0.1, 0.9)

    corpus = [dictionary.doc2bow(text) for text in tokens]

    ldamodel = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=15)
    topics = ldamodel.print_topics(num_words=num_words)

    for topic in topics:
        print("Topic {}:".format(topic[0]+1))
        for word in topic[1].split(' + '):
            print('   {}: {}'.format(word.split('*')[1], word.split('*')[0]))
    
    return corpus, dictionary, ldamodel

An interactive visualization of topics from LDA model.

You can select the topic manually by clicking on the circle in the plot or by selecting topic number in the control area at the top.

On the right, you see the most relevant terms for the selected topic. If you click on a word in the histogram on the right, topic circles will resize according to the saliency of the term in the topic.

In [None]:
def pyldavis_vis(corpus, dictionary, ldamodel, save_to_html=None):
    lda_display = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary, sort_topics=False)
    # pyLDAvis will throw a FutureWarning, which you can ignore
    if save_to_html:
        pyLDAvis.save_html(lda_display, save_to_html)
    return pyLDAvis.display(lda_display)