# 5.3 Visualize the Model

We will visualize the model in two different ways, from two different perspectives. The package pyLDAvis is used to visualize the topics and their most prevalent lemmas. First install pyLDAvis.

In [1]:
%pip install pyldavis

Note: you may need to restart the kernel to use updated packages.
Name: pyLDAvis
Version: 3.4.0
Summary: Interactive topic model visualization. Port of the R package.
Home-page: https://github.com/bmabey/pyLDAvis
Author: Ben Mabey
Author-email: ben@benmabey.com
License: BSD-3-Clause
Location: c:\users\veldhuis\anaconda3\lib\site-packages
Requires: funcy, gensim, jinja2, joblib, numexpr, numpy, pandas, scikit-learn, scipy, setuptools
Required-by: 
Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip show pyldavis

Name: pyLDAvis
Version: 3.4.0
Summary: Interactive topic model visualization. Port of the R package.
Home-page: https://github.com/bmabey/pyLDAvis
Author: Ben Mabey
Author-email: ben@benmabey.com
License: BSD-3-Clause
Location: c:\users\veldhuis\anaconda3\lib\site-packages
Requires: funcy, gensim, jinja2, joblib, numexpr, numpy, pandas, scikit-learn, scipy, setuptools
Required-by: 
Note: you may need to restart the kernel to use updated packages.


pyLDAvis 3.4.0 contains a line that is no longer compatible with Pandas 2. The following code block will take care of that issue by patching one file ([see](https://github.com/bmabey/pyLDAvis/issues/247#issuecomment-1517214945)). If you use a higher version of pyLDAvis, or if you have already patched the file in question, you may skip this.

In [3]:
import os
import site

dir_list = site.getsitepackages()

for d in dir_list:
    fname = os.path.join(d, 'pyLDAvis', '_prepare.py')
    backup = fname + '.old'
    if os.path.isfile(fname):
        try:
            corrected = []
            with open(fname, 'r') as f:
                lines = f.read().splitlines()
            if not os.path.isfile(backup):
                os.rename(fname, fname + '.old')  # backup
            for l in lines:
                l = l.replace("drop('saliency', 1)", "drop('saliency', axis=1)")
                corrected.append(l)
            with open(fname, "w") as outfile:
                outfile.write("\n".join(corrected))
            print(f"Patched: {fname}")
        except Exception as e:
            print(f"Error patching {fname}: {e}")

Patched: C:\Users\veldhuis\Anaconda3\lib\site-packages\pyLDAvis\_prepare.py


In [4]:
import os
import gensim
import warnings
import pickle
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.manifold import MDS, TSNE
from scipy.spatial.distance import pdist, squareform
import pyLDAvis.gensim_models
from bokeh.models import ColumnDataSource, OpenURL, TapTool, HoverTool, CustomJS, Title, MultiChoice
from bokeh.plotting import figure, output_file, output_notebook, save, show
from bokeh.layouts import column
from bokeh.io import reset_output
output_notebook()

Create directory (if necessary) for visualization output.

In [5]:
os.makedirs('vis', exist_ok=True)

In [6]:
ldamodel = gensim.models.ldamodel.LdaModel.load('output/ldasaved')
dictionary = gensim.corpora.Dictionary.load('output/ldadict')
corpus = gensim.corpora.MmCorpus('output/ldacorpus')

# Get the Model


In [7]:
with open('output/topic_model.p', 'rb') as r:
    topic_model = pickle.load(r)

In [8]:
ntopics = len(ldamodel.show_topics())
ldamodel.show_topics(ntopics, formatted = False)

[(0,
  [('manû[unit]n', 0.07460955),
   ('ṣarpu[silver]n', 0.05077602),
   ('hurāṣu[gold]n', 0.047348257),
   ('biltu[load]n', 0.04248671),
   ('šiqlu[unit]n', 0.034077335),
   ('ammatu[unit]n', 0.026702428),
   ('eleppu[ship]n', 0.021947471),
   ('šawiru[ring]n', 0.01901914),
   ('pūtu[forehead]n', 0.018668577),
   ('kāru[quay]n', 0.017603515)]),
 (1,
  [('sisû[horse]n', 0.044651676),
   ('karābu[pray]v', 0.039131243),
   ('dullu[trouble]n', 0.030765027),
   ('ṭūbu[goodness]n', 0.027118457),
   ('šīru[flesh]n', 0.014573547),
   ('balāṭu[live]v', 0.013999887),
   ('ilu[god]n', 0.013509653),
   ('našû[lifted]aj', 0.013248313),
   ('šattu[year]n', 0.011965697),
   ('wadû[know]v', 0.011748885)]),
 (2,
  [('rabû[big]aj', 0.05999384),
   ('ilūtu[divinity]n', 0.03386138),
   ('immeru[sheep]n', 0.03187009),
   ('šalmu[intact]aj', 0.026799057),
   ('luʾʾû[sullied]aj', 0.022594161),
   ('pû[mouth]n', 0.018162629),
   ('bīru[divination]n', 0.016008219),
   ('qātu[hand]n', 0.014118909),
   ('kīnu

# pyLDAvis
Use pyLDAvis to visualize the topic model. By default, pyLDAvis will order the topics by [prevalence](https://github.com/bmabey/pyLDAvis/issues/59) (topic 1 is the most prevalent topic). That means that the topic numbers in the visualization do not agree with the topic numbers in the lda model. To prevent this behaviour one may use `sort_topics=False` in the `prepare` command. The advantage of ordering the topics by prevalence, however, is that new instances of the lda model are more comparable (that is, the same topic will receive the same number). Note that the library was written in Java for R, and so the numbering in the visualization begins with 1 (not with 0). The topic numbers in the Document/Topic and Topic/Term matrices below will be adjusted to be compatible with the pyLDAvis visualization.

PyLDAvis needs a large output box. The `%%html` lines below create such a box (for the code see [here](http://stackoverflow.com/questions/18770504/resize-ipython-notebook-output-window)). 

In [9]:
%%html
<style>
.output_wrapper, .output {
    height:auto !important;
    max-height:1000px;  /* your desired max-height here */
}
.output_scroll {
    box-shadow:none !important;
    webkit-box-shadow:none !important;
}
</style>


In [10]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(ldamodel, corpus, dictionary, sort_topics=False)
if not os.path.exists('vis'):
    os.makedirs('vis')
pyLDAvis.save_html(vis, 'vis/lda_terms.html')
pyLDAvis.display(vis)

# Visualize the Documents 1: Using MDS
While pyLDAvis is an excellent tool for exploring the topic/term aspect of a topic model (the words and their probabilities in each topic) it does not provide access to the document/topic aspect (the probability distribution of topics in each document). The visualization below plots all the documents according to their (cosine) distances (using Multi-Dimensional Scaling) in the Document/Term DataFrame. Each document (data point in the visualization) is colored according to the most prevalent topic and the size of the dot represents the probability of the most prevalent topic in that document.

Compute the distances between each of the documents. Use either the Document/Topic Dataframe or the Document/Term Dataframe (constructed below) to measure distance.

Since the data is already in list format, CountVectorizer does not need to preprocess or tokenize. The only way to prevent CountVectorizer from doing so is by creating dummy functions for the preprocessor and the tokenizer. The function DoNothing() simply returns the argument it receives.

In [None]:
def DoNothing(x):
    return x

In [None]:
df = topic_model['df']
texts = topic_model['texts']
cv = CountVectorizer(analyzer='word', preprocessor=DoNothing, tokenizer=DoNothing, 
                     token_pattern=None)
dtm = cv.fit_transform(texts)
dtm_df = pd.DataFrame(dtm.toarray(), columns = cv.get_feature_names_out(), index = df.index.values)
dtm_df.head()

In [None]:
dist = squareform(pdist(dtm_df, 'cosine'))

Compute the position of each document using Multi-Dimensional Scaling. The variable `pos` holds the `x` and `y`  coordinates. Execution of the following cell may take several minutes.

In [None]:
seed = 15
mds = MDS(
    n_components=2, 
    max_iter=3000,
    random_state=seed, 
    dissimilarity="precomputed", 
    n_jobs=1,
    normalized_stress='auto')
pos = mds.fit_transform(dist)

Create lists of x and y values (coordinates).

In [None]:
mds_x = [x for x, y in pos]
mds_y = [y for x, y in pos]

Create lists of the most prevalent topic, the probability of the most prevalent topic, and the text name for each document. These lists are used in the tooltips of the Bokeh visualization.

In [None]:
d_t_df = topic_model['d_t_df']
prevalent_topic = d_t_df.idxmax(axis=1)
probability = d_t_df.max(axis=1)
designation = list(df['designation'])

# Main Visualization Function
Interactive features in Bokeh, such as a drop-down menu, use a callback function that is activated when a certain event takes place. This event can be a mouse movement, a click, or a change in the drop-down menu. Custom callback functions need to be written in JavaScript.

Draw the visualization. The visualization provides various tools for further exploration:
- tooltips (provides topic, probability, text name and URL)
- box zoom
- wheel zoom
- pan
- reset
- link to document edition
- save the visualization

In addition, the visualization has a drop-down menu that allow the user to select two topics.
This function was written with feedback from ChatGPT.

In [None]:
def drawviz(title, outputfile):
    colormap = {
        str(i): color for i, color in enumerate([
            'grey', 'orange', 'olive', 'firebrick', 'gold', 'red', 'fuchsia', 'green',
            'blue', 'purple', 'aqua', 'yellow', 'indigo', 'blueviolet', 'beige', 'navy',
            'chocolate', 'azure', 'coral', 'crimson', 'darkblue', 'darkkhaki', 
            'darkseagreen', 'darkturquoise', 'deeppink', 'black'
        ])
    }

    colormap_source = ColumnDataSource(data=dict(
        topic=list(colormap.keys()),
        color=list(colormap.values())
    ))

    d_mds = dict(
        x=mds_x,
        y=mds_y,
        id_text=list(df.id_text),
        size=probability / max(probability) * 15,
        probability=probability,
        topic=[str(n) for n in prevalent_topic],  # ensure string topics
        color=[colormap[str(n)] for n in prevalent_topic],
        alpha=[0.5] * len(mds_x),
        designation=designation
    )

    source_mds = ColumnDataSource(data=d_mds)

    p = figure(width=1000, height=1000,
               tools="tap,pan,wheel_zoom,box_zoom,reset,save", 
               title=title)

    p.add_tools(HoverTool(tooltips=[
        ("url", "http://oracc.org/" + "@id_text"),
        ("topic, probability", "@topic, @probability"),
        ("designation", "@designation")
    ]))

    p.circle('x', 'y', color='color', fill_alpha='alpha', size='size', source=source_mds)
    p.axis.visible = False

    # MultiChoice selector instead of sliders
    topic_selector = MultiChoice(title="Select Topics",
                                 options=[str(i+1) for i in range(ntopics)],
                                 value=[])

    callback = CustomJS(args=dict(
        source=source_mds,
        selector=topic_selector,
        colormap=colormap_source
    ), code="""
        const selected = selector.value;
        const data = source.data;
        const topics = data['topic'];
        const alpha = data['alpha'];
        const colors = data['color'];
        const cmap_data = colormap.data;

        let cmap = {};
        for (let i = 0; i < cmap_data['topic'].length; i++) {
            cmap[cmap_data['topic'][i]] = cmap_data['color'][i];
        }

        for (let i = 0; i < topics.length; i++) {
            const currentTopic = topics[i];
            const highlight = selected.length === 0 || selected.includes(currentTopic);
            alpha[i] = highlight ? 0.5 : 0.1;
            colors[i] = highlight ? cmap[currentTopic] : 'grey';
        }

        source.change.emit();
    """)

    topic_selector.js_on_change('value', callback)

    # Tap tool
    taptool = p.select(type=TapTool)
    taptool.callback = OpenURL(url="http://oracc.museum.upenn.edu/@id_text")

    instructions = [
        "Highlight one or more topics using the dropdown. If none are selected, all are shown.",
        "Hover over a data point for more info. Click to view the document edition.",
        "Use the toolbar to zoom, pan, reset, or save as .png."
    ]
    for line in instructions:
        p.add_layout(Title(text=line), 'below')

    layout = column(topic_selector, p)
    reset_output()
    output_file(outputfile)
    save(layout)
    show(layout)


In [None]:
drawviz('Visualize with MDS', 'vis/lda_mds.html')

In [None]:
reset_output()
output_notebook()
title = "Projection with MDS. Size of the circle represents prevalence of the topic."
outputfile = 'vis/mds1.html'
drawviz(d_mds, title, 'vis/mds1.html')

In [None]:
from bokeh.plotting import figure, show, output_file, save
from bokeh.models import (
    ColumnDataSource, HoverTool, Slider, CustomJS,
    OpenURL, TapTool, Title
)
from bokeh.layouts import column

def drawviz(data, title, outputfile, ntopics, colormap, instructions):
    source = ColumnDataSource(data=data)

    # Create figure
    p = figure(
        width=1000, height=1000,
        tools="tap,pan,wheel_zoom,box_zoom,reset,save",
        title=title
    )

    # Add hover tool
    hover = HoverTool(
        tooltips=[
            ("URL", "@id_text",),  # Actual link handled below
            ("Topic, Probability", "@topic, @probability"),
            ("Designation", "@designation")
        ]
    )
    p.add_tools(hover)

    # Draw circles
    p.circle(
        x='x', y='y',
        color='color', fill_alpha='alpha', size='size',
        source=source
    )
    p.axis.visible = False

    # Create sliders
    slider1 = Slider(start=0, end=ntopics, value=0, step=1, title="Topic A")
    slider2 = Slider(start=0, end=ntopics, value=0, step=1, title="Topic B")

    # JavaScript callback (from earlier improved version)
    callback_code = """
        var data = source.data;
        var topics = data['topic'];
        var alpha = data['alpha'];
        var colors = data['color'];
        var selected1 = topic1.value;
        var selected2 = topic2.value;

        for (var i = 0; i < topics.length; i++) {
            var currentTopic = topics[i];
            var highlight = (selected1 === 0 && selected2 === 0) ||
                            (currentTopic === selected1 || currentTopic === selected2);
            alpha[i] = highlight ? 0.5 : 0.1;
            colors[i] = highlight ? cm[currentTopic] : 'grey';
        }
        source.change.emit();
    """

    callback = CustomJS(args=dict(source=source, topic1=slider1, topic2=slider2, cm=colormap), code=callback_code)
    slider1.js_on_change('value', callback)
    slider2.js_on_change('value', callback)

    # Set up TapTool with dynamic URL
    taptool = p.select(type=TapTool)
    taptool.callback = OpenURL(url="http://oracc.museum.upenn.edu/@id_text")

    # Add instructional text as layout titles
    for line in instructions:
        p.add_layout(Title(text=line), 'below')

    # Build layout and show
    layout = column(slider1, slider2, p)
    output_file(outputfile)
    show(layout)
    save(layout);


## Alternative: plotting based on Document/Topic table
The following visualization uses the same approach, but takes the document/topic table as the basis for distance measurements. Documents that share approximately the same distribution of topics will be plotted n the same region. Since the sum of each row in the document/topic table is 1 the distance matrix is computed with euclidean distance (not cosine).

In [None]:
dist_dt = squareform(pdist(d_t_df))

In [None]:
mds = MDS(n_components=2, max_iter=3000,
       random_state=seed, dissimilarity="precomputed", n_jobs=1)
pos = mds.fit_transform(dist_dt)

In [None]:
drawviz(d_mds2, 'Visualize with MDS, version 2', 'vis/lda_mds2.html')

In [None]:
d_mds2 = d_mds.copy() # the data source is the same as for the previous visualization, except for the x and y coordinates.
d_mds2['x'] = [x for x, y in pos]
d_mds2['y'] = [y for x, y in pos]

In [None]:
reset_output()
output_notebook()
title = "Projection with MDS, based on Document/Topic distribution. Size of the circle represents prevalence of the topic."
outputfile = 'vis/mds2.html'
drawviz(d_mds2, title, outputfile)

# Visualize the Documents 2: Using TSNE

# TSNE based on Document/Term Matrix (Cosine distance)

Cosine distances have been computed earlier; the matrix is stored in the variable `dist`.

In [None]:
X = dist
tsne = TSNE(n_components = 2, random_state=0, metric="precomputed")
X_tsne = tsne.fit_transform(X)

In [None]:
d_tsne = d_mds.copy() # the data source is the same as for the previous visualization, except for the x and y coordinates.
d_tsne['x'] = [x for x, y in X_tsne]
d_tsne['y'] = [y for x, y in X_tsne]

In [None]:
title = "Projection with tSNE. Size of the circle represents prevalence of the topic."
outputfile = 'vis/tsne1.html'
drawviz(d_tsne, title, outputfile)

# TSNE based on Document/Topic Matrix

In [None]:
X = dist_dt
tsne = TSNE(n_components = 2, random_state=0, metric="precomputed")
X_tsne = tsne.fit_transform(X)

In [None]:
d_tsne2 = d_mds.copy() # the data source is the same as for the previous visualization, except for the x and y coordinates.
d_tsne2['x'] = [x for x, y in X_tsne]
d_tsne2['y'] = [y for x, y in X_tsne]

In [None]:
title = "Projection with tSNE, based on Document/Topic distribution. Size of the circle represents prevalence of the topic."
outputfile = 'vis/tsne2.html'
drawviz(d_tsne2, title, outputfile)