# Visualizing Topic Model

This is a work-in-progress Notebook providing different types of visualization for a topic model. The first visualization (`pyLDAvis`) allows a researcher to interact with the *vocabulary* of the corpus and its distribution over topics. pyLDAvis is an out-of-the-box package, a wrapper around the R-library `LDAvis`.

The second visualization employs the `bokeh` library and allows a researcher to interact with (and inspect) *documents* and the distribution of topics over documents. 

The data used in this notebook are scraped from [ORACC](http://oracc.org). Most of the techniques used here, however, may be applied to any set of documents.

# Dependencies and Versions
pyLDAvis 2.0 is incompatible with Pandas 0.19 (check for pyLDAvis 2.1.0 or later). The MDS and tSNE computations necessary for plotting the distribution of documents in the topic model need scikit-learn 0.18. This notebook was written for Python 3.5.

In [1]:
import pandas as pd
import glob
import re
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

# Read in Data
First read the directory with the relevant texts. These files contain lemmatization in ORACC style, e.g. `lugal[king]N`. The documents list lemmatizations per line.

In [2]:
path =r'data/' # use your path
allFiles = glob.glob(path +"saao" + "*.txt")
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    list_.append(df)
data = pd.concat(list_)
data.head()

Unnamed: 0,id_text,text_name,version,l_no,text
0,saao/saa01/P224485,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,o 1,awātu[word]N šarru[king]N ana[to]PRP Aššur-šar...
1,saao/saa01/P224485,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,o 2,šulmu[completeness]N ana[to]PRP libbu[interior...
2,saao/saa01/P224485,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,o 3,ša[that]REL šapāru[send]V mā[saying]PRP māru[s...
3,saao/saa01/P224485,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,o 4,Muskaya[Phrygian]EN ina[in]PRP muhhu[skull]N a...
4,saao/saa01/P224485,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,o 5,Quwaya[from-Quwe]EN ša[that]REL Urik[1]PN ana[...


# Collapse lines to one row per document
In order to transform this DataFrame into a proper input for topic modeling we need to discard the columns `version` and `l_no` and concatenate all the text that belongs to one document in a single row. Some lines have no content in the `text` column - these lines need to be dropped.

First select the relevant columns and drop the rows that have no text content.

In [3]:
data = data[['id_text', 'text_name', 'text']]
data = data.dropna()
data.head()

Unnamed: 0,id_text,text_name,text
0,saao/saa01/P224485,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,awātu[word]N šarru[king]N ana[to]PRP Aššur-šar...
1,saao/saa01/P224485,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,šulmu[completeness]N ana[to]PRP libbu[interior...
2,saao/saa01/P224485,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,ša[that]REL šapāru[send]V mā[saying]PRP māru[s...
3,saao/saa01/P224485,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,Muskaya[Phrygian]EN ina[in]PRP muhhu[skull]N a...
4,saao/saa01/P224485,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,Quwaya[from-Quwe]EN ša[that]REL Urik[1]PN ana[...


Group the rows by `id_text` and apply the `join` function to the `text` column. Transform the aggregated data into a new DataFrame.

In [4]:
docs = data['text'].groupby(data['id_text']).apply(' '.join)
df = pd.DataFrame(docs)
df.head()

Unnamed: 0_level_0,text
id_text,Unnamed: 1_level_1
saao/saa01/P224485,awātu[word]N šarru[king]N ana[to]PRP Aššur-šar...
saao/saa01/P313416,ana[to]PRP šarru[king]N bēlu[lord]N ardu[slave...
saao/saa01/P313417,ana[to]PRP šarru[king]N bēlu[lord]N ardu[slave...
saao/saa01/P313425,ana[to]PRP šarru[king]N bēlu[lord]N ardu[slave...
saao/saa01/P313427,ana[to]PRP šarru[king]N bēlu[lord]N ardu[slave...


Create a DataFrame of `id_text` and `text_name` equivalencies, with `id_text` set as index (row names). Then merge these DataFrames using the indexes.

In [5]:
ids_names = data[['id_text','text_name']].drop_duplicates().set_index('id_text')
df = pd.merge(ids_names, df, right_index=True, left_index=True)
df.head()

Unnamed: 0_level_0,text_name,text
id_text,Unnamed: 1_level_1,Unnamed: 2_level_1
saao/saa01/P224485,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,awātu[word]N šarru[king]N ana[to]PRP Aššur-šar...
saao/saa01/P313416,SAA 01 158. Gold and Silver Objects Sent to th...,ana[to]PRP šarru[king]N bēlu[lord]N ardu[slave...
saao/saa01/P313417,SAA 01 233. More Land to Bel-duri (CT 53 002),ana[to]PRP šarru[king]N bēlu[lord]N ardu[slave...
saao/saa01/P313425,SAA 01 179. No Iron to the Arabs! (CT 53 010),ana[to]PRP šarru[king]N bēlu[lord]N ardu[slave...
saao/saa01/P313427,SAA 01 152. The Affair of Gidgidanu and His Br...,ana[to]PRP šarru[king]N bēlu[lord]N ardu[slave...


In [6]:
from gensim import corpora, models, utils
import gensim

# Read in the texts
[For test purposes one may select only the first 100 documents. Remove the hashmark (#) from the first line of the following cell if you wish to do that]

In [7]:
#df = df[:100]
documents = df['text']

#  Tokenize

Texts are tokenized by white space. The output variable `texts` is a list of lists, where each list represents a document.

In [8]:
texts = [[word for word in document.lower().split()] for document in documents]

# POS-filter
The variable `posfilter` holds the last two characters of lemmatized words with allowed Part of Speech tags. If, for instance, you wish to select Verbs, Adjectives, and Nouns (in Akkadian), posfilter will be `[']n', 'aj', ']v']`. Note that one-character pos-tags need the right bracket!
The POS labels are:
* "n", #Nouns
* "v", #Verbs
* "aj", #Adjectives
* "av", #Adverbs
* "an", #Agricultural Name
* "cn", #Celestial Name
* "dn", #Divine Name
* "en", #Ethnicity Name
* "fn", #Field Name
* "gn", #Geographical Name (lands, etc.)
* "ln", #Lineage Name (ancestral clan)
* "mn", #Month Name
* "on", #Object Name
* "pn", #Personal Name
* "qn", #Quarter (of a city) Name
* "rn", #Royal Name
* "sn", #Settlement Name
* "tn", #Temple Name
* "wn", #Watercourse Name
* "yn", #Year Name
* "nu", #Numeral


In [9]:
posfilter = [']n', ']v', 'aj']
#include nouns, verbs, and adjectives, not numerals, prepositions or proper nouns
texts = [[word for word in text if word[-2:] in posfilter] for text in texts]

# Stop words

Stop words are very frequent words that are not able to distinguish between topics. This includes, for instance, prepositions - but can also be filtered out by the POS filter. The following nouns and verbs are too frequent to contribute to the analysis.

In [10]:
stoplist = [
'šarru[king]n',
'bēlu[lord]n',
'libbu[interior]n',
'muhhu[skull]n',
'ardu[slave]n',
'šulmu[completeness]n',
'šapāru[send]v',
'alāku[go]v',
'qabû[say]v',
'pānu[front]n',
'māru[son]n',
'bītu[house]n',
'epēšu[do]v',
'wabālu[bring]v',
'šakānu[put]v',
'amāru[see]v',
'bašû[exist]v',
'našû[lift]v',
'izuzzu[stand]v',
'ūmu[day]n',
'ṭābu[good]aj',
'mādu[many]aj',
'nadānu[give]v',
'tadānu[give]v',
'ṣehru[small]aj',
'mimmû[all]n',
'gimru[totality]n',
'gabbu[totality]n',
'šâlu[ask]v',
'šemû[hear]v',
'ūmu[day]n',
'awātu[word]n',
'erēbu[enter]v'
]
texts = [[word for word in text if word not in stoplist] for text in texts]


# Filter out texts that have too few words left
Identify texts that have less than 10 lemmas left and use that selection for the list `texts` and for the dataframe `df`. 

In [11]:
bo = [len(text)>9 for text in texts]
df = df[bo]
texts = [texts[i] for i in range(0, len(texts)) if bo[i]]

How many documents did we start with, and how many do we have left?

In [12]:
len(bo), len(df)

(3230, 1754)

# Dictionary
create the gensim Dictionary and filter for words that are too common or too rare (no_above may be set too low here).

In [13]:
dictionary = corpora.Dictionary(texts)
dictionary.filter_extremes(no_below=1, no_above=0.8)
## CHECK - is this done correctly?

In [14]:
corpus = [dictionary.doc2bow(doc) for doc in texts]

# Compute the Model

Set the seed, indicate the number of topics and run the model

In [15]:
seed = 42
np.random.seed(seed)
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel
ntopics = 10
# Running and Training LDA model on the document term matrix.
ldamodel = Lda(corpus, num_topics=ntopics, id2word = dictionary, passes=50)

List the top words and their probabilities in all topics. Note: the topic numbers here are not the ones used below in the visualizations! (The topics are the same, but not their numbers).

In [16]:
ldamodel.show_topics(ntopics, formatted = False)

[(0,
  [('šību[witness]n', 0.20119665132839737),
   ('ṣarpu[silver]n', 0.086926239750857531),
   ('manû[unit]n', 0.045961577261693538),
   ('šiqlu[unit]n', 0.030092803557208763),
   ('līmu[eponym-(of-year)]n', 0.029939499915361174),
   ('rabû[to-be-big]v', 0.024426718907711871),
   ('eqlu[field]n', 0.01837725824917737),
   ('imēru[unit]n', 0.018058214717473601),
   ('qaqqadu[head]n', 0.015774471965697597),
   ('šattu[year]n', 0.014893728038126858)]),
 (1,
  [('mātu[land]n', 0.043232311673364318),
   ('karābu[pray]v', 0.023450848880261301),
   ('ilu[god]n', 0.020458059975362312),
   ('maṣṣartu[observation]n', 0.015742270461271211),
   ('pû[mouth]n', 0.014970078742146535),
   ('ahu[brother]n', 0.01376429806421765),
   ('qātu[hand]n', 0.013500165354680377),
   ('abu[father]n', 0.012072018796439137),
   ('dibbu[words]n', 0.011551023852423499),
   ('ṣabātu[seize]v', 0.011132044901703672)]),
 (2,
  [('dīnu[legal-decision]n', 0.081540564396077278),
   ('dabābu[speak]v', 0.078683685334104483),

# pyLDAvis
Use pyLDAvis to visualize the topic model. By default, pyLDAvis will order the topics by [prevalence](https://github.com/bmabey/pyLDAvis/issues/59) (topic 1 is the most prevalent topic). That means that the topic numbers in the visualization do not agree with the topic numbers in the lda model. To prevent this behaviour one may use `sort_topics=False` in the `prepare` command. The advantage of ordering the topics by prevalence, however, is that new instances of the lda model are more comparable (that is, the same topic will receive the same number). Note that the library was written in Java for R, and so the numbering in the visualization begins with 1 (not with 0). The topic numbers in the Document/Topic and Topic/Term matrices below will be adjusted to be compatible with the pyLDAvis visualization.

PyLDAvis needs a large output box. The `%%html` lines below create such a box (for the code see [here](http://stackoverflow.com/questions/18770504/resize-ipython-notebook-output-window)). 

%%html
<style>
.output_wrapper, .output {
    height:auto !important;
    max-height:1000px;  /* your desired max-height here */
}
.output_scroll {
    box-shadow:none !important;
    webkit-box-shadow:none !important;
}
</style>


In [17]:
import os
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary, sort_topics=False)
if not os.path.exists('vis'):
    os.makedirs('vis')
pyLDAvis.save_html(vis, 'vis/lda_terms.html')
pyLDAvis.display(vis)

  spec = inspect.getargspec(func)
  spec = inspect.getargspec(func)
  spec = inspect.getargspec(func)
  spec = inspect.getargspec(func)
  spec = inspect.getargspec(func)
  spec = inspect.getargspec(func)


# Document/Topic Probability
The function `get_document_topics()` will list the probability of the topics in a single document. In order to get all the topics set the argument `minimum_probability` to zero. 

In [18]:
ldamodel.get_document_topics(corpus[1], minimum_probability=0)

[(0, 0.9742816965044887),
 (1, 0.0028574097938728802),
 (2, 0.0028576941610677252),
 (3, 0.0028573030336620734),
 (4, 0.0028577760924097572),
 (5, 0.0028575548249468816),
 (6, 0.0028574054692463635),
 (7, 0.002857387283911733),
 (8, 0.0028585016234512527),
 (9, 0.0028572712129425674)]

# Create Document/Topic Probability Table
In order to create a full Document/Topic probability table we iterate over the entire corpus with the `get_document_topics()` function. This creates a list of list (`list_of_doctopics`) where each list represents the probability of each topic in a document. The probability is represented in a tuple (topic_number, probability). The `list_of_probabilities` preserves only the probabilities. This list of lists is transformed into a DataFrame, whith as index the index of the original DataFrame with the tokenized data.  

In [19]:
list_of_doctopics = [ldamodel.get_document_topics(corpus[i], minimum_probability=0) for i in range(len(corpus))]
list_of_probabilities = [[probability for label,probability in distribution] for distribution in list_of_doctopics]
d_t_df = pd.DataFrame(list_of_probabilities)
d_t_df = d_t_df.set_index(df.index)
d_t_df.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9
id_text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
saao/saa01/P224485,0.001177,0.001177,0.001177,0.001177,0.001177,0.001177,0.98941,0.001177,0.001177,0.001177
saao/saa01/P313416,0.974282,0.002857,0.002858,0.002857,0.002858,0.002858,0.002857,0.002857,0.002858,0.002857
saao/saa01/P313417,0.002223,0.002223,0.002223,0.362884,0.002222,0.002222,0.342887,0.002223,0.002222,0.27867
saao/saa01/P313425,0.065904,0.001352,0.076895,0.083638,0.001352,0.001352,0.512959,0.253846,0.001351,0.001352
saao/saa01/P313427,0.004167,0.478395,0.004167,0.086689,0.004168,0.004168,0.405744,0.004167,0.004167,0.004167


# Renumber Topics
Rename the topics `topic_1` to `topic_n` in accordance with the pyLDAvis visualization and add `text_name`.

In [20]:
topics = ['topic_' + str(i+1) for i in range(ntopics)]
d_t_df.columns = topics
d_t_df['text_name'] = df['text_name']
d_t_df

Unnamed: 0_level_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,topic_10,text_name
id_text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
saao/saa01/P224485,0.001177,0.001177,0.001177,0.001177,0.001177,0.001177,0.989410,0.001177,0.001177,0.001177,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...
saao/saa01/P313416,0.974282,0.002857,0.002858,0.002857,0.002858,0.002858,0.002857,0.002857,0.002858,0.002857,SAA 01 158. Gold and Silver Objects Sent to th...
saao/saa01/P313417,0.002223,0.002223,0.002223,0.362884,0.002222,0.002222,0.342887,0.002223,0.002222,0.278670,SAA 01 233. More Land to Bel-duri (CT 53 002)
saao/saa01/P313425,0.065904,0.001352,0.076895,0.083638,0.001352,0.001352,0.512959,0.253846,0.001351,0.001352,SAA 01 179. No Iron to the Arabs! (CT 53 010)
saao/saa01/P313427,0.004167,0.478395,0.004167,0.086689,0.004168,0.004168,0.405744,0.004167,0.004167,0.004167,SAA 01 152. The Affair of Gidgidanu and His Br...
saao/saa01/P313435,0.007144,0.007144,0.007143,0.007144,0.007144,0.007145,0.367337,0.575511,0.007144,0.007144,SAA 01 192. Oil for the Governor of Dur-Šarruk...
saao/saa01/P313439,0.002703,0.002703,0.002704,0.493995,0.002703,0.002703,0.413925,0.073158,0.002703,0.002703,SAA 01 150. Finding Big Bull Colossi for the K...
saao/saa01/P313447,0.003031,0.141941,0.003031,0.833812,0.003031,0.003031,0.003031,0.003032,0.003030,0.003031,SAA 01 056. Transporting Stone Thresholds on B...
saao/saa01/P313458,0.001786,0.101211,0.001786,0.001786,0.001786,0.705561,0.001786,0.001786,0.180726,0.001786,SAA 01 134. Blessings and Rituals (CT 53 043)
saao/saa01/P313487,0.004348,0.004348,0.222106,0.004348,0.004348,0.004348,0.743109,0.004349,0.004348,0.004348,SAA 01 237. A Harmful Petition (CT 53 072)


# Create Topic / Term table
This is a table with N rows (the number of topics) and M columns (the number of individual terms in the Dictionary). The table indicates the probability of each term in each topic.

In [21]:
topic_term = ldamodel.show_topics(ntopics, formatted=False, num_words=len(dictionary))

The object `topic_term` is a list of list. Each topic is represented by a list of tuples in the form `(word, probability)`. The following code pulls out the probabilities for each word in each topic (`topic_term[i][1]`) and creates a list of DataFrames with the words as index (rows) and the probabilities as the only column. The DataFrames are concatenated to a single DataFrame. 

In [22]:
topic_term_list = [pd.DataFrame(topic_term[i][1]).set_index(0) for i in range(0, ntopics)]
t_t_df_ = pd.concat(topic_term_list, axis=1, ignore_index=True)
t_t_df_.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
abati[(meaning-unknown)]n,3.4e-05,1.3e-05,1.4e-05,3.9e-05,8.4e-05,1.8e-05,1.2e-05,2.8e-05,5.2e-05,0.00013
abbūtu[fatherhood]n,3.4e-05,0.001584,1.4e-05,3.9e-05,8.4e-05,0.000304,0.00017,2.8e-05,5.2e-05,1.2e-05
abiktu[defeat]n,3.4e-05,1.3e-05,1.4e-05,3.9e-05,8.4e-05,1.8e-05,1.2e-05,0.000864,5.2e-05,1.2e-05
abku[captive]n,3.4e-05,1.3e-05,1.4e-05,3.9e-05,8.4e-05,1.8e-05,1.2e-05,0.000585,0.000569,1.2e-05
abnu[stone]n,0.001241,1.8e-05,1.4e-05,3.9e-05,0.0013,0.001378,1.2e-05,0.000432,5.2e-05,1.2e-05


Rename the columns (`0` becomes `topic_1` etc) and Transpose to Topic/Term matrix.

In [23]:
t_t_df_.columns = topics
t_t_df = t_t_df_.T
t_t_df

Unnamed: 0,abati[(meaning-unknown)]n,abbūtu[fatherhood]n,abiktu[defeat]n,abku[captive]n,abnu[stone]n,abu[father]n,abullu[(city)-gate]n,abāku[lead-away]v,abālu[dry-(up)]v,abāru[(the-metal)-lead]n,...,ṭābtu[salt]n,ṭābtānu[doer-of-good]n,ṭābtūtu[goodwill]n,ṭātu[bribe]n,ṭēhu[immediate-vicinity]n,ṭēmu[(fore)thought]n,ṭēmūtu[of-order]n,ṭīdu[clay]n,ṭīmu[yarn]n,ṭūbu[goodness]n
topic_1,3.4e-05,3.4e-05,3.4e-05,3.4e-05,0.001241,4e-05,3.4e-05,3.4e-05,3.4e-05,0.000373,...,3.4e-05,0.000373,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05
topic_2,1.3e-05,0.001584,1.3e-05,1.3e-05,1.8e-05,0.012072,0.000212,0.005781,1.3e-05,1.3e-05,...,1.3e-05,1.3e-05,1.3e-05,1.3e-05,1.3e-05,0.010497,0.000272,1.3e-05,1.3e-05,0.0018
topic_3,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,...,1.4e-05,1.4e-05,1.4e-05,1.4e-05,0.000151,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05
topic_4,3.9e-05,3.9e-05,3.9e-05,3.9e-05,3.9e-05,3.9e-05,3.9e-05,0.001683,3.9e-05,3.9e-05,...,3.9e-05,3.9e-05,3.9e-05,3.9e-05,3.9e-05,0.001141,3.9e-05,3.9e-05,3.9e-05,7.4e-05
topic_5,8.4e-05,8.4e-05,8.4e-05,8.4e-05,0.0013,8.4e-05,0.010501,8.4e-05,8.4e-05,8.4e-05,...,8.4e-05,8.4e-05,8.4e-05,8.4e-05,8.4e-05,0.008225,8.4e-05,8.4e-05,0.000923,8.4e-05
topic_6,1.8e-05,0.000304,1.8e-05,1.8e-05,0.001378,0.009628,0.000708,0.000227,0.000381,1.8e-05,...,0.0002,1.8e-05,1.8e-05,2e-05,1.8e-05,0.007027,1.8e-05,1.8e-05,1.8e-05,0.0307
topic_7,1.2e-05,0.00017,1.2e-05,1.2e-05,1.2e-05,4e-05,0.001785,0.000133,1.2e-05,1.2e-05,...,1.2e-05,1.2e-05,1.2e-05,0.000371,1.2e-05,0.030129,1.2e-05,0.000259,1.2e-05,1.2e-05
topic_8,2.8e-05,2.8e-05,0.000864,0.000585,0.000432,0.026573,0.000347,3.2e-05,2.8e-05,2.8e-05,...,2.8e-05,2.8e-05,2.8e-05,0.000818,2.8e-05,0.002828,2.8e-05,2.8e-05,2.8e-05,2.8e-05
topic_9,5.2e-05,5.2e-05,5.2e-05,0.000569,5.2e-05,0.004856,0.00101,5.2e-05,5.2e-05,5.2e-05,...,5.2e-05,0.000569,5.2e-05,0.000696,5.2e-05,0.010271,5.2e-05,5.2e-05,5.2e-05,5.2e-05
topic_10,0.00013,1.2e-05,1.2e-05,1.2e-05,1.2e-05,1.2e-05,0.000231,1.2e-05,1.2e-05,1.2e-05,...,1.2e-05,1.2e-05,0.00013,1.2e-05,1.2e-05,0.000922,1.2e-05,1.2e-05,1.2e-05,1.2e-05


In [24]:
#just checking
t_t_df['ēkallu[palace]n']

topic_1     0.003609
topic_2     0.009399
topic_3     0.001938
topic_4     0.003693
topic_5     0.006287
topic_6     0.003365
topic_7     0.019851
topic_8     0.007383
topic_9     0.012962
topic_10    0.005468
Name: ēkallu[palace]n, dtype: float64

# Visualize the Documents 1: Using MDS
While pyLDAvis is an excellent tool for exploring the topic/term aspect of a topic model (the words and their probabilities in each topic) it does not provide access to the document/topic aspect (the probability distribution of topics in each document). The visualization below plots all the documents according to their (cosine) distances (using Multi-Dimensional Scaling) in the Document/Term DataFrame. Each document (data point in the visualization) is colored according to the most prevalent topic and the size of the dot represents the probability of the most prevalent topic in that document.

In [25]:
from scipy.spatial.distance import pdist, squareform
from sklearn.manifold import MDS

Compute the distances between each of the documents. Use either the Document/Topic Dataframe or the Document/Term Dataframe (constructed below) to measure distance.

In [26]:
cv = CountVectorizer(analyzer='word', token_pattern=r'[^ ]+', vocabulary=list(t_t_df.columns.values))
dtm = cv.fit_transform(df['text'])
dtm_df = pd.DataFrame(dtm.toarray(), columns = cv.get_feature_names(), index = df.index.values)
dtm_df.head()

Unnamed: 0,abati[(meaning-unknown)]n,abbūtu[fatherhood]n,abiktu[defeat]n,abku[captive]n,abnu[stone]n,abu[father]n,abullu[(city)-gate]n,abāku[lead-away]v,abālu[dry-(up)]v,abāru[(the-metal)-lead]n,...,ṭābtu[salt]n,ṭābtānu[doer-of-good]n,ṭābtūtu[goodwill]n,ṭātu[bribe]n,ṭēhu[immediate-vicinity]n,ṭēmu[(fore)thought]n,ṭēmūtu[of-order]n,ṭīdu[clay]n,ṭīmu[yarn]n,ṭūbu[goodness]n
saao/saa01/P224485,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
saao/saa01/P313416,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
saao/saa01/P313417,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,2,0,0,0,0
saao/saa01/P313425,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
saao/saa01/P313427,0,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [27]:
dist = squareform(pdist(dtm_df, 'cosine'))

Compute the position of each document using Multi-Dimensional Scaling. The variable `pos` holds the `x` and `y`  coordinates.

In [28]:
mds = MDS(n_components=2, max_iter=3000,
       random_state=seed, dissimilarity="precomputed", n_jobs=1)
pos = mds.fit_transform(dist)

Create a list of x values (coordinates) and a list of y values.

In [29]:
dms_x = [x for x, y in pos]
dms_y = [y for x, y in pos]

Create lists of the most prevalent topic, the probability of the most prevalent topic, and the text name for each document. These lists are used in the Bokeh visualization.

In [30]:
prevalent_topic = d_t_df.drop('text_name', axis=1).idxmax(axis=1)
probability = [d_t_df.ix[i][prevalent_topic[i]] for i in range(0, len(corpus))]
text_name = [name for name in d_t_df['text_name']]

In [31]:
len(prevalent_topic)

1754

# Define Colors

Create a palette of the length `ntopics` and link that to topic numbers in the dictionary `colormap`. The list `colors` indicates the proper color for each document (determined by the most prevalent topic). 

In [32]:
from bokeh import palettes
#color_pal = palettes.all_palettes['Viridis'][ntopics]
#try to use d3['Category20']
#color_pal = ["#023FA5","#7D87B9","#BEC1D4","#D6BCC0","#BB7784","#FFFFFF",
#             "#4A6FE3","#8595E1","#B5BBE3","#E6AFB9","#E07B91","#D33F6A", 
#             "#11C638","#8DD593","#C6DEC7","#EAD3C6","#F0B98D","#EF9708", 
#             "#0FCFC0","#9CDED6","#D5EAE7","#F3E1EB","#F6C4E1","#F79CD4"]
# color palette found in http://graphicdesign.stackexchange.com/questions/3682/where-can-i-find-a-large-palette-set-of-contrasting-colors-for-coloring-many-d
#topics = ['topic_' + str(i) for i in range(1, ntopics+1)]
#colormap = dict(zip(topics, color_pal))

colormap = {'topic_1': "orange", 'topic_2': "olive", 'topic_3': "firebrick", 
          'topic_4': "gold", 'topic_5': "red", 'topic_6': "fuchsia", 'topic_7': "green", 
          'topic_8': "blue", 'topic_9': "purple", 'topic_10': "aqua", 'topic_11': "yellow", 
          'topic_12': "indigo", 'topic_13': "blueviolet", 'topic_14': "beige", 'topic_15':"navy", 'topic_16': 'chocolate',
           'topic_17': 'azure', 'topic_18': 'coral', 'topic_19': 'crimson', 'topic_20': 'darkblue', 'topic_21': 'darkkhaki', 
            'topic_22': 'darkseagreen', 'topic_23': 'darkturquoise', 'topic_24': 'deeppink', 'topic_25': 'black'}
#colormap = OrderedDict(colormap)
colors = [colormap[n] for n in prevalent_topic]
color_list = [colormap['topic_' + str(i)] for i in range(1, ntopics+1)]

Import Bokeh and create data source. ColumnDataSource is a Bokeh function that creates a `source`, a database that holds information about each of the data points that are plotted. Bokeh can use the `source` to hold `x` and `y` coordinates; to select color, shape, or size; to include data or metadata in tooltips, to create a link to a web page, etc.

In [33]:
from bokeh.io import vform
from bokeh.models import ColumnDataSource, OpenURL, TapTool, HoverTool, CustomJS
from bokeh.models.widgets import Slider
from bokeh.plotting import figure, output_file, output_notebook, show
from bokeh.layouts import widgetbox, column

output_notebook()

In [34]:
source_mds = ColumnDataSource(data=dict(
        x=dms_x,
        y=dms_y,
        id_text=list(d_t_df.index.values),
        size = probability/max(probability)*25,
        probability = probability,
        topic = prevalent_topic,
        color = colors,
        orig_color = color_list,
        text_name = text_name
    ))

Draw the visualization. The visualization provides various tools for further exploration:
- tooltips (provides topic, probability, text name and URL)
- box zoom
- wheel zoom
- pan
- reset
- link to document edition
- save the visualization

In addition, the visualization has two sliders that allow the user to select two topics.

In [35]:
p = figure(
    plot_width=1000, plot_height=1000,
    tools="tap,pan,wheel_zoom,box_zoom,reset,save", 
    title="Topic Distribution MDS\nSize of the circle represents prevalence of the topic")
p.add_tools(HoverTool(
        tooltips=[
            ("url", "http://oracc.org/" + "@id_text"),
            (("topic, probability"), ("@topic, @probability")),
            ("text name", "@text_name")
        ]
        ))


p.circle('x', 'y', color='color', fill_alpha=.5, size='size', source=source_mds)
p.axis.visible = False

callback = CustomJS(args=dict(source=source_mds), code = """
    var data = source.get('data');
    var current_topic_1 = topic1.value;
    var current_topic_2 = topic2.value;
    x = data['x']
    y = data['y']
    col = data['color']
    topic = data['topic']

    orig_color = data['orig_color']
    for (i = 0; i < x.length; i++) {
        if (topic[i].substring(6) == current_topic_1) {
            col[i] = orig_color[current_topic_1 - 1]
        } else {
            col[i] = 'grey'
        }
        if (topic[i].substring(6) == current_topic_2) {
            col[i] = orig_color[current_topic_2 - 1]
        }
    }
    source.trigger('change');
""")

topic_slider = Slider(start=1, end = ntopics, value=1, step=1, title = 'Topic', callback=callback)
callback.args['topic1'] = topic_slider
topic_slider2 = Slider(start=1, end = ntopics, value=1, step=1, title = 'Topic', callback=callback)
callback.args['topic2'] = topic_slider2

url = "http://oracc.museum.upenn.edu/@id_text"

taptool = p.select(type=TapTool)
taptool.callback = OpenURL(url=url)

show(column(topic_slider, topic_slider2, p))

## Alternative: plotting based on Document/Topic table
The following visualization uses the same approach, but takes the document/topic table as the basis for distance measurements. Documents that share approximately the same distribution of topics will be plotted n the same region. Since the sum of each row in the document/topic table is 1 the distance matrix is computed with euclidean distance (not cosine).

In [36]:
dist = squareform(pdist(d_t_df.drop('text_name', axis=1)))

In [37]:
mds = MDS(n_components=2, max_iter=3000,
       random_state=seed, dissimilarity="precomputed", n_jobs=1)
pos = mds.fit_transform(dist)

In [38]:
mds2_x = [x for x, y in pos]
mds2_y = [y for x, y in pos]
len(mds2_x)

1754

In [39]:
source_mds2 = ColumnDataSource(data=dict(
        x=mds2_x,
        y=mds2_y,
        id_text=list(d_t_df.index.values),
        size = probability/max(probability)*25,
        probability = probability,
        topic = prevalent_topic,
        color = colors,
        orig_color = color_list,
        text_name = text_name
    ))

In [40]:
p = figure(
    plot_width=1000, plot_height=1000,
    tools="tap,pan,wheel_zoom,box_zoom,reset,save", 
    title="Topic Distribution MDS\nSize of the circle represents prevalence of the topic")
p.add_tools(HoverTool(
        tooltips=[
            ("url", "http://oracc.org/" + "@id_text"),
            (("topic, probability"), ("@topic, @probability")),
            ("text name", "@text_name")
        ]
        ))

p.circle('x', 'y', color='color', fill_alpha=0.5, size='size', source=source_mds2)
p.axis.visible = False

callback = CustomJS(args=dict(source=source_mds2), code = """
    var data = source.get('data');
    var current_topic_1 = topic1.value;
    var current_topic_2 = topic2.value;
    x = data['x']
    y = data['y']
    col = data['color']
    topic = data['topic']

    orig_color = data['orig_color']
    for (i = 0; i < x.length; i++) {
        if (topic[i].substring(6) == current_topic_1) {
            col[i] = orig_color[current_topic_1 - 1]
        } else {
            col[i] = 'grey'
        }
        if (topic[i].substring(6) == current_topic_2) {
            col[i] = orig_color[current_topic_2 - 1]
        }
    }
    source.trigger('change');
""")

topic_slider = Slider(start=1, end = ntopics, value=1, step=1, title = 'Topic', callback=callback)
callback.args['topic1'] = topic_slider
topic_slider2 = Slider(start=1, end = ntopics, value=1, step=1, title = 'Topic', callback=callback)
callback.args['topic2'] = topic_slider2

url = "http://oracc.museum.upenn.edu/@id_text"

taptool = p.select(type=TapTool)
taptool.callback = OpenURL(url=url)

show(column(topic_slider, topic_slider2, p))

# Visualize the Documents 2: Using TSNE

In [41]:
from sklearn.manifold import TSNE

# TSNE based on Document/Term Matrix (Cosine distance)

Cosine distances have been computed earlier; the matrix is stored in the variable `dist`.

In [42]:
X = dist
tsne = TSNE(n_components = 2, random_state=0, metric="precomputed")
X_tsne = tsne.fit_transform(X)

In [43]:
tsne_x = [x for x, y in X_tsne]
tsne_y = [y for x, y in X_tsne]

In [44]:
source_tsne = ColumnDataSource(data=dict(
        x=tsne_x,
        y=tsne_y,
        id_text=list(d_t_df.index.values),
        size = probability/max(probability)*25,
        probability = probability,
        topic = prevalent_topic,
        color = colors,
        orig_color = color_list,
        text_name = text_name
    ))

In [45]:
p = figure(
    plot_width=800, plot_height=800,
    tools="tap,pan,wheel_zoom,box_zoom,reset,save", 
    title="Topic Distribution tSNE.\nSize of the circle represents prevalence of the topic")
p.add_tools(HoverTool(
        tooltips=[
            ("url", "http://oracc.org/" + "@id_text"),
            (("topic, probability"), ("@topic, @probability")),
            ("text name", "@text_name")
        ]
        ))


p.circle('x', 'y', color='color', fill_alpha=0.5, size='size', source=source_tsne)
p.axis.visible = False

callback = CustomJS(args=dict(source=source_tsne), code = """
    var data = source.get('data');
    var current_topic_1 = topic1.value;
    var current_topic_2 = topic2.value;
    x = data['x']
    y = data['y']
    col = data['color']
    topic = data['topic']

    orig_color = data['orig_color']
    for (i = 0; i < x.length; i++) {
        if (topic[i].substring(6) == current_topic_1) {
            col[i] = orig_color[current_topic_1 - 1]
        } else {
            col[i] = 'grey'
        }
        if (topic[i].substring(6) == current_topic_2) {
            col[i] = orig_color[current_topic_2 - 1]
        }
    }
    source.trigger('change');
""")

topic_slider = Slider(start=1, end = ntopics, value=1, step=1, title = 'Topic', callback=callback)
callback.args['topic1'] = topic_slider
topic_slider2 = Slider(start=1, end = ntopics, value=1, step=1, title = 'Topic', callback=callback)
callback.args['topic2'] = topic_slider2

url = "http://oracc.museum.upenn.edu/@id_text"

taptool = p.select(type=TapTool)
taptool.callback = OpenURL(url=url)

show(column(topic_slider, topic_slider2, p))

# TSNE based on Document/Topic Matrix

In [46]:
X = d_t_df.drop('text_name', axis=1).as_matrix()
tsne = TSNE(n_components = 2, init = 'pca', random_state=0)
X_tsne = tsne.fit_transform(X)

In [47]:
tsne2_x = [x for x, y in X_tsne]
tsne2_y = [y for x, y in X_tsne]

In [48]:
source_tsne2 = ColumnDataSource(data=dict(
        x=tsne2_x,
        y=tsne2_y,
        id_text=list(d_t_df.index.values),
        size = probability/max(probability)*25,
        probability = probability,
        topic = prevalent_topic,
        color = colors,
        orig_color = color_list,
        text_name = text_name
    ))

In [49]:
p = figure(
    plot_width=800, plot_height=800,
    tools="tap,pan,wheel_zoom,box_zoom,reset,save", 
    title="Topic Distribution tSNE.\nSize of the circle represents prevalence of the topic")
p.add_tools(HoverTool(
        tooltips=[
            ("url", "http://oracc.org/" + "@id_text"),
            (("topic, probability"), ("@topic, @probability")),
            ("text name", "@text_name")
        ]
        ))


p.circle('x', 'y', color='color', fill_alpha=0.5, size='size', source=source_tsne2)
p.axis.visible = False

callback = CustomJS(args=dict(source=source_tsne2), code = """
    var data = source.get('data');
    var current_topic_1 = topic1.value;
    var current_topic_2 = topic2.value;
    x = data['x']
    y = data['y']
    col = data['color']
    topic = data['topic']

    orig_color = data['orig_color']
    for (i = 0; i < x.length; i++) {
        if (topic[i].substring(6) == current_topic_1) {
            col[i] = orig_color[current_topic_1 - 1]
        } else {
            col[i] = 'grey'
        }
        if (topic[i].substring(6) == current_topic_2) {
            col[i] = orig_color[current_topic_2 - 1]
        }
    }
    source.trigger('change');
""")

topic_slider = Slider(start=1, end = ntopics, value=1, step=1, title = 'Topic', callback=callback)
callback.args['topic1'] = topic_slider
topic_slider2 = Slider(start=1, end = ntopics, value=1, step=1, title = 'Topic', callback=callback)
callback.args['topic2'] = topic_slider2

url = "http://oracc.museum.upenn.edu/@id_text"

taptool = p.select(type=TapTool)
taptool.callback = OpenURL(url=url)

show(column(topic_slider, topic_slider2, p))