# Topic Model of the State Archives of Assyria Letters

This is a work-in-progress Notebook. The number of topics is set with the ntopics variable. There are almost 2500 letters, most of them brief or very brief. The topic model produced from the entire corpus (with 15 or 25 topics) is hard to interpret. One reason may be that the texts are too short. A brief or broken letter with just a few words that happen to be important in a certain topic may come out as one of the most important documents in that topic (because a high percentage of that letter resides in the topic). One solution may be to group (aggregate) letters by metadata, in the file SAAO.csv. One could group the letters by Chapter number (this referes to chapters in the original book publications - assuming that those chapters group texts that are related in a meaningful way)


In [1]:
import pandas as pd
import glob
import re
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

# Read in Data
First read the directory with the State Archives of Assyria letters. These files contain lemmatization in ORACC. The texts list lemmatizations per line.

In [2]:
path =r'../Scrape-Oracc/Output/' # use your path
allFiles = glob.glob(path + "/saao" + "*.txt")
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    list_.append(df)
saao_data = pd.concat(list_)
saao_data.head()

Unnamed: 0,id_text,l_no,text,text_name,version
0,saao/saa01/P224485,o 1,awātu[word]N šarru[king]N ana[to]PRP Aššur-šar...,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,
1,saao/saa01/P224485,o 2,šulmu[completeness]N ana[to]PRP libbu[interior...,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,
2,saao/saa01/P224485,o 3,ša[that]REL šapāru[send]V mā[saying]PRP māru[s...,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,
3,saao/saa01/P224485,o 4,Muskaya[Phrygian]EN ina[in]PRP muhhu[skull]N a...,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,
4,saao/saa01/P224485,o 5,Quwaya[from-Quwe]EN ša[that]REL Urik[1]PN ana[...,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,


# Collapse lines to one row per document
In order to transform this DataFrame into a proper input for topic modeling we need to discard the column `l_no` and concatenate all the text that belongs to one letter in a single row. Some lines have no content in the `text` column - these lines need to be dropped.

First select the relevant columns and drop the rows that have no text content.

In [3]:
saao_data = saao_data[['id_text', 'text_name', 'text']]
saao_data = saao_data.dropna()
saao_data.head()

Unnamed: 0,id_text,text_name,text
0,saao/saa01/P224485,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,awātu[word]N šarru[king]N ana[to]PRP Aššur-šar...
1,saao/saa01/P224485,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,šulmu[completeness]N ana[to]PRP libbu[interior...
2,saao/saa01/P224485,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,ša[that]REL šapāru[send]V mā[saying]PRP māru[s...
3,saao/saa01/P224485,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,Muskaya[Phrygian]EN ina[in]PRP muhhu[skull]N a...
4,saao/saa01/P224485,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,Quwaya[from-Quwe]EN ša[that]REL Urik[1]PN ana[...


Group the rows by `id_text` and apply the `join` function to the `text` column. Transform the aggregated data into a new DataFrame.

In [4]:
saao_byletter = saao_data['text'].groupby(saao_data['id_text']).apply(' '.join)
saao_byletter_df = pd.DataFrame(saao_byletter)
saao_byletter_df.head()

Unnamed: 0_level_0,text
id_text,Unnamed: 1_level_1
saao/saa01/P224485,awātu[word]N šarru[king]N ana[to]PRP Aššur-šar...
saao/saa01/P313416,ana[to]PRP šarru[king]N bēlu[lord]N ardu[slave...
saao/saa01/P313417,ana[to]PRP šarru[king]N bēlu[lord]N ardu[slave...
saao/saa01/P313425,ana[to]PRP šarru[king]N bēlu[lord]N ardu[slave...
saao/saa01/P313427,ana[to]PRP šarru[king]N bēlu[lord]N ardu[slave...


Create a DataFrame of `id_text` and `text_name` equivalencies, with `id_text` set as index (row names). Then merge this DataFrame with the the `saao_bytext_df` using the indexes.

In [5]:
saao_id_names = saao_data[['id_text', 'text_name']].drop_duplicates().set_index('id_text')
saao_data_df = pd.merge(saao_id_names, saao_byletter_df, right_index=True, left_index=True)
saao_data_df.head()

Unnamed: 0_level_0,text_name,text
id_text,Unnamed: 1_level_1,Unnamed: 2_level_1
saao/saa01/P224485,SAA 01 001. Midas Of Phrygia Seeks Detente (NL...,awātu[word]N šarru[king]N ana[to]PRP Aššur-šar...
saao/saa01/P313416,SAA 01 158. Gold and Silver Objects Sent to th...,ana[to]PRP šarru[king]N bēlu[lord]N ardu[slave...
saao/saa01/P313417,SAA 01 233. More Land to Bel-duri (CT 53 002),ana[to]PRP šarru[king]N bēlu[lord]N ardu[slave...
saao/saa01/P313425,SAA 01 179. No Iron to the Arabs! (CT 53 010),ana[to]PRP šarru[king]N bēlu[lord]N ardu[slave...
saao/saa01/P313427,SAA 01 152. The Affair of Gidgidanu and His Br...,ana[to]PRP šarru[king]N bēlu[lord]N ardu[slave...


In [6]:
from gensim import corpora, models, utils
import gensim

# Test corpus: first 100 texts
[For test purposes only the first 100 documents are admitted.]

In [7]:
#saao_data_df = saao_data_df[:100]
documents = saao_data_df['text']

# Tokenize and POS-filter
Tokenization is done by splitting on white space. The variable `posfilter` holds the last two characters of lemmatized words with allowed Part of Speech tags. If, for instance, you wish to select Verbs, Adjectives, and Nouns (in Akkadian), posfilter will be `[']n', 'aj', ']v']`. Note that one-character pos-tags need the right bracket!
The POS labels are:
* "n", #Nouns
* "v", #Verbs
* "aj", #Adjectives
* "av", #Adverbs
* "an", #Agricultural Name
* "cn", #Celestial Name
* "dn", #Divine Name
* "en", #Ethnicity Name
* "fn", #Field Name
* "gn", #Geographical Name (lands, etc.)
* "ln", #Lineage Name (ancestral clan)
* "mn", #Month Name
* "on", #Object Name
* "pn", #Personal Name
* "qn", #Quarter (of a city) Name
* "rn", #Royal Name
* "sn", #Settlement Name
* "tn", #Temple Name
* "wn", #Watercourse Name
* "yn", #Year Name
* "nu", #Numeral


In [8]:
posfilter = [']n', ']v', 'en']
texts = [[word for word in document.lower().split() if word[-2:] in posfilter] for document in documents]

# Stop words

Stop words are very frequent words that are not able to distinguish between topics. This includes, for instance, prepositions - but those have already been filtered out by the POS filter. The following nouns and verbs are too frequent to contribute to the analysis.

In [9]:
stoplist = [
'šarru[king]n',
'bēlu[lord]n',
'libbu[interior]n',
'muhhu[skull]n',
'ardu[slave]n',
'šulmu[completeness]n',
'šapāru[send]v',
'alāku[go]v',
'qabû[say]v',
'pānu[front]n',
'māru[son]n',
'bītu[house]n',
'epēšu[do]v',
'wabālu[bring]v',
'šakānu[put]v',
'amāru[see]v',
'bašû[exist]v',
'našû[lift]v',
'izuzzu[stand]v',
'ūmu[day]n',
'ṭābu[good]aj',
'mādu[many]aj',
'nadānu[give]v',
'tadānu[give]v',
'ṣehru[small]aj',
'mimmû[all]n',
'gimru[totality]n',
'gabbu[totality]n',
'šâlu[ask]v',
'šemû[hear]v',
'ūmu[day]n',
'awātu[word]n',
'erēbu[enter]v']
texts = [[word for word in text if word not in stoplist] for text in texts]

# Dictionary
create the gensim Dictionary and filter for words that are too common or too rare (no_above may be set too low here).

In [10]:
dictionary = corpora.Dictionary(texts)
dictionary.filter_extremes(no_below=1, no_above=0.8)

In [None]:
corpus = [dictionary.doc2bow(doc) for doc in texts]

In [None]:
seed = 42
np.random.seed(seed)
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel
ntopics = 25
# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(corpus, num_topics=ntopics, id2word = dictionary, passes=50)

List the top words and their probabilities in all topics. Note: the topic numbers here are not the ones used below in the visualizations! (The topics are the same, but not their numbers).

In [None]:
ldamodel.show_topics(ntopics, formatted = False)

# pyLDAvis
Use pyLDAvis to visualize the topic model. By default, pyLDAvis will order the topics by [prevalence](https://github.com/bmabey/pyLDAvis/issues/59) (topic 1 is the most prevalent topic). That means that the topic numbers in the visualization do not agree with the topic numbers in the lda model. To prevent this behaviour one may use `sort_topics=False` in the `prepare` command. The advantage of ordering the topics by prevalence, however, is that new instances of the lda model are more comparable (that is, the same topic will receive the same number). Note that the library was written in Java for R, and so the numbering in the visualization begins with 1 (not with 0). The topic numbers in the Document/Topic and Topic/Term matrices below will be adjusted to be compatible with the pyLDAvis visualization.

In [None]:
import pyLDAvis.gensim
vis = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary, sort_topics=False)
pyLDAvis.save_html(vis, 'saao.html')
pyLDAvis.display(vis)

# Create Document/Topic Table
Create a DataFrame that gives the probability of each topic for each document.

In [None]:
list_of_doctopics = [ldamodel.get_document_topics(corpus[i], minimum_probability=0) for i in range(len(corpus))]
list_of_probabilities = [[probability for label,probability in distribution] for distribution in list_of_doctopics]
df_list = [pd.DataFrame(list_of_probabilities[i]).transpose() for i in range(len(corpus))]
doc_topic_df = pd.concat(df_list)
doc_topic_df = doc_topic_df.set_index(saao_data_df.index)
doc_topic_df.head()

Reorder the topics (columns) according to prevalence and rename the topics `topic 1` to `topic n` in accordance with the pyLDAvis visualization.

In [None]:
topics = ['topic ' + str(i+1) for i in range(ntopics)]

In [None]:
doc_topic_df.columns = topics
doc_topic_df['text_name'] = saao_data_df['text_name']
doc_topic_df.head()

# Create Topic / Term table
This is a table with N rows (the number of topics) and M columns (the number of individual terms in the Dictionary). The table indicates the probability of each term in each topic.

In [None]:
topic_term = ldamodel.show_topics(ntopics, formatted=False, num_words=len(dictionary))

The object `topic_term` is a list of list. Each topic is represented by a list of tuples in the form `(word, probability)`. The following code pulls out the probabilities for each word in each topic (`topic_term[i][1]`) and creates a list of DataFrames with the words as index (rows) and the probabilities as the only column. Each Dataframe is transposed and the DataFrames are concatenated to a single DataFrame. 

In [None]:
topic_term_list = [pd.DataFrame(topic_term[i][1]).set_index(0).transpose() for i in range(0, ntopics)]
topic_term_df = pd.concat(topic_term_list, ignore_index=True)
topic_term_df['topics'] = topics
topic_term_df = topic_term_df.set_index('topics')

The topics (rows) are numbered 0-14 in the (random) order of the lda model. The array `prevalences` is used to re-arrange the topics in the order of prevalence (as is done above for the document/topic matrix). The re-arranged topics (rows) are now re-labeled `topic 1` to `topic 15` in accordance with their numbering in the pyLDAvis visualization.

In [None]:
topic_term_df['ēkallu[palace]n']

The visualization below plots all the documents according to their distances (using Multi-Dimensional Scaling) in the Document/Topic DataFrame. Each document (dot) is colored according to the most prevalent topic and the size of the dot represents the probability of the most prevalent topic in that document.

In [None]:
from scipy.spatial.distance import pdist, squareform
from sklearn.manifold import MDS

Compute the distances between each of the documents. Use either the Document/Topic Dataframe or the Document/Term Dataframe (constructed below) to measure distance.

In [None]:
cv = CountVectorizer(analyzer='word', token_pattern=r'[^ ]+', vocabulary=list(topic_term_df.columns.values))
saao_dtm = cv.fit_transform(saao_data_df['text'])
saao_dtm_df = pd.DataFrame(saao_dtm.toarray(), columns = cv.get_feature_names(), index = saao_data_df.index.values)
saao_dtm_df.head()

In [None]:
#dist = squareform(pdist(saao_dtm_df))
dist = squareform(pdist(doc_topic_df.drop('text_name', axis=1)))

Compute the position of each document using Multi-Dimensional Scaling. The variable `pos` holds the `x` and `y`  coordinates.

In [None]:
mds = MDS(n_components=2, max_iter=3000,
      random_state=seed, dissimilarity="precomputed", n_jobs=1)

pos = mds.fit(dist).embedding_

Create a list of x values (coordinates) and a list of y values.

In [None]:
pos_x = [x for x, y in pos]
pos_y = [y for x, y in pos]

Create lists of the most prevalent topic, the probability of the most prevalent topic, and the text name for each document.

In [None]:
prevalent_topic = doc_topic_df.drop('text_name', axis=1).idxmax(axis=1)
probability = [doc_topic_df.ix[i][prevalent_topic[i]] for i in range(0, len(corpus))]
text_name = [name for name in doc_topic_df['text_name']]

Import bokeh and draw the visualization

In [None]:
from bokeh.models import ColumnDataSource, OpenURL, TapTool, HoverTool
from bokeh.plotting import figure, output_file, output_notebook, show
from collections import OrderedDict
output_notebook()

p = figure(
    plot_width=800, plot_height=800,
    tools="tap,pan,wheel_zoom,box_zoom,reset,save", 
    title="Topic Distribution MDS\nSize of the circle represents prevalence of the topic")
p.add_tools(HoverTool(
        tooltips=[
            ("url", "http://oracc.org/" + "@id_text"),
            (("topic, probability"), ("@topic, @probability")),
            ("text name", "@text_name")
        ]
        ))

colormap = [('topic 1', "orange"), ('topic 2', "olive"), ('topic 3', "firebrick"), 
          ('topic 4', "gold"), ('topic 5', "red"), ('topic 6', "black"), ('topic 7', "green"), 
          ('topic 8', "blue"), ('topic 9', "purple"), ('topic 10', "darkslategray"), ('topic 11', "yellow"), 
          ('topic 12', "indigo"), ('topic 13', "blueviolet"), ('topic 14', "saddlebrown"), ('topic 15',"navy"), ('topic 16', 'orange'),
           ('topic 17', 'olive'), ('topic 18', 'firebrick'), ('topic 19', 'gold'), ('topic 20', 'red'), ('topic 21', 'black'),
            ('topic 22', 'green'), ('topic 23', 'blue'), ('topic 24', 'purple'), ('topic 25', 'darkslategray')]
colormap = OrderedDict(colormap)
colors = [colormap[n] for n in prevalent_topic]
color_list = [colormap[topic] for topic in colormap]
color_list += ["black"] * (len(colors) - len(color_list))
source = ColumnDataSource(data=dict(
        x=pos_x,
        y=pos_y,
        id_text=list(doc_topic_df.index.values),
        size = probability/max(probability)*25,
        probability = probability,
        topic = prevalent_topic,
        color = colors,
        orig_color = color_list,
        text_name = text_name
    ))

p.circle('x', 'y', color='color', fill_alpha=0.5, size='size', source=source)

url = "http://oracc.museum.upenn.edu/@id_text"
    

taptool = p.select(type=TapTool)
taptool.callback = OpenURL(url=url)

show(p)

The following graph uses a slider to choose one topic. The idea is to color the dots (documents) that have high prevalence for that topic and leave the other dots gray (or: use a high fill_alpha for the chosen topic and a low one for the ones that are backgrounded). Since the dots link to the online editions of these documents, that would make it easier to explore a topic.

In [None]:
from bokeh.io import vform
from bokeh.models import ColumnDataSource, OpenURL, TapTool, HoverTool, CustomJS
from bokeh.models.widgets import Slider
from bokeh.plotting import figure, output_file, output_notebook, show
from bokeh.layouts import widgetbox, column
from collections import OrderedDict
output_notebook()

p = figure(
    plot_width=800, plot_height=800,
    tools="tap,pan,wheel_zoom,box_zoom,reset,save", 
    title="Topic Distribution MDS\nSize of the circle represents prevalence of the topic")
p.add_tools(HoverTool(
        tooltips=[
            ("url", "http://oracc.org/" + "@id_text"),
            (("topic, probability"), ("@topic, @probability")),
            ("text name", "@text_name")
        ]
        ))


colormap = [('topic 1', "orange"), ('topic 2', "olive"), ('topic 3', "firebrick"), 
          ('topic 4', "gold"), ('topic 5', "red"), ('topic 6', "black"), ('topic 7', "green"), 
          ('topic 8', "blue"), ('topic 9', "purple"), ('topic 10', "darkslategray"), ('topic 11', "yellow"), 
          ('topic 12', "indigo"), ('topic 13', "blueviolet"), ('topic 14', "saddlebrown"), ('topic 15',"navy"), ('topic 16', 'orange'),
           ('topic 17', 'olive'), ('topic 18', 'firebrick'), ('topic 19', 'gold'), ('topic 20', 'red'), ('topic 21', 'black'),
            ('topic 22', 'green'), ('topic 23', 'blue'), ('topic 24', 'purple'), ('topic 25', 'darkslategray')]
colormap = OrderedDict(colormap)
colors = [colormap[n] for n in prevalent_topic]
color_list = [colormap[topic] for topic in colormap]
color_list += ["black"] * (len(colors) - len(color_list))

source = ColumnDataSource(data=dict(
        x=pos_x,
        y=pos_y,
        id_text=list(doc_topic_df.index.values),
        size = probability/max(probability)*25,
        probability = probability,
        topic = prevalent_topic,
        color = colors,
        orig_color = color_list,
        text_name = text_name
    ))

p.circle('x', 'y', color='color', fill_alpha=0.5, size='size', source=source)

callback = CustomJS(args=dict(source=source), code = """
    var data = source.get('data');
    var current_topic = topic.value;
    x = data['x']
    y = data['y']
    col = data['color']
    topic = data['topic']
    orig_color = data['orig_color']
    for (i = 0; i < x.length; i++) {
        if (topic[i].substring(6) == current_topic) {
            col[i] = orig_color[current_topic-1]
        } else {
            col[i] = 'grey'
        }
    }
    source.trigger('change');
""")

topic_slider = Slider(start=1, end = ntopics, value=1, step=1, title = 'Topic', callback=callback)
callback.args['topic'] = topic_slider

url = "http://oracc.museum.upenn.edu/@id_text"

taptool = p.select(type=TapTool)
taptool.callback = OpenURL(url=url)

show(column(topic_slider, p))