# Build Topic Model
The present notebook will use the data collected in section 5.1 to compute and explore a topic model. In essence, a topic model consists of two sets of probability distributions. First, the probability that a *word* belongs to a certain topic (Topic/Term probability). Second, the probability of a *topic* to appear in a document (Document/Topic probability). These probabilities are never 0, that is, each word has some probability (even if very small) to be part of each topic and each topic has some probability (even if very small) to appear in a document.

A standard way to inspect a topic model is to look at the top-ten words of a topic (the ten words with the highest probability in each particular topic). Similarly, we can pull the ten most important documents for each topic (the documents in which this topic has the highest probability).

For a more thorough analysis we may create full probability tables: a topic/term probability table and a document/topic probability table. These tables give a fuller account of the model and will be used for the visualizations (5.3). 

In [1]:
import numpy as np
import pandas as pd
from gensim import corpora, models, utils
import gensim
import pickle



# Read in the texts
[For test purposes one may select only the first 100 documents. Remove the hashmark (#) from the first line of the following cell if you wish to do that]

In [2]:
pickled = 'output/data_for_topic_model.p'
df = pd.read_pickle(pickled)
texts = df['lemma']

# POS-filter
The variable `posfilter` holds the last two characters of lemmatized words with allowed Part of Speech tags. If, for instance, you wish to select Verbs, Adjectives, and Nouns (in Akkadian), posfilter will be `[']n', 'aj', ']v']`. Note that one-character pos-tags need the right bracket!
The POS labels are:
* "n", #Nouns
* "v", #Verbs
* "aj", #Adjectives
* "av", #Adverbs
* "an", #Agricultural Name
* "cn", #Celestial Name
* "dn", #Divine Name
* "en", #Ethnicity Name
* "fn", #Field Name
* "gn", #Geographical Name (lands, etc.)
* "ln", #Lineage Name (ancestral clan)
* "mn", #Month Name
* "on", #Object Name
* "pn", #Personal Name
* "qn", #Quarter (of a city) Name
* "rn", #Royal Name
* "sn", #Settlement Name
* "tn", #Temple Name
* "wn", #Watercourse Name
* "yn", #Year Name
* "nu", #Numeral


In [3]:
posfilter = [']n', ']v', 'aj']
#include nouns, verbs, and adjectives, not numerals, prepositions or proper nouns
texts = [[word for word in text if word[-2:] in posfilter] for text in texts]

# Stop words

Stop words are very frequent words that are not able to distinguish between topics. This includes, for instance, prepositions - but those can also be filtered out by the POS filter. The following nouns and verbs are too frequent to contribute to the analysis. Note that this list of stop words was assembled for the SAAo corpus - another corpus may require a different list, or none at all. In a cell further below the dictionary is built - leaving out words that appear in more than 80 percent of the documents (or whatever the 'no_above' parameter is set too) making the use of a stop word list mostyly unnecessary. The only advantage of an explicit list of stop words is that it makes it possible to filter out documents or text fragments that remain with too few words to be meaningful.

The 'stoplist' cell can be omitted entirely or adapted to your purposes.

In [4]:
stoplist = [
'šarru[king]n',
'bēlu[lord]n',
'libbu[interior]n',
'muhhu[skull]n',
'ardu[slave]n',
'šulmu[completeness]n',
'šapāru[send]v',
'alāku[go]v',
'qabû[say]v',
'pānu[front]n',
'māru[son]n',
'bītu[house]n',
'epēšu[do]v',
'wabālu[bring]v',
'šakānu[put]v',
'amāru[see]v',
'bašû[exist]v',
'našû[lift]v',
'izuzzu[stand]v',
'ūmu[day]n',
'ṭābu[good]aj',
'mādu[many]aj',
'nadānu[give]v',
'tadānu[give]v',
'ṣehru[small]aj',
'mimmû[all]n',
'gimru[totality]n',
'gabbu[totality]n',
'šâlu[ask]v',
'šemû[hear]v',
'ūmu[day]n',
'awātu[word]n',
'erēbu[enter]v'
]
texts = [[word for word in text if word not in stoplist] for text in texts]


# Filter out texts that have too few words left
Identify texts that have at least 10 lemmas left and use that as a mask to filter  the list `texts` as well as the dataframe `df`. 

In [5]:
bo = [len(text)>9 for text in texts]
df = df[bo]
texts = [texts[i] for i in range(0, len(texts)) if bo[i]]

How many documents did we start with, and how many do we have left?

In [6]:
len(bo), len(df)

(4976, 3006)

# Dictionary
create the gensim Dictionary and filter for words that are too common or too rare (no_above may be set too low here).

In [7]:
dictionary = corpora.Dictionary(texts)
dictionary.filter_extremes(no_below=1, no_above=0.8)
## CHECK - is this done correctly?

In [8]:
corpus = [dictionary.doc2bow(doc) for doc in texts]

# Compute the Model

Set the seed, indicate the number of topics (default set to 10) and run the model.

The visualization (section 5.3) will fail if the number of topics is higher than 25. 

In [9]:
ntopics = int(input("Number of topics: ") or 10)
if ntopics > 25:
    ntopics = 25

Number of topics:  10


In [10]:
seed = 15
np.random.seed(seed)
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel
# Running and Training LDA model on the document term matrix.
ldamodel = Lda(corpus, num_topics=ntopics, id2word = dictionary, passes=50)

List the top 10 words and their probabilities in all topics. Note: the topic numbers here are not the ones used in the visualizations in 5.3! (The topics are the same, but not their numbers).

In [11]:
ldamodel.show_topics(ntopics, formatted = False)

[(0,
  [('qû[unit]n', 0.043913845),
   ('šikaru[beer]n', 0.027053101),
   ('ilu[god]n', 0.02678105),
   ('karānu[vine]n', 0.02474789),
   ('šamnu[oil]n', 0.02157632),
   ('kusāpu[bread]n', 0.019909386),
   ('zamāru[sing]v', 0.017745739),
   ('naqû[pour-(a-libation)]v', 0.015519118),
   ('dišpu[honey]n', 0.015406738),
   ('qātu[hand]n', 0.014111808)]),
 (1,
  [('nišu[people]n', 0.06924614),
   ('imēru[unit]n', 0.06392229),
   ('eqlu[field]n', 0.04847227),
   ('immeru[sheep]n', 0.04338606),
   ('ikkaru[farmer]n', 0.042105567),
   ('alpu[ox]n', 0.034718562),
   ('kirû[garden]n', 0.028828202),
   ('sinništu[woman]n', 0.028570034),
   ('lawû[surround]v', 0.023197655),
   ('qabūtu[bowl]n', 0.021560801)]),
 (2,
  [('rabû[big-one]n', 0.07557273),
   ('sisû[horse]n', 0.068644285),
   ('pīhātu[responsibility]n', 0.03608168),
   ('ša-rēši[eunuch]n', 0.025469813),
   ('ša-qurbūti[close-follower]n', 0.023695476),
   ('ṣābu[people]n', 0.021298977),
   ('ēkallu[palace]n', 0.018446097),
   ('kiṣru[kno

# Document/Topic Probability
The function `get_document_topics()` will list the probability of the topics in a single document. In order to get all the topics set the argument `minimum_probability` to zero. 

In [12]:
ldamodel.get_document_topics(corpus[1], minimum_probability=0)

[(0, 0.04477357),
 (1, 0.0038464875),
 (2, 0.04697821),
 (3, 0.0038463266),
 (4, 0.0038469909),
 (5, 0.003846869),
 (6, 0.09362814),
 (7, 0.09908836),
 (8, 0.0038464055),
 (9, 0.69629866)]

# Create Document/Topic Probability Table
A Document/Topiuc probability table is a table (DataFrame), where each row represents a document and each column a topuic. Each cell has the probability of a particular topic in a particular document. The sum of each row is 1 (probability distribution).

In order to create a full Document/Topic probability table we iterate over the entire corpus with the `get_document_topics()` function. This creates a list of lists (`list_of_doctopics`) where each list represents the probability of each topic in a document. The probability is represented in a tuple (topic_number, probability). The `list_of_probabilities` preserves only the probabilities. This list of lists is transformed into a DataFrame, whith as index the index of the original DataFrame with the tokenized data.  

In [13]:
list_of_doctopics = [ldamodel.get_document_topics(corpus[i], minimum_probability=0) for i in range(len(corpus))]
list_of_probabilities = [[probability for label,probability in distribution] for distribution in list_of_doctopics]
d_t_df = pd.DataFrame(list_of_probabilities)
d_t_df = d_t_df.set_index(df.index)
d_t_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
P224378,0.009091,0.009091,0.009092,0.009092,0.009095,0.009091,0.009091,0.009091,0.10082,0.826446
P224382,0.044774,0.003846,0.046978,0.003846,0.003847,0.003847,0.09363,0.099089,0.003846,0.696296
P224383,0.003333,0.003334,0.351182,0.003334,0.003333,0.003335,0.003334,0.003334,0.003333,0.622148
P224386,0.006667,0.006667,0.006671,0.006667,0.006668,0.006667,0.006667,0.006668,0.078078,0.86858
P224388,0.006667,0.006668,0.096926,0.006667,0.006667,0.006667,0.006667,0.006667,0.006667,0.849737


We can use the above table to find the ten highest scoring documents per topic with the pandas function 'nlargest'. First add the 'designation' as a separate ccolumn to the table.

In [14]:
d_t_df_w_desig = pd.merge(df['designation'], d_t_df, left_index=True, right_index=True)
d_t_df_w_desig

Unnamed: 0,designation,0,1,2,3,4,5,6,7,8,9
P224378,Take Over the Kingship!,0.009091,0.009091,0.009092,0.009092,0.009095,0.009091,0.009091,0.009091,0.100820,0.826446
P224382,Family Affairs,0.044774,0.003846,0.046978,0.003846,0.003847,0.003847,0.093630,0.099089,0.003846,0.696296
P224383,Šubrian King Protecting Deserters,0.003333,0.003334,0.351182,0.003334,0.003333,0.003335,0.003334,0.003334,0.003333,0.622148
P224386,Specialists Reviving the Land,0.006667,0.006667,0.006671,0.006667,0.006668,0.006667,0.006667,0.006668,0.078078,0.868580
P224388,Elamite King and the Men of Mukin-zeri,0.006667,0.006668,0.096926,0.006667,0.006667,0.006667,0.006667,0.006667,0.006667,0.849737
P224390,Boats and Water-Skin Rafts are Well Despite an...,0.005883,0.064696,0.340272,0.005883,0.005882,0.005888,0.005883,0.005883,0.005883,0.553849
P224391,Assigning Men and Donkeys,0.005556,0.205469,0.649511,0.005556,0.005556,0.005556,0.005556,0.005556,0.005556,0.106129
P224392,Mule Express not Available,0.004001,0.004000,0.755589,0.004000,0.004000,0.004000,0.004001,0.004000,0.004001,0.212408
P224393,Fragment Referring to Boats and River Transport,0.004167,0.094652,0.004169,0.055212,0.004167,0.004167,0.004167,0.004167,0.004167,0.820965
P224395,Arabs Attack a Column of Booty,0.004001,0.004002,0.102742,0.104347,0.004001,0.004000,0.004000,0.004001,0.004001,0.764905


The following code goes through the (numbered) columns of the table which hold the probabilities of each of the topics (the columns) in each of the documents (the rows). The highest ten probabilities are selected, together with a brief descriptipon of the text (designation). 

In [15]:
doctop = []
for i in range(ntopics):
    t = d_t_df_w_desig.nlargest(10, i)[['designation', i]]
    t['topic'] = i
    t = t.rename(columns = {i :'probability'})
    doctop.append(t)
doctop_df = pd.concat(doctop, axis=0)
doctop_df

Unnamed: 0,designation,probability,topic
P336282,Unplaced Fragment of the Text of No. 69,0.976923,0
P335833,"Aššur Temple Offerings, Day 10",0.974284,0
P425166,Fragment of a Ritual for Singer,0.952631,0
P335668,Amounts of Grain(?) from(?) Individuals,0.943748,0
P336650,Aššur Temple Offerings,0.935714,0
P336212,Similar to No. 143,0.925000,0
P335850,Aššur Temple Offerings,0.916040,0
P335622,Aššur Temple Offerings,0.912966,0
P335836,Aššur Temple Offerings,0.890462,0
P398228,Rituals on Shebat 18-22,0.882802,0


# Renumber Topics
Rename the topics (columns) to start with 1, in accordance with the pyLDAvis visualization.

In [16]:
topics = [i+1 for i in range(ntopics)]
d_t_df.columns = topics
d_t_df

Unnamed: 0,1,2,3,4,5,6,7,8,9,10
P224378,0.009091,0.009091,0.009092,0.009092,0.009095,0.009091,0.009091,0.009091,0.100820,0.826446
P224382,0.044774,0.003846,0.046978,0.003846,0.003847,0.003847,0.093630,0.099089,0.003846,0.696296
P224383,0.003333,0.003334,0.351182,0.003334,0.003333,0.003335,0.003334,0.003334,0.003333,0.622148
P224386,0.006667,0.006667,0.006671,0.006667,0.006668,0.006667,0.006667,0.006668,0.078078,0.868580
P224388,0.006667,0.006668,0.096926,0.006667,0.006667,0.006667,0.006667,0.006667,0.006667,0.849737
P224390,0.005883,0.064696,0.340272,0.005883,0.005882,0.005888,0.005883,0.005883,0.005883,0.553849
P224391,0.005556,0.205469,0.649511,0.005556,0.005556,0.005556,0.005556,0.005556,0.005556,0.106129
P224392,0.004001,0.004000,0.755589,0.004000,0.004000,0.004000,0.004001,0.004000,0.004001,0.212408
P224393,0.004167,0.094652,0.004169,0.055212,0.004167,0.004167,0.004167,0.004167,0.004167,0.820965
P224395,0.004001,0.004002,0.102742,0.104347,0.004001,0.004000,0.004000,0.004001,0.004001,0.764905


# Create Topic / Term table
This is a table with N rows (the number of topics) and M columns (the number of individual terms in the Dictionary). The table indicates the probability of each term in each topic.

In [17]:
topic_term = ldamodel.show_topics(ntopics, formatted=False, num_words=len(dictionary))

The object `topic_term` is a list of list. Each topic is represented by a list of tuples in the form `(word, probability)`. The following code pulls out the probabilities for each word in each topic (`topic_term[i][1]`) and creates a list of DataFrames with the words as index (rows) and the probabilities as the only column. The DataFrames are concatenated to a single DataFrame. 

In [18]:
topic_term_list = [pd.DataFrame(topic_term[i][1]).set_index(0) for i in range(ntopics)]
t_t_df_ = pd.concat(topic_term_list, axis=1, ignore_index=True, sort=True)
t_t_df_.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
aban-bāšti[(a-stone)]n,9e-06,1.4e-05,1.4e-05,6e-06,8e-06,9e-06,8e-06,1.3e-05,0.000384,4e-06
aban-lamassi[(a-precious-stone)]n,9e-06,1.4e-05,1.4e-05,6e-06,8e-06,9e-06,8e-06,1.3e-05,0.001108,4e-06
aban-râmi['love'-stone]n,9e-06,1.4e-05,1.4e-05,6e-06,8e-06,9e-06,8e-06,1.3e-05,0.000384,4e-06
abati[(meaning-unknown)]n,9e-06,1.4e-05,1.4e-05,6e-06,8e-06,9e-06,8e-06,1.3e-05,2.2e-05,2.9e-05
abašmû[(a-stone)]n,9e-06,1.4e-05,1.4e-05,6e-06,8e-06,9e-06,8e-06,1.3e-05,0.000384,4e-06


Rename the columns to start with 1, and Transpose to Topic/Term matrix.

In [19]:
t_t_df_.columns = topics
t_t_df = t_t_df_.T
t_t_df

Unnamed: 0,aban-bāšti[(a-stone)]n,aban-lamassi[(a-precious-stone)]n,aban-râmi['love'-stone]n,abati[(meaning-unknown)]n,abašmû[(a-stone)]n,abbušu[(meaning-unknown)]n,abbūtu[fatherhood]n,abiktu[defeat]n,abku[captive]n,ablu[brought]aj,...,ṭēmūtu[of-order]n,ṭīdu[clay]n,ṭīmu[yarn]n,ṭīpu[addition]n,ṭīru[impression]n,ṭūbtu[peace]n,ṭūbu[goodness]n,ṭūbātu[happiness]n,ṭūdu[way]n,ṭūru[opopanax]n
1,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,...,9e-06,0.004963,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06
2,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,0.000326,1.4e-05,1.4e-05,...,1.4e-05,1.4e-05,1.5e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05
3,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,0.000298,1.4e-05,...,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05
4,6e-06,6e-06,6e-06,6e-06,6e-06,6e-06,6e-06,0.00012,6e-06,6e-06,...,6e-06,6e-06,6e-06,9.5e-05,6e-06,6e-06,0.000343,0.000555,6e-06,6e-06
5,8e-06,8e-06,8e-06,8e-06,8e-06,8e-06,0.000552,8e-06,8e-06,8e-06,...,8e-06,8e-06,8e-06,8e-06,8e-06,0.000145,8e-06,8e-06,8e-06,8e-06
6,9e-06,9e-06,9e-06,9e-06,9e-06,7.3e-05,9e-06,9e-06,9e-06,9e-06,...,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06
7,8e-06,8e-06,8e-06,8e-06,8e-06,8e-06,8e-06,8e-06,8e-06,8e-06,...,8e-06,8e-06,8e-06,8e-06,8e-06,8e-06,8e-06,8e-06,8e-06,8e-06
8,1.3e-05,1.3e-05,1.3e-05,1.3e-05,1.3e-05,1.3e-05,1.3e-05,1.3e-05,1.3e-05,0.000107,...,1.3e-05,1.3e-05,1.3e-05,1.3e-05,0.000107,1.3e-05,0.001459,1.3e-05,1.3e-05,1.3e-05
9,0.000384,0.001108,0.000384,2.2e-05,0.000384,2.3e-05,2.2e-05,2.2e-05,2.2e-05,2.2e-05,...,2.2e-05,2.2e-05,2.2e-05,2.2e-05,2.2e-05,2.2e-05,2.2e-05,2.2e-05,0.000746,0.000384
10,4e-06,4e-06,4e-06,2.9e-05,4e-06,4e-06,0.00064,0.000343,4e-06,4e-06,...,8.1e-05,4e-06,0.000149,4e-06,4e-06,4e-06,0.005886,4e-06,4e-06,4e-06


In [20]:
#just checking
t_t_df['ēkallu[palace]n']

1     0.003364
2     0.007302
3     0.018446
4     0.000515
5     0.000029
6     0.004453
7     0.003743
8     0.003263
9     0.007932
10    0.010247
Name: ēkallu[palace]n, dtype: float64

# Export Data

In [21]:
topic_model = {'dictionary': dictionary,
                  'corpus' : corpus,
                  'ldamodel' : ldamodel,
                  't_t_df' : t_t_df,
                  'd_t_df' : d_t_df,
               'df' : df,
              'ntopics' : ntopics,
              'texts' : texts}

In [22]:
with open('output/topic_model.p', 'wb') as w:
    pickle.dump(topic_model, w)