# 5.1 Acquire Data for Topic Model: State Archives of Assyria

The data acquisition techniques discussed in section 2.1 are applied here to gather all the data from State Archives from Assyria Online ([SAAo](http://oracc.org/saao)). In the next notebook this data will be used for creating a topic model. 

# 5.1.0 Preparation: Import modules

In [1]:
import os
import sys
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
from utils import *
import pandas as pd

### 5.1.1 Get data
The module `utils` in the `utils` directory of Compass includes the function `get_data()` which essentially runs the same code as the Extended ORACC Parser (2.1.3; see there for explanation of the code). Its only parameter is a string with [ORACC](http://oracc.org) project names, separated by commas. It returns a Pandas DataFrame in which each word is represented by a row.

If you wish to build a topic model with a different set of texts, you may replace the list of subprojects (separated by commas) with any other list of valid [ORACC](http://oracc.org) (sub)projects. Note, however, that the code below (and in the next notebook) uses field names that are specific for the [SAAo](http://oracc.org/saao) catalogs (in particular the field 'title'). [ORACC](http://oracc.org) data sets essentially all have the same structure, but catalogs vary widely in the fields they include (the fields 'id_text' and 'designation' are obligatory and are found in all).

In [2]:
projects = """saao/saa01,
                saao/saa02,
                saao/saa03,
                saao/saa04,
                saao/saa05,
                saao/saa06,
                saao/saa07,
                saao/saa08,
                saao/saa09,
                saao/saa10,
                saao/saa11,
                saao/saa12,
                saao/saa13,
                saao/saa14,
                saao/saa15,
                saao/saa16,
                saao/saa17,
                saao/saa18,
                saao/saa19,
                saao/saa20,
                saao/saa21"""
words = get_data(projects)

Downloading JSON
Saving http://oracc.org/saao/saa10/json/saao-saa10.zip as jsonzip/saao-saa10.zip.


HBox(children=(IntProgress(value=1, bar_style='info', description='saao/saa10', max=1, style=ProgressStyle(des…


Saving http://oracc.org/saao/saa15/json/saao-saa15.zip as jsonzip/saao-saa15.zip.


HBox(children=(IntProgress(value=1, bar_style='info', description='saao/saa15', max=1, style=ProgressStyle(des…


Saving http://oracc.org/saao/saa17/json/saao-saa17.zip as jsonzip/saao-saa17.zip.


HBox(children=(IntProgress(value=1, bar_style='info', description='saao/saa17', max=1, style=ProgressStyle(des…


Saving http://oracc.org/saao/saa08/json/saao-saa08.zip as jsonzip/saao-saa08.zip.


HBox(children=(IntProgress(value=1, bar_style='info', description='saao/saa08', max=1, style=ProgressStyle(des…


Saving http://oracc.org/saao/saa01/json/saao-saa01.zip as jsonzip/saao-saa01.zip.


HBox(children=(IntProgress(value=1, bar_style='info', description='saao/saa01', max=1, style=ProgressStyle(des…


Saving http://oracc.org/saao/saa16/json/saao-saa16.zip as jsonzip/saao-saa16.zip.


HBox(children=(IntProgress(value=1, bar_style='info', description='saao/saa16', max=1, style=ProgressStyle(des…


Saving http://oracc.org/saao/saa19/json/saao-saa19.zip as jsonzip/saao-saa19.zip.


HBox(children=(IntProgress(value=1, bar_style='info', description='saao/saa19', max=1, style=ProgressStyle(des…


Saving http://oracc.org/saao/saa06/json/saao-saa06.zip as jsonzip/saao-saa06.zip.


HBox(children=(IntProgress(value=1, bar_style='info', description='saao/saa06', max=1, style=ProgressStyle(des…


Saving http://oracc.org/saao/saa05/json/saao-saa05.zip as jsonzip/saao-saa05.zip.


HBox(children=(IntProgress(value=1, bar_style='info', description='saao/saa05', max=1, style=ProgressStyle(des…


Saving http://oracc.org/saao/saa09/json/saao-saa09.zip as jsonzip/saao-saa09.zip.


HBox(children=(IntProgress(value=1, bar_style='info', description='saao/saa09', max=1, style=ProgressStyle(des…


Saving http://oracc.org/saao/saa11/json/saao-saa11.zip as jsonzip/saao-saa11.zip.


HBox(children=(IntProgress(value=1, bar_style='info', description='saao/saa11', max=1, style=ProgressStyle(des…


Saving http://oracc.org/saao/saa04/json/saao-saa04.zip as jsonzip/saao-saa04.zip.


HBox(children=(IntProgress(value=1, bar_style='info', description='saao/saa04', max=1, style=ProgressStyle(des…


Saving http://oracc.org/saao/saa03/json/saao-saa03.zip as jsonzip/saao-saa03.zip.


HBox(children=(IntProgress(value=1, bar_style='info', description='saao/saa03', max=1, style=ProgressStyle(des…


Saving http://oracc.org/saao/saa13/json/saao-saa13.zip as jsonzip/saao-saa13.zip.


HBox(children=(IntProgress(value=1, bar_style='info', description='saao/saa13', max=1, style=ProgressStyle(des…


Saving http://oracc.org/saao/saa14/json/saao-saa14.zip as jsonzip/saao-saa14.zip.


HBox(children=(IntProgress(value=1, bar_style='info', description='saao/saa14', max=1, style=ProgressStyle(des…


Saving http://oracc.org/saao/saa18/json/saao-saa18.zip as jsonzip/saao-saa18.zip.


HBox(children=(IntProgress(value=1, bar_style='info', description='saao/saa18', max=1, style=ProgressStyle(des…


Saving http://oracc.org/saao/saa21/json/saao-saa21.zip as jsonzip/saao-saa21.zip.


HBox(children=(IntProgress(value=1, bar_style='info', description='saao/saa21', max=1, style=ProgressStyle(des…


Saving http://oracc.org/saao/saa02/json/saao-saa02.zip as jsonzip/saao-saa02.zip.


HBox(children=(IntProgress(value=1, bar_style='info', description='saao/saa02', max=1, style=ProgressStyle(des…


Saving http://oracc.org/saao/saa20/json/saao-saa20.zip as jsonzip/saao-saa20.zip.


HBox(children=(IntProgress(value=1, bar_style='info', description='saao/saa20', max=1, style=ProgressStyle(des…


Saving http://oracc.org/saao/saa07/json/saao-saa07.zip as jsonzip/saao-saa07.zip.


HBox(children=(IntProgress(value=1, bar_style='info', description='saao/saa07', max=1, style=ProgressStyle(des…


Saving http://oracc.org/saao/saa12/json/saao-saa12.zip as jsonzip/saao-saa12.zip.


HBox(children=(IntProgress(value=1, bar_style='info', description='saao/saa12', max=1, style=ProgressStyle(des…


Parsing JSON


HBox(children=(IntProgress(value=0, description='saao/saa10', max=389, style=ProgressStyle(description_width='…




HBox(children=(IntProgress(value=0, description='saao/saa15', max=389, style=ProgressStyle(description_width='…

saao/saa15/P314095 is not available or not complete



HBox(children=(IntProgress(value=0, description='saao/saa17', max=207, style=ProgressStyle(description_width='…




HBox(children=(IntProgress(value=0, description='saao/saa08', max=568, style=ProgressStyle(description_width='…




HBox(children=(IntProgress(value=0, description='saao/saa01', max=264, style=ProgressStyle(description_width='…




HBox(children=(IntProgress(value=0, description='saao/saa16', max=246, style=ProgressStyle(description_width='…




HBox(children=(IntProgress(value=0, description='saao/saa19', max=229, style=ProgressStyle(description_width='…




HBox(children=(IntProgress(value=0, description='saao/saa06', max=350, style=ProgressStyle(description_width='…

saao/saa06/P335204 is not available or not complete
saao/saa06/P335372 is not available or not complete
saao/saa06/P335226 is not available or not complete
saao/saa06/P335202 is not available or not complete
saao/saa06/P335322 is not available or not complete
saao/saa06/P335176 is not available or not complete



HBox(children=(IntProgress(value=0, description='saao/saa05', max=300, style=ProgressStyle(description_width='…




HBox(children=(IntProgress(value=0, description='saao/saa09', max=11, style=ProgressStyle(description_width='i…




HBox(children=(IntProgress(value=0, description='saao/saa11', max=234, style=ProgressStyle(description_width='…

saao/saa11/P336708 is not available or not complete
saao/saa11/P335756 is not available or not complete
saao/saa11/P336803 is not available or not complete
saao/saa11/P336687 is not available or not complete
saao/saa11/P335782 is not available or not complete
saao/saa11/P335685 is not available or not complete
saao/saa11/P335633 is not available or not complete
saao/saa11/P335697 is not available or not complete
saao/saa11/P335588 is not available or not complete
saao/saa11/P335808 is not available or not complete



HBox(children=(IntProgress(value=0, description='saao/saa04', max=353, style=ProgressStyle(description_width='…




HBox(children=(IntProgress(value=0, description='saao/saa03', max=52, style=ProgressStyle(description_width='i…




HBox(children=(IntProgress(value=0, description='saao/saa13', max=210, style=ProgressStyle(description_width='…

saao/saa13/P334893 is not available or not complete



HBox(children=(IntProgress(value=0, description='saao/saa14', max=479, style=ProgressStyle(description_width='…

saao/saa14/P335107 is not available or not complete
saao/saa14/P335943 is not available or not complete
saao/saa14/P335080 is not available or not complete
saao/saa14/P224949 is not available or not complete
saao/saa14/P335180 is not available or not complete
saao/saa14/P335257 is not available or not complete
saao/saa14/P335154 is not available or not complete
saao/saa14/P334991 is not available or not complete
saao/saa14/P335415 is not available or not complete
saao/saa14/P335196 is not available or not complete
saao/saa14/P336196 is not available or not complete
saao/saa14/P335459 is not available or not complete
saao/saa14/P335079 is not available or not complete
saao/saa14/P335574 is not available or not complete
saao/saa14/P335305 is not available or not complete
saao/saa14/P335587 is not available or not complete
saao/saa14/P334977 is not available or not complete
saao/saa14/P335038 is not available or not complete
saao/saa14/P335197 is not available or not complete
saao/saa14/P

HBox(children=(IntProgress(value=0, description='saao/saa18', max=204, style=ProgressStyle(description_width='…




HBox(children=(IntProgress(value=0, description='saao/saa21', max=161, style=ProgressStyle(description_width='…




HBox(children=(IntProgress(value=0, description='saao/saa02', max=15, style=ProgressStyle(description_width='i…




HBox(children=(IntProgress(value=0, description='saao/saa20', max=55, style=ProgressStyle(description_width='i…




HBox(children=(IntProgress(value=0, description='saao/saa07', max=219, style=ProgressStyle(description_width='…

saao/saa07/P335888 is not available or not complete
saao/saa07/P335691 is not available or not complete
saao/saa07/P335783 is not available or not complete
saao/saa07/P335681 is not available or not complete
saao/saa07/P335792 is not available or not complete
saao/saa07/P335865 is not available or not complete
saao/saa07/P335884 is not available or not complete
saao/saa07/P335781 is not available or not complete
saao/saa07/P335875 is not available or not complete
saao/saa07/P335923 is not available or not complete
saao/saa07/P335898 is not available or not complete
saao/saa07/P335866 is not available or not complete



HBox(children=(IntProgress(value=0, description='saao/saa12', max=98, style=ProgressStyle(description_width='i…




Create lemma column and collect all lemmas of a single document in a list.

In [3]:
words = words.fillna('')
words = words.loc[words.cf != '']
words["lemma"] = words['cf'] + '[' + words['gw'] + ']' + words['pos']
words['lemma'] = words['lemma'].str.lower()
docs = words['lemma'].groupby(words['id_text']).apply(list)

In [4]:
docs_df = pd.DataFrame(docs).reset_index()
docs_df.index = [idt[-7:] for idt in docs_df.id_text]
docs_df

Unnamed: 0,id_text,lemma
P224485,saao/saa01/P224485,"[awātu[word]n, šarru[king]n, ana[to]prp, aššur..."
P313416,saao/saa01/P313416,"[ana[to]prp, šarru[king]n, bēlu[lord]n, ardu[s..."
P313417,saao/saa01/P313417,"[ana[to]prp, šarru[king]n, bēlu[lord]n, ardu[s..."
P313425,saao/saa01/P313425,"[ana[to]prp, šarru[king]n, bēlu[lord]n, ardu[s..."
P313427,saao/saa01/P313427,"[ana[to]prp, šarru[king]n, bēlu[lord]n, ardu[s..."
...,...,...
P452805,saao/saa21/P452805,"[awātu[word]n, šarru[king]n, ana[to]prp, nabu-..."
P452858,saao/saa21/P452858,"[maṣṣartu[observation]n, ša[of]det, šarru[king..."
P452901,saao/saa21/P452901,"[qabû[say]v, ša[of]det, nišu[people]n, šanû[do..."
X210106,saao/saa21/X210106,"[awātu[word]n, šarru[king]n, ana[to]prp, hunda..."


# Get metadata from catalog file.

In [10]:
df = pd.DataFrame() # create an empty dataframe
p = projects.split(',')
p = [pro.lower().strip() for pro in p]
for project in p:
    file = "jsonzip/" + project.replace("/", "-") + ".zip"
    try:
        z = zipfile.ZipFile(file)       # create a Zipfile object
    except:
        print(file + " does not exist or is not a proper ZIP file")
        continue
    try:
        st = z.read(project + '/catalogue.json').decode('utf-8')  #read and decode the catalogue.json file of one project
                                                                # the result is a string object
    except:
        print(project + '/catalogue.json' + ' is not available or not complete')
        continue
    cat = json.loads(st)
    cat = cat['members']  # select the 'members' node 
    for item in cat.values():
        item["project"] = project # add project name as separate field
    cat_df = pd.DataFrame(cat).T
    df = pd.concat([df, cat_df], sort=True)  # sort=True is necessary in case catalogs have a different sets of fields
df

Unnamed: 0,accession_no,ancient_author,archive,astron_date,ch_name,ch_no,ch_num_name,credits,date,designation,...,ruler,script,script_remarks,script_type,short_title,subgenre,title,trans,vol_title,volume
P224485,,Sargon II,099 - Miscellaneous,,Royal Letters,Ch. 1,Ch. 1 (Royal Letters),"Adapted from Simo Parpola, The Correspondence ...",721-705,SAA 01 001,...,Sargon II,Neo-Assyrian,inscribed,Cuneiform,Sargon Letters 1,,Midas Of Phrygia Seeks Detente,[en],"The Correspondence of Sargon II, Part I: Lette...",SAA 1
P313416,,Sin-ašared,099 - Miscellaneous,,Miscellaneous Letters,Ch. 7,Ch. 7 (Miscellaneous Letters),"Adapted from Simo Parpola, The Correspondence ...",721-705,SAA 01 158,...,Sargon II,Neo-Assyrian,inscribed,Cuneiform,Sargon Letters 1,,Gold and Silver Objects Sent to the King,[en],"The Correspondence of Sargon II, Part I: Lette...",SAA 1
P313417,,Mannu-ki-Aššur-le’i,099 - Miscellaneous,,Letters from Guzana and Naṣibina,Ch. 13,Ch. 13 (Letters from Guzana and Naṣibina),"Adapted from Simo Parpola, The Correspondence ...",721-705,SAA 01 233,...,Sargon II,Neo-Assyrian,inscribed,Cuneiform,Sargon Letters 1,,More Land to Bel-duri,[en],"The Correspondence of Sargon II, Part I: Lette...",SAA 1
P313425,,Bel-liqbi,099 - Miscellaneous,,Letters from Western Provinces,Ch. 8,Ch. 8 (Letters from Western Provinces),"Adapted from Simo Parpola, The Correspondence ...",721-705,SAA 01 179,...,Sargon II,Neo-Assyrian,inscribed,Cuneiform,Sargon Letters 1,,No Iron to the Arabs!,[en],"The Correspondence of Sargon II, Part I: Lette...",SAA 1
P313427,,Nabu-zer-ketti-lešir,099 - Miscellaneous,,Miscellaneous Letters,Ch. 7,Ch. 7 (Miscellaneous Letters),"Adapted from Simo Parpola, The Correspondence ...",721-705,SAA 01 152,...,Sargon II,Neo-Assyrian,inscribed,Cuneiform,Sargon Letters 1,,The Affair of Gidgidanu and His Brothers,[en],"The Correspondence of Sargon II, Part I: Lette...",SAA 1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
P452805,"1883-01-18, 0096",Assurbanipal,099 - Miscellaneous,,"Letters to Uruk, Ur, and Kissik",Ch. 2,"Ch. 2 (Letters to Uruk, Ur, and Kissik)","Adapted from Simo Parpola, The Correspondence ...",652-648,SAA 21 026,...,Assurbanipal,Neo-Assyrian,inscribed,Cuneiform,Assurbanipal Letters 1,,You Did Well With the Bit-Amukanians,[en],"The Correspondence of Assurbanipal, Part I: Le...",SAA 21
P452858,"1889-04-26 Bu, 0057",Assurbanipal,099 - Miscellaneous,,Letters to Elam,Ch. 5,Ch. 5 (Letters to Elam),"Adapted from Simo Parpola, The Correspondence ...",649-648,SAA 21 068,...,Assurbanipal,Neo-Assyrian,inscribed,Cuneiform,Assurbanipal Letters 1,,Why Did [You] Side with Indabibi?,[en],"The Correspondence of Assurbanipal, Part I: Le...",SAA 21
P452901,"1891-05-09 Bu, 0165",Assurbanipal,099 - Miscellaneous,,Letters to Gambulu and Raši,Ch. 4,Ch. 4 (Letters to Gambulu and Raši),"Adapted from Simo Parpola, The Correspondence ...",646*-XIII-27,SAA 21 057,...,Assurbanipal,Neo-Assyrian,inscribed,Cuneiform,Assurbanipal Letters 1,,Invading Elam (646-XII-27),[en],"The Correspondence of Assurbanipal, Part I: Le...",SAA 21
X210106,,Assurbanipal,099 - Miscellaneous,,"Letters to Vassal Rulers, and Miscellany",Ch. 6,"Ch. 6 (Letters to Vassal Rulers, and Miscellany)","Adapted from Simo Parpola, The Correspondence ...",647*-VI-13,SAA 21 075,...,Assurbanipal,Neo-Assyrian,inscribed,Cuneiform,Assurbanipal Letters 1,,Granting the Kingship of Dilmun (647-VI-13),[en],"The Correspondence of Assurbanipal, Part I: Le...",SAA 21


# For SAAo only: The 'title' field
[ORACC](http://oracc.org) projects may have a variety of fields - only the fields 'id_text' and 'designation' are obligatory. The content of the field 'designation' is usually an abbreviation for a text publication plus a text number (as in MVN 12 14)', indicating where the original cuneiform text may be found (in some cases 'designation' may also be a museum number). For analyzing a topic model 'designation' is not a very helpful field. 

The [SAAo](http://oracc.org/saao) catalogues include a field 'title' that provides a brief, somewhat impressionistic, summary of the text in question such as 'Transporting logs and hauling a threshold stone'. We will copy this field to the field 'designation' so that it is available to the analysis of the topic model in the Bokeh visualization (section 5.3). This cell may be skipped if you include data from any other [ORACC](http://oracc.org) project.

In [11]:
if 'title' in df.columns:
    df['designation'] = df['title']

# Merge Catalog and Text Data

In [12]:
df = df['designation']
df = pd.merge(df, docs_df, right_index=True, left_index=True, how='inner')
df.head()

Unnamed: 0,designation,id_text,lemma
P224378,Take Over the Kingship!,saao/saa19/P224378,"[ardu[slave]n, ana[to]prp, dinānu[substitution..."
P224379,Banquet,saao/saa19/P224379,"[ana[to]prp, šarru[king]n, bēlu[lord]n, ardu[s..."
P224380,Business with Kummeans,saao/saa19/P224380,"[ana[to]prp, šarru[king]n, bēlu[lord]n, ardu[s..."
P224381,I Shall Keep my Watch,saao/saa19/P224381,"[ana[to]prp, šarru[king]n, bēlu[lord]n, ardu[s..."
P224382,Family Affairs,saao/saa19/P224382,"[ṭuppu[tablet]n, data[1]pn, ana[to]prp, šumu-i..."


# Pickle

In [13]:
pickled = 'output/data_for_topic_model.p'
df.to_pickle(pickled)