# 5.1 Acquire Data for Topic Model: State Archives of Assyria

The data acquisition techniques discussed in section 2.1 are applied here to gather all the data from State Archives from Assyria Online ([SAAo](http://oracc.org/saao)). In the next notebook this data will be used for creating a topic model. 

# 5.1.0 Preparation: Import modules

In [1]:
import os
import sys
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
from utils import *
import pandas as pd

### 5.1.1 Get data
The module `utils` in the `utils` directory of Compass includes the function `get_data()` which essentially runs the same code as the Extended ORACC Parser (2.1.3; see there for explanation of the code). Its only parameter is a string with [ORACC](http://oracc.org) project names, separated by commas. It returns a Pandas DataFrame in which each word is represented by a row.

If you wish to build a topic model with a different set of texts, you may replace the list of subprojects (separated by commas) with any other list of valid [ORACC](http://oracc.org) (sub)projects. Note, however, that the code below (and in the next notebook) uses field names that are specific for the [SAAo](http://oracc.org/saao) catalogs (in particular the field 'title'). [ORACC](http://oracc.org) data sets essentially all have the same structure, but catalogs vary widely in the fields they include (the fields 'id_text' and 'designation' are obligatory and are found in all).

In [2]:
projects = """saao/saa01,
                saao/saa02,
                saao/saa03,
                saao/saa04,
                saao/saa05,
                saao/saa06,
                saao/saa07,
                saao/saa08,
                saao/saa09,
                saao/saa10,
                saao/saa11,
                saao/saa12,
                saao/saa13,
                saao/saa14,
                saao/saa15,
                saao/saa16,
                saao/saa17,
                saao/saa18,
                saao/saa19,
                saao/saa20,
                saao/saa21"""
words = get_data(projects)

Downloading JSON
Saving https://oracc.museum.upenn.edu/json/saao-saa08.zip as jsonzip/saao-saa08.zip.




saao/saa08:   0%|          | 0.00/7.21M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa09.zip as jsonzip/saao-saa09.zip.




saao/saa09:   0%|          | 0.00/773k [00:00<?, ?B/s]



Saving https://oracc.museum.upenn.edu/json/saao-saa21.zip as jsonzip/saao-saa21.zip.


saao/saa21:   0%|          | 0.00/3.83M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa11.zip as jsonzip/saao-saa11.zip.




saao/saa11:   0%|          | 0.00/2.34M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa18.zip as jsonzip/saao-saa18.zip.




saao/saa18:   0%|          | 0.00/4.71M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa04.zip as jsonzip/saao-saa04.zip.




saao/saa04:   0%|          | 0.00/8.21M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa13.zip as jsonzip/saao-saa13.zip.




saao/saa13:   0%|          | 0.00/3.91M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa05.zip as jsonzip/saao-saa05.zip.




saao/saa05:   0%|          | 0.00/4.88M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa20.zip as jsonzip/saao-saa20.zip.




saao/saa20:   0%|          | 0.00/4.57M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa14.zip as jsonzip/saao-saa14.zip.




saao/saa14:   0%|          | 0.00/6.45M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa06.zip as jsonzip/saao-saa06.zip.




saao/saa06:   0%|          | 0.00/7.08M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa15.zip as jsonzip/saao-saa15.zip.




saao/saa15:   0%|          | 0.00/5.78M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa03.zip as jsonzip/saao-saa03.zip.




saao/saa03:   0%|          | 0.00/4.29M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa17.zip as jsonzip/saao-saa17.zip.




saao/saa17:   0%|          | 0.00/4.52M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa16.zip as jsonzip/saao-saa16.zip.




saao/saa16:   0%|          | 0.00/4.06M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa10.zip as jsonzip/saao-saa10.zip.




saao/saa10:   0%|          | 0.00/8.70M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa01.zip as jsonzip/saao-saa01.zip.




saao/saa01:   0%|          | 0.00/5.01M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa19.zip as jsonzip/saao-saa19.zip.




saao/saa19:   0%|          | 0.00/5.49M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa12.zip as jsonzip/saao-saa12.zip.




saao/saa12:   0%|          | 0.00/3.59M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa07.zip as jsonzip/saao-saa07.zip.




saao/saa07:   0%|          | 0.00/3.81M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa02.zip as jsonzip/saao-saa02.zip.




saao/saa02:   0%|          | 0.00/2.64M [00:00<?, ?B/s]

Parsing JSON


saao/saa08:   0%|          | 0/568 [00:00<?, ?it/s]

saao/saa09:   0%|          | 0/11 [00:00<?, ?it/s]

saao/saa21:   0%|          | 0/161 [00:00<?, ?it/s]

saao/saa11:   0%|          | 0/234 [00:00<?, ?it/s]

saao/saa18:   0%|          | 0/205 [00:00<?, ?it/s]

saao/saa04:   0%|          | 0/353 [00:00<?, ?it/s]

saao/saa04/corpusjson/P237481.json
<class 'json.decoder.JSONDecodeError'>
Expecting value: line 1 column 1 (char 0)


saao/saa13:   0%|          | 0/210 [00:00<?, ?it/s]

saao/saa05:   0%|          | 0/300 [00:00<?, ?it/s]

saao/saa20:   0%|          | 0/55 [00:00<?, ?it/s]

saao/saa14:   0%|          | 0/479 [00:00<?, ?it/s]

saao/saa06:   0%|          | 0/350 [00:00<?, ?it/s]

saao/saa15:   0%|          | 0/390 [00:00<?, ?it/s]

saao/saa03:   0%|          | 0/52 [00:00<?, ?it/s]

saao/saa17:   0%|          | 0/207 [00:00<?, ?it/s]

saao/saa16:   0%|          | 0/246 [00:00<?, ?it/s]

saao/saa10:   0%|          | 0/389 [00:00<?, ?it/s]

saao/saa10/corpusjson/P314338.json
<class 'json.decoder.JSONDecodeError'>
Expecting value: line 1 column 1 (char 0)
saao/saa10/corpusjson/P313449.json
<class 'json.decoder.JSONDecodeError'>
Expecting value: line 1 column 1 (char 0)


saao/saa01:   0%|          | 0/265 [00:00<?, ?it/s]

saao/saa19:   0%|          | 0/229 [00:00<?, ?it/s]

saao/saa12:   0%|          | 0/98 [00:00<?, ?it/s]

saao/saa07:   0%|          | 0/219 [00:00<?, ?it/s]

saao/saa07/corpusjson/P335792.json
<class 'json.decoder.JSONDecodeError'>
Expecting value: line 1 column 1 (char 0)


saao/saa02:   0%|          | 0/15 [00:00<?, ?it/s]

Create lemma column and collect all lemmas of a single document in a list.

In [3]:
words = words.fillna('')
words = words.loc[words.cf != '']
words["lemma"] = words['cf'] + '[' + words['gw'] + ']' + words['pos']
words['lemma'] = words['lemma'].str.lower()
docs = words['lemma'].groupby(words['id_text']).apply(list)

In [4]:
docs_df = pd.DataFrame(docs).reset_index()
docs_df.index = [idt[-7:] for idt in docs_df.id_text]
docs_df

Unnamed: 0,id_text,lemma
P224395,saao/saa01/P224395,"[ana[to]prp, šarru[king]n, bēlu[lord]n, ardu[s..."
P224403,saao/saa01/P224403,"[awātu[word]n, šarru[king]n, ana[to]prp, šaknu..."
P224417,saao/saa01/P224417,"[ana[to]prp, šarru[king]n, bēlu[lord]n, ardu[s..."
P224431,saao/saa01/P224431,"[ana[to]prp, šarru[king]n, bēlu[lord]n, ardu[s..."
P224433,saao/saa01/P224433,"[ana[to]prp, šarru[king]n, bēlu[lord]n, ardu[s..."
...,...,...
P452858,saao/saa21/P452858,"[maṣṣartu[observation]n, ša[of]det, šarru[king..."
P452901,saao/saa21/P452901,"[qabû[say]v, ša[of]det, nišu[people]n, šanû[do..."
Q009252,saao/saa21/Q009252,"[awātu[word]n, šarru[king]n, ana[to]prp, šaddu..."
X210106,saao/saa21/X210106,"[awātu[word]n, šarru[king]n, ana[to]prp, hunda..."


# Get metadata from catalog file.

In [5]:
df = pd.DataFrame() # create an empty dataframe
p = projects.split(',')
p = [pro.lower().strip() for pro in p]
for project in p:
    file = "jsonzip/" + project.replace("/", "-") + ".zip"
    try:
        z = zipfile.ZipFile(file)       # create a Zipfile object
    except:
        print(file + " does not exist or is not a proper ZIP file")
        continue
    try:
        st = z.read(project + '/catalogue.json').decode('utf-8')  #read and decode the catalogue.json file of one project
                                                                # the result is a string object
    except:
        print(project + '/catalogue.json' + ' is not available or not complete')
        continue
    cat = json.loads(st)
    cat = cat['members']  # select the 'members' node 
    for item in cat.values():
        item["project"] = project # add project name as separate field
    cat_df = pd.DataFrame(cat).T
    df = pd.concat([df, cat_df], sort=True)  # sort=True is necessary in case catalogs have a different sets of fields
df

Unnamed: 0,abl_no,accession_no,adb_no,add_no,ags_no,ancient_author,ancient_recipient,archive,astron_date,atae_lists,...,short_title,stt_no,subgenre,tim_11_no,title,title_in_date,trans,vol_title,volume,year
P224395,,,,,,Adda-hati,the king,"006 - Northwest Palace, Room ZT 4",,atae/saao:P224395,...,Sargon Letters 1,,,,Arabs Attack a Column of Booty,,[en],"The Correspondence of Sargon II, Part I: Lette...",SAA 1,
P224403,,,,,,the king,go[vernor] (of Calah),"006 - Northwest Palace, Room ZT 4",,atae/saao:P224403,...,Sargon Letters 1,,,,Straw and Reeds for Dur-Šarruken,,[en],"The Correspondence of Sargon II, Part I: Lette...",SAA 1,
P224417,,,,,,Adda-hati,the king,"006 - Northwest Palace, Room ZT 4",,atae/saao:P224417,...,Sargon Letters 1,,,,Turning in Taxes and Organizing the Province,,[en],"The Correspondence of Sargon II, Part I: Lette...",SAA 1,
P224431,,,,,,[Bel-duri],[the king],"006 - Northwest Palace, Room ZT 4",,atae/saao:P224431,...,Sargon Letters 1,,,,Raising Food and Fodder from Desert Towns,,[en],"The Correspondence of Sargon II, Part I: Lette...",SAA 1,
P224433,,,,,,S[ennacherib],the king,"006 - Northwest Palace, Room ZT 4",,atae/saao:P224433,...,Sargon Letters 1,,,,Urarṭu After the Cimmerian Rout,,[en],"The Correspondence of Sargon II, Part I: Lette...",SAA 1,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
P452858,ABL 1167,"1889-04-26 Bu, 0057",,,,[the king],(unknown),099 - Miscellaneous,,,...,Ashurbanipal Letters 1,,,,Why Did [You] Side with Indabibi?,,[en],"The Correspondence of Ashurbanipal, Part I: Le...",SAA 21,
P452901,ABL 1262,"1891-05-09 Bu, 0165",,,,[the king],[... of Raši?],099 - Miscellaneous,,,...,Ashurbanipal Letters 1,,,,Invading Elam (646-XII-27),,[en],"The Correspondence of Ashurbanipal, Part I: Le...",SAA 21,
Q009252,,,,,,the king,Šadanu,099 - Miscellaneous,,,...,Ashurbanipal Letters 1,,,,Confiscating Scholarly Tablets in Borsippa,,[en],"The Correspondence of Ashurbanipal, Part I: Le...",SAA 21,
X210106,,,,,,the king,"Hunda[ru, king of Dilmun]",099 - Miscellaneous,,,...,Ashurbanipal Letters 1,,,,Granting the Kingship of Dilmun (647-VI-13),,[en],"The Correspondence of Ashurbanipal, Part I: Le...",SAA 21,


# For SAAo only: The 'title' field
[ORACC](http://oracc.org) projects may have a variety of fields - only the fields 'id_text' and 'designation' are obligatory. The content of the field 'designation' is usually an abbreviation for a text publication plus a text number (as in MVN 12 14)', indicating where the original cuneiform text may be found (in some cases 'designation' may also be a museum number). For analyzing a topic model 'designation' is not a very helpful field. 

The [SAAo](http://oracc.org/saao) catalogues include a field 'title' that provides a brief, somewhat impressionistic, summary of the text in question such as 'Transporting logs and hauling a threshold stone'. We will copy this field to the field 'designation' so that it is available to the analysis of the topic model in the Bokeh visualization (section 5.3). This cell may be skipped if you include data from any other [ORACC](http://oracc.org) project.

In [6]:
if 'title' in df.columns:
    df['designation'] = df['title']

# Merge Catalog and Text Data

In [7]:
df = df['designation']
df = pd.merge(df, docs_df, right_index=True, left_index=True, how='inner')
df.head()

Unnamed: 0,designation,id_text,lemma
P224395,Arabs Attack a Column of Booty,saao/saa01/P224395,"[ana[to]prp, šarru[king]n, bēlu[lord]n, ardu[s..."
P224403,Straw and Reeds for Dur-Šarruken,saao/saa01/P224403,"[awātu[word]n, šarru[king]n, ana[to]prp, šaknu..."
P224417,Turning in Taxes and Organizing the Province,saao/saa01/P224417,"[ana[to]prp, šarru[king]n, bēlu[lord]n, ardu[s..."
P224431,Raising Food and Fodder from Desert Towns,saao/saa01/P224431,"[ana[to]prp, šarru[king]n, bēlu[lord]n, ardu[s..."
P224433,Urarṭu After the Cimmerian Rout,saao/saa01/P224433,"[ana[to]prp, šarru[king]n, bēlu[lord]n, ardu[s..."


# Pickle

In [8]:
pickled = 'output/data_for_topic_model.p'
df.to_pickle(pickled)