# 5.1 Acquire Data for Topic Model: State Archives of Assyria

The data acquisition techniques discussed in section 2.1 are applied here to gather all the data from State Archives from Assyria Online ([SAAo](http://oracc.org/saao)). In the next notebook this data will be used for creating a topic model. 

# 5.1.0 Preparation: Import modules

In [None]:
import os
import sys
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
from utils import *
import pandas as pd

### 5.1.1 Get data
The module `utils` in the `utils` directory of Compass includes the function `get_data()` which essentially runs the same code as the Extended ORACC Parser (2.1.3; see there for explanation of the code). Its only parameter is a string with [ORACC](http://oracc.org) project names, separated by commas. It returns a Pandas DataFrame in which each word is represented by a row.

If you wish to build a topic model with a different set of texts, you may replace the list of subprojects (separated by commas) with any other list of valid [ORACC](http://oracc.org) (sub)projects. Note, however, that the code below (and in the next notebook) uses field names that are specific for the [SAAo](http://oracc.org/saao) catalogs (in particular the field 'title'). [ORACC](http://oracc.org) data sets essentially all have the same structure, but catalogs vary widely in the fields they include (the fields 'id_text' and 'designation' are obligatory and are found in all).

In [None]:
projects = """saao/saa01,
                saao/saa02,
                saao/saa03,
                saao/saa04,
                saao/saa05,
                saao/saa06,
                saao/saa07,
                saao/saa08,
                saao/saa09,
                saao/saa10,
                saao/saa11,
                saao/saa12,
                saao/saa13,
                saao/saa14,
                saao/saa15,
                saao/saa16,
                saao/saa17,
                saao/saa18,
                saao/saa19,
                saao/saa20,
                saao/saa21"""
words = get_data(projects)

Create lemma column and collect all lemmas of a single document in a list.

In [None]:
words = words.fillna('')
words = words.loc[words.cf != '']
words["lemma"] = words['cf'] + '[' + words['gw'] + ']' + words['pos']
words['lemma'] = words['lemma'].str.lower()
docs = words['lemma'].groupby(words['id_text']).apply(list)

In [None]:
docs_df = pd.DataFrame(docs).reset_index()
docs_df.index = [idt[-7:] for idt in docs_df.id_text]
docs_df

# Get metadata from catalog file.
If you are downloading files other than those from [SAAo](http://oracc.org/saao), adjust the list of fields in the penultimate line of the code below, depending on the fields available in the catalog(s) of the data set of your choice. Since all catalogs ionclude the field 'designation', the safest choice is:
```python
df = df['designation']
```

In [None]:
df = pd.DataFrame() # create an empty dataframe
p = projects.split(',')
p = [pro.lower().strip() for pro in p]
for project in p:
    file = "jsonzip/" + project.replace("/", "-") + ".zip"
    try:
        z = zipfile.ZipFile(file)       # create a Zipfile object
    except:
        print(file + " does not exist or is not a proper ZIP file")
        continue
    try:
        st = z.read(project + '/catalogue.json').decode('utf-8')  #read and decode the catalogue.json file of one project
                                                                # the result is a string object
    except:
        print(project + '/catalogue.json' + ' is not available or not complete')
        continue
    cat = json.loads(st)
    cat = cat['members']  # select the 'members' node 
    for item in cat.values():
        item["project"] = project # add project name as separate field
    cat_df = pd.DataFrame(cat).T
    df = pd.concat([df, cat_df], sort=True)  # sort=True is necessary in case catalogs have a different set of fields
df = df[['designation', 'title', 'volume', 'ch_no']]
df

## TODO
edit this description. Replace title/text_name by designation to make code more universally applicable. Rescue title, but indicate that this only works for SAAo.

Create a DataFrame of `id_text` and `text_name` equivalencies, with `id_text` set as index (row names). Then use `merge` to add text names to the DataFrame, using the indexes.

In [None]:
df.columns = ['designation', 'text_name', 'volume', 'ch_no']
df = pd.merge(df, docs_df, right_index=True, left_index=True, how='inner')
df.head()

# Pickle

In [None]:
pickled = 'output/data_for_topic_model.p'
df.to_pickle(pickled)