# Retrieve Catalog Data from `catalogue.json`

The file `catalogue.json` contains all the catalog data for an [ORACC](http://oracc.org) project (for general information, see the [Oracc Open Data](http://oracc.org/doc/opendata) page). The `zip` that contains all JSON files of a particular project can be found at `http://oracc.museum.upenn.edu/[PROJECT]/json/[PROJECT].zip`. In the URL replace [PROJECT] with your project name (e.g. `dcclt`). For sub-projects the URL pattern is `http://oracc.museum.upenn.edu/[PROJECT]/[SUBPROJECT]/json/[PROJECT]-[SUBPROJECT].zip`. 

The main key in a `catalogue.json` file is called `members`. The value of this key contains the information of all the fields and all the entries in the project catalog. This information is put in a `pandas` DataFrame.

In [1]:
import pandas as pd
import zipfile
import json
import os
import sys
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
import utils

## 0 Create Directories, if Necessary
The two directories needed for this script are `jsonzip` and `output`. 

In [2]:
os.makedirs('jsonzip', exist_ok = True)
os.makedirs('output', exist_ok = True)

## 1.1 Input Project Names
We can download and manipulate multiple [ORACC](http://oracc.org) `zip` files at the same time. Note, however that different [ORACC](http://oracc.org) projects use different fields in their catalogs; not all catalogs are mutually compatible.

Provide project abbreviations, separated by a comma. Note that subprojects must be processed separately, they are not included in the main project. A subproject is named `[PROJECT]/[SUBPROJECT]`, for instance `saao/saa01`.

Split the list of projects and create a list of project names, using the `format_project_list()` function from the `utils` module.

In [3]:
projects = input('Project(s): ').lower().strip() # lowercase user input and strip accidental spaces
project_list = utils.format_project_list(projects)

Project(s):  obel


## 1.2 Split the List of Projects and Download the ZIP files.
Use the `oracc_download()` function from the `utils` module to download the requested projects. The code of this function is discussed in more detail in 2.1.0. Download ORACC JSON Files. The function returns a new version of the project list, with duplicates and non-existing projects removed.

In [7]:
project_list = utils.oracc_download(project_list)



Saving http://oracc.org/obel/json/obel.zip as jsonzip/obel.zip.


obel:   0%|          | 0.00/4.00 [00:00<?, ?B/s]

## 2 Extract Catalogue Data from `JSON` files
The code in this cell will iterate through the list of projects entered above (1.1). For each project the `JSON` zip file, named `[PROJECT].zip` has been downloaded in the directory `jsonzip` (1.2). Each of these `zip` files includes a file called `catalogue.json`. This file is read in and loaded with the command `json.loads()`, which transforms a string into a JSON object - a sequence of names and values.

The JSON object is transformed into a `pandas` Dataframe. By default, when reading in a dictionary, the `DataFrame()` function will take the top-level keys (in this case the text IDs) as columns. The dataframe needs to be transposed (`.T`), so that the P, Q, and X numbers become indexes or row names, and each column represents a field in the catalog.  The individual dataframes (one for each project requested) are concatenated. Since individual [ORACC](http://oracc.org) project catalogs may have different fields, the dataframes may have different column names. By default `pandas` concatenation uses an `outer join` so that all column names of all the catalogs are preserved.

In [8]:
df = pd.DataFrame() # create an empty dataframe
for project in project_list:
    file = f"jsonzip/{project.replace('/', '-')}.zip"
    try:
        zip_file = zipfile.ZipFile(file)       # create a Zipfile object
    except:
        errors = sys.exc_info() # get error information
        print(file), print(errors[0]), print(errors[1]) # and print it
        continue
    try:
        json_cat_string = zip_file.read(f"{project}/catalogue.json").decode('utf-8')  #read and decode the catalogue.json file of one project
                                                                # the result is a string object
    except:
        errors = sys.exc_info() # get error information
        print(project), print(errors[0]), print(errors[1]) # and print it
        continue
    zip_file.close()
    cat = json.loads(json_cat_string)
    cat = cat['members']  # select the 'members' node 
    for item in cat.values():
        item["project"] = project # add project name as separate field
    cat_df = pd.DataFrame.from_dict(cat, orient="index")
    df = pd.concat([df, cat_df], sort=True)  # sort=True is necessary in case catalogs have a different set of fields
df

Unnamed: 0,Cohen_balag,Delnero_no,Delnero_remarks,Delnero_subgenre_no,accession_no,additional_P_numbers,author,collection,deity,designation,...,period,pleiades_id,primary_publication,project,provenience,public,publication_date,publication_history,trans,type
P223438,,,,,,,"van Dijk, Johannes J. A.","National Museum of Iraq, Baghdad, Iraq",,"TIM 09, 005",...,Old Babylonian,,"TIM 09, 005",obel,unclear,no,1976,"W.G. Lambert, JNES 33 (1974: 291ff.); M. Jaque...",,Ershahunga
P223457,,4720,,Balag ID (167),,,"van Dijk, Johannes J. A.","National Museum of Iraq, Baghdad, Iraq",Inana-Dumuzi,"TIM 09, 031",...,Old Babylonian,821129014,"Sumer 13, pl. 7 = TIM 9, 31",obel,Šaduppum,no,1976,,,Balag
P223475,,4730,,Ershemma Utu (438),,,"van Dijk, Johannes J. A.","National Museum of Iraq, Baghdad, Iraq",Utu,"TIM 09, 030",...,Old Babylonian,821129014,"Sumer 13, pl. 8 = TIM 9, 30",obel,Šaduppum,no,1976,"B. Baragli & J. Peterson, OrAnt Series Nova 5 ...",,Ershemma
P223481,,4750,,Balag ID (53),,,"van Dijk, Johannes J. A.","National Museum of Iraq, Baghdad, Iraq",Inana-Dumuzi,"TIM 09, 015",...,Old Babylonian,,"TIM 09, 015",obel,unclear,no,1976,"M. Fritz, AOAT 307 (2003: 136–38)",,Balag
P227377,,4930,,Ershemma AD (395),,,"Speleers, Louis","Musées royaux d’Art et d’Histoire, Brussels, B...",Aruru-Dingirmah,RIAA 189,...,Old Babylonian,,RIAA 189,obel,unclear,no,1925,"H. Limet, Akkadica 117 (2000: 3-8); J. Black, ...",,Ershemma
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
X907112,,,,,,,"George, Andrew R. and Taniguchi, Junko","Birmingham Museums and Art Gallery, Birmingham...",Ninisina-Gula,"MC 24, 35",...,Old Babylonian,,"MC 24, 35",obel,unclear,,2019,"A. Cavigneaux, Iraq 85 (2023: 103-135).",,
X907113,,,,,,,"George, Andrew R. and Taniguchi, Junko","Birmingham Museums and Art Gallery, Birmingham...",,"MC 24, 37",...,Old Babylonian,,"MC 24, 37",obel,unclear,,2019,,,
X907114,,,,,,,"Kramer, Samuel N., Çig, Muazzez & Kizilyay, Ha...","Arkeoloji Müzeleri, Istanbul, Turkey",,"ISET 1, 209 (Pl. 151) Ni 13236",...,Old Babylonian,912910,"ISET 1, 209 (Pl. 151)",obel,Nippur,,1969,"M. Cohen, CLAM (1988) 633; S. Maul, CTMMA 2 (2...",,
X907115,,,,,,,"Walker, C.B.F.","Brotherton Library, University of Leeds, Leeds...",Enlil,ULCI 17,...,Old Babylonian,,"JCS 30, 242",obel,unclear,,1978,"Matini, Lobpreis (2014: 38); U. Gabbay, HES 2 ...",,


## 3. Clean the Dataframe
The function `fillna('')` will put a blank (instead of `NaN`) in all fields that have no entry.

In [9]:
df = df.fillna('')
df

Unnamed: 0,Cohen_balag,Delnero_no,Delnero_remarks,Delnero_subgenre_no,accession_no,additional_P_numbers,author,collection,deity,designation,...,period,pleiades_id,primary_publication,project,provenience,public,publication_date,publication_history,trans,type
P223438,,,,,,,"van Dijk, Johannes J. A.","National Museum of Iraq, Baghdad, Iraq",,"TIM 09, 005",...,Old Babylonian,,"TIM 09, 005",obel,unclear,no,1976,"W.G. Lambert, JNES 33 (1974: 291ff.); M. Jaque...",,Ershahunga
P223457,,4720,,Balag ID (167),,,"van Dijk, Johannes J. A.","National Museum of Iraq, Baghdad, Iraq",Inana-Dumuzi,"TIM 09, 031",...,Old Babylonian,821129014,"Sumer 13, pl. 7 = TIM 9, 31",obel,Šaduppum,no,1976,,,Balag
P223475,,4730,,Ershemma Utu (438),,,"van Dijk, Johannes J. A.","National Museum of Iraq, Baghdad, Iraq",Utu,"TIM 09, 030",...,Old Babylonian,821129014,"Sumer 13, pl. 8 = TIM 9, 30",obel,Šaduppum,no,1976,"B. Baragli & J. Peterson, OrAnt Series Nova 5 ...",,Ershemma
P223481,,4750,,Balag ID (53),,,"van Dijk, Johannes J. A.","National Museum of Iraq, Baghdad, Iraq",Inana-Dumuzi,"TIM 09, 015",...,Old Babylonian,,"TIM 09, 015",obel,unclear,no,1976,"M. Fritz, AOAT 307 (2003: 136–38)",,Balag
P227377,,4930,,Ershemma AD (395),,,"Speleers, Louis","Musées royaux d’Art et d’Histoire, Brussels, B...",Aruru-Dingirmah,RIAA 189,...,Old Babylonian,,RIAA 189,obel,unclear,no,1925,"H. Limet, Akkadica 117 (2000: 3-8); J. Black, ...",,Ershemma
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
X907112,,,,,,,"George, Andrew R. and Taniguchi, Junko","Birmingham Museums and Art Gallery, Birmingham...",Ninisina-Gula,"MC 24, 35",...,Old Babylonian,,"MC 24, 35",obel,unclear,,2019,"A. Cavigneaux, Iraq 85 (2023: 103-135).",,
X907113,,,,,,,"George, Andrew R. and Taniguchi, Junko","Birmingham Museums and Art Gallery, Birmingham...",,"MC 24, 37",...,Old Babylonian,,"MC 24, 37",obel,unclear,,2019,,,
X907114,,,,,,,"Kramer, Samuel N., Çig, Muazzez & Kizilyay, Ha...","Arkeoloji Müzeleri, Istanbul, Turkey",,"ISET 1, 209 (Pl. 151) Ni 13236",...,Old Babylonian,912910,"ISET 1, 209 (Pl. 151)",obel,Nippur,,1969,"M. Cohen, CLAM (1988) 633; S. Maul, CTMMA 2 (2...",,
X907115,,,,,,,"Walker, C.B.F.","Brotherton Library, University of Leeds, Leeds...",Enlil,ULCI 17,...,Old Babylonian,,"JCS 30, 242",obel,unclear,,1978,"Matini, Lobpreis (2014: 38); U. Gabbay, HES 2 ...",,


## 4 Select Relevant Fields
[ORACC](http://oracc.org) catalogs may have custom fields, the only fields that are obligatory are `id_text` (the P, Q, or X number that identifies the text, for instance "P243546") and `designation` (the human-readable reference, for instance "VS 17, 012"). The example code below works with field names that are available in (almost) every catalog. Adjust the code to your data and your needs.

In [None]:
df1 = df[['designation', 'period', 'provenience',
        'museum_no', 'project', 'id_text']]
df1

## 5.1 Save as CSV
Save the resulting data set as a `csv` file. `UTF-8` encoding is the encoding with the widest support in text analysis (and also the encoding used by [ORACC](http://oracc.org)). If you intend to use the catalog file in Excel, however, it is better to use `utf-16` encoding.

In [None]:
filename = 'output/catalog.csv'
df1.to_csv(filename, index=False, encoding='utf-8')

## 5.2 Save with Pickle
One may pickle a file either with the `pickle` library or directly from within `pandas` with the `to_pickle()` function. A pickled file preserves the data structure of the dataframe, which is an advantage over saving as `csv`. The pickle file is a binary file, so we must open the file with the `wb` (write binary) option and we cannot give an encoding. To open the pickled file one may use the `read_pickle()` function from the `pandas` library, as in:

```python
import pandas as pd
df = pd.read_pickle(o)
```

In [None]:
filename = "output/catalog.p"
df1.to_pickle(filename)