# Retrieve Catalog Data from `catalogue.json`

The file `catalogue.json` contains all the catalog data for an [ORACC](http://oracc.org) project (for general information, see the [Oracc Open Data](http://oracc.org/doc/opendata) page). The file can be found at `http://oracc.org/[PROJECT]/catalogue.json`. In the URL replace [PROJECT] with your project or sub-project name (e.g. `dcclt` or `cams/gkab`).

The file `catalogue.json` of each project is also included in the file `PROJECT.zip` in https://github.com/oracc/json/. 

The main item in a `catalogue.json` file is called `members`, which contains the information about all the items in the project catalog.

In [1]:
import requests
import pandas as pd

## Select Project
Select the project or subproject of interest. Subprojects are indicated as `[PROJECT]/[SUBPROJECT]` or `[PROJECT]/[SUBPROJECT]/[SUBPROJECT]`.

In [2]:
project = input('Project or subproject abbreviation: ')
project = project.strip().lower()

Project or subproject abbreviation: rimanum


## Access File and Create Dataframe
The `requests` library creates a Python `dictionary` out of a JSON file with the `.json()` function. The JSON catalogue file has all the P, Q, and X numbers of a project under the key `members`.

The resulting dictionary `d` has the text ID numbers (P, Q, and X numbers) as keys, the value is another dictionary where each key is a field (`provenience`, `primary_publication`, etc) and the value is the content of that field. 

In [3]:
url = 'http://oracc.org/' + project + '/catalogue.json'
f = requests.get(url, timeout = 3)
d = f.json()
d = d['members']

## Create Dataframe
After the dictionary is transformed into a Pandas Dataframe, it needs to be transposed, so that the P, Q, and X numbers become indexes or row names (rather than column names), and each column represents a field in the catalog. 

Creating a Dataframe is not necessary, one may also manipulate the dictionary directly, but for demonstration purposes the Dataframe is a handy format. In manipulating the dictionary directly it is important to keep in mind that not all catalog fields have data for all entries, which means that not all dictionary keys are available for each P, Q, or X number.

Example code for slicing the dictionary to select all entries that have `provenience = 'Ur'`:
> `urcat = {key:value for key, value in d.items() if 'provenience' in d[key] and d[key]['provenience'] == 'Ur'}

In [4]:
df = pd.DataFrame(d).T.fillna('')
df

Unnamed: 0,accession_no,ark_number,atf_source,atf_up,author,author_remarks,cdli_remarks_internal_only,citation,collection,collection_remarks_internal_only,...,publication_date,publication_history,seal_id,subgenre,subgenre_remarks,supergenre,thickness,uri,width,xproject
P295625,,21198/zz001wkcn7,,,"Simmons, Stephen D.",,,,"J. Pierpont Morgan Library Collection, Yale Ba...",,...,1978,,Sx,bit asiri flour,,ELA,20,http://cdli.ucla.edu/P295625,35,CDLI
P296047,,21198/zz001wk2cr,,,"Simmons, Stephen D.",,MLC 01297 is the tablet ?,,"J. Pierpont Morgan Library Collection, Yale Ba...",,...,1978,"Scheil, Vincent, RT 20 (1898) 064-065 (MLC 01297)",Sx,,bit asiri people,ELA,26,http://cdli.ucla.edu/P296047,54,CDLI
P296277,,21198/zz001wwhh4,,,"Simmons, Stephen D.",,,,"J. Pierpont Morgan Library Collection, Yale Ba...",,...,1978,,Sx,bit asiri people,,ELA,23,http://cdli.ucla.edu/P296277,43,CDLI
P296278,,21198/zz001wwhjn,,,"Simmons, Stephen D.",,,,"J. Pierpont Morgan Library Collection, Yale Ba...",,...,1978,,,bit asiri people,,ELA,24,http://cdli.ucla.edu/P296278,45,CDLI
P296414,,21198/zz001x50db,,,"Simmons, Stephen D.",,,,"J. Pierpont Morgan Library Collection, Yale Ba...",,...,1978,,Sx,bit asiri people,,ELA,28,http://cdli.ucla.edu/P296414,50,CDLI
P297038,,21198/zz001wg725,,,"Simmons, Stephen D.",,,,"J. Pierpont Morgan Library Collection, Yale Ba...",,...,1978,,Sx,bit asiri people,,ELA,23,http://cdli.ucla.edu/P297038,38,CDLI
P311964,,21198/zz002084qg,,,"Simmons, Stephen D.",,,"CBCY 04, p. 231, YBC 11995","Yale Babylonian Collection, New Haven, Connect...",,...,1978,,Sx,bit asiri flour,,ELA,18,http://cdli.ucla.edu/P311964,33,CDLI
P368396,,21198/zz001zzv7j,"Seri, Andrea",20130125.atf,"Seri, Andrea",bit asari,,,"Kalamazoo Valley Museum, Kalamazoo, Michigan, USA",,...,2007,,,bit asiri people,,ELA,?,http://cdli.ucla.edu/P368396,85,CDLI
P368398,,21198/zz001zzv9k,"Seri, Andrea",20130125.atf,"Seri, Andrea",female slaves,,,"Kalamazoo Valley Museum, Kalamazoo, Michigan, USA",,...,2007,,,bit asiri people,,ELA,?,http://cdli.ucla.edu/P368398,38,CDLI
P372766,,21198/zz00222mwz,,,"Figulla, Hugo H.",,,,"Vorderasiatisches Museum, Berlin, Germany",,...,1914,,,bit asiri people,,ELA,,http://cdli.ucla.edu/P372766,,CDLI


## Select Relevant Fields
First display all available fields, then select the ones that are relevant for the task at hand. The function `fillna('')` will put a blank (instead of `NaN`) in all fields that have no entry.

In [5]:
df.columns

Index(['accession_no', 'ark_number', 'atf_source', 'atf_up', 'author',
       'author_remarks', 'cdli_remarks_internal_only', 'citation',
       'collection', 'collection_remarks_internal_only', 'date_entered',
       'date_of_origin', 'date_remarks', 'date_updated', 'dates_referenced',
       'db_source', 'designation', 'excavation_no', 'genre', 'height', 'id',
       'id_text', 'images', 'join_information', 'langs', 'language',
       'lineart_up', 'material', 'museum_no', 'object_type', 'period',
       'photo_up', 'primary_publication', 'project', 'provenience',
       'provenience_remarks', 'public', 'public_atf', 'public_images',
       'publication_date', 'publication_history', 'seal_id', 'subgenre',
       'subgenre_remarks', 'supergenre', 'thickness', 'uri', 'width',
       'xproject'],
      dtype='object')

In [6]:
df1 = df[['designation', 'period', 'provenience',
        'museum_no']]
df1

Unnamed: 0,designation,period,provenience,museum_no
P295625,"YOS 14, 341",Old Babylonian,Uruk (mod. Warka),MLC 00837
P296047,"YOS 14, 338",Old Babylonian,Uruk (mod. Warka),MLC 01284 & MLC 01297
P296277,"YOS 14, 339",Old Babylonian,Uruk (mod. Warka),MLC 01588
P296278,"YOS 14, 342",Old Babylonian,Uruk (mod. Warka),MLC 01589
P296414,"YOS 14, 337",Old Babylonian,Uruk (mod. Warka),MLC 01741
P297038,"YOS 14, 340",Old Babylonian,Uruk (mod. Warka),MLC 02650
P311964,"YOS 14, 346",Old Babylonian,Uruk (mod. Warka),YBC 11995
P368396,CDLJ 2007/1 \xA73.45,Old Babylonian,Uruk (mod. Warka),KVM 32.1160
P368398,CDLJ 2007/1 \xA73.47,Old Babylonian,Uruk (mod. Warka),KVM 32.1186
P372766,"VS 13, 013",Old Babylonian,Uruk (mod. Warka),VAT 05589


## Manipulate
The Dataframe may now be manipulated with standard Pandas methods. The example code selects the texts from Ur.
> `ur = df1[df1.provenience == "Ur"]`

## Save
Save the resulting data set as a `csv` file. `UTF-8` encoding is the encoding with the widest support in text analysis (and also the encoding used by [ORACC](http://oracc.org). If you intend to use the catalog file in Excel, however, it is better to use `utf-16` encoding.

In [8]:
filename = project.replace('/', '-') + '_cat.csv'
with open('output/' + filename, 'w') as w:
    df1.to_csv(w, encoding='utf-8')