# Retrieve Catalog Data from `catalogue.json`

The file `catalogue.json` contains all the catalog data for an [ORACC](http://oracc.org) project (for general information, see the [Oracc Open Data](http://oracc.org/doc/opendata) page). The `zip` that contains all JSON files of a particular project can be found at `http://build-oracc.museum.upenn.edu/json/[PROJECT].zip`. In the URL replace [PROJECT] with your project or sub-project name (e.g. `dcclt` or `cams/gkab`).

The main node in a `catalogue.json` file is called `members`. This node contains the information of all the fields and all the entries in the project catalog.

In [1]:
import pandas as pd
import zipfile
import json
import requests
import errno
import os
import tqdm

## 0 Create Directories, if Necessary
The two directories needed for this script are `jsonzip` and `output`. If they do not exist they are created, else: do nothing.

For the code, see [Stack Overflow](http://stackoverflow.com/questions/18973418/os-mkdirpath-returns-oserror-when-directory-does-not-exist).

In [2]:
directories = ['jsonzip', 'output']
for d in directories:
    try:
        os.mkdir(d)
    except OSError as exc:
        if exc.errno !=errno.EEXIST:
            raise
        pass

## 1.1 Input Project Name
We will download and manipulate one catalog file at a time. Different [ORACC](http://oracc.org) projects use different fields in their catalogs, the catalogs are not mutually compatible.

Provide a project name. Note that subprojects must be processed separately, they are not included in the main project. A subproject is named `[PROJECT]/[SUBPROJECT]`, for instance `saao/saa01`.

In [3]:
projects = input('Project(s): ').lower().strip() # lowercase user input and strip accidental spaces

Project(s): dcclt


## 1.2 Split the List of Projects
Split the list of projects and create a list of project names.

In [4]:
p = projects.split(',')               # split at each comma and make a list called `p`
p = [x.strip() for x in p]        # strip spaces left and right of each entry in `p`

## 1.2 Download the ZIP files
Download all the `json` files from `http://build-oracc.museum.upenn.edu/json/`. The file is called `[PROJECT].zip` (for instance: `dcclt.zip`). For subprojects the file is called `[PROJECT-SUBPROJECT].zip` (for instance `cams-gkab.zip`). 

For larger projects (such as [DCCLT](http://oracc.org/dcclt)) the `zip` file may be 25Mb or more. Downloading may take some time and it may be necessary to chunk the downloading process. The `iter_content()` function in the `requests` library takes care of that. For the chunking code see [this page](https://www.smallsurething.com/how-to-read-a-file-properly-in-python/).

If you have downloaded the files by hand (and put them in the `jsonzip` directory) you may skip this cell and jump directly to section ...

In [5]:
non_existent = []
CHUNK = 16 * 1024
for project in tqdm.tqdm(p):
    project = project.replace('/', '-')
    url = "http://build-oracc.museum.upenn.edu/json/" + project + ".zip"
    file = 'jsonzip/' + project + '.zip'
    r = requests.get(url)
    if r.status_code == 200:
        print("Downloading " + url + " saving as " + file)
        with open(file, 'wb') as f:
            for c in r.iter_content(chunk_size=CHUNK):
                f.write(c)
    else:
        print(url + " does not exist.")
        non_existent.append(project)
p = [i for i in p if i not in non_existent] # remove non-existing project names from list

  0%|                                                    | 0/1 [00:00<?, ?it/s]

Downloading http://build-oracc.museum.upenn.edu/json/dcclt.zip saving as jsonzip/dcclt.zip


100%|████████████████████████████████████████████| 1/1 [00:07<00:00,  7.02s/it]


## 2 Extract Catalogue Data from `JSON` files
The code in this cell will iterate through the list of projects entered above (1.1). For each project the `JSON` zip file, named `[PROJECT].zip` is located in the directory `jsonzip`. Each of these `zip` files includes a file called `catalogue.json`. This file is extracted and read with the command `json.loads()`, which reads the json data and transforms it into a JSON object - a sequence of names and values.

The JSON object is transformed into a Pandas Dataframe. The dataframe needs to be transposed (`.T`), so that the P, Q, and X numbers become indexes or row names (rather than column names), and each column represents a field in the catalog.  The individual dataframes (one for each project requested) are concatenated. Since individual [ORACC](http://oracc.org) project catalogs may have different fields, the dataframes may have different column names. By default Pandas concatenation uses an `outer join` so that all column names of all the catalogs are preserved.

In [6]:
df = pd.DataFrame() # create an empty dataframe
for project in p:
    file = "jsonzip/" + project.replace("/", "-") + ".zip"
    try:
        z = zipfile.ZipFile(file)       # create a Zipfile object
    except:
        print(file + " does not exist or is not a proper ZIP file")
        continue
    try:
        st = z.read(project + '/catalogue.json').decode('utf-8')  #read and decode the catalogue.json file of one project
                                                                # the result is a string object
    except:
        print(project + '/catalogue.json' + ' is not available or not complete')
        continue
    cat = json.loads(st)
    cat = cat['members']  # select the 'members' node 
    for item in cat.values():
        item["project"] = project # add project name as separate field
    cat_df = pd.DataFrame(cat).T
    df = pd.concat([df, cat_df])

## Create Dataframe
 Creating a Dataframe is not necessary, one may also manipulate the dictionary directly, but for demonstration purposes the Dataframe is a handy format. In manipulating the dictionary directly it is important to keep in mind that not all catalog fields have data for all entries, which means that not all dictionary keys are available for each P, Q, or X number.

The function `fillna('')` will put a blank (instead of `NaN`) in all fields that have no entry.

Example code for slicing the dictionary to select all entries that have `provenience = 'Ur'`:
> `urcat = {key:value for key, value in d.items() if 'provenience' in d[key] and d[key]['provenience'] == 'Ur'}

In [7]:
df = df.fillna('')
df

Unnamed: 0,accession_no,acquisition_history,archive,ark,atf_source,atf_up,author,author_remarks,cdli_collation,cdli_comments,...,subgenre,subgenre_remarks,supergenre,surface_preservation,text_remarks,thickness,trans,uri,width,xproject
P000001,,,,21198/zz001q0dtm,"Englund, Robert K.",20130125.atf,"Englund, Robert K. &amp; Nissen, Hans J.","31x61x18; Lú A 14-16.30-32.48-50; M XVIII, auf...",,,...,Archaic Lu A,,LEX,,,18,,http://cdli.ucla.edu/P000001,61,CDLI
P000002,,,,21198/zz001q0dv4,"Englund, Robert K.",20130125.atf,"Englund, Robert K. &amp; Nissen, Hans J.",30x48x13; Lú A 13-15.23-25.?; Fundstelle wie W...,,,...,Archaic Lu A,,LEX,,,13,,http://cdli.ucla.edu/P000002,48,CDLI
P000003,,,,21198/zz001q0dwn,"Englund, Robert K.",20130125.atf,"Englund, Robert K. &amp; Nissen, Hans J.","42x53x19; Vocabulary 9; Qa XVI,2, unter der Ab...",,,...,Archaic Vocabulary,,LEX,,,19,,http://cdli.ucla.edu/P000003,53,CDLI
P000004,,,,21198/zz001q0dx5,"Englund, Robert K.",20130125.atf,"Englund, Robert K. &amp; Nissen, Hans J.",26x23x23; Lú A 9-10.?.?; Fundstelle wie W 9123...,,,...,Archaic Lu A,,LEX,,,23,,http://cdli.ucla.edu/P000004,23,CDLI
P000005,,,,21198/zz001q0dzp,"Englund, Robert K.",20130125.atf,"Englund, Robert K. &amp; Nissen, Hans J.","29x36x20; Lú A Vorläufer; Qa XVI,2, unter der ...",,,...,Archaic Lu A,,LEX,,,20,,http://cdli.ucla.edu/P000005,36,CDLI
P000006,,,,21198/zz001q0f0p,"Englund, Robert K.",20130125.atf,"Englund, Robert K. &amp; Nissen, Hans J.",82x62x19; Lú A Vorläufer; Fundstelle wie W 912...,,,...,Archaic Lu A,,LEX,,,19,,http://cdli.ucla.edu/P000006,62,CDLI
P000007,,,,21198/zz001q0f16,"Englund, Robert K.",20130125.atf,"Englund, Robert K. &amp; Nissen, Hans J.",56x36x29; Lú A Vorläufer; Fundstelle wie W 912...,,,...,Archaic Lu A,,LEX,,,29,,http://cdli.ucla.edu/P000007,36,CDLI
P000008,,,,21198/zz001q0f2q,"Englund, Robert K.",20130125.atf,"Englund, Robert K. &amp; Nissen, Hans J.","39x26x9; Unidentified 1; Pb XVII,1, +19.50 m, ...",,,...,Archaic Unidentified,1,LEX,,,9,,http://cdli.ucla.edu/P000008,26,CDLI
P000009,,,,21198/zz001q0f37,"Englund, Robert K.",20130125.atf,"Englund, Robert K. &amp; Nissen, Hans J.",54x46x?; Lú A 95-98.111-113; Fundstelle wie W ...,,,...,Archaic Lu A,,LEX,,,,,http://cdli.ucla.edu/P000009,46,CDLI
P000010,,,,21198/zz001q0f4r,"Englund, Robert K.",20130125.atf,"Englund, Robert K. &amp; Nissen, Hans J.",23x25x19; Officials 16-18.66-68.?-?; Fundstell...,,,...,Archaic Officials,,LEX,,,19,,http://cdli.ucla.edu/P000010,25,CDLI


## Select Relevant Fields
[ORACC](http://oracc.org) catalogs may have custom fields, the only fields that are obligatory are `id_text` (the P, Q, or X number that identifies the text, for instance "P243546") and `designation` (the human-readable reference, for instance "VS 17, 012"). In order to find out which fields are available one may list the column names of the DataFrame.

First display all available fields, then select the ones that are relevant for the task at hand. 

In [8]:
df.columns

Index(['accession_no', 'acquisition_history', 'archive', 'ark', 'atf_source',
       'atf_up', 'author', 'author_remarks', 'cdli_collation', 'cdli_comments',
       'citation', 'collection', 'collection_copyright',
       'condition_description', 'created_by', 'created_on', 'credits',
       'date_entered', 'date_of_origin', 'date_remarks', 'date_updated',
       'dates_referenced', 'db_source', 'designation',
       'electronic_publication', 'elevation', 'excavation_no', 'external_id',
       'findspot_remarks', 'findspot_square', 'genre',
       'google_earth_collection', 'google_earth_provenience', 'height',
       'id_composite', 'id_text', 'images', 'join_information', 'keywords',
       'langs', 'language', 'last_modified_by', 'last_modified_on',
       'lineart_up', 'material', 'museum_no', 'notes', 'object_preservation',
       'object_remarks', 'object_type', 'other_names', 'period',
       'period_remarks', 'photo_up', 'place', 'primary_edition',
       'primary_publication',

In [11]:
df1 = df[['designation', 'period', 'provenience',
        'museum_no', 'project', 'id_text']]
df1

Unnamed: 0,designation,period,provenience,museum_no,project,id_text
P000001,"W 06435,a",Uruk III,Uruk,VAT 01533,dcclt,P000001
P000002,"W 06435,b",Uruk III,Uruk,VAT 15263,dcclt,P000002
P000003,"W 09123,d",Uruk IV,Uruk,VAT 15253,dcclt,P000003
P000004,"W 09169,d",Uruk IV,Uruk,VAT 15168,dcclt,P000004
P000005,"W 09206,k",Uruk IV,Uruk,VAT 15153,dcclt,P000005
P000006,"W 09656,h1",Uruk IV,Uruk,VAT 15003,dcclt,P000006
P000007,"W 09656,x",Uruk IV,Uruk,VAT 15111,dcclt,P000007
P000008,"W 11985,e",Uruk III,Uruk,VAT 17684,dcclt,P000008
P000009,"W 11985,f",Uruk III,Uruk,VAT 17702,dcclt,P000009
P000010,"W 11985,g",Uruk III,Uruk,VAT 17709,dcclt,P000010


## Manipulate
The Dataframe may now be manipulated with standard Pandas methods. The example code selects the texts from Ur.
> `ur = df1[df1.provenience == "Ur"]`

## Save
Save the resulting data set as a `csv` file. `UTF-8` encoding is the encoding with the widest support in text analysis (and also the encoding used by [ORACC](http://oracc.org)). If you intend to use the catalog file in Excel, however, it is better to use `utf-16` encoding.

In [12]:
filename = 'catalog.csv'
with open('output/' + filename, 'w', encoding='utf-8') as w:
    df1.to_csv(w, index=False)