# Retrieve Catalog Data from `catalogue.json`

The file `catalogue.json` contains all the catalog data for an [ORACC](http://oracc.org) project (for general information, see the [Oracc Open Data](http://oracc.org/doc/opendata) page). The `zip` that contains all JSON files of a particular project can be found at `http://oracc.museum.upenn.edu/[PROJECT]/json/[PROJECT].zip`. In the URL replace [PROJECT] with your project name (e.g. `dcclt`). For sub-projects the URL pattern is `http://oracc.museum.upenn.edu/[PROJECT]/[SUBPROJECT]/json/[PROJECT]-[SUBPROJECT].zip`. 

The main key in a `catalogue.json` file is called `members`. The value of this key contains the information of all the fields and all the entries in the project catalog. This information is put in a `pandas` DataFrame.

In [5]:
import pandas as pd
import zipfile
import json
import os
import sys
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
import utils

## 0 Create Directories, if Necessary
The two directories needed for this script are `jsonzip` and `output`. 

In [6]:
os.makedirs('jsonzip', exist_ok = True)
os.makedirs('output', exist_ok = True)

## 1.1 Input Project Names
We can download and manipulate multiple [ORACC](http://oracc.org) `zip` files at the same time. Note, however that different [ORACC](http://oracc.org) projects use different fields in their catalogs; not all catalogs are mutually compatible.

Provide project abbreviations, separated by a comma. Note that subprojects must be processed separately, they are not included in the main project. A subproject is named `[PROJECT]/[SUBPROJECT]`, for instance `saao/saa01`.

Split the list of projects and create a list of project names, using the `format_project_list()` function from the `utils` module.

In [7]:
projects = input('Project(s): ').lower().strip() # lowercase user input and strip accidental spaces
project_list = utils.format_project_list(projects)

Project(s):  obmc


## 1.2 Split the List of Projects and Download the ZIP files.
Use the `oracc_download()` function from the `utils` module to download the requested projects. The code of this function is discussed in more detail in 2.1.0. Download ORACC JSON Files. The function returns a new version of the project list, with duplicates and non-existing projects removed.

In [8]:
project_list = utils.oracc_download(project_list)

Saving http://oracc.org/obmc/json/obmc.zip as jsonzip/obmc.zip.


obmc: 0.00B [00:00, ?B/s]

## 2 Extract Catalogue Data from `JSON` files
The code in this cell will iterate through the list of projects entered above (1.1). For each project the `JSON` zip file, named `[PROJECT].zip` has been downloaded in the directory `jsonzip` (1.2). Each of these `zip` files includes a file called `catalogue.json`. This file is read in and loaded with the command `json.loads()`, which transforms a string into a JSON object - a sequence of names and values.

The JSON object is transformed into a `pandas` Dataframe. By default, when reading in a dictionary, the `DataFrame()` function will take the top-level keys (in this case the text IDs) as columns. The dataframe needs to be transposed (`.T`), so that the P, Q, and X numbers become indexes or row names, and each column represents a field in the catalog.  The individual dataframes (one for each project requested) are concatenated. Since individual [ORACC](http://oracc.org) project catalogs may have different fields, the dataframes may have different column names. By default `pandas` concatenation uses an `outer join` so that all column names of all the catalogs are preserved.

In [9]:
df = pd.DataFrame() # create an empty dataframe
for project in project_list:
    file = f"jsonzip/{project.replace('/', '-')}.zip"
    try:
        zip_file = zipfile.ZipFile(file)       # create a Zipfile object
    except:
        errors = sys.exc_info() # get error information
        print(file), print(errors[0]), print(errors[1]) # and print it
        continue
    try:
        json_cat_string = zip_file.read(f"{project}/catalogue.json").decode('utf-8')  #read and decode the catalogue.json file of one project
                                                                # the result is a string object
    except:
        errors = sys.exc_info() # get error information
        print(project), print(errors[0]), print(errors[1]) # and print it
        continue
    zip_file.close()
    cat = json.loads(json_cat_string)
    cat = cat['members']  # select the 'members' node 
    for item in cat.values():
        item["project"] = project # add project name as separate field
    cat_df = pd.DataFrame.from_dict(cat, orient="index")
    df = pd.concat([df, cat_df], sort=True)  # sort=True is necessary in case catalogs have a different set of fields
df

Unnamed: 0,accession_no,acquisition_history,author,citation,collection,designation,electronic_publication,excavation_no,findspot_square,genre,...,publication_date,publication_history,subgenre,subgenre_remarks,supergenre,surface_preservation,text_remarks,trans,uri,xproject
P200931,,,"Limet, Henri",,"Musées royaux d'Art et d'Histoire, Brussels, B...",Akkadica 117 015 O.118,,,,School,...,2000,"Speleers, Louis, RIAA (1925) 047",Type II Tablet,Obv: model contract; rev. unreadable,LIT,,,[en],http://cdli.ucla.edu/P200931,CDLI
P200932,,,"Limet, Henri",,"Musées royaux d'Art et d'Histoire, Brussels, B...",Akkadica 117 015 O.119,,,,School,...,2000,"Speleers, Louis, RIAA (1925) 049",Type III Tablet,Model contract,LIT,,Manumission of a male slave; cf. Akkadica 117 ...,[en],http://cdli.ucla.edu/P200932,CDLI
P200933,,,"Limet, Henri",,"Musées royaux d'Art et d'Histoire, Brussels, B...",Akkadica 117 016 O.120,,,,School,...,2000,"Speleers, Louis, RIAA 045",Type III Tablet,Model contract,LIT,,Manumission of a male slave; cf. Akkadica 117 ...,[en],http://cdli.ucla.edu/P200933,CDLI
P227953,,,"Chiera, Edward",,University of Pennsylvania Museum of Archaeolo...,"OIP 011, 105 + 109 + 148",,,,School,...,1929,"MSL 12, 029 L + 031 Y'; MSL 05, 169 (on OIP 01...",Type II Tablet,Obv: model contract; rev: OB Nippur Lu,LIT,,Sale of a field,,http://cdli.ucla.edu/P227953,CDLI
P227955,,,"Chiera, Edward",,University of Pennsylvania Museum of Archaeolo...,"OIP 011, 030",,,,School,...,1929,"MSL 13, 092 B",Type II Tablet,Obv: model contract; rev: Nigga,LIT,,Silver loans,,http://cdli.ucla.edu/P227955,CDLI
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
X000028,,ex OIM A33584,,,"National Museum of Iraq, Baghdad, Iraq",IM –,,3N-T0922hh,,School,...,,,Type I Tablet,Model contracts,LIT,,Barley loans,[en],,
P255081,,,,,University of Pennsylvania Museum of Archaeolo...,UM 29-15-617,,,,School,...,,,Type I or II Tablet,Obv: model contracts; rev: not preserved,LIT,,"Manumission of a female slave; cf. TMH 11, 01,...",[en],http://cdli.ucla.edu/P255081,CDLI
P276764,,,,,University of Pennsylvania Museum of Archaeolo...,N 1640,,,,School,...,,,,Model contracts,LIT,,Sales of real estates,,http://cdli.ucla.edu/P276764,CDLI
X000007,,,"Isma'el, Khalid Salim",,"National Museum of Iraq, Baghdad",Edubba 9 29,,,,School,...,2007,,Type III Tablet,Model contract,LIT,,Manumission of a male slave; cf. Akkadica 117 ...,[en],,


## 3. Clean the Dataframe
The function `fillna('')` will put a blank (instead of `NaN`) in all fields that have no entry.

In [10]:
df = df.fillna('')
df

Unnamed: 0,accession_no,acquisition_history,author,citation,collection,designation,electronic_publication,excavation_no,findspot_square,genre,...,publication_date,publication_history,subgenre,subgenre_remarks,supergenre,surface_preservation,text_remarks,trans,uri,xproject
P200931,,,"Limet, Henri",,"Musées royaux d'Art et d'Histoire, Brussels, B...",Akkadica 117 015 O.118,,,,School,...,2000,"Speleers, Louis, RIAA (1925) 047",Type II Tablet,Obv: model contract; rev. unreadable,LIT,,,[en],http://cdli.ucla.edu/P200931,CDLI
P200932,,,"Limet, Henri",,"Musées royaux d'Art et d'Histoire, Brussels, B...",Akkadica 117 015 O.119,,,,School,...,2000,"Speleers, Louis, RIAA (1925) 049",Type III Tablet,Model contract,LIT,,Manumission of a male slave; cf. Akkadica 117 ...,[en],http://cdli.ucla.edu/P200932,CDLI
P200933,,,"Limet, Henri",,"Musées royaux d'Art et d'Histoire, Brussels, B...",Akkadica 117 016 O.120,,,,School,...,2000,"Speleers, Louis, RIAA 045",Type III Tablet,Model contract,LIT,,Manumission of a male slave; cf. Akkadica 117 ...,[en],http://cdli.ucla.edu/P200933,CDLI
P227953,,,"Chiera, Edward",,University of Pennsylvania Museum of Archaeolo...,"OIP 011, 105 + 109 + 148",,,,School,...,1929,"MSL 12, 029 L + 031 Y'; MSL 05, 169 (on OIP 01...",Type II Tablet,Obv: model contract; rev: OB Nippur Lu,LIT,,Sale of a field,,http://cdli.ucla.edu/P227953,CDLI
P227955,,,"Chiera, Edward",,University of Pennsylvania Museum of Archaeolo...,"OIP 011, 030",,,,School,...,1929,"MSL 13, 092 B",Type II Tablet,Obv: model contract; rev: Nigga,LIT,,Silver loans,,http://cdli.ucla.edu/P227955,CDLI
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
X000028,,ex OIM A33584,,,"National Museum of Iraq, Baghdad, Iraq",IM –,,3N-T0922hh,,School,...,,,Type I Tablet,Model contracts,LIT,,Barley loans,[en],,
P255081,,,,,University of Pennsylvania Museum of Archaeolo...,UM 29-15-617,,,,School,...,,,Type I or II Tablet,Obv: model contracts; rev: not preserved,LIT,,"Manumission of a female slave; cf. TMH 11, 01,...",[en],http://cdli.ucla.edu/P255081,CDLI
P276764,,,,,University of Pennsylvania Museum of Archaeolo...,N 1640,,,,School,...,,,,Model contracts,LIT,,Sales of real estates,,http://cdli.ucla.edu/P276764,CDLI
X000007,,,"Isma'el, Khalid Salim",,"National Museum of Iraq, Baghdad",Edubba 9 29,,,,School,...,2007,,Type III Tablet,Model contract,LIT,,Manumission of a male slave; cf. Akkadica 117 ...,[en],,


## 4 Select Relevant Fields
[ORACC](http://oracc.org) catalogs may have custom fields, the only fields that are obligatory are `id_text` (the P, Q, or X number that identifies the text, for instance "P243546") and `designation` (the human-readable reference, for instance "VS 17, 012"). The example code below works with field names that are available in (almost) every catalog. Adjust the code to your data and your needs.

In [11]:
df1 = df[['designation', 'period', 'provenience',
        'museum_no', 'project', 'id_text']]
df1

Unnamed: 0,designation,period,provenience,museum_no,project,id_text
P200931,Akkadica 117 015 O.118,Old Babylonian,uncertain,MRAH O.0118,obmc,P200931
P200932,Akkadica 117 015 O.119,Old Babylonian,uncertain,MRAH O.0119,obmc,P200932
P200933,Akkadica 117 016 O.120,Old Babylonian,uncertain,MRAH O.120,obmc,P200933
P227953,"OIP 011, 105 + 109 + 148",Old Babylonian,Nippur,CBS 04803 + CBS 06591 + CBS 05962,obmc,P227953
P227955,"OIP 011, 030",Old Babylonian,Nippur,CBS 04805,obmc,P227955
...,...,...,...,...,...,...
X000028,IM –,Old Babylonian,Nippur,IM –,obmc,X000028
P255081,UM 29-15-617,Old Babylonian,Nippur,UM 29-15-617,obmc,P255081
P276764,N 1640,Old Babylonian,Nippur,N 1640,obmc,P276764
X000007,Edubba 9 29,Old Babylonian,Tulul Khattab,IM 092207,obmc,X000007


## 5.1 Save as CSV
Save the resulting data set as a `csv` file. `UTF-8` encoding is the encoding with the widest support in text analysis (and also the encoding used by [ORACC](http://oracc.org)). If you intend to use the catalog file in Excel, however, it is better to use `utf-16` encoding.

In [None]:
filename = 'output/catalog.csv'
df1.to_csv(filename, index=False, encoding='utf-8')

## 5.2 Save with Pickle
One may pickle a file either with the `pickle` library or directly from within `pandas` with the `to_pickle()` function. A pickled file preserves the data structure of the dataframe, which is an advantage over saving as `csv`. The pickle file is a binary file, so we must open the file with the `wb` (write binary) option and we cannot give an encoding. To open the pickled file one may use the `read_pickle()` function from the `pandas` library, as in:

```python
import pandas as pd
df = pd.read_pickle(o)
```

In [None]:
filename = "output/catalog.p"
df1.to_pickle(filename)