# 2.1.2 Retrieve ORACC Catalog Data

In this section we will download one or more [ORACC](http://oracc.org) projects, select the catalog data and display the catalog in a table. Each [ORACC](http://oracc.org) JSON `zip` file includes a file named `catalogue.json`. 

:::{margin}
For general information, see the [Oracc Open Data](http://oracc.org/doc/opendata) page.
:::

The file `catalogue.json` contains all the catalog data for an [ORACC](http://oracc.org) project. We will transform the JSON into a `pandas` dataframe. 

:::{note}
A dataframe is, essentially, a table in which each row represents an observation (in our case: a document) and each column represents an attribute (publication, museum number, etc.). 
:::

## 2.1.2.0 Load Packages
* pandas: data analysis and manipulation; dataframes
* ipywidgets: user interface (enter project names)
* zipfile: read data from a zipped file
* json: read a json object
* os: basic Operating System tasks (such as creating a directory)
* sys: change system parameters
* utils: compass-specific utilities (download files from ORACC, etc.)

In [1]:
import pandas as pd
import ipywidgets as widgets
import zipfile
import json
import os
import sys
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
import utils

## 2.1.2.1 Create Directories, if Necessary
The two directories needed for this script are `jsonzip` and `output`. 

In [2]:
os.makedirs('jsonzip', exist_ok = True)
os.makedirs('output', exist_ok = True)

## 2.1.2.2 Input Project Names
We can download and manipulate multiple [ORACC](http://oracc.org) `zip` files at the same time. The `Textarea` widget provides a space for typing project abbreviations, separated by a comma. The widget is assigned to the variable `projects`. The text entered in the `Textarea` widget can be retrieved as `projects.value`.

:::{warning}
Subprojects must be listed separately, they are not included in the main project. A subproject is named `[PROJECT]/[SUBPROJECT]`, for instance `saao/saa01`.
:::

In [3]:
projects = widgets.Textarea(
    value="obmc",
    placeholder='Type project names, separated by commas',
    description='Projects:',
)
projects

Textarea(value='obmc', description='Projects:', placeholder='Type project names, separated by commas')

## 2.1.2.3 Split the List of Projects and Download the ZIP files.
Use the `format_project_list()` and `oracc_download()` functions from the `utils` module to download the requested projects. The code of these function is discussed in more detail in 2.1.0. Download ORACC JSON Files. The function returns a new version of the project list, with duplicates and non-existing projects removed.

In [4]:
project_list = utils.format_project_list(projects.value)
project_list = utils.oracc_download(project_list)

Saving http://build-oracc.museum.upenn.edu/json/obmc.zip as jsonzip/obmc.zip.


obmc:   0%|          | 0.00/2.63M [00:00<?, ?B/s]

## 2.1.2.4 Extract Catalogue Data from `JSON` files
The process begins by turning a `zip` file (for instance `obmc.zip`) into a `zipfile` object that may be manipulated with the functions available in the `zipfile` library. This is done with the `zipfile.Zipfile()` function:

```python
import zipfile
file = "jsonzip/obmc.zip"    
# or: file = "jsonzip/dcclt-nineveh.zip"
zipfile_object = zipfile.ZipFile(file)
```

The `read()` function from that same `zipfile` package reads one particular file from the `zip` and turns it into a string:

```python
string_object = zipfile_object.read("obmc/catalogue.json").decode("utf-8") 
# or: string_object = zipfile_object.read("dcclt/nineveh/catalogue.json").decode("utf-8")
```

The `json` library provides functions for reading (loading) or producing (dumping) a JSON file. Reading is done with the function `load()`, which comes in two versions. Regular `json.load()` takes a filename as argument and will load a JSON file. In this case, however, the `read()` function from the `zipfile` library has produced a string (extracted from `obmc.zip`), and therefore we need the command `json.loads()`, which takes a string as its argument:  

```python
import json
json_object = json.loads(string_object)
```

The variable `json_object` will now contain all the data in the `catalogue.json` file from the [OBMC](http://oracc.org/obmc) (Old Babylonian Model Contracts) project by Gabriella Spada. We may treat the variable `json_object` as a Python dictionary. The `catalogue.json` has various keys, including `type`, `project`, `source`, `license`, `license-url`, `more-info`, `UTC-timestamp`, `members`, and `summaries`. The key `members` is the only one that concerns us here, since it contains the actual catalog information. The value of the key `members` is itself a dictionary of dictionaries. Each of the keys in the top-level dictionary is a P, Q, or X-number (a text ID). The value of each of these keys is still another dictionary; each key in that dictionary is a field in the original catalog (`primary_publication`, `provenience`, `genre`, etc.). The dictionary of dictionaries under the key `members` may be transformed into a Pandas dataframe for ease of viewing and manipulation.

``` python
import pandas as pd
cat = json_object["members"]
df = pd.DataFrame.from_dict(cat)
df
```

By default, the `DataFrame.from_dict()` function in the `pandas` library takes each key as a column - in this case the keys of `cat` are the P numbers (text IDs); the catalog fields will become rows. To address that issue, we need to tell the `DataFrame.from_dict()` function explicitly that each key should be a row (`orient="index"`) 

```python
df = pd.DataFrame.from_dict(cat, orient="index")
df
```

We can put the code discussed above in a loop that will iterate through the list of projects entered in 2.1.2.2. For each project the `JSON` zip file, named `[PROJECT].zip` has been downloaded in the directory `jsonzip`. 

In the last step of the loop, the individual dataframes (one for each project requested) are concatenated. Since individual [ORACC](http://oracc.org) project catalogs may have different fields, the dataframes may have different column names. By default `pandas` concatenation uses an `outer join` so that all column names of all the catalogs are preserved.

:::{warning}
[ORACC](http://oracc.org) catalogs have two obligatory fields: `id_text` (the P, Q, or X number that identifies the text, for instance "P243546") and `designation` (the human-readable reference, for instance "MEE 04, 020"). Many projects use catalog fields that are derived from [CDLI](http://cdli.ucla.edu), such as `museum_no`, `primary_publication`, etc., but there is no uniformity. If you build a catalog from multiple projects you may need to manipulate the resulting dataframe to align the catalogs.
:::

In [5]:
df = pd.DataFrame() # create an empty dataframe
for project in project_list:
    file = f"jsonzip/{project.replace('/', '-')}.zip"
    try:
        zip_file = zipfile.ZipFile(file)       # create a Zipfile object
    except:
        errors = sys.exc_info() # get error information
        print(file), print(errors[0]), print(errors[1]) # and print it
        continue
    try:
        json_cat_string = zip_file.read(f"{project}/catalogue.json").decode('utf-8')  #read and decode the catalogue.json file of one project
                                                                # the result is a string object
    except:
        errors = sys.exc_info() # get error information
        print(project), print(errors[0]), print(errors[1]) # and print it
        continue
    zip_file.close()
    cat = json.loads(json_cat_string)
    cat = cat['members']  # select the 'members' node 
    cat_df = pd.DataFrame.from_dict(cat, orient="index")
    cat_df["project"] = project  # add project name as separate field
    df = pd.concat([df, cat_df], sort=True)  # sort=True is necessary in case catalogs have different sets of fields
df

Unnamed: 0,accession_no,author,citation,collection,designation,electronic_publication,excavation_no,findspot_square,genre,id_text,...,publication_date,publication_history,subgenre,subgenre_remarks,supergenre,surface_preservation,text_remarks,trans,uri,xproject
P200931,,"Limet, Henri",,"Musées royaux d'Art et d'Histoire, Brussels, B...",Akkadica 117 015 O.118,,,,School,P200931,...,2000,"Speleers, Louis, RIAA (1925) 047",Type II Tablet,Obv: model contract; rev. unreadable,LIT,,,,http://cdli.ucla.edu/P200931,CDLI
P200932,,"Limet, Henri",,"Musées royaux d'Art et d'Histoire, Brussels, B...",Akkadica 117 015 O.119,,,,School,P200932,...,2000,"Speleers, Louis, RIAA (1925) 049",Type III Tablet,Model contract,LIT,,Manumission of a male slave; cf. Akkadica 117 ...,[en],http://cdli.ucla.edu/P200932,CDLI
P200933,,"Limet, Henri",,"Musées royaux d'Art et d'Histoire, Brussels, B...",Akkadica 117 016 O.120,,,,School,P200933,...,2000,"Speleers, Louis, RIAA 045",Type III Tablet,Model contract,LIT,,Manumission of a male slave; cf. Akkadica 117 ...,[en],http://cdli.ucla.edu/P200933,CDLI
P227953,,"Chiera, Edward",,University of Pennsylvania Museum of Archaeolo...,"OIP 011, 105 + 109 + 148",,,,School,P227953,...,1929,"MSL 12, 029 L + 031 Y'; MSL 05, 169 (on OIP 01...",Type II Tablet,Obv: model contract; rev: OB Nippur Lu,LIT,,Sale of a field,,http://cdli.ucla.edu/P227953,CDLI
P227955,,"Chiera, Edward",,University of Pennsylvania Museum of Archaeolo...,"OIP 011, 030",,,,School,P227955,...,1929,"MSL 13, 092 B",Type II Tablet,Obv: model contract; rev: Nigga,LIT,,Silver loans,,http://cdli.ucla.edu/P227955,CDLI
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
X000013,,"Bodine, Walter R.",,"Yale Babylonian Collection, New Haven",YBC 12074,,,,School,X000013,...,2015,,Type III Tablet,Model court case,LIT,,Public announcement of a lost seal (cf. NBC 07...,[en],,
X000014,,,,"Oriental Institute, University of Chicago, Chi...",OIM A33295,,3N-T0910f,,School,X000014,...,,,,model contract,LIT,,,,,
P255081,,,,University of Pennsylvania Museum of Archaeolo...,UM 29-15-617,,,,School,P255081,...,,,Type I or II Tablet,Obv: model contracts; rev: not preserved,LIT,,"Manumission of a female slave; cf. TMH 11, 01,...",[en],http://cdli.ucla.edu/P255081,CDLI
P276764,,,,University of Pennsylvania Museum of Archaeolo...,N 1640,,,,School,P276764,...,,,,Model contracts,LIT,,Sales of real estates,,http://cdli.ucla.edu/P276764,CDLI


## 2.1.2.5 Clean the Dataframe
The function `fillna('')` will put a blank (instead of `NaN`) in all fields that have no entry.

:::{note}
NaN means "Not a Number" and is used for missing values. NaN is a special data type (it is not equivalent to the string "NaN"!) and may cause a number of issues in manipulating the dataframe.
:::

In [6]:
df = df.fillna('')
df

Unnamed: 0,accession_no,author,citation,collection,designation,electronic_publication,excavation_no,findspot_square,genre,id_text,...,publication_date,publication_history,subgenre,subgenre_remarks,supergenre,surface_preservation,text_remarks,trans,uri,xproject
P200931,,"Limet, Henri",,"Musées royaux d'Art et d'Histoire, Brussels, B...",Akkadica 117 015 O.118,,,,School,P200931,...,2000,"Speleers, Louis, RIAA (1925) 047",Type II Tablet,Obv: model contract; rev. unreadable,LIT,,,,http://cdli.ucla.edu/P200931,CDLI
P200932,,"Limet, Henri",,"Musées royaux d'Art et d'Histoire, Brussels, B...",Akkadica 117 015 O.119,,,,School,P200932,...,2000,"Speleers, Louis, RIAA (1925) 049",Type III Tablet,Model contract,LIT,,Manumission of a male slave; cf. Akkadica 117 ...,[en],http://cdli.ucla.edu/P200932,CDLI
P200933,,"Limet, Henri",,"Musées royaux d'Art et d'Histoire, Brussels, B...",Akkadica 117 016 O.120,,,,School,P200933,...,2000,"Speleers, Louis, RIAA 045",Type III Tablet,Model contract,LIT,,Manumission of a male slave; cf. Akkadica 117 ...,[en],http://cdli.ucla.edu/P200933,CDLI
P227953,,"Chiera, Edward",,University of Pennsylvania Museum of Archaeolo...,"OIP 011, 105 + 109 + 148",,,,School,P227953,...,1929,"MSL 12, 029 L + 031 Y'; MSL 05, 169 (on OIP 01...",Type II Tablet,Obv: model contract; rev: OB Nippur Lu,LIT,,Sale of a field,,http://cdli.ucla.edu/P227953,CDLI
P227955,,"Chiera, Edward",,University of Pennsylvania Museum of Archaeolo...,"OIP 011, 030",,,,School,P227955,...,1929,"MSL 13, 092 B",Type II Tablet,Obv: model contract; rev: Nigga,LIT,,Silver loans,,http://cdli.ucla.edu/P227955,CDLI
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
X000013,,"Bodine, Walter R.",,"Yale Babylonian Collection, New Haven",YBC 12074,,,,School,X000013,...,2015,,Type III Tablet,Model court case,LIT,,Public announcement of a lost seal (cf. NBC 07...,[en],,
X000014,,,,"Oriental Institute, University of Chicago, Chi...",OIM A33295,,3N-T0910f,,School,X000014,...,,,,model contract,LIT,,,,,
P255081,,,,University of Pennsylvania Museum of Archaeolo...,UM 29-15-617,,,,School,P255081,...,,,Type I or II Tablet,Obv: model contracts; rev: not preserved,LIT,,"Manumission of a female slave; cf. TMH 11, 01,...",[en],http://cdli.ucla.edu/P255081,CDLI
P276764,,,,University of Pennsylvania Museum of Archaeolo...,N 1640,,,,School,P276764,...,,,,Model contracts,LIT,,Sales of real estates,,http://cdli.ucla.edu/P276764,CDLI


## 2.1.2.6 Select Relevant Fields
:::{margin}
Various introductions to Pandas may be found on the web or in [VanderPlas 2016](https://github.com/jakevdp/PythonDataScienceHandbook) and similar overviews.
:::

The Pandas library allows one to manipulate and slice a dataframe in many different ways. The example code below assigns to the variable `keep` a list of the most relevant fields (these are field names that are available in (almost) every [ORACC](http://oracc.org) catalog). The list `keep` is used to create a new dataframe, with only the relevant fields. Adjust the code to your data and your needs.

In [7]:
keep = ['designation', 'period', 'provenience',
        'museum_no', 'project', 'id_text']
df1 = df[keep]
df1

Unnamed: 0,designation,period,provenience,museum_no,project,id_text
P200931,Akkadica 117 015 O.118,Old Babylonian,uncertain,MRAH O.0118,obmc,P200931
P200932,Akkadica 117 015 O.119,Old Babylonian,uncertain,MRAH O.0119,obmc,P200932
P200933,Akkadica 117 016 O.120,Old Babylonian,uncertain,MRAH O.120,obmc,P200933
P227953,"OIP 011, 105 + 109 + 148",Old Babylonian,Nippur,CBS 04803 + CBS 06591 + CBS 05962,obmc,P227953
P227955,"OIP 011, 030",Old Babylonian,Nippur,CBS 04805,obmc,P227955
...,...,...,...,...,...,...
X000013,YBC 12074,Old Babylonian,uncertain,YBC 12074,obmc,X000013
X000014,OIM A33295,Old Babylonian,unclear,OIM A33295,obmc,X000014
P255081,UM 29-15-617,Old Babylonian,Nippur,UM 29-15-617,obmc,P255081
P276764,N 1640,Old Babylonian,Nippur,N 1640,obmc,P276764


## 2.1.2.7 Save as CSV
:::{margin}
Character encoding is primarily relevant when reading from or writing to disk. See section 1.4.4.
:::
Save the resulting data set as a `csv` file. `UTF-8` encoding is the encoding with the widest support in text analysis and the standard encoding in Python. It is also the encoding used by [ORACC](http://oracc.org). 

:::{note}
If you intend to use the catalog file in Excel, it is better to use `utf-16` encoding.
:::

In [8]:
filename = 'output/catalog.csv'
df1.to_csv(filename, index=False, encoding='utf-8')

## 2.1.2.8 Save with Pickle
One may pickle a file either with the `pickle` library or directly from within `pandas` library with the `to_pickle()` function. A pickled file preserves the data structure of the dataframe, which is an advantage over saving as `csv`. The pickle file is a binary file, so we must open the file with the `wb` (write binary) option and we cannot give an encoding. 

:::{note}
To open the pickled file again:
```python
import pandas as pd
df = pd.read_pickle(o)
```
:::

In [9]:
filename = "output/catalog.p"
df1.to_pickle(filename)