# Download ORACC JSON Files
This script downloads open data from the Open Richly Annotated Cuneiform Corpus ([ORACC](http://oracc.org)) in `json` format. The JSON files are made available in a ZIP file. For a description of the various JSON files included in the ZIP see the [open data](http://oracc.museum.upenn.edu/doc/opendata) page on [ORACC](http://oracc.org). 

# 0. Import Packages

In [None]:
import requests
import tqdm
import os
import errno

# 1. Create Download Directory
Create a directory called `jsonzip`. If the directory already exists, do nothing.

For the code, see [Stack Overflow](http://stackoverflow.com/questions/18973418/os-mkdirpath-returns-oserror-when-directory-does-not-exist).

In [None]:
try:
    os.mkdir('jsonzip')
except OSError as exc:
    if exc.errno !=errno.EEXIST:
        raise
    pass

# 2.1 Input a List of Project Abbreviations
Enter one or more project abbreviations to download their JSON zip files. The project names are separated by commas. Note that the subprojects must be given explicitly, they are not automatically included in the main project. For instance: 
* saao/saa01, aemw/amarna, rimanum

In [None]:
projects = input('Project(s): ').lower().strip() # lowercase user input and strip accidental spaces

# 2.2 Split the List of Projects
Split the list of projects and create a list of project names.

In [None]:
p = projects.split(',')               # split at each comma and make a list called `p`
p = [x.strip() for x in p]        # strip spaces left and right of each entry in `p`

## Download the ZIP files
For each project from which files are to be processed download the entire project (all the json files) from `http://build-oracc.museum.upenn.edu/json/`. The file is called `PROJECT.zip` (for instance: `dcclt.zip`). For subprojects the file is called `PROJECT-SUBPROJECT.zip` (for instance `cams-gkab.zip`). 

For larger projects (such as [DCCLT](http://oracc.org/dcclt)) the `zip` file may be 25Mb or more. Downloading may take some time and it may be necessary to chunk the downloading process. The `iter_content()` function in the `requests` library takes care of that.

Although downloading the entire zip file is time consuming, it will make processing the individual files much more efficient and the code is less likely to break due to interruption in connectivity.


In [None]:
CHUNK = 16 * 1024
for project in tqdm.tqdm(p):
    project = project.replace('/', '-')
    url = "http://build-oracc.museum.upenn.edu/json/" + project + ".zip"
    file = 'jsonzip/' + project + '.zip'
    r = requests.get(url)
    if r.status_code == 200:
        print("Downloading " + url + " saving as " + file)
        with open(file, 'wb') as f:
            for c in r.iter_content(chunk_size=CHUNK):
                f.write(c)
    else:
        print(url + " does not exist.")