# 2.1.0 Download ORACC JSON Files

Each public [ORACC](http://oracc.org) project has a `zip` file that contains a collection of JSON files, which provide data on lemmatizations, transliterations, catalog data, indexes, etc. The `zip` file can be found at `https://oracc.museum.upenn.edu/json/[PROJECT].zip`, where `[PROJECT]` is to be replaced with the project abbreviation. For sub-projects the address is `https://oracc.museum.upenn.edu/json/[PROJECT]-[SUBPROJECT].zip`

:::{note}
For instance https://oracc.museum.upenn.edu/json/etcsri.zip or, for a subproject https://oracc.museum.upenn.edu/json/cams-gkab.zip.
::: 

One may download these files by hand (simply type the address in your browser), or use the code in the current notebook. The notebook will create a directory `jsonzip` and copy the file to that directory - all further scripts will expect the `zip` files to reside in `jsonzip`. 

:::{note}
One may also use the function `oracc_download()` in the `utils` module. See below (2.1.0.5) for instructions on how to use the `utils` module.
:::

```{figure} ../images/mocci_banner.jpg
:scale: 50%
:figclass: margin
```

Some [ORACC](http://oracc.org) projects are maintained by the Munich Open-access Cuneiform Corpus Initiative ([MOCCI](https://www.en.ag.geschichte.uni-muenchen.de/research/mocci/index.html)). This includes, for example, Official Inscriptions of the Middle East in Antiquity ([OIMEA](http://oracc.org/oimea)) and various other projects and sub-projects. In theory, project data are copied from the Munich server to the Philadelphia ORACC server (and v.v.), but in order to get the most recent data set it is sometimes advisable to request the `zip` files directly from the Munich server. The address is `http://oracc.ub.uni-muenchen.de/[PROJECT]/[SUBPROJECT]/json/[PROJECT]-[SUBPROJECT].zip`. 

:::{note}
The function `oracc_download()` in the `utils` module will try the various servers to find the project(s) of your choice.
:::

After downloading the JSON `zip` file you may unzip it to inspect its contents but there is no necessity to do so. For larger projects unzipping may result in hundreds or even thousands of files; the scripts will always read the data directly from the `zip` file.

## 2.1.0.0. Import Packages

* requests: for communicating with a server over the internet
* tqdm: for creating progress bars
* os: for basic Operating System operations (such as creating a directory)
* ipywidgets: for user interface (input project names to be downloaded)

In [1]:
import requests
from tqdm.auto import tqdm
import os
import ipywidgets as widgets

## 2.1.0.1. Create Download Directory
Create a directory called `jsonzip`. If the directory already exists, do nothing.

In [2]:
os.makedirs("jsonzip", exist_ok = True)

## 2.1.0.2 Input a List of Project Abbreviations
Enter one or more project abbreviations to download their JSON zip files. The project names are separated by commas. Note that subprojects must be given explicitly, they are not automatically included in the main project. For instance: 
* saao/saa01, aemw/alalakh/idrimi, rimanum

In [3]:
projects = widgets.Textarea(
    placeholder='Type project names, separated by commas',
    description='Projects:',
)
projects

Textarea(value='', description='Projects:', placeholder='Type project names, separated by commas')


## 2.1.0.3 Split the List of Projects
Lower case the list of projects and split it to create a list of project names.

In [4]:
project_list = projects.value.lower().split(',')   # split at each comma and make a list called `project_list`
project_list = [project.strip() for project in project_list]  # strip spaces left and right of each entry

## 2.1.0.4 Download the ZIP files

For larger projects (such as [DCCLT](http://oracc.org/dcclt)) the `zip` file may be 25Mb or more. Downloading may take some time and it may be necessary to chunk the downloading process. The `iter_content()` function in the `requests` library takes care of that.

In order to show a progress bar (with `tqdm`) we need to know how large the file to be downloaded is (this value is is then fed to the `total` parameter). The http protocol provides a key `content-length` in the headers (a dictionary) that indicates file length. Not all servers provide this field - if `content-length` is not avalaible it is set to 0. With the `total` value of 0 `tqdm` will show a bar and will count the number of chunks received, but it will not indicate the degree of progress.

In [5]:
CHUNK = 1024
for project in project_list:
    proj = project.replace('/', '-')
    url = f"https://oracc.museum.upenn.edu/json/{proj}.zip"
    file = f'jsonzip/{proj}.zip'
    with requests.get(url, stream=True, verify=False) as request:
        if request.status_code == 200:   # meaning that the file exists
            total_size = int(request.headers.get('content-length', 0))
            tqdm.write(f'Saving {url} as {file}')
            t=tqdm(total=total_size, unit='B', unit_scale=True, desc = project)
            with open(file, 'wb') as f:
                for c in request.iter_content(chunk_size=CHUNK):
                    t.update(len(c))
                    f.write(c)
        else:
            tqdm.write(f"WARNING: {url} does not exist.")

Saving https://oracc.museum.upenn.edu/json/saao-saa01.zip as jsonzip/saao-saa01.zip




saao/saa01:   0%|          | 0.00/5.01M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa02.zip as jsonzip/saao-saa02.zip




saao/saa02:   0%|          | 0.00/2.64M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa03.zip as jsonzip/saao-saa03.zip




saao/saa03:   0%|          | 0.00/4.29M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa04.zip as jsonzip/saao-saa04.zip




saao/saa04:   0%|          | 0.00/8.21M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa05.zip as jsonzip/saao-saa05.zip




saao/saa05:   0%|          | 0.00/4.88M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa06.zip as jsonzip/saao-saa06.zip




saao/saa06:   0%|          | 0.00/7.08M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa07.zip as jsonzip/saao-saa07.zip




saao/saa07:   0%|          | 0.00/3.81M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa08.zip as jsonzip/saao-saa08.zip




saao/saa08:   0%|          | 0.00/7.21M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa09.zip as jsonzip/saao-saa09.zip




saao/saa09:   0%|          | 0.00/773k [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa10.zip as jsonzip/saao-saa10.zip




saao/saa10:   0%|          | 0.00/8.70M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa11.zip as jsonzip/saao-saa11.zip




saao/saa11:   0%|          | 0.00/2.34M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa12.zip as jsonzip/saao-saa12.zip




saao/saa12:   0%|          | 0.00/3.59M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa13.zip as jsonzip/saao-saa13.zip




saao/saa13:   0%|          | 0.00/3.91M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa14.zip as jsonzip/saao-saa14.zip




saao/saa14:   0%|          | 0.00/6.45M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa15.zip as jsonzip/saao-saa15.zip




saao/saa15:   0%|          | 0.00/5.78M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa16.zip as jsonzip/saao-saa16.zip




saao/saa16:   0%|          | 0.00/4.06M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa17.zip as jsonzip/saao-saa17.zip




saao/saa17:   0%|          | 0.00/4.52M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa18.zip as jsonzip/saao-saa18.zip




saao/saa18:   0%|          | 0.00/4.71M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa19.zip as jsonzip/saao-saa19.zip




saao/saa19:   0%|          | 0.00/5.49M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa20.zip as jsonzip/saao-saa20.zip




saao/saa20:   0%|          | 0.00/4.57M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa21.zip as jsonzip/saao-saa21.zip




saao/saa21:   0%|          | 0.00/3.83M [00:00<?, ?B/s]

## 2.1.0.5 Downloading with the utils Module
In the chapters 3-6, downloading of [ORACC](http://oracc.org) data will be done with the `oracc_download()` function in the module `utils` that can be found in the `utils` directory. The following code illustrates how to use that function. 

The function `oracc_download()` takes a list of project names as its first argument. Replace the line
```python
projects = ["dcclt", "saao/saa01"]
```
with the list of projects (and sub-projects) of your choice. 

The second (optional) argument is `server`; possible values are "penn" (default; try the Penn server first) and "lmu" (try the Munich server first). The `oracc_download()` function returns a cleaned list of projects with duplicates and non-existing projects removed.

In [6]:
import os
import sys
util_dir = os.path.abspath('../utils') # When necessary, adapt the path to the utils directory.
sys.path.append(util_dir)
import utils
directories = ["jsonzip"]
os.makedirs("jsonzip", exist_ok = True)
#projects = ["dcclt", "saao/saa01"] # or any comma-separated list of ORACC projects
projects = ["saao/saa01","saao/saa02","saao/saa03",
    "saao/saa04","saao/saa05","saao/saa06","saao/saa07"
    ,"saao/saa08","saao/saa09","saao/saa10","saao/saa11","saao/saa12"
    ,"saao/saa13","saao/saa14","saao/saa15","saao/saa16"
    ,"saao/saa17","saao/saa18","saao/saa19","saao/saa20","saao/saa21"]
utils.oracc_download(projects, server="penn")

Saving https://oracc.museum.upenn.edu/json/saao-saa07.zip as jsonzip/saao-saa07.zip.




saao/saa07:   0%|          | 0.00/3.81M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa20.zip as jsonzip/saao-saa20.zip.




saao/saa20:   0%|          | 0.00/4.57M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa21.zip as jsonzip/saao-saa21.zip.




saao/saa21:   0%|          | 0.00/3.83M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa04.zip as jsonzip/saao-saa04.zip.




saao/saa04:   0%|          | 0.00/8.21M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa16.zip as jsonzip/saao-saa16.zip.




saao/saa16:   0%|          | 0.00/4.06M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa19.zip as jsonzip/saao-saa19.zip.




saao/saa19:   0%|          | 0.00/5.49M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa08.zip as jsonzip/saao-saa08.zip.




saao/saa08:   0%|          | 0.00/7.21M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa17.zip as jsonzip/saao-saa17.zip.




saao/saa17:   0%|          | 0.00/4.52M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa05.zip as jsonzip/saao-saa05.zip.




saao/saa05:   0%|          | 0.00/4.88M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa06.zip as jsonzip/saao-saa06.zip.




saao/saa06:   0%|          | 0.00/7.08M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa10.zip as jsonzip/saao-saa10.zip.




saao/saa10:   0%|          | 0.00/8.70M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa09.zip as jsonzip/saao-saa09.zip.




saao/saa09:   0%|          | 0.00/773k [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa18.zip as jsonzip/saao-saa18.zip.




saao/saa18:   0%|          | 0.00/4.71M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa02.zip as jsonzip/saao-saa02.zip.




saao/saa02:   0%|          | 0.00/2.64M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa12.zip as jsonzip/saao-saa12.zip.




saao/saa12:   0%|          | 0.00/3.59M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa13.zip as jsonzip/saao-saa13.zip.




saao/saa13:   0%|          | 0.00/3.91M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa11.zip as jsonzip/saao-saa11.zip.




saao/saa11:   0%|          | 0.00/2.34M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa01.zip as jsonzip/saao-saa01.zip.




saao/saa01:   0%|          | 0.00/5.01M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa14.zip as jsonzip/saao-saa14.zip.




saao/saa14:   0%|          | 0.00/6.45M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa15.zip as jsonzip/saao-saa15.zip.




saao/saa15:   0%|          | 0.00/5.78M [00:00<?, ?B/s]

Saving https://oracc.museum.upenn.edu/json/saao-saa03.zip as jsonzip/saao-saa03.zip.




saao/saa03:   0%|          | 0.00/4.29M [00:00<?, ?B/s]

['saao/saa07',
 'saao/saa20',
 'saao/saa21',
 'saao/saa04',
 'saao/saa16',
 'saao/saa19',
 'saao/saa08',
 'saao/saa17',
 'saao/saa05',
 'saao/saa06',
 'saao/saa10',
 'saao/saa09',
 'saao/saa18',
 'saao/saa02',
 'saao/saa12',
 'saao/saa13',
 'saao/saa11',
 'saao/saa01',
 'saao/saa14',
 'saao/saa15',
 'saao/saa03']