In [1]:
import pandas as pd
import zipfile
import json
import tqdm
import requests
import errno
import os
import objectpath

# TODO
There may be an easier way to grab `ref` and `f` in a single go and zip them into a single dictionary.
```python
result = tree.execute("[$..cdl[@.node is 'l'].ref, $..'f'.cf, $..'f'.gw, $..'f'.pos, $..'f'.lang]")
re = [list(r) for r in result]
re_df = pd.DataFrame(re)
```
No - this does not work because some `f` nodes do not have `cf` or `gw` so that the length of these lists will differ.
Document why `f` needs quotation marks in objectpath.




## 0 Create Directories, if Necessary
The two directories needed for this script are `jsonzip` and `output`. If they do not exist they are created, else: do nothing.

For the code, see [Stack Overflow](http://stackoverflow.com/questions/18973418/os-mkdirpath-returns-oserror-when-directory-does-not-exist).

In [2]:
directories = ['jsonzip', 'output']
for d in directories:
    try:
        os.mkdir(d)
    except OSError as exc:
        if exc.errno !=errno.EEXIST:
            raise
        pass

## 1.1 Input Project Names
Provide a list of one or more project names, separated by commas. Note that subprojects must be listed separately, they are not included in the main project. For instance:

`saao/saa01,saao/saa02,blms`

In [3]:
projects = input('Project(s): ').lower()

Project(s): saao/saa01, hbtin


## 1.2 Split the List of Projects
Split the list of projects and create a list of project names.

In [4]:
p = projects.split(',')               # split at each comma and make a list called `p`
p = [x.strip() for x in p]        # strip spaces left and right of each entry in `p`

## 1.3 Download the ZIP files
For each project in the list download all the `json` files from `http://build-oracc.museum.upenn.edu/json/`. The file is called `PROJECT.zip` (for instance: `dcclt.zip`). For subprojects the file is called `PROJECT-SUBPROJECT.zip` (for instance `cams-gkab.zip`). 

For larger projects (such as [DCCLT](http://oracc.org/dcclt)) the `zip` file may be 25Mb or more. Downloading may take some time and it may be necessary to chunk the downloading process. The `iter_content()` function in the `requests` library takes care of that. For the chunking code see [this page](https://www.smallsurething.com/how-to-read-a-file-properly-in-python/).

If you have downloaded the files by hand (and put them in the `jsonzip` directory) you may skip this cell and jump directly to section ...

In [5]:
CHUNK = 16 * 1024
for project in tqdm.tqdm(p):
    project = project.replace('/', '-')
    url = "http://build-oracc.museum.upenn.edu/json/" + project + ".zip"
    file = 'jsonzip/' + project + '.zip'
    r = requests.get(url)
    if r.status_code == 200:
        print("Downloading " + url + " saving as " + file)
        with open(file, 'wb') as f:
            for c in r.iter_content(chunk_size=CHUNK):
                f.write(c)
    else:
        print(url + " does not exist.")

  0%|                                                    | 0/2 [00:00<?, ?it/s]

Downloading http://build-oracc.museum.upenn.edu/json/saao-saa01.zip saving as jsonzip/saao-saa01.zip


 50%|██████████████████████                      | 1/2 [00:02<00:02,  2.93s/it]

Downloading http://build-oracc.museum.upenn.edu/json/hbtin.zip saving as jsonzip/hbtin.zip


100%|████████████████████████████████████████████| 2/2 [00:07<00:00,  3.41s/it]


## 2.2 Extract Data from `JSON` files
The code in this cell will iterate through the list of projects entered above (1.1). For each project the `JSON` zip file is located in the directory `jsonzip`, named PROJECT.zip. The function `namelist()` (from the zipfile package) will give all file names in a `zip` object.  The files we need are in the directory `corpusjson.` The list of names is reduced to the names in that directory that end in `.json` (this way, the directory itself is omitted). The list of files is used to parse each individual text file.

Each of these files is extracted from the `zip` file and read with the command `json.loads()`, which reads the json data and transforms it into a JSON object - a sequence of keys and values.

This JSON object is parsed with the `objectpath` package. Three different nodes are extracted:
- 'f' This node contains all the lemmatization information, plus the grapheme data. Currently the code extracts only the fields Citation Form, Guide Word, Part of Speech, and Language, but other fields may be extracted in the same way.

- 'ref' This node contains the reference of a single word (the word lemmatized in 'f'). Word references have the format P######.[line][no], e.g. P245342.12.1 (the first word of the twelth line). The project name is added to each word reference.

- 'label' This node contains line numbers in the format 'o ii 6', plus a line reference. The line references have the format P######.[line], for instance P245342.12. Note that the line number is an abstract internal line number that does not correspond with line numbers on the tablet (breaks, columns,  and horizontal rulings also receive line numbers).

These data are collected in the lists `lemm_l`, `ref_l`, and `label_l`.

### Note
In `objectpath` expressions quotation marks around a name are usually not necessary: the expressions `$..cdl` and `$..'cdl'` are equivalent (all `cdl` nodes that descend from the root). In some cases, however, a name may have a special function. This is the case for `f` that (without quotation marks) refers to boolean `False`. In the `JSON` files parsed here the `f` nodes subsume all lemmatization data - this `f` needs to be in quotation marks as in
```python
result = tree.execute("$...'f'.(cf, gw, pos, lang, form)")
```


In [6]:
lemm_l = []
ref_l = []
label_l = []
for project in p:
    file = "jsonzip/" + project.replace("/", "-") + ".zip"
    try:
        z = zipfile.ZipFile(file)       # create a Zipfile object
    except:
        print(file + " does not exist or is not a proper ZIP file")
        continue
    files = z.namelist()     # list of all the files in the ZIP
    files = [name for name in files if "corpusjson" in name and name[-5:] == '.json']                                                                                                  #that holds all the P, Q, and X numbers.
    for filename in files:                            #iterate over the file names
        try:
            text = z.read(filename).decode('utf-8')         #read and decode the json file of one particular text
        except:
            print(filename + ' is not available or not complete')
        data_json = json.loads(text)                # make it into a json object (essentially a dictionary)
        tree = objectpath.Tree(data_json)
        refs = list(tree.execute("$..cdl[@.node is 'l'].ref"))
        refs = [[ref , project] for ref in refs]    # add project name to each entry in the refs list
        ref_l.extend(refs)
        lemms = list(tree.execute("$..'f'.(cf, gw, pos, lang, form)"))
        lemm_l.extend(lemms)
        labels = list(tree.execute("$..cdl[@.type is 'line-start'].(ref, label)"))
        label_l.extend(labels)

Turn the three lists into DataFrames

In [7]:
ref_df = pd.DataFrame(ref_l)
lemm_df = pd.DataFrame(lemm_l)
labels_df = pd.DataFrame(label_l)

Provide the ref_df with proper column names.

In [8]:
ref_df.columns = ["ref", "project"]

In `ref_df` and `labels_df`, extract P number (text ID) and line number into separate fields. These are used for matching later on.

In [9]:
dataframes = [ref_df, labels_df]
for dataframe in dataframes: 
    dataframe['line'] = [int(ref.split('.')[1]) for ref in dataframe['ref']]
    dataframe["text_id"] = [ref[:7] for ref in dataframe["ref"]]

Join ref_df and lemm_df. Since both collect data on the word level, they are of the same length. Replace `NaN` ("Not a Numer" or missing value) with the empty string.

In [10]:
df = ref_df.join(lemm_df)
df = df.fillna('')

Create a `lemma` field by adding up Citation Form, Guide Word, and Part of Speech. If there is no Citation Form (word has not been lemmatized - for instance because it is broken), use Form and `[NA]NA` as Guideword and Part of Speech.

In [11]:
df["lemma"] = df.apply(lambda x: (x["cf"] + '[' + x["gw"] + ']' + x["pos"]) 
                            if x["cf"] != '' else x['form'] + '[NA]NA', axis=1)
#df['lemma'] = [lemma if not lemma == '[NA]NA' else '' for lemma in df['lemma'] ]

Group data with the `groupby` function (from Pandas) by `project`, `text_id`, and `line`, so that all lemmas of a single line are in one row.

In [12]:
lines = df.groupby([df["project"], df['text_id'], df['line']]).agg({
        'lemma': ' '.join
    }).reset_index()    

Add the labels by merging `labels_df` with `lines`.

In [13]:
new = pd.merge(lines, labels_df, how='left', left_on=['text_id', 'line'], right_on=['text_id', 'line'])

In [14]:
new

Unnamed: 0,project,text_id,line,lemma,label,ref
0,hbtin,P235192,3,isqu[share]N x[NA]NA marāqu[rub]V adi[until]PR...,o 1,P235192.3
1,hbtin,P235192,4,šuāti[him]IP Anu-ah-ittannu[00]PN māru[son]N š...,o 2,P235192.4
2,hbtin,P235192,5,mišlu[half]NU ša[of]DET ina[in]PRP ištēn[one]N...,o 3,P235192.5
3,hbtin,P235192,6,27-KAM₂[NA]NA ūmu[day]N 28-KAM₂[NA]NA ūmu[day]...,o 4,P235192.6
4,hbtin,P235192,7,ūmu[day]N šunūti[them]IP isqu[share]N sīrāšûtu...,o 5,P235192.7
5,hbtin,P235192,8,Nanaya[1]DN Beltu-ša-Reš[1]DN u[and]CNJ ilu[go...,o 6,P235192.8
6,hbtin,P235192,9,šattu[year]N guqqû[(a monthly offering)]N eššē...,o 7,P235192.9
7,hbtin,P235192,10,isqu[share]N šuāti[him]IP kašādu[reach]V ša[th...,o 8,P235192.10
8,hbtin,P235192,11,māhirānu[receiver]N ūmu[day]N 30-KAM₂[NA]NA is...,o 9,P235192.11
9,hbtin,P235192,12,isqu[share]N šuāti[him]IP māru[son]N ša[of]DET...,o 10,P235192.12
