# Text Acquisition
CTAWG Febr 2019

Introdcution to basic aspects of data acqusition and data transformation in Python 3, using:
- Requests
- ZipFile
- JSON
- pickle

The data to be downloaded come from the Open Richly Annotated Cuneiform Corpus ([ORACC](http://oracc.org)). [ORACC](http://oracc.org) data are made available in [JSON](https://www.json.org/) format, all [JSON](https://www.json.org/) files that belong to one [ORACC](http://oracc.org) project are collected in a ZIP file. The data are made available under a [Creative Commons Attribution Non-Commercial Sharealike](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode) license - no worries about playing around with this data!

We will work with the Old Babylonian Model Contracts ([OBMC](http://oracc.org/obmc)) data set, because this is relatively small and good for demonstration purposes. Old Babylonian model contracts are school texts that teach the proper format and formulary of contracts (house or slave sales, loans, etc.) in Sumerian, dated to approximately 1800 BCE.

## Text Acquisition 1: Download a ZIP

In [1]:
import pandas as pd   
import requests
import zipfile
import tqdm
import json
import os
import errno
import pickle
import re

# Create Download Directory
Create a directory called `jsonzip`. If the directory already exists, do nothing.

For the code, see [Stack Overflow](http://stackoverflow.com/questions/18973418/os-mkdirpath-returns-oserror-when-directory-does-not-exist).

In [2]:
try:
    os.mkdir('jsonzip')
except OSError as exc:
    if exc.errno !=errno.EEXIST:
        raise
    pass

# Downloading, Chunking

The [requests](http://docs.python-requests.org/en/master/user/quickstart/) library is used for communicating with a server. The `get()` command in `requests` takes a URL as argument and returns a **Response** object.

Since the file we download is a ZIP file, it must be saved as a **binary** file. This is done with the argument `"wb"` in the `open()` command. 

In case your file is very large, it may make sense to chunk it (if it is not you can skip that part). The `iter_content()` function in the `requests` package takes care of that. 

In [3]:
url = "http://build-oracc.museum.upenn.edu/json/obmc.zip"
file = "jsonzip/obmc.zip"

CHUNK = 16 * 1024

r = requests.get(url)
if r.status_code == 200:  # if the request is successful
    print("Downloading " + url + " saving as " + file)
    with open(file, 'wb') as f:
        for c in r.iter_content(chunk_size=CHUNK):
            f.write(c)
else:
    print(url + " does not exist.")

Downloading http://build-oracc.museum.upenn.edu/json/obmc.zip saving as jsonzip/obmc.zip


## Text Acquisition 2: Scrape
better ask Jaren ...

Page from the Ebla Digital Archives site [ebda 18](http://ebda.cnr.it/tablet/view/18).

In [4]:
from bs4 import BeautifulSoup

In [5]:
url1 = "http://ebda.cnr.it/tablet/view/18"
r1 = requests.get(url1)

In [6]:
r1.text



In [7]:
soup = BeautifulSoup(r1.text, "html.parser")
data = soup.find_all('a', attrs={'id' : re.compile(r".*")})
rawdata = [item.contents for item in data]

In [8]:
l = []
for i in rawdata:
    s = ""
    for item in i:
        s = s+ str(item).strip()
    l.append(s)
l

['Home',
 'Database',
 'Search',
 'Bibliography',
 'Corrections',
 'Words frequency',
 'Signs frequency',
 'r.1,1',
 '1',
 'aktum<sup>t</sup><sup>u</sup><sup>g</sup><sup>₂</sup><sup></sup><sup></sup>',
 '1',
 'ib₂-Ⅲ',
 'gun₃',
 '<sup>t</sup><sup>u</sup><sup>g</sup><sup>₂</sup><sup></sup><sup></sup>',
 'r.1,2',
 '2',
 '<i>gu</i>₂-<i>li</i>-<i>lum</i>',
 'a-gar₅-gar₅',
 'ku₃-gi',
 'r.1,3',
 '<i>gu</i>₂-<i>za</i>-<i>zi</i>',
 'r.1,4',
 'TUŠ.LU₂xTIL',
 'r.1,5',
 '2',
 'gu-dul₃<sup>t</sup><sup>u</sup><sup>g</sup><sup>₂</sup><sup></sup><sup></sup>',
 '2',
 'ib₂-Ⅲ',
 'gun₃',
 '<sup>t</sup><sup>u</sup><sup>g</sup><sup>₂</sup><sup></sup><sup></sup>',
 '4',
 '<i>gu</i>₂-<i>li</i>-<i>lum</i>',
 'a-gar₅-gar₅',
 'ku₃-gi-Ⅱ',
 'r.1,6',
 '<i>a</i>-<i>šur</i>ᵪ-<i>il</i>',
 'r.1,7',
 '<i>wa</i>',
 'r.1,8',
 '<i>i</i>-<i>da</i>-<i>ni</i>-<i>ki</i>-<i>mu</i>',
 'r.1,9',
 'ugula',
 'NI-<i>a</i>-NE-<i>in</i><sup>k</sup><sup>i</sup><sup></sup><sup></sup><sup></sup><sup></sup>',
 'r.1,10',
 '2',
 'aktum<sup>t

## Text Acquisition 3: Click a Button

Some web sites provide an export option. Go to [BDTNS](http://bdtns.filol.csic.es/) go to drop-down menu "Catalogue and Transliterations" and choose "Search". In "Word and Grapheme Strings" search for "gu2-bir5" (with the double quotation marks). The word gu₂-bir₅ means "locust neck" and is (rarely) used as a qualification of bronze (presumably meaning: greenish). You will get a list of two texts and you can click "Export" to get a list of export options.

# Back to ORACC: Inspect the ZIP File

The function `ZipFile()` in the [zipfile](https://docs.python.org/3/library/zipfile.html) library takes as argument the name of a ZIP file. It returns the Zipfile object `z`. The command `namelist()` returns a list with all the filenames of the files in the ZIP.

To extract all the files from the ZIP file use `.extractall()` as in:
```python3
z = zipfile.ZipFile(file)
z.extractall()
```

In this demonstration we will not do so - we will read and manipulate the data from the files that we need directly from the ZIP file without saving the extracted files - thus saving some disk space.

In [9]:
z = zipfile.ZipFile(file)
z.namelist()

['obmc/',
 'obmc/gloss-qpn-x-months.json',
 'obmc/index-tra.json',
 'obmc/gloss-qpn-x-places.json',
 'obmc/index-qpn-x-temple.json',
 'obmc/index-qpn-x-months.json',
 'obmc/index-sux.json',
 'obmc/gloss-qpn-x-divine.json',
 'obmc/index-qpn-x-divine.json',
 'obmc/obmc-portal.json',
 'obmc/gloss-qpn-x-temple.json',
 'obmc/corpus.json',
 'obmc/index-qpn-x-places.json',
 'obmc/index-lem.json',
 'obmc/gloss-qpn-x-waters.json',
 'obmc/sortcodes.json',
 'obmc/corpusjson/',
 'obmc/corpusjson/P411563.json',
 'obmc/corpusjson/P230698.json',
 'obmc/corpusjson/P230726.json',
 'obmc/corpusjson/P230723.json',
 'obmc/corpusjson/P230664.json',
 'obmc/corpusjson/P251564.json',
 'obmc/corpusjson/P230801.json',
 'obmc/corpusjson/P229699.json',
 'obmc/corpusjson/P230144.json',
 'obmc/corpusjson/P230668.json',
 'obmc/corpusjson/P230753.json',
 'obmc/corpusjson/X000008.json',
 'obmc/corpusjson/P273797.json',
 'obmc/corpusjson/P230746.json',
 'obmc/corpusjson/P230734.json',
 'obmc/corpusjson/P230751.json',
 

# The Files

The text data is contained in the files in the directory `obmc/corpusjson`. We will look at that at a later stage. The catalogue data is contained in the file `obmc/catalogue.json`. We will inspect this file.

The `zipfile` function `read()` takes as its argument a file in a ZIP archive and returns a string.

If you know the encoding, it is useful to add that with the `decode()` function - in particular for Windows machines. Windows has its own default encoding scheme which may wreck your file.

In [10]:
cat_j = "obmc/catalogue.json"
st = z.read(cat_j).decode("utf8")
st

'{\n  "type": "catalogue",\n  "project": "obmc",\n  "source": "http://oracc.org/obmc",\n  "license": "This data is released under the CC0 license",\n  "license-url": "https://creativecommons.org/publicdomain/zero/1.0/",\n  "more-info": "http://oracc.org/doc/opendata/",\n  "UTC-timestamp": "2019-02-05T21:57:40",\n  "members": {\n    "P200931": {\n      "langs": "0x02000000",\n      "project": "obmc",\n      "author": "Limet, Henri",\n      "collection": "Musées royaux d\'Art et d\'Histoire, Brussels, Belgium",\n      "genre": "School",\n      "id_text": "P200931",\n      "language": "Sumerian",\n      "museum_no": "MRAH O.0118",\n      "object_preservation": "--",\n      "object_type": "tablet",\n      "period": "Old Babylonian",\n      "primary_publication": "Akkadica 117 015 O.118",\n      "provenience": "uncertain",\n      "publication_date": "2000",\n      "publication_history": "Speleers, Louis, RIAA (1925) 047",\n      "subgenre": "Type II Tablet",\n      "subgenre_remarks": "Obv:

# JSON

This is a JSON file and the data become easier to read and manipulate if we transform it into a proper JSON object with the `json` library with the `load()` function. The function `load()` takes a file as argument, the sister-function `loads()` is used for striungs. Since the output of the `read()` command in `zipfile` is a string, we use `loads()`.

In [11]:
cat = json.loads(st)
cat

{'type': 'catalogue',
 'project': 'obmc',
 'source': 'http://oracc.org/obmc',
 'license': 'This data is released under the CC0 license',
 'license-url': 'https://creativecommons.org/publicdomain/zero/1.0/',
 'more-info': 'http://oracc.org/doc/opendata/',
 'UTC-timestamp': '2019-02-05T21:57:40',
 'members': {'P200931': {'langs': '0x02000000',
   'project': 'obmc',
   'author': 'Limet, Henri',
   'collection': "Musées royaux d'Art et d'Histoire, Brussels, Belgium",
   'genre': 'School',
   'id_text': 'P200931',
   'language': 'Sumerian',
   'museum_no': 'MRAH O.0118',
   'object_preservation': '--',
   'object_type': 'tablet',
   'period': 'Old Babylonian',
   'primary_publication': 'Akkadica 117 015 O.118',
   'provenience': 'uncertain',
   'publication_date': '2000',
   'publication_history': 'Speleers, Louis, RIAA (1925) 047',
   'subgenre': 'Type II Tablet',
   'subgenre_remarks': 'Obv: model contract; rev. unreadable',
   'designation': 'Akkadica 117 015 O.118',
   'supergenre': 'LI

# Structure of catalogue.json

A JSON file is, essentially, a Python dictionary with keys and values (in JSON-speak they are called names and values, but we will use the Python vocabulary).

It turns out that `catalogue.json` has a couple of top-level keys (`type`, `project`, `source`, `license`, etc.). The key `members` is the main top-level key - this contains a new dictionary where each key is a unique text ID number (starting with P). This key, in turn contains still another dictionary with all the catalog information such as museum number, object_type, genre, etc.

```
{'type' : 'catalogue',
 'project': 'obmc',
 'members' : {
            'P200931' : {'museum_no' : 'MRAH O.0118', 'object_type' : 'tablet'},
            'P200932' : {'museum_no' : 'MRAH O.0119', 'object_type' : 'tablet'}
            }
}
```

Treating the JSON as a dictionary, we can simply select the key `members` and turn that into a DataFrame.

In [12]:
cat_m = cat['members']
cat_df = pd.DataFrame(cat_m)
cat_df

Unnamed: 0,P200931,P200932,P200933,P227953,P227955,P227962,P227972,P227988,P228140,P228267,...,X000004,X000005,X000006,X000007,X000008,X000009,X000010,X000011,X000012,X000013
accession_no,,,,,,,,,,,...,,,,,,,,,,
author,"Limet, Henri","Limet, Henri","Limet, Henri","Chiera, Edward","Chiera, Edward",,"Chiera, Edward","Chiera, Edward",,,...,,,,"Isma'el, Khalid Salim","Isma'el, Khalid Salim","Bodine, Walter R.",,"Spada, Gabriella","Bodine, Walter R.","Bodine, Walter R."
citation,,,,,,,,,,,...,Veldhuis 2014: 193,,,,,,,,,
collection,"Musées royaux d'Art et d'Histoire, Brussels, B...","Musées royaux d'Art et d'Histoire, Brussels, B...","Musées royaux d'Art et d'Histoire, Brussels, B...",University of Pennsylvania Museum of Archaeolo...,University of Pennsylvania Museum of Archaeolo...,University of Pennsylvania Museum of Archaeolo...,University of Pennsylvania Museum of Archaeolo...,University of Pennsylvania Museum of Archaeolo...,University of Pennsylvania Museum of Archaeolo...,University of Pennsylvania Museum of Archaeolo...,...,"British Museum, London, UK","British Museum, London, UK","Arkeoloji Müzeleri, Istanbul, Turkey","National Museum of Iraq, Baghdad","National Museum of Iraq, Baghdad","Yale Babylonian Collection, New Haven","British Museum, London, UK","private: anonymous, unlocated","Yale Babylonian Collection, New Haven","Yale Babylonian Collection, New Haven"
designation,Akkadica 117 015 O.118,Akkadica 117 015 O.119,Akkadica 117 016 O.120,"OIP 011, 105 + 109 + 148","OIP 011, 030",CBS 04813 + CBS 06853,"OIP 011, p. 014, CBS 04826 (unpub. dup.)","OIP 011, 096",N 2469,N 4541,...,BM 079062,BM 054697,Ist Ni 10200,Edubba 9 29,Edubba 9 30,YBC 11121,BM 061370,"ZA 101, 204-245",YBC 00263,YBC 12074
electronic_publication,,,,,,,,,,,...,,,,,,,,,,
excavation_no,,,,,,,,,,,...,,,,,,,,,,
findspot_square,,,,,,,,,,,...,,,,,,,,,,
genre,School,School,School,School,School,School,School,School,School,School,...,School,School,School,School,School,School,School,School,School,School
id_text,P200931,P200932,P200933,P227953,P227955,P227962,P227972,P227988,P228140,P228267,...,X000004,X000005,X000006,X000007,X000008,X000009,X000010,X000011,X000012,X000013


# Improve

This is a little odd: each text ID is a column (we expected a row). All the NaNs are uninteresting and we have far too many fileds.

In [13]:
cat_df = cat_df.T.fillna('')
cat_df = cat_df[['designation', 'period', 'provenience',
        'museum_no', 'id_text']]
cat_df[:10]

Unnamed: 0,designation,period,provenience,museum_no,id_text
P200931,Akkadica 117 015 O.118,Old Babylonian,uncertain,MRAH O.0118,P200931
P200932,Akkadica 117 015 O.119,Old Babylonian,uncertain,MRAH O.0119,P200932
P200933,Akkadica 117 016 O.120,Old Babylonian,uncertain,MRAH O.120,P200933
P227953,"OIP 011, 105 + 109 + 148",Old Babylonian,Nippur,CBS 04803 + CBS 06591 + CBS 05962,P227953
P227955,"OIP 011, 030",Old Babylonian,Nippur,CBS 04805,P227955
P227962,CBS 04813 + CBS 06853,Old Babylonian,uncertain,CBS 04813 + CBS 06853,P227962
P227972,"OIP 011, p. 014, CBS 04826 (unpub. dup.)",Old Babylonian,Nippur,CBS 04826,P227972
P227988,"OIP 011, 096",Old Babylonian,Nippur,CBS 04846,P227988
P228140,N 2469,Old Babylonian,Nippur,N 2469,P228140
P228267,N 4541,Old Babylonian,Nippur,N 4541,P228267


# Save or Pickle for later use
Save to a `csv` file with the `Pandas` command `to_csv()`. Use `encoding = "utf8"` if you want to read the file in another Python or R project. Use `encoding = 'utf16'` if you plan to open the file in Excel.

If the file gets only used in Python projects, you may also pickle the file and you can essentially continue where you left off. 

In [14]:
filename = 'obmc_cat.csv'
p_filename = "obmc_cat.p"
with open(filename, 'w', encoding='utf-8') as w:
    cat_df.to_csv(w, index=False)

with open(p_filename, 'wb') as p:
    pickle.dump(cat_df, p)

# Open your Pickle file

In theory you can open your pickle file with the command `load()` - and that is probably going to work right now. However, it is often better to load the file directly into Pandas with the `read_pickle()` command (you can also pickle your dataframe with `pd.to_pickle()`).

In [15]:
df = pd.read_pickle("obmc_cat.p")
df

Unnamed: 0,designation,period,provenience,museum_no,id_text
P200931,Akkadica 117 015 O.118,Old Babylonian,uncertain,MRAH O.0118,P200931
P200932,Akkadica 117 015 O.119,Old Babylonian,uncertain,MRAH O.0119,P200932
P200933,Akkadica 117 016 O.120,Old Babylonian,uncertain,MRAH O.120,P200933
P227953,"OIP 011, 105 + 109 + 148",Old Babylonian,Nippur,CBS 04803 + CBS 06591 + CBS 05962,P227953
P227955,"OIP 011, 030",Old Babylonian,Nippur,CBS 04805,P227955
P227962,CBS 04813 + CBS 06853,Old Babylonian,uncertain,CBS 04813 + CBS 06853,P227962
P227972,"OIP 011, p. 014, CBS 04826 (unpub. dup.)",Old Babylonian,Nippur,CBS 04826,P227972
P227988,"OIP 011, 096",Old Babylonian,Nippur,CBS 04846,P227988
P228140,N 2469,Old Babylonian,Nippur,N 2469,P228140
P228267,N 4541,Old Babylonian,Nippur,N 4541,P228267


# More Complex JSON
The `catalogue.json` that we parsed above has a fairly simple structure. Text editions in [ORACC](http://oracc.org) JSON are much more complex because they preserve the inner structure of the document, divided into:
    - obverse
    - reverse
    - column 1, column 2, etc.
    - lines

It is impossible to predict how deep the structure is: some texts have columns (tablet > side > column > line) others do not (tablet > side > line). Let's look at a simple example.

We still have the ZipFile object `z`.

In [22]:
P = "obmc/corpusjson/P230754.json"
String = z.read(P).decode("utf8")
P230754 = json.loads(String)
P230754

{'type': 'cdl',
 'project': 'obmc',
 'source': 'http://oracc.org/obmc',
 'license': 'This data is released under the CC0 license',
 'license-url': 'https://creativecommons.org/publicdomain/zero/1.0/',
 'more-info': 'http://oracc.org/doc/opendata/',
 'UTC-timestamp': '2019-02-05T21:57:50',
 'textid': 'P230754',
 'cdl': [{'node': 'c',
   'type': 'text',
   'id': 'P230754.U0',
   'cdl': [{'node': 'd',
     'subtype': 'tablet',
     'type': 'tablet',
     'ref': 'P230754.x59.1',
     'label': 'x59'},
    {'node': 'd', 'type': 'surface', 'ref': ''},
    {'node': 'd',
     'subtype': "column 1'",
     'type': 'column',
     'ref': 'P230754.i_.2',
     'n': '1',
     'label': "i'"},
    {'node': 'c',
     'type': 'discourse',
     'subtype': 'body',
     'id': 'P230754.U1',
     'cdl': [{'node': 'c',
       'type': 'sentence',
       'implicit': 'yes',
       'id': 'P230754.U2',
       'label': "i' 1' - i' 4'",
       'cdl': [{'node': 'd',
         'type': 'line-start',
         'ref': 'P2307

The JSON is structured around `cdl` nodes. A `cdl` node is a list that may contain `c` (Chunk), `d` (Discontinuity), and `l` (Lemma) nodes, and/or a new `cdl` node. The hierarchy of `cdl` nodes represents the textual hierarchy. We therefore dig into the JSON until we find no more new `cdl` nodes.

In [17]:
def parsejson(text):
    for JSONobject in text["cdl"]:
        if "cdl" in JSONobject: 
            parsejson(JSONobject)
        if "f" in JSONobject:
            lemm = JSONobject["f"]
            lemm["id_text"] = id_text
            lemm_l.append(lemm)
    return

In [18]:
id_text = "P230754"
lemm_l = []
parsejson(P230754)

In [19]:
lemm_l

[{'lang': 'sux',
  'form': '{iti}šeg₁₂-a-ka',
  'gdl': [{'det': 'semantic',
    'pos': 'pre',
    'seq': [{'v': 'iti', 'id': 'P230754.3.1.0'}]},
   {'v': 'šeg₁₂',
    'id': 'P230754.3.1.1',
    'break': 'damaged',
    'ho': '1',
    'delim': '-',
    'hc': '1'},
   {'v': 'a',
    'id': 'P230754.3.1.2',
    'breakStart': '1',
    'o': '[',
    'break': 'missing',
    'delim': '-'},
   {'v': 'ka',
    'id': 'P230754.3.1.3',
    'break': 'missing',
    'o': ']',
    'breakEnd': 'P230754.3.1.2'}],
  'cf': 'Šega',
  'gw': '1',
  'sense': '1',
  'norm0': 'Šega,ak.a',
  'pos': 'MN',
  'epos': 'MN',
  'base': '{iti}šeg₁₂-a',
  'morph': '~,ak.a',
  'id_text': 'P230754'},
 {'lang': 'sux',
  'form': 'kug-bi',
  'delim': '',
  'gdl': [{'v': 'kug', 'id': 'P230754.4.1.0', 'delim': '-'},
   {'v': 'bi', 'id': 'P230754.4.1.1'}],
  'cf': 'kug',
  'gw': 'metal',
  'sense': 'metal, silver',
  'norm0': 'kug,bi',
  'pos': 'N',
  'epos': 'N',
  'base': 'kug',
  'morph': '~,bi',
  'id_text': 'P230754'},
 {'la

In [21]:
word_df = pd.DataFrame(lemm_l)
word_df = word_df.fillna('')      # replace NaN (Not a Number) with empty string
word_df[["form", "cf", "gw", "pos", "id_text"]]

Unnamed: 0,form,cf,gw,pos,id_text
0,{iti}šeg₁₂-a-ka,Šega,1,MN,P230754
1,kug-bi,kug,metal,N,P230754
2,i₃-la₂-e,la,hang,V/t,P230754
3,tukum-bi,tukumbi,if,CNJ,P230754
4,{iti}šeg₁₂-a-ka,Šega,1,MN,P230754
