# Data Acquisition from CDLI
The downloadable CDLI files are found on the download page http://cdli.ucla.edu/bulk_data. The data available are a set of transliterations and a catalog file with meta-data. Because of its size the catalog file is currently split in two, it is possible that in the future there will be either more or fewer such files. The script identifies the file names and downloads those to a directory `cdlidata`. Once downloaded the catalog is reconstituted as a single file and is loaded into a `pandas` DataFrame. The DataFrame is used, by way of example, to select the transliterations from the Early Dynastic IIIa period.


# 0 Import Packages

In [1]:
import requests
import tqdm
import os
import errno
import pandas as pd
import csv
from bs4 import BeautifulSoup

# 1. Create Download Directory
Create a directory called `cdlidata`. If the directory already exists, do nothing.

For the code, see [Stack Overflow](http://stackoverflow.com/questions/18973418/os-mkdirpath-returns-oserror-when-directory-does-not-exist).

In [2]:
try:
    os.mkdir('cdlidata')
except OSError as exc:
    if exc.errno !=errno.EEXIST:
        raise
    pass

# 2. Retrieve File Names
We first need to retrieve the names of the files that are offered for download on the [CDLI][] download page. The script requests the HTML of the download page and uses BeautifulSoup (a package for web scraping) to  retrieve all the links from the page. This includes the file names, but also links used to order the files in different ways (by name, by size, or by data). These latter links all start with "?".

In [3]:
download_page = "http://cdli.ucla.edu/bulk_data/"
r = requests.get(download_page)
html = r.text
soup = BeautifulSoup(html)
links = soup.find_all('a')
files = set()
for link in links:
    f = link.get('href')
    files.add(f)
files = {f for f in files if not f[0] == "?"}
files

# 3. Download
The download code in this cell is essentially identical with the code in notebook 2_1_0_download_ORACC-JSON.ipynb. Depending on the speed of your computer and internet connection the downloading process can take some time because of the size of the files.

In [4]:
CHUNK = 16 * 1024
for f in files:
    url = download_page + f
    target = 'cdlidata/' + f
    with requests.get(url, stream=True) as r:
        if r.status_code == 200:
            print("Downloading " + url + " saving as " + target)
            with open(target, 'wb') as t:
                for c in tqdm.tqdm(r.iter_content(chunk_size=CHUNK)):
                    t.write(c)
        else:
            print(url + " does not exist.")

Downloading http://cdli.ucla.edu/bulk_data/README.md saving as cdlidata/README.md


1it [00:00, 335.73it/s]


Downloading http://cdli.ucla.edu/bulk_data/cdli_catalogue_1of2.csv saving as cdlidata/cdli_catalogue_1of2.csv


5101it [04:25, 19.21it/s]


Downloading http://cdli.ucla.edu/bulk_data/cdli_catalogue_2of2.csv saving as cdlidata/cdli_catalogue_2of2.csv


3719it [03:13, 19.21it/s]


Downloading http://cdli.ucla.edu/bulk_data/cdliatf_unblocked.atf saving as cdlidata/cdliatf_unblocked.atf


4519it [03:56, 19.13it/s]


# 4. Concatenate the Catalogue Files
The catalogue files are concatenated by reading them line by line into the file `catalogue.csv` which is placed in the directory `cdlidata`.

In [5]:
filenames = [f for f in files if "cdli_catalogue" in f]
filenames.sort()
with open('cdlidata/catalogue.csv', 'w', encoding="utf8") as outfile:
    for fname in filenames:
        with open('cdlidata/' + fname, encoding="utf8") as infile:
            for line in infile:
                outfile.write(line)

# 5 Load in Pandas DataFrame
## 5.1 Adjust Shape of the File
At the time of writing this notebook, the file cdli_catalogue_1of2.csv had at least one instance of a a double record in a single line, resulting in more than the standard 63 fields. This prevents `pandas` from loading the file directly with the `from_csv()` function. For that reason the file is read with the `csv` library; each line is placed in the list `catalogue` but all columns higher than 63 are discarded. 

In [6]:
catalogue = []
with open('cdlidata/catalogue.csv', 'r', encoding="utf8") as f:
    csv_reader = csv.reader(f, delimiter=',', quotechar='"')
    for row in csv_reader:
        catalogue.append(row[:63])

## 5.2 DataFrame
The list `catalogue` is now loaded into `pandas`. The first row (row 0) is holding the column names. This row is used to rename the columns and then discarded.

In [76]:
cat = pd.DataFrame(catalogue)
cat.columns = cat.iloc[0]
cat = cat[1:]
cat

Unnamed: 0,accession_no,accounting_period,acquisition_history,alternative_years,ark_number,atf_source,atf_up,author,author_remarks,cdli_collation,...,seal_id,seal_information,stratigraphic_level,subgenre,subgenre_remarks,surface_preservation,text_remarks,thickness,translation_source,width
1,,,,,21198/zz001q0dtm,"Englund, Robert K.",,CDLI,"31x61x18; Lú A 14-16.30-32.48-50; M XVIII, auf...",,...,,,,Archaic Lu2 A (witness),,,,18,no translation,61
2,,,,,21198/zz001q0dv4,"Englund, Robert K.",,CDLI,30x48x13; Lú A 13-15.23-25.?; Fundstelle wie W...,,...,,,,Archaic Lu2 A (witness),,,,13,no translation,48
3,,,,,21198/zz001q0dwn,"Englund, Robert K.",,"Englund, Robert K. & Nissen, Hans J.","42x53x19; Vocabulary 9; Qa XVI,2, unter der Ab...",,...,,,,witness Archaic Vocabulary,Text category: 15-09; Foreign ID: LVO 9,,,19,no translation,53
4,,,,,21198/zz001q0dx5,"Englund, Robert K.",,CDLI,26x23x23; Lú A 9-10.?.?; Fundstelle wie W 9123...,,...,,,,Archaic Lu2 A (witness),,,,23,no translation,23
5,,,,,21198/zz001q0dzp,"Englund, Robert K.",,CDLI,"29x36x20; Lú A Vorläufer; Qa XVI,2, unter der ...",,...,,,,Archaic Lu2 A (witness),,,,20,no translation,36
6,,,,,21198/zz001q0f0p,"Englund, Robert K.",,CDLI,82x62x19; Lú A Vorläufer; Fundstelle wie W 912...,,...,,,,Archaic Lu2 A (witness),,,,19,no translation,62
7,,,,,21198/zz001q0f16,"Englund, Robert K.",,CDLI,56x36x29; Lú A Vorläufer; Fundstelle wie W 912...,,...,,,,Archaic Lu2 A (witness),,,,29,no translation,36
8,,,,,21198/zz001q0f2q,"Englund, Robert K.",,CDLI,"39x26x9; Unidentified 1; Pb XVII,1, +19.50 m, ...",,...,,,,Archaic Unidentified (witness),,,,9,no translation,26
9,,,,,21198/zz001q0f37,"Englund, Robert K.",,CDLI,54x46x?; Lú A 95-98.111-113; Fundstelle wie W ...,,...,,,,Archaic Lu2 A (witness),,,,0,no translation,46
10,,,,,21198/zz001q0f4r,"Englund, Robert K.",,CDLI,23x25x19; Officials 16-18.66-68.?-?; Fundstell...,,...,,,,Archaic Officials (witness),,,,19,no translation,25


# 6 Use Catalog to Select Transliterations
In the example code in the following cell the catalog is used to select from the transliteration file all texts from the Early Dynastic IIIa period.

In [50]:
ed3a = cat.loc[cat["period"] == "ED IIIa (ca. 2600-2500 BC)"]
pnos = list(ed3a["id_text"].str.zfill(6))
with open("cdlidata/cdliatf_unblocked.atf", encoding="utf8") as c: 
    lines = c.readlines()
keep = False
ed3a_atf = []
for line in lines:
    if line[0] == "&": 
        if line[2:8] in pnos: 
            keep = True
        else: 
            keep = False
    if keep: 
        ed3a_atf.append(line)
ed3a_atf

['&P005984 = RIME 1.08.03.02, ex. 01 \n',
 '#atf: lang akk \n',
 '@tablet \n',
 '@obverse \n',
 '@column 1 \n',
 '$ beginning broken \n',
 "1'. [n] 4(bur3@c) _GAN2_ \n",
 ">>Q003640 001' \n",
 "2'. E2#? HA? GU4? x \n",
 ">>Q003640 002' \n",
 "3'. in ur-sa6{ki} \n",
 ">>Q003640 003' \n",
 "4'. 6(bur3@c) _GAN2_ \n",
 ">>Q003640 004' \n",
 "5'. x x x \n",
 ">>Q003640 005' \n",
 "6'. x _GAN2#_ [...] \n",
 ">>Q003640 006' \n",
 "7'. [...] \n",
 ">>Q003640 007' \n",
 "8'. [...] \n",
 ">>Q003640 008' \n",
 "9'. 2(bur3@c) _GAN2_ \n",
 ">>Q003640 009' \n",
 "10'. _GAN2 sa10_ \n",
 ">>Q003640 010' \n",
 "11'. asz2-te4 \n",
 ">>Q003640 011' \n",
 "12'. inim-ma-ni#-zi# \n",
 ">>Q003640 012' \n",
 '$ rest broken \n',
 '@column 2 \n',
 '1. en-na-il \n',
 ">>Q003640 013' \n",
 '2. _lugal_ kisz \n',
 ">>Q003640 014' \n",
 '3. _alan#?_-[su?] \n',
 ">>Q003640 015' \n",
 '4. [...] \n',
 ">>Q003640 016' \n",
 '5. [...] \n',
 ">>Q003640 017' \n",
 '6. [...] \n',
 ">>Q003640 018' \n",
 '7. _igi_ inanna \n',

# 7 Place in DataFrame
Place the ED IIIa texts in a DataFrame, where each row represents one document.

In [75]:
docs = []
doc = []
d = ""
id_text = None
for line in ed3a_atf:
    if line[0] == "&":
        if id_text:
            doc = [id_text, d.strip()]
            docs.append(doc)
            d = ""
        id_text = line[1:8]
    elif line [0] in ["#", "$", "<", ">", "@"]:
        continue
    else:
        start = line.find(". ")
        line = line[start + 1:].rstrip()
        d = d + line
ed3a_df = pd.DataFrame(docs)
ed3a_df.columns = ["id_text", "transliteration"]
ed3a_df

Unnamed: 0,id_text,transliteration
0,P005984,[n] 4(bur3@c) _GAN2_ E2#? HA? GU4? x in ur-sa6...
1,P010008,4(iku@c) GAN2 DUR2-HAR sar 1(u@c) uruda <a>-ru...
2,P010009,[6(asz@c)] uruda ma-na sa10 GAN2 4(iku@c) GAN2...
3,P010011,6(asz@c) uruda ma-na sa10 GAN2 4(iku@c) GAN2-b...
4,P010012,1(esze3@c) GAN2 ur-{d}nin-PA-ke4 lugal-USZ-MUS...
5,P010014,[x] 1/2(asz@c)? ku3 [gin2] 2(asz@c) lid2#-ga# ...
6,P010015,[x siki?] ma#?-[na] 2(u@c) ninda 2(u@c) gug2 3...
7,P010016,1(gesz2@c) 3(u@c) 2(asz@c) i3-nun [x] ga#? szu...
8,P010017,2(u@c) i3-nun 4(ban2@c) ga! szu-tag nin-unken-...
9,P010018,4(u@c) 6(asz@c) i3-nun 2(u@c) la2 2(asz@c) gam...
