# Data Acquisition from CDLI
The downloadable [CDLI](http://cdli.ucla.edu) files are found on the download page http://cdli.ucla.edu/bulk_data. The data available are a set of transliterations and a catalog file with meta-data. Because of its size the catalog file is currently split in two, it is possible that in the future there will be either more or fewer such files. The script identifies the file names and downloads those to a directory `cdlidata`. Once downloaded the catalog is reconstituted as a single file and is loaded into a `pandas` DataFrame. The DataFrame is used, by way of example, to select the transliterations from the Early Dynastic IIIa period.


# 0 Import Packages

In [None]:
import requests
from tqdm import tqdm_notebook
import errno
import pandas as pd
import csv
from bs4 import BeautifulSoup
import os
import sys
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
from utils import *

# 1. Create Download Directory
Create a directory called `cdlidata`. If the directory already exists, do nothing. The directory is created with the function `make_dirs()` from the `utils` module. For the code see section 2.1.0.

In [None]:
directories = ['cdlidata']
make_dirs(directories)

# 2. Retrieve File Names
We first need to retrieve the names of the files that are offered for download on the CDLI [download](http://cdli.ucla.edu/bulk_data/) page. The script requests the HTML of the download page and uses BeautifulSoup (a package for web scraping) to  retrieve all the links from the page. This includes the file names, but also links used to order the files in different ways (by name, by size, or by date). These latter links, which all start with "?", are filtered out.

In [None]:
download_page = "http://cdli.ucla.edu/bulk_data/"
r = requests.get(download_page)
html = r.text
soup = BeautifulSoup(html)
links = soup.find_all('a')       # retrieve all html anchors, which define links
files = set()
for link in links:
    f = link.get('href')        # from the anchors, retrieve the URLs
    files.add(f)
files = {f for f in files if not f[0] == "?"}  # filter out URLs that start with "?"
files

# 3. Download
The download code in this cell is essentially identical with the code in notebook 2_1_0_download_ORACC-JSON.ipynb. Depending on the speed of your computer and internet connection the downloading process can take some time because of the size of the files.

In [None]:
CHUNK = 16 * 1024
for f in files:
    url = download_page + f
    target = f'cdlidata/{f}'
    with requests.get(url, stream=True) as r:
        if r.status_code == 200:
            print(f"Downloading {url} saving as {target}")
            with open(target, 'wb') as t:
                for c in tqdm_notebook(r.iter_content(chunk_size=CHUNK)):
                    t.write(c)
        else:
            print(f"{url} does not exist.")

# 4. Concatenate the Catalogue Files
The catalogue files are concatenated by reading them line by line into the file `catalogue.csv` which is placed in the directory `cdlidata`.

In [None]:
filenames = [f for f in files if "cdli_catalogue" in f]
filenames.sort()  # to make sure we read cdli_catalogue_1of2.csv first.
with open('cdlidata/catalogue.csv', 'w', encoding="utf8") as outfile:
    for fname in tqdm_notebook(filenames):
        with open(f'cdlidata/{fname}', encoding="utf8") as infile:
            for line in infile:
                outfile.write(line)

# 5 Load in Pandas DataFrame
## 5.1 Adjust Shape of the File
At the time of writing this notebook, the file cdli_catalogue_1of2.csv had at least one instance of a a double record in a single line, resulting in more than the standard 63 fields. This prevents `pandas` from loading the file directly with the `from_csv()` function. For that reason the file is read with the `csv` library; each line is placed in the list `catalogue` but all columns higher than 63 are discarded. 

In [None]:
catalogue = []
with open('cdlidata/catalogue.csv', 'r', encoding="utf8") as f:
    csv_reader = csv.reader(f, delimiter=',', quotechar='"')
    for row in csv_reader:
        catalogue.append(row[:63])

## 5.2 DataFrame
The list `catalogue` is now loaded into `pandas`. The first row (row 0) is holding the column names. This row is used to rename the columns and then discarded.

In [None]:
cat = pd.DataFrame(catalogue)
cat.columns = cat.iloc[0]
cat = cat[1:]
cat

# 6 Use Catalog to Select Transliterations
In the example code in the following cell the catalog is used to select from the transliteration file all texts from the Early Dynastic IIIa period. The field "period" is used to select those catalog entries that have "ED IIIa" in that field. P numbers are stored in the catalog without the initial 'P' and without leading zeros (that is '1183' corresponds to 'P001183'). The function `zfill()` is used to created a 6-digit number with leading zeros, if necessary. The P-numbers of our catalog selection are stored in the variable `pnos` (but note that the numbers do not have the initial 'P'!).

The code then iterates through the list of lines. The flag `keep` (which initially is set to `FALSE`) is set to `TRUE` if the code encounters a P number that is present in the list `pnos`. As long as `keep = TRUE` subsequent lines are added to the list `ed3a_atf`. When the script encounters a P-number that is not in `pnos`, the flag `keep` is set to `FALSE`.

The result is a list lines with all the transliteration data of the Early Dynastic IIIa texts in [CDLI](http://cdli.ucla.edu).

In [None]:
ed3a = cat.loc[cat["period"].str[:7] == "ED IIIa"]
pnos = list(ed3a["id_text"].str.zfill(6))
with open("cdlidata/cdliatf_unblocked.atf", encoding="utf8") as c: 
    lines = c.readlines()
keep = False
ed3a_atf = []
for line in tqdm_notebook(lines):
    if line[0] == "&": 
        if line[2:8] in pnos: 
            keep = True
        else: 
            keep = False
    if keep: 
        ed3a_atf.append(line)

# 7 Place in DataFrame
Place the ED IIIa texts in a DataFrame, where each row represents one document (line numbers are omitted). This is, of course, just one example of how the data may be selected and formatted.

The lines are read in reverse order, so that when the script encounters an '&P' line (as in '&P212416 = AAICAB 1/1, pl. 008, 19282-439'), this signals that all the lines of a text have been read and that the document can be added to the list `docs`. (When reading the lines in regular order - taking the '&P' line as signaling the end of the previous document - one needs to separately save the last document, because there is no '&P' line anymore to indicate that the text is complete).

In [None]:
docs = []
d = ''
id_text = ''
ed3a_atf = [line for line in ed3a_atf if line.strip()]  # remove empty lines, which cause trouble
for line in tqdm_notebook(reversed(ed3a_atf)):
    if line[0] == "&":  # line beginning with & marks the beginning of a document
        id_text = line[1:8] # retrieve the P number
        docs.append([id_text, d])
        d = ''   # after appending the data to docs, reset d for a new document.
        continue
    elif line [0] in ["#", "$", "<", ">", "@"]:  # skip all non-transliteration lines
        continue
    else:
        try:
            line = line.split(' ', 1)[1].strip() # split line at first space (after the line number)
            d = f'{line} {d}' # add the new line in front
        except:
            continue   # malformed lines (no proper separation between line number and text) are skipped
ed3a_df = pd.DataFrame(docs)
ed3a_df.columns = ["id_text", "transliteration"]

In [None]:
ed3a_df