# TODO
The scraper output needs a field 'version' to make it compatible with ETCSL output. This field may be used for pseudo-composites such as [dcclt/Q000043](http://oracc.org/dcclt/Q000043) to indicate the designation of the tablet from which the text is taken.

# ORACC Scraper

Generalized scraper for [ORACC](html://oracc.org) files. The scraper needs an input file, that lists the P, Q, or X numbers to be scraped. It will create an output file for each of these P, Q, or X numbers with line numbers and lemmatized words.

## Set up the environment
Import the packages:
- re Regular Expressions (for string manipulation)
- BeautifulSoup (for parsing HTML pages)
- tqdm for progress bar
This scraper is written for Python3.

In [1]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#! pip install tqdm
from __future__ import print_function
from bs4 import BeautifulSoup
import re
import sys
import os
from time import sleep
from tqdm import *

#this program should use python unit test
PY3 = sys.version_info.major == 3

if not PY3:
    input = raw_input

print("Running under Python version:", sys.version_info[:3])

Running under Python version: (3, 5, 1)


Initialize output variables.

In [2]:
textiderror = 'Not available:\n'

# Input File

The input file should be located in a directory called /Input, which must be located in the directory in which this Python Notebook is found. The file should have a .txt extension and must be created with a flat text editor such as TextEdit, Notepad, or Emacs. The file contains a simple list of P, Q, or X numbers, preceded by the ORACC abbreviation where the file is edited. For instance:

    rinap/rinap1/Q003421
    dcclt/Q000039
    cams/gkab/P348623

Before running this scraper, use the same input file to download the text material (in html format) with the ORACC Downloader to the /Data directory.

In [3]:
inputfile = input("Name of Input List: ")

Name of Input List: test.txt


# Format Output

The function outputformat() defines what the output of the lemmatized forms will look like. This function may be adapted in various ways to produce different types of output. The function takes a dictionary as input with the following keys: 

- 1     lang: Language
- 2     translit: Transliteration
- 3     citform: Citation Form
- 4     guideword: Guide Word
- 5     sense: Sense
- 6     pos: Part of Speech
- 7     epos: Effective Part of Speech
- 8     norm: Normalization
- 9     base: Base (Sumerian only)
- 10    morph: Morphology (Sumerian only)

In the standard format the output will look like: sux:lugal[king]N. One may adapt the output, for instance, by omitting the element lang (lugal[king]N) or to select for certain parts of speech, or for certain language codes. For instance:
```python
    if output['pos'] == 'N':
        output_formatted = output['citform'] + "\t" + output['guideword']
```
This will create output in the form lugal   king (lugal and king seperated by TAB), selecting only Nouns (excluding Proper Nouns).

```python
    if output['lang'] == 'sux-x-emesal':
        output_formatted = output['citform'] + "[" + output['guideword'] + "]" + output['pos']
```
This will create output in the form umun[lord]N, selecting only Emesal words.
In order to select both Sumerian (sux) and Emesal (sux-x-emesal) forms one could use:
```python
    if output['lang'][0:3] == 'sux':
```


In [4]:
def outputformat(output):
    #output_formatted = ''
    if True:
        output_formatted = output['lang'] + ":" + output['citform'] + "[" + output['guideword'] + "]" + output['pos']
    else:
        return None
    return output_formatted

# Parse an ORACC lemma

The function `parselemma()` parses a so-called ORACC 'signature.' It is called by the `getline()` function where these signatures are extracted from the html files. A signature, as extracted by the `getline()` function, looks as follows:

> javascript:pop1sig('dcclt','','@dcclt%sux:am-si=amsi[elephant//elephant]N´N$amsi/am-si#~').

Akkadian signatures look slightly different, lacking the last two elements (after the /). The `parselemma()` function removes all the superfluous elements and breaks the string up into its parts. It returns a dictionary that lists all of these parts. The function `getline()` forwards this dictionary to the function `formatoutput()`, which uses its keys and values to build a usable data structure and/or to filter the data.

In [5]:
def parselemma(signature):
    output = {}
    signature = signature.replace(' ', '-')
    signature = signature.replace(',', ';') #remove spaces and commas from GuideWord and Sense
    oracc_words = re.sub("'\)$", "", signature) #remove ') from the end of the signature
    oracc_words = re.sub('^.*%', '', oracc_words) #remove everything before % in the signature

    
    sep_char = [":", "=", "$", "#", "[", "]", "//"] # these are the character sequences that separate
                                                    # the various elements of the signature
    for eachchar in sep_char:
        oracc_words = oracc_words.replace(eachchar, " ", 1) # ':' may appear in guideword/sense, so replace only once
    oracc_words_l = oracc_words.split() #split signature into a list
    output['lang'] = oracc_words_l[0]
    output['translit'] = oracc_words_l[1]
    output['citform'] = oracc_words_l[2]
    output['guideword'] = oracc_words_l[3]
    output['sense'] = oracc_words_l[4]

    oracc_words_l[5] = oracc_words_l[5].replace("´", " ") # this separates pos from epos
    oracc_words_l[5] = oracc_words_l[5].split() #split into sub-list
    output['pos'] = oracc_words_l[5][0]
    output['epos'] = oracc_words_l[5][1]
    
    if output['lang'][0:2] == 'sux': # Sumerian or Emesal signature
        oracc_words_l[6] = oracc_words_l[6].replace("/", " ") # this separates norm from base in sux
        oracc_words_l[6] = oracc_words_l[6].split() #split into sub-list
        output['norm'] = oracc_words_l[6][0]
        output['base'] = oracc_words_l[6][1]
        output['morph'] = oracc_words_l[7]
    
    else:
        output['norm'] = oracc_words_l[6]
    
    return output

## Compound Orthographic Forms

Compound Orthographic Forms are combinations of two or more words that are written with a single graphemic complex. Examples are *im-ma-ti* for *ina mati* (when?) or {lu₂}EN.NAM for *bēl pīhati* (governor). In ORACC HTML, the words in a COF are combined in a single signature, separated by '&&':

> javascript:pop1sig('saao/saa10','','@saao/saa10%akk-x-neoass:im-ma-ti=ina[in//in]PRP´PRP\$ina&&@saao/saa10%akk-x-neoass:=mati[when?//when?]QP´QP\$mati')

The function `cof()` is called, when necessary, by `getline()`. It splits the signature at the '&&' separator and returns a list of signatures. The transliteration (in this case *im-ma-ti*), which is included only in the first signature, is isolated and assigned to the variable `translit`. This transliteration is inserted in the remaining signatures at the proper place.

Occasionally, in some COF signatures, the second and further words do not have their own normalization (introduced by $) - this is, presumably, an irregularity in ORACC. If this is the case, normalization is supplied by assuming that it is equal to the citation form in the function `supplynorm()`.


In [6]:
def cof(signature):
    signature = signature.replace("javascript:pop1sig('", "")
    signature = signature.replace("')", "")
    translit = re.search(':(.*?)=', signature).group(1) #TODO replace the expression with positive look back
                                                        # and positive look ahead expression
    cofsignatures = signature.split('&&@')
    cofsignatures = [cofsignature.replace(':=', ':' + translit + '=') for cofsignature in cofsignatures]

    def supplynorm(cofsignature):
        if not '$' in cofsignature:
            citform = re.search('=(.*?)\[', cofsignature).group(1)
            cofsignature = cofsignature + '$' + citform
        return cofsignature
    
    cofsignatures = [supplynorm(cofsignature) for cofsignature in cofsignatures]
        
    return cofsignatures

# Process a Line

The function `getline()` receives a line with metadata from the function `scrape()`. It returns a single variable (`line`) that contains the lines metadata and the formatted lemmata in a single string. 

The function needs two arguments. The first, `line_label`, includes all the metadata of the line: `id_text`, `text_name`, and `l_no` in a single string (separated by commas). The second argument, `line`, is a `BeautifulSoup` object that holds the HTML representation of a single line.

Each line contains a series of words, represented as `signatures` in ORACC HTML. The function collects the signatures in the list `lemmas` and iterates over these. If a signature represents a Compound Orthographic Form (a combination of multiple lemmas, separated by '&&') it is sent to the function `cof()` in order to split the signature in its component forms.

Each signature is sent to `parselemma()`, where it is analyzed. The function `parselemma()` returns a dictionary (`output`) that contains all the elements of the signature (transliteration, citation form, guide word, etc.). This dictionary is sent to the function `outputformat()` which returns a string, combining elements of the signature in the desired format (the default is language:citform[GuideWord]POS, as in sux:lugal[king]N). This string is added to the list `wordsinline`. Finally, once all lemmas (signatures) have been processed, the function builds a single string out of the `line_label` (metadata) and the formatted lemmas in `wordsinline`. This string is returned in the variable `line`.

In [7]:
def getline(line_label, line):
    wordsinline = [] #initialize list for the words in this line
    lemmas = line.findAll('a', {'class':'cbd'}, href=True)
    for eachlemma in lemmas:
        signature = eachlemma['href']
        if '&&@' in signature:  #Compound Orthographic Form, which results in multiple lemmas
            cofsignatures = cof(signature)
            for cofsignature in cofsignatures:
                output = parselemma(cofsignature)
                output_formatted = outputformat(output)
                if not output_formatted == None:
                    wordsinline.append(output_formatted)
        else:
            output = parselemma(signature)
            output_formatted = outputformat(output)
            if not output_formatted == None:
                wordsinline.append(output_formatted)
    line = line_label + ' '.join(wordsinline)
    return line

# Scrape a Single Text

The function `scrape()` takes a single text file and uses the `BeautifulSoup` package to analyze the HTML and return the data in a csv (Comma Separated Values) format. The function is called by the main process.

The function `scrape()` first identifies the name (or designation) of the text - if it cannot find the name, the text is not further processed and an error message is returned.

The function then identifies a line and sends this line to the function `getline()` for further processing. The line number is a string that belongs to the attribute `class = 'xlabel'`. Text name, text id and line number are combined into a single string as `line_label` - which is sent to `getline()` as its first argument (the second is the line itself).

If there is no `class = 'xlabel'` attribute, this means that the line belongs with the previous line as a single unit. This happens in interlinear bilinguals and, occasionally, in the representation of explanatory glosses (see, e.g. SAA 10, 044). In such cases the variable `line` from the previous iteration (which is a single string, concatenating `line_label` and the formatted lemmas) is sent, in its entirety, as the first argument to `getline()` so that the new lemma or lemmas will be added to the end of it.

Finally all lines are assembled in the variable `csvformat`, which is returned to the main process.

In [8]:
def scrape(text_id):
    print('Parsing ' + text_id)
    line = ''
    csvformat ='id_text,text_name,l_no,text\n' #initialize output variable
    with open('HTML/' + text_id.replace('/', '_') + '.html') as f:
        raw_html = f.read()
    soup = BeautifulSoup(raw_html, "html.parser")
    name = soup.find('h1', {'class':'p3h2'}).string
    #if there is no text name, there are errors in atf and page was not built correctly
    if name == None:
        print(eachtextid + " is not built correctly.")
    else:
        #if name includes comma, replace comma with nothing
        name = name.replace(',','')
        print(eachtextid + ":" + name)
        lines = soup.findAll('tr', {'class':'l'})
        for index, eachline in enumerate(lines):
            if eachline.find('a', {'class':'cbd'}, href=True) == None: # if the line has no words
                continue                                               # go to next line
            if eachline.find('span', {'class': 'xlabel'}) != None:
                l_no = eachline.find('span', {'class': 'xlabel'}).string.replace(',', ';')
                line_label = text_id + ',' + name + ',' + l_no + ','
                line = getline(line_label, eachline)
                try:
                    nextline = lines[index + 1]
                    if nextline.find('span', {'class': 'xlabel'}) == None: #if next line has no line number
                        if nextline.find('a', {'class':'cbd'}, href=True) == None: #ignore if there are no lemmatized words
                            continue                                               # in next line
                        else:
                            line_label = line + ' '                         # otherwise join output with previous line
                            line = getline(line_label, nextline)
                except:
                    continue
            else:
                continue
            csvformat = csvformat + line + '\n'
    return csvformat
        

# Main Process

The main process opens the list of text IDs (in the directory `Input`) to be processed and iterates over that list. The process calls the function `scrape()` which, in turn, calls the other functions defined above.

The function creates a separate file for each of the scraped documents in the directory Output. A list of texts that could not be scraped is saved in the directory Error.

In [9]:
with open('Input/' + inputfile, mode = 'r') as f:
    textlist = f.read().splitlines()
for eachtextid in tqdm(textlist):
    sleep(0.01)
    eachtextid = eachtextid.rstrip()
    file = 'HTML/' + eachtextid.replace('/', '_') + '.html'
    try:
        csvformat = scrape(eachtextid)
        outputfile = 'Output/' + eachtextid.replace('/', '_') + '.txt'
        print("Saving " + outputfile + '\n')
    
        if not os.path.exists('Output'):
            os.mkdir('Output')
        
        with open(outputfile, mode = 'w') as writeFile:
            writeFile.write(csvformat)  

    except IOError:
        print(file + ' not available')
        textiderror = textiderror + eachtextid + '\n'

#Create error log
if not os.path.exists('Error'):
    os.mkdir('Error')
with open('Error/oraccerror.txt', mode='w') as writeFile:
    writeFile.write(textiderror)

 17%|█▋        | 1/6 [00:00<00:00,  9.36it/s]

Parsing rinap/rinap1/Q003421
rinap/rinap1/Q003421:Tiglath-pileser III 08
Saving Output/rinap_rinap1_Q003421.txt

Parsing dcclt/Q000039
dcclt/Q000039:OB Nippur Ura 01


 33%|███▎      | 2/6 [00:02<00:02,  1.46it/s]

Saving Output/dcclt_Q000039.txt

Parsing cams/gkab/P348623


 50%|█████     | 3/6 [00:02<00:01,  1.65it/s]

cams/gkab/P348623:SpTU 2 018 [Namburbu]
Saving Output/cams_gkab_P348623.txt

Parsing saao/saa10/P334751
saao/saa10/P334751:SAA 10 044. Timing a Journey of the King (ABL 1141+) [from astrologers]
Saving Output/saao_saa10_P334751.txt

Parsing dcclt/Q000043
dcclt/Q000043:OB Nippur Ura 06


 83%|████████▎ | 5/6 [00:06<00:01,  1.03s/it]

Saving Output/dcclt_Q000043.txt

Parsing blms/P274259


100%|██████████| 6/6 [00:06<00:00,  1.20it/s]

blms/P274259:CT 58 63 (BM 054636+) [Exam at the Scribal School]
Saving Output/blms_P274259.txt




