# TODO
The scraper output needs a field 'version' to make it compatible with ETCSL output. This field may be used for pseudo-composites such as [dcclt/Q000043](http://oracc.org/dcclt/Q000043) to indicate the designation of the tablet from which the text is taken.

# ORACC Scraper

Generalized scraper for [ORACC](html://oracc.org) files. The scraper needs an input file, that lists the P, Q, or X numbers to be scraped. It will create an output file for each of these P, Q, or X numbers with line numbers and lemmatized words.

## Set up the environment
Import the packages:
- re Regular Expressions (for string manipulation)
- BeautifulSoup (for parsing HTML pages)
- tqdm for progress bar
- ipywidgets for checkboxes and text input boxes
- IPython.display to display widgets

This scraper is written for Python3.

In [None]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#! pip install tqdm
from __future__ import print_function
from bs4 import BeautifulSoup
import re
import sys
import os
import ipywidgets as widgets
from time import sleep
from tqdm import *

# Additional imports required for interactive outputter.
# 1. ipywidgets to use checkboxes and text box widgets
# 2. IPython.display to display these widgets
from ipywidgets import Checkbox, interactive
from IPython.display import display

#this program should use python unit test
PY3 = sys.version_info.major == 3

if not PY3:
    input = raw_input

print("Running under Python version:", sys.version_info[:3])

## Format Output Options
- Select output variables to display by checking the box next to the data element. 
- To provide optional values to filter by for the selected elements, type the desired value in the text box. If the text box is left blank, the output will not be filtered on that field.
- When multiple text boxes are filled with values, only output matching all the specifications are displayed.
- If you want to filter in a way such that text matching any of the values listed should be outputted, type in a comma separated list for each element that is acceptable in the appropriate input field. For example, if you want to output text that is either Akkadian or Sumerian, in the 'lang' text box, type in akk,sux (or sux,akk - but no spaces in between options).
- If you want to exclude certain values for an attribute (e.g. Akkadian for 'lang' or N for 'pos'), type in -akk or -N in the input box.
- The default output is "lang:citform[guide]pos", so those 4 options are initially checked. The text fields are initially empty, so there are no restrictions (nothing is filtered out). 
- You can check and uncheck any of the options, as well as provide values to filter the output by.

### Examples
- Example 1: If we only want to display nouns, make sure pos is checked, and type in 'N' in the text field next to pos.
- Example 2: If we only want display nouns, and the language if it is Sumerian, make sure lang and pos are checked, and type in 'sux' for lang and 'N' for pos. This will only output text that is both in Sumerian and a noun.
- Example 3: If we want keep nouns, but only display 'lang', make sure lang is checked, and type in 'N' for pos. This will only output text with the language, and filter out text so that all outputs are nouns even though pos isn't displayed.
- Special Edge Case: If you want to output all proper nouns, you can type QPN in the pos text box.

### Summary:
- Checkboxes are for choosing which options you want to display.
- Input boxes are for filtering text whose values match the values in the boxes.
- You can check boxes without filtering, and you can also type in values to filter by without checking the box.

In [None]:
# Step 1: Create a list of all possible output options.
l = ["lang", "translit", "citform", "guide", "sense", "pos", "epos", "norm", "base", "morph"]

# Step 2: Create a checkbox widget for each element in the option list using a list comprehension. 
#         Set the description argument of checkbox equal to the variable "a", the element name.
checks = [widgets.Checkbox(description=a) for a in l]

# Step 3: Create a text box input widget for each element in the option list using a list comprehension. 
#         Set the description argument of checkbox equal to the empty string, since we don't want
#         the option name to be repeated twice.
inputs = [widgets.Text(description="") for a in l]

# Step 4: Sets the 'lang', 'citform', 'guide', and 'pos' checkboxes as checked from the start.
for c in checks:
    if c.description in ['lang', 'citform', 'guide', 'pos']:
        c.value = True

# Step 5: Combine the list of 10 checkboxes into one vertical column widget using VBox.
checkboxes = widgets.VBox(checks)

# Step 6: Combine the list of 10 input text boxes into one vertical column widget using VBox.
inputboxes = widgets.VBox(inputs)

# Step 7: Combine the two column widgets, checkboxes and inputboxes, to be side-by-side using HBox.
combined_check_input = widgets.HBox([checkboxes, inputboxes])

# Step 8: Display the combined widget.
display(combined_check_input)

In [None]:
textiderror = 'Not available:\n'

# Input File

The input file should be located in a directory called /Input, which must be located in the directory in which this Python Notebook is found. The file should have a .txt extension and must be created with a flat text editor such as TextEdit, Notepad, or Emacs. The file contains a simple list of P, Q, or X numbers, preceded by the ORACC abbreviation where the file is edited. For instance:

    rinap/rinap1/Q003421
    dcclt/Q000039
    cams/gkab/P348623

Before running this scraper, use the same input file to download the text material (in html format) with the ORACC Downloader to the /Data directory.

Run everything above this cell, select the output format options you wish to display with the checkboxes, and type in anything you want to include (e.g. V for verbs in the 'pos' field, akk for Akkadian in the 'lang' field), or exclude (put a "-" minus sign in front of what you want to exclude, e.g. -N in 'pos' excludes all nouns, -sux in 'lang' means no Sumerian) in the text boxes. Then run everything below this cell and type in the Input .txt file you want to scrape.

In [None]:
inputfile = input("Name of Input List: ")

# Format Output

The function outputformat() defines what the output of the lemmatized forms will look like. This function may be adapted in various ways to produce different types of output. The function takes a dictionary as input with the following keys: 

- 1     lang: Language
- 2     translit: Transliteration
- 3     citform: Citation Form
- 4     guide: Guide Word
- 5     sense: Sense
- 6     pos: Part of Speech
- 7     epos: Effective Part of Speech
- 8     norm: Normalization
- 9     base: Base (Sumerian only)
- 10    morph: Morphology (Sumerian only)

In the standard format the output will look like: sux:lugal[king]N. One may adapt the output, for instance, by omitting the element lang (lugal[king]N) or to select for certain parts of speech, or for certain language codes. For instance:
```python
    if output['pos'] == 'N':
        output_formatted = output['citform'] + "\t" + output['guide']
```
This will create output in the form lugal   king (lugal and king seperated by TAB), selecting only Nouns (excluding Proper Nouns).

```python
    if output['lang'] == 'sux-x-emesal':
        output_formatted = output['citform'] + "[" + output['guide'] + "]" + output['pos']
```
This will create output in the form umun[lord]N, selecting only Emesal words.
In order to select both Sumerian (sux) and Emesal (sux-x-emesal) forms one could use:
```python
    if output['lang'][0:3] == 'sux':
```


This code may be used (instead of the next cell) to only return proper nouns in the format CF[GW]POS:
`def outputformat(output):
    #output_formatted = ''
    QPN = ('AN', 'CN', 'DN', 'EN', 'FN', 'GN', 'LN', 'MN', 'ON', 'PN', 'QN', 'RN', 'SN', 'TN', 'WN', 'YN')
    if output['pos'] in QPN:
        output_formatted = output['citform'] + "[" + output['guide'] + "]" + output['pos']
    else:
        return None
    return output_formatted`

In [None]:
# List of proper nouns for the special QPN edge case.
QPN = ('AN', 'CN', 'DN', 'EN', 'FN', 'GN', 'LN', 'MN', 'ON', 'PN', 'QN', 'RN', 'SN', 'TN', 'WN', 'YN')

# Step 1: Use a list comprehension to store all checkboxes that are true. "for c in checks" iterates
#         through all the checkboxes, and "if c.value" only keeps the ones that are marked true.
#         We store the checkboxes marked true in a tuple, (option name, corresponding value in the input box).
#         Option name is just the description argument in checkbox. We get the value of the input box
#         by finding the equivalent index of c (checks.index(c)) in the inputs list, and then getting the value
#         of this input box element. This relies on the fact that the index of the checkbox corresponds 
#         to the index of the input box. For example, index 0 for the checkbox list is 'lang', and
#         index 0 for the inputs list contains the value of 'lang' if we want to filter it.

options_selected = [(c.description, inputs[checks.index(c)].value) for c in checks if c.value]

# Step 2: Use a list comprehension to store all restrictions, i.e. text fields that are filled. 
#         "for i in inputs" iterates through all the input boxes, and "i.value != '' " only keeps 
#         the ones that are filled with some text. We store the restrictions in a tuple, 
#         (option name in corresponding checkbox, value in the input box). Option name is the description 
#         argument of the corresponding checkbox element. We can get the option name using the 
#         same corresponding index strategy as in step 1. i.value is the value in the input box.
#         i.value != '' and i.value[0] != "-" tells us which values should be the only ones included.
#         i.value != '' and i.value[0] == "-" means the value starts with a negative, so we exclude it.

keep = [(checks[inputs.index(i)].description, i.value) for i in inputs if i.value != '' and i.value[0] != "-"]
ban = [(checks[inputs.index(i)].description, i.value) for i in inputs if i.value != '' and i.value[0] == "-"]

# Step 3: Write a function that takes in the output and a list of restrictions.
#         Returns True if the output meets every restriction.
#         Returns False if it fails to meet at least one restriction.

def conforms(output, keep, ban):
    for k in keep:
        # Special edge case for QPN proper nouns.
        # The continue statement stops the current iteration, and
        # forces the for loop to move on to the next restriction.
        if k[0] == 'pos' and k[1] == 'QPN' and output['pos'] in QPN:
            continue
        
        # If the output's value (output[r[0]]) for an element, r[0], 
        # is not in the list of permitted values (r[1]) inputted in the text box 
        # for that element, immediately return False.
        if output[k[0]] not in k[1].split(','):
            return False
        
    for b in ban:
        # Special edge case for QPN proper nouns.
        # If text is -QPN, exclude all QPN words.
        if b[0] == 'pos' and b[1][1:]  == 'QPN' and output['pos'] in QPN:
            return False
        
        # If the output's value (output[r[0]]) for an element, r[0], 
        # is on the excluded list, immediately return False.
        if "-" + output[b[0]] in b[1].split(','):
            return False
    return True

# Step 4: Given the output, the data element, and the list of all options checked, add the 
#         appropriate separators before and after the output's value for this data element.
#         Return the string of the output's value for this data element with punctuation added.

def addSeparators(output, field, all_options):
    # This if statement deals with the case when 'morph' and/or 'base' are checked
    # and the output is not Sumerian. Only Sumerian and Emesal words have morph and base elements. 
    # If the language of the text does not start with 'sux' and we are adding
    # separators for 'base' or 'morph', this function returns the empty string
    # to avoid a KeyError.
    
    if output['lang'][0:2] != 'sux' and (field == 'base' or field == 'morph'):
        return ""
    
    # The text variable holds the value of the element in the text.
    text = output[field]
    
    # Depending on which data element we are working with, we have to prepend
    # or append certain punctuation marks to complete the ORACC signature.
    # If the data element we are considering does not need any additional
    # punctuation, we simply return the text variable at the end.
    
    if field == 'lang':
        return text + ":"
    if field == 'translit':
        return text + "="
    
    # For the guide word and sense case, if only one is checked, surround it with [].
    # If both are checked, put the "[" before the guide word, the "//" before the
    # sense, and "]" after the sense.
    
    if field == 'guide':
        if 'sense' not in [a[0] for a in all_options]:
            return "[" + text + "]"
        return "[" + text
    if field == 'sense':
        if 'guide' not in [a[0] for a in all_options]:
            return "[" + text + "]"
        return "//" + text + "]"
    
    if field == 'epos':
        return "'" + text
    if field == 'norm':
        return "$" + text
    if field == 'base':
        return "/" + text
    if field == 'morph':
        return "#" + text
    return text
    
# Step 5: Write the function that builds the actual output
#         given a dictionary of the values for the data elements
#         in the text.

def outputformat(output):
    # Only output text that meets the restrictions.
    if conforms(output, keep, ban):
        
        # Start with an empty string.
        output_formatted = ''
        
        # For all data elements checked
        for options in options_selected:
            # Concatenate the value of each data element checked 
            # with its punctuation to the existing formatted output.
            # options[0] gives the name of the data element.
            
            output_formatted += addSeparators(output, options[0], options_selected)
            
        return output_formatted
    
    # Texts that do not meet the restrictions return None.
    else:
        return None

# Parse an ORACC lemma

The function `parselemma()` parses a so-called ORACC 'signature.' It is called by the `getline()` function where these signatures are extracted from the html files. A signature, as extracted by the `getline()` function, looks as follows:

> javascript:pop1sig('dcclt','','@dcclt%sux:am-si=amsi[elephant//elephant]N´N$amsi/am-si#~').

Akkadian signatures look slightly different, lacking the last two elements (after the /). The `parselemma()` function removes all the superfluous elements and breaks the string up into its parts. It returns a dictionary that lists all of these parts. The function `getline()` forwards this dictionary to the function `formatoutput()`, which uses its keys and values to build a usable data structure and/or to filter the data.

In [None]:
def parselemma(signature):
    output = {}
    signature = signature.replace(' ', '-')
    signature = signature.replace(',', ';') #remove spaces and commas from GuideWord and Sense
    oracc_words = re.sub("'\)$", "", signature) #remove ') from the end of the signature
    oracc_words = re.sub('^[^%]*%', '', oracc_words) #remove everything before the first % in the signature
    
    sep_char = [":", "=", "$", "#", "[", "]", "//"] # these are the character sequences that separate
                                                    # the various elements of the signature
    for eachchar in sep_char:
        oracc_words = oracc_words.replace(eachchar, " ", 1) # ':' may appear in guideword/sense, so replace only once
    
    oracc_words_l = oracc_words.split() #split signature into a list
    output['lang'] = oracc_words_l[0]
    output['translit'] = oracc_words_l[1]
    output['citform'] = oracc_words_l[2]
    output['guide'] = oracc_words_l[3]
    output['sense'] = oracc_words_l[4]

    oracc_words_l[5] = oracc_words_l[5].replace("´", " ") # this separates pos from epos
    oracc_words_l[5] = oracc_words_l[5].split() #split into sub-list
    output['pos'] = oracc_words_l[5][0]
    output['epos'] = oracc_words_l[5][1]
    
    if output['lang'][0:2] == 'sux': # Sumerian or Emesal signature
        oracc_words_l[6] = oracc_words_l[6].replace("/", " ") # this separates norm from base in sux
        oracc_words_l[6] = oracc_words_l[6].split() #split into sub-list
        output['norm'] = oracc_words_l[6][0]
        output['base'] = oracc_words_l[6][1]
        output['morph'] = oracc_words_l[7]
    
    else:
        output['norm'] = oracc_words_l[6]
    
    return output

## Compound Orthographic Forms

Compound Orthographic Forms are combinations of two or more words that are written with a single graphemic complex. Examples are *im-ma-ti* for *ina mati* (when?) or {lu₂}EN.NAM for *bēl pīhati* (governor). In ORACC HTML, the words in a COF are combined in a single signature, separated by '&&':

> javascript:pop1sig('saao/saa10','','@saao/saa10%akk-x-neoass:im-ma-ti=ina[in//in]PRP´PRP\$ina&&@saao/saa10%akk-x-neoass:=mati[when?//when?]QP´QP\$mati')

The function `cof()` is called, when necessary, by `getline()`. It splits the signature at the '&&' separator and returns a list of signatures. The transliteration (in this case *im-ma-ti*), which is included only in the first signature, is isolated and assigned to the variable `translit`. This transliteration is inserted in the remaining signatures at the proper place.

Occasionally, in some COF signatures, the second and further words do not have their own normalization (introduced by $) - this is, presumably, an irregularity in ORACC. If this is the case, normalization is supplied by assuming that it is equal to the citation form in the function `supplynorm()`.


In [None]:
def cof(signature):
    signature = signature.replace("javascript:pop1sig('", "")
    signature = signature.replace("')", "")
    translit = re.search(':(.*?)=', signature).group(1) #TODO replace the expression with positive look back
                                                        # and positive look ahead expression
    cofsignatures = signature.split('&&@')
    cofsignatures = [cofsignature.replace(':=', ':' + translit + '=') for cofsignature in cofsignatures]

    def supplynorm(cofsignature):
        if not '$' in cofsignature:
            citform = re.search('=(.*?)\[', cofsignature).group(1)
            cofsignature = cofsignature + '$' + citform
        return cofsignature
    
    cofsignatures = [supplynorm(cofsignature) for cofsignature in cofsignatures]
        
    return cofsignatures

# Process a Line

The function `getline()` receives a line with metadata from the function `scrape()`. It returns a single variable (`line`) that contains the lines metadata and the formatted lemmata in a single string. 

The function needs two arguments. The first, `line_label`, includes all the metadata of the line: `id_text`, `text_name`, and `l_no` in a single string (separated by commas). The second argument, `line`, is a `BeautifulSoup` object that holds the HTML representation of a single line.

Each line contains a series of words, represented as `signatures` in ORACC HTML. The function collects the signatures in the list `lemmas` and iterates over these. If a signature represents a Compound Orthographic Form (a combination of multiple lemmas, separated by '&&') it is sent to the function `cof()` in order to split the signature in its component forms.

Each signature is sent to `parselemma()`, where it is analyzed. The function `parselemma()` returns a dictionary (`output`) that contains all the elements of the signature (transliteration, citation form, guide word, etc.). This dictionary is sent to the function `outputformat()` which returns a string, combining elements of the signature in the desired format (the default is language:citform[guide]pos, as in sux:lugal[king]N). This string is added to the list `wordsinline`. Finally, once all lemmas (signatures) have been processed, the function builds a single string out of the `line_label` (metadata) and the formatted lemmas in `wordsinline`. This string is returned in the variable `line`.

In [None]:
def getline(line_label, line):
    wordsinline = [] #initialize list for the words in this line
    lemmas = line.findAll('a', {'class':'cbd'}, href=True)
    for eachlemma in lemmas:
        signature = eachlemma['href']
        if '&&@' in signature:  #Compound Orthographic Form, which results in multiple lemmas
            cofsignatures = cof(signature)
            for cofsignature in cofsignatures:
                output = parselemma(cofsignature)
                output_formatted = outputformat(output)
                if not output_formatted == None:
                    wordsinline.append(output_formatted)
        else:
            output = parselemma(signature)
            output_formatted = outputformat(output)
            if not output_formatted == None:
                wordsinline.append(output_formatted)
    line = line_label + ' '.join(wordsinline)
    return line

# Scrape a Single Text

The function `scrape()` takes a single text file and uses the `BeautifulSoup` package to analyze the HTML and return the data in a csv (Comma Separated Values) format. The function is called by the main process.

The function `scrape()` first identifies the name (or designation) of the text - if it cannot find the name, the text is not further processed and an error message is returned.

The function then identifies a line and sends this line to the function `getline()` for further processing. The line number is a string that belongs to the attribute `class = 'xlabel'`. Text name, text id and line number are combined into a single string as `line_label` - which is sent to `getline()` as its first argument (the second is the line itself).

If there is no `class = 'xlabel'` attribute, this means that the line belongs with the previous line as a single unit. This happens in interlinear bilinguals and, occasionally, in the representation of explanatory glosses (see, e.g. SAA 10, 044). In such cases the variable `line` from the previous iteration (which is a single string, concatenating `line_label` and the formatted lemmas) is sent, in its entirety, as the first argument to `getline()` so that the new lemma or lemmas will be added to the end of it.

Finally all lines are assembled in the variable `csvformat`, which is returned to the main process.

In [None]:
def scrape(text_id):
    print('Parsing ' + text_id)
    line = ''
    csvformat ='id_text,text_name,version,l_no,text\n' #initialize output variable
    with open('HTML/' + text_id.replace('/', '_') + '.html', encoding='utf8') as f:
        raw_html = f.read()
    soup = BeautifulSoup(raw_html, "html.parser")
    name = soup.find('h1', {'class':'p3h2'}).string
    #if there is no text name, there are errors in atf and page was not built correctly
    if name == None:
        print(eachtextid + " is not built correctly.")
    else:
        #if name includes comma, replace comma with nothing
        name = name.replace(',','')
        print(eachtextid + ":" + name)
        #all line HTML tags and also tags of the form
        #<div dc:title="..."> which contains the versions
        #as their text. We will find all lines containing
        #signatures plus the version headers if applicable
        lines = soup.findAll(lambda tag: (tag.name == 'tr' and 'class' in tag.attrs and tag.attrs['class'][0] == 'l') 
                                    or (tag.name == 'div' and 'dc:title' in tag.attrs))
        version = '' # set default version to empty string at the very beginning
        for index, eachline in enumerate(lines):
            #all versions are included as text inside <a target="_blank"> tags, 
            #so we look for these tags for each line
            #if text never has any versions, BeautifulSoup will never
            #find these 'a' tags so version is the default value ""
            subtitle = eachline.find('a', {'target': '_blank'})
            #check if the line contains this type of tag
            if subtitle:
                version = subtitle.string.replace(',', '') # assign the tag's text to the version variable
            if eachline.find('a', {'class':'cbd'}, href=True) == None: # if the line has no words
                continue                                               # go to next line
            if eachline.find('span', {'class': 'xlabel'}) != None and eachline.find('span', {'class': 'xlabel'}).string != None:
                l_no = eachline.find('span', {'class': 'xlabel'}).string.replace(',', ';')
                line_label = text_id + ',' + name + ',' + version + ',' + l_no + ','
                line = getline(line_label, eachline)
                try:
                    nextline = lines[index + 1]
                    if nextline.find('span', {'class': 'xlabel'}) == None: #if next line has no line number
                        if nextline.find('a', {'class':'cbd'}, href=True) == None: #ignore if there are no lemmatized words
                            continue                                               # in next line
                        else:
                            line_label = line + ' '                         # otherwise join output with previous line
                            line = getline(line_label, nextline)
                except:
                    continue
            else:
                continue
            csvformat = csvformat + line + '\n'
    return csvformat
        

# Main Process

The main process opens the list of text IDs (in the directory `Input`) to be processed and iterates over that list. The process calls the function `scrape()` which, in turn, calls the other functions defined above.

The function creates a separate file for each of the scraped documents in the directory Output. A list of texts that could not be scraped is saved in the directory Error.

In [None]:
with open('Input/' + inputfile, mode = 'r', encoding='utf8') as f:
    textlist = f.read().splitlines()
for eachtextid in tqdm(textlist):
    sleep(0.01)
    eachtextid = eachtextid.rstrip()
    file = 'HTML/' + eachtextid.replace('/', '_') + '.html'
    try:
        csvformat = scrape(eachtextid)
        outputfile = 'Output/' + eachtextid.replace('/', '_') + '.txt'
        print("Saving " + outputfile + '\n')
    
        if not os.path.exists('Output'):
            os.mkdir('Output')
        
        with open(outputfile, mode = 'w', encoding='utf8') as writeFile:
            writeFile.write(csvformat)  

    except IOError:
        print(file + ' not available')
        textiderror = textiderror + eachtextid + '\n'

#Create error log
if not os.path.exists('Error'):
    os.mkdir('Error')
with open('Error/oraccerror.txt', mode='w', encoding='utf8') as writeFile:
    writeFile.write(textiderror)