# Tutorial - Using XML Tools on ETCBC Hebrew Data

None of what follows would have been possible without:
1. The work of ETCBC whose data you can find at https://github.com/ETCBC/bhsa/ and
2. The example set by Jonathan Robie with the NT at https://github.com/biblicalhumanities/greek-new-testament/blob/master/labnotes/lxml-tutorial.ipynb

To begin to do something analogous with the ETCBC's data, it needed to be converted to an xml tree structure. This xml is available at https://github.com/jcuenod/tf_to_xml/tree/master/output in separate files but `root.xml` contains the whole OT. To see how I generated this data, see the other notebook https://github.com/jcuenod/tf_to_xml/blob/master/convert_tf_bhs_to_xml.ipynb

## Importing XML data

We're going to use `etree` which is part of the `lxml` package to import and parse our data:

In [1]:
from lxml import etree
tree = etree.parse("./output/root.xml")

Now that we have imported it, we can check to see whether all the books are there.

In [2]:
books = tree.xpath('/base/book')
len(books)

39

Or, maybe we should list the books just to be sure.

In [3]:
for book in books:
    print(book.get("id"))

Gen
Exod
Lev
Num
Deut
Josh
Judg
1Sam
2Sam
1Kgs
2Kgs
Isa
Jer
Ezek
Hos
Joel
Amos
Obad
Jonah
Mic
Nah
Hab
Zeph
Hag
Zech
Mal
Ps
Job
Prov
Ruth
Song
Eccl
Lam
Dan
Ezra
Neh
Esth
1Chr
2Chr


Sucess!

# Sentences, Clauses and Phrases

Now lets find textual units like sentences. To do so, let's first take note of the structure of the xml:

```xml
<book id="Gen">
    <sentence n="1172209">
      <milestone unit="verse" id="Gen.1.1">Gen.1.1</milestone>
      <p>בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ </p>
      <wg n="427553" level="clause" class="x-qatal-X clause">
        <wg n="651503" level="phrase" class="Prepositional phrase">
          <w n="1" lemma="בְּ" partOfSpeech="prep" gloss="in">בְּ</w>
          <w n="2" lemma="רֵאשִׁית" partOfSpeech="subs" number="sg" gender="f" state="a" gloss="beginning">רֵאשִׁ֖ית</w>
        ...
```

A few features worth noting include:
- A milestone occurs at the beginning of every sentence. This means that sometimes there is more than one milestone that refers to a single verse because some verses contain more than one sentence.
- A milestone occurs at the beginning of every verse. This means that sometimes there is more than one verse milestone within a single sentence because some sentences span more than one verse.
- `<wg>` elements mark both clauses and phrases. This is because I am following the structure I saw in Jonathan Robie's sample data but clauses always contain at least one phrase just as sentences always contain at least one clause.

Let's search for all sentences in our tree use the "`//`" find anywhere syntax.

In [4]:
sentences = tree.xpath('//sentence')
len(sentences)

63711

Looks good. Now what about clauses and phrases. They both use `<wg>` elements and although their roles could distinguish them, we can also use their hierarchy or the `level` attribute to find them.

In [5]:
clauses = tree.xpath('//sentence/wg') # same as '//wg[@level="clause"]'
len(clauses)

88101

In [6]:
phrases = tree.xpath('//sentence/wg/wg') # same as '//wg[@level="phrase"]'
len(phrases)

253187

Now that we've have the hierarchy we can traverse it. But what about finding verses? The problem is that they don't map perfectly to sentences or clauses. They are marked by `<milestone>` elements.

In [7]:
verses = tree.xpath('//milestone')
len(verses)

65191

Hang on... sometimes these milestones have the same value. Look at the sentences that make up Deut 5:1

In [8]:
for sentence in tree.xpath('//sentence[ milestone[@id="Deut.5.1"] ]/p'):
    print(sentence.text)

וַיִּקְרָ֣א מֹשֶׁה֮ אֶל־כָּל־יִשְׂרָאֵל֒ 
וַיֹּ֣אמֶר אֲלֵהֶ֗ם 
שְׁמַ֤ע יִשְׂרָאֵל֙ אֶת־הַחֻקִּ֣ים וְאֶת־הַמִּשְׁפָּטִ֔ים אֲשֶׁ֧ר אָנֹכִ֛י דֹּבֵ֥ר בְּאָזְנֵיכֶ֖ם הַיֹּ֑ום 
וּלְמַדְתֶּ֣ם אֹתָ֔ם 
וּשְׁמַרְתֶּ֖ם לַעֲשֹׂתָֽם׃ 


It's worth paying attention to that query. We are selecting the `<p>` child of any `<sentence>` element that has a `<milestone>` whose `id` attribute matches "Deut.5.1".

As you can see, Deut 5:1 contains 5 sentences according to the ETCBC data.

So to loop back to our question:
**How many verses do we *actually* have?**

To answer that, lets get get the text from each of our so-called verses (note, the "text" is the verse reference; cf. the text of `<milestone>` elements above):

In [9]:
verse_text = list(map(lambda v: v.text, verses))

Now let's convert our `list` of verse references to a `set` because sets only contain unique values.

In [10]:
unique_verse_list = set(verse_text)
len(unique_verse_list)

23208

Well that's a lot lower than 65191... (but it's accurate)


# Finding Textual Units by Features

Now that we know how to find verses, sentences, etc.; let's try to find a specific word.

Let's begin by looking for all the piel verbs. To find verbs we could just use:

```python
verbs = tree.xpath('//w[@partOfSpeech="verb"]')
```

but we know that only verbs occur in the piel so we don't need to worry about the part of speech.

In [11]:
piels = tree.xpath('//w[@stem="piel"]')
len(piels)

6811

What other stems are contained in this data set?

In [12]:
# We'll start by finding the verbs
verbs = tree.xpath('//w[@partOfSpeech="verb"]')
# Then we'll loop through them adding each stem to a set (which only contains unique elements)
stem_set = set()
for w in verbs:
    stem_set.add(w.get("stem"))

stem_set

{'afel',
 'etpa',
 'etpe',
 'haf',
 'hif',
 'hit',
 'hof',
 'hotp',
 'hsht',
 'htpa',
 'htpe',
 'htpo',
 'nif',
 'nit',
 'pael',
 'pasq',
 'peal',
 'peil',
 'piel',
 'poal',
 'poel',
 'pual',
 'qal',
 'shaf',
 'tif'}

There are a bunch of stems that might not look very familiar to you, that's probably because they're Aramaic stems (this data includes the Aramaic portions of the OT). There are also some stems you may notice seem to be missing like "Pilpel" and "Polal". ETCBC data parses these words with the "piel" stem so those stems will not appear here.

What if we want to find a word by more than one feature? Easy!

In [13]:
imperative_piels = tree.xpath('//w[@stem="piel" and @tense="impv"]')
len(imperative_piels)

458

Now let's try a more typical search for a word by its lemma. Let's say we're looking for Moses - in Hebrew, "מֹשֶׁה".

Unicode data presents us with a problem here. A "שׁ" may be represented in two ways, with a `ש + a shin dot` or a pre-composed character. Some of the difficulties that arise may be seen at https://unicode.org/reports/tr15/. Suffice it to say that the xml we are using has used "NFC" to normalize the data. To do the same in our search we need the package `unicodedata` and then we simply use the `normalize` function.

In [14]:
import unicodedata
moseses = tree.xpath('//w[@lemma="{}"]'.format(unicodedata.normalize("NFC", "מֹשֶׁה")))
len(moseses)

767

Yup, it's annoying, but if you use NFC normalized unicode, it won't make much difference to your life. And hey, at least this way your comparisons are not going to fail because even though the words appear identical, the vowel is encoded before the dagesh instead of after it (a problem I have faced before).

So if you know your unicode is NFC normalized you can of course just do this (and to figure out the lexemes, it's worth checking the data):

In [15]:
moseses = tree.xpath('//w[@lemma="מֹשֶׁה"]')
len(moseses)

767

Now how about finding a clause with moses speaking.

In [16]:
moses_amr = tree.xpath('//sentence[ .//w[@lemma="מֹשֶׁה"] and .//w[@lemma="אמֶר"] ]')
len(moses_amr)

134

Before we do something more useful with this data than list a number of occurrences, let's look at one more interesting feature in the data set: the clause/phrase relations.

In [17]:
prepositional_phrases = tree.xpath('//wg[@class="Prepositional phrase"]')
print("Number of prepositional phrases:", len(prepositional_phrases))

wayqx_clauses = tree.xpath('//wg[@class="Wayyiqtol-X clause"]')
print("Number of wayyiqtol-x clauses:", len(wayqx_clauses))

Number of prepositional phrases: 57464
Number of wayyiqtol-x clauses: 5895


To actually print out some of this data we can use something like:

In [18]:
print("First sentence with מֹשֶׁה and אמֶר:")
print(etree.tostring(moses_amr[0], pretty_print=True, encoding="utf-8").decode("utf-8"))

print("First Wayyiqtol-X clause:")
print(etree.tostring(wayqx_clauses[0], pretty_print=True, encoding="utf-8").decode("utf-8"))

First sentence with מֹשֶׁה and אמֶר:
<sentence n="1176997">
      <milestone unit="verse" id="Exod.3.3">Exod.3.3</milestone>
      <p>וַיֹּ֣אמֶר מֹשֶׁ֔ה </p>
      <wg n="433741" level="clause" class="Wayyiqtol-X clause">
        <wg n="670067" level="phrase" class="Conjunctive phrase">
          <w n="29676" lemma="וַ" partOfSpeech="conj" gloss="and">וַ</w>
        </wg>
        <wg n="670068" level="phrase" class="Verbal phrase">
          <w n="29677" lemma="אמֶר" partOfSpeech="verb" person="3" number="sg" gender="m" tense="wayq" stem="qal" gloss="say">יֹּ֣אמֶר</w>
        </wg>
        <wg n="670069" level="phrase" class="Proper-noun phrase">
          <w n="29678" lemma="מֹשֶׁה" partOfSpeech="nmpr" number="sg" gender="m" state="a" gloss="Moses">מֹשֶׁ֔ה</w>
        </wg>
      </wg>
    </sentence>
    

First Wayyiqtol-X clause:
<wg n="427557" level="clause" class="Wayyiqtol-X clause">
        <wg n="651518" level="phrase" class="Conjunctive phrase">
          <w n="32" lemma="וַ"

Note that in our searches the last element in the xpath chain is what was returned `<sentence>` in the first case and `<wg>` in the second. So if we wanted to find out the verse references for these searches we would need to find the associated milestone:

In [19]:
moses_amr[0].xpath("./milestone")[0].text

'Exod.3.3'

To find it for the `<wg>` element we could edit our previous search to find the parent `<sentence>`. Or, as we have just done, we can use xpath to find the parent (this works because we know that our clause `<wg>` only needs to navigate up one level to hit a sentence:

In [20]:
wayqx_clauses[0].xpath("../milestone")[0].text

'Gen.1.3'

# Do Something Cool

Well, now that we can find verse references based on our searches, let's try to do something interesting.

I glanced at the Moses + Speaking data and noticed that of the 134 values returned, a lot of them seemed to be "Wayyiqtol-X" clauses. So let's see where that is *not* the case (it turns out there are only two, so we'll just print them out):

In [21]:
moses_amr_not_wayqx = tree.xpath('//sentence[ .//wg[ .//w[@lemma="מֹשֶׁה"] and .//w[@lemma="אמֶר"] and not(@class="Wayyiqtol-X clause") ] ]')

for sentence in moses_amr_not_wayqx:
    print('{:15}'.format(sentence.xpath("./milestone")[0].text), sentence.xpath("./p")[0].text)

Exod.18.6       וַיֹּ֨אמֶר֙ אֶל־מֹשֶׁ֔ה 
Exod.32.17      וַיֹּ֨אמֶר֙ אֶל־מֹשֶׁ֔ה 
