# What this is all about

Often laypeople search a single word in the Bible to see where else it occurs. This can be helpful at times, especially with rare words. But far more often the better search is one that involves multiple words.

On <https://parabible.com>, every word has a phrase, clause, and sentence id. This allows searches to be performed with varying levels of specificity so that, for example, words that are likely being used together may be found (I find this particularly useful in Hebrew searches for verb+preposition). This also means that parabible is not dependent on "<within 5 words>" kinds of searches because it has syntax trees that it is leaning against for meaningful associations between words.

The missing piece on Parabible is, however, equivalent phrase/clause/sentence ids for the NT. This has been a source of frustration to me for some time because multiple openly licensed treebanks exist for the NT, they are just far more complicated than the syntax trees provided by the ETCBC for Hebrew data. So I'm trying to wrangle one such tree into something usable in Parabible's database.

Feedback is welcome!

In [1]:
from lxml import etree
tree = etree.parse('../greek-new-testament/syntax-trees/nestle1904-lowfat/xml/gnt.xml')
# honour xinclude statements:
tree.xinclude()
root = tree.getroot()

#  Where the Magic Happens

If this cell doesn't make sense, don't worry about it. I think the key is that you take a look at the sample data printed out in the following cell. It is important how the ids are generated but hopefully even if it doesn't make sense for now, the purpose will become clear as you see the examples below.

Okay, this jupyter cell is where we're going to start assigning `clause-node-ids` to words. The trick is, every time we encounter a new `clause` class, we're going to give all the children of that node a unique `clause-node-id`. The deepest clause node's id overrides the shallower ones.

To assign `phrase-node-ids` is slightly more complicated. We want the shallowest possible phrases to be phrase nodes (because otherwise there are loads of phrases with very few words). So we allow a `phrase-node-id` to be passed in. When we discover a clause, however, we assign a new phrase node to the clause's children. If the clause only has `<w>` elements as children, there's only one phrase but if it has any `<wg>` children, each child receives its own `phrase-node-id`.

It is likely that I have made logical errors here because this was pretty experimental for most of the afternoon.

In [2]:
limit_counter = 0
sentence_id = -1
max_clause = -1

def explore_word_group(wg, c, phrase_id):
    global max_clause, sentence_id
    to_return = []
    for node in wg:
        current_node_is_clause = False
        if node.tag == "pc": continue
            
        clause_id = c
        if node.get("class") == "cl":
            max_clause += 1
            clause_id = max_clause
            current_node_is_clause = True
        
        if node.tag == "wg":
            if current_node_is_clause:
                new_phrase_id = clause_id * 1000
                if len(node.xpath("(wg)")) > 0:
                    for phrase in node:
                        to_return.extend(explore_word_group(phrase, clause_id, new_phrase_id))
                        new_phrase_id += 1
                else:
                    to_return.extend(explore_word_group(node, clause_id, new_phrase_id))
            else:
                to_return.extend(explore_word_group(node, clause_id, phrase_id))
        else:
            to_return.append({
                "wid": node.get("osisId"),
                "sentence_node": sentence_id,
                "clause_node": clause_id,
                "phrase_node": phrase_id,
                "case": node.get("case"), # We'll do a search based on case to experiment
                "lemma": node.get("lemma"),
                "text": node.text
            })
    return to_return

gnt = []
for node in root:
    if node.tag != "book":
        continue
    print(node.get("name"))
    for sentence in node:
        sentence_id += 1
        for child in sentence:
            if child.tag == "wg":
                c = explore_word_group(child, max_clause, 0)
                for w in c:
                    gnt.append(w)

Matthew
Mark
Luke
John
Acts
Romans
1 Corinthians
2 Corinthians
Galatians
Ephesians
Philippians
Colossians
1 Thessalonians
2 Thessalonians
1 Timothy
2 Timothy
Titus
Philemon
Hebrews
James
1 Peter
2 Peter
1 John
2 John
3 John
Jude
Revelation


In [3]:
# Here's what some of the data looks like:
for w in gnt[9995:10020]:
    print(w)

{'wid': 'Matt.21.13!13', 'sentence_node': 761, 'clause_node': 2330, 'phrase_node': 2330003, 'case': 'accusative', 'lemma': 'αὐτός', 'text': 'αὐτὸν'}
{'wid': 'Matt.21.13!14', 'sentence_node': 761, 'clause_node': 2330, 'phrase_node': 2330003, 'case': None, 'lemma': 'ποιέω', 'text': 'ποιεῖτε'}
{'wid': 'Matt.21.13!15', 'sentence_node': 761, 'clause_node': 2330, 'phrase_node': 2330003, 'case': 'accusative', 'lemma': 'σπήλαιον', 'text': 'σπήλαιον'}
{'wid': 'Matt.21.13!16', 'sentence_node': 761, 'clause_node': 2330, 'phrase_node': 2330003, 'case': 'genitive', 'lemma': 'λῃστής', 'text': 'λῃστῶν'}
{'wid': 'Matt.21.14!4', 'sentence_node': 762, 'clause_node': 2332, 'phrase_node': 2332002, 'case': 'nominative', 'lemma': 'τυφλός', 'text': 'τυφλοὶ'}
{'wid': 'Matt.21.14!5', 'sentence_node': 762, 'clause_node': 2332, 'phrase_node': 2332002, 'case': None, 'lemma': 'καί', 'text': 'καὶ'}
{'wid': 'Matt.21.14!6', 'sentence_node': 762, 'clause_node': 2332, 'phrase_node': 2332002, 'case': 'nominative', 'lemm

In [4]:
# Let's create a quick helper function to print out the text of an array of words
def print_ws(ws):
    print(" ".join(map(lambda w: w["text"],ws)))

In [5]:
# now let's create a dictionary that groups words by their clause_node
words_by_clause_node = {}
for w in gnt:
    c = w["clause_node"]
    if c not in words_by_clause_node:
        words_by_clause_node[c] = []
    words_by_clause_node[c].append(w)

In [6]:
# To find the first word, we just filter our list of words in the gnt:
first_word = list(filter(lambda w: w["lemma"]=="ὑπακούω", gnt))
print(list(map(lambda w: w["wid"],first_word)))

['Mark.1.27!22', 'Luke.8.25!27', 'Luke.17.6!23', 'Acts.6.7!21', 'Acts.12.13!10', 'Rom.6.16!13', 'Rom.6.17!10', 'Rom.10.16!4', 'Eph.6.1!3', 'Eph.6.5!3', 'Col.3.20!3', 'Col.3.22!3', 'Heb.5.9!6', 'Heb.11.8!4']


In [7]:
# Let's try a "followed by" kind of search by hacking the osisId a bit (this doesn't work too well)
def get_next_osisId(o): # o = osisId
    newint = int(o[o.index("!") + 1:]) + 1
    new_o = o[:o.index("!") + 1] + str(newint)
    return new_o
    
for w in first_word:
    o = w["wid"]
    next_wid = get_next_osisId(o)
    next_w = [x for x in gnt if x["wid"] == next_wid]
    to_print = []
    to_print.append(w)
    to_print.extend(next_w)
    print_ws(to_print)

ὑπακούουσιν αὐτῷ
ὑπακούουσιν αὐτῷ
ὑπήκουσεν
ὑπήκουον τῇ
ὑπακοῦσαι ὀνόματι
ὑπακούετε
ὑπηκούσατε
ὑπήκουσαν τῷ
ὑπακούετε τοῖς
ὑπακούετε τοῖς
ὑπακούετε τοῖς
ὑπακούετε κατὰ
ὑπακούουσιν αὐτῷ
ὑπήκουσεν


In [8]:
# Based on the first word, find clauses that have the second word
matching_clauses = []
for w in first_word:
    c = w["clause_node"]
    ws = words_by_clause_node[c]
    matches = list(filter(lambda w: w["lemma"]=="ἐκ", ws))
    if len(matches) > 0:
        matching_clauses.append(c)

print("RESULTS:", len(matching_clauses))
for c in matching_clauses:
    print("Clause Nodes:", c, "(", words_by_clause_node[c][0]["wid"], ")")
    print_ws(words_by_clause_node[c])

RESULTS: 1
Clause Nodes: 16489 ( Rom.6.17!6 )
ἦτε δοῦλοι τῆς ἁμαρτίας ὑπηκούσατε ἐκ καρδίας τύπον διδαχῆς ἐδουλώθητε τῇ δικαιοσύνῃ


In [9]:
first_word = list(filter(lambda w: w["lemma"]=="ἀνάγω", gnt))
matching_clauses = []
for w in first_word:
    c = w["clause_node"]
    ws = words_by_clause_node[c]
    matches = list(filter(lambda w: w["case"]=="dative", ws))
    if len(matches) > 0:
        matching_clauses.append(c)

print("RESULTS:", len(matching_clauses))
for c in matching_clauses:
    print("Clause Node:", c, "(", words_by_clause_node[c][0]["wid"], ")")
    print_ws(words_by_clause_node[c])

RESULTS: 4
Clause Node: 13545 ( Acts.7.41!2 )
ἐμοσχοποίησαν ἐν ταῖς ἡμέραις ἐκείναις ἀνήγαγον θυσίαν τῷ εἰδώλῳ
Clause Node: 14151 ( Acts.12.4!14 )
μετὰ τὸ πάσχα ἀναγαγεῖν αὐτὸν τῷ λαῷ
Clause Node: 15823 ( Acts.27.2!13 )
ἀνήχθημεν ὄντος σὺν ἡμῖν Ἀριστάρχου Μακεδόνος Θεσσαλονικέως
Clause Node: 16004 ( Acts.28.11!1 )
Μετὰ τρεῖς μῆνας ἀνήχθημεν ἐν πλοίῳ Ἀλεξανδρινῷ


In [10]:
# Now let's try a phrase search:
words_by_phrase_node = {}
for w in gnt:
    p = w["phrase_node"]
    if p not in words_by_phrase_node:
        words_by_phrase_node[p] = []
    words_by_phrase_node[p].append(w)

In [11]:
first_word = list(filter(lambda w: w["lemma"]=="Ἰησοῦς", gnt))
matching_phrases = []
for w in first_word:
    p = w["phrase_node"]
    ws = words_by_phrase_node[p]
    matches = list(filter(lambda w: w["lemma"]=="Χριστός", ws))
    if len(matches) > 0:
        matching_phrases.append(p)

print("RESULTS:", len(matching_phrases))
for p in matching_phrases:
    print("Phrase Node:", p, "(", words_by_phrase_node[p][0]["wid"], ")")
    print_ws(words_by_phrase_node[p])

RESULTS: 232
Phrase Node: 0 ( Matt.1.1!1 )
Βίβλος γενέσεως Ἰησοῦ Χριστοῦ υἱοῦ Δαυεὶδ υἱοῦ Ἀβραάμ
Phrase Node: 36001 ( Matt.1.18!1 )
Τοῦ Ἰησοῦ Χριστοῦ ἡ γένεσις οὕτως ἦν
Phrase Node: 1798002 ( Matt.16.21!4 )
Ἰησοῦς Χριστὸς
Phrase Node: 3480000 ( Mark.1.1!1 )
Ἀρχὴ τοῦ εὐαγγελίου Ἰησοῦ Χριστοῦ (Υἱοῦ Θεοῦ)
Phrase Node: 9672001 ( John.1.17!12 )
διὰ Ἰησοῦ Χριστοῦ
Phrase Node: 12171001 ( John.17.3!9 )
σὲ τὸν μόνον ἀληθινὸν Θεὸν καὶ Ἰησοῦν Χριστόν
Phrase Node: 12691001 ( John.20.31!7 )
Ἰησοῦς ἐστιν ὁ Χριστὸς ὁ Υἱὸς τοῦ Θεοῦ
Phrase Node: 12995002 ( Acts.2.38!7 )
βαπτισθήτω ἕκαστος ὑμῶν ἐπὶ τῷ ὀνόματι Ἰησοῦ Χριστοῦ εἰς ἄφεσιν τῶν ἁμαρτιῶν ὑμῶν
Phrase Node: 13033003 ( Acts.3.6!16 )
ἐν τῷ ὀνόματι Ἰησοῦ Χριστοῦ τοῦ Ναζωραίου περιπάτει
Phrase Node: 13084001 ( Acts.3.20!12 )
τὸν Χριστὸν Ἰησοῦν
Phrase Node: 13123000 ( Acts.4.10!11 )
ἐν τῷ ὀνόματι Ἰησοῦ Χριστοῦ τοῦ Ναζωραίου
Phrase Node: 13631002 ( Acts.8.12!6 )
εὐαγγελιζομένῳ περὶ τῆς βασιλείας τοῦ Θεοῦ καὶ τοῦ ὀνόματος Ἰησοῦ Χριστοῦ
Phrase Node: 1385

In [12]:
# Finally, let's try a sentence search:
words_by_sentence_node = {}
for w in gnt:
    s = w["sentence_node"]
    if s not in words_by_sentence_node:
        words_by_sentence_node[s] = []
    words_by_sentence_node[s].append(w)

In [13]:
first_word = list(filter(lambda w: w["lemma"]=="αἷμα", gnt))
matching_sentences = []
for w in first_word:
    s = w["sentence_node"]
    ws = words_by_sentence_node[s]
    matches = list(filter(lambda w: w["lemma"]=="βασιλεία", ws))
    if len(matches) > 0:
        matching_sentences.append(s)

print("RESULTS:", len(matching_sentences))
for s in matching_sentences:
    print("Sentence Node:", s, "(", words_by_sentence_node[s][0]["wid"], ")")
    print_ws(words_by_sentence_node[s])

RESULTS: 3
Sentence Node: 5889 ( 1Cor.15.50!1 )
Τοῦτο φημι ἀδελφοί ὅτι σὰρξ καὶ αἷμα βασιλείαν Θεοῦ οὐ δύναται ἡ φθορὰ τὴν ἀφθαρσίαν κληρονομεῖ
Sentence Node: 7549 ( Rev.1.5!20 )
Τῷ ἀγαπῶντι ἡμᾶς λύσαντι ἡμᾶς ἐκ τῶν ἁμαρτιῶν ἡμῶν ἐν τῷ αἵματι αὐτοῦ καὶ βασιλείαν ἱερεῖς τῷ Θεῷ καὶ Πατρὶ αὐτοῦ αὐτῷ ἡ δόξα καὶ τὸ κράτος εἰς τοὺς αἰῶνας τῶν αἰώνων ἀμήν
Sentence Node: 7653 ( Rev.5.9!2 )
ᾄδουσιν ᾠδὴν καινὴν λέγοντες Ἄξιος εἶ λαβεῖν τὸ βιβλίον ἀνοῖξαι τὰς σφραγῖδας αὐτοῦ ὅτι ἐσφάγης ἠγόρασας τῷ Θεῷ ἐν τῷ αἵματί σου ἐκ πάσης φυλῆς καὶ γλώσσης καὶ λαοῦ καὶ ἔθνους ἐποίησας αὐτοὺς τῷ Θεῷ ἡμῶν βασιλείαν καὶ ἱερεῖς βασιλεύσουσιν ἐπὶ τῆς γῆς
