# Using a Neural Network to Build a Phrase Atom Parser for Biblical Hebrew

by Mark Klooster | October 2, 2020 | ETCBC

Original can be found here: http://etcbc.nl/bible/using-a-neural-network-to-build-a-phrase-atom-parser-for-biblical-hebrew/

This notebook publishes the results of my internship the ETCBC (Eep Talstra Centre for Bible and Computer). The internship project involved creating a phrase atom parser for Hebrew text by building a machine learning model. The phrase atom parser contributes to the joint project between the ETCBC and the Theological Seminary at Andrews University, which is called the Creating Annotated Corpora of Classical Hebrew Texts project (CACCHT). The CACCHT project is currently broadening their scope by adding more text corpora to their database. A few years ago, the research group created a new Text-Fabric module containing the Dead Sea Scrolls (DSS) with morphological encoding. The digitized DSS and their morphological annotations were provided by Martin Abegg. However, Abegg's encoding system is very different from the other modules encoded by the ETCBC (such as the BHSA package or the extra-biblical package). Therefore, the CACCHT project has been working on converting all morphological features, thereby using a bottom-up approach (i.e. converting word features first, then phrase features, clause features, and so on). 

The encoding of word features is well underway and the project is about to move on to encoding phrase atom features. To start this, it should first be known which words constitute a phrase atom. However, Abegg's encoding does not have information about phrase atom boundaries in the dataset. Therefore, the phrase atom boundaries have to be constructed first. The construction or prediction of phrase atom boundaries is the project of this notebook. 

The reason this project predicts *phrase atom* boundaries instead of *phrase* boundaries is that whereas phrases might be separated by words of another phrase, phrase atoms consist of continuous words. Take for example this English sentence:

'A clearer example has never been given'.

The *adverb* and *adverbial phrase* 'never' split the *verbal phrase* 'has been given' in two smaller *phrase atoms*, namely 'has' and 'been given'. As phrases that are interrupted by other phrases are harder to detect, it is more logical to try and find phrase **atom** boundaries first. This agrees with the bottom-up approach that is used in the CACCHT project. (In another project, phrase atoms found here could be used to find complete phrases).

Determining things like phrase atom boundaries used to be done manually. However, this notebook uses a different approach. Phrase atom boundaries will be deducted and predicted based on information on word level. The data set of the BHSA already has information on *all* levels, including phrase atom boundaries, while the DSS data set has only information on word level. The data of the BHSA will be used as training data for a neural network. A neural network is an example of a machine learning algorithm which has a pattern-based approach. This means that rather than feeding rules to an algorithm to predict phrase atom boundaries, the networks will find patterns between input (on word level) and the output (phrase atom boundaries), to come up with these rules itself. These rules, in turn, will be applied on word-level input of the DSS to predict phrase atom boundaries. 

The neural network will be trained to find statistical patterns between *part of speech* - a word-level feature that has been encoded for the DSS already - and phrase atom boundaries. The deep learning model is trained on 90 per cent of the chapters of the BHSA, of which the phrase atom boundaries (the output) are known. As input, the model takes part of speech (e.g. noun, verb, adjective, etc.). The output consists of a 'p' or an 'x', indicating, respectively, whether the word is the end of a phrase atom, or not. 

The trained model is then tested on the remaining 10 per cent of the BHSA, which is called the test set. The mistakes are evaluated in detail, to get insight into specific cases in which the model is incorrect. The evaluation has led to several alterations in the input data, which in turn have improved the accuracy of the model. The model below only shows the final script of the most accurate model. The following alterations were made based on the evaluation of simpler models:

1. The scope of the training set was limited to Hebrew words only. This means that Aramaic parts were left out. As Aramaic has different grammatical conventions, it would not help the prediction of phrase atom boundaries in Hebrew. For example, in Aramaic, the part of speech 'article' comes after the noun, while it precedes the noun in Hebrew. Therefore, in Aramaic, the article would often be the last word of a phrase atom, while this would be theoretically impossible in Hebrew. Moreover, the target scroll of the DSS for this experiment, the Qumran Community Scroll (1QS), is a Hebrew text.
2. According to the Abegg encoding and the lastest ETCBC convention (applied in the extra-biblical package but not in the BHSA package), pronominal suffixes are to be treated as separate, individual words. Therefore, in the pre-processing phase of the data of the BHS, all suffixes are separated from their base words. In some cases, the suffix formed a separate, individual phrase atom. Suppose, for example, a verb with an object suffix. The verb and the suffix are in reality separate phrase atoms (the verb is the predicate, while the suffix is the object of the broader clause). For those and similar cases the output (phrase atom boundary) is re-evaluated and adjusted accordingly. 
3. Previous models were especially inaccurate when the part of speech was a noun, adverb, verb, proper noun, or an adjective. Interestingly, these five parts of speech can have a construct state in Hebrew, in which case it is closely connected to the following word. As a result of this, these cases are rarely the end of a phrase atom. Adding extra information about the state to these parts of speeches helps the model to predict more accurate in these cases.
These alterations combined made the accuracy of the model on the test set of the BHSA jump from 89 per cent to 97 per cent. The accuracy of the most efficient model on 1QS reached almost 95 per cent.

Moreover, whether a word is the end of a phrase atom or not, cannot be deducted from its part of speech alone. When dealing with language, context is crucial. Therefore, as is common in the practice of natural language processing, the model works with input and output *sequences* instead of single input and output. This is called a sequence to sequence model (seq2seq). After testing sequence lengths ranging from 5 to 20, the most ideal and efficient sequence length was 9. Therefore, the model works with sequences of length 9. This means that the input consists of 9 consecutive parts of speech and the output of 9 phrase atom boundary indicators (x’s or p’s).

In the script below, the following steps are taken:
1.	The input and output data are collected and pre-processed for the network.
2.	The network is defined, compiled, and fit to the training data.
3.	The model’s performance on the test set is calculated and evaluated extensively.
4.	The input and output data for the DSS scroll (1QS) is collected and pre-processed.
5.	The model is run on 1QS and the results are evaluated.


Each step is explained in more detail below.

First, the necessary libraries and modules are imported. This includes the Tensorflow package to build neural networks and the Text-Fabric package containing the BHSA database. 

It is recommended to run the model on a GPU instead on a CPU because that is much faster (depending on the specifications of the GPU of course). In order to do this, a virtual environment needs to be created. This might be a bit complicated but there are various good explanations and tutorials available online. See, for example, this tutorial on how to install a Tensorflow-GPU: https://www.youtube.com/watch?v=tPq6NIboLSc

In [37]:
# imports the necessary libraries and modules
import collections
import pandas as pd
import numpy as np
from sklearn.utils import shuffle

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import Adam

# imports the ETCBC database of the BHSA
from tf.app import use
A = use('bhsa:hot', hoist=globals())

rate limit is 5000 requests per hour, with 5000 left for this hour
	connecting to online GitHub repo annotation/app-bhsa ... connected


First, it is important to collect all words of the BHSA that are suitable for this project's purposes. As the ideal sequence length is 9, it is useful to collect these words in ranges longer than the sequence length. Moreover, there has to be a certain amount of ranges so that a random 10 per cent (for the test set) is representative of all genres throughout the Hebrew Bible. Therefore, the entire Hebrew Bible is split up into 929 smaller blocks containing consecutive words from exactly one chapter. The next step is to delete all words that are not Hebrew but Aramaic (to get a more homogeneous dataset). This is done in two steps:
1. Each block is checked for the presence of Aramaic words, which are then deleted.
2. At the position of the gap left by the deleted Aramaic words, the block is split into two new blocks (or more if the block has multiple gaps). 

This way, the resulting 927 blocks consist of only consecutive words.

Moreover, in Hebrew writing, when a word has an article that has and a prefixed preposition, the article is elided. Therefore, it is no longer visible, except in vocalised texts (such as the 10th-century Masoretic Text). As BHSA is based on an edition of the Masoretic Text (the BHS), it includes the information about 'hidden' articles. As the goal of this research is to predict phrase atom boundaries for the Dead Sea Scrolls - which are unvocalised texts - this added information is ignored and deleted. In the dataset of the BHSA, these words have an empty string ('') as the value for the feature g_cons, the transliterated consonantal presentation of words. 

Also, when a word has a pronominal suffix, in the pre-processing phase, this suffix will be separated from that word and considered as a word on its own. This way, the pronominal suffix becomes similar to the 'normal' personal pronouns. More importantly, regarding the pronominal suffix as a separate, individual word matches the encoding of the DSS, the target text set.

In [3]:
def create_hebrew_blocks():
    
    hebrew_blocks = collections.defaultdict(list)
    chapters = [chap for chap in F.otype.s("chapter")]
    
    block_index = 0
    # iterates over all chapters
    for chap in chapters:
        chap_words = []
        
        # iterates over and collects all words except the elided-he
        # adds an extra word if there is a pronominal suffix
        for word in L.d(chap, "word"):
            if F.g_cons.v(word) != '':
                chap_words.append(word)
                if F.prs.v(word) not in ['absent', 'n/a']:
                    chap_words.append(word)
        
        # splits chapter into blocks when it encounters non-Hebrew words 
        for node in range(len(chap_words)):
            if F.language.v(chap_words[node]) == 'Hebrew':
                hebrew_blocks[block_index].append(chap_words[node])
            elif F.language.v(chap_words[node]) != 'Hebrew':
                if F.language.v(chap_words[node - 1]) == 'Hebrew':
                    block_index += 1
                    continue
                else:
                    continue
        block_index += 1
    
    
    # shuffles the blocks randomly
    indexes = shuffle(list(hebrew_blocks.keys()))
    hebrew_blocks = {k: hebrew_blocks[k] for k in indexes}

    return hebrew_blocks

To give an example of what the 'hebrew blocks' look like, here are the first ten:

In [4]:
hebrew_blocks = create_hebrew_blocks()
[" ".join([str(i) for i in T.sectionFromNode(words[0])]).replace("_", " ") + "-" + str(T.sectionFromNode(words[-1])[2]) for words in hebrew_blocks.values()][:10]
    

['Daniel 10 1-21',
 'Judges 5 1-31',
 'Nahum 2 1-14',
 'Judges 4 1-24',
 'Psalms 110 1-7',
 '2 Samuel 22 1-51',
 'Judges 6 1-40',
 '1 Kings 18 1-46',
 'Psalms 136 1-26',
 'Micah 1 1-16']

Now that the dataset is defined, the next step is to collect the input and output data. For these purposes, the following three functions are used. The first function, *get_pos*, returns the part of speech when the input is a word and extends the part of speech - if needed - by the word's state. The second function returns a 'p' when the word is the end of a phrase atom, and an 'x' when it is not. The third function iterates through each block and all words and adds the input and output to each word. The resulting blocks, that now also contain input and output data, are split into training blocks and test blocks according to a predefined ratio of 9:1.

In [5]:
def get_pos(w):
    # customises the part of speech for a word and returns it
    
    # when a word has a suffix and a defined state, 
    # its part of speech is extended by '_c' indicating a construct state.
    if F.prs.v(w) not in ['absent', 'n/a'] and F.st.v(w) != "NA":
        pos = str(F.sp.v(w)) + "_c"
        
    # in all other cases, when a word has a state and no suffix, the 
    # part of speech is extended by the state
    elif F.st.v(w) != "NA":
        pos = str(F.sp.v(w)) + "_" + str(F.st.v(w))
        
    # when the word has neither state nor suffix, its part of speech remains unchanged
    else:
        pos = str(F.sp.v(w))

    return pos

In [6]:
def position_in_phrase_atom(w):
    # returns an 'p" when a word is the end of a phrase atom and an 'x' if it is not.

    ph_atom = L.u(w, 'phrase_atom')[0]
    words_in_ph_atom = L.d(ph_atom, "word")
    
    # when the word is the end of the phrase atom
    if w == words_in_ph_atom[-1]:
        ph_atom_end = 'p'
    
    # when it is not
    else:
        ph_atom_end = "x"

    return ph_atom_end

In [7]:
def collect_data(hebrew_blocks, ratio=0.9):

    data = {}

    # iterates through all blocks
    for block_idx, block_words in hebrew_blocks.items():
        block_data = []
        done = False

        # iterates through all words
        for w in block_words:
            
            # looks up the phrase to find the phrase function later
            phrase = L.u(w, "phrase")[0]

            # checks whether a word has a suffix
            if done == True:
                done = False
                continue

            # when a word appears twice in a block, the second one represents the suffix
            # the following lines make sure that the suffix gets a fitting part of speech
            # (prps) and phrase atom position

            elif block_words.count(w) == 2:
                # if the word has a suffix the data collection will happen for both the
                # word and the suffix. The second time the word passes the loop, it is ignored
                # by setting the bolean 'done' to true.
                done = True

                # if the phrase function of the word indicates a SUBJECT or OBJECT suffix
                if F.function.v(phrase)[-1] in "SO":
                    
                    # if it is the end of a phrase atom, the suffix becomes a separte phrase atom
                    if position_in_phrase_atom(w) == 'p':
                        block_data.append(['p', get_pos(w), w])
                        block_data.append(['p', 'prps', w])
                    
                    # if it is not, both original word and suffix remain 'x' for the same phrase atom
                    else:
                        block_data.append(['x', get_pos(w), w])
                        block_data.append(['x', 'prps', w])

                # if the phrase function does not indicate a subject or object suffix
                # the suffix takes over the phrase atom position form its base word.
                # If it becomes the end of phrase atom because of this, the base word gets an 'x'
                else:
                    if position_in_phrase_atom(w) == 'p':
                        block_data.append(['x', get_pos(w), w])
                        block_data.append(['p', 'prps', w])
                    else:
                        block_data.append(['x', get_pos(w), w])
                        block_data.append(['x', 'prps', w])

            # in all other cases, without suffixes involved, the phrase atom position and part of speech
            # are determined in the regular way
            else:
                block_data.append([position_in_phrase_atom(w), get_pos(w), w])
        data[block_idx] = block_data
    
    # shuffles the data randomly by block index
    data = {k: data[k] for k in shuffle(list(data.keys()))}
    
    # splits the shuffled data into train blocks and test blocks according to the preset ratio
    keys = list(data.keys())
    train_blocks = {k: data[k] for k in keys[:int(len(keys) * ratio)]}
    test_blocks = {k: data[k] for k in keys[int(len(keys) * ratio):]}

    return train_blocks, test_blocks

This is what the data of the first ten words of the test blocks looks like:

In [8]:
train_blocks, test_blocks = collect_data(hebrew_blocks)
[words for words in train_blocks.values()][0][:10]

[['p', 'verb', 226065],
 ['x', 'prep', 226066],
 ['p', 'prps', 226066],
 ['p', 'subs_a', 226067],
 ['p', 'conj', 226068],
 ['p', 'subs_a', 226069],
 ['p', 'verb', 226070],
 ['p', 'subs_a', 226071],
 ['p', 'verb', 226072],
 ['p', 'advb', 226073]]

The resulting train and test blocks consist of blocks containing the following three features for each word:
1. The part of speech of that word
2. The corresponding value for its position in a phrase atom (an 'x' or a 'p')
3. The word's node, which is an integer that is unique for each word and connects the word to the structure of the database. This helps to evaluate the results on a word level. 

The following two functions create the input and output sequences for the train and test set. In addition to this, each unique input and output value for every single word is collected in the input and output vocabularies. Lastly, the maximum length of the input and output sequences is calculated. These parameters are useful for choosing the dimensions of the neural network.

In [9]:
def prep_train_data(train_blocks):
    ip_pos_seq = []
    op_ph_seq = []
    ip_pos_voc = set()
    op_ph_voc = set()
    
    # iterates over all training blocks
    for train_word_nodes in train_blocks.values():
        
        # iterates over all words except the last 8, 
        # this way the last sequence won't run out of words
        # and have exactly 9 words 
        for w in range(len(train_word_nodes[:-8])):
            
            # the following lines collect the training data 
            # for 9 consecutive words in a list
            
            # input data: part of speech
            pos = [train_word_nodes[w][1] for w in range(w, w + 9)]
            
            # output data: position in phrase atom
            ph_atom = [
                train_word_nodes[w][0] for w in range(w, w + 9)
            ]
            
            # adds the start and stop symbol
            ph_atom = ['\t'] + ph_atom + ['\n']
            
            # collects the input and output for this word (w)
            # in a list
            ip_pos_seq.append(pos)
            op_ph_seq.append(ph_atom)
            
            # collects all unique input and output values in vocabularies
            for p in pos:
                ip_pos_voc.add(p)
            for ph in ph_atom:
                op_ph_voc.add(ph)
                
    # sorts the vocabuluries and converts them into lists
    ip_pos_voc = sorted(list(ip_pos_voc))
    op_ph_voc = sorted(list(op_ph_voc))
    
    # calculated the the maximum lenght of input and output sequences
    max_len_ip = max([len(pos) for pos in ip_pos_seq])
    max_len_op = max([len(ph) for ph in op_ph_seq])
    
    # shuffles all sequences randomly
    ip_pos_seq, op_ph_seq = shuffle(ip_pos_seq, op_ph_seq)

    return ip_pos_seq, op_ph_seq, ip_pos_voc, op_ph_voc, max_len_ip, max_len_op

This is what the first four input and output sequences look like. 

In [10]:
ip_pos_seq, op_ph_seq = prep_train_data(train_blocks)[:2]
for i in range(0, 4):
    print(ip_pos_seq[i])    
    print(op_ph_seq[i])

['nmpr_a', 'conj', 'verb', 'art', 'adjv_a', 'conj', 'art', 'adjv_a', 'conj']
['\t', 'p', 'p', 'p', 'x', 'x', 'x', 'x', 'x', 'x', '\n']
['subs_c', 'prps', 'prde', 'subs_c', 'subs_c', 'nmpr_a', 'prep', 'subs_c', 'prps']
['\t', 'x', 'p', 'p', 'x', 'x', 'p', 'x', 'x', 'p', '\n']
['prps', 'prep', 'nmpr_a', 'subs_c', 'prps', 'nega', 'verb', 'subs_a', 'prep']
['\t', 'p', 'x', 'p', 'x', 'p', 'p', 'p', 'p', 'x', '\n']
['prep', 'prps', 'prep', 'nmpr_a', 'subs_c', 'prps', 'conj', 'verb', 'prps']
['\t', 'x', 'p', 'x', 'p', 'x', 'p', 'p', 'p', 'p', '\n']


In [11]:
def prep_test_data(test_blocks):
    
    ip_pos_test = {}
    op_ph_test = {}
    
    for block_idx, test_word_nodes in test_blocks.items():
        
        ip_pos_test_block = []
        op_ph_test_block = []
        
        for w in range(len(test_word_nodes[:-8])):
            
            # collects test data
            pos = [test_word_nodes[w][1] for w in range(w, w + 9)]
            ph_atom = [test_word_nodes[w][0] for w in range(w, w + 9)]
            
            ip_pos_test_block.append(pos)
            op_ph_test_block.append(ph_atom)
            
        ip_pos_test[block_idx] = ip_pos_test_block
        op_ph_test[block_idx] = op_ph_test_block

    return ip_pos_test, op_ph_test

The data is then transformed because the neural network can only handle numerical data. To convert the numeric data back to the original data, the following dictionaries are created to map the input and output vocabularies to integers.

In [12]:
def create_dicts(ip_pos_voc, op_ph_voc):
    
    # maps the input vocabulary of part of speech to indeces
    ip_idx2pos = {}
    ip_pos2idx = {}

    for k, v in enumerate(ip_pos_voc):
        ip_idx2pos[k] = v
        ip_pos2idx[v] = k
    
    # maps the output vocabulary of phrase atom position to indeces
    op_idx2ph = {}
    op_ph2idx = {}

    for k, v in enumerate(op_ph_voc):
        op_idx2ph[k] = v
        op_ph2idx[v] = k

    return ip_idx2pos, ip_pos2idx, op_idx2ph, op_ph2idx

Because the input and output data are categorical, the data is being one-hot encoded. This means that each input value is represented by an array containing as many values as there are values in the input variable. The arrays contain zero's, with a 1 on the place of the integer value of the input.
An input for a single word might look like 
          
            [1, 0, 0, ... , 0, 0] 
which corresponds with the integer value 1 of the input vocabulary, which is 'adjv_a', an adjective with an absolute state.

In [13]:
def one_hot_encode(max_len_ip, max_len_op, ip_pos_voc, op_ph_voc, ip_pos2idx,
                   op_ph2idx, ip_pos_test, op_ph_seq):
    
    # creates three-dimensional numpy arrays
    one_hot_ip = np.zeros(shape=(len(ip_pos_test), max_len_ip, len(ip_pos_voc)),
                      dtype='float32')
    one_hot_op = np.zeros(shape=(len(ip_pos_test), max_len_op, len(op_ph_voc)),
                      dtype='float32')
    target_data = np.zeros((len(ip_pos_test), max_len_op, len(op_ph_voc)),
                           dtype='float32')

    for i in range(len(ip_pos_test)):
        for k, ps in enumerate(ip_pos_test[i]):
            one_hot_ip[i, k, ip_pos2idx[ps]] = 1

        for k, ph in enumerate(op_ph_seq[i]):
            one_hot_op[i, k, op_ph2idx[ph]] = 1
            
            # the decoder target data is ahead one timestep and does 
            # not include the start symbol
            if k > 0:
                target_data[i, k - 1, op_ph2idx[ph]] = 1

    return one_hot_ip, one_hot_op, target_data

The following function creates the structure of the neural network, which has an encoder-decoder architecture. The encoder consists of an *input layer* that has as many cells as the size of the input vocabulary of parts of speech and two *LSTM layers* which both have 250 cells. The *input layer* of the decoder has as many cells as the size of the output vocabulary of x's and p's. The decoder also has a *LSTM layer* of 250 cells, and a *dense layer* of exactly as many cells as the output vocabulary. The dense layer uses the softmax activation to normalise the outputs into a probability distribution.

In [14]:
def define_LSTM_model(ip_pos_voc, op_ph_voc):
    
    # encoder model
    encoder_input = Input(shape=(None, len(ip_pos_voc)))
    encoder_LSTM = LSTM(250,
                        activation='relu',
                        return_state=True,
                        return_sequences=True)(encoder_input)
    encoder_LSTM = LSTM(250, return_state=True)(encoder_LSTM)
    encoder_outputs, encoder_h, encoder_c = encoder_LSTM
    encoder_states = [encoder_h, encoder_c]
    
    # decoder model
    decoder_input = Input(shape=(None, len(op_ph_voc)))
    decoder_LSTM = LSTM(250, return_sequences=True, return_state=True)
    decoder_out, _, _ = decoder_LSTM(decoder_input,
                                     initial_state=encoder_states)
    decoder_dense = Dense(len(op_ph_voc), activation='softmax')
    decoder_out = decoder_dense(decoder_out)

    model = Model(inputs=[encoder_input, decoder_input], outputs=[decoder_out])

    model.summary()

    return encoder_input, encoder_states, decoder_input, decoder_LSTM, decoder_dense, model

The model's architecture is defined and the next step is to feed the data into the model. First, stopping conditions are defined. When these are met, the model is finished and stops running. Then, the optimiser and loss function are set. Finally, the model is fed the training data and begins fitting itself to the training data.

In [15]:
def compile_and_train(model, one_hot_ip, one_hot_op, target_data, batch_size,
                      epochs, val_split):
    # defines stop conditions
    callback = EarlyStopping(monitor='val_loss',
                             patience=patience,
                             verbose=0,
                             mode='auto')
    
    # defines optimizer
    adam = Adam(lr=0.0008, beta_1=0.99, beta_2=0.999, epsilon=0.00000001)
    
    # compiles the model
    model.compile(optimizer=adam,
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    
    # fits the model to the training data
    model.fit(x=[one_hot_ip, one_hot_op],
              y=target_data,
              batch_size=batch_size,
              epochs=epochs,
              validation_split=val_split,
              callbacks=[callback])

    return model

The following script sets all parameters and then runs all functions mentioned above. The data is collected, created, pre-processed and the network is defined and compiled. In the end, the model is fit to the training data.

In [16]:
batch_size = 1024
epochs = 150
val_split = 0.05
patience = 3
ratio = 0.9

# collects the relevant parts of the Hebrew Bible
hebrew_blocks = create_hebrew_blocks()

# collects input and output data and creates training and test sets
train_blocks, test_blocks = collect_data(hebrew_blocks, ratio)

# creates training sequences
ip_pos_seq, op_ph_seq, ip_pos_voc, op_ph_voc, max_len_ip, max_len_op = prep_train_data(
    train_blocks)

# creates test sequences
ip_pos_test, op_ph_test = prep_test_data(test_blocks)

# converts data to numerical data
ip_idx2pos, ip_pos2idx, op_idx2ph, op_ph2idx = create_dicts(
    ip_pos_voc, op_ph_voc)

# one-hot encodes the data
one_hot_ip, one_hot_op, target_data = one_hot_encode(max_len_ip, max_len_op,
                                                     ip_pos_voc, op_ph_voc,
                                                     ip_pos2idx, op_ph2idx,
                                                     ip_pos_seq, op_ph_seq)

one_hot_test_data = {
    block:
    one_hot_encode(max_len_ip, max_len_op, ip_pos_voc, op_ph_voc, ip_pos2idx,
                   op_ph2idx, ip_pos_test[block], op_ph_seq)[0]
    for block in test_blocks
}

# defines the model
encoder_input, encoder_states, decoder_input, decoder_LSTM, decoder_dense, model = define_LSTM_model(
    ip_pos_voc, op_ph_voc)

# fits the model to the training data
model = compile_and_train(model, one_hot_ip, one_hot_op, target_data,
                          batch_size, epochs, val_split)

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None, 18)     0                                            
__________________________________________________________________________________________________
lstm_1 (LSTM)                   [(None, None, 250),  269000      input_1[0][0]                    
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, None, 4)      0                                            
__________________________________________________________________________________________________
lstm_2 (LSTM)                   [(None, 250), (None, 501000      lstm_1[0][0]                     
                                                                 lstm_1[0][1]               

After 22 epochs, the stopping conditions were met and the model stopped training. It reached an accuracy of 98.67% on the validation set (5% of the training data that was set aside for self-evaluation). Although this is a decent result, it is more important to find out how accurate the model is on completely new data. This is where the test set, the 10% that was set apart at the beginning, comes in.

First, a few more functions are needed to be able to convert input data into predicted outcomes. The function *prediction_dict* converts the predicted sequences of 9 words into phrase atom boundary predictions for each individual word.

In [17]:
def encoder_decoder_model(encoder_input, encoder_states, decoder_LSTM, decoder_dense):
    # encoder inference model
    encoder_model_inf = Model(encoder_input, encoder_states)

    # decoder inference model
    decoder_state_input_h = Input(shape=(250, ))
    decoder_state_input_c = Input(shape=(250, ))
    decoder_input_states = [decoder_state_input_h, decoder_state_input_c]

    decoder_out, decoder_h, decoder_c = decoder_LSTM(
        decoder_input, initial_state=decoder_input_states)

    decoder_states = [decoder_h, decoder_c]

    decoder_out = decoder_dense(decoder_out)

    decoder_model_inf = Model(inputs=[decoder_input] + decoder_input_states,
                              outputs=[decoder_out] + decoder_states)

    return encoder_model_inf, decoder_model_inf

The function *decode_seq()* uses the trained model to predict output sequences. It takes one-hot encoded sequences of words as input.

In [18]:
def decode_seq(ip_seq, encoder_model_inf, decoder_model_inf, op_ph_voc,
               op_ph2idx, op_idx2ph):

    states_val = encoder_model_inf.predict(ip_seq)

    target_seq = np.zeros((1, 1, len(op_ph_voc)))
    target_seq[0, 0, op_ph2idx['\t']] = 1

    pred_ph = []
    stop_condition = False

    while not stop_condition:

        decoder_out, decoder_h, decoder_c = decoder_model_inf.predict(
            x=[target_seq] + states_val)

        max_val_index = np.argmax(decoder_out[0, -1, :])
        sampled_out_char = op_idx2ph[max_val_index]
        pred_ph.append(sampled_out_char)

        if (sampled_out_char == '\n'):
            stop_condition = True

        target_seq = np.zeros((1, 1, len(op_ph_voc)))
        target_seq[0, 0, max_val_index] = 1

        states_val = [decoder_h, decoder_c]

    return pred_ph

The function *prediction_dict()* converts the predicted outputs for sequences into predicted outputs for single words. 

In [19]:
def prediction_dict(test_blocks, one_hot_test_data, op_ph_test):
    decision_dict = {}
    for block, block_seqs in test_blocks.items():
        
        decision_dict_block = collections.defaultdict(list)
        
        for seq_index in range(len(one_hot_test_data[block])):
            ip_seq = one_hot_test_data[block][seq_index:seq_index+1]
            
            pred_ph = decode_seq(ip_seq, encoder_model_inf, decoder_model_inf, op_ph_voc,
               op_ph2idx, op_idx2ph)
            if len(pred_ph[:-1]) == len(op_ph_test[block][seq_index]):
                for pred_index in range(len(pred_ph[:-1])):
                    decision_dict_block[seq_index + pred_index].append(pred_ph[:-1][pred_index])
        decision_dict[block] = decision_dict_block
    
    return decision_dict

The function *safe_div()* divides two numbers and returns the result. If the denominator is zero, it returns zero. This function comes in handy when calculating percentages in the evaluation later. 

In [20]:
def safe_div(numerator, denominator):
    if denominator == 0:
        return 0
    
    else:
        return numerator / denominator

The following function runs all words in the test set through the model and counts the correct and false predictions. Of the latter, it also registers the corresponding part of speech to get insight into the performance of the model per input value.

In [21]:
def test_evaluation(test_blocks, decision_dict):
    correct_test = 0
    wrong_test = 0
    bible_section = []
    pos_dict = collections.defaultdict(lambda: collections.defaultdict(int))
    cross_dict = collections.defaultdict(lambda: collections.defaultdict(int))
    
    # iterates through all blocks
    for block in test_blocks:
        
        # iterates through all words
        for key in range(len(test_blocks[block])):
            w = test_blocks[block][key][2]
            
            # collects all predictions for the word (up to 9)
            data = collections.Counter(decision_dict[block][key])
            
            # determines the most common prediction
            pred = data.most_common(1)[0][0]
            
            # counts each possible combination of true and predicted output
            cross_dict[test_blocks[block][key][0]][pred] += 1
            
            # if the prediction is correct
            if test_blocks[block][key][0] == pred:
                correct_test += 1
            
            # if the prediciton is false
            else:
                wrong_test += 1
                
                # registers the exact location in the BHSA of the misprediction
                # along with information about the output
                bible_section.append(
                    str(w) + " " + T.sectionFromNode(w)[0].replace("_", " ") +
                    " " + str(T.sectionFromNode(w)[1]) + ":" +
                    str(T.sectionFromNode(w)[2]) + " " +
                    test_blocks[block][key][1] + " " +
                    test_blocks[block][key][0] + " " +
                    data.most_common(1)[0][0])
                
                # registers input corresponding with the misprediciton
                pos = test_blocks[block][key][1]
                pos_dict[pos][pred] = pos_dict[pos].get(pred, 0) + 1
                
    # creates an extensive evaluation of errors by part of speech
    eval_by_pos = {}
    for k in pos_dict.keys():
        total_pos = len([
            test_blocks[block][key][2] for block in test_blocks
            for key in range(len(test_blocks[block]))
            if test_blocks[block][key][1] == k
        ])
        total_pos_ph_x = len([
            test_blocks[block][key][2] for block in test_blocks
            for key in range(len(test_blocks[block]))
            if test_blocks[block][key][1] == k
            and test_blocks[block][key][0] == 'x'
        ])
        total_pos_ph_p = len([
            test_blocks[block][key][2] for block in test_blocks
            for key in range(len(test_blocks[block]))
            if test_blocks[block][key][1] == k
            and test_blocks[block][key][0] == 'p'
        ])
        total_wrong = pos_dict[k]['x'] + pos_dict[k]['p']

        pct_x = 100 * safe_div(pos_dict[k]['p'], total_pos_ph_x)
        pct_p = 100 * safe_div(pos_dict[k]['x'], total_pos_ph_p)
        pct_tot = 100 * \
            safe_div(total_wrong, total_pos)

        eval_by_pos[k] = {
            "Total in Test Set": total_pos,
            "Total Mistakes": total_wrong,
            "Mistakes Percentage": pct_tot,
            "Total 'x' in Test Set": total_pos_ph_x,
            "Mistaken for '" + 'p' + "'": pos_dict[k]['p'],
            "Percentage 'x'": pct_x,
            "Total '" + 'p' + "' in Test Set": total_pos_ph_p,
            "Mistaken for 'x'": pos_dict[k]['x'],
            "Percentage '" + 'p' + "'": pct_p
        }

    eval_by_pos = {
        item[0]: item[1]
        for item in sorted(eval_by_pos.items(),
                           key=lambda x: (x[1]["Total Mistakes"]),
                           reverse=True)
    }

    df_eval_by_pos = pd.DataFrame.from_dict(eval_by_pos).T
    int_cols = [
        "Total in Test Set", "Total Mistakes", "Total 'x' in Test Set", "Mistaken for 'x'",
        "Total '" + 'p' + "' in Test Set", "Mistaken for '" + 'p' + "'"
    ]
    float_cols = [
        "Mistakes Percentage", "Percentage 'x'", "Percentage '" + 'p' + "'"
    ]
    
    # creates a data frame containing the evaluation per the part of speech 
    df_eval_by_pos[int_cols] = df_eval_by_pos[int_cols].applymap(np.int64)
    df_eval_by_pos[float_cols] = df_eval_by_pos[float_cols].round(2)

    # creates a cross evaluation
    cross_eval = [[
        cross_dict[key][key2] if key2 in cross_dict[key] else 0
        for key2 in list(cross_dict.keys())
    ] for key in list(cross_dict.keys())]
    df_cross_eval = pd.DataFrame(
        cross_eval,
        columns=["End of Phrase Atom", "Not " + "End of Phrase Atom"],
        index=["Predicted as End", "Predicted as Not End"])

    eval_summary = {
        "Correct Classifications":
        correct_test,
        "Misclassifications":
        wrong_test,
        "Accuracy":
        round(100 * safe_div(correct_test, (correct_test + wrong_test)), 2)
    }
    print("Accuracy:",
          round(100 * safe_div(correct_test, (correct_test + wrong_test)), 2))
    
    # creates a dataframe of the cross evaluation
    df_eval_summary = pd.DataFrame(eval_summary, index=["Value"])

    return df_eval_by_pos, df_cross_eval, df_eval_summary, bible_section

The following script runs the previous functions to predict the outcomes for the test set, does some evaluations, and displays the results in tables.

In [22]:
# creates the encoder and decoder inference model
encoder_model_inf, decoder_model_inf = encoder_decoder_model(
    encoder_input, encoder_states, decoder_LSTM, decoder_dense)

# creates the decision dictionary containing up to predicted outcomes for each word
decision_dict = prediction_dict(test_blocks, one_hot_test_data, op_ph_test)

# evaluates the results and publishes the results in tables
df_eval_by_pos, df_cross_eval, df_eval_summary, bible_section = test_evaluation(
    test_blocks, decision_dict)

Accuracy: 96.46


In [23]:
df_eval_summary

Unnamed: 0,Correct Classifications,Misclassifications,Accuracy
Value,44139,1620,96.46


In [24]:
df_cross_eval

Unnamed: 0,End of Phrase Atom,Not End of Phrase Atom
Predicted as End,26725,499
Predicted as Not End,1121,17414


The model was able to predict the phrase atom boundaries for the test set correctly for 96.46% of the words. It is important to analyse the model's performance further. Therefore, the results are evaluated more specifically. The following table shows the errors per part of speech:

In [25]:
df_eval_by_pos

Unnamed: 0,Total in Test Set,Total Mistakes,Mistakes Percentage,Total 'x' in Test Set,Mistaken for 'p',Percentage 'x',Total 'p' in Test Set,Mistaken for 'x',Percentage 'p'
subs_a,6083,493,8.1,1251,349,27.9,4832,144,2.98
conj,6051,416,6.87,1074,293,27.28,4977,123,2.47
nmpr_a,2803,146,5.21,367,102,27.79,2436,44,1.81
verb_c,535,132,24.67,211,94,44.55,324,38,11.73
prps,5097,110,2.16,200,96,48.0,4897,14,0.29
subs_c,6084,98,1.61,5897,19,0.32,187,79,42.25
advb,451,71,15.74,97,60,61.86,354,11,3.11
art,2460,52,2.11,2314,24,1.04,146,28,19.18
adjv_a,965,52,5.39,103,48,46.6,862,4,0.46
verb_a,1149,23,2.0,25,23,92.0,1124,0,0.0


Most of the mistakes occurred in predicting the phrase atom position for words that were conjunctions or substantives with an absolute state (416 and 493 errors). Relatively, most errors occurred for verbs in the construct state and adverbs (24.67 and 15.74%). 

For a complete list of incorrect predictions, see the end of this notebook.

The final goal of this notebook was to predict phrase atom boundaries for the DSS package. For that reason, the model is tested on one scroll, namely, the Community Scroll (1QS). 
First, the extra-biblical package that contains this scroll is imported:

In [27]:
from tf.fabric import Fabric

TF = Fabric(locations='C:/Users/Mark/text-fabric-data/etcbc/extrabiblical/tf/0.2')

api = TF.load('''
    otype mother lex st typ code function rela det txt prs kind vs vt sp book chapter verse label language
''')

api.makeAvailableIn(globals())

This is Text-Fabric 8.3.0
Api reference : https://annotation.github.io/text-fabric/cheatsheet.html

72 features found and 0 ignored
  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
  0.35s All features loaded/computed - for details use loadLog()


[('Computed',
  'computed-data',
  ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),
 ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),
 ('Fabric', 'loading', ('TF',)),
 ('Locality', 'locality', ('L Locality',)),
 ('Nodes', 'navigating-nodes', ('N Nodes',)),
 ('Features',
  'node-features',
  ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),
 ('Search', 'search', ('S Search',)),
 ('Text', 'text', ('T Text',))]

The following steps of collecting and pre-processing the data are similar to the steps taken earlier when the model was trained on the BHSA. The main difference is that - this time - only one data set is created, which is the test set. As the model has already been trained, a training set is no longer needed. 

Moreover, there are some important differences between the structure of the data of the BHSA and the extra-biblical package:
1. Most scrolls of the extra-biblical package are fragmentary, which means that some words are unreadable or missing from the scroll. Therefore, the data set does not always contain consecutive continuous words. To deal with this problem, the chapters are split into smaller blocks when and where ommissions occur (this process is similar to the splitting of the chapters of the BHSA when non-Hebrew words occurred).
2. Contrary to the BHSA, in the extra-biblical package, the pronominal suffix is considered as an individual word. This is no longer a problem, as the data of the BHSA has been pre-processed earlier in such a manner that it is similar to the extra-biblical package in this respect. 

Most functions that were used before on the BHSA can be used again without alterations. Because of the differences mentioned above, only the bundling of usable segments of consecutive words, the collection of input and output data, and the preparation of the test set need to be programmed differently.

In [28]:
def create_dss_blocks(test_book=['B_1QS']):
    
    dss_blocks = collections.defaultdict(list)
    chapters = [
        chap for chap in F.otype.s("chapter") if F.book.v(chap) in test_book
    ]
    
    block_index = 0
    # iterates over all chapters and collects all words except the elided-he
    for chap in chapters:
        chap_words = [w for w in L.d(chap, "word") if F.g_cons.v(w) != '']
        block = []
        
        # detects and removes omissions and splits blocks when they occur
        for word in range(len(chap_words)):
            if F.lex.v(chap_words[word]) == '=':
                if block != []:
                    dss_blocks[block_index] = block
            elif F.lex.v(chap_words[word]) != '=':
                block.append(chap_words[word])
                block_index += 1
        if block != []:
            dss_blocks[block_index] = block
            block_index += 1
    
    # filters out blocks that are shorter than the sequence length (9)
    dss_blocks = {block: words for block, words in dss_blocks.items() if len(words) >= 9}
    
    # shuffles the blocks randomly
    indexes = shuffle(list(dss_blocks.keys()))
    dss_blocks = {k: dss_blocks[k] for k in indexes}
    
    return dss_blocks
    

In [29]:
def collect_dss_data(dss_blocks, ratio=0.9):

    dss_data = {}

    # iterates through all blocks
    for block_idx, block_words in dss_blocks.items():
        block_data = []

        # iterates through all words
        for w in block_words:
            block_data.append([position_in_phrase_atom(w), get_pos(w), w])
        dss_data[block_idx] = block_data

    return dss_data

In [30]:
def prep_test_data(dss_data):

    ip_pos_dss = {}
    op_ph_dss = {}
    
    # iterates through dss blocks
    for block in dss_data:
        ip_pos_dss_block = []
        op_ph_dss_block = []
        dss_words = dss_data[block]

        for w in range(len(dss_words[:-8])):
            
            # collects dss data
            pos = [dss_words[w][1] for w in range(w, w + 9)]
            ph_atom = [dss_words[w][0] for w in range(w, w + 9)]
            
            ip_pos_dss_block.append(pos)
            op_ph_dss_block.append(ph_atom)

        ip_pos_dss[block] = ip_pos_dss_block
        op_ph_dss[block] = op_ph_dss_block

    return ip_pos_dss, op_ph_dss

The following script runs all necessary functions to create the input data the DSS and to run it through the model. The resulting outcomes are shown in tables similar to those of the test set of the BHSA.

In [31]:
test_book = ['B_1QS']

# creates test data
dss_blocks = create_dss_blocks(test_book)
dss_data = collect_dss_data(dss_blocks)

# prepares test data
ip_pos_dss, op_ph_dss = prep_test_data(dss_data)

# one-hot encodes test data
one_hot_dss_data = {
    block: one_hot_encode(max_len_ip, max_len_op, ip_pos_voc, op_ph_voc, ip_pos2idx,
                   op_ph2idx, ip_pos_dss[block], op_ph_seq)[0]
    for block in dss_data
}

# creates prediction dictionary
decision_dict_dss = prediction_dict(dss_data, one_hot_dss_data, op_ph_dss)

df_eval_by_pos_dss, df_cross_eval_dss, df_eval_summary_dss, bible_section_dss = test_evaluation(
    dss_data, decision_dict_dss)

Accuracy: 94.47


In [32]:
df_eval_summary_dss

Unnamed: 0,Correct Classifications,Misclassifications,Accuracy
Value,9731,570,94.47


In [33]:
df_cross_eval_dss

Unnamed: 0,End of Phrase Atom,Not End of Phrase Atom
Predicted as End,5052,162
Predicted as Not End,408,4679


In the end, the model trained on the BHSA is able to predict the phrase atom boundaries for the Qumran Community Scroll with a 94.47% accuracy.

In [34]:
df_eval_by_pos_dss

Unnamed: 0,Total in Test Set,Total Mistakes,Mistakes Percentage,Total 'x' in Test Set,Mistaken for 'p',Percentage 'x',Total 'p' in Test Set,Mistaken for 'x',Percentage 'p'
subs_a,1606,224,13.95,339,193,56.93,1267,31,2.45
conj,1252,144,11.5,238,117,49.16,1014,27,2.66
prps,966,49,5.07,66,41,62.12,900,8,0.89
subs_c,1979,47,2.37,1886,8,0.42,93,39,41.94
verb_c,104,47,45.19,19,12,63.16,85,35,41.18
art,630,16,2.54,546,8,1.47,84,8,9.52
adjv_a,185,12,6.49,13,10,76.92,172,2,1.16
advb,47,12,25.53,6,6,100.0,41,6,14.63
verb_a,551,6,1.09,7,6,85.71,544,0,0.0
prep,1940,5,0.26,1933,0,0.0,7,5,71.43


Most of the mistakes occurred in predicting the phrase atom position for words that were conjunctions or substantives with an absolute state (144 and 224 errors). Relatively, the most errors occurred for verbs in the construct state and adverbs (45.19 and 25.53%). The high error rate of interjections is not meaningful as interjections only occur 6 times in 1QS of which 3 are wrongly predicted. The results are strikingly similar to the evaluation of the test set of the BHSA. This could mean that the Hebrew of 1QS is not that different from the Hebrew of the BHSA. 

In conclusion, a sequence to sequence Neural Network with a LSTM encoder-decoder is quite capable of finding relations between parts of speech and phrase atom end. For further research, one could build upon this model to predict phrase functions, for instance. In fact, these kind of models could be used in the field of ancient languages for many more applications, such as manuscript clustering, feature parsing, or to address questions of authorship, dating, and much more.

For the sake of completeness, here follows the complete list of wrong predictions both on the test set of the BHSA and on 1QS. Each line shows the node, verse, part of speech, correct and predicted position in the phrase atom:

In [35]:
for error in bible_section:
    print(error)

223011 Isaiah 33:6 conj x p
223074 Isaiah 33:12 subs_a x p
223093 Isaiah 33:14 subs_a p x
223164 Isaiah 33:19 subs_a x p
223173 Isaiah 33:19 verb_c x p
223175 Isaiah 33:19 subs_c p x
223199 Isaiah 33:21 conj x p
223207 Isaiah 33:21 subs_a x p
223242 Isaiah 33:23 subs_a p x
223256 Isaiah 33:24 verb_c x p
158210 1 Samuel 27:1 subs_a x p
158244 1 Samuel 27:2 prps x p
158245 1 Samuel 27:2 conj x p
158267 1 Samuel 27:3 subs_a x p
158268 1 Samuel 27:3 conj x p
158270 1 Samuel 27:3 nmpr_a x p
158271 1 Samuel 27:3 conj x p
97458 Deuteronomy 7:1 subs_a p x
97514 Deuteronomy 7:5 conj x p
97618 Deuteronomy 7:9 subs_a p x
97620 Deuteronomy 7:9 subs_a x p
97621 Deuteronomy 7:9 art x p
97625 Deuteronomy 7:9 subs_a x p
97626 Deuteronomy 7:9 conj x p
97630 Deuteronomy 7:9 verb_c x p
97630 Deuteronomy 7:9 prps x p
97631 Deuteronomy 7:9 conj x p
97633 Deuteronomy 7:9 verb_c x p
97641 Deuteronomy 7:10 verb_c x p
97649 Deuteronomy 7:10 verb_c x p
97761 Deuteronomy 7:15 nmpr_a x p
97823 Deuteronomy 7:18 su

In [36]:
for error in bible_section_dss:
    print(error)

17252 1QS 2:2 subs_a x p
17276 1QS 2:3 intj x p
17308 1QS 2:5 prps p x
17309 1QS 2:5 conj p x
17331 1QS 2:6 prps x p
17332 1QS 2:6 conj x p
17450 1QS 2:12 subs_c p x
17451 1QS 2:12 verb_c p x
17460 1QS 2:12 subs_c p x
17461 1QS 2:12 verb_c p x
17465 1QS 2:13 prep p x
17466 1QS 2:13 subs_c p x
17508 1QS 2:15 subs_c p x
17509 1QS 2:15 verb_c p x
17525 1QS 2:16 intj x p
17594 1QS 2:18 adjv_a x p
17598 1QS 2:18 prep p x
17599 1QS 2:18 subs_c p x
17602 1QS 2:19 subs_a x p
17603 1QS 2:19 conj x p
17643 1QS 2:20 prps x p
17644 1QS 2:20 conj x p
17656 1QS 2:22 subs_c p x
17657 1QS 2:22 verb_c p x
17665 1QS 2:22 intj x p
17670 1QS 2:23 subs_a x p
17690 1QS 2:24 prde x p
17710 1QS 2:25 prde x p
17757 1QS 2:27 subs_a x p
17758 1QS 2:27 conj x p
17891 1QS 2:36 subs_c p x
17940 1QS 2:38 subs_a x p
17941 1QS 2:38 conj x p
18918 1QS 4:1 verb_c p x
18957 1QS 4:2 subs_a p x
18958 1QS 4:2 conj p x
18993 1QS 4:3 conj x p
19047 1QS 4:3 subs_c p x
19048 1QS 4:3 verb_c p x
19118 1QS 4:6 verb_c p x
19122 1QS