# Ælfric to Albert Demo Notebook

## Setup

In [None]:
# This might take a few minutes
!pip install texttable > /dev/null
!pip install contractions > /dev/null
!pip install git+https://github.com/huggingface/transformers.git@master > /dev/null
!pip install git+https://github.com/huggingface/datasets.git@master > /dev/null
!pip install sentencepiece > /dev/null
!pip install cltk==0.1.121 > /dev/null
!pip install nltk==3.5 > /dev/null

## A quick look at the data

One of the big limitations of our project was the limited corpus size. The bible corpus contains only around 30k verses, split between the Old and New Testaments.

In [None]:
from summarize_data import *
print_testament_table()

Different books for the Bible can vary stylistically. They may be written in vastly different time periods, perspectives, and genres.

In [None]:
print_genre_table()

For accurate testing, it was important for us to get a roughly even spread of the different bible genres for our test dataset.

In [None]:
print_genre_data_split_table()

Different Modern and Middle English Bible versions may differ slightly in the exact verses provided, but are largely similar. On the other hand, the two Old English Bible versions we used, Aelfric's Old Testament and the West-Saxon Gospels, only contain a small subset of the total Bible books, let alone verses.

In [None]:
print_testament_table('t_alf')
print()
print_testament_table('t_wsg')

While these two versions vary drastically, we combined their verses in order to use as much data as possible.

In [None]:
print_genre_table('t_alf_wsg')

## Preprocessing

As shown above, different Bible versions can differ in which verses and books they contain. In order to train any sequence-to-sequence model, we first need to pair together all the Bible verses shared by the different relevant Bible versions. To do this, we use our `create_datasets` function. This function is our is our swiss army knife for data preprocessing. It does the following:

 - Pairs all Bible verses shared between the given Bible versions
 - Runs any number of specified text pre-processing operations
 - Sets aside the verses from pre-defined test books into a test set
 - Splits the remaining verses into training and validation sets depending on the requested training split
 - Saves the datasets to files if requested
 - Shuffles the datasets if requested
 - Returns the datasets in an easy-to-use dictionary format

In [None]:
from src.data_manager import *
versions = get_bible_versions_by_file_name(['t_kjv', 't_bbe'])
datasets = create_datasets(
    bible_versions = versions,
    training_fraction = 0.85,
    preprocess_operations = [
        preprocess_expand_contractions(),
        preprocess_filter_num_words(max_num_words = 35, min_num_words = 4),
        preprocess_filter_num_sentences(max_num_sentences = 1),
        preprocess_remove_punctuation(preserve_periods = True),
        preprocess_lowercase()
    ],
    write_files = True,
    shuffle = False
)

In [None]:
datasets['test']['t_kjv'][:3]

Without any pre-process operations, the results would contain much more content:

In [None]:
datasets = create_datasets(
    bible_versions = versions,
    training_fraction = 0.85,
    write_files = True,
    shuffle = False
)

In [None]:
datasets['test']['t_kjv'][:4]

By saving the split datasets to files, the same data can be used for consistent results and repreducability. The data can be loaded quickly.

In [None]:
!wc -l data/split/*
print()
datasets = load_datasets()
datasets['test']['t_kjv'][:4]

## LSTM Model

## Transformer Model

In [None]:
# Import dependencies
from transformers import (
    BartForConditionalGeneration, BartTokenizer
)

# Model path
from os.path import join
model_path = join('models', 'bart-bbe-to-kjv')

First, let's load one of our fine-tuned sequence-to-sequence transformer models along with a pre-trained tokenizer (this might take a minute):

In [None]:
model = BartForConditionalGeneration.from_pretrained(model_path, max_length = 100)
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')

Next, let's define a transformer pipeline for translating text:

In [None]:
from transformers import pipeline

translator = pipeline('translation_bbe_to_kjv', model = model, tokenizer = tokenizer)

Finally, we can translate!

In [None]:
num_verses = 4
source_verses, target_verses = datasets['test']['t_bbe'][:num_verses], datasets['test']['t_kjv'][:num_verses]
predicted_verses = [translation['translation_text'] for translation in translator(source_verses, return_text = True)]

for (source_verse, target_verse, predicted_verse) in zip(source_verses, target_verses, predicted_verses):
    print(f'SOURCE:    {source_verse}')
    print(f'TARGET:    {target_verse}')
    print(f'PREDICTED: {predicted_verse}')
    print()

Best of all, you can make your own predictions (feel free to pass whatever you want to the translate function below!):

In [None]:
def translate(text: str) -> str:
    # translates a single string
    return translator(text, return_text = True)[0]['translation_text']

translate('And the fearless instructors gave us a good grade in the class. For just they were. And full of kindness in their soul.')

### Notable findings (Transformer)

Feel free to skip this section if you're not interested.

The transformer seems to have learned parallelism:

In [None]:
print(translate('For they were just.'))
print(translate('For they were just. And full of kindness in their soul.'))

The structure of the second sentence was extrapolated into the first sentence, shown by how the first sentence was translated differently when followed by the second.

The model learned that 'lord' is often capitalized in the King James Version:

In [None]:
translate('What did the lord say to you?')

The model may translate a sentence differently depending on the ending puncutation (compare word order with above):

In [None]:
translate('What did the lord say to you.')

The model still learned to be derogatory towards homosexuals:

In [None]:
translate('He was a homosexual man.')

It was, after all, trained from verses such as:
`There shall be no prostitute of the daughters of Israel, neither shall there be a sodomite of the sons of Israel.`

However, as opposed to our previous models, it seems like these later models with more training and slightly different methods were less biased against homosexuals and less prone to complete failure:

In [None]:
translate('He was a gay man.')

Previously: `He was a man of the offspring of the evil spirits;`

In [None]:
translate('The black man')

Previously: `The black man, the king of the army, the captain of the army, the captains of the army, the captains of the captains of the captains...`

However, there was still gender bias, assuming that pretty much any profession is held by men, except those associated with women:

In [None]:
print(translate('The person had a marriage.'))
print()
print(translate('The carpenter had a marriage.'))
print(translate('The tailor had a marriage.'))
print(translate('The butcher had a marriage.'))
print(translate('The blacksmith had a marriage.'))
print(translate('The real estate agent had a marriage.'))
print(translate('The journalist had a marriage.'))
print(translate('The artist had a marriage.'))
# etc., there are many more
print()
print(translate('The nurse had a marriage.'))
print(translate('The babysitter had a marriage.'))

We received inconclusive results when trying to determine whether the gender bias was inherent to the models or if it was learned. More humerously, however:

In [None]:
translate('The avocado had a marriage.')

The model understands context, and translates the same word differently even within the same sentence (loving -> loving and loving -> loveth):

In [None]:
translate("Now I'm saving all my loving for someone who's loving me")

Finally, some quotes by Yoda:

In [None]:
translate('Once you start down the dark path, forever will it dominate your destiny. Consume you, it will.')

In [None]:
translate('Death is a natural part of life. Rejoice for those around you who transform into the Force. Mourn them do not. Miss them do not. Attachment leads to jealously.')

In [None]:
translate('On many long journeys have I gone. And waited, too, for others to return from journeys of their own. Some return; some are broken; some come back so different only their names remain.')

In [None]:
translate('No longer certain, that one ever does win a war, I am. For in fighting the battles, the bloodshed, already lost we have. Yet, open to us a path remains. That unknown to the Sith is. Through this path, victory we may yet find. Not victory in the Clone Wars, but victory for all time.')

In [None]:
translate('I can’t believe it, said Luke Skywalker. And Yoda replied, That is why you fail.')