<a href="https://colab.research.google.com/github/mayaschwarz/cs175--lfric-to-Albert/blob/main/EncDecRNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OpenNMT-py Encoder Decoder LSTM

Sequence-to-Sequence Encoder-Decoder Models for translating Middle, Modern, and Old English.

We implemented an Encoder Decoder architecture with an Attention Mechanism, multiple layers, and bidirectional encoding.

We were running into difficulties implementing beam search for decoding our custom model, eventually deciding to utilize the OpenNMT-py framework. The framework provided scripts that would generate, train, and translate a model given a configuration script and data. 

It also allowed for our smallest dataset (Old English) to double in size from (~3k to ~5k) sentence pairs.

## Google Colab Set Up

Steps that need to be taken to set up the Google Colab Environment.If you're running this locally, feel free to ignore this section. 

The only requirement is that you must install the required packages using `requirements.txt`. If there is any dependency errors, please raise an issue with the repository.

In [1]:
from google.colab import drive

# default location for the drive
ROOT = "/content/gdrive"

drive.mount(ROOT)

Mounted at /content/gdrive


In [2]:
# Clone github repository setup
# import join used to join ROOT path and MY_GOOGLE_DRIVE_PATH
from os.path import join  

# path to your project on Google Drive
MY_GOOGLE_DRIVE_PATH = 'My Drive/cs175-Aelfric-to-Albert' 
GIT_USERNAME = "mayaschwarz" 

PROJECT_PATH = join(ROOT, MY_GOOGLE_DRIVE_PATH)

# It's good to print out the value if you are not sure 
print("PROJECT_PATH: ", PROJECT_PATH)   

GIT_PATH = "https://github.com/mayaschwarz/cs175--lfric-to-Albert.git"
print("GIT_PATH: ", GIT_PATH)

PROJECT_PATH:  /content/gdrive/My Drive/cs175-Aelfric-to-Albert
GIT_PATH:  https://github.com/mayaschwarz/cs175--lfric-to-Albert.git


In [None]:
# Answer input query for downloading git repository
while True:
    response = input("Are you sure you want to download the repo? Doing so will delete all unpush work. [y|N] ").lower().strip()
    if not response or response[0] == 'n':
        break
    elif response[0] == "y":
        !rm -rv "{PROJECT_PATH}"
        !mkdir -p "{PROJECT_PATH}" 
        !git clone "{GIT_PATH}" "{PROJECT_PATH}"
        break

# cd into the repository
%cd "{PROJECT_PATH}"

In [None]:
# Check that repository is up to date
!git pull 

In [None]:
# Check which branch you're on
!git branch

In [None]:
# Error warnings should be safe to ignore, able to run the dataset and opennmt-py
# versions fine
# If encountering packing does not exist errors, restart the runtime.
!pip install -r requirements.txt

## Setting up the Python Environment


In [None]:
# load notebook environment variables
%load_ext tensorboard

In [None]:
# standard library
import math
from os import listdir
import re
import random

# additional libraries (pip install ..)
import cltk
import nltk
import onmt
from onmt.utils.misc import set_random_seed
import pyonmttok
import torch
import torch.nn as nn
from torchtext.data import Dataset
import yaml

# local libraries
from src.data_manager import *
from src.paths import *

In [None]:
def set_deterministic(seed: int = 1234):
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    set_random_seed(seed, torch.cuda.is_available())

set_deterministic()

## Preprocessing and Tokenization


## Data Retrieval


In [None]:
from pandas import read_csv

class HomiliesDataset(Dataset):
    '''
    Processes the Homilies Dataset

    Arguments:
      path{str|Path} -- path to the filename containing the homilies dataset
      rever
    '''
    def  __init__(self, path, reverse=False):
        df = read_csv (path)
        self.src = list(df['text'])
        self.src_key = 't_old'
        self.tgt = list(df['translation'])
        self.tgt_key = 't_mod'

        if reverse:
            self.src, self.tgt, self.src_key = self.tgt, self.src
            self.src_key, self.tgt_key = self.tgt_key, self.src_key

    def __getitem__(self, index):
        return self.src[index], self.tgt[index]

    def __len__(self):
        return len(self.src)

    def bible_format(self, training: float = 0.7, valid: float = 0.0) -> {str : {str : [str]}}:
        '''
        Returns the dataset formatted similar to the bible datasets to allow
        common operations.

        Keys are accessed as t_old and t_mod to access the versions

        Arguments:
            training{float} -- perentage of training data
            valid{float} -- percentage of data set aside for validation, rest is test data
        '''
        dataset = { 'training': { self.src_key : [], self.tgt_key: []}, 
                    'validation': { self.src_key : [], self.tgt_key: []}, 
                    'test': { self.src_key : [], self.tgt_key: []} 
                  }

        n = len(self)
        train_size = int(training * n)
        valid_size = int(valid * n)
        test_size  = n - train_size - valid_size
        train, valid, test = torch.utils.data.random_split(
                                                          self, 
                                                          [
                                                           train_size, 
                                                           valid_size, 
                                                           test_size
                                                           ])
        src_train, tgt_train = zip(*train)
        src_valid, tgt_valid = zip(*valid)
        src_test, tgt_test = zip(*test)

        dataset['training'][self.src_key] = list(src_train)
        dataset['training'][self.tgt_key] = list(tgt_train)

        dataset['validation'][self.src_key] = list(src_valid)
        dataset['validation'][self.tgt_key] = list(tgt_valid)

        dataset['test'][self.src_key] = list(src_test)
        dataset['test'][self.tgt_key] = list(tgt_test)

        return dataset

## Tokenization

In [None]:
import cltk
from cltk.corpus.middle_english.alphabet import normalize_middle_english
from cltk.phonology.old_english.phonology import Word
from typing import Union

def _normalize(text: str, language_code: str):
    if language_code == 'ang':
        # old english
        DONT_NORMALIZE = '!?.&,:;"'
        normalized_words = list()
        for word in text.split():
            if len(word) == 0:
                continue

            if word[-1] in DONT_NORMALIZE:
                normalized_words.append(Word(word[:-1]).ascii_encoding() + word[-1])
            else:
                normalized_words.append(Word(word).ascii_encoding())

        return ' '.join(normalized_words)
    elif language_code == 'enm':
        # middle english
        return normalize_middle_english(text, to_lower=False, alpha_conv=True, punct=False)
    return text

def tokenizer(text: str, language_code: str, **kwargs: bool) -> [str]:
    tok = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, **kwargs)
    tokens, _ = tok.tokenize(_normalize(text, language_code))
    return tokens

def write_tokenized_dataset(dataset: {str: {str: [str]}}, source: str, source_language_code: str, target: str, target_language_code: str, file_paths: {str, Union[str, Path], Union[str, Path]}, token_kwargs: {str: bool} = {}) -> None:
    """
    Given a dataset, tokenizes and writes the contents according to it's file path

    Arguments:
      dataset {{str: [str]}} -- dataset returned from create_datasets
      file_paths - dictionary with key as the dataset-type (training, validation, test), item as (path to source, path to target)
      token_kwargs {{str: bool}} -- kwargs for the tokenizer (case_markup, etc.)
    """
    for dataset_t in file_paths.keys():
        src_path, tgt_path = file_paths[dataset_t]
        with open(src_path, mode='w+', encoding='utf-8') as src, open(tgt_path, mode='w+', encoding='utf-8') as tgt:
            src.write('\n'.join([" ".join(tokenizer(l, source_language_code, **token_kwargs)) for l in dataset[dataset_t][source]]))
            tgt.write('\n'.join([" ".join(tokenizer(l, target_language_code, **token_kwargs)) for l in dataset[dataset_t][target]]))

# Training

In [None]:
# Check if GPU is active
# If not, go to "Runtime" menu > "Change runtime type" > "GPU"

!nvidia-smi -L

GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-dec3bf4a-ba0e-c1e2-2141-c60db31ee2fb)


In [None]:
# Make sure the GPU is visable to PyTorch
import torch

gpu_id = torch.cuda.current_device()
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(gpu_id))

True
Tesla V100-SXM2-16GB


In [None]:
def build_and_train(config_path):
    # build and store vocab in run folder
    !onmt_build_vocab -config "{config_path}" -n_sample -1
    # begin training
    !onmt_train -config "{config_path}"

# Translation and Evaluation

See [here](https://opennmt.net/OpenNMT-py/options/translate.html) for more info on translation parameters

Evaluatation using BLEU and METEOR

In [None]:
from datasets import list_metrics, load_metric
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate import meteor

def compute_score(candidate_verses: [str], reference_verses: [str], metric_name: str = 'sacrebleu') -> float:
    metric = load_metric(metric_name)
    # if it's sacrebleu, need to reformat
    if metric_name == 'sacrebleu':
        reference_verses = [[r] for r in reference_verses]
    
    if len(candidate_verses) < len(reference_verses):
        print("candidate verses is less than reference verses, trimming reference to fit")
        reference_verses = reference_verses[:-1]

    metric.add_batch(predictions = candidate_verses, references = reference_verses)
    
    return metric.compute()

def get_detokenized_file(filename: Union[str, Path], tokenize: pyonmttok.Tokenizer) -> [str]:
    with open(filename, encoding='utf-8') as f:
        # Add line to strip empty lines
        lines = [tokenize.detokenize(line.rstrip('\n').split(' ')) for line in f]
        if lines[-1] == '':
            lines = lines[:-1]

        return lines

def get_score(metric_name: str, value: {str: float}) -> None:
    if metric_name == 'sacrebleu':
        return value['score']
    elif metric_name == 'meteor':
        return value['meteor']

def evaluate(models: Union[str, Path], source: Union[str, Path], target: Union[str, Path], eval_metrics: [str], token_kwargs: {str:bool},  max_length: int, beam_size: 5, save_folder='./predictions', verbose=True) -> {str: {str: {}}}:
    tokenize = pyonmttok.Tokenizer("aggressive", **token_kwargs)
    # detokenize the reference file that has been tokenized
    # (this ensures that any normalization techniques used do not effect the scoring)
    references = get_detokenized_file(target, tokenize)
  
    scores = dict() 
    for m in models:
        # get the model name
        model_name = m.name[:-3] if isinstance(m, Path) else m.rsplit('(\\|\/)')[-1][:-3]

        filename = f"{save_folder}/{model_name}_pred.txt"
        
        # Call the translate script to generate token predictions
        !onmt_translate -model "{m}" -src "{source}" -output "{filename}" -min_length 1 -max_length "{max_length}" -beam_size "{beam_size}" -gpu 0 
        
        # Retrieve candidate sentences
        candidates = get_detokenized_file(filename, tokenize)
        
        print(f'{model_name} SCORE:')

        metrics = dict()
        for eval_name in eval_metrics:
            eval_score = compute_score(candidates, references, eval_name)
            if verbose:
                print(f'\t{eval_name} = {get_score(eval_name, eval_score):.4f}')
            metrics[eval_name] = eval_score
        scores[model_name] = metrics

    return scores

# Configuring the Data, Model, and Training Parameters
Generate a YAML file that contains all the hyperparameters and system variables necessary to build the vocab, build, and train the model.

See [here](https://opennmt.net/OpenNMT-py/options/build_vocab.html) for more info on building vocab

See [here](https://opennmt.net/OpenNMT-py/options/train.html) for more info about building the model and training parameters

In [None]:
# declare the config folder to store all the yaml files
CONFIG_NAME = 'openmt-config'
!mkdir -p "{CONFIG_NAME}"
CONFIG_PATH = Path(CONFIG_NAME)

## Middle and Modern English

### Middle to Modern


In [None]:
from pathlib import Path

ENM2MOD_TRANSLATE_NAME = 'enm2mod'
!mkdir -p '{ENM2MOD_TRANSLATE_NAME}'

# PATH VARIABLES
ENM2MOD_TRANSLATE_PATH = Path(ENM2MOD_TRANSLATE_NAME)
ENM2MOD_RUN_PATH = ENM2MOD_TRANSLATE_PATH / 'run'
!mkdir -p "{ENM2MOD_RUN_PATH}"

# Dataset Variables
ENM2MOD_SOURCE_VER = 't_wyc'
ENM2MOD_SRC_LANG_CODE = 'enm'
ENM2MOD_TARGET_VER = 't_kjv'
ENM2MOD_TGT_LANG_CODE = 'eng'

MAX_SENTENCE_LENGTH = 60

# Dataset Paths
DATA_PATH = Path('data/preprocessed')
!mkdir -p "{DATA_PATH}"

set_deterministic()

In [None]:
# Generate splits and write to files
versions = get_bible_versions_by_file_name([ENM2MOD_SOURCE_VER, ENM2MOD_TARGET_VER])

datasets = create_datasets(versions, .82, 
                preprocess_operations = [preprocess_filter_num_words(MAX_SENTENCE_LENGTH),
                                         preprocess_expand_contractions(),
                                         preprocess_filter_num_sentences(),
                ]);

In [None]:
ENM2MOD_SRC_EXT = ENM2MOD_SOURCE_VER[2:]
ENM2MOD_TGT_EXT = ENM2MOD_TARGET_VER[2:]


enm2mod_file_paths = {
    'training' : (DATA_PATH / f'bible-train.{ENM2MOD_SRC_EXT}', DATA_PATH / f'bible-train.{ENM2MOD_TGT_EXT}'),
    'validation' : (DATA_PATH / f'bible-valid.{ENM2MOD_SRC_EXT}', DATA_PATH / f'bible-valid.{ENM2MOD_TGT_EXT}'),
    'test' : (DATA_PATH / f'bible-test.{ENM2MOD_SRC_EXT}', DATA_PATH / f'bible-test.{ENM2MOD_TGT_EXT}')
    }

token_kwargs = {
    'case_markup': True
    }

In [None]:
write_tokenized_dataset(datasets, ENM2MOD_SOURCE_VER, ENM2MOD_SRC_LANG_CODE, ENM2MOD_TARGET_VER, ENM2MOD_TGT_LANG_CODE, enm2mod_file_paths, token_kwargs)

In [None]:
ENM2MOD_SRC_VOCAB_PATH = ENM2MOD_RUN_PATH / 'vocab.src'
ENM2MOD_TGT_VOCAB_PATH = ENM2MOD_RUN_PATH / 'vocab.tgt'

enm2mod_yaml = 'enm2mod.yaml'

ENM2MOD_MODEL_PATH = ENM2MOD_RUN_PATH / 'models'
ENM2MOD_MODEL_PREFIX = 'enm2mod'

In [None]:
config =  f'''# {enm2mod_yaml}
save_data: {ENM2MOD_RUN_PATH}

### DATA PROPROCESSING ###
## Where the vocab(s) will be written
src_vocab: {ENM2MOD_SRC_VOCAB_PATH}
tgt_vocab: {ENM2MOD_TGT_VOCAB_PATH}

# Corpus opts:
data:
    corpus_1:
        path_src: {enm2mod_file_paths['training'][0]}
        path_tgt: {enm2mod_file_paths['training'][1]}
        transforms: []
        weight: 1
    valid:
        path_src: {enm2mod_file_paths['validation'][0]}
        path_tgt: {enm2mod_file_paths['validation'][1]}
        transforms: []

## silently ignore empty lines in data
skip_empty_level: silent

### TRAINING ###
## Where the model will be saved
save_model: {ENM2MOD_MODEL_PATH / ENM2MOD_MODEL_PREFIX}
save_checkpoint_steps: 1000
average_decay: 0.0005
seed: 1234
report_every: 100
train_steps: 100000
valid_steps: 100
early_stopping: 10
# early_stopping_criteria: accuracy
tensorboard: True
tensorboard_log_dir: {ENM2MOD_RUN_PATH / 'logs'}

# Batching
world_size: 1
gpu_ranks: [0]
batch_size: 64
valid_batch_size: 64
batch_size_multiple: 1

# Optimization
model_dtype: "fp32"
optim: "adam"
learning_rate: 0.001

# Model
encoder_type: rnn
decoder_type: rnn
rnn_type: LSTM
bidir_edges: True
enc_layers: 2
dec_layers: 2
rnn_size: 1024
word_vec_size: 256
dropout: 0.5
attn_dropout: 0.3
'''

with open(CONFIG_PATH / enm2mod_yaml, "w+") as config_yaml:
  config_yaml.write(config)

In [None]:
build_and_train(CONFIG_PATH / enm2mod_yaml)

In [None]:
# retrieve the models
enm2mod_models = [ ENM2MOD_MODEL_PATH / f for f in listdir(ENM2MOD_MODEL_PATH) if f.startswith(ENM2MOD_MODEL_PREFIX)]

ENM2MOD_PREDICTIONS_PATH = ENM2MOD_RUN_PATH / 'predictions'
!mkdir -p "{ENM2MOD_PREDICTIONS_PATH}"

eval_metrics = ['sacrebleu', 'meteor']

In [None]:
enm2mod_scores = evaluate(enm2mod_models, 
                          enm2mod_file_paths['test'][0], 
                          enm2mod_file_paths['test'][1], 
                          eval_metrics,
                          token_kwargs,
                          MAX_SENTENCE_LENGTH, 
                          5, 
                          str(ENM2MOD_PREDICTIONS_PATH))

The best performing model is after 2900 training iterations with early stopping and beam size 5:

    BLEU   = 26.5572
    METEOR = 0.4481

Interesting note: Adding a capitalization token and keeping punctuation increased both BLEU and METEOR accuracy by 2-4% compared to lowercase without punctuation.

#### User Studies Predictions

Generate predictions for the user studies


In [None]:
queries = ['In the bigynnyng God made of nouyt heuene and erthe.', 
           'And Adam clepide the name of his wijf Eue, for sche was the moder of alle men lyuynge.', 
           'And the Lord God clepide Adam, and seide to hym, Where art thou?', 
           'Forsothe the Lord hadde mynde of Noe, and of alle lyuynge beestis, and of alle werk beestis, that weren with hym in the schip; and brouyte a wynd on the erthe.', 
           'And whanne God seiy, that the erthe was corrupt, for ech fleisch ether man hadde corrupt his weie on erthe,', 
           'bi tweyne and bi tweyne, male and female entriden to Noe in to the schip, as the Lord comaundide to Noe.', 
           'And sotheli the watrys yeden and decresiden til to the tenthe monethe, for in the tenthe monethe, in the firste dai of the monethe, the coppis of hillis apperiden.', 
           'And God fillide in the seuenthe dai his werk which he made; and he restide in the seuenthe dai fro al his werk which he hadde maad;', 
           'And the erthe brouyte forth greene erbe and makynge seed bi his kynde, and a tre makynge fruyt, and ech hauynge seed by his kynde. And God seiy that it was good.', 
           'And the Lord dide in that nyyt, as Gedeon axide; and drynesse was in the flees aloone, and deew was in al the erthe.', 
           'Also `the trees spaken to the vyne, Come thou, and comaunde to vs.', 
           'Therfor not Y do synne ayens thee, but thou doist yuel ayens me, and bryngist in batels not iust to me; the Lord, iuge of this dai, deme bitwixe the sones of Israel and bitwixe the sones of Amon.', 
           'The trauel of foolis shal turment hem, that kunnen not go in to the citee.', 
           'And if seuene sithis in the dai he do synne ayens thee, and seuene sithis in the dai he be conuertid to thee, and seie, It forthenkith me, foryyue thou hym.', 
           'and seide, Oneli Y knewe you of alle the kynredis of erthe; therfor Y schal visite on you alle youre wickidnessis.', 
           'And he is heed of the bodi of the chirche; which is the bigynnyng and the firste bigetun of deede men, that he holde the firste dignyte in alle thingis.', 
           'And the foure beestis seiden, Amen. And the foure and twenti eldre men fellen doun on her faces, and worschipiden hym that lyueth in to worldis of worldis.', 
           'He brak at noumbre my teeth; he fedde me with aische.', 
           'And if Sathanas be departid ayens hym silf, hou schal his rewme stonde? For ye seien, that Y caste out feendis in Belsabub.', 
           'coueitouse, hiy of bering, proude, blasfemeris, not obedient to fadir and modir, vnkynde,', 
           'Forsothe God seide, Liytis be maad in the firmament of heuene, and departe tho the dai and niyt; and be tho in to signes, and tymes, and daies, and yeeris;', 
           'Isaac dredde bi a greet astonying; and he wondride more, than it mai be bileued, and seide, Who therfor is he which a while ago brouyte to me huntyng takun, and Y eet of alle thingis bifor that thou camest; and Y blesside him? and he schal be blessid.', 
           'And lo! an aungel of the Lord criede fro heuene, and seide, Abraham! Abraham!', 
           'Sotheli Abraham plauntide a wode in Bersabee, and inwardli clepide there the name of euerlastinge God; and he was an erthetiliere ether a comelynge of the lond of Palestynes in many dayes.',
           'And he helde forth his hond, and took the swerd to sacrifice his sone.', 
           'Abraham turnede ayen to hise children, and thei yeden to Bersabee to gidere, and he dwellide there.', 
           'And whanne ye weren deed in giltis, and in the prepucie of youre fleisch, he quikenyde togidere you with hym;', 
           'Wymmen, be ye sugetis to youre hosebondis, as it bihoueth in the Lord.', 
           'For he that doith iniurie, schal resseyue that that he dide yuele; and acceptacioun of persoones is not anentis God.', 
           'Aristark, prisoner with me, gretith you wel, and Mark, the cosyn of Barnabas, of whom ye han take maundementis; if he come to you, resseyue ye hym;', 
           'But to God and oure fadir be glorie in to worldis of worldis.'
           ]

In [None]:
# Best Performing Model
model = ENM2MOD_MODEL_PATH / f'{ENM2MOD_MODEL_PREFIX}_step_2900.pt'
MIDDLE_TEXT_TOK = DATA_PATH / 'user-studies.enm'
MIDDLE_TEST_PRED = ENM2MOD_PREDICTIONS_PATH / 'middle-text-pred.txt'

with open(MIDDLE_TEXT_TOK, mode='w+', encoding='utf-8') as f:
      eval_text = [l.rstrip('\n') for l in f]
      f.write('\n'.join([" ".join(tokenizer(l, 'enm', **token_kwargs)) for l in queries]))

!onmt_translate -model "{model}" -src "{MIDDLE_TEXT_TOK}" -output "{MIDDLE_TEST_PRED}" -min_length 1 -max_length 60 -beam_size 5 -gpu 0 

In [None]:
tokenize = pyonmttok.Tokenizer("aggressive", **token_kwargs)
hypotheses = get_detokenized_file(MIDDLE_TEST_PRED, tokenize)

for hyp in hypotheses:
    print(hyp)

### Modern to Middle

We can reuse the preprocessing files saved from the previous model

In [None]:
MOD2ENM_TRANSLATE_NAME = 'mod2enm'
!mkdir -p '{MOD2ENM_TRANSLATE_NAME}'

# PATH VARIABLES
MOD2ENM_TRANSLATE_PATH = Path(MOD2ENM_TRANSLATE_NAME)
MOD2ENM_RUN_PATH = MOD2ENM_TRANSLATE_PATH / 'run'
!mkdir -p "{MOD2ENM_RUN_PATH}"

# Dataset Variables (swap previous run)
MOD2ENM_SOURCE_VER = 't_kjv'
MOD2ENM_TARGET_VER = 't_wyc'

In [None]:
MOD2ENM_SRC_EXT = MOD2ENM_SOURCE_VER[2:]
MOD2ENM_TGT_EXT = MOD2ENM_TARGET_VER[2:]

mod2enm_file_paths = {
    'training' : (DATA_PATH / f'bible-train.{MOD2ENM_SRC_EXT}', DATA_PATH / f'bible-train.{MOD2ENM_TGT_EXT}'),
    'validation' : (DATA_PATH / f'bible-valid.{MOD2ENM_SRC_EXT}', DATA_PATH / f'bible-valid.{MOD2ENM_TGT_EXT}'),
    'test' : (DATA_PATH / f'bible-test.{MOD2ENM_SRC_EXT}', DATA_PATH / f'bible-test.{MOD2ENM_TGT_EXT}')
    }

token_kwargs = {
    'case_markup': True
    }

In [None]:
# datasets are already tokenized by the first run, no need to do again
# write_tokenized_dataset(datasets, MOD2ENM_SOURCE_VER, ENM2MOD_TGT_LANG_CODE, MOD2ENM_TARGET_VER, ENM2MOD_SRC_LANG_CODE, mod2enm_file_paths, token_kwargs)

In [None]:
MOD2ENM_SRC_VOCAB_PATH = MOD2ENM_RUN_PATH / 'vocab.src'
MOD2ENM_TGT_VOCAB_PATH = MOD2ENM_RUN_PATH / 'vocab.tgt'

mod2enm_yaml = 'mod2enm.yaml'

MOD2ENM_MODEL_PATH = MOD2ENM_RUN_PATH / 'models'
MOD2ENM_MODEL_PREFIX = 'mod2enm'

In [None]:
config =  f'''# {mod2enm_yaml}
save_data: {MOD2ENM_RUN_PATH}

### DATA PROPROCESSING ###
## Where the vocab(s) will be written
src_vocab: {MOD2ENM_SRC_VOCAB_PATH}
tgt_vocab: {MOD2ENM_TGT_VOCAB_PATH}

# Corpus opts:
data:
    corpus_1:
        path_src: {mod2enm_file_paths['training'][0]}
        path_tgt: {mod2enm_file_paths['training'][1]}
        transforms: []
        weight: 1
    valid:
        path_src: {mod2enm_file_paths['validation'][0]}
        path_tgt: {mod2enm_file_paths['validation'][1]}
        transforms: []

## silently ignore empty lines in data
skip_empty_level: silent

### TRAINING ###
## Where the model will be saved
save_model: {MOD2ENM_MODEL_PATH / MOD2ENM_MODEL_PREFIX}
save_checkpoint_steps: 1000
average_decay: 0.0005
seed: 1234
report_every: 100
train_steps: 100000
valid_steps: 100
early_stopping: 10
early_stopping_criteria: accuracy
tensorboard: True
tensorboard_log_dir: {MOD2ENM_RUN_PATH / 'logs'}

# Batching
world_size: 1
gpu_ranks: [0]
batch_size: 64
valid_batch_size: 64
batch_size_multiple: 1

# Optimization
model_dtype: "fp32"
optim: "adam"
learning_rate: 0.001

# Model
encoder_type: rnn
decoder_type: rnn
rnn_type: LSTM
bidir_edges: True
enc_layers: 2
dec_layers: 2
rnn_size: 1024
word_vec_size: 256
dropout: 0.5
attn_dropout: 0.3
'''

with open(CONFIG_PATH / mod2enm_yaml, "w+") as config_yaml:
  config_yaml.write(config)

In [None]:
build_and_train(CONFIG_PATH / mod2enm_yaml)

In [None]:
# retrieve the models
mod2enm_model_paths = [ MOD2ENM_MODEL_PATH / f for f in listdir(MOD2ENM_MODEL_PATH) if f.startswith(MOD2ENM_MODEL_PREFIX)]

MOD2ENM_PREDICTIONS_PATH = MOD2ENM_RUN_PATH / 'predictions'
!mkdir -p "{MOD2ENM_PREDICTIONS_PATH}"

# Don't use meteor on non-english translations
eval_metrics = ['sacrebleu', 'meteor']

In [None]:
mod2enm_scores = evaluate(mod2enm_model_paths, 
                          mod2enm_file_paths['test'][0], 
                          mod2enm_file_paths['test'][1], 
                          eval_metrics,
                          token_kwargs,
                          MAX_SENTENCE_LENGTH, 
                          5, 
                          MOD2ENM_PREDICTIONS_PATH)

Using BLEU Scoring, the best performing model is after 3000 training iterations at beam size 5:

    BLEU   = 28.4447
    METEOR = 0.4577

## Old and Modern

### Old to Modern

In [None]:
from pathlib import Path

ANG2MOD_TRANSLATE_NAME = 'ang2mod'
!mkdir -p '{ANG2MOD_TRANSLATE_NAME}'

# PATH VARIABLES
ANG2MOD_TRANSLATE_PATH = Path(ANG2MOD_TRANSLATE_NAME)
ANG2MOD_RUN_PATH = ANG2MOD_TRANSLATE_PATH / 'run'
!mkdir -p "{ANG2MOD_RUN_PATH}"

## Dataset Variables
# For the Homilies Dataset
ANG2MOD_HOM_SOURCE_VER = 't_old'
ANG2MOD_HOM_SRC_LANG_CODE = 'ang'
ANG2MOD_HOM_TARGET_VER = 't_mod'
ANG2MOD_HOM_TGT_LANG_CODE = 'eng'

# For the Bible Dataset
ANG2MOD_SOURCE_VER = 't_alf_wsg'
ANG2MOD_SRC_LANG_CODE = 'ang'
ANG2MOD_TARGET_VER = 't_kjv'
ANG2MOD_TGT_LANG_CODE = 'eng'

MAX_SENTENCE_LENGTH = 60

# Dataset Paths
DATA_PATH = Path('data/preprocessed')
!mkdir -p "{DATA_PATH}"

set_deterministic()

In [None]:
homilies_raw = HomiliesDataset(MISC_TEXTS_PATH / 't_hom.csv')
hom_dataset = homilies_raw.bible_format(training=0.7, valid=0.15)

In [None]:
print("# training verses: \t", len(hom_dataset['training']['t_old']))
print("# training verses: \t", len(hom_dataset['validation']['t_old']))
print("# training verses: \t", len(hom_dataset['test']['t_old']))

In [None]:
# Generate splits and write to files
versions = get_bible_versions_by_file_name([ANG2MOD_SOURCE_VER, ANG2MOD_TARGET_VER])

datasets = create_datasets(versions, 0.8, 
                preprocess_operations = [preprocess_filter_num_words(MAX_SENTENCE_LENGTH),
                                         preprocess_expand_contractions(),
                ], write_files=True);

In [None]:
ANG2MOD_HOM_SRC_EXT = ANG2MOD_HOM_SOURCE_VER[2:]
ANG2MOD_HOM_TGT_EXT = ANG2MOD_HOM_TARGET_VER[2:]

ang2mod_hom_file_paths = {
    'training' : (DATA_PATH / f'hom-train.{ANG2MOD_HOM_SRC_EXT}', DATA_PATH / f'hom-train.{ANG2MOD_HOM_TGT_EXT}'),
    'validation' : (DATA_PATH / f'hom-valid.{ANG2MOD_HOM_SRC_EXT}', DATA_PATH / f'hom-valid.{ANG2MOD_HOM_TGT_EXT}'),
    'test' : (DATA_PATH / f'hom-test.{ANG2MOD_HOM_SRC_EXT}', DATA_PATH / f'hom-test.{ANG2MOD_HOM_TGT_EXT}')
    }

ANG2MOD_SRC_EXT = ANG2MOD_SOURCE_VER[2:]
ANG2MOD_TGT_EXT = ANG2MOD_TARGET_VER[2:]

ang2mod_file_paths = {
    'training' : (DATA_PATH / f'bible-train.{ANG2MOD_SRC_EXT}', DATA_PATH / f'bible-train.{ANG2MOD_TGT_EXT}'),
    'validation' : (DATA_PATH / f'bible-valid.{ANG2MOD_SRC_EXT}', DATA_PATH / f'bible-valid.{ANG2MOD_TGT_EXT}'),
    'test' : (DATA_PATH / f'bible-test.{ANG2MOD_SRC_EXT}', DATA_PATH / f'bible-test.{ANG2MOD_TGT_EXT}')
    }

token_kwargs = {
    'case_markup': True
    }

In [None]:
write_tokenized_dataset(hom_dataset, ANG2MOD_HOM_SOURCE_VER, ANG2MOD_HOM_SRC_LANG_CODE, ANG2MOD_HOM_TARGET_VER, ANG2MOD_HOM_TGT_LANG_CODE, ang2mod_hom_file_paths, token_kwargs)
write_tokenized_dataset(datasets, ANG2MOD_SOURCE_VER, ANG2MOD_SRC_LANG_CODE, ANG2MOD_TARGET_VER, ANG2MOD_TGT_LANG_CODE, ang2mod_file_paths, token_kwargs)

In [None]:
# Need to combine the validation sets for training
COMBINED_VALID_SRC = DATA_PATH / f'combined-valid.{ANG2MOD_HOM_SRC_EXT}.{ANG2MOD_SRC_EXT}'
COMBINED_VALID_TGT = DATA_PATH / f'combined-valid.{ANG2MOD_HOM_TGT_EXT}.{ANG2MOD_TGT_EXT}'
COMBINED_TEST_SRC = DATA_PATH / f'combined-test.{ANG2MOD_HOM_SRC_EXT}.{ANG2MOD_SRC_EXT}'
COMBINED_TEST_TGT = DATA_PATH / f'combined-test.{ANG2MOD_HOM_TGT_EXT}.{ANG2MOD_TGT_EXT}'

In [None]:
with open(COMBINED_VALID_SRC, mode='w+', encoding='utf-8') as f:
    with open(ang2mod_hom_file_paths['validation'][0], encoding='utf-8') as hom:
        f.write('\n'.join([l.rstrip('\n') for l in hom if l != '\n']))
    with open(ang2mod_file_paths['validation'][0], encoding='utf-8') as hom:
        f.write('\n'.join([l.rstrip('\n') for l in hom if l != '\n']))

with open(COMBINED_VALID_TGT, mode='w+', encoding='utf-8') as f:
    with open(ang2mod_hom_file_paths['validation'][1], encoding='utf-8') as hom:
        f.write('\n'.join([l.rstrip('\n') for l in hom if l != '\n']))
    with open(ang2mod_file_paths['validation'][1], encoding='utf-8') as hom:
        f.write('\n'.join([l.rstrip('\n') for l in hom if l != '\n']))

In [None]:
with open(COMBINED_TEST_SRC, mode='w+', encoding='utf-8') as f:
    with open(ang2mod_hom_file_paths['test'][0], encoding='utf-8') as hom:
        f.write('\n'.join([l.rstrip('\n') for l in hom if l != '\n']))
    with open(ang2mod_file_paths['test'][0], encoding='utf-8') as hom:
        f.write('\n'.join([l.rstrip('\n') for l in hom if l != '\n']))

with open(COMBINED_TEST_TGT, mode='w+', encoding='utf-8') as f:
    with open(ang2mod_hom_file_paths['test'][1], encoding='utf-8') as hom:
        f.write('\n'.join([l.rstrip('\n') for l in hom if l != '\n']))
    with open(ang2mod_file_paths['test'][1], encoding='utf-8') as hom:
        f.write('\n'.join([l.rstrip('\n') for l in hom if l != '\n']))

In [None]:
ANG2MOD_SRC_VOCAB_PATH = ANG2MOD_RUN_PATH / 'vocab.src'
ANG2MOD_TGT_VOCAB_PATH = ANG2MOD_RUN_PATH / 'vocab.tgt'

ang2mod_yaml = 'ang2mod.yaml'

ANG2MOD_MODEL_PATH = ANG2MOD_RUN_PATH / 'models'
ANG2MOD_MODEL_PREFIX = 'ang2mod'

In [None]:
config =  f'''# {ang2mod_yaml}
save_data: {ANG2MOD_RUN_PATH}

### DATA PROPROCESSING ###
## Where the vocab(s) will be written
src_vocab: {ANG2MOD_SRC_VOCAB_PATH}
tgt_vocab: {ANG2MOD_TGT_VOCAB_PATH}

# Corpus opts:
data:
    corpus_1:
        path_src: {ang2mod_hom_file_paths['training'][0]}
        path_tgt: {ang2mod_hom_file_paths['training'][1]}
        transforms: [filtertoolong]
        weight: 1
    corpus_2:
       path_src: {ang2mod_file_paths['training'][0]}
       path_tgt: {ang2mod_file_paths['training'][1]}
       transforms: [filtertoolong]
       weight: 1
    valid:
        path_src: {COMBINED_VALID_SRC}
        path_tgt: {COMBINED_VALID_TGT}
        transforms: [filtertoolong]

## silently ignore empty lines in data
skip_empty_level: silent

# Data Transformations
### Filter
src_seq_length: {MAX_SENTENCE_LENGTH}
tgt_seq_length: {MAX_SENTENCE_LENGTH}

### TRAINING ###
## Where the model will be saved
save_model: {ANG2MOD_MODEL_PATH / ANG2MOD_MODEL_PREFIX}
save_checkpoint_steps: 1000
average_decay: 0.0005
seed: 1234
report_every: 100
train_steps: 100000
valid_steps: 100
early_stopping: 10
# early_stopping_criteria: accuracy
tensorboard: True
tensorboard_log_dir: {ANG2MOD_RUN_PATH / 'logs'}

# Batching
world_size: 1
gpu_ranks: [0]
batch_size: 40
valid_batch_size: 40
batch_size_multiple: 1

# Optimization
model_dtype: "fp32"
optim: "adam"
learning_rate: 0.001

# Model
encoder_type: rnn
decoder_type: rnn
rnn_type: LSTM
bidir_edges: True
enc_layers: 2
dec_layers: 2
rnn_size: 512
word_vec_size: 128
dropout: 0.6
attn_dropout: 0.4
# global_attention: dot
'''

with open(CONFIG_PATH / ang2mod_yaml, "w+") as config_yaml:
  config_yaml.write(config)

In [None]:
build_and_train(CONFIG_PATH / ang2mod_yaml)

In [None]:
# retrieve the models
ang2mod_models = [ ANG2MOD_MODEL_PATH / f for f in listdir(ANG2MOD_MODEL_PATH) if f.startswith(ANG2MOD_MODEL_PREFIX)]

ANG2MOD_PREDICTIONS_PATH = ANG2MOD_RUN_PATH / 'predictions'
!mkdir -p "{ANG2MOD_PREDICTIONS_PATH}"

eval_metrics = ['sacrebleu', 'meteor']

In [None]:
ang2mod_scores = evaluate(ang2mod_models, 
                          COMBINED_TEST_SRC, 
                          COMBINED_TEST_TGT, 
                          eval_metrics,
                          token_kwargs,
                          MAX_SENTENCE_LENGTH, 
                          5, 
                          str(ANG2MOD_PREDICTIONS_PATH))

The best performing model is after 3000 training iterations with early stopping and beam size 5:

    BLEU   = 16.9989
    METEOR = 0.3338

#### User Studies Predictions

Generate predictions for the user studies

In [None]:
queries = ['ON angynne gesceop God heofonan & eorðan.', 
           'ða gesceop Adam naman his wife, Eua, ðæt is lif, for ðanðe heo is ealra libbendra modor.', 
           'God clypode ða Adam, & cwæð: Adam, hwær eart ðu.', 
           '& GOD ða gemunde Noes fare & ðæra nytena ðe him midwæron, & asende wind ofer eorðan, & ða wæteruwurdon gewanode.', 
           'ða geseah God ðæt seo eorðe wæs gewemmed, for ðan ðeælc flæsc gewemde his weg ofer eorðan.', 
           'comon to Noe in to ðam arce, swa swa God bebead.', 
           '& ða wæteru toeodan & wanodon of ðone teoðan monð, & onðam teoðan monðe æteowedon ðæra muntacnollas.', 
           '& God ða gefylde on ðone seofoðan dæg his weorc ðe heworhte. & he gereste hine on ðone seofoðan dæg fram eallumðam weorcum ðe he gefremode.', 
           '& seo eorðe forðteah growende wyrta & sæd berende be hyrecynne & treow wæstm wyrcende & gehwilcsæd hæbbende æfter his hiwe; God geseah ða ðæt hit godwæs.', 
           'God cwæð ða soðlice: Beo nu leoht on ðære heofenanfæstnysse, & todælan dæg & nihte, & beon to tacnum& to tidum & to dagum & to gearum.', 
           'Mid ðam ðe he wolde þæt weorc begynnan, ða clypode Godesengel ardlice of heofonum, Abraham; Heandwyrde sona.', 
           '& hys swurd ateah þæt he hyne geoffrode on þa ealdanwisan.'
           ]

In [None]:
model = ANG2MOD_MODEL_PATH / f'{ANG2MOD_MODEL_PREFIX}_step_3000.pt'
OLD_TEXT_TOK = DATA_PATH / 'user-studies.ang'
OLD_TEST_PRED = ANG2MOD_PREDICTIONS_PATH / 'old-text-pred.txt'

with open(OLD_TEXT_TOK, mode='w+', encoding='utf-8') as f:
      eval_text = [l.rstrip('\n') for l in f]
      f.write('\n'.join([" ".join(tokenizer(l, 'enm', **token_kwargs)) for l in queries]))

!onmt_translate -model "{model}" -src "{OLD_TEXT_TOK}" -output "{OLD_TEST_PRED}" -min_length 1 -max_length "{MAX_SENTENCE_LENGTH}" -beam_size 5 -gpu 0 

In [None]:
tokenize = pyonmttok.Tokenizer("aggressive", **token_kwargs)
hypotheses = get_detokenized_file(OLD_TEST_PRED, tokenize)

for hyp in hypotheses:
    print(hyp)

### Modern to Old

In [None]:
from pathlib import Path

MOD2ANG_TRANSLATE_NAME = 'mod2ang'
!mkdir -p '{MOD2ANG_TRANSLATE_NAME}'

# PATH VARIABLES
MOD2ANG_TRANSLATE_PATH = Path(MOD2ANG_TRANSLATE_NAME)
MOD2ANG_RUN_PATH = MOD2ANG_TRANSLATE_PATH / 'run'
!mkdir -p "{MOD2ANG_RUN_PATH}"

## Dataset Variables
# For the Homilies Dataset
MOD2ANG_HOM_SOURCE_VER = 't_mod'
MOD2ANG_HOM_SRC_LANG_CODE = 'eng'
MOD2ANG_HOM_TARGET_VER = 't_old'
MOD2ANG_HOM_TGT_LANG_CODE = 'ang'

# For the Bible Dataset
MOD2ANG_SOURCE_VER = 't_kjv'
MOD2ANG_SRC_LANG_CODE = 'eng'
MOD2ANG_TARGET_VER = 't_alf_wsg'
MOD2ANG_TGT_LANG_CODE = 'ang'

MAX_SENTENCE_LENGTH = 60

# Dataset Paths
DATA_PATH = Path('data/preprocessed')
!mkdir -p "{DATA_PATH}"

In [None]:
MOD2ANG_HOM_SRC_EXT = MOD2ANG_HOM_SOURCE_VER[2:]
MOD2ANG_HOM_TGT_EXT = MOD2ANG_HOM_TARGET_VER[2:]

mod2ang_hom_file_paths = {
    'training' : (DATA_PATH / f'hom-train.{MOD2ANG_HOM_SRC_EXT}', DATA_PATH / f'hom-train.{MOD2ANG_HOM_TGT_EXT}'),
    'validation' : (DATA_PATH / f'hom-valid.{MOD2ANG_HOM_SRC_EXT}', DATA_PATH / f'hom-valid.{MOD2ANG_HOM_TGT_EXT}'),
    'test' : (DATA_PATH / f'hom-test.{MOD2ANG_HOM_SRC_EXT}', DATA_PATH / f'hom-test.{MOD2ANG_HOM_TGT_EXT}')
    }

MOD2ANG_SRC_EXT = MOD2ANG_SOURCE_VER[2:]
MOD2ANG_TGT_EXT = MOD2ANG_TARGET_VER[2:]

mod2ang_file_paths = {
    'training' : (DATA_PATH / f'bible-train.{MOD2ANG_SRC_EXT}', DATA_PATH / f'bible-train.{MOD2ANG_TGT_EXT}'),
    'validation' : (DATA_PATH / f'bible-valid.{MOD2ANG_SRC_EXT}', DATA_PATH / f'bible-valid.{MOD2ANG_TGT_EXT}'),
    'test' : (DATA_PATH / f'bible-test.{MOD2ANG_SRC_EXT}', DATA_PATH / f'bible-test.{MOD2ANG_TGT_EXT}')
    }

token_kwargs = {
    'case_markup': True
    }

In [None]:
# Need to combine the validation sets for training
COMBINED_VALID_SRC = DATA_PATH / f'combined-valid.{MOD2ANG_HOM_SRC_EXT}.{MOD2ANG_SRC_EXT}'
COMBINED_VALID_TGT = DATA_PATH / f'combined-valid.{MOD2ANG_HOM_TGT_EXT}.{MOD2ANG_TGT_EXT}'
COMBINED_TEST_SRC = DATA_PATH / f'combined-test.{MOD2ANG_HOM_SRC_EXT}.{MOD2ANG_SRC_EXT}'
COMBINED_TEST_TGT = DATA_PATH / f'combined-test.{MOD2ANG_HOM_TGT_EXT}.{MOD2ANG_TGT_EXT}'

In [None]:
MOD2ANG_SRC_VOCAB_PATH = MOD2ANG_RUN_PATH / 'vocab.src'
MOD2ANG_TGT_VOCAB_PATH = MOD2ANG_RUN_PATH / 'vocab.tgt'

mod2ang_yaml = 'mod2ang.yaml'

MOD2ANG_MODEL_PATH = MOD2ANG_RUN_PATH / 'models'
MOD2ANG_MODEL_PREFIX = 'mod2ang'

In [None]:
config =  f'''# {mod2ang_yaml}
save_data: {MOD2ANG_RUN_PATH}

### DATA PROPROCESSING ###
## Where the vocab(s) will be written
src_vocab: {MOD2ANG_SRC_VOCAB_PATH}
tgt_vocab: {MOD2ANG_TGT_VOCAB_PATH}

# Corpus opts:
data:
    corpus_1:
        path_src: {mod2ang_hom_file_paths['training'][0]}
        path_tgt: {mod2ang_hom_file_paths['training'][1]}
        transforms: [filtertoolong]
        weight: 1
    corpus_2:
       path_src: {mod2ang_file_paths['training'][0]}
       path_tgt: {mod2ang_file_paths['training'][1]}
       transforms: [filtertoolong]
       weight: 1
    valid:
        path_src: {COMBINED_VALID_SRC}
        path_tgt: {COMBINED_VALID_TGT}
        transforms: [filtertoolong]

## silently ignore empty lines in data
skip_empty_level: silent

# Data Transformations
### Filter
src_seq_length: {MAX_SENTENCE_LENGTH}
tgt_seq_length: {MAX_SENTENCE_LENGTH}

### TRAINING ###
## Where the model will be saved
save_model: {MOD2ANG_MODEL_PATH / MOD2ANG_MODEL_PREFIX}
save_checkpoint_steps: 1000
average_decay: 0.0005
seed: 1234
report_every: 100
train_steps: 100000
valid_steps: 100
early_stopping: 10
# early_stopping_criteria: accuracy
tensorboard: True
tensorboard_log_dir: {MOD2ANG_RUN_PATH / 'logs'}

# Batching
world_size: 1
gpu_ranks: [0]
batch_size: 40
valid_batch_size: 40
batch_size_multiple: 1

# Optimization
model_dtype: "fp32"
optim: "adam"
learning_rate: 0.001

# Model
encoder_type: rnn
decoder_type: rnn
rnn_type: LSTM
bidir_edges: True
enc_layers: 2
dec_layers: 2
rnn_size: 512
word_vec_size: 128
dropout: 0.6
attn_dropout: 0.4
'''

with open(CONFIG_PATH / mod2ang_yaml, "w+") as config_yaml:
  config_yaml.write(config)

In [None]:
build_and_train(CONFIG_PATH / mod2ang_yaml)

In [None]:
# retrieve the models
mod2ang_models = [ MOD2ANG_MODEL_PATH / f for f in listdir(MOD2ANG_MODEL_PATH) if f.startswith(MOD2ANG_MODEL_PREFIX)]

MOD2ANG_PREDICTIONS_PATH = MOD2ANG_RUN_PATH / 'predictions'
!mkdir -p "{MOD2ANG_PREDICTIONS_PATH}"

eval_metrics = ['sacrebleu', 'meteor']

In [None]:
mod2ang_scores = evaluate(mod2ang_models, 
                          COMBINED_TEST_SRC, 
                          COMBINED_TEST_TGT, 
                          eval_metrics,
                          token_kwargs,
                          MAX_SENTENCE_LENGTH, 
                          5, 
                          str(MOD2ANG_PREDICTIONS_PATH))

The best performing model is after 2900 training iterations with early stopping and beam size 5:

    BLEU   = 10.9438
    METEOR = 0.2551

## Modern to Modern

### KJV to BBE


In [None]:
from pathlib import Path

KJV2BBE_TRANSLATE_NAME = 'kjv2bbe'
!mkdir -p '{KJV2BBE_TRANSLATE_NAME}'

# PATH VARIABLES
KJV2BBE_TRANSLATE_PATH = Path(KJV2BBE_TRANSLATE_NAME)
KJV2BBE_RUN_PATH = KJV2BBE_TRANSLATE_PATH / 'run'
!mkdir -p "{KJV2BBE_RUN_PATH}"

# Dataset Variables
KJV2BBE_SOURCE_VER = 't_kjv'
KJV2BBE_SRC_LANG_CODE = 'eng'
KJV2BBE_TARGET_VER = 't_bbe'
KJV2BBE_TGT_LANG_CODE = 'eng'

MAX_SENTENCE_LENGTH = 60

# Dataset Paths
DATA_PATH = Path('data/preprocessed')
!mkdir -p "{DATA_PATH}"

set_deterministic()

In [None]:
# Generate splits and write to files
versions = get_bible_versions_by_file_name([KJV2BBE_SOURCE_VER, KJV2BBE_TARGET_VER])

datasets = create_datasets(versions, .82, 
                preprocess_operations = [preprocess_filter_num_words(MAX_SENTENCE_LENGTH),
                                         preprocess_expand_contractions(),
                                         preprocess_filter_num_sentences(),
                ]);

In [None]:
KJV2BBE_SRC_EXT = KJV2BBE_SOURCE_VER[2:]
KJV2BBE_TGT_EXT = KJV2BBE_TARGET_VER[2:]


kjv2bbe_file_paths = {
    'training' : (DATA_PATH / f'bible-train.{KJV2BBE_SRC_EXT}', DATA_PATH / f'bible-train.{KJV2BBE_TGT_EXT}'),
    'validation' : (DATA_PATH / f'bible-valid.{KJV2BBE_SRC_EXT}', DATA_PATH / f'bible-valid.{KJV2BBE_TGT_EXT}'),
    'test' : (DATA_PATH / f'bible-test.{KJV2BBE_SRC_EXT}', DATA_PATH / f'bible-test.{KJV2BBE_TGT_EXT}')
    }

token_kwargs = {
    'case_markup': True
    }

In [None]:
write_tokenized_dataset(datasets, KJV2BBE_SOURCE_VER, KJV2BBE_SRC_LANG_CODE, KJV2BBE_TARGET_VER, KJV2BBE_TGT_LANG_CODE, kjv2bbe_file_paths, token_kwargs)

In [None]:
KJV2BBE_SRC_VOCAB_PATH = KJV2BBE_RUN_PATH / 'vocab.src'
KJV2BBE_TGT_VOCAB_PATH = KJV2BBE_RUN_PATH / 'vocab.tgt'

kjv2bbe_yaml = 'kjv2bbe.yaml'

KJV2BBE_MODEL_PATH = KJV2BBE_RUN_PATH / 'models'
KJV2BBE_MODEL_PREFIX = 'kjv2bbe'

In [None]:
config =  f'''# {kjv2bbe_yaml}
save_data: {KJV2BBE_RUN_PATH}

### DATA PROPROCESSING ###
## Where the vocab(s) will be written
src_vocab: {KJV2BBE_SRC_VOCAB_PATH}
tgt_vocab: {KJV2BBE_TGT_VOCAB_PATH}

# Corpus opts:
data:
    corpus_1:
        path_src: {kjv2bbe_file_paths['training'][0]}
        path_tgt: {kjv2bbe_file_paths['training'][1]}
        transforms: []
        weight: 1
    valid:
        path_src: {kjv2bbe_file_paths['validation'][0]}
        path_tgt: {kjv2bbe_file_paths['validation'][1]}
        transforms: []

## silently ignore empty lines in data
skip_empty_level: silent

### TRAINING ###
## Where the model will be saved
save_model: {KJV2BBE_MODEL_PATH / KJV2BBE_MODEL_PREFIX}
save_checkpoint_steps: 1000
average_decay: 0.0005
seed: 1234
report_every: 100
train_steps: 100000
valid_steps: 100
early_stopping: 10
# early_stopping_criteria: accuracy
tensorboard: True
tensorboard_log_dir: {KJV2BBE_RUN_PATH / 'logs'}

# Batching
world_size: 1
gpu_ranks: [0]
batch_size: 64
valid_batch_size: 64
batch_size_multiple: 1

# Optimization
model_dtype: "fp32"
optim: "adam"
learning_rate: 0.001

# Model
encoder_type: rnn
decoder_type: rnn
rnn_type: LSTM
bidir_edges: True
enc_layers: 2
dec_layers: 2
rnn_size: 512
word_vec_size: 256
dropout: 0.5
attn_dropout: 0.3
'''

with open(CONFIG_PATH / kjv2bbe_yaml, "w+") as config_yaml:
  config_yaml.write(config)

In [None]:
build_and_train(CONFIG_PATH / kjv2bbe_yaml)

In [None]:
# retrieve the models
kjv2bbe_models = [ KJV2BBE_MODEL_PATH / f for f in listdir(KJV2BBE_MODEL_PATH) if f.startswith(KJV2BBE_MODEL_PREFIX)]

KJV2BBE_PREDICTIONS_PATH = KJV2BBE_RUN_PATH / 'predictions'
!mkdir -p "{KJV2BBE_PREDICTIONS_PATH}"

eval_metrics = ['sacrebleu', 'meteor']

kjv2bbe_scores = evaluate(kjv2bbe_models, 
                          kjv2bbe_file_paths['test'][0], 
                          kjv2bbe_file_paths['test'][1], 
                          eval_metrics,
                          token_kwargs,
                          MAX_SENTENCE_LENGTH, 
                          5, 
                          str(KJV2BBE_PREDICTIONS_PATH))

The best performing model is after 4000 training iterations with early stopping and beam size 5:

    BLEU   = 36.048
    METEOR = 0.5451

#### User Studies Predictions

Generate predictions for the user studies

In [None]:
queries = ['In the beginning God created the heaven and the earth.', 
           "And Adam called his wife's name Eve; because she was the mother of all living.", 
           'And the LORD God called unto Adam, and said unto him, Where art thou?', 
           'And God remembered Noah, and every living thing, and all the cattle that was with him in the ark: and God made a wind to pass over the earth, and the waters assuaged;', 
           'And God looked upon the earth, and, behold, it was corrupt; for all flesh had corrupted his way upon the earth.', 
           'There went in two and two unto Noah into the ark, the male and the female, as God had commanded Noah.', 
           'And the waters decreased continually until the tenth month: in the tenth month, on the first day of the month, were the tops of the mountains seen.', 
           'And on the seventh day God ended his work which he had made; and he rested on the seventh day from all his work which he had made.', 
           'And the earth brought forth grass, and herb yielding seed after his kind, and the tree yielding fruit, whose seed was in itself, after his kind: and God saw that it was good.', 
           'And God did so that night: for it was dry upon the fleece only, and there was dew on all the ground.', 
           'Then said the trees unto the vine, Come thou, and reign over us.', 
           'Wherefore I have not sinned against thee, but thou doest me wrong to war against me: the LORD the Judge be judge this day between the children of Israel and the children of Ammon.', 
           'The labour of the foolish wearieth every one of them, because he knoweth not how to go to the city.', 
           'And if he trespass against thee seven times in a day, and seven times in a day turn again to thee, saying, I repent; thou shalt forgive him.',
           'You only have I known of all the families of the earth: therefore I will punish you for all your iniquities.', 
           'And he is the head of the body, the church: who is the beginning, the firstborn from the dead; that in all things he might have the preeminence.', 
           'And the four beasts said, Amen. And the four and twenty elders fell down and worshipped him that liveth for ever and ever.', 
           'He hath also broken my teeth with gravel stones, he hath covered me with ashes.', 
           'If Satan also be divided against himself, how shall his kingdom stand? because ye say that I cast out devils through Beelzebub.', 
           'For men shall be lovers of their own selves, covetous, boasters, proud, blasphemers, disobedient to parents, unthankful, unholy,', 
           'And God said, Let there be lights in the firmament of the heaven to divide the day from the night; and let them be for signs, and for seasons, and for days, and years:', 
           'And Isaac trembled very exceedingly, and said, Who? where is he that hath taken venison, and brought it me, and I have eaten of all before thou camest, and have blessed him? yea, and he shall be blessed.', 
           'And the angel of the LORD called unto him out of heaven, and said, Abraham, Abraham: and he said, Here am I.', 
           'Thus they made a covenant at Beersheba: then Abimelech rose up, and Phichol the chief captain of his host, and they returned into the land of the Philistines.', 
           'And Abraham stretched forth his hand, and took the knife to slay his son.', 
           'So Abraham returned unto his young men, and they rose up and went together to Beersheba; and Abraham dwelt at Beersheba.', 
           'And you, being dead in your sins and the uncircumcision of your flesh, hath he quickened together with him, having forgiven you all trespasses;', 
           'Wives, submit yourselves unto your own husbands, as it is fit in the Lord.', 
           'But he that doeth wrong shall receive for the wrong which he hath done: and there is no respect of persons.', 
           "Aristarchus my fellowprisoner saluteth you, and Marcus, sister's son to Barnabas, (touching whom ye received commandments: if he come unto you, receive him;)", 
           'Now unto God and our Father be glory for ever and ever. Amen.'
           ]

In [None]:
model = KJV2BBE_MODEL_PATH / f'{KJV2BBE_MODEL_PREFIX}_step_4600.pt'
BBE_TEST_TOK = DATA_PATH / 'user-studies.eng'
BBE_TEST_PRED = KJV2BBE_PREDICTIONS_PATH / 'bbe-text-pred.txt'

with open(BBE_TEST_TOK, mode='w+', encoding='utf-8') as f:
      eval_text = [l.rstrip('\n') for l in f]
      f.write('\n'.join([" ".join(tokenizer(l, 'enm', **token_kwargs)) for l in queries]))

!onmt_translate -model "{model}" -src "{BBE_TEST_TOK}" -output "{BBE_TEST_PRED}" -min_length 1 -max_length "{MAX_SENTENCE_LENGTH}" -beam_size 5 -gpu 0 

In [None]:
tokenize = pyonmttok.Tokenizer("aggressive", **token_kwargs)
hypotheses = get_detokenized_file(BBE_TEST_PRED, tokenize)

for hyp in hypotheses:
    print(hyp)

### BBE to KJV

In [None]:
from pathlib import Path

BBE2KJV_TRANSLATE_NAME = 'bbe2kjv'
!mkdir -p '{BBE2KJV_TRANSLATE_NAME}'

# PATH VARIABLES
BBE2KJV_TRANSLATE_PATH = Path(BBE2KJV_TRANSLATE_NAME)
BBE2KJV_RUN_PATH = BBE2KJV_TRANSLATE_PATH / 'run'
!mkdir -p "{BBE2KJV_RUN_PATH}"

# Dataset Variables
BBE2KJV_SOURCE_VER = 't_bbe'
BBE2KJV_SRC_LANG_CODE = 'eng'
BBE2KJV_TARGET_VER = 't_kjv'
BBE2KJV_TGT_LANG_CODE = 'eng'

MAX_SENTENCE_LENGTH = 60

# Dataset Paths
DATA_PATH = Path('data/preprocessed')
!mkdir -p "{DATA_PATH}"

In [None]:
BBE2KJV_SRC_EXT = BBE2KJV_SOURCE_VER[2:]
BBE2KJV_TGT_EXT = BBE2KJV_TARGET_VER[2:]

bbe2kjv_file_paths = {
    'training' : (DATA_PATH / f'bible-train.{BBE2KJV_SRC_EXT}', DATA_PATH / f'bible-train.{BBE2KJV_TGT_EXT}'),
    'validation' : (DATA_PATH / f'bible-valid.{BBE2KJV_SRC_EXT}', DATA_PATH / f'bible-valid.{BBE2KJV_TGT_EXT}'),
    'test' : (DATA_PATH / f'bible-test.{BBE2KJV_SRC_EXT}', DATA_PATH / f'bible-test.{BBE2KJV_TGT_EXT}')
    }

token_kwargs = {
    'case_markup': True
    }

In [None]:
BBE2KJV_SRC_VOCAB_PATH = BBE2KJV_RUN_PATH / 'vocab.src'
BBE2KJV_TGT_VOCAB_PATH = BBE2KJV_RUN_PATH / 'vocab.tgt'

bbe2kjv_yaml = 'bbe2kjv.yaml'

BBE2KJV_MODEL_PATH = BBE2KJV_RUN_PATH / 'models'
BBE2KJV_MODEL_PREFIX = 'bbe2kjv'

In [None]:
config =  f'''# {bbe2kjv_yaml}
save_data: {BBE2KJV_RUN_PATH}

### DATA PROPROCESSING ###
## Where the vocab(s) will be written
src_vocab: {BBE2KJV_SRC_VOCAB_PATH}
tgt_vocab: {BBE2KJV_TGT_VOCAB_PATH}

# Corpus opts:
data:
    corpus_1:
        path_src: {bbe2kjv_file_paths['training'][0]}
        path_tgt: {bbe2kjv_file_paths['training'][1]}
        transforms: []
        weight: 1
    valid:
        path_src: {bbe2kjv_file_paths['validation'][0]}
        path_tgt: {bbe2kjv_file_paths['validation'][1]}
        transforms: []

## silently ignore empty lines in data
skip_empty_level: silent

### TRAINING ###
## Where the model will be saved
save_model: {BBE2KJV_MODEL_PATH / BBE2KJV_MODEL_PREFIX}
save_checkpoint_steps: 1000
average_decay: 0.0005
seed: 1234
report_every: 100
train_steps: 100000
valid_steps: 100
early_stopping: 10
# early_stopping_criteria: accuracy
tensorboard: True
tensorboard_log_dir: {BBE2KJV_RUN_PATH / 'logs'}

# Batching
world_size: 1
gpu_ranks: [0]
batch_size: 64
valid_batch_size: 64
batch_size_multiple: 1

# Optimization
model_dtype: "fp32"
optim: "adam"
learning_rate: 0.001

# Model
encoder_type: rnn
decoder_type: rnn
rnn_type: LSTM
bidir_edges: True
enc_layers: 2
dec_layers: 2
rnn_size: 512
word_vec_size: 256
dropout: 0.5
attn_dropout: 0.3
'''

with open(CONFIG_PATH / bbe2kjv_yaml, "w+") as config_yaml:
  config_yaml.write(config)

In [None]:
build_and_train(CONFIG_PATH / bbe2kjv_yaml)

In [None]:
# retrieve the models
bbe2kjv_models = [ BBE2KJV_MODEL_PATH / f for f in listdir(BBE2KJV_MODEL_PATH) if f.startswith(BBE2KJV_MODEL_PREFIX)]

BBE2KJV_PREDICTIONS_PATH = BBE2KJV_RUN_PATH / 'predictions'
!mkdir -p "{BBE2KJV_PREDICTIONS_PATH}"

eval_metrics = ['sacrebleu', 'meteor']

bbe2kjv_scores = evaluate(bbe2kjv_models, 
                          bbe2kjv_file_paths['test'][0], 
                          bbe2kjv_file_paths['test'][1], 
                          eval_metrics,
                          token_kwargs,
                          MAX_SENTENCE_LENGTH, 
                          5, 
                          str(BBE2KJV_PREDICTIONS_PATH))

The best performing model is after 4000 training iterations with early stopping and beam size 5:

    BLEU   = 31.2598
    METEOR = 0.4973

#### User Studies Predictions

Generate predictions for the user studies

In [None]:
queries = ['At the first God made the heaven and the earth', 
           'And the man gave his wife the name of Eve because she was the mother of all who have life.', 
           'And the voice of the Lord God came to the man, saying, Where are you?', 
           'And God kept Noah in mind, and all the living things and the cattle which were with him in the ark: and God sent a wind over the earth, and the waters went down.', 
           'And God, looking on the earth, saw that it was evil: for the way of all flesh had become evil on the earth.', 
           'In twos, male and female, they went into the ark with Noah, as God had said.', 
           'And still the waters went on falling, till on the first day of the tenth month the tops of the mountains were seen.', 
           'And on the seventh day God came to the end of all his work; and on the seventh day he took his rest from all the work which he had done.', 
           'And grass came up on the earth, and every plant producing seed of its sort, and every tree producing fruit, in which is its seed, of its sort: and God saw that it was good.', 
           'And that night God did so; for the wool was dry, and there was dew on all the earth round it.', 
           'Then the trees said to the vine, You come and be king over us.', 
           'So I have done no wrong against you, but you are doing wrong to me in fighting against me: may the Lord, who is Judge this day, be judge between the children of Israel and the children of Ammon.', 
           'The work of the foolish will be a weariness to him, because he has no knowledge of the way to the town.', 
           'And if he does you wrong seven times in a day, and seven times comes to you and says, I have regret for what I have done; let him have forgiveness.', 
           'You only of all the families of the earth have I taken care of: for this reason I will send punishment on you for all your sins.', 
           'And he is the head of the body, the church: the starting point of all things, the first to come again from the dead; so that in all things he might have the chief place.', 
           'And the four beasts said, So be it. And the rulers went down on their faces and gave worship.', 
           'By him my teeth have been broken with crushed stones, and I am bent low in the dust.', 
           'If, then, Satan is at war with himself, how will he keep his kingdom? because you say that I send evil spirits out of men by the help of Beelzebul.', 
           'For men will be lovers of self, lovers of money, uplifted in pride, given to bitter words, going against the authority of their fathers, never giving praise, having no religion,', 
           'And God said, Let there be lights in the arch of heaven, for a division between the day and the night, and let them be for signs, and for marking the changes of the year, and for days and for years:', 
           'And in great fear Isaac said, Who then is he who got meat and put it before me, and I took it all before you came, and gave him a blessing, and his it will be?', 
           'But the voice of the angel of the Lord came from heaven, saying, Abraham, Abraham: and he said, Here am I.', 
           'So they made an agreement at Beer-sheba, and Abimelech and Phicol, the captain of his army, went back to the land of the Philistines.', 
           'And stretching out his hand, Abraham took the knife to put his son to death.', 
           'Then Abraham went back to his young men and they went together to Beer-sheba, the place where Abraham was living.', 
           'And you, being dead through your sins and the evil condition of your flesh, to you, I say, he gave life together with him, and forgiveness of all our sins;', 
           'Wives, be under the authority of your husbands, as is right in the Lord.', 
           "For the wrongdoer will have punishment for the wrong he has done, without respect for any man's position.", 
           'Aristarchus, my brother-prisoner, sends his love to you, and Mark, a relation of Barnabas (about whom you have been given orders: if he comes to you, be kind to him),', 
           'Now to God our Father be glory for ever and ever. So be it.'
           ]

In [None]:
model = BBE2KJV_MODEL_PATH / f'{BBE2KJV_MODEL_PREFIX}_step_4600.pt'
KJV_TEST_TOK = DATA_PATH / 'user-studies.kjv'
KJV_TEST_PRED = BBE2KJV_PREDICTIONS_PATH / 'kjv-text-pred.txt'

with open(KJV_TEST_TOK, mode='w+', encoding='utf-8') as f:
      eval_text = [l.rstrip('\n') for l in f]
      f.write('\n'.join([" ".join(tokenizer(l, 'enm', **token_kwargs)) for l in queries]))

!onmt_translate -model "{model}" -src "{KJV_TEST_TOK}" -output "{KJV_TEST_PRED}" -min_length 1 -max_length "{MAX_SENTENCE_LENGTH}" -beam_size 5 -gpu 0 

In [None]:
tokenize = pyonmttok.Tokenizer("aggressive", **token_kwargs)
hypotheses = get_detokenized_file(KJV_TEST_PRED, tokenize)

for hyp in hypotheses:
    print(hyp)