<a href="https://colab.research.google.com/github/mayaschwarz/cs175--lfric-to-Albert/blob/main/HuggingfaceBartTransformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformer Training and Evaluating Notebook

This was designed to be run directly from Google Colab, your mileage may vary if run from elsewhere.

## Setup

Skip this step if running locally.


This code will use a model and some helper files that are stored in a GitHub repository. The repository will be cloned temporarily.

In [1]:
!git clone https://github.com/mayaschwarz/cs175--lfric-to-Albert.git git/
%cd git/

Cloning into 'git'...
remote: Enumerating objects: 180, done.[K
remote: Counting objects: 100% (180/180), done.[K
remote: Compressing objects: 100% (123/123), done.[K
remote: Total 425 (delta 91), reused 128 (delta 55), pack-reused 245[K
Receiving objects: 100% (425/425), 58.03 MiB | 27.92 MiB/s, done.
Resolving deltas: 100% (176/176), done.
/content/git


In [2]:
# Install once, then restart the runtime
!pip install -r requirements.txt > /dev/null 2> /dev/null

#### IMPORTANT

Be sure to restart (**not** factory reset) the runtime after running the above cell. Then continue running the code below. To do this, click on Runtime -> Restart runtime. You do not need to do the `pip install`s again.

In [1]:
%cd git/

/content/git


Import necessary modules

In [2]:
from datasets import load_dataset
import datasets as metric_datasets
from transformers import (
    BartForConditionalGeneration, BartTokenizer,
    Seq2SeqTrainingArguments, Seq2SeqTrainer
)

import torch
meteor_metric = metric_datasets.load_metric('meteor')

from src.data_manager import *

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2055.0, style=ProgressStyle(description…




[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


## Preparing data

First, pick which bible versions should be the source and target version, where the model will learn to translate verses from the source version to the target version. This works for any of the 7 Modern English versions in our corpora, namely: `t_asv`, `t_bbe`, `t_dby`, `t_kjv`, `t_wbt`, `t_web`, and `t_ylt`.

Then, decide how you wish to normalize your text.

Finally, decide what your maximum number of words should be per verse.

In [3]:
source_version, target_version = 't_kjv', 't_bbe'
lowercase_text = False
no_punctuation = False
MAX_NUM_WORDS = 40

In [4]:
# Set up the datasets
versions = get_bible_versions_by_file_name([source_version, target_version])

preprocess_operations = [preprocess_filter_num_words(MAX_NUM_WORDS)]

if lowercase_text:
    preprocess_operations.append(preprocess_lowercase())

if no_punctuation:
    preprocess_operations.append(preprocess_remove_punctuation(preserve_periods = False))

datasets = create_datasets(versions, 0.85, preprocess_operations = preprocess_operations, write_files = False)

Finding shared verses between 2 versions...        done in 0.363 seconds
Run preprocess operations...                       done in 0.097 seconds
Separate test verses...                            done in 0.014 seconds
Separate validation verses...                      done in 0.029 seconds
Zip together verses (shuffle = True)...            done in 0.045 seconds

# verses before preprocessing:  31,048
# verses after  preprocessing:  26,468 (85%)


# training verses:    19,071 (72%)
# validation verses:   3,366 (13%)
# test verses:         4,031 (15%)


In [5]:
def zip_data(dataset: dict) -> [dict]:
    """
    Returns a zipped list containing both the source and target versions for each verse in the dataset.

    Arguments:
        dataset: {dict} -- a single dataset returned by create_datasets or load_datasets

    Returns:
        [
            {
                't_bbe': 'and pilate gave his decision for their desire to be put into effect',
                't_kjv': 'and pilate gave sentence that it should be as they required'
            },
            {
                't_bbe': 'for these are the days of punishment in which all the things in the writings will be put into effect',
                't_kjv': 'for these be the days of vengeance that all things which are written may be fulfilled'
            },
            ...
        ]
    """
    zipped_data = list()
    for source_line, target_line in zip(dataset[source_version], dataset[target_version]):
        zipped_data.append({
            source_version: source_line,
            target_version: target_line,
        })

    return zipped_data

In [6]:
training_data = zip_data(datasets['training'])
validation_data = zip_data(datasets['validation'])
test_data = zip_data(datasets['test'])

In [7]:
# Take a look at some of the data to see if it looks as expected
test_data[:3]

[{'t_bbe': 'Now the Midianites and the Amalekites and all the people of the east were covering the valley like locusts; and their camels were like the sand by the seaside, without number.',
  't_kjv': 'And the Midianites and the Amalekites and all the children of the east lay along in the valley like grasshoppers for multitude; and their camels were without number, as the sand by the sea side for multitude.'},
 {'t_bbe': "And Eliphaz, the son of Esau, had connection with a woman named Timna, who gave birth to Amalek: all these were the children of Esau's wife Adah.",
  't_kjv': "And Timna was concubine to Eliphaz Esau's son; and she bare to Eliphaz Amalek: these were the sons of Adah Esau's wife."},
 {'t_bbe': 'Wives, be under the authority of your husbands, as is right in the Lord.',
  't_kjv': 'Wives, submit yourselves unto your own husbands, as it is fit in the Lord.'}]

In [8]:
def data_collator(features: list) -> 'batch':
    """
    Creates a sequence-to-sequence training batch, as used by HuggingFace's Seq2SeqTrainer.
    """
    labels = [f[target_version] for f in features]
    inputs = [f[source_version] for f in features]

    batch = tokenizer.prepare_seq2seq_batch(
        src_texts = inputs,
        src_lang = 'en_XX',
        tgt_lang = 'en_XX',
        tgt_texts = labels,
        max_length = MAX_NUM_WORDS,
        max_target_length = MAX_NUM_WORDS + 5
    )

    for k in batch:
        batch[k] = torch.tensor(batch[k])

    return batch

In [9]:
# Load a pre-trained sequence-to-sequence model. This model will be fine-tuned.
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1525.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1018571383.0, style=ProgressStyle(descr…




In [10]:
# Load a pre-trained sequence-to-sequence tokenizer
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




## Initiating model and trainer for training

In [11]:
# defining training related arguments
args = Seq2SeqTrainingArguments(output_dir = 'bible-bart',
    do_train = True,
    do_eval = True,
    load_best_model_at_end = True,
    evaluation_strategy = 'epoch',
    per_device_train_batch_size = 16,
    per_device_eval_batch_size = 16,
    learning_rate = 1e-5,
    num_train_epochs = 3,
    logging_dir = '/logs',
)

In [12]:
# defining trainer using 🤗
trainer = Seq2SeqTrainer(model = model,
    args = args,
    data_collator = data_collator,
    train_dataset = training_data,
    eval_dataset = validation_data,
)

## Training time

In [13]:
# I will take hours to train this model upon this table.
trainer.train()

Epoch,Training Loss,Validation Loss,Runtime,Samples Per Second
1,0.9665,0.825653,11.8273,284.595
2,0.8179,0.771257,11.788,285.544
3,0.744,0.758731,11.8659,283.671


TrainOutput(global_step=3576, training_loss=1.0442394644888722, metrics={'train_runtime': 1092.0608, 'train_samples_per_second': 3.275, 'total_flos': 5566316459802624, 'epoch': 3.0})

In [14]:
model_path = f'bart-{source_version[2:]}-to-{target_version[2:]}'

In [15]:
# Saved with timestamp to avoid overwriting previous save
import time
model_path = f'{model_path}-{int(time.time())}'
trainer.save_model(model_path)
print(f'Saved model at: {model_path}')

Saved model at: bart-kjv-to-bbe-1615779545


#### IMPORTANT

The model was saved in a temporary location. It will be deleted. If you want to retain the results of your model, either change the model_path to your own Google Drive, or download the model using the Google Colaboratory file system.

In [16]:
# Load the model on the same device
model = BartForConditionalGeneration.from_pretrained(model_path, max_length = 100)

# Inference time
Let's load the model from hub and use it for inference using 🤗 pipeline.

In [17]:
from transformers import pipeline

translator = pipeline(f'translation_{source_version[2:]}_to_{target_version[2:]}', model = model, tokenizer = tokenizer)

def translate(text: str) -> str:
    # translates a single string
    return translator(text, return_text = True)[0]['translation_text']

In [18]:
# And let us see how our model performeth.
translate("And the LORD God called unto Adam, and said unto him, Where art thou?")

'And the Lord God sent for Adam and said to him, Where are you?'

# Evaluation

Now, let's evaluate our model's performance using METEOR

In [19]:
def translate_all(translator, verses: [str], num_verses: int = 100) -> [str]:
    return [verse['translation_text'] for verse in translator(verses[:num_verses], return_text = True)]

def compute_meteor_metric(predictions: [str], references: [str]) -> float:
    meteor_metric.add_batch(predictions = predictions, references = references)
    return meteor_metric.compute()['meteor']

def compute_meteor_metric_easy(translator, num_verses: int) -> float:
    predictions = translate_all(translator, datasets['test'][source_version], num_verses)
    references = datasets['test'][target_version][:num_verses]
    return compute_meteor_metric(predictions, references)

In [20]:
print(f'METEOR score = ~{compute_meteor_metric_easy(translator, 100):.4f}')

METEOR score = ~0.5951
