The purpose of this notebook is to train and run a bidirectional Marian MT model to translate Early Modern English to Modern English and Modern English to Early Modern English.


We imported all of the needed libraries and modules for all aspects of running this model. Some of these are needed for saving checkpoints, compute metrics, tokenizing our data, training a Marian MT model, and running the model.

In [None]:
import torch
from pathlib import Path
from datasets import load_dataset
from transformers import MarianTokenizer, MarianMTModel, TrainingArguments, Trainer, DataCollatorForSeq2Seq
from google.colab import files

We upload our data to the Colab session using a file called `data.tsv`. This tsv has Shakespearean Early Modern English in the first column and its paired Modern English translation in the second column.


In [None]:
from google.colab import files
uploaded = files.upload()

#file is called data.tsv

Here, we load in the Early Modern English and Modern English dataset as a `DatasetDict` with two columns. We then split this dataset into training (90%) and validation (10%) sets for loss analysis later.


In [None]:
from pathlib import Path
import re
from datasets import load_dataset

dataset = load_dataset(
    "csv",
    data_files={"full": str(Path("./data.tsv"))},
    delimiter="\t",
    column_names=["shakespeare", "modern"]
)

split_dataset = dataset["full"].train_test_split(test_size=0.1, seed=42)
train_dataset = split_dataset["train"]
valid_dataset = split_dataset["test"]

We then prepare our dataset for training by adding special tokens needed for bidirectional input, tokenizing, processing the text, and loading the base model.


In [None]:
# Load tokenizer and base model
model_name = "Helsinki-NLP/opus-mt-en-fr"  # base MarianMT model
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

special_tokens = {"additional_special_tokens": ["<to_modern>", "<to_shakespeare>"]}
tokenizer.add_special_tokens(special_tokens)
model.resize_token_embeddings(len(tokenizer))

# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
print("Using device:", device)

# Preprocess function (no padding here; will use dynamic padding)
def preprocess(batch):
    src = [str(s) for s in batch["modern"]]
    tgt = [str(s) for s in batch["shakespeare"]]
    return tokenizer(src, text_target=tgt, max_length=128, truncation=True)

def preprocess_bidirectional(batch):
    modern_list = [str(s) for s in batch["modern"]]
    shakes_list = [str(s) for s in batch["shakespeare"]]

    # Build bidirectional sources/targets
    src_texts = []
    tgt_texts = []

    for modern, shakes in zip(modern_list, shakes_list):
        # Modern → Shakespeare
        src_texts.append("<to_shakespeare> " + modern)
        tgt_texts.append(shakes)

        # Shakespeare → Modern
        src_texts.append("<to_modern> " + shakes)
        tgt_texts.append(modern)

    # Tokenize both directions
    return tokenizer(
        src_texts,
        text_target=tgt_texts,
        max_length=128,
        truncation=True,
    )

tokenized_train = train_dataset.map(preprocess_bidirectional, batched=True, remove_columns=train_dataset.column_names)
tokenized_valid = valid_dataset.map(preprocess_bidirectional, batched=True, remove_columns=valid_dataset.column_names)

We then put together the hyperparameters we need for training, including things like learning rate, bath size, epoch amount, logging, saving, and evaluation. This then gets a `TrainingArguments` object, of which will later be used by the `Trainer`. We follow this by setting up the `Trainer` with dynamic padding and the compute metrics we established in a cell above. We are then finally ready to actually go through the training of our bidirection Marian MT model.



In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./mt_shakespeare",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=5e-5,
    weight_decay=0.01,
    num_train_epochs=50,
    logging_steps=500,
    save_steps=1000,
    save_total_limit=2,
    fp16=True,                # Mixed precision for faster GPU training
    eval_strategy="epoch"   # Skip evaluation to speed up training
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_valid,
    data_collator=data_collator
)

# Start training
trainer.train()

With the model trained, we move on to generating output from this model that we can test in various ways. To do so, we begin with this function that takes an input, follows the token on which direction to translate, generates output, and decodes that output.


In [None]:
def translate(text):
  encoded = tokenizer(text, return_tensors="pt").to(device)
  out_tokens = model.generate(**encoded, max_length=128)
  return tokenizer.decode(out_tokens[0], skip_special_tokens=True)


We import our testing data as `test.tsv`.


In [None]:
from google.colab import files
uploaded = files.upload()

#import file called test.tsv

Here, we load in the Early Modern English and Modern English test set as a `DatasetDict` with two columns.


In [None]:
from pathlib import Path
import re
from datasets import load_dataset

testset = load_dataset(
    "csv",
    data_files={"full": str(Path("./test.tsv"))},
    delimiter="\t",
    column_names=["shakespeare", "modern"]
)


Since there are line numbers in the test data, we eliminate those so that they don't interfere with our accuracy evaluation.


In [None]:
import re

def remove_numbers(row):
    row["shakespeare"] = re.sub(r"\d+", "", str(row["shakespeare"]))
    row["modern"] = re.sub(r"\d+", "", str(row["modern"]))
    return row

testset["full"] = testset["full"].map(remove_numbers)


Our model was also trained with a tokenizer having `max_length=128` and we only have it generate things with `max_length=128` in the `translate` function, so here we get rid of all test cases that are excessively long since our model isn't built to perform well on that.


In [None]:
def too_long(testset):
    sh_words = len(str(testset["shakespeare"]).split())
    mod_words = len(str(testset["modern"]).split())
    return (sh_words <= 25) and (mod_words <= 25)

testset["full"] = testset["full"].filter(too_long)


We create a file `modern_test.tsv` that contains two columns: one for the original Modern English transcriptions and one for the Early Modern English translated into Modern English by our bidirectional MarianMT model. We will use this for testing evaluation.


In [None]:
import csv

with open("modern_test.tsv", "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f, delimiter="\t")
    writer.writerow(["original modern", "translated from early modern"])

    for i in range(len(testset["full"])):
        original = testset["full"][i]["shakespeare"]
        prompt = "<to_modern> " + str(original)

        translated = translate(prompt)
        print(i)
        print(original)
        print(translated)

        writer.writerow([str(testset["full"][i]["modern"]), translated])


We create a file `shakespeare_test.tsv` that contains two columns: one for the original Early Modern English transcriptions and one for the  Modern English translated into Early Modern English by our bidirectional MarianMT model. We will use this for testing evaluation.


In [None]:
with open("shakespeare_test.tsv", "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f, delimiter="\t")
    writer.writerow(["original early modern", "translated from modern"])

    for i in range(len(testset["full"])):
        original = testset["full"][i]["modern"]
        prompt = "<to_shakespeare> " + str(original)

        translated = translate(prompt)

        writer.writerow([str(testset["full"][i]["shakespeare"]), translated])
