The purpose of this notebook is to train a DistilRoBERTa-based classifier to test the outputs of the translation models. The classifier labels each translation as either Shakespearean or Modern English and checks whether the output matches the intended style of the translator model.



We install (and update if already installed) the transformers library from HuggingFace, which gives us the ability to load the pretrained model, tokenize text, and train.

In [None]:
!pip install -U transformers

We upload the file shakespeare_modern.tsv, which we will train the classifier on. It holds a column for Shakespearean English and a column for respective Modern English translations.

In [None]:
from google.colab import files
uploaded = files.upload()

We load the tsv into a dataframe, which also labels one of the columns as "shakespeare" and the other as "modern".

In [None]:
import pandas as pd

df = pd.read_csv("shakespeare_modern.tsv", sep="\t", header=None, names=["shakespeare", "modern"])

A Shakespeare dataset is created and labeled as 1 and a Modern dataset is created and labeled as 0. They are combined so that there is categorization on what is Shakespearean and what is Modern.

In [None]:
shakes = pd.DataFrame({
    "text": df["shakespeare"],
    "label": 1
})

modern = pd.DataFrame({
    "text": df["modern"],
    "label": 0
})

full_df = pd.concat([shakes, modern], ignore_index=True)


This splits the dataset so that 80% is used for training and 20% is used for testing.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    full_df["text"],
    full_df["label"],
    test_size=0.2,
    random_state=42
)

Ensures that the data is cleaned so that there are no missing values and are in the format of strings.

In [None]:
X_train_clean = X_train.fillna("").astype(str).tolist()
X_test_clean  = X_test.fillna("").astype(str).tolist()
y_train_clean = y_train.tolist()
y_test_clean  = y_test.tolist()

This imports the tokenizer and the pre-trained model, distilroberta-base.

In [None]:
from transformers import AutoTokenizer

model_name = "distilroberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)


These are functions where `init` tokenizes the text of the passed in dataset, stores their labels (0 for modern English, 1 for Shakespearean English), and sets padding for the encodings. `len` returns the length of the dataset and `getItem` retrieves the ith item from the dataset.



In [None]:
import torch
from torch.utils.data import Dataset

class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.encodings = tokenizer(texts, truncation=True, padding=True, max_length=max_length)
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item


Creates the training dataset and test dataset by using `TextDataSet` to tokenize the texts.

In [None]:
train_dataset = TextDataset(X_train_clean, y_train_clean, tokenizer)
test_dataset  = TextDataset(X_test_clean, y_test_clean, tokenizer)


This sets the hyperparameters for the training loop. We set the training to 2 epochs and evaluated validation loss and training loss based on each epoch.

In [None]:
from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="epoch",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)


This trains the model and outputs evaluation metrics at the end such as validation loss.

In [None]:
trainer.train()
results = trainer.evaluate()
print("Evaluation results:", results)

This outputs metrics based on the training dataset to get training loss.

In [None]:
train_metrics = trainer.evaluate(train_dataset)
print("Training metrics:", train_metrics)

This plots training loss vs. validation loss to determine when overfitting happened.

In [None]:
import matplotlib.pyplot as plt

logs = trainer.state.log_history

train_loss = [x["loss"] for x in logs if "loss" in x]
eval_loss  = [x["eval_loss"] for x in logs if "eval_loss" in x]
epochs     = [x["epoch"] for x in logs if "eval_loss" in x]

plt.plot(epochs, eval_loss, label="Validation Loss")
plt.plot(range(len(train_loss)), train_loss, label="Training Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Training vs Validation Loss")
plt.legend()
plt.show()


This is a function that takes in a sentence and uses the classifer to categorize it as either "Shakespeare" or "Modern".

In [None]:
import torch
import numpy as np

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

def classify_sentence(sentence):
    inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True)

    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    pred_class = torch.argmax(logits, dim=1).item()

    return "Shakespeare" if pred_class == 1 else "Modern"

These are some tests seeing how the classifer would categorize sample sentences. It was mostly accurate.

In [None]:
print(classify_sentence("He walks among shadows, carrying memories not his own."))
print(classify_sentence("The morning sun whispers secrets the night forgot"))
print(classify_sentence("The wind bends the trees as if sharing some quiet counsel."))
print(classify_sentence("She lingers by the window, pondering what tomorrow might bring."))
print(classify_sentence("Methinks the night doth hide more than the eye can see."))
print(classify_sentence("I shall attend the meeting tomorrow, though I might be late."))
print(classify_sentence("The river murmurs softly, carrying tales no one recalls."))

This is to compare training accuracy and test accuracy to ensure that it has good accuracy, but that they aren't too far apart so that we know it generalizes well.

In [None]:
from sklearn.metrics import accuracy_score
import numpy as np

train_preds_output = trainer.predict(train_dataset)
train_preds = np.argmax(train_preds_output.predictions, axis=1)
print("Train accuracy:", accuracy_score(y_train, train_preds))

test_preds_output = trainer.predict(test_dataset)
test_preds = np.argmax(test_preds_output.predictions, axis=1)
print("Test accuracy:", accuracy_score(y_test, test_preds))

The next two cells are for uploading our data:


*  `modern_test.tsv` that has one column of original Modern English from our test set and a second column of translated Early Modern English to Modern English that was generated by one of our models
*   `shakespeare_test.tsv` that has one column of original Early Modern English from our test set and a second column of translated Modern English to Early Modern English that was generated by one of our models


In [None]:
# Upload modern_test.tsv and shakespeare_test.tsv
from google.colab import files
uploaded = files.upload()


In [None]:
# Upload modern_test.tsv and shakespeare_test.tsv
from google.colab import files
uploaded = files.upload()

We then prepare the data into `DatasetDict` objects with two colummns.

In [None]:
from pathlib import Path
from datasets import load_dataset

modern_testset = load_dataset(
    "csv",
    data_files={"full": str(Path("modern_test.tsv"))},
    delimiter="\t",
    column_names=["original_modern", "translated_from_earlymodern"]
)

shakes_testset = load_dataset(
    "csv",
    data_files={"full": str(Path("shakespeare_test.tsv"))},
    delimiter="\t",
    column_names=["original_earlymodern", "translated_from_modern"]
)


Foes through the modern data and ensures it skips empty entries and that the entries are a string. It then makes a prediction by using the future defined function, `classify_sentence`.

In [None]:
for row in modern_testset["full"]:
    translated = row["translated_from_earlymodern"]

    if translated is None:
        continue

    translated = str(translated)
    pred = classify_sentence(translated)


A function, `classify_sentence` that takes an input sentence and uses the classifier to categorize it as Modern (0) or Shakespeare (1).

In [None]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

def classify_sentence(sentence):
    if sentence is None or sentence == "":
        return None  # or a default class

    inputs = tokenizer(text=str(sentence), return_tensors="pt", truncation=True, padding=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    pred_class = torch.argmax(logits, dim=1).item()
    return pred_class  # 0=Modern, 1=Shakespeare


These keep track of what predictions by the model were correct by looping through the dataset, reading the tranlater model's output, and seeing if it matches the classifier's output. It repeats this for both translating from Modern English to Shakespearean English and from Shakespearean English to Modern English.

In [None]:
correct = 0
total = 0

for row in modern_testset["full"]:
    translated = row["translated_from_earlymodern"]
    pred = classify_sentence(translated)

    if pred == 0:  # expected: Modern
        correct += 1
    total += 1

print("Modern Test Accuracy:", correct / total)

In [None]:
correct = 0
total = 0

for row in shakes_testset["full"]:
    translated = row["translated_from_modern"]
    pred = classify_sentence(translated)

    if pred == 1:  # expected: Shakespeare
        correct += 1
    total += 1

print("Shakespeare Test Accuracy:", correct / total)