# Loading Data And Preprocessing

After manually creating the dataset of poetry and labeling them, let's just see if it's possible to train a GPT-2 model on the data and get an actual result. The dataset I made consists of 30 entries, half being written by me and the other half written by other authors on r/OCPoetry. The label meanings are:
- 1 = Written by me
- 0 = Not written by me

Before we train, I'll need to convert the literal '\n' characters in the dataset into actual line breaks and split the dataset into training and test sets, then convert them into Hugging Face datasets.

In [1]:
import pandas as pd
import torch
from transformers import GPT2Tokenizer, GPT2ForSequenceClassification, Trainer, TrainingArguments, EarlyStoppingCallback
from datasets import Dataset
import matplotlib.pyplot as plt
import numpy as np

2025-08-10 20:38:42.045479: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-08-10 20:38:42.367480: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1754876322.525974    1299 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1754876322.593810    1299 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1754876323.173557    1299 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

ValueError: Your currently installed version of Keras is Keras 3, but this is not yet supported in Transformers. Please install the backwards-compatible tf-keras package with `pip install tf-keras`.

In [None]:
# Load in the data
df = pd.read_csv("data.csv")
df.head()

In [None]:
# Function that preprocesses the CSV file and prepares the sets for training
def preprocess(df, shuffle=True):
    # Shuffle the data entries if when making the dataset you didn't mix up the entries like me
    if shuffle:
    df = df.sample(frac=1).reset_index(drop=True)

    # Convert the place holder \n characters into actual line breaks
    print("Before conversion:")
    print(df['text'][0])

    df['text'] = df['text'].apply(lambda x: x.replace("\\n", "\n"))

    # Check to see if conversion worked
    print("\nAfter conversion:")
    print(df['text'][0])

    # Split the data into training and test sets
    test = df.sample(frac=0.2, random_state=42)
    train = df.drop(test.index)

    # Check to see if the shapes are correct
    print("\nDataframe sizes:", test.shape, train.shape)

    # Convert the CSV files into Hugging Face Datasets for training
    train_dataset = Dataset.from_pandas(train).remove_columns(["__index_level_0__"])
    test_dataset = Dataset.from_pandas(test).remove_columns(["__index_level_0__"])

    print("\n")
    print(train_dataset)
    print(test_dataset)

    return df, train_dataset, test_dataset

In [None]:
df, train_dataset, test_dataset = preprocess(df, shuffle=True)

In [None]:
# Ensure the dataframe was shuffled
df.head()

In [None]:
# Function that tokenizes the plain text datasets for training
def tokenize_text(train_dataset, test_dataset):
    # Load in tokenizer
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    tokenizer.pad_token = tokenizer.eos_token

    # Tokenize the data
    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

    # Apply the tokenization
    train_dataset = train_dataset.map(tokenize_function, batched=True)
    test_dataset = test_dataset.map(tokenize_function, batched=True)

    # Format datasets for PyTorch
    train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
    test_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

    return train_dataset, test_dataset

In [None]:
train_dataset, test_dataset = tokenize_text(train_dataset, test_dataset)

# Training GPT-2

Given that we're working with such a small dataset, overfitting is a huge risk when training. For now, I'll be freezing all of the layers except for the classification head. Early stopping after 3 epochs of no validation loss improvement will also be addded to prevent overfitting. A weight decay argument will also be added.

In [None]:
# Load in pre-trained GPT-2 model with a classification head
model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=2)

# Set padding token
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model.config.pad_token_id = tokenizer.eos_token_id

# Freeze all parameters
for param in model.parameters():
    param.requires_grad = False

# Unfreeze only the classification head
for param in model.score.parameters():
    param.requires_grad = True

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    learning_rate=1e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=10,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    report_to="none",
    seed=42
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

trainer.train()

# Evaluation

Looking at the graph, the GPT-2 model is learning to generalize, but at a veryyyyy slow rate. The training loss is also very unstable. This is likely due to the small training set and small batch sizes, leading to alot of variance in the training data. This could be improved with more training data, increasing batch sizes, or unfreezing some layers. The training is feasible, but will likely require some major changes to get done properly.

In [None]:
# Extract losses
train_loss = [log["loss"] for log in trainer.state.log_history if "loss" in log]
eval_loss  = [log["eval_loss"] for log in trainer.state.log_history if "eval_loss" in log]

# Plot the training and validation loss
plt.plot(train_loss, label="Training Loss")
plt.plot(eval_loss, label="Validation Loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.title("Training & Validation Loss")
plt.legend()
plt.show()