<a href="https://colab.research.google.com/github/honicky/character-extraction/blob/main/Character_Extractor_T5_LoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Character Extractor - T5 LoRA

This notebook is part of an exploration of how to extract the names of characters from stories cheaply and easily. I am doing this as part of a little project to generate and co-author illustrated childrens stories.  One of the challenges for this problem is to generate consistent characters, so I am using it as an excuse to learn about different approaches including using off-the-shelf models with manual and automated (DSPy) prompting, fine-tuning small models and just using god models like GTP4 or Claude Opus.

In this notebook I fine tune a T5-based encoder-decoder model to extract the character names from short stories.  I have already generated a bunch of story-character pairs using GTP4.  I also used the [loubnabnl/stories_oh_children](https://huggingface.co/datasets/loubnabnl/stories_oh_children) and extracted the story character-names using GPT 3.5-turbo.

If I'm being completely honest, GPT 3.5-turbo is quite fast and low cost for this application, so fine-tuning even a small model like this is overkill, but most of the point is to learn about fine-tuning, so lets give it a try.

## Getting started

I decided to use the `transformers.Trainer` and `peft` infrastructure to do the fine-tune because I want to understand in detail how things are working, and this looks (at first glance) like a mature but somewhat low-level set of libraries to get started with.

In [1]:
!pip install datasets peft evaluate wandb
# !pip install datasets peft evaluate

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft
  Downloading peft-0.10.0-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting wandb
  Downloading wandb-0.17.0-py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m30.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━

## Weights and Biases

I also want to learn how to use Weights and Biases at least to track what I'm doing, if not to control the training process, so lets set up `wandb`

In [2]:
import wandb
from google.colab import userdata

wandb.login(key=userdata.get("WANDB_API_KEY"))

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

## Load the datasets and preprocess them

Lets load the datasets I created, combine them, and do a train-validate-test split

In [3]:
from datasets import load_dataset, concatenate_datasets, DatasetDict

honicky_dataset = load_dataset('honicky/short_childrens_stories_with_labeled_character_names')
loubnabnl_dataset = load_dataset('honicky/stories_oh_children_with_character_names')

Downloading readme:   0%|          | 0.00/815 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.07M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2588 [00:00<?, ? examples/s]

Downloading readme:   0%|          | 0.00/408 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/8.81M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [4]:
honicky_dataset

DatasetDict({
    train: Dataset({
        features: ['story', 'characters'],
        num_rows: 2588
    })
})

In [5]:
loubnabnl_dataset["train"]

Dataset({
    features: ['train', 'characters'],
    num_rows: 5000
})

In [6]:
loubnabnl_dataset["train"]["train"][0].keys()

dict_keys(['category', 'completion', 'prompt_young_children_story', 'token_length'])

In [7]:
loubnabnl_dataset = loubnabnl_dataset.map(lambda example: {'story': example['train']['completion']})

# Flatten the nested structure
loubnabnl_dataset = loubnabnl_dataset.remove_columns('train').map(lambda example: {'story': example['story'], 'characters': example['characters']})

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [8]:
loubnabnl_dataset

DatasetDict({
    train: Dataset({
        features: ['characters', 'story'],
        num_rows: 5000
    })
})

In [9]:
combined_dataset = concatenate_datasets([honicky_dataset["train"], loubnabnl_dataset["train"]])

In [10]:
combined_dataset

Dataset({
    features: ['story', 'characters'],
    num_rows: 7588
})

In [11]:
# Split into training and test + validation first (95% train, 5% test+val)
train_test_split = combined_dataset.train_test_split(test_size=0.05, seed=42)

# Split the test+validation set into test and validation (50% test, 50% validation)
test_val_split = train_test_split['test'].train_test_split(test_size=0.5, seed=42)

# Now assemble the final splits
final_splits = DatasetDict({
    'train': train_test_split['train'],
    'test': test_val_split['test'],
    'validation': test_val_split['train']  # Since we split test into two halves
})

In [12]:
final_splits

DatasetDict({
    train: Dataset({
        features: ['story', 'characters'],
        num_rows: 7208
    })
    test: Dataset({
        features: ['story', 'characters'],
        num_rows: 190
    })
    validation: Dataset({
        features: ['story', 'characters'],
        num_rows: 190
    })
})

## Use the PEFT quickstart to get started

I'm completely new to LoRAs, so lets just start by following along and cutting-and-pasting from the `peft` library tour.

https://huggingface.co/docs/peft/en/quicktour


In [13]:
from peft import LoraConfig, TaskType

peft_config = LoraConfig(task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1)

In [14]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large")



config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [15]:
from peft import get_peft_model

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

trainable params: 2,359,296 || all params: 785,509,376 || trainable%: 0.30035236651331837


## Batch size and eval accumulation

I itereated a lot on the batch-size to get the model to fit in memory.  I decided to use a A100 because the larger, faster GPU is easier to iterate with.  I presume that if I knew what I was doing, I could get this done with a much smaller GPU.

I went through a lot of iteration on the batch size, balancing out of memory errors and speed.  It seems like for a large training run this would matter because you can get a higher utilization rate on the GPU, but the gains seem to be in the 10-30% range between batch size 2 and 8, so for a small run like this, maybe it doesn't matter so much?

### Make sure you use the right Trainer and Collator!

I cut-and-pasted code that used a generic Trainer and Collator and that caused the `Trainer` to make incorrect assuptions about the evaluation data structures.  This in turn led to a big waste of time on figuring out why I was running out of memory during the eval step. I had to crank down the `eval_accumulation_steps` to 50, which allowed the trainer to compute whatever weird, misshapen evals it was doing, but eventually it crashed during another part of the eval step.

These crashes were particularly confusing because they just result in the notebook kernel resetting, and when I spun up a RunPod node, I still ended up with a `Resetting` message and nothing else. I looks like when `python` runs out of system memory, it fails pretty gracelessly in this case.  

This was why I wanted to do things by hand: I learned a bunch about Seq2Seq models, the different Trainers, sources of memory errors, etc. Annoying though. Beware copy-pasta!


In [16]:
# start a new wandb run to track this script

training_setup_hyperparameters = dict(
    learning_rate=1e-3,
    per_device_train_batch_size=6,
    per_device_eval_batch_size=6,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="steps",
    eval_steps=250,
    save_steps=250,
    # metric_for_best_model = 'accuracy',
    # load_best_model_at_end=True,
    # eval_accumulation_steps=50,
    include_inputs_for_metrics=False,
    fp16=False,
    predict_with_generate=True
)

wandb.init(
    # set the wandb project where this run will be logged
    project="t5_target_finetune_for_character_extraction",

    # track hyperparameters and run metadata
    config=training_setup_hyperparameters
)

[34m[1mwandb[0m: Currently logged in as: [33mhonicky[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [17]:
import os

# save your trained model checkpoint to wandb
os.environ["WANDB_LOG_MODEL"]="true"

# turn off watch to log faster
os.environ["WANDB_WATCH"]="false"

In [18]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    report_to="wandb",
    output_dir="/content/models",
    **training_setup_hyperparameters
)

In [19]:
from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained('t5-large')




spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [20]:
def tokenize_function(examples):
    # T5 expects a certain input format that has a task description in the begining.
    # I've actually read that this is not necessary and might be confusing, but
    # other places suggested it, so I'm adding "extract characters: " to the beginning
    # of the story to extract the characters from
    input_texts = ["extract characters: " + story for story in examples["story"]]

    # Tokenize the inputs and labels
    model_inputs = tokenizer(input_texts, max_length=512, truncation=True, padding="max_length")
    labels = tokenizer(examples["characters"], max_length=512, truncation=True, padding="max_length")

    # The labels need to be what the model expects
    model_inputs["labels"] = labels["input_ids"]

    return model_inputs


In [21]:
tokenized_datasets = final_splits.map(tokenize_function, batched=True).remove_columns(["story", "characters"])


Map:   0%|          | 0/7208 [00:00<?, ? examples/s]

Map:   0%|          | 0/190 [00:00<?, ? examples/s]

Map:   0%|          | 0/190 [00:00<?, ? examples/s]

In [22]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 7208
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 190
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 190
    })
})

In [23]:
!nvidia-smi

Wed May 15 20:59:02 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   36C    P8              16W /  72W |      4MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Define the evaluation metrics: precision, recall, f1

We will use mean precision, recall and f1 scores over all of the stories as my evaluation metrics. This means that we are insensitive to the order of characters in the output. I also strip punctuation and whitespace from the begining and end of each character name to be somewhat insensitive to the output formatting.  

In [24]:
import string

def strip_punctuation_whitespace(text):
  # Define a set of characters to strip: all punctuation and whitespace characters
  strip_chars = set(string.punctuation + string.whitespace)

  # Strip from the beginning
  start = 0
  while start < len(text) and text[start] in strip_chars:
    start += 1

  # Strip from the end
  end = len(text)
  while end > 0 and text[end-1] in strip_chars:
    end -= 1

  # Return the stripped string
  return text[start:end]

def metrics_from_strings(true_labels_str, predicted_labels_str):
    # Parse the strings to remove whitespace and split by commas
    true_labels = [strip_punctuation_whitespace(label) for label in true_labels_str.split(',')]
    predicted_labels = [strip_punctuation_whitespace(label) for label in predicted_labels_str.split(',')]

    # Calculate the intersection of true and predicted labels for correctly predicted labels
    correct_predictions = set(true_labels).intersection(predicted_labels)

    # Precision: correctly predicted positive / all predicted positive
    if len(predicted_labels) == 0:
        precision = 0
    else:
        precision = len(correct_predictions) / len(predicted_labels)

    # Recall: correctly predicted positive / all actual positive
    if len(true_labels) == 0:
        recall = 0
    else:
        recall = len(correct_predictions) / len(true_labels)

    # F1 Score: 2 * (precision * recall) / (precision + recall)
    if precision + recall == 0:
        return 0, 0, 0
    f1 = 2 * (precision * recall) / (precision + recall)

    return precision, recall, f1


In [25]:

import numpy as np

def compute_metrics(eval_preds):
    preds, labels = eval_preds

    # decode preds and labels
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # print(f"decoded_preds: {decoded_preds}")
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # print(f"decoded_labels: {decoded_labels}")

    metrics = [
      metrics_from_strings(pred, label)
      for pred, label in zip(decoded_preds, decoded_labels)
    ]

    precisions, recalls, f1s = zip(*metrics)

    return {
        "precision": np.mean(precisions),
        "recall": np.mean(recalls),
        "f1": np.mean(f1s),
    }

## Ok go!

In [26]:
from transformers import Seq2SeqTrainer, DataCollatorForSeq2Seq
import logging

logger = logging.getLogger("transformers")
logger.setLevel(logging.DEBUG)  # Set transformers logging to DEBUG

# os.environ["WANDB_DISABLED"] = "true"

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer),
    compute_metrics=compute_metrics,
)



Setting `WANDB_LOG_MODEL` from true to `end` instead


In [None]:
trainer.train()

## Woo hoo!

It looks like in this case, the 1500-step checkpoint had the best F1 score, and both the training and test loss plateaued at that point, so we probably can use that checkpoint.  I saved all of the checkpoints to Google Drive for now.  I don't know what the standard way to do this is: should I use WandB? Upload all of the checkpoints to HuggingFace? S3?

Lets load 1500-step checkpoint and run it on the test data set. I'm using a L4 GPU to run do evals, so the inference will be slower than with a A100.


In [27]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PeftModel, PeftConfig

model_path = "/content/drive/MyDrive/Learning/story-time/t5-large-lora/models/checkpoint-1500"
config = PeftConfig.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(model, model_path)
# tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer = T5Tokenizer.from_pretrained('t5-large')

model = model.to("cuda")

# This is really verbose, but interesting to look at, so uncomment if you're
# interested in the model structure
# model.eval()

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--google--flan-t5-large/snapshots/0613663d0d48ea86ba8cb3d7a44f0f65dc596a2a/config.json
Model config T5Config {
  "_name_or_path": "google/flan-t5-large",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "classifier_dropout": 0.0,
  "d_ff": 2816,
  "d_kv": 64,
  "d_model": 1024,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 24,
  "num_heads": 16,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "transformers_version": "4.40.2",
  "use_cache": true,
  "vocab_size": 32128
}

loading weight

In [28]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer),
    compute_metrics=compute_metrics,
)

Setting `WANDB_LOG_MODEL` from true to `end` instead


For some reason, `trainer.evaluate()` gives me crap results, but if I use `trainer.predict()` then the results look good.  It may be related to the `eval_dataset` parameter... I'm to lazy to figure out why.

In [31]:
import torch


with torch.no_grad():
  encoded_predictions = trainer.predict(tokenized_datasets["test"])
  test_predictions = tokenizer.batch_decode(encoded_predictions.predictions, skip_special_tokens=True)


***** Running Prediction *****
  Num examples = 190
  Batch size = 6


In [35]:
encoded_predictions.metrics

{'test_loss': 0.0036612835247069597,
 'test_precision': 0.8756599832915622,
 'test_recall': 0.886578947368421,
 'test_f1': 0.8749092428039797,
 'test_runtime': 75.6676,
 'test_samples_per_second': 2.511,
 'test_steps_per_second': 0.423}

In [40]:
for i in range(5):
  print(f"predicted: {test_predictions[i]} --- true labels: {final_splits['test']['characters'][i]}")


predicted: Anna,Elias --- true labels: Anna,Elias
predicted: Lola,Bolt --- true labels: Lola,Bolt
predicted: Max,Mrs. Johnson,Mr. Peters --- true labels: Max,Yu Qiuyu,Mrs. Johnson,Mr. Peters
predicted: Jeff --- true labels: Jeff,Friend
predicted: Zara,Avicenna --- true labels: Zara,Avicenna


The precision, recall and f1 scores are very close to the validation set, so we are doing well on metrics.  Inspeciting the first five values in the test set shows that we are doing well.  We missed a character in the third story, and it looks like GPT had a character called "Friend" in the fourth, so it is a bit unclear which one is better in that case.