# Fine-Tuning a Pre-Trained Model

This notebook is used for fine-tuning the T5 pre-trained model. Please refer to the `README.md` within the parent `forge/` directory for more details.

Potential models for use:
- sysresearch101/t5-large-finetuned-xsum

## Step 1: Read, Tokenize and Encode Data

The block below reads data to memory and performs tokenization on all IO for the fine tuning process. 

- TRAINING_FILE (`training_set.jsonl`): The name of the JSONL file, from the `data/training` directory, that will be used for training.
- MAXIMUM_SIZE (`0`): The maximum size of the data you want to read to memory, where an input of `0` extracts all data.

In [1]:
import asyncio
import nest_asyncio
from transformers import T5Tokenizer
from scripts.tokenize_encode import tokenize_and_encode


# Preprocess the training data using the T5 tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-large", model_max_length=1024)


# Run the async events
async def main():
    # Read in the training data JSONL file
    TRAINING_FILE = "training_set.jsonl"
    MAXIMUM_SIZE = 0

    # Tokenize and encode each IO pair
    return await tokenize_and_encode(TRAINING_FILE, int(MAXIMUM_SIZE), tokenizer)


# Enable nested event loops
nest_asyncio.apply()
prepared_data = asyncio.run(main())

[32m2023-07-10 22:55:59.935[0m | [1mINFO    [0m | [36mscripts.tokenize_and_encode[0m:[36mtokenize_and_encode[0m:[36m26[0m - [1mSample of the tokenized and encoded data: {'input_ids': tensor([282,   8, 262,  ...,   0,   0,   0]), 'labels': tensor([  3,  18, 262,  ...,   0,   0,   0])}[0m
[32m2023-07-10 22:55:59.936[0m | [1mINFO    [0m | [36mscripts.tokenize_and_encode[0m:[36mtokenize_and_encode[0m:[36m27[0m - [1mTotal count of tokenized and encoded data: 50[0m
[32m2023-07-10 22:55:59.936[0m | [32m[1mSUCCESS [0m | [36mscripts.tokenize_and_encode[0m:[36mtokenize_and_encode[0m:[36m28[0m - [32m[1mThe data has been tokenized and encoded into memory![0m


## Step 2: Setup Training Parameters

The parameters below have their recommended default values in parenthesis. Changing the value may result in different trained model variations. Feel free to experiment with each parameter based on the locally available datasets, compute and store limitations, and other factors.

- EPOCHS (`3`): The number of times the training loop will iterate over the entire dataset. Increasing the number of epochs can potentially improve the model's performance by allowing it to see the data multiple times, but too many epochs may lead to over-fitting.
- LEARNING_RATE (`2e-5`): The step size at which the optimizer adjusts the model's parameters during training. A higher learning rate can result in faster convergence but may cause instability or overshooting. A lower learning rate can lead to slower convergence but more stable training.
- TOTAL_STEPS (`len(prepared_data) * EPOCHS`): The total number of steps to train the model. It is calculated by multiplying the number of training examples with the number of epochs. Increasing the total steps can provide more training iterations, potentially allowing the model to learn more from the data, but it also increases the training time and computational resources required.


In [None]:
# Configure the training parameters
EPOCHS = 3
LEARNING_RATE = 2e-5
TOTAL_STEPS = len(prepared_data) * EPOCHS

## Step 3: Train the Model

Using the previous parameters, the model will be trained with the prepared data. For more details, please see the `scripts/` directory.

Learning rate scheduling strategy notes:

- max_lr: The maximum learning rate used during training - helps control the learning rate range during the training process
- total_steps: The total number of steps in the training process - influences the scheduling of the learning rate and momentum during training
- div_factor: The factor by which the initial learning rate is divided to get the lower boundary learning rate - affects the lower bound of the learning rate range
- final_div_factor: The factor by which the initial learning rate is divided to get the final learning rate - affects the final learning rate at the end of the training process
- pct_start: The percentage of the total number of steps used for the warm-up phase - determines the portion of the training where the learning rate gradually increases
- anneal_strategy: The strategy used for annealing the learning rate and momentum during training - set to "cos" for cosine annealing
- cycle_momentum: Whether to cycle the momentum between base_momentum and max_momentum during training
- base_momentum: The lower momentum boundary during training
- max_momentum: The upper momentum boundary during training
- epochs: The number of epochs to train the model
- steps_per_epoch: The number of steps per epoch - used to calculate the learning rate schedule
- warmup_steps: The number of warm-up steps where the learning rate gradually increases - helps the model to stabilize at the beginning of training


In [2]:
# Configure the training parameters
EPOCHS = 3
LEARNING_RATE = 2e-5
TOTAL_STEPS = len(prepared_data) * EPOCHS

## Step 3: Train the Model

Using the previous parameters, the model will be trained with the prepared data. For more details, please see the `scripts/` directory.

Learning rate scheduling strategy notes:

- max_lr: The maximum learning rate used during training - helps control the learning rate range during the training process
- total_steps: The total number of steps in the training process - influences the scheduling of the learning rate and momentum during training
- div_factor: The factor by which the initial learning rate is divided to get the lower boundary learning rate - affects the lower bound of the learning rate range
- final_div_factor: The factor by which the initial learning rate is divided to get the final learning rate - affects the final learning rate at the end of the training process
- pct_start: The percentage of the total number of steps used for the warm-up phase - determines the portion of the training where the learning rate gradually increases
- anneal_strategy: The strategy used for annealing the learning rate and momentum during training - set to "cos" for cosine annealing
- cycle_momentum: Whether to cycle the momentum between base_momentum and max_momentum during training
- base_momentum: The lower momentum boundary during training
- max_momentum: The upper momentum boundary during training
- epochs: The number of epochs to train the model
- steps_per_epoch: The number of steps per epoch - used to calculate the learning rate schedule
- warmup_steps: The number of warm-up steps where the learning rate gradually increases - helps the model to stabilize at the beginning of training


In [3]:
import torch
from transformers import T5ForConditionalGeneration
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from tqdm import tqdm
from loguru import logger


class PreparedDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        input_ids = self.data[idx]["input_ids"]
        labels = self.data[idx]["labels"]
        input_ids_tensor = input_ids.clone().detach()
        labels_tensor = labels.clone().detach()
        return {
            "input_ids": input_ids_tensor,
            "labels": labels_tensor,
        }


def train_t5_model(model, data_loader, optimizer, scheduler, device, epochs):
    model.to(device)
    model.train()
    for epoch in range(epochs):
        total_loss = 0

        progress_bar = tqdm(data_loader, desc=f"Epoch {epoch + 1}", unit="batch")
        for batch in progress_bar:
            try:
                batch_input_ids = batch["input_ids"].to(device)
                batch_labels = batch["labels"].to(device)

                # Forward pass
                outputs = model(input_ids=batch_input_ids, labels=batch_labels)
                loss = outputs.loss
                total_loss += loss.item()

                # Backward pass and optimization
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
                scheduler.step()

                progress_bar.set_postfix({"Loss": loss.item()})

            except Exception as e:
                logger.exception(f"An error occurred during training: {e}")

        average_loss = total_loss / len(data_loader)
        logger.info(f"Epoch {epoch + 1} - Average Loss: {average_loss}")


# Convert the dataset into a PyTorch DataLoader
batch_size = 8
dataset = PreparedDataset(prepared_data)
data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Initialize the T5 model
model = T5ForConditionalGeneration.from_pretrained("t5-large")

# Set up the optimizer and scheduler
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=LEARNING_RATE,
    total_steps=TOTAL_STEPS,
    div_factor=2.0,
    final_div_factor=1e4,
    pct_start=0.1,
    anneal_strategy="cos",
    cycle_momentum=True,
    base_momentum=0.85,
    max_momentum=0.95,
    epochs=EPOCHS,
    steps_per_epoch=len(data_loader),
)

# Training loop
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
train_t5_model(model, data_loader, optimizer, scheduler, device, EPOCHS)

# Save the trained model
output_model = "../models/t5/trained"
model.save_pretrained(output_model)

  input_ids_tensor = torch.tensor(input_ids, dtype=torch.long).clone().detach()
  labels_tensor = torch.tensor(labels, dtype=torch.long).clone().detach()


## Step 4: Test Fine-Tuned T5 Model

The block below provides validation of the fine-tuned T5 model using a set of data and outputs, where the outputs can be visually inspected by a user. The parameters from the defaults can be changed to the local development environment's specific or available models and datasets.
- FINE_TUNED_MODEL (`../models/t5/trained`): The T5 model to be used for the input and output generation task.
- INPUT_TEXT_ARRAY (`../data/training/validation_set.jsonl`): A validated set of summary-evaluation pairs for visual inspection. The default file has 20 pairs for testing.

In [None]:
from transformers import T5ForConditionalGeneration
import torch
from scripts.utils.file_utils import jsonl_read

# Load the saved fine-tuned model
fine_tuned_model = T5ForConditionalGeneration.from_pretrained("../models/t5/base")

# Set the device for inference
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
fine_tuned_model.to(device)

# Use the model on the validation data sets
input_pairs_array = await jsonl_read("../data/training/validation_set.jsonl")
for key, input_pairs in enumerate(input_pairs_array):
    input_text = input_pairs["summary"]
    input_ids = tokenizer.encode(
        input_text, padding="max_length", truncation=True, return_tensors="pt"
    ).to(device)
    outputs = fine_tuned_model.generate(input_ids, max_length=512)
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    expected_text = input_pairs["evaluation"]
    print(
        f"({key + 1}) Input Text: {input_text}\n\tExpected Text: {expected_text}\n\tGenerated Text: {generated_text}"
    )