# The Forge Model Fine Tuning: T5x Variants

This notebook is specifically geared towards the fine tuning of T5x variant models using PyTorch and Hugging Face Transformers. the notebook contains model fine tuning and monitoring scripts to be used for the generation of the Forge's training data set and the production Forge model.

This notebook includes options, through user input, to direct the fine tuning towards training data generation or bullet generation models.

## Step 1: Instantiate Global Parameters

Thi step must be completed prior to performing any of the steps following this one. These parameters will be used for instantiating, fine tuning, and prompting the model.

When selecting a model and tokenizer, ensure the tokenizer is of the same signature and base model as the selected model. For example, if using google/flan-t5-xl as the model then the tokenizer should be google/flan-t5-base.

Some T5x variant checkpoints to try out for fine tuning or inferencing include, but are note limited to:
- [MBZUAI/LaMini-T5-738M](https://huggingface.co/MBZUAI/LaMini-T5-738M)
- [t5-base](https://huggingface.co/t5-base)
- [google/t5-efficient-tiny](https://huggingface.co/google/t5-efficient-tiny)

In [None]:
# Path of the pre-trained model that will be used
model_path = input("Input a checkpoint model's Hugging Face repository or a relative path")
# Path of the pre-trained model tokenizer that will be used
# Must match the model checkpoint's signature
tokenizer_path = input("Input a tokenizer's Hugging Face repository or a relative path")
# Max length of tokens a user may enter for summarization
# Increasing this beyond 512 may increase compute time significantly
max_input_token_length = 512
# Max length of tokens the model should output for the summary
# Approximately the number of tokens it may take to generate a bullet
max_output_token_length = 512
# Beams to use for beam search algorithm
# Increased beams means increased quality, but increased compute time
number_of_beams = 6
# Number of examples per batch during training
# Larger batch sizes require more memory, but can speed up training
train_batch_size = 1
# Number of full passes through the entire training dataset
# More epochs can lead to better performance, but risk over-fitting
train_epochs = 6
# Number of examples per batch during validation
# Larger batch sizes require more memory, but can speed up the validation process
valid_batch_size = 1
# Number of full passes through the entire validation dataset
# Typically kept to a single epoch as the validation set does not need to be repeatedly passed
val_epochs = 1
# Affects how quickly or slowly a model learns
# Too high can cause instability, too low can cause slow learning
learning_rate = 1e-4
# Random seed to ensure reproducibility
# Using the same seed will yield the same model given the same data and training process
seed = 8
# Multiplier to penalize repeated n-grams
# Higher values discourage repetition in the generated text
repetition_penalty = 1
# Penalty applied for producing long sequences
# Higher values encourage longer sequences
length_penalty = 0
# The number of steps to take before the gradient is averaged and applied
# Helps in stabilizing training and requires less memory
gradient_accumulation_steps = 1
# Weight decay introduced to the optimizer to prevent over-fitting
# Regularization strategy by adding a small penalty, typically the L2 norm of the weights
weight_decay = 0.1
# Small constant to prevent any division by zero in the implementation (Adam)
adam_epsilon = 1e-8
# Number of steps for the warmup phase
# Helps in avoiding very high and undesirable values of gradients at the start of training
warmup_steps = 4
# The split between the training and validation data
training_validation_split = 0.85
# Scales logits before soft-max to control randomness
# Lower values (~0) make output more deterministic
temperature = 0.5
# Limits generated tokens to top K probabilities
# Reduces chances of rare word predictions
top_k = 50
# Applies nucleus sampling, limiting token selection to a cumulative probability
# Creates a balance between randomness and determinism
top_p = 0.90

# Step 2: Instantiate the T5 Variant Model

Must be from Hugging Face or your local models directory.

In [None]:
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
from loguru import logger

# Load the pre-trained model and tokenizer
logger.info(f"Instantiating tokenizer from {tokenizer_path}, and model from {model_path}")
tokenizer = T5Tokenizer.from_pretrained(f"{tokenizer_path}", model_max_length=max_input_token_length, add_special_tokens = False)
input_model = T5ForConditionalGeneration.from_pretrained(f"{model_path}")
logger.info(f"Loading {model_path}...")
# Set device to be used based on GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Model is sent to device for use
model = input_model.to(device) # type: ignore
logger.success("Instantiated target tokenizer and model")

The block below allows the user optionally to pull in the pre-trained and/or raw model checkpoint into this repository's local _forge/models/_ directory. This step is optional, but it allows [Step 2](#step-2-fine-tune-t5-checkpoint-model)'s user input to be a model in your local directory, thus providing offline usage and fine tuning later on. E.g., if you download google/flan-t5-xl to the local directory with name 'my-test-model' first, you can input '../models/my-test-model' for executing fine tuning on. The script below works on any model and tokenizer, but the fine tuning script in [Step 2](#step-2-fine-tune-t5-checkpoint-model) depends on the usage of a T5x variant.

In [None]:
# Save the model and tokenizer to your specified directory
logger.info(f"Downloading tokenizer from {tokenizer_path}, and model from {model_path}")
tokenizer.save_pretrained(f"../models/{tokenizer_path.replace('/', '_')}")
input_model.save_pretrained(f"../models/{model_path.replace('/', '_')}") # type: ignore
logger.success("Successfully downloaded target tokenizer and model")

## Step 3: Fine Tune T5 Checkpoint Model

If the target model is already fine-tuned or ready for manual testing, skip to [Step 4](#step-4-fine-tuned-model-manual-testing)

In [None]:
# Fine tuning scripts
import signal
import re
import traceback
import numpy as np
from torch.utils.data import DataLoader, Dataset
from torch.cuda.amp.grad_scaler import GradScaler
from rouge_score import rouge_scorer
from transformers import get_linear_schedule_with_warmup

from scripts.file_utils import load_jsonl_data
from scripts.constants import *
from scripts.rich_logger import training_table as table, live_refresher

model_output_directory = "../models/" + input(
    "What name would you like to give the fine-tuned model?"
)

prompt_prefix_option = input(
    "Type the number to choose a prompt prefix type: (1) Bullet Prompt Training or (2) Data Creation Training"
)
prompt_prefix = (
    bullet_data_creation_prefix if prompt_prefix_option == "2" else bullet_prompt_prefix
)
data_set = (
    "../data/training/data_creation_set.jsonl"
    if prompt_prefix_option == "2"
    else "../data/training/training_validation_set.jsonl"
)

data = load_jsonl_data(
    data_set,
    prompt_prefix,
    isDataFrame=True,
)


# Creating a custom dataset for reading the dataset and loading it into the dataloader
# to pass it to the neural network for fine tuning the model
class CustomDataset(Dataset):
    def __init__(
        self, dataframe, tokenizer, source_len, target_len, source_text, target_text
    ):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.source_len = source_len
        self.summ_len = target_len
        self.target_text = self.data[target_text]
        self.source_text = self.data[source_text]

    def __len__(self):
        return len(self.target_text)

    def __getitem__(self, index):
        source_text = str(self.source_text[index])
        target_text = str(self.target_text[index])

        # Cleaning data so as to ensure data is in string type
        source_text = " ".join(source_text.split())
        target_text = " ".join(target_text.split())

        source = self.tokenizer.batch_encode_plus(
            [source_text],
            max_length=self.source_len,
            pad_to_max_length=True,
            truncation=True,
            padding="max_length",
            return_tensors="pt",
        )
        target = self.tokenizer.batch_encode_plus(
            [target_text],
            max_length=self.summ_len,
            pad_to_max_length=True,
            truncation=True,
            padding="max_length",
            return_tensors="pt",
        )

        source_ids = source["input_ids"].squeeze()
        source_mask = source["attention_mask"].squeeze()
        target_ids = target["input_ids"].squeeze()

        return {
            "source_ids": source_ids.to(dtype=torch.long),
            "source_mask": source_mask.to(dtype=torch.long),
            "target_ids": target_ids.to(dtype=torch.long),
            "target_ids_y": target_ids.to(dtype=torch.long),
        }


# Generates a penalty for not complying to bullet formatting
def format_penalty(outputs, tokenizer, format_pattern):
    total_penalty = 0.0
    logits = outputs.logits
    # Converting the logits to token ids
    token_ids = torch.argmax(logits, dim=-1)
    # Decoding the token ids to text
    decoded_outputs = [
        tokenizer.decode(token_ids[i], skip_special_tokens=True)
        for i in range(token_ids.shape[0])
    ]

    for text in decoded_outputs:
        match = re.fullmatch(format_pattern, text)
        # If the output does not match the desired format exactly, add a penalty
        if not match:
            total_penalty += 1.0

    return torch.tensor(total_penalty, device=logits.device)

    # Function to be called for training with the parameters passed from main function


def train(epoch, tokenizer, model, device, loader, optimizer, scheduler):
    # Create a GradScaler object for mixed precision training
    # Optionally define the GradScaler based on CUDA availability
    use_cuda = torch.cuda.is_available()
    scaler = GradScaler(enabled=use_cuda) if use_cuda else None

    # Training logger refresh flag
    table.switch_epoch_refresh()
    # Stepping through training batches
    for step, data in enumerate(loader, 0):
        y = data["target_ids"].to(device, dtype=torch.long)
        y_ids = y[:, :-1].contiguous()
        lm_labels = y[:, 1:].clone().detach()
        lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
        ids = data["source_ids"].to(device, dtype=torch.long)
        mask = data["source_mask"].to(device, dtype=torch.long)

        outputs = model(
            input_ids=ids,
            attention_mask=mask,
            decoder_input_ids=y_ids,
            labels=lm_labels,
        )
        loss = outputs[0]

        # Add a penalty to the loss for outputs that don't match the format
        format_loss = format_penalty(outputs, tokenizer, bullet_pattern)
        total_loss = loss + format_loss

        if table.get_epoch_refresh():
            # Refresh table once per epoch
            table.refresh_table(epoch, loss)

        if scaler:  # If CUDA is available and scaler is defined
            scaler.scale(total_loss).backward() # type: ignore

            if (step + 1) % gradient_accumulation_steps == 0:
                scaler.unscale_(optimizer)
                torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # type: ignore
                scaler.step(optimizer)
                scaler.update()

                # Clear gradients after optimizer step
                optimizer.zero_grad()
        else:  # If no CUDA, follow standard backward pass
            total_loss.backward()

            if (step + 1) % gradient_accumulation_steps == 0:
                torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # type: ignore
                optimizer.step()

                # Clear gradients after optimizer step
                optimizer.zero_grad()

        # Adjust the learning rate
        scheduler.step()


# Function to evaluate model for predictions and compute ROUGE scores
def validate(tokenizer, model, device, loader):
    predictions = []
    actuals = []

    # Initialize the rouge scorer
    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
    scores = []

    with torch.no_grad():
        for _, data in enumerate(loader, 0):
            y = data["target_ids"].to(device, dtype=torch.long)
            ids = data["source_ids"].to(device, dtype=torch.long)
            mask = data["source_mask"].to(device, dtype=torch.long)

            generated_ids = model.generate(
                input_ids=ids,
                attention_mask=mask,
                max_length=max_input_token_length,
                num_beams=number_of_beams,
                repetition_penalty=repetition_penalty,
                length_penalty=length_penalty,
                early_stopping=True,
            )
            preds = [
                tokenizer.decode(
                    g, skip_special_tokens=True, clean_up_tokenization_spaces=True
                )
                for g in generated_ids
            ]
            targets = [
                tokenizer.decode(
                    t, skip_special_tokens=True, clean_up_tokenization_spaces=True
                )
                for t in y
            ]

            # Calculate rouge scores for each prediction and corresponding target
            for pred, target in zip(preds, targets):
                score = scorer.score(target, pred)
                scores.append(score)

            predictions.extend(preds)
            actuals.extend(targets)

    # Compute the average ROUGE scores for the entire validation set
    avg_scores = {
        "rouge1": np.mean([score["rouge1"].fmeasure for score in scores]),
        "rouge2": np.mean([score["rouge2"].fmeasure for score in scores]),
        "rougeL": np.mean([score["rougeL"].fmeasure for score in scores]),
    }

    logger.info(f"Average ROUGE scores: {avg_scores}")

    return predictions, actuals


# T5 training main function
def T5Trainer(dataframe, source_text, target_text):
    # Set random seeds and deterministic pytorch for reproducibility
    torch.manual_seed(seed)
    np.random.seed(seed)
    torch.backends.cudnn.deterministic = True # type: ignore

    logger.info("Reading data...")
    # Importing the raw dataset
    dataframe = dataframe[[source_text, target_text]]

    # Creation of Dataset and Dataloader
    # 80% of the data will be used for training and the rest for validation
    train_size = training_validation_split
    train_dataset = dataframe.sample(frac=train_size, random_state=seed)
    val_dataset = dataframe.drop(train_dataset.index).reset_index(drop=True)
    train_dataset = train_dataset.reset_index(drop=True)

    logger.info(f"FULL Dataset: {dataframe.shape}")
    logger.info(f"TRAIN Dataset: {train_dataset.shape}")
    logger.info(f"VALIDATION Dataset: {val_dataset.shape}")

    # Creating the Training and Validation dataset for further creation of data loader
    training_set = CustomDataset(
        train_dataset,
        tokenizer,
        max_input_token_length,
        max_output_token_length,
        source_text,
        target_text,
    )
    val_set = CustomDataset(
        val_dataset,
        tokenizer,
        max_input_token_length,
        max_output_token_length,
        source_text,
        target_text,
    )

    # Defining the parameters for creation of data loaders
    train_params = {
        "batch_size": train_batch_size,
        "shuffle": True,
        "num_workers": 0,
    }
    val_params = {
        "batch_size": valid_batch_size,
        "shuffle": False,
        "num_workers": 0,
    }

    # Creation of data loaders for testing and validation - this will be used down for training and validation stage for the model
    training_loader = DataLoader(training_set, **train_params)
    val_loader = DataLoader(val_set, **val_params)

    # Defining the optimizer that will be used to tune the weights of the network in the training session
    optimizer = torch.optim.AdamW(
        params=[p for p in model.parameters() if p.requires_grad],
        lr=learning_rate,
        eps=adam_epsilon,
        weight_decay=weight_decay,
    )

    # Define the learning rate scheduler
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=warmup_steps,
        num_training_steps=train_epochs
        * len(training_loader)
        // gradient_accumulation_steps,
    )

    # Training loop
    logger.info(f"Initiating fine tuning of {model_path}...")
    # Table logger for training statistics
    with live_refresher:
        for epoch in range(train_epochs):
            train(
                epoch, tokenizer, model, device, training_loader, optimizer, scheduler
            )
    logger.info(f"Saving fine-tuned  to {model_output_directory} ...")
    # Saving the model after training
    save_model()

    # Evaluating validation dataset
    logger.info("Initiating validation...")
    for _ in range(val_epochs):
        validate(tokenizer, model, device, val_loader)
    logger.success("Model fine tuning, saving, and validation steps completed!")


# Saves the model
def save_model():
    model.save_pretrained(model_output_directory)
    tokenizer.save_pretrained(model_output_directory)
    logger.info(f"Fine-tuned model successfully saved to: {model_output_directory}")
    logger.success("Model saved. Shutting down...")


# In case of interrupt, save model and exit
def save_and_exit(signal, _):
    logger.warning(
        f"Received interrupt signal {signal}, stopping script and saving model..."
    )
    save_model()


# Attach the SIGINT signal (generated by Ctrl+C) to the handler
signal.signal(signal.SIGINT, save_and_exit)

try:
    # Run training function on the T5 model using data set and training parameters
    T5Trainer(dataframe=data, source_text="input", target_text="output")
except Exception as e:
    # Handle other unexpected errors
    logger.error(f"An unexpected error occurred during fine tuning: {str(e)}")
    logger.error(f"Stack trace: \n{traceback.format_exc()}")
    # Save the model and any relevant data before exiting gracefully
    save_model()

## Step 4: Fine Tuned Model Manual Testing

In [None]:
from scripts.constants import bullet_prompt_prefix, bullet_data_creation_prefix

# Load the data from the manual test file
prompt_prefix_option = input(
    "Type the number to choose a prompt prefix type: (1) Bullet Prompt or (2) Data Creation"
)
prompt_prefix = (
    bullet_data_creation_prefix if prompt_prefix_option == "2" else bullet_prompt_prefix
)

# Preprocess input
input_text = prompt_prefix + input("Provide your input below, sans prompt prefix.")

encoded_input_text = tokenizer.encode_plus(
    input_text, return_tensors="pt", truncation=True, max_length=max_input_token_length
)

# Generate summary
summary_ids = model.generate(
    encoded_input_text["input_ids"],
    attention_mask=encoded_input_text["attention_mask"],
    max_length=max_output_token_length,
    num_beams=number_of_beams,
    temperature=temperature,
    top_k=top_k,
    top_p=top_p,
    early_stopping=True,
)

output_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Print results
logger.info(f"INPUT: {input_text}")
logger.success(f"GENERATED OUTPUT: {output_text}\n")