# The Forge Model Fine Tuning: T5x Variants

This notebook is specifically geared towards the fine tuning of T5x variant models using PyTorch and Hugging Face Transformers. the notebook contains model fine tuning and monitoring scripts to be used for the generation of the Forge's training data set and the production Forge model.

This notebook includes options, through user input, to choose specific checkpoint models and tokenizers and choices for directing the fine tuning or inferencing towards bullet interpretation or bullet generation.

## Step 1: Import Dependencies

This block must be run prior to performing any of the steps following this one.

This block imports all the necessary scripts and dependencies used by all of the following blocks.

In [None]:
import signal
import traceback
import torch
from loguru import logger

from scripts.constants import *
from scripts.model_instantiation import *
from scripts.file_utils import load_jsonl_as_dataframe
from scripts.prompt import prompt, select_prompt_type
from scripts.t5_trainer import T5Trainer
from scripts.training import save_trained_model

# Step 2: Instantiate the Model and Tokenizer

This block instantiates a selected model and tokenizer, and loads it to the target device for further usage (e.g., fine tuning, inferencing, etc.).

The block below also allows the user optionally to pull in the pre-trained and/or raw model checkpoint into this repository's local _forge/models/_ directory, providing offline usage and fine tuning later on.

When selecting a model and tokenizer, ensure the tokenizer is of the same signature and base model as the selected model. For example, if using google/flan-t5-xl as the model then the tokenizer should be google/flan-t5-base.

Some T5x variant checkpoints to try out for fine tuning or inferencing include, but are note limited to:
- [MBZUAI/LaMini-T5-738M](https://huggingface.co/MBZUAI/LaMini-T5-738M)
- [t5-base](https://huggingface.co/t5-base)
- [google/t5-efficient-tiny](https://huggingface.co/google/t5-efficient-tiny)

 The block requires the following user input:

- model_path: The model's directory or Hugging Face repository, e.g., google/flan-t5-xl, ../models/opera-bullet-interpreter
- tokenizer_path: The tokenizer's directory or Hugging Face repository, e.g., google/flan-t5-xl, ../models/opera-bullet-interpreter
- save_model_decision: Whether the user wants to save a copy of the model to the local directory, "yes" or "no"

In [None]:
tokenizer_path, model_path = select_model()
tokenizer, model = load_model(model_path, tokenizer_path)

## Step 3: Fine Tune T5 Checkpoint Model

If the target model is already fine-tuned or ready for manual testing, skip to [Step 4](#step-4-fine-tuned-model-manual-testing)

 The block requires the following input:

- model_output_directory: The fine-tuned model's destination directory, e.g., ../models/opera-bullet-interpreter
- prompt_prefix_option: The number corresponding to the type of model you want to train, either a model for creating long-form sentences from an existing bullet or a model that creates new bullets from long-form sentences

In [None]:
# Set the device for loading the model to
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_output_directory = "../models/" + input(
    "What name would you like to give the fine-tuned model?"
)

data_set = "../data/training/bullet_training_set.jsonl"

prompt_prefix, prompt_prefix_option = select_prompt_type()

data = load_jsonl_as_dataframe(data_set, prompt_prefix)


# In case of interrupt, save model and exit
def save_and_exit(signal, _):
    logger.warning(
        f"Received interrupt (code {signal}), stopping script and saving progress..."
    )
    save_trained_model()


# Attach the SIGINT signal (generated by Ctrl+C) to the handler
signal.signal(signal.SIGINT, save_and_exit)

# Run training function on the T5 model using data set and training parameters
# Prompt prefix of (1) is Bullet Creation Training and (2) is Bullet Interpretation Training
# Input key stores the bullet, Output key stores the long-form sentence(s)
if prompt_prefix_option == "2":
    # Train the bullet interpreter model
    T5Trainer(
        model,
        tokenizer,
        model_output_directory,
        dataframe=data,
        source_text="input",
        target_text="output",
    )
else:
    # Train the bullet creator model
    T5Trainer(
        model,
        tokenizer,
        model_output_directory,
        dataframe=data,
        source_text="output",
        target_text="input",
    )

## Step 4: Fine Tuned Model Manual Testing

 The block requires the following input:

- prompt_prefix_option: The number corresponding to the type of model you want to inference, either a model for creating long-form sentences from an existing bullet or a model that creates new bullets from long-form sentences
- input_text: The input you are providing to the model to respond to

In [None]:
prompt_prefix_option = input(
    "Type the number to choose a prompt prefix type: (1) Bullet Prompt or (2) Data Creation"
)
prompt_prefix = (
    bullet_data_creation_prefix if prompt_prefix_option == "2" else bullet_prompt_prefix
)

# Preprocess input
input_text = prompt_prefix + input("Provide your input below, sans prompt prefix.")

output_text = prompt(model, tokenizer, input_text)

# Print results
logger.info(f"INPUT: {input_text}")
logger.success(f"GENERATED OUTPUT: {output_text}\n")