# T5 Model Fine-tuning

This notebook is used for fine-tuning the T5 base model. Please refer to the `README.md` within the parent `forge/` directory for more details.

## Step 1: Read and Tokenize Data

The block below reads data to memory and performs tokenization on all IO for the fine tuning process. 

- TRAINING_FILE: The name of the JSONL file, from the `data/training` directory
- MAXIMUM_SIZE: The maximum size of the data you want to read to memory, where `0` extracts all data

In [2]:
import asyncio
import nest_asyncio
from loguru import logger
from scripts.utils.file_utils import jsonl_read
from scripts.prepare_training import read_and_encode

async def main():
    # Read in the training data JSONL file
    TRAINING_FILE = input("Enter the JSONL filename: ")
    MAXIMUM_SIZE = input("Enter a maximum data read size: ")
    training_file_path = f"../data/training/{TRAINING_FILE}"
    data = await jsonl_read(training_file_path, int(MAXIMUM_SIZE))
    if data == []:
        logger.error(f"An error occurred during the reading of JSONL file: {training_file_path}")
        exit(1)
    return await read_and_encode(data)


def run_asyncio_loop():
    loop = asyncio.get_event_loop()
    return loop.run_until_complete(main())


# Enable nested event loops in Jupyter Notebook
nest_asyncio.apply()
prepared_data = run_asyncio_loop()

[{'input_ids': tensor([[  27, 5153,    8,  ...,    0,    0,    0]]), 'labels': tensor([[ 3, 18, 71,  ...,  0,  0,  0]])}, {'input_ids': tensor([[ 282,    8, 4564,  ...,    0,    0,    0]]), 'labels': tensor([[ 3, 18,  3,  ...,  0,  0,  0]])}, {'input_ids': tensor([[   27, 11197,  1361,  ...,     0,     0,     0]]), 'labels': tensor([[   3,   18, 9506,  ...,    0,    0,    0]])}, {'input_ids': tensor([[  27, 4468,    3,  ...,    0,    0,    0]]), 'labels': tensor([[   3,   18, 1193,  ...,    0,    0,    0]])}, {'input_ids': tensor([[  282,     8, 14640,  ...,     0,     0,     0]]), 'labels': tensor([[  3,  18, 180,  ...,   0,   0,   0]])}, {'input_ids': tensor([[  27, 4252,  192,  ...,    0,    0,    0]]), 'labels': tensor([[  3,  18, 332,  ...,   0,   0,   0]])}, {'input_ids': tensor([[282,   3,   9,  ...,   0,   0,   0]]), 'labels': tensor([[    3,    18, 16736,  ...,     0,     0,     0]])}, {'input_ids': tensor([[  86,   82, 1075,  ...,    0,    0,    0]]), 'labels': tensor([[  3, 

In [None]:
print(prepared_data)