# T5 Model Fine-tuning

This notebook is used for fine-tuning the T5 base model. Please refer to the `README.md` within the parent `forge/` directory for more details.

## Step 1: Read, Tokenize and Encode Data

The block below reads data to memory and performs tokenization on all IO for the fine tuning process. 

- TRAINING_FILE: The name of the JSONL file, from the `data/training` directory
- MAXIMUM_SIZE: The maximum size of the data you want to read to memory, where `0` extracts all data

In [1]:
import asyncio
import nest_asyncio
from loguru import logger
from scripts.utils.file_utils import jsonl_read
from scripts.prepare_training import tokenize_and_encode

async def main():
    # Read in the training data JSONL file
    TRAINING_FILE = input("Enter the JSONL filename: ")
    MAXIMUM_SIZE = input("Enter a maximum data read size: ")
    training_file_path = f"../data/training/{TRAINING_FILE}"
    data = await jsonl_read(training_file_path, int(MAXIMUM_SIZE))
    if data == []:
        logger.error(f"An error occurred during the reading of JSONL file: {training_file_path}")
        exit(1)
    
    # Tokenize and encode each IO pair
    return await tokenize_and_encode(data)

# Run the async events
def run_asyncio_loop():
    loop = asyncio.get_event_loop()
    return loop.run_until_complete(main())


# Enable nested event loops
nest_asyncio.apply()
prepared_data = run_asyncio_loop()

logger.info(f"Sample of the tokenized and encoded data: {prepared_data[0]}")
logger.info(f"Total count of tokenized and encoded data: {len(prepared_data)}")
logger.success(f"The data has been tokenized and encoded into memory!")
logger.warning(f"This tokenized and encoded data is only temporarily stored in the Jupyter Notebook instance.")
logger.warning(f"Failing to save the data to file wil result in loss during restart or clearing of outputs.")


[32m2023-07-09 20:48:33.141[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m30[0m - [1mSample of encoded data: {'input_ids': tensor([[282,   8, 262,  ...,   0,   0,   0]]), 'labels': tensor([[  3,  18, 262,  ...,   0,   0,   0]])}[0m
[32m2023-07-09 20:48:33.142[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m31[0m - [1mCount of encoded data: 50[0m
[32m2023-07-09 20:48:33.142[0m | [32m[1mSUCCESS [0m | [36m__main__[0m:[36m<module>[0m:[36m32[0m - [32m[1mThe data has been tokenized and encoded into memory![0m
