# Training Helper Utilities
To ease the process of training and evaluation multiple models, we have implemented two helper classes `TrainingHelperM` and `TrainingHelperD`, alongside a few helper functions like `get_finetuned_model_name`. These tools help us keep a tidy and safe code.

## `TrainingHelperM`
This class stores metadata regarding the model we are training. Its attributes are the following:
- `base_model_name`: This is only the name of the model without prefixes.
- `batch_size`: Training batch size. This is auto inferred for known model (see below at `__init__()`).
- `dataset_name`: Name of teh training dataset.
- `epochs`: Number of training epochs.
- `eval_split_name`: Name of the evaluation split of the dataset. Auto generated based on sequence length.
- `gradient_accumulation_steps`: Gradient accumulation steps.
- `huggingface_model_name`: This is the Hugging Face identifier of the base model we are using.
- `huggingface_prefix`: The prefix string. Name of the developer team.
- `learning_rate`: Learning rate used to train the model.
- `seq_len`: Max length of the training sequences.
- `separator`: Small string used to separate fields in the finetuned name.
- `task`: Training task name.
- `test_split_name`: Name of the testing split of the dataset. Auto generated too.
- `train_split_name`: Name of the training split of the dataset. Auto generated too.

### Methods:
`TrainingHelperM` has a few methods to further help the programmer:
- `from_json(path)`: Loads the model from a JSON file (more below).
- `get_default_model()`: This method returns a Hugging Face AutoModelForSequenceClassification instantiated from the name stored in the helper.
- `get_finetuned_model_name`: Returns the finetuned model name string.
- `get_tokenizer()`: Returns the matching tokenizer to the model.
- `get_tokenizer_function()`: Returns a tokenizer function which can be used by `dataset.map()` for pre-tokenization and other tasks.
- `get_training_params()`: Returns a dictionary with the following keys: `batch_size`, `gradient_accumulation_steps`, `seq_len`, `learning_rate`, `epochs`.
- `initialize_from_environment()`: Instantiates the model helper from environmental variables (more below).
- `initialize_from_finetuned_name(name)`: Instantiates the model helper from a finetuned name (more below).
- `to_json(path)`: The helper can save its metadata to a JSON file.

### Initialization of the class
There are four ways to initialize a `TrainingHelperM` class. First off we can do so by calling it's `__init__()` method. Secondly we can use the `initialize_from_environment` class method, to pull the metadata from environment variables. This is extremely helpful for SLURM array jobs, as one can start up a wide range of model trainings from a single launch script. Thirdly we can initialize the model from a JSON file created by the `to_json` method. Lastly we can use the `parse_model_helper_from_finetuned_name` factory function. This one takes a pretrained model name, created by the `get_finetuned_model_name(TrainingHelperD, TrainingHelperM)` function and returns with a `TrainingHelperM` object.

- The `__init__(**kwargs)` method takes the following arguments:
  - huggingface_model_name: (str) Required.
  - epochs: (int) Optional. Defaults to 1.0
  - learning_rate (float) optional. Defaults to 0.001.
  - seq_len (int) Optional. Defaults to 512.
  - batch_size (int) Optional. If not given it's inferred for known models, otherwise an error is thrown.
  - gradient_accumulation_steps (int) Optional. If not given it's inferred for known models, otherwise an error is thrown.
  Since many of the parameters are default or auto inferred those values might not be correct in every case! It is better practice to pass all known information to the `__init__` !

- `initialize_from_environment()`: This is a class method factory function. It looks for the following environmental variables:
  - `MODEL_NAME`: The full Hugging Face name of the model.
  - `LEARNING_RATE`: The learning rate.
  - `LS`: The maximal sequence length for the given training. (The model itself might be able to handle more!)
  - `NUM_TRAIN_EPOCHS`: Number of training epochs.
- `from_json(path)`: This class method loads the class from a JSON file.
- `parse_model_helper_from_finetuned_name(name)`: Not recommended, as  it's unsafe. It only works for known models, as batch size and gradient accumulation steps are not part of the finetuned name, so they have to be auto inferred.


In [None]:
# This install method is guaranteed to work in google colab, so it is preferred for this example. For more details please check the Readme
#!git clone --single-branch --branch TrainHelper https://github.com/nbrg-ppcu/prokbert.git
#%pip install ./prokbert -q;

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
# If you are testing from the repo itself
import sys
sys.path.append('../')
from src.prokbert.traininghelper_utils import TrainingHelperM

In [None]:
#from prokbert.traininghelper_utils import TrainingHelperM

In [None]:
# Let's start by initializing the helpers through their __init__ methods

model_helper = TrainingHelperM(
    huggingface_model_name='neuralbioinfo/prokbert-mini-long')
# Notice that no batch size or gradient accumulation steps is given. These are auto inferred since prokbert-mini-long is a known model
# These batch sizes are calculated assuming 40GB NVIDIA A100-s
del model_helper

# To fully control parameters we can pass in everything through the arguments to init
model_helper = TrainingHelperM(
    huggingface_model_name='neuralbioinfo/prokbert-mini-long',
    dataset_name='TEST',
    epochs=1,
    learning_rate=0.001,
    seq_len=512,
    batch_size=64,
    gradient_accumulation_steps=2,
    separator='___',
    task='testingtask'
)


In [None]:
# To initialize the helpers from environmental variables we need to set them up first
import os
os.environ['MODEL_NAME'] = 'neuralbioinfo/prokbert-mini-long'
os.environ['LEARNING_RATE'] = '0.001'
os.environ['LS'] = '256'
os.environ['NUM_TRAIN_EPOCHS'] = '1'
os.environ['TASK'] = 'phage-lifestyle'
os.environ['DATASET_NAME'] = 'testdataset'


In [None]:
# This will work because prokbert-mini-long is a known model and BS and GAC are auto inferred
del model_helper
model_helper = TrainingHelperM.initialize_from_environment()

In [None]:
del model_helper
os.environ['MODEL_NAME'] = 'some_developer/some_model'
# This fails because it is not a known model
model_helper = TrainingHelperM.initialize_from_environment()

In [None]:
# However if we add BS and GAC to the environmental variables the initialization will work
os.environ['BATCH_SIZE'] = '64'
os.environ['GRADIENT_ACCUMULATION_STEPS'] = '4'
model_helper = TrainingHelperM.initialize_from_environment()

In [None]:
# Lets save the helper so we can try loading it back
model_helper.to_json('model_helper.json')

In [None]:
# Load the helper from JSON
helper_two = TrainingHelperM.from_json('model_helper.json')
print("The helper loaded back from JSON is equal to the previous", helper_two == model_helper)

## Now that we know how to initialize and save the helper let's look at their usage
Both classes are decorated with `@dataclass` so they have equivalence and ordering operators (==, <, >, <=, >=). Also since they are dataclasses they can be printed out directly. Also they can be converted to dictionaries directly using the `asdict()` method from dataclasses.

In [None]:
# Let's try printing
print(model_helper)

### Something looks off with those prints?
As you might notice when printed like this the properties have a `_` prefix, and they are different, than described before. This is because these are the hidden internal values that are not supposed to be accessed or changed directly. The public properties of the models are all accessible through getter and setter methods, using the `@property` decorator.

### So what can we access? These public properties
Again the public properties of the model helper are the following:
- `base_model_name`: This is only the name of the model without prefixes.
- `batch_size`: Training batch size. This is auto inferred for known model (see below at `__init__()`).
- `dataset_name`: Name of the training dataset.
- `epochs`: Number of training epochs.
- `eval_split_name`: Name of the evaluation split of the dataset. Auto generated based on sequence length.
- `gradient_accumulation_steps`: Gradient accumulation steps.
- `huggingface_model_name`: This is the Hugging Face identifier of the base model we are using.
- `huggingface_prefix`: The prefix string. Name of the developer team.
- `learning_rate`: Learning rate used to train the model.
- `seq_len`: Max length of the training sequences.
- `separator`: Small string used to separate fields in the finetuned name.
- `task`: Training task name.
- `test_split_name`: Name of the testing split of the dataset. Auto generated too.
- `train_split_name`: Name of the training split of the dataset. Auto generated too.

A few things to note here. The names of the dataset splits are generated from the sequence length used. As a result these are non-modifiable. Trying to do so will raise an error. Consequently, the dataset paths are non-modifiable too!


In [None]:
# Here we will access all the fields to see what's up
# Model name and parameters
print("Name of the Hugging Face model: ", model_helper.huggingface_model_name)
# We can access the prefix and the basename separately
print("Hugging Face prefix: ", model_helper.huggingface_prefix)
print("Hugging Face base model name: ", model_helper.base_model_name)

# Training parameters
print("Number of training epochs: ", model_helper.epochs)
print("Training batch size: ", model_helper.batch_size)
print("Gradient accumulation steps: ", model_helper.gradient_accumulation_steps)
print("Learning rate: ", model_helper.learning_rate)
print("Sequence length: ", model_helper.seq_len)

# Dataset parameters
print("Name of the training dataset: ", model_helper.dataset_name)
print("Name of the training split: ", model_helper.train_split_name)
print("Name of the testing split: ", model_helper.test_split_name)
print("Name of the evaluation split: ", model_helper.eval_split_name)

# Other task specific parameters
print("Name of the training task: ", model_helper.task)
print("Substring to separate fields in the finetuned name: ", model_helper.separator)



## Off to the more intriguing functionalities
Here we will showcase the helper functions of the `TrainingHelperM` class. Namely: `TrainingHelperM.get_default_model()`, `TrainingHelperM.get_tokenizer()`, `TrainingHelperM.get_tokenizer_function()` and `TrainingHelperM.get_finetuned_model_name`.

In [None]:
# These are more heavyweight operations it is recommended not to try these without a GPU

model_helper = TrainingHelperM('neuralbioinfo/prokbert-mini-long') # Full default helper for prokbert

model = model_helper.get_default_model() # This will return a Hugging Face AutoModelForClassification
tokenizer = model_helper.get_tokenizer() # This returns the corresponding tokenizer
tokenize_fn = model_helper.get_tokenizer_function() # Tokenizer function to use with dataset.map() for example

In [None]:
# Let's say we successfully trained out model and would like to save it
finetuned_name = model_helper.get_finetuned_model_name()
print(finetuned_name)
#model.save_pretrained(finetuned_name)

In [None]:
# And later on one can initialize the helper from the finetuned name
helper_two = TrainingHelperM.initialize_from_finetuned_name(finetuned_name, separator='___') # Separator can be set to something else

## Lastly sometimes we only need the training parameters
These can be accessed by calling the `get_training_parameters` function like this:

In [None]:
model_helper.get_training_params()