# How to use the RoBerta model inside the NNTile framework

In [1]:
# Preliminary setup of experimental environment
import os
from pathlib import Path
import subprocess

nntile_dir = Path.cwd() / ".."

# Set environment variables
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Limit CUDA visibility
os.environ["OMP_NUM_THREADS"] = "1" # Disable BLAS parallelism
os.environ["PYTHONPATH"] = str(nntile_dir / "build" / "wrappers" / "python") # Path to a binary dir of NNTile Python wrappers

# All StarPU environment variables are available at https://files.inria.fr/starpu/doc/html/ExecutionConfigurationThroughEnvironmentVariables.html
os.environ["STARPU_NCPU"] = "2" # Use only 1 CPU core
os.environ["STARPU_NCUDA"] = "1" # Use only 1 CUDA device
os.environ["STARPU_SILENT"] = "1" # Do not show lots of StarPU outputs
os.environ["STARPU_SCHED"] = "dmdasd" # Name StarPU scheduler to be used
os.environ["STARPU_FXT_TRACE"] = "0" # Do not generate FXT traces
os.environ["STARPU_WORKERS_NOBIND"] = "1" # Do not bind workers (it helps if several instances of StarPU run in parallel)
os.environ["STARPU_PROFILING"] = "1" # This enables logging performance of workers and bandwidth of memory nodes
os.environ["STARPU_HOME"] = str(Path.cwd() / "starpu") # Main directory in which StarPU stores its configuration files
os.environ["STARPU_PERF_MODEL_DIR"] = str(Path(os.environ["STARPU_HOME"]) / "sampling") # Main directory in which StarPU stores its performance model files
os.environ["STARPU_PERF_MODEL_HOMOGENEOUS_CPU"] = "1" # Assume all CPU cores are equal
os.environ["STARPU_PERF_MODEL_HOMOGENEOUS_CUDA"] = "1" # Assume all CUDA devices are equal
os.environ["STARPU_HOSTNAME"] = "Roberta_example" # Force the hostname to be used when managing performance model files
os.environ["STARPU_FXT_PREFIX"] = str(Path(os.environ["STARPU_HOME"]) / "fxt") # Directory to store FXT traces if enabled

## Prepare dataset for the Masked Language Model with the RoBerta model 

- ```hf-dataset``` (str, default="roneneldan/TinyStories"): the name of the dataset aligned with name in ```datasets``` library used to download it
- ```dataset-path``` (str, default=".data"): path to the directory where previously prepared data sets are saved.
- ```dataset-select``` (int, default=100): number of the first pieces of texts from the dataset used for training model
- ```hf-tokenizer``` (str, default="bert-base-uncased"): tokenizer used to train masked language model
- ```tokenizer-path``` (str, default=".model"): path to the folder where the tokenizer data is stored
- ```seq-len``` (int, deault=1024): length of the input token sequence for training
- ```batch-size``` (int, default=1): batch size for the training process, which specifies the number of sentences processed by ```seq-len``` tokens between steps of the optimizer.

In [2]:
# Prepare TinyStories dataset into train.bin file
# Describe parameters and arguments
!python ../wrappers/python/examples/mlm_data_preparation.py --seq-len=512 \
                                                            --batch-size=8 \
                                                            --dataset-select=100 \
                                                            --hf-tokenizer="FacebookAI/roberta-base"

tokenizer_config.json: 100%|█████████████████| 25.0/25.0 [00:00<00:00, 78.1kB/s]
config.json: 100%|█████████████████████████████| 481/481 [00:00<00:00, 1.35MB/s]
vocab.json: 100%|████████████████████████████| 899k/899k [00:00<00:00, 4.76MB/s]
merges.txt: 100%|████████████████████████████| 456k/456k [00:00<00:00, 4.65MB/s]
tokenizer.json: 100%|██████████████████████| 1.36M/1.36M [00:00<00:00, 11.6MB/s]


## Arguments of the ```roberta_training.py``` script, which is used to run all the scenarios below

- ```remote-model-name```, (str, default="FacebookAI/roberta-base"): the name of the Bert architecture-based model that resides in the HuggingFace framework and will be used to initialize the configuration and initial state of the NNTile model.
- ```pretrained```, (choices=["local", "remote"], default="local"): the source type of the pre-trained model. The remote option loads the model ```remote-model-name``` from the Huggingface infrastructure. The ```local``` option requires a configuration file path (```config-path```) to start training from a randomly initialized state, or to continue training if a checkpoint file path (```checkpoint-path```) is also provided.
- ```checkpoint-path```, (str, default=""): path to the saved state of the pre-trained model weights. If the file is available, training will continue from this state.
- ```config-path```, (str, default=""): path to a .json configuration file that must be provided in the current version if the pretrained parameter is set to ```local```.  
- ```save-checkpoint-path```, (str, default=".model"): the path in which the state of the model will be saved at the end of the current training cycle.
- ```optimizer```, (choices=["sgd", "adam", "adamw"], default="adam"): the parameter determines the type of optimizer that will be used during the training process; the current version of NNTile supports three different optimization methods.
- ```model-path```, (str, default=".model"): path where previously downloaded models from a remote HuggingFace source are saved, making it easy to access for future use.  
- ```seq-len```, (int, default=1024): length of the input token sequence for training.
- ```batch-size```, (int, default=1): batch size for the training process, which specifies the number of sentences processed by ```seq-len``` tokens between steps of the optimizer.
- ```minibatch-size```, (int, default=-1): размер батча, под который выделяется память при обучении. Весь батч разбивается на целые минибатчи. Все минибатчи из одного батча один за другим «прогоняются» через модель для накапливания градиентов параметров.
- ```minibatch-size-tile```, (type=int, default=-1): batch size for which memory is allocated during training. The entire batch is divided into entire minibatches. All minibatches from one batch are passed through the model one after another to accumulate parameter gradients.
- ```hidden-size-tile```, (type=int, default=-1): size of pieces (tiles) into which the dimension ```hidden size``` (also known as ```embedding size```) is divided - the size of the multidimensional space into which incoming tokens are embedded. Only "tiled" tensors with the ```hidden-size-tile``` size along the corresponding axis are processed on the CPU and GPU.
- ```intermediate-size-tile```, (type=int, default=-1): size of pieces (tiles) into which the ```intermediate size``` dimension is divided. Only "tiled" tensors with the size ```intermediate-size-tile``` along the corresponding axis are processed on the CPU and GPU.
- ```n-head-tile```, (type=int, default=-1): size of pieces (tiles) into which the number of heads of the Transformer layer is divided. Only "tiled" tensors with a size of ```n-head-tile``` along the corresponding axis are processed on the CPU and GPU.
- ```dtype```, (choices=["fp32", "fp64", "fp32_fast_tf32", "bf16", "fp32_fast_fp16", "fp32_fast_bf16"], default="fp32"): set the data type from those supported by the NNTile framework in the current state. It allows users to select the appropriate option depending on their requirements.
- ```restrict```, (choices=["cpu", "cuda", None], default=None): the option allows users to set limits on the computing resources used during training. Selecting ```cpu``` limits training to CPU cores only, ```cuda``` limits training to GPU cores only, while setting it to ```None``` allows all available computing cores to be used.
- ```flash-attention```, (action="store_true"): a logical flag that, when used in the argument string, enables the current implementation of the FlashAttention algorithm (low-level Flash Attention kernels are currently not available) for processing data in the "attention mechanism" of the Transformers-type neural networks.
- ```use-redux```, (action="store_true"): a logical flag that, when used in the argument string, allows dependent tasks to be evaluated simultaneously, with the results then reduced to a single tensor.
- ```dataset-path```, (default=".data"): path to the directory where previously prepared data sets are saved.
- ```dataset-file```, (default=""): path (relative to ```dataset-path```) to the .bin file that is created in the data preparation script for training.
- ```lr```, (type=float, default=1e-4): step size for the optimization algorithm.
- ```nepochs```, (type=int, default=1): number of complete passes through the training set
- ```label-mask-token``` (type=int, default=3): index of the token that is responsible for masking the elements of the sequence. It must be consistent with the tokenizer used to avoid intersections of the indices of masked and normal tokens
- ```n-masked-tokens-per-seq``` (type=int, default=1): the number of tokens in each sequence that will be randomly masked

## 1. Training from a random initial state and saving the weights of the trained model



In [3]:
!python ../wrappers/python/examples/roberta_training.py --pretrained=local \
                                                        --config-path="../wrappers/python/examples/roberta_config.json" \
                                                        --save-checkpoint-path=".model/nntile_checkpoint.pt" \
                                                        --optimizer="adam" --lr=1e-5 --dtype=fp32_fast_fp16 \
                                                        --nepochs=3  --batch-size=8 --minibatch-size=4 --seq-len=512 \
                                                        --dataset-file="tinystories/train.bin" --restrict="cuda"

2024-11-13 15:19:36.743737: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-13 15:19:36.773377: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Namespace(remote_model_name='FacebookAI/roberta-base', pretrained='local', checkpoint_path='', config_path='../wrappers/python/examples/roberta_config.json', save_checkpoint_path='.model/nntile_checkpoint.pt', optimizer='adam', model_path='.model', seq_len=512, seq_len_tile=-1, batch_size=8, minibatch_size=4, minibatch_size_tile=-1, hidden_size_tile=-1, intermediate_size_

## 2. Load the model weights from the checkpoint and continue training with a different data type.

This again requires setting the ```pretrained``` parameter to ```local```, the ```config-path``` parameter should point to the previously created ```.json``` configuration file, and the ```checkpoint-path``` should point to an existing PyTorch checkpoint file. 
Training can be continued using a different data type and on a different set of compute nodes.
For example, here we switch to the ```fp32_fast_tf32``` data type.

In [4]:
!python ../wrappers/python/examples/roberta_training.py --pretrained=local --checkpoint-path=".model/nntile_checkpoint.pt" \
                                                        --config-path="../wrappers/python/examples/roberta_config.json" \
                                                        --save-checkpoint-path=".model/nntile_further_checkpoint.pt" \
                                                        --optimizer="adam" --lr=1e-5 --dtype=fp32_fast_fp16 \
                                                        --nepochs=3 --batch-size=8 --minibatch-size=4 \
                                                        --dataset-file="tinystories/train.bin" --seq-len=512

2024-11-13 15:19:53.204296: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-13 15:19:53.233867: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Namespace(remote_model_name='FacebookAI/roberta-base', pretrained='local', checkpoint_path='.model/nntile_checkpoint.pt', config_path='../wrappers/python/examples/roberta_config.json', save_checkpoint_path='.model/nntile_further_checkpoint.pt', optimizer='adam', model_path='.model', seq_len=512, seq_len_tile=-1, batch_size=8, minibatch_size=4, minibatch_size_tile=-1, hidd

## 3. Continue training of a model loaded from the Hugging Face framework.

The NNTile framework currently supports continued training of a model loaded from a remote source, as shown in our example from the Hugging Face framework library.
The weights of the loaded model are passed to the model implemented in NNTile.
To run such a scenario, the ```pretrained``` parameter must be set to ```remote```.
The ```config-path``` and ```checkpoint-path``` parameters are no longer required, as the model configuration and layer weights will be obtained from the loaded model.
Training can be continued using any data type and on any compute nodes that support the selected data type.
In the example below, we switch to the ```bf16``` type.

In [5]:
!python ../wrappers/python/examples/roberta_training.py --restrict="cuda" --pretrained=remote \
                                                        --save-checkpoint-path=".model/nntile_remote_checkpoint.pt" \
                                                        --optimizer="adam" --lr=1e-13 --dtype=bf16 --nepochs=3 \
                                                        --batch-size=8 --minibatch-size=4 --seq-len=512  \
                                                        --dataset-file="tinystories/train.bin"

2024-11-13 15:20:52.454531: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-13 15:20:52.484440: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Namespace(remote_model_name='FacebookAI/roberta-base', pretrained='remote', checkpoint_path='', config_path='', save_checkpoint_path='.model/nntile_remote_checkpoint.pt', optimizer='adam', model_path='.model', seq_len=512, seq_len_tile=-1, batch_size=8, minibatch_size=4, minibatch_size_tile=-1, hidden_size_tile=-1, intermediate_size_tile=-1, n_head_tile=-1, dtype='bf16', 