### 1. Environment variable setting block:

The following block is required to set environment variables that are read during the execution of the program code. 

User can change these environment variables between runs.

In [2]:
# Preliminary setup of experimental environment
import os
from pathlib import Path
import subprocess

nntile_dir = Path.cwd() / ".."

# Set environment variables
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Limit CUDA visibility
os.environ["OMP_NUM_THREADS"] = "1" # Disable BLAS parallelism
os.environ["PYTHONPATH"] = str(nntile_dir / "build" / "wrappers" / "python") # Path to a binary dir of NNTile Python wrappers

# All StarPU environment variables are available at https://files.inria.fr/starpu/doc/html/ExecutionConfigurationThroughEnvironmentVariables.html
os.environ["STARPU_NCPU"] = "1" # Use only 1 CPU core
os.environ["STARPU_NCUDA"] = "1" # Use only 1 CUDA device
os.environ["STARPU_SILENT"] = "1" # Do not show lots of StarPU outputs
os.environ["STARPU_SCHED"] = "dmdasd" # Name StarPU scheduler to be used
os.environ["STARPU_FXT_TRACE"] = "0" # Do not generate FXT traces
os.environ["STARPU_WORKERS_NOBIND"] = "1" # Do not bind workers (it helps if several instances of StarPU run in parallel)
os.environ["STARPU_PROFILING"] = "0" # This enables logging performance of workers and bandwidth of memory nodes
os.environ["STARPU_BUS_STATS"] = "1" # This enables logging of bus usage, prined at the end of execution
os.environ["STARPU_HOME"] = str(Path.cwd() / "starpu") # Main directory in which StarPU stores its configuration files
os.environ["STARPU_PERF_MODEL_DIR"] = str(Path(os.environ["STARPU_HOME"]) / "sampling") # Main directory in which StarPU stores its performance model files
os.environ["STARPU_PERF_MODEL_HOMOGENEOUS_CPU"] = "1" # Assume all CPU cores are equal
os.environ["STARPU_PERF_MODEL_HOMOGENEOUS_CUDA"] = "1" # Assume all CUDA devices are equal
os.environ["STARPU_HOSTNAME"] = "Llama_LMHead_example" # Force the hostname to be used when managing performance model files
os.environ["STARPU_FXT_PREFIX"] = str(Path(os.environ["STARPU_HOME"]) / "fxt") # Directory to store FXT traces if enabled

### 2. Data Preparation Block: 

This block uses the interpreted file "causal_lm_data_preparation.py". This Python script supports the following arguments when run:
- hf-dataset, (default=`"roneneldan/TinyStories"`): The name of the dataset to be processed and prepared for use in the training process. By default, the "TinyStories" dataset from the Huggingface infrastructure is specified,
- dataset-path, (default=`".data"`): path to the directory where previously downloaded datasets from remote sources are saved, making it easy to access for the future use,
- dataset-select, (`int`, default=`100`): specifies the number of records from the original dataset that fall into the training set,
- hf-tokenizer, (`str`, default=`"kimihailv/llama-1.3b"`): specifies the repository from the Huggingface infrastructure used as a tokenizer,
- tokenizer-path, (`str`, default=`".model"`): path to the directory where previously downloaded tokenizers are saved,
- seq-len, (`int`, default=`1024`): length of the input token sequence for the training process,
- batch-size, (`int`, default=`1`): batch size for the training process, then is the number of input data sentences between which the loss function optimizer step is called.

In [3]:
# Prepare TinyStories dataset into train.bin file
!python ../wrappers/python/examples/causal_lm_data_preparation.py --seq-len=512 --batch-size=256 --dataset-select=5000

Generating train split: 100%|█| 2119719/2119719 [00:07<00:00, 287866.41 examples
Generating validation split: 100%|█| 21990/21990 [00:00<00:00, 300242.66 example
tokenizer_config.json: 1.60kB [00:00, 5.64MB/s]
tokenizer.model: 100%|████████████████████████| 500k/500k [00:01<00:00, 477kB/s]
tokenizer.json: 1.84MB [00:00, 19.0MB/s]
added_tokens.json: 100%|██████████████████████| 51.0/51.0 [00:00<00:00, 145kB/s]
special_tokens_map.json: 100%|█████████████████| 547/547 [00:00<00:00, 1.89MB/s]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama to

### 3. Example Scenarios

Below we show an example of utilizing the Llama model, implemented using the NNTile framework. We explore the following scenarios:

- **Training the model from a random initial state and saving it to a checkpoint.**
- **Loading the model weights from a checkpoint and continuing training with a different data type.**
- **Loading pretrained model from remote source and continuing training**

For training and continuing retraining scenarios, the interpreted file "llama_training.py" is used. This Python script supports the following arguments when running:

- remote_model_name, (`str`, default=`"kimihailv/llama-1.3b"`): This parameter specifies the name of the Llama based model that resides within the HuggingFace infrastructure and will be utilized to initialize the configuration and the intial state of the NNTile model.

- pretrained, (choices=`["local", "remote"]`, default=`"local"`): This flag indicates the location of the pretrained model, with the `local` option requiring a configuration path (`config-path`) to start training from a randomly initialized state unless the checkpoint (`checkpoint-path`) is provided, in which case training continues from the last saved checkpoint state.

- checkpoint-path, (`str`, default=`""`): This refers to the file path where a saved checkpoint can be found, allowing for the resumption of training from a specific point if available.

- config-path, (`str`, default=`""`): This denotes the path to the configuration .json file that must be provided in the current version if the `pretrained` parameter is set to `"local"`.

- save-checkpoint-path, (`str`, default=`".model"`): This parameter specifies the directory path where intermediate checkpoints will be saved during the training process for future reference.

- optimizer, (choices=`["sgd", "adam", "adamw"]`, default=`"adam"`): This defines the type of optimizer that will be employed during the training process; the current version of NNTile supports three distinct optimization methods.

- model-path, (`str`, default=`".model"`): This indicates the directory path where previously loaded remote models are stored, facilitating easy access for further use.

- seq-len, (`int`, default=`1024`): length of the input token sequence for training.

- seq-len-tile, (`int`, default=`1024`): split size of sequence length into tiles

- batch-size, (`int`, default=`1`): batch size for the training process, which specifies the number of sentences processed by seq-len tokens between steps of the loss function optimizer.

- minibatch-size, (`int`, default=`-1`): batch size for which memory is allocated during training. The entire batch is divided into whole minibatches. All minibatches from one batch are fed through the model one by one to accumulate parameter gradients.

- minibatch-size-tile, (`int`, default=`-1`): batch size that goes to the CPU or GPU for calculations. Each minibatch must be divisible by an integer number of minibatch tiles.

- hidden-size-tile, (`int`, default=`-1`): the size of the pieces (tiles) into which the "hidden size" dimension is divided (also known as "embedding size") – the size of the multidimensional space into which incoming tokens are mapped. Only "piecewise" tensors of size hidden-size-tile along the corresponding axis are processed on the CPU and GPU.

- intermediate-size-tile, (`int`, default=`-1`): the size of the pieces (tiles) into which the "intermediate size" dimension is divided. Only "piecewise" tensors of size intermediate-size-tile along the corresponding axis are processed on the CPU and GPU.

- n-head-tile, (`int`, default=`-1`): the size of the pieces (tiles) into which the number of heads of the transformer layer is divided. Only “piecewise” tensors with a size of n-head-tile along the corresponding axis are processed by the CPU and GPU.

- dtype, (choices=`["fp32", "fp64", "tf32", "bf16", "fp32_fast_fp16", "fp32_fast_bf16"]`, default=`"fp32"`): This parameter outlines the various data types supported by NNTile, allowing users the flexibility to choose based on their model requirements.

- restrict, (choices=`["cpu", "cuda", None]`, default=`None`): This option allows users to specify restrictions on the computational resources utilized during training; selecting `"cpu"` restricts training to CPU-only cores, `"cuda"` limits it to GPU cores, while setting it to None allows for training across all available cores.

- use-redux, (action=`"store_true"`): a boolean flag that, when used in the argument string, allows for the computation of dependent tasks simultaneously, with the subsequent reduction of the results into a single tensor.

- dataset-path, (default=`".data"`): path to the directory where previously prepared datasets are saved.

- dataset-file, (default=`""`): path (relative to dataset-path) to the .bin file that is created in the block of data preparation for training.

- lr, (`float`, default=`1e-4`): step length for the optimization algorithm.

- nepochs, (`int`, default=`1`): number of full passes through the training set.

#### 3.1. Training from the random initial state and saving into checkpoint.

This requires option `pretrained` set to `local` and `config-path` to point on previously created `.json` configuration file.

In this example, we start training in the fp32 type.

In [4]:
# Launch an external python process to finetune a pretrained LLaMa model on TinyStories
!python ../wrappers/python/examples/llama_training.py \
    --restrict="cuda" --pretrained=local --config-path="../wrappers/python/examples/llama_1.3b_config.json" \
    --save-checkpoint-path=".model/nntile_checkpoint.pt" --optimizer="adam" --seq-len=512 --lr=1e-4 --dtype=fp32 --nepochs=1 \
    --batch-size=256 --minibatch-size=8 --dataset-file="tinystories/train.bin"

2025-11-15 15:55:55.632791: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Namespace(remote_model_name='kimihailv/llama-1.3b', pretrained='local', checkpoint_path='', config_path='../wrappers/python/examples/llama_1.3b_config.json', save_checkpoint_path='.model/nntile_checkpoint.pt', optimizer='adam', model_path='.model', seq_len=512, seq_len_tile=-1, batch_size=256, minibatch_size=8, minibatch_size_tile=-1, hidden_size_tile=-1, intermediate_size_tile=-1, n_head_tile=-1, dtype='fp32', restrict='cuda', flash_attention=False, use_redux=False, dataset_path='.data', dataset_file='tinystories/train.bin', lr=0.0001, nepochs=1, logger=False, logger_server_addr='localhost', logger_server_port=5001)
LlamaConfig {
  "_attn_implementation_autoset": true,
  "activa

#### 3.2. Resume training from the local checkpoint.

This requires option `pretrained` again to be set to `local`, `config-path` to point on previously created `.json` configuration file, and also `checkpoint-path` to point on the pre-existing checkpoint file in the PyTorch format.

Training process can be resumed using a different data type and on a different set of compute nodes. For example, here we switch to the bf16 type and restrict to using only GPUs.

In [5]:
# Launch an external python process to finetune a pretrained NNTile llama model on TinyStories
!python ../wrappers/python/examples/llama_training.py \
    --restrict="cuda" --pretrained=local --checkpoint-path=".model/nntile_checkpoint.pt" \
    --config-path="../wrappers/python/examples/llama_1.3b_config.json" \
    --save-checkpoint-path=".model/nntile_further_checkpoint.pt" --optimizer="adam" --seq-len=512 --lr=1e-4 --dtype=bf16 \
    --restrict="cuda" --nepochs=1 --batch-size=256 --minibatch-size=8 --dataset-file="tinystories/train.bin"

2025-11-15 16:46:03.845496: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Namespace(remote_model_name='kimihailv/llama-1.3b', pretrained='local', checkpoint_path='.model/nntile_checkpoint.pt', config_path='../wrappers/python/examples/llama_1.3b_config.json', save_checkpoint_path='.model/nntile_further_checkpoint.pt', optimizer='adam', model_path='.model', seq_len=512, seq_len_tile=-1, batch_size=256, minibatch_size=8, minibatch_size_tile=-1, hidden_size_tile=-1, intermediate_size_tile=-1, n_head_tile=-1, dtype='bf16', restrict='cuda', flash_attention=False, use_redux=False, dataset_path='.data', dataset_file='tinystories/train.bin', lr=0.0001, nepochs=1, logger=False, logger_server_addr='localhost', logger_server_port=5001)
LlamaConfig {
  "_attn_imple

#### 3.3. Training from remote and saving into checkpoint.

Our framework currently supports the continuation of model training obtained from a remote source, as we show here with the Hugging Face library. The weights from the loaded model are transferred into the model implemented in NNTile. Consequently, training can be further advanced using any data type and across any set of computing nodes that accommodate the selected data type.

This requires option `pretrained` to be set to `remote`. Options `config-path` and `checkpoint-path` are no longer needed since model config is obtained from the remote model as well as layers' weights. Training can be resumed using any data type and on any set of compute nodes that support the selected data type.

In [6]:
# Launch an external python process to finetune a downloaded from remote source pretrained gpt_neo model on TinyStories
!python ../wrappers/python/examples/llama_training.py \
    --restrict="cuda" --pretrained=remote --save-checkpoint-path=".model/nntile_remote_checkpoint.pt"\
    --optimizer="adam" --seq-len=512 --lr=1e-4 --dtype=bf16 --nepochs=1 --batch-size=256 --minibatch-size=8 \
    --dataset-file="tinystories/train.bin"

2025-11-15 17:31:26.153856: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Namespace(remote_model_name='kimihailv/llama-1.3b', pretrained='remote', checkpoint_path='', config_path='', save_checkpoint_path='.model/nntile_remote_checkpoint.pt', optimizer='adam', model_path='.model', seq_len=512, seq_len_tile=-1, batch_size=256, minibatch_size=8, minibatch_size_tile=-1, hidden_size_tile=-1, intermediate_size_tile=-1, n_head_tile=-1, dtype='bf16', restrict='cuda', flash_attention=False, use_redux=False, dataset_path='.data', dataset_file='tinystories/train.bin', lr=0.0001, nepochs=1, logger=False, logger_server_addr='localhost', logger_server_port=5001)
config.json: 100%|█████████████████████████████| 673/673 [00:00<00:00, 2.67MB/s]
model.safetensors.index.

### 4. Inference process.

In the current version of the Llama scenario, the NNTile framework model is created from a (pre-)loaded pre-trained LLama model from the Huggingface library. The model layer weights are passed to the corresponding NNTile model layers, and then the inference process is performed solely by NNTile, without any involvement of third-party models and mechanisms. To perform the inference, we use another program file - "llama_generate.py". The program code supports the following arguments when running:

- cache_dir, (`str`, default="cache_hf"): path to the directory where previously loaded models from a remote source are saved,
- max-seq-len, (`int`, default=1024): maximum length of the input token sequence,
- remote-model-name, (`str`, default=`"kimihailv/llama-1.3b"`): This parameter specifies the name of the Llama based model that resides within the HuggingFace infrastructure and will be utilized to initialize the configuration and the intial state of the NNTile model.
- restrict, (choices=`["cpu", "cuda", None]`, default=`None`): limit on the computing resources used during inference; `"cpu"` restricts inference to CPU cores only, `"cuda"` - to GPU cores only, while None allows using all available cores,
- prompt, (`str`, default=`"What do you think about dogs?"`): input query, a string fed to the model input to perform inference based on it,
- generation-mode, (choices = `["Greedy", "TopK", "TopP"]`, default=`"Greedy"`): token generation mode in the GenerationMode class object (described in the "llm_params.py" file),
- parallel-sampling-mode, (choices=`["BeamSearch", "Parallel"]`, default=`"BeamSearch"`): parallel generation mode for multiple responses to a single query in the ParallelSamplingMode class object (described in the "llm_params.py" file),
- max-tokens, (`int`, default=`100`): maximum number of generated tokens, including user request tokens,
- use-cache, (action=`"store_true"`): boolean flag, when used in the argument line, enables the use of KV caches, allowing to reuse previously calculated values,
- top-k, (`int`, default=`None`): probabilistic selection based on the top-k most probable tokens,
- top-p-thr, (`float`, default=`None`): probabilistic selection based on tokens whose probability is not lower than the top-p-thr threshold,
- temperature, (`float`, default=`1.0`): "temperature" parameter for token generation,
- num-beams, (`int`, default=`1`): number of beams for parallel generation mode.

#### 4.1. Examples with different types of generation strategies

`BeamSearch` generation strategy and number of beams set to the default value of `1`.

In [7]:
!python ../wrappers/python/examples/llama_generate.py --cache_dir=.model --max-seq-len=512 \
    --restrict=cuda --use-cache \
    --prompt="Why does the Sun shine?" \
    --max-tokens=40

2025-11-15 18:08:58.006593: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Notice:
 None
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00, 19.78it/s]
<s> Why does the Sun shine?
The S

`Parallel` generation strategy and number of beams set to `3`.

In [8]:
!python ../wrappers/python/examples/llama_generate.py --cache_dir=.model --max-seq-len=512 \
    --restrict=cuda --use-cache \
    --num-beams=3 --parallel-sampling-mode=Parallel \
    --prompt="Why does the Sun shine?" \
    --max-tokens=40

2025-11-15 18:09:42.987985: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Notice:
 None
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00, 24.53it/s]
["<s> Why does the Sun shine? Wha

`BeamSearch` generation strategy and number of beams set to `3`.

In [17]:
!python ../wrappers/python/examples/llama_generate.py --cache_dir=.model --max-seq-len=512 \
    --restrict=cuda --use-cache \
    --num-beams=3 --parallel-sampling-mode=BeamSearch \
    --prompt="Why does the Sun shine?" \
    --max-tokens=40

2025-11-15 18:15:42.467011: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Notice:
 None
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00, 17.43it/s]
['<s> Why does the Sun shine?\nTh

#### 4.2. Examples with different token generation modes and temperatures

`TopK` token generation strategy with default temperature value of `1.0`.

In [1]:
!python ../wrappers/python/examples/llama_generate.py --cache_dir=.model --max-seq-len=512 \
    --restrict=cuda --use-cache \
    --generation-mode=TopK --top-k=10 \
    --prompt="Why does the Sun shine?" \
    --max-tokens=40

2025-11-15 18:35:38.807796: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Notice:
 None
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00, 17.64it/s]
<s> Why does the Sun shine? Becau

`TopK` token generation strategy with the temperature value of `100.0`.

In [12]:
!python ../wrappers/python/examples/llama_generate.py --cache_dir=.model --max-seq-len=512 \
    --restrict=cuda --use-cache \
    --generation-mode=TopK --top-k=10 --temperature=100.0 \
    --prompt="Why does the Sun shine?" \
    --max-tokens=40

2025-11-15 18:12:41.238679: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Notice:
 None
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00, 21.25it/s]
<s> Why does the Sun shine? And o

`TopK` token generation strategy with the temperature value of `0.01`.

In [13]:
!python ../wrappers/python/examples/llama_generate.py --cache_dir=.model --max-seq-len=512 \
    --restrict=cuda --use-cache \
    --generation-mode=TopK --top-k=10 --temperature=0.01 \
    --prompt="Why does the Sun shine?" \
    --max-tokens=40

2025-11-15 18:13:09.647777: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Notice:
 None
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00, 20.64it/s]
<s> Why does the Sun shine?
The S

`TopP` token generation strategy with default temperature value of `1.0`.

In [14]:
!python ../wrappers/python/examples/llama_generate.py --cache_dir=.model --max-seq-len=512 \
    --restrict=cuda --use-cache \
    --generation-mode=TopP --top-p=0.1 \
    --prompt="Why does the Sun shine?" \
    --max-tokens=40

2025-11-15 18:13:38.006329: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Notice:
 None
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00, 19.08it/s]
<s> Why does the Sun shine?
The S

`TopP` token generation strategy with the temperature value of `100.0`.

In [15]:
!python ../wrappers/python/examples/llama_generate.py --cache_dir=.model --max-seq-len=512 \
    --restrict=cuda --use-cache \
    --generation-mode=TopP --top-p=0.1 --temperature=100.0 \
    --prompt="Why does the Sun shine?" \
    --max-tokens=40

2025-11-15 18:14:05.669373: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Notice:
 None
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00, 16.97it/s]
<s> Why does the Sun shine? also 

`TopP` token generation strategy with the temperature value of `0.01`.

In [16]:
!python ../wrappers/python/examples/llama_generate.py --cache_dir=.model --max-seq-len=512 \
    --restrict=cuda --use-cache \
    --generation-mode=TopP --top-p=0.1 --temperature=0.01 \
    --prompt="Why does the Sun shine?" \
    --max-tokens=40

2025-11-15 18:14:32.349593: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Notice:
 None
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00, 18.97it/s]
<s> Why does the Sun shine?
The S

---
Next in our scenario file, we want to demonstrate the application of our process and framework to another LLaMA model with a larger number of parameters. To do this, we need to re-do data preparation and adjust learning parameters to make it little bit easier to train a more complex model. Therefore, in the following sections, we repeat blocks 2 through 4, but now for a different model.

Can be found here in the HuggingFace infrastructure: [LLaMA 2 7B Model](https://huggingface.co/unsloth/llama-2-7b)

---

### 5. Repeat the Data Preparation Block: change the model

This block uses the interpreted file "causal_lm_data_preparation.py". All remains the same except we tune seq-len and batch-size parameters, and also change the default tokenizer to the one from our new model of choice.

In [45]:
# Prepare TinyStories dataset into train.bin file
!python ../wrappers/python/examples/causal_lm_data_preparation.py --hf-tokenizer="unsloth/llama-2-7b" \
    --seq-len=128 --batch-size=256 --dataset-select=3000

### 6. Example scenarios

Below we show an example of utilizing the Llama model, implemented using the NNTile framework. We explore the following scenarios:

- **Training the model from a random initial state and saving it to a checkpoint.**
- **Loading pretrained model from remote source and continuing training**

#### 6.1. Training from the random initial state and saving into checkpoint.

This requires option `pretrained` set to `local` and `config-path` to point on previously created `.json` configuration file.

In [None]:
# Launch an external python process to finetune a pretrained LLaMa model on TinyStories
!python ../wrappers/python/examples/llama_training.py \
    --remote_model_name="unsloth/llama-2-7b" --restrict="cuda" --pretrained=local --config-path="../wrappers/python/examples/llama_2.7b_config.json" \
    --save-checkpoint-path=".model/nntile_checkpoint_llama_2-7.pt" --optimizer="adam" --seq-len=128 --lr=1e-4 --dtype=bf16 --nepochs=1 \
    --batch-size=256 --minibatch-size=8 --dataset-file="tinystories/train.bin"

Namespace(remote_model_name='unsloth/llama-2-7b', pretrained='local', checkpoint_path='', config_path='../wrappers/python/examples/llama_2.7b_config.json', save_checkpoint_path='.model/nntile_checkpoint_llama_2-7.pt', optimizer='adam', model_path='.model', seq_len=128, seq_len_tile=-1, batch_size=256, minibatch_size=8, minibatch_size_tile=-1, hidden_size_tile=-1, intermediate_size_tile=-1, n_head_tile=-1, dtype='bf16', restrict='cuda', flash_attention=False, use_redux=False, dataset_path='.data', dataset_file='tinystories/train.bin', lr=0.0001, nepochs=1, logger=False, logger_server_addr='localhost', logger_server_port=5001)
LlamaConfig {
  "_attn_implementation_autoset": true,
  "activation_function": "silu",
  "architectures": [
    "LlamaCasualForLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "dtype": "bf16",
  "eos_token_id": 2,
  "flashattention": false,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range":

#### 6.2. Training from remote and saving into checkpoint.

Our framework currently supports the continuation of model training obtained from a remote source, as we show here with the Hugging Face library. The weights from the loaded model are transferred into the model implemented in NNTile. Consequently, training can be further advanced using any data type and across any set of computing nodes that accommodate the selected data type.

This requires option `pretrained` to be set to `remote`. Options `config-path` and `checkpoint-path` are no longer needed since model config is obtained from the remote model as well as layers' weights. Training can be resumed using any data type and on any set of compute nodes that support the selected data type.

In [6]:
# Launch an external python process to finetune a downloaded from remote source pretrained gpt_neo model on TinyStories
!python ../wrappers/python/examples/llama_training.py \
    --remote_model_name="unsloth/llama-2-7b" --restrict="cuda" --pretrained=remote --save-checkpoint-path=".model/nntile_remote_checkpoint_llama_2-7b.pt"\
    --optimizer="adam" --seq-len=128 --lr=1e-4 --dtype=bf16 --nepochs=1 --batch-size=256 --minibatch-size=8 \
    --dataset-file="tinystories/train.bin"

Namespace(remote_model_name='unsloth/llama-2-7b', pretrained='remote', checkpoint_path='', config_path='', save_checkpoint_path='.model/nntile_remote_checkpoint_llama_2-7b.pt', optimizer='adam', model_path='.model', seq_len=128, seq_len_tile=-1, batch_size=256, minibatch_size=8, minibatch_size_tile=-1, hidden_size_tile=-1, intermediate_size_tile=-1, n_head_tile=-1, dtype='bf16', restrict='cuda', flash_attention=False, use_redux=False, dataset_path='.data', dataset_file='tinystories/train.bin', lr=0.0001, nepochs=1, logger=False, logger_server_addr='localhost', logger_server_port=5001)
Loading checkpoint shards: 100%|██████████████████| 3/3 [01:43<00:00, 34.63s/it]
LlamaConfig {
  "_attn_implementation_autoset": true,
  "_name_or_path": "unsloth/llama-2-7b",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range

### 7. Inference process.

This is the analog of the **Scenario 4** above, just made for another [LLaMA 2 7B Model](https://huggingface.co/unsloth/llama-2-7b). The model layer weights are passed to the corresponding NNTile model layers, and then the inference process is performed solely by NNTile, without any involvement of third-party models and mechanisms. To perform the inference, we use another program file - "llama_generate.py".

At least one important argument should be change - remote-model-name. In this case we  use remote-model-name="unsloth/llama-2-7b".

#### 7.1. Examples with different types of generation strategies

`BeamSearch` generation strategy and number of beams set to the default value of `1`.

In [14]:
!python ../wrappers/python/examples/llama_generate.py --cache_dir=.model --max-seq-len=256 \
    --remote-model-name="unsloth/llama-2-7b" --restrict=cuda --use-cache \
    --prompt="Why does the Sun shine?" \
    --max-tokens=50

Notice:
 None
Loading checkpoint shards: 100%|██████████████████| 3/3 [00:18<00:00,  6.05s/it]
<s> Why does the Sun shine?
Why does the Sun shine? The Sun is a star, and stars shine because they are hot. The Sun is hot because it is a big ball of gas. The Sun is a big ball

#---------------------
Data transfer stats:
	NUMA 0 -> CUDA 0	540.1003 GB	2930.3453 MB/s	(transfers : 5346 - avg 103.4536 MB)
	CUDA 0 -> NUMA 0	13.0563 GB	70.8378 MB/s	(transfers : 323 - avg 41.3922 MB)
Total transfers: 553.1566 GB
#---------------------


`Parallel` generation strategy and number of beams set to `3`.

In [34]:
!python ../wrappers/python/examples/llama_generate.py --cache_dir=.model --max-seq-len=256 \
    --remote-model-name="unsloth/llama-2-7b" --restrict=cuda --use-cache \
    --num-beams=3 --parallel-sampling-mode=Parallel \
    --prompt="Why does the Sun shine?" \
    --max-tokens=50

Notice:
 None
Loading checkpoint shards: 100%|██████████████████| 3/3 [00:18<00:00,  6.18s/it]
['<s> Why does the Sun shine? Why does the Moon have phases? Why does the sky appear blue? Why does the sky appear black at night? Why does the sky appear red at sunrise and sunset? Why does the sky appear', '<s> Why does the Sun shine? What is the difference between a star and a planet? How do we know that the Earth is round? How do we know that the Earth is not flat? How do we know that the Earth is not the', '<s> Why does the Sun shine?\nWhy does the Sun shine? The Sun is a star, and stars shine because they are hot. The Sun is hot because it is a big ball of gas. The Sun is a big ball']

#---------------------
Data transfer stats:
	NUMA 0 -> CUDA 0	1536.8503 GB	2203.1329 MB/s	(transfers : 14733 - avg 106.8170 MB)
	CUDA 0 -> NUMA 0	13.0661 GB	18.7308 MB/s	(transfers : 420 - avg 31.8564 MB)
Total transfers: 1549.9164 GB
#---------------------


`BeamSearch` generation strategy and number of beams set to `3`.

In [35]:
!python ../wrappers/python/examples/llama_generate.py --cache_dir=.model --max-seq-len=256 \
    --remote-model-name="unsloth/llama-2-7b" --restrict=cuda --use-cache \
    --num-beams=3 --parallel-sampling-mode=BeamSearch \
    --prompt="Why does the Sun shine?" \
    --max-tokens=50

Notice:
 None
Loading checkpoint shards: 100%|██████████████████| 3/3 [00:19<00:00,  6.34s/it]
['<s> Why does the Sun shine?\nWhy does the Sun shine? Why does the Sun shine?\nWhy does the Sun shine? Why does the Sun shine? Why does the Sun shine?\nWhy does the Sun', '<s> Why does the Sun shine?\nWhy does the Sun shine? Why does the Sun shine?\nWhy does the Sun shine? Why does the Sun shine? Why does the Sun shine? Why does the Sun sh', '<s> Why does the Sun shine?\nWhy does the Sun shine? Why does the Sun shine? Why does the Sun shine? Why does the Sun shine? Why does the Sun shine? Why does the Sun shine']

#---------------------
Data transfer stats:
	NUMA 0 -> CUDA 0	1240.8964 GB	2757.2342 MB/s	(transfers : 11862 - avg 107.1217 MB)
	CUDA 0 -> NUMA 0	10.6082 GB	23.5712 MB/s	(transfers : 360 - avg 30.1745 MB)
Total transfers: 1251.5046 GB
#---------------------


#### 7.2. Examples with different token generation modes and temperatures

`TopK` token generation strategy with default temperature value of `1.0`.

In [36]:
!python ../wrappers/python/examples/llama_generate.py --cache_dir=.model --max-seq-len=256 \
    --remote-model-name="unsloth/llama-2-7b" --restrict=cuda --use-cache \
    --generation-mode=TopK --top-k=10 \
    --prompt="Why does the Sun shine?" \
    --max-tokens=50

Notice:
 None
Loading checkpoint shards: 100%|██████████████████| 3/3 [00:18<00:00,  6.14s/it]
<s> Why does the Sun shine?
What makes the Moon appear in the morning?
How long does it take for a meteor to burn-up?
Why do stars shine at night?
How do you get a star named after

#---------------------
Data transfer stats:
	NUMA 0 -> CUDA 0	430.1776 GB	2646.5842 MB/s	(transfers : 4277 - avg 102.9932 MB)
	CUDA 0 -> NUMA 0	10.4383 GB	64.2196 MB/s	(transfers : 279 - avg 38.3112 MB)
Total transfers: 440.6159 GB
#---------------------


`TopK` token generation strategy with the temperature value of `100.0`.

In [37]:
!python ../wrappers/python/examples/llama_generate.py --cache_dir=.model --max-seq-len=256 \
    --remote-model-name="unsloth/llama-2-7b" --restrict=cuda --use-cache \
    --generation-mode=TopK --top-k=10 --temperature=100.0 \
    --prompt="Why does the Sun shine?" \
    --max-tokens=50

Notice:
 None
Loading checkpoint shards: 100%|██████████████████| 3/3 [00:18<00:00,  6.05s/it]
<s> Why does the Sun shine? What happens inside the core that makes energy? Why are some elements hot, others are dense (heated iron)? These simple but fundamental question have occupied our minds forever! Now the Sun shiners bright than

#---------------------
Data transfer stats:
	NUMA 0 -> CUDA 0	430.1776 GB	2530.2673 MB/s	(transfers : 4277 - avg 102.9932 MB)
	CUDA 0 -> NUMA 0	10.4383 GB	61.3972 MB/s	(transfers : 279 - avg 38.3112 MB)
Total transfers: 440.6159 GB
#---------------------


`TopK` token generation strategy with the temperature value of `0.01`.

In [38]:
!python ../wrappers/python/examples/llama_generate.py --cache_dir=.model --max-seq-len=256 \
    --remote-model-name="unsloth/llama-2-7b" --restrict=cuda --use-cache \
    --generation-mode=TopK --top-k=10 --temperature=0.01 \
    --prompt="Why does the Sun shine?" \
    --max-tokens=50

Notice:
 None
Loading checkpoint shards: 100%|██████████████████| 3/3 [00:18<00:00,  6.04s/it]
<s> Why does the Sun shine?
Why does the Sun shine? The Sun is a star, and stars shine because they are hot. The Sun is hot because it is a big ball of gas. The Sun is a big ball

#---------------------
Data transfer stats:
	NUMA 0 -> CUDA 0	430.1776 GB	2579.6674 MB/s	(transfers : 4277 - avg 102.9932 MB)
	CUDA 0 -> NUMA 0	10.4383 GB	62.5959 MB/s	(transfers : 279 - avg 38.3112 MB)
Total transfers: 440.6159 GB
#---------------------


`TopP` token generation strategy with default temperature value of `1.0`.

In [39]:
!python ../wrappers/python/examples/llama_generate.py --cache_dir=.model --max-seq-len=256 \
    --remote-model-name="unsloth/llama-2-7b" --restrict=cuda --use-cache \
    --generation-mode=TopP --top-p=0.1 \
    --prompt="Why does the Sun shine?" \
    --max-tokens=50

Notice:
 None
Loading checkpoint shards: 100%|██████████████████| 3/3 [00:18<00:00,  6.18s/it]
<s> Why does the Sun shine?
Author Topic: Why does the Sun shine? (Read 3216 times)
Are you a science teacher or programmer that would be interested in writing something for this project? We have

#---------------------
Data transfer stats:
	NUMA 0 -> CUDA 0	430.1776 GB	2600.1033 MB/s	(transfers : 4277 - avg 102.9932 MB)
	CUDA 0 -> NUMA 0	10.4383 GB	63.0917 MB/s	(transfers : 279 - avg 38.3112 MB)
Total transfers: 440.6159 GB
#---------------------


`TopP` token generation strategy with the temperature value of `100.0`.

In [40]:
!python ../wrappers/python/examples/llama_generate.py --cache_dir=.model --max-seq-len=256 \
    --remote-model-name="unsloth/llama-2-7b" --restrict=cuda --use-cache \
    --generation-mode=TopP --top-p=0.1 --temperature=100.0 \
    --prompt="Why does the Sun shine?" \
    --max-tokens=50

Notice:
 None
Loading checkpoint shards: 100%|██████████████████| 3/3 [00:18<00:00,  6.18s/it]
<s> Why does the Sun shine? which not fewer designons suspesoG (OP Analitaine Ho - look level tan she stir usually ganghesuto por class what pok care habit sweet effortinca select clreton gun + target base cu

#---------------------
Data transfer stats:
	NUMA 0 -> CUDA 0	430.1776 GB	2637.4449 MB/s	(transfers : 4277 - avg 102.9932 MB)
	CUDA 0 -> NUMA 0	10.4383 GB	63.9979 MB/s	(transfers : 279 - avg 38.3112 MB)
Total transfers: 440.6159 GB
#---------------------


`TopP` token generation strategy with the temperature value of `0.01`.

In [41]:
!python ../wrappers/python/examples/llama_generate.py --cache_dir=.model --max-seq-len=256 \
    --remote-model-name="unsloth/llama-2-7b" --restrict=cuda --use-cache \
    --generation-mode=TopP --top-p=0.1 --temperature=0.01 \
    --prompt="Why does the Sun shine?" \
    --max-tokens=50

Notice:
 None
Loading checkpoint shards: 100%|██████████████████| 3/3 [00:18<00:00,  6.15s/it]
<s> Why does the Sun shine?
Why does the Sun shine? The Sun is a star, and stars shine because they are hot. The Sun is hot because it is a big ball of gas. The Sun is a big ball

#---------------------
Data transfer stats:
	NUMA 0 -> CUDA 0	430.1776 GB	2553.6082 MB/s	(transfers : 4277 - avg 102.9932 MB)
	CUDA 0 -> NUMA 0	10.4383 GB	61.9635 MB/s	(transfers : 279 - avg 38.3112 MB)
Total transfers: 440.6159 GB
#---------------------
