In [1]:
import numpy as np
from datasets import Dataset
from pynvml import *
from transformers import TrainingArguments, Trainer, logging, AutoModelForSequenceClassification, AutoTokenizer
import torch

In [2]:
seq_len, dataset_size = 512, 512
dummy_data = {
    "input_ids": np.random.randint(100, 30000, (dataset_size, seq_len)),
    "labels": np.random.randint(0, 1, (dataset_size)),
}
ds = Dataset.from_dict(dummy_data)
ds.set_format("pt")

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")


def print_summary(result):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    print_gpu_utilization()

In [3]:
print_gpu_utilization()

GPU memory occupied: 223 MB.


That looks good: the GPU memory is not occupied as we would expect before we load any models. If that’s not the case on your machine make sure to stop all processes that are using GPU memory. However, not all free GPU memory can be used by the user. When a model is loaded to the GPU the kernels are also loaded,which can take up 1-2GB of memory. To see how much it is we load a tiny tensor into the GPU which triggers the kernels to be loaded as well.

In [4]:
import torch
torch.ones((1, 1)).to("cuda")
print_gpu_utilization()

GPU memory occupied: 322 MB.


## Load Model
First, we load the bert-large-uncased model. We load the model weights directly to the GPU so that we can check how much space just the weights use.

In [5]:
# MODEL_ID = "TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ"
MODEL_ID = "Felladrin/TinyMistral-248M-SFT-v4"
# model = AutoModelForSequenceClassification.from_pretrained(
#     MODEL_ID, 
#     torch_dtype=torch.float16,
#     use_flash_attention_2=True).to("cuda")
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).to("cuda")
print_gpu_utilization()

Some weights of MistralForSequenceClassification were not initialized from the model checkpoint at Felladrin/TinyMistral-248M-SFT-v4 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


GPU memory occupied: 1240 MB.


We get the same number as before and you can also see that we are using a V100 GPU with 16GB of memory. So now we can start training the model and see how the GPU memory consumption changes. First, we set up a few standard training arguments:

In [6]:
default_args = {
    "output_dir": "tmp",
    "evaluation_strategy": "steps",
    "num_train_epochs": 1,
    "log_level": "error",
    "report_to": "none",
}

In [7]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
model.config.pad_token_id = model.config.eos_token_id

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Memory utilization at vanilla training
Let’s use the Trainer and train the model without using any GPU performance optimization techniques and a batch size of 4:

In [8]:
# from transformers import TrainingArguments, Trainer, logging
# logging.set_verbosity_error()
# training_args = TrainingArguments(per_device_train_batch_size=4, **default_args)
# trainer = Trainer(model=model, args=training_args, train_dataset=ds, tokenizer=tokenizer)
# result = trainer.train()
# print_summary(result)

In [9]:
logging.set_verbosity_error()
max_seq_length = 2048

In [10]:
training_args = TrainingArguments(per_device_train_batch_size=1, gradient_accumulation_steps=4, **default_args)
trainer = Trainer(model=model, args=training_args, train_dataset=ds, tokenizer=tokenizer)
result = trainer.train()
print_summary(result)

{'train_runtime': 80.8524, 'train_samples_per_second': 6.333, 'train_steps_per_second': 1.583, 'train_loss': 0.006253509316593409, 'epoch': 1.0}
Time: 80.85
Samples/second: 6.33
GPU memory occupied: 4986 MB.


Gradient checkpointing offers a compromise between these two approaches and saves strategically selected activations throughout the computational graph so only a fraction of the activations need to be re-computed for the gradients. For an in-depth explanation of gradient checkpointing, refer to this great article.

In [11]:
training_args = TrainingArguments(per_device_train_batch_size=1, gradient_accumulation_steps=4, gradient_checkpointing=True, **default_args)
trainer = Trainer(model=model, args=training_args, train_dataset=ds, tokenizer=tokenizer)
result = trainer.train()
print_summary(result)



{'train_runtime': 110.6139, 'train_samples_per_second': 4.629, 'train_steps_per_second': 1.157, 'train_loss': 2.607703031287656e-08, 'epoch': 1.0}
Time: 110.61
Samples/second: 4.63
GPU memory occupied: 4612 MB.


### fp16
The main advantage of mixed precision training comes from saving the activations in half precision (fp16). Although the gradients are also computed in half precision they are converted back to full precision for the optimization step so no memory is saved here. While mixed precision training results in faster computations, it can also lead to more GPU memory being utilized, especially for small batch sizes. This is because the model is now present on the GPU in both 16-bit and 32-bit precision (1.5x the original model on the GPU).

In [12]:
training_args = TrainingArguments(per_device_train_batch_size=4, fp16=True, **default_args)
trainer = Trainer(model=model, args=training_args, train_dataset=ds, tokenizer=tokenizer)
result = trainer.train()
print_summary(result)

{'train_runtime': 52.2573, 'train_samples_per_second': 9.798, 'train_steps_per_second': 2.449, 'train_loss': 6.984919309616089e-10, 'epoch': 1.0}
Time: 52.26
Samples/second: 9.80
GPU memory occupied: 4990 MB.


In [13]:
training_args = TrainingArguments(per_device_train_batch_size=1, gradient_accumulation_steps=4, fp16=True, **default_args)
trainer = Trainer(model=model, args=training_args, train_dataset=ds, tokenizer=tokenizer)
result = trainer.train()
print_summary(result)

{'train_runtime': 64.6782, 'train_samples_per_second': 7.916, 'train_steps_per_second': 1.979, 'train_loss': 2.3283064365386963e-10, 'epoch': 1.0}
Time: 64.68
Samples/second: 7.92
GPU memory occupied: 4600 MB.


In [14]:
training_args = TrainingArguments(per_device_train_batch_size=1, gradient_accumulation_steps=4, gradient_checkpointing=True, fp16=True, **default_args)
trainer = Trainer(model=model, args=training_args, train_dataset=ds, tokenizer=tokenizer)
result = trainer.train()
print_summary(result)

{'train_runtime': 65.3438, 'train_samples_per_second': 7.835, 'train_steps_per_second': 1.959, 'train_loss': 0.0, 'epoch': 1.0}
Time: 65.34
Samples/second: 7.83
GPU memory occupied: 4600 MB.


### FlashAttention-2
FlashAttention-2 is a faster and more efficient implementation of the standard attention mechanism that can significantly speedup inference by:
1- additionally parallelizing the attention computation over sequence length
2- partitioning the work between GPU threads to reduce communication and shared memory reads/writes between them