**LLM Models Inference and Training Memory Specs**

**Model Inference Memory Requirments:**

1. in full precision (float32), every parameter of the model is stored in 32 bits or 4 bytes. Hence 4 bytes / parameter * 7 billion parameters = 28 billion bytes = 28 GB of GPU memory required, for inference only.

2. In half precision, each parameter would be stored in 16 bits, or 2 bytes. Hence you would need 14 GB for inference.

3. There are now also 8 bit and 4 bit algorithms, so with 4 bits (or half a byte) per parameter you would need 3.5 GB of memory for inference.

**For training, it depends on the optimizer you use.**

1. In case you use regular AdamW, then you need 8 bytes per parameter (as it not only stores the parameters, but also their gradients and second order gradients). Hence, for a 7B model you would need 8 bytes per parameter * 7 billion parameters = 56 GB of GPU memory.

2. If you use AdaFactor, then you need 4 bytes per parameter, or 28 GB of GPU memory.

3. With the optimizers of bitsandbytes (like 8 bit AdamW), you would need 2 bytes per parameter, or 14 GB of GPU memory.

| Model Size | Precision Mode                     | Inference VRAM(GPU) Required | Training VRAM(GPU) Required  |
|------------|-----------------------------------|-------------------------|------------------------|
| 7B         | 32 Bit - Full Precision (Original)| 28 GB                   | 56 GB                  |
| 7B         | 16 Bit - Half Precision           | 14 GB                   | 28 GB                  |
| 7B         | 8 Bit Precision                   | 7 GB                    | 14 GB                  |
| 7B         | 4 Bit Precision                   | 3.5 GB                  | 7 GB                   |
| 13B        | 32 Bit - Full Precision (Original)| 52 GB                   | 104 GB                 |
| 13B        | 16 Bit - Half Precision           | 26 GB                   | 52 GB                  |
| 13B        | 8 Bit Precision                   | 13 GB                   | 26 GB                  |
| 13B        | 4 Bit Precision                   | 6.5 GB                  | 13 GB                  |
| 70B        | 32 Bit - Full Precision (Original)| 280 GB                  | 560 GB                 |
| 70B        | 16 Bit - Half Precision           | 140 GB                  | 280 GB                 |
| 70B        | 8 Bit Precision                   | 70 GB                   | 140 GB                 |
| 70B        | 4 Bit Precision                   | 35 GB                   | 70 GB                  |


I highly recommend this guide: https://huggingface.co/docs/transformers/perf_train_gpu_one#anatomy-of-models-memory

In [None]:
!pip install transformers datasets accelerate nvidia-ml-py3

In [None]:
import numpy as np
from datasets import Dataset


seq_len, dataset_size = 512, 512
dummy_data = {
    "input_ids": np.random.randint(100, 30000, (dataset_size, seq_len)),
    "labels": np.random.randint(0, 1, (dataset_size)),
}
ds = Dataset.from_dict(dummy_data)
ds.set_format("pt")

In [None]:
ds

Dataset({
    features: ['input_ids', 'labels'],
    num_rows: 512
})

In [None]:
ds['input_ids'][0:7]

tensor([[27199,  7146,  5046,  ..., 18648,  3825, 29741],
        [ 1751, 15496,  7903,  ...,  8480, 25346,  6916],
        [  437, 23762,  6821,  ..., 14453, 12699,  6274],
        ...,
        [16616, 22093, 18415,  ..., 11199, 28125,  2436],
        [19925, 25133, 11904,  ..., 23160, 13572, 21581],
        [ 5146,  7686,  4829,  ...,  4952,   373,  6188]])

In [None]:
ds['labels'][0:7]

tensor([0, 0, 0, 0, 0, 0, 0])

In [None]:
from pynvml import *


def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")


def print_summary(result):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    print_gpu_utilization()

In [None]:
print_gpu_utilization()

GPU memory occupied: 258 MB.


In [None]:
import torch

torch.ones((1, 1)).to("cuda")
print_gpu_utilization()

GPU memory occupied: 363 MB.


In [None]:
from transformers import AutoModelForSequenceClassification


model = AutoModelForSequenceClassification.from_pretrained("bert-large-uncased").to("cuda")
print_gpu_utilization()

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


GPU memory occupied: 1651 MB.


In [None]:
!nvidia-smi

Mon Aug 21 09:41:19 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   61C    P0    29W /  70W |   1393MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
default_args = {
    "output_dir": "tmp",
    "evaluation_strategy": "steps",
    "num_train_epochs": 1,
    "log_level": "error",
    "report_to": "none",
}

In [None]:
from transformers import TrainingArguments, Trainer, logging

logging.set_verbosity_error()


training_args = TrainingArguments(per_device_train_batch_size=4, **default_args)
trainer = Trainer(model=model, args=training_args, train_dataset=ds)
result = trainer.train()
print_summary(result)



{'train_runtime': 174.4257, 'train_samples_per_second': 2.935, 'train_steps_per_second': 0.734, 'train_loss': 0.028667159378528595, 'epoch': 1.0}
Time: 174.43
Samples/second: 2.94
GPU memory occupied: 11539 MB.
