# Performance and Scalability: How To Fit a Bigger Model and Train It Faster
[https://huggingface.co/docs/transformers/performance](https://huggingface.co/docs/transformers/performance)

In this section we have a look at a few tricks to reduce the memory footprint and speed up training for large models and how they are integrated in the Trainer and 🤗 Accelerate. Before we start make sure you have installed the following libraries:

In [None]:
%load_ext autoreload
%autoreload 2

In [5]:
!pip install transformers datasets accelerate nvidia-ml-py3



The nvidia-ml-py3 library allows us to monitor the memory usage of the models from within Python.
You might be familiar with the nvidia-smi command in the terminal - this library allows to access the same information in Python directly.

Then we create some dummy data. We create random token IDs between 100 and 30000 and binary labels for a classifier.
In total we get 512 sequences each with length 512 and store them in a Dataset with PyTorch format.

In [6]:
import numpy as np
from datasets import Dataset


seq_len, dataset_size = 512, 512
dummy_data = {
    "input_ids": np.random.randint(100, 30000, (dataset_size, seq_len)),
    "labels": np.random.randint(0, 1, (dataset_size)),
}
ds = Dataset.from_dict(dummy_data)
ds.set_format("pt")

We want to print some summary statistics for the GPU utilization and the training run with the Trainer.
We setup a two helper functions to do just that:

In [7]:
from pynvml import *


def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")


def print_summary(result):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    print_gpu_utilization()

Let’s verify that we start with a free GPU memory:

In [8]:
print_gpu_utilization()

GPU memory occupied: 2642 MB.


When a model is loaded to the GPU also the kernels are loaded which can take up 1-2GB of memory.
To see how much it is we load a tiny tensor into the GPU which triggers the kernels to be loaded as well.

In [9]:
import torch
torch.ones((1, 1)).to("cuda")
print_gpu_utilization()

GPU memory occupied: 2642 MB.


# Load Model

First, we load the bert-large-uncased model. We load the model weights directly to the GPU so that we can check how much space just weights use.

In [10]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-large-uncased").to("cuda")
print_gpu_utilization()

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint a

GPU memory occupied: 3843 MB.


# Setup Training Arguments

In [11]:
default_args = {
    "output_dir": "tmp",
    "evaluation_strategy": "steps",
    "num_train_epochs": 1,
    "log_level": "error",
    "report_to": "none",
}

# Vanilla Training

As a first experiment we will use the Trainer and train the model without any further modifications and a batch size of 4:

In [10]:
from transformers import TrainingArguments, Trainer, logging

logging.set_verbosity_error()

training_args = TrainingArguments(per_device_train_batch_size=4, **default_args)
trainer = Trainer(model=model, args=training_args, train_dataset=ds)
result = trainer.train()
print_summary(result)

# ... TODO