# Hugging Face Accelerate

 > Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

`Accelerate` is a Hugging Face library that allows running the same PyTorch code in any distributed setup by adding only four lines of code.

## Installation

To install `accelerate` with `pip`, simply run:
``` bash
pip install accelerate```

And with `conda`:
``` bash
conda install -c conda-forge accelerate```

## Configuration

In each environment where `accelerate` is installed, the first thing to do is to configure it. To do this, run in a terminal:
``` bash
`accelerate config````

In [1]:
!accelerate config

--------------------------------------------------------------------------------
In which compute environment are you running?
This machine
--------------------------------------------------------------------------------
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
Do you want to use FullyShardedDataParallel? [yes/NO]: no
Do you want to use Megatron-LM ? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1
--------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
no
accelerate configuration saved at ~/

In my case, the answers have been* In which compute environment are you running?- [x] "This machine"- [_] "AWS (Amazon SageMaker)"> I want to set it up on my computer
* What type of machine are you using?- [_] multi-CPU- [_] multi-XPU- [x] multi-GPU- [_] multi-NPU- [_] TPU> Since I have 2 GPUs and want to run distributed codes on them, I choose `multi-GPU` 
* How many different machines will you use (use more than 1 for multi-node training)? [1]:- 1> I choose `1` because I'm only going to run it on my computer
* Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]:- no> With this option, you can choose to have `accelerate` check for errors during execution, but it would make it slower, so I choose `no`, and if there are any errors I change it to `yes`. 
* Do you wish to optimize your script with torch dynamo? [yes/NO]:- no
* Do you want to use FullyShardedDataParallel? [yes/NO]:- no 
* Do you want to use Megatron-LM? [yes/NO]:- no 
* How many GPU(s) should be used for distributed training? [1]:- 2> I choose `2` because I have 2 GPUs
* What GPU(s) (by id) should be used for training on this machine as a comma-separated list? [all]:- 0.1> I choose `0,1` because I want to use both GPUs
* Do you wish to use FP16 or BF16 (mixed precision)?- [x] no- [_] fp16- [_] bf16- [_] fp8> For now I choose `no`, because to simplify the code when not using `accelerate` we are going to train in fp32, but ideally we would use fp16

The configuration will be saved in `~/.cache/huggingface/accelerate/default_config.yaml` and can be modified at any time. Let's see what's inside.

In [100]:
!cat ~/.cache/huggingface/accelerate/default_config.yaml

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 0,1
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false


Another way to view the configuration we have is by running in a terminal:
``` bash
accelerate env```

In [100]:
!accelerate env


Copy-and-paste the text below in your GitHub issue

- `Accelerate` version: 0.28.0
- Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.31
- Python version: 3.11.8
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.2.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 31.24 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: MULTI_GPU
	- mixed_precision: fp16
	- use_cpu: False
	- debug: False
	- num_processes: 2
	- machine_rank: 0
	- num_machines: 1
	- gpu_ids: 0,1
	- rdzv_backend: static
	- same_network: True
	- main_training_function: main
	- downcast_bf16: no
	- tpu_use_cluster: False
	- tpu_use_sudo: False
	- tpu_env: []


Once we have configured `accelerate` we can test if we have done it correctly by running in a terminal:
``` bash
accelerate test```

In [100]:
!accelerate test


Running:  accelerate-launch ~/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: DistributedType.MULTI_GPU  Backend: nccl
stdout: Num processes: 2
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout: 
stdout: Mixed precision type: fp16
stdout: 
stdout: Distributed environment: DistributedType.MULTI_GPU  Backend: nccl
stdout: Num processes: 2
stdout: Process index: 1
stdout: Local process index: 1
stdout: Device: cuda:1
stdout: 
stdout: Mixed precision type: fp16
stdout: 
stdout: 
stdout: **Test process execution**
stdout: 
stdout: **Test split between processes as a list**
stdout: 
stdout: **Test split between processes as a dict**
stdout: 
stdout: **Test split between processes as a tensor**
stdout: 
stdout: **Test random number generator synchronization**
stdout: All rng are properly synched.
stdout: 
stdout: **Da

We see that it ends by saying `Test is a success! You are ready for your distributed training!` so everything is correct.

## Training

### Optimization of Training

#### Base code

Let's start by creating a basic training code and then we'll optimize it to see how it's done and how it improves.

First, let's find a dataset. In my case, I will use the [tweet_eval](https://huggingface.co/datasets/tweet_eval) dataset, which is a tweet classification dataset. Specifically, I will download the `emoji` subset that classifies tweets with emojis.

In [100]:
from datasets import load_dataset

dataset = load_dataset("tweet_eval", "emoji")
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 45000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})

In [100]:
dataset["train"].info

DatasetInfo(description='', citation='', homepage='', license='', features={'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['❤', '😍', '😂', '💕', '🔥', '😊', '😎', '✨', '💙', '😘', '📷', '🇺🇸', '☀', '💜', '😉', '💯', '😁', '🎄', '📸', '😜'], id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name='parquet', dataset_name='tweet_eval', config_name='emoji', version=0.0.0, splits={'train': SplitInfo(name='train', num_bytes=3808792, num_examples=45000, shard_lengths=None, dataset_name='tweet_eval'), 'test': SplitInfo(name='test', num_bytes=4262151, num_examples=50000, shard_lengths=None, dataset_name='tweet_eval'), 'validation': SplitInfo(name='validation', num_bytes=396704, num_examples=5000, shard_lengths=None, dataset_name='tweet_eval')}, download_checksums={'hf://datasets/tweet_eval@b3a375baf0f409c77e6bc7aa35102b7b3534f8be/emoji/train-00000-of-00001.parquet': {'num_bytes': 2609973, 'checksum': None}, 'hf://datasets/tweet_eval@b3a375baf0f409c77e6bc7aa35

Let's see the classes

In [100]:
print(dataset["train"].info.features["label"].names)

['❤', '😍', '😂', '💕', '🔥', '😊', '😎', '✨', '💙', '😘', '📷', '🇺🇸', '☀', '💜', '😉', '💯', '😁', '🎄', '📸', '😜']


And the number of classes

In [100]:
num_classes = len(dataset["train"].info.features["label"].names)
num_classes

20

We see that the dataset has 20 classes

Let's look at the maximum sequence of each split

In [100]:
max_len_train = 0
max_len_val = 0
max_len_test = 0

split = "train"
for i in range(len(dataset[split])):
    len_i = len(dataset[split][i]["text"])
    if len_i > max_len_train:
        max_len_train = len_i
split = "validation"
for i in range(len(dataset[split])):
    len_i = len(dataset[split][i]["text"])
    if len_i > max_len_val:
        max_len_val = len_i
split = "test"
for i in range(len(dataset[split])):
    len_i = len(dataset[split][i]["text"])
    if len_i > max_len_test:
        max_len_test = len_i

max_len_train, max_len_val, max_len_test

(142, 139, 167)

So we define the maximum sequence in general as 130 for tokenization

In [100]:
max_len = 130

We are interested in the tokenized dataset, not the raw sequences, so we create a tokenizer

In [100]:
from transformers import AutoTokenizer

checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)

We create a tokenization function

In [100]:
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")

And now we tokenize the dataset

In [100]:
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}

Map:   0%|          | 0/45000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

As we can see, now we have the tokens (`input_ids`) and the attention masks (`attention_mask`), but let's take a look at what kind of data we have.

In [100]:
type(tokenized_dataset["train"][0]["input_ids"]), type(tokenized_dataset["train"][0]["attention_mask"]), type(tokenized_dataset["train"][0]["label"])

(list, list, int)

In [100]:
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
type(tokenized_dataset["train"][0]["label"]), type(tokenized_dataset["train"][0]["input_ids"]), type(tokenized_dataset["train"][0]["attention_mask"])

(torch.Tensor, torch.Tensor, torch.Tensor)

We create a DataLoader

In [100]:
import torch
from torch.utils.data import DataLoader
BS = 64

dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}

We load the model

In [100]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoints)

Let's see how the model is.

In [100]:
model

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
             

Let's take a look at its last layer

In [100]:
model.classifier.out_proj

Linear(in_features=768, out_features=2, bias=True)

In [100]:
model.classifier.out_proj.in_features, model.classifier.out_proj.out_features

(768, 2)

We have seen that our dataset has 20 classes, but this model is trained for 2 classes, so we need to modify the last layer.

In [100]:
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
model.classifier.out_proj

Linear(in_features=768, out_features=20, bias=True)

Now yes

Now we create a loss function

In [100]:
loss_function = torch.nn.CrossEntropyLoss()

An optimizer

In [100]:
from torch.optim import Adam

optimizer = Adam(model.parameters(), lr=5e-4)

And lastly, a metric

In [100]:
import evaluate

metric = evaluate.load("accuracy")

Let's check that everything is fine with a sample.

In [100]:
sample = next(iter(dataloader["train"]))

In [100]:
sample["input_ids"].shape, sample["attention_mask"].shape

(torch.Size([64, 130]), torch.Size([64, 130]))

Now we feed that sample to the model

In [100]:
model.to("cuda")
ouputs = model(input_ids=sample["input_ids"].to("cuda"), attention_mask=sample["attention_mask"].to("cuda"))
ouputs.logits.shape

torch.Size([64, 20])

We see that the model outputs 64 batches, which is fine because we set `BS = 20` and each one with 20 outputs, which is good because we modified the model to have an output of 20 values.

We obtain the one with the highest value

In [100]:
predictions = torch.argmax(ouputs.logits, axis=-1)
predictions.shape

torch.Size([64])

We obtain the loss

In [100]:
loss = loss_function(ouputs.logits, sample["label"].to("cuda"))
loss.item()

2.9990389347076416

And the accuracy

In [100]:
accuracy = metric.compute(predictions=predictions, references=sample["label"])["accuracy"]
accuracy

0.015625

We can now create a small training loop

In [100]:
from fastprogress.fastprogress import master_bar, progress_bar

epochs = 1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

master_progress_bar = master_bar(range(epochs))
for i in master_progress_bar:
    model.train()
    progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
    for batch in progress_bar_train:
        optimizer.zero_grad()

        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
        master_progress_bar.child.comment = f'loss: {loss}'

        loss.backward()
        optimizer.step()

    model.eval()
    progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)

        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
    master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}\n"

#### Script with the base code

In most of the `accelerate` documentation, it is explained how to use `accelerate` with scripts, so for now we will do it this way and at the end we will explain how to do it with a notebook.

First, we are going to create a folder where we will save the scripts

In [1]:
!mkdir accelerate_scripts

Now we write the base code in a script

In [40]:
%%writefile accelerate_scripts/01_code_base.py

import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar

dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130

checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)

def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])

BS = 64
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}

model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)

loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")

EPOCHS = 1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
    model.train()
    progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
    for batch in progress_bar_train:
        optimizer.zero_grad()

        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
        master_progress_bar.child.comment = f'loss: {loss}'

        loss.backward()
        optimizer.step()

    model.eval()
    progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)

        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
    master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}\n"
print(f"Accuracy = {accuracy['accuracy']}")

Overwriting accelerate_scripts/01_code_base.py


And now we run it

In [41]:
%%time

!python accelerate_scripts/01_code_base.py

Accuracy = 0.2112                                                               
CPU times: user 2.12 s, sys: 391 ms, total: 2.51 s
Wall time: 3min 36s


We see that on my computer it took about 3 and a half minutes

#### Code with accelerate

Now we replace some things
* First, we import `Accelerator` and initialize it
``` python
from accelerate import Acceleratoraccelerator = Accelerator()```

* We don't do the typical anymore
``` python 
torch.device("cuda" if torch.cuda.is_available() else "cpu")```

* But we let `accelerate` choose the device.
``` python
device = accelerator.device```

* We pass the relevant elements for training through the `prepare` method and no longer do `model.to(device)`
``` python
model, optimizer, dataloader["train"], dataloader["validation"] = prepare(model, optimizer, dataloader["train"], dataloader["validation"])```

* We no longer send the data and model to the GPU with `.to(device)` since `accelerate` has taken care of it with the `prepare` method.
* Instead of performing backpropagation with `loss.backward()`, we let `accelerate` handle it with 
``` python
accelerator.backward(loss)```

* When calculating the metric in the validation loop, we need to gather the values from all points, especially if we are doing distributed training, for this we do
``` python
predictions = accelerator.gather_for_metrics(predictions)```

In [19]:
%%writefile accelerate_scripts/02_accelerate_base_code.py

import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar

# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()

dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130

checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)

def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])

BS = 64
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}

model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)

loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")

EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device

# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])

master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
    model.train()
    progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
    for batch in progress_bar_train:
        optimizer.zero_grad()

        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
        master_progress_bar.child.comment = f'loss: {loss}'

        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
    print(f"End of training epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}")

    model.eval()
    progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)

        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    print(f"End of validation epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}")
    
    master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}\n"

print(f"Accuracy = {accuracy['accuracy']}")

Overwriting accelerate_scripts/02_accelerate_base_code.py


If you notice, I've added these two lines `print(f"End of training epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}")` and the line `print(f"End of validation epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}")`, I added them on purpose because they will reveal something very important.

Now let's run it. To execute the `accelerate` scripts, use the command `accelerate launch`
``` bash
accelerate launch script.py```

In [29]:
%%time

!accelerate launch accelerate_scripts/02_accelerate_base_code.py

End of training epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([64])
End of training epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([64])
End of validation epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([8])
Accuracy = 0.206
End of validation epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([8])
Accuracy = 0.206
CPU times: user 1.6 s, sys: 272 ms, total: 1.88 s
Wall time: 2min 37s


We see that before it took about 3 and a half minutes, and now it takes around 2 and a half minutes. Quite an improvement. Additionally, if we look at the `print`s, we can see that they have been printed twice.
And how can this be? Well, because `accelerate` has parallelized the training across the two GPUs I have, so it was much faster.
Moreover, when I ran the first script, that is, when I didn't use `accelerate`, the GPU was almost full, while when I ran the second one, that is, the one that uses `accelerate`, both GPUs were barely utilized, so we can increase the batch size to try to fill both. Let's do it!

In [27]:
%%writefile accelerate_scripts/03_accelerate_base_code_more_bs.py

import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar

# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()

dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130

checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)

def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])

BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}

model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)

loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")

EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device

# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])

master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
    model.train()
    progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
    for batch in progress_bar_train:
        optimizer.zero_grad()

        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
        master_progress_bar.child.comment = f'loss: {loss}'

        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()

    model.eval()
    progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)

        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
    master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}\n"

print(f"Accuracy = {accuracy['accuracy']}")

Overwriting accelerate_scripts/03_accelerate_base_code_more_bs.py


I have removed the extra prints, as we have already seen that the code is running on both GPUs, and I increased the batch size from 64 to 128. Let's run it and see.

In [28]:
%%time

!accelerate launch accelerate_scripts/03_accelerate_base_code_more_bs.py

Accuracy = 0.1052                                                               
Accuracy = 0.1052
CPU times: user 1.41 s, sys: 180 ms, total: 1.59 s
Wall time: 2min 22s


Increasing the batch size has reduced the execution time by a few seconds.

### Execution of processes

#### Execution of code in a single process

We have previously seen that the `print`s were printed twice, this is because `accelerate` creates as many processes as there are devices where the code is executed, in my case it creates two processes due to having two GPUs.
However, not all code should be executed in all processes, for example, the `print`s slow down the code a lot, especially if they are executed multiple times, if checkpoints are saved, they would be saved twice, etc.
To be able to execute part of the code in a single process, it has to be encapsulated in a function and decorated with `accelerator.on_local_main_process`. For example, in the following code you will see that I have created the following function
``` python
@accelerator.on_local_main_processdef print_something(something):print(something)```

Another option is to include the code within an `if accelerator.is_local_main_process` as in the following code
``` python
if accelerator.is_local_main_process:print("Something")```

In [61]:
%%writefile accelerate_scripts/04_accelerate_base_code_some_code_in_one_process.py

import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar

# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()

dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130

checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)

def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])

BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}

model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)

loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")

EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device

# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])

@accelerator.on_local_main_process
def print_something(something):
    print(something)

master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
    model.train()
    progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
    for batch in progress_bar_train:
        optimizer.zero_grad()

        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
        master_progress_bar.child.comment = f'loss: {loss}'

        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()

    model.eval()
    progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)

        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
    master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}\n"

# print(f"Accuracy = {accuracy['accuracy']}")
print_something(f"Accuracy = {accuracy['accuracy']}")

if accelerator.is_local_main_process:
    print(f"End of script with {accuracy['accuracy']} accuracy")

Overwriting accelerate_scripts/04_accelerate_base_code_some_code_in_one_process.py


Let's run it and see

In [62]:
%%time

!accelerate launch accelerate_scripts/04_accelerate_base_code_some_code_in_one_process.py

Accuracy = 0.2098                                                               
End of script with 0.2098 accuracy
CPU times: user 1.38 s, sys: 197 ms, total: 1.58 s
Wall time: 2min 22s


Now the print has only been executed once

However, although not visible much, the progress bars run in each process.
I haven't found a way to avoid this with the progress bars from `fastprogress`, but I have with those from `tqdm`, so I'm going to replace the `fastprogress` progress bars with `tqdm` ones, and to make them run in a single process, you need to add the argument `disable=not accelerator.is_local_main_process`

In [4]:
%%writefile accelerate_scripts/05_accelerate_base_code_some_code_in_one_process.py

import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm

# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()

dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130

checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)

def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])

BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}

model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)

loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")

EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device

# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])

@accelerator.on_local_main_process
def print_something(something):
    print(something)

for i in range(EPOCHS):
    model.train()
    # progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()

        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
        # master_progress_bar.child.comment = f'loss: {loss}'

        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()

    model.eval()
    # progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)

        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
# print(f"Accuracy = {accuracy['accuracy']}")
print_something(f"Accuracy = {accuracy['accuracy']}")

if accelerator.is_local_main_process:
    print(f"End of script with {accuracy['accuracy']} accuracy")

Overwriting accelerate_scripts/05_accelerate_base_code_some_code_in_one_process.py


In [1]:
%%time

!accelerate launch accelerate_scripts/05_accelerate_base_code_some_code_in_one_process.py

100%|█████████████████████████████████████████| 176/176 [02:01<00:00,  1.45it/s]
100%|███████████████████████████████████████████| 20/20 [00:06<00:00,  3.30it/s]
Accuracy = 0.2166
End of script with 0.2166 accuracy
CPU times: user 1.33 s, sys: 195 ms, total: 1.52 s
Wall time: 2min 22s


We have shown an example of how to print in a single process, and this has been a way to execute processes in a single process. But if what you want is just to print in a single process, the `print` method from `accelerate` can be used. Let's see the same example as before with this method.

In [1]:
%%writefile accelerate_scripts/06_accelerate_base_code_print_one_process.py

import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm

# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()

dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130

checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)

def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])

BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}

model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)

loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")

EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device

# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])

for i in range(EPOCHS):
    model.train()
    # progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()

        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
        # master_progress_bar.child.comment = f'loss: {loss}'

        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()

    model.eval()
    # progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)

        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
# print(f"Accuracy = {accuracy['accuracy']}")
accelerator.print(f"Accuracy = {accuracy['accuracy']}")

if accelerator.is_local_main_process:
    print(f"End of script with {accuracy['accuracy']} accuracy")

Writing accelerate_scripts/06_accelerate_base_code_print_one_process.py


We run it

In [2]:
%%time

!accelerate launch accelerate_scripts/06_accelerate_base_code_print_one_process.py

Map: 100%|██████████████████████| 45000/45000 [00:02<00:00, 15433.52 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 11406.61 examples/s]
Map: 100%|██████████████████████| 45000/45000 [00:02<00:00, 15036.87 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14932.76 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14956.60 examples/s]
100%|█████████████████████████████████████████| 176/176 [02:00<00:00,  1.46it/s]
100%|███████████████████████████████████████████| 20/20 [00:05<00:00,  3.33it/s]
Accuracy = 0.2134
End of script with 0.2134 accuracy
CPU times: user 1.4 s, sys: 189 ms, total: 1.59 s
Wall time: 2min 27s


#### Code Execution in All Processes

However, there is code that must run in all processes, for example, if we upload the checkpoints to the hub, so here we have two options: wrap the code in a function and decorate it with `accelerator.on_main_process`
``` python
@accelerator.on_main_processdef do_my_thing():"Something done once per server"do_thing_once()```

or put the code inside an `if accelerator.is_main_process`
``` python
if accelerator.is_main_process:repo.push_to_hub()```

Since we are only doing training to showcase the `accelerate` library and the model we are training is not good, it doesn't make sense to upload the checkpoints to the hub right now, so I will do an example with `print`s.

In [12]:
%%writefile accelerate_scripts/07_accelerate_base_code_some_code_in_all_process.py

import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm

# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()

dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130

checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)

def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])

BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}

model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)

loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")

EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device

# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])

@accelerator.on_local_main_process
def print_in_one_process(something):
    print(something)

@accelerator.on_main_process
def print_in_all_processes(something):
    print(something)

for i in range(EPOCHS):
    model.train()
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()

        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)

        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()

    model.eval()
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)

        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
print_in_one_process(f"Accuracy = {accuracy['accuracy']}")

if accelerator.is_local_main_process:
    print(f"End of script with {accuracy['accuracy']} accuracy")

print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")

if accelerator.is_main_process:
    print(f"All process: End of script with {accuracy['accuracy']} accuracy")

Overwriting accelerate_scripts/06_accelerate_base_code_some_code_in_all_process.py


Let's run it to see

In [9]:
%%time

!accelerate launch accelerate_scripts/07_accelerate_base_code_some_code_in_all_process.py

Map: 100%|██████████████████████| 45000/45000 [00:03<00:00, 14518.44 examples/s]
Map: 100%|██████████████████████| 45000/45000 [00:03<00:00, 14368.77 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 16466.33 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 14806.14 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14253.33 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14337.07 examples/s]
100%|█████████████████████████████████████████| 176/176 [02:00<00:00,  1.46it/s]
100%|███████████████████████████████████████████| 20/20 [00:05<00:00,  3.34it/s]
Accuracy = 0.2092
End of script with 0.2092 accuracy
All process: Accuracy = 0.2092
All process: End of script with 0.2092 accuracy
CPU times: user 1.42 s, sys: 216 ms, total: 1.64 s
Wall time: 2min 27s


#### Code Execution in Process X

Finally, we can specify in which process we want to execute code. For this, we need to create a function and decorate it with `@accelerator.on_process(process_index=0)`
``` python
@accelerator.on_process(process_index=0)def do_my_thing():"Something done on process index 0"do_thing_on_index_zero()```

or decorate it with `@accelerator.on_local_process(local_process_idx=0)`
``` python
@accelerator.on_local_process(local_process_index=0)def do_my_thing():"Something done on process index 0 on each server"do_thing_on_index_zero_on_each_server()```

Here I have put process 0, but any number can be put

In [18]:
%%writefile accelerate_scripts/08_accelerate_base_code_some_code_in_some_process.py

import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm

# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()

dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130

checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)

def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])

BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}

model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)

loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")

EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device

# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])

@accelerator.on_local_main_process
def print_in_one_process(something):
    print(something)

@accelerator.on_main_process
def print_in_all_processes(something):
    print(something)

@accelerator.on_process(process_index=0)
def print_in_process_0(something):
    print("Process 0: " + something)

@accelerator.on_local_process(local_process_index=1)
def print_in_process_1(something):
    print("Process 1: " + something)

for i in range(EPOCHS):
    model.train()
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()

        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)

        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()

    model.eval()
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)

        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
print_in_one_process(f"Accuracy = {accuracy['accuracy']}")

if accelerator.is_local_main_process:
    print(f"End of script with {accuracy['accuracy']} accuracy")

print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")

if accelerator.is_main_process:
    print(f"All process: End of script with {accuracy['accuracy']} accuracy")

print_in_process_0("End of process 0")
print_in_process_1("End of process 1")

Overwriting accelerate_scripts/07_accelerate_base_code_some_code_in_some_process.py


We run it

In [15]:
%%time

!accelerate launch accelerate_scripts/08_accelerate_base_code_some_code_in_some_process.py

Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 15735.58 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14906.20 examples/s]
100%|█████████████████████████████████████████| 176/176 [02:02<00:00,  1.44it/s]
100%|███████████████████████████████████████████| 20/20 [00:06<00:00,  3.27it/s]
Process 1: End of process 1
Accuracy = 0.2128
End of script with 0.2128 accuracy
All process: Accuracy = 0.2128
All process: End of script with 0.2128 accuracy
Process 0: End of process 0
CPU times: user 1.42 s, sys: 295 ms, total: 1.71 s
Wall time: 2min 37s


#### Synchronizing Processes

If we have code that needs to run on all processes, it's interesting to wait for it to finish on all processes before doing another task, so for this we use `accelerator.wait_for_everyone()`
To see it, we are going to introduce a delay in one of the print functions in a process.
I've also added a break in the training loop so that it doesn't train for too long, which isn't what we're interested in right now.

In [22]:
%%writefile accelerate_scripts/09_accelerate_base_code_sync_all_process.py

import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
import time

# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()

dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130

checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)

def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])

BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}

model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)

loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")

EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device

# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])

@accelerator.on_local_main_process
def print_in_one_process(something):
    print(something)

@accelerator.on_main_process
def print_in_all_processes(something):
    print(something)

@accelerator.on_process(process_index=0)
def print_in_process_0(something):
    time.sleep(2)
    print("Process 0: " + something)

@accelerator.on_local_process(local_process_index=1)
def print_in_process_1(something):
    print("Process 1: " + something)

for i in range(EPOCHS):
    model.train()
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()

        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)

        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
        break

    model.eval()
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)

        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
print_in_one_process(f"Accuracy = {accuracy['accuracy']}")

if accelerator.is_local_main_process:
    print(f"End of script with {accuracy['accuracy']} accuracy")

print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")

if accelerator.is_main_process:
    print(f"All process: End of script with {accuracy['accuracy']} accuracy")

print_in_one_process("Printing with delay in process 0")
print_in_process_0("End of process 0")
print_in_process_1("End of process 1")
accelerator.wait_for_everyone()

print_in_one_process("End of script")

Overwriting accelerate_scripts/08_accelerate_base_code_sync_all_process.py


We run it

In [23]:
!accelerate launch accelerate_scripts/09_accelerate_base_code_sync_all_process.py

Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 14218.23 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 14666.25 examples/s]
  0%|                                                   | 0/176 [00:00<?, ?it/s]
100%|███████████████████████████████████████████| 20/20 [00:05<00:00,  3.58it/s]
Process 1: End of process 1
Accuracy = 0.212
End of script with 0.212 accuracy
All process: Accuracy = 0.212
All process: End of script with 0.212 accuracy
Printing with delay in process 0
Process 0: End of process 0
End of script


As can be seen, first `Process 1: End of process 1` is printed and then the rest. This is because the remaining prints are either done in process 0 or in all processes, so until the 2-second delay we set is over, the rest of the code does not execute.

### Save and load the state dict

When we train, sometimes we save the state so we can continue at another time.
To save the state, we will have to use the methods `save_state()` and `load_state()`

In [66]:
%%writefile accelerate_scripts/10_accelerate_save_and_load_checkpoints.py

import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm

# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()

dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130

checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)

def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])

BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}

model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)

loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")

@accelerator.on_local_main_process
def print_something(something):
    print(something)

EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device

# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])

for i in range(EPOCHS):
    model.train()
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()

        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)

        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()

    model.eval()
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)

        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()

    # Guardamos los pesos
    accelerator.save_state("accelerate_scripts/checkpoints")

print_something(f"Accuracy = {accuracy['accuracy']}")

# Cargamos los pesos
accelerator.load_state("accelerate_scripts/checkpoints")

Overwriting accelerate_scripts/09_accelerate_save_and_load_checkpoints.py


We run it

In [67]:
!accelerate launch accelerate_scripts/10_accelerate_save_and_load_checkpoints.py

100%|█████████████████████████████████████████| 176/176 [01:58<00:00,  1.48it/s]
100%|███████████████████████████████████████████| 20/20 [00:05<00:00,  3.40it/s]
Accuracy = 0.2142


### Save the model

When the `prepare` method was used, the model was wrapped to be able to save it to the necessary devices. Therefore, when saving it, we need to use the `save_model` method, which first unwraps it and then saves it. Additionally, if we use the parameter `safe_serialization=True`, the model will be saved as a `safe tensor`.

In [1]:
%%writefile accelerate_scripts/11_accelerate_save_model.py

import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm

# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()

dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130

checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)

def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])

BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}

model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)

loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")

@accelerator.on_local_main_process
def print_something(something):
    print(something)

EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device

# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])

for i in range(EPOCHS):
    model.train()
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()

        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)

        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()

    model.eval()
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)

        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()

    # Guardamos el modelo
    accelerator.wait_for_everyone()
    accelerator.save_model(model, "accelerate_scripts/model", safe_serialization=True)

print_something(f"Accuracy = {accuracy['accuracy']}")

Writing accelerate_scripts/11_accelerate_save_model.py


We run it

In [78]:
!accelerate launch accelerate_scripts/11_accelerate_save_model.py

100%|█████████████████████████████████████████| 176/176 [01:58<00:00,  1.48it/s]
100%|███████████████████████████████████████████| 20/20 [00:05<00:00,  3.35it/s]
Accuracy = 0.214


### Save the `pretrained` model

In models that use the `transformers` library, we must save the model using the `save_pretrained` method to be able to load it with the `from_pretrained` method. Before saving it, you need to unwrap it using the `unwrap_model` method.

In [79]:
%%writefile accelerate_scripts/12_accelerate_save_pretrained.py

import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm

# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()

dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130

checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)

def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])

BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}

model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)

loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")

@accelerator.on_local_main_process
def print_something(something):
    print(something)

EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device

# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])

for i in range(EPOCHS):
    model.train()
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()

        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)

        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()

    model.eval()
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)

        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()

    # Guardamos el modelo pretrained
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(
        "accelerate_scripts/model_pretrained",
        is_main_process=accelerator.is_main_process,
        save_function=accelerator.save,
    )

print_something(f"Accuracy = {accuracy['accuracy']}")

Writing accelerate_scripts/11_accelerate_save_pretrained.py


We run it

In [80]:
!accelerate launch accelerate_scripts/12_accelerate_save_pretrained.py

Map: 100%|██████████████████████| 45000/45000 [00:02<00:00, 15152.47 examples/s]
Map: 100%|██████████████████████| 45000/45000 [00:02<00:00, 15119.13 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 12724.70 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 12397.49 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 15247.21 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 15138.03 examples/s]
100%|█████████████████████████████████████████| 176/176 [01:59<00:00,  1.48it/s]
100%|███████████████████████████████████████████| 20/20 [00:05<00:00,  3.37it/s]
Accuracy = 0.21


Now we could load it

In [82]:
from transformers import AutoModel

checkpoints = "accelerate_scripts/model_pretrained"
tokenizer = AutoModel.from_pretrained(checkpoints)

Some weights of RobertaModel were not initialized from the model checkpoint at accelerate_scripts/model_pretrained and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Training in notebooks

So far we have seen how to run scripts, but if you want to run the code in a notebook, we can write the same code as before, but encapsulated in a function

First we import the libraries

In [1]:
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
import time
# from accelerate import Accelerator

Now we create the function

In [2]:
def train_code(batch_size: int = 64):
    from accelerate import Accelerator
    accelerator = Accelerator()

    dataset = load_dataset("tweet_eval", "emoji")
    num_classes = len(dataset["train"].info.features["label"].names)
    max_len = 130

    checkpoints = "cardiffnlp/twitter-roberta-base-irony"
    tokenizer = AutoTokenizer.from_pretrained(checkpoints)

    def tokenize_function(dataset):
        return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
    tokenized_dataset = {
        "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
        "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
        "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
    }
    tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
    tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
    tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])

    BS = batch_size
    dataloader = {
        "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
        "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
        "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
    }

    model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
    model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)

    loss_function = torch.nn.CrossEntropyLoss()
    optimizer = Adam(model.parameters(), lr=5e-4)
    metric = evaluate.load("accuracy")

    EPOCHS = 1
    # device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    device = accelerator.device

    # model.to(device)
    model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])

    for i in range(EPOCHS):
        model.train()
        progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
        for batch in progress_bar_train:
            optimizer.zero_grad()

            input_ids = batch["input_ids"]#.to(device)
            attention_mask = batch["attention_mask"]#.to(device)
            labels = batch["label"]#.to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = loss_function(outputs['logits'], labels)

            # loss.backward()
            accelerator.backward(loss)
            optimizer.step()

        model.eval()
        progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
        for batch in progress_bar_validation:
            input_ids = batch["input_ids"]#.to(device)
            attention_mask = batch["attention_mask"]#.to(device)
            labels = batch["label"]#.to(device)

            with torch.no_grad():
                outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            predictions = torch.argmax(outputs['logits'], axis=-1)
            # Recopilamos las predicciones de todos los dispositivos
            predictions = accelerator.gather_for_metrics(predictions)
            labels = accelerator.gather_for_metrics(labels)

            accuracy = metric.add_batch(predictions=predictions, references=labels)
        accuracy = metric.compute()
        
    accelerator.print(f"Accuracy = {accuracy['accuracy']}")

To run the training in the notebook, we use the `notebook_launcher` function, to which we pass the function we want to execute, the arguments of that function, and the number of GPUs on which we will train using the `num_processes` variable.

In [10]:
from accelerate import notebook_launcher

args = (128,)
notebook_launcher(train_code, args, num_processes=2)

Launching training on 2 GPUs.


100%|██████████| 176/176 [02:01<00:00,  1.45it/s]
100%|██████████| 20/20 [00:06<00:00,  3.31it/s]


Accuracy = 0.2112


### Training in FP16

When we first set up `accelerate`, it asked us `Do you wish to use FP16 or BF16 (mixed precision)?` and we said no, so now we are going to tell it yes, that we want FP16.

So far we have trained in FP32, which means that each weight of the model is a 32-bit floating-point number, and now we are going to use a 16-bit floating-point number, meaning the model will take up less space. As a result, two things will happen: we will be able to use a larger batch size, and it will also be faster.

First we run `accelerate config` again and tell it that we want FP16

In [4]:
!accelerate config

--------------------------------------------------------------------------------
In which compute environment are you running?
This machine
--------------------------------------------------------------------------------
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
Do you want to use FullyShardedDataParallel? [yes/NO]: no
Do you want to use Megatron-LM ? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1
--------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
fp16
accelerate configuration saved at 

Now we create a script to train, with the same batch size as before, to see if it takes less time to train

In [2]:
%%writefile accelerate_scripts/13_accelerate_base_code_fp16_bs128.py

import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm

# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()

dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130

checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)

def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])

BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}

model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)

loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")

EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device

# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])

for i in range(EPOCHS):
    model.train()
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()

        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)

        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()

    model.eval()
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)

        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
accelerator.print(f"Accuracy = {accuracy['accuracy']}")

Overwriting accelerate_scripts/12_accelerate_base_code_fp16_bs128.py


Let's run it to see how long it takes

In [3]:
%%time

!accelerate launch accelerate_scripts/13_accelerate_base_code_fp16_bs128.py

Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 14983.76 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14315.47 examples/s]
100%|█████████████████████████████████████████| 176/176 [01:01<00:00,  2.88it/s]
100%|███████████████████████████████████████████| 20/20 [00:02<00:00,  6.84it/s]
Accuracy = 0.2094
CPU times: user 812 ms, sys: 163 ms, total: 976 ms
Wall time: 1min 27s


When we ran this training in FP32 it took about 2 and a half minutes, and now roughly 1 and a half minutes. Let's see if instead of training with a batch size of 128, we do it with one of 256.

In [6]:
%%writefile accelerate_scripts/14_accelerate_base_code_fp16_bs256.py

import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm

# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()

dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130

checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)

def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])

BS = 256
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}

model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)

loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")

EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device

# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])

for i in range(EPOCHS):
    model.train()
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()

        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)

        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()

    model.eval()
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)

        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
accelerator.print(f"Accuracy = {accuracy['accuracy']}")

Overwriting accelerate_scripts/13_accelerate_base_code_fp16_bs256.py


We run it

In [7]:
%%time

!accelerate launch accelerate_scripts/14_accelerate_base_code_fp16_bs256.py

Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 15390.30 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 14990.92 examples/s]
100%|███████████████████████████████████████████| 88/88 [00:54<00:00,  1.62it/s]
100%|███████████████████████████████████████████| 10/10 [00:02<00:00,  3.45it/s]
Accuracy = 0.2236
CPU times: user 670 ms, sys: 91.6 ms, total: 761 ms
Wall time: 1min 12s


It has only dropped by about 15 seconds

### Training in BF16

We have previously trained in FP16 and now we are going to do it in BF16, what is the difference?
![FP32_FP16_BF16](https://pub-fb664c455eca46a2ba762a065ac900f7.r2.dev/FP32_FP16_BF16.webp)
As we can see in the image, while FP16 compared to FP32 has fewer bits in the mantissa and the exponent, which makes its range much smaller, BF16 compared to FP32 has the same number of bits for the exponent but fewer in the mantissa, which means that BF16 has the same range of numbers as FP32, but is less precise.
This is beneficial because in FP16 some calculations could result in very high numbers, which cannot be represented in the FP16 format. Additionally, there are certain HW devices that are optimized for this format.

Just like before, we run `accelerate config` and indicate that we want BF16

In [8]:
!accelerate config

--------------------------------------------------------------------------------
In which compute environment are you running?
This machine
--------------------------------------------------------------------------------
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
Do you want to use FullyShardedDataParallel? [yes/NO]: no
Do you want to use Megatron-LM ? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1
--------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
bf16
accelerate configuration saved at 

Now we run the last script we created, that is, with a batch size of 256

In [9]:
%%time

!accelerate launch accelerate_scripts/14_accelerate_base_code_fp16_bs256.py

Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14814.95 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14506.83 examples/s]
100%|███████████████████████████████████████████| 88/88 [00:51<00:00,  1.70it/s]
100%|███████████████████████████████████████████| 10/10 [00:03<00:00,  3.21it/s]
Accuracy = 0.2112
CPU times: user 688 ms, sys: 144 ms, total: 832 ms
Wall time: 1min 17s


It took a similar amount of time as before, which is normal since we trained a model with 16-bit weights, just like before.

### Training in FP8

Now we are going to train in FP8 format, which, as the name suggests, is a floating-point format where each weight has 8 bits, so we run `accelerate config` to tell it that we want FP8

In [10]:
!accelerate config

--------------------------------------------------------------------------------
In which compute environment are you running?
This machine
--------------------------------------------------------------------------------
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
Do you want to use FullyShardedDataParallel? [yes/NO]: no
Do you want to use Megatron-LM ? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1
--------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
fp8
accelerate configuration saved at ~

Now we run the last script, the one with a batch size of 256

In [11]:
%%time

!accelerate launch accelerate_scripts/14_accelerate_base_code_fp16_bs256.py

Traceback (most recent call last):
  File "/home/wallabot/Documentos/web/portafolio/posts/accelerate_scripts/13_accelerate_base_code_fp16_bs256.py", line 12, in <module>
    accelerator = Accelerator()
                  ^^^^^^^^^^^^^
  File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/accelerator.py", line 371, in __init__
    self.state = AcceleratorState(
                 ^^^^^^^^^^^^^^^^^
  File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/state.py", line 790, in __init__
    raise ValueError(
ValueError: Using `fp8` precision requires `transformer_engine` or `MS-AMP` to be installed.
Traceback (most recent call last):
  File "/home/wallabot/Documentos/web/portafolio/posts/accelerate_scripts/13_accelerate_base_code_fp16_bs256.py", line 12, in <module>
    accelerator = Accelerator()
                  ^^^^^^^^^^^^^
  File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/accelerator.py", line 371,

Since the weights are now 8-bit and take up half the memory, we will increase the batch size to 512.

In [2]:
%%writefile accelerate_scripts/15_accelerate_base_code_fp8_bs512.py

import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm

# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()

dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130

checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)

def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])

BS = 512
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}

model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)

loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")

EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device

# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])

for i in range(EPOCHS):
    model.train()
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()

        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)

        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()

    model.eval()
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)

        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
accelerator.print(f"Accuracy = {accuracy['accuracy']}")

Writing accelerate_scripts/15_accelerate_base_code_fp8_bs512.py


We run it

In [100]:
%%time

!accelerate launch accelerate_scripts/15_accelerate_base_code_fp8_bs512.py

## Model Inference

### Usage of the Hugging Face Ecosystem

Let's see how to perform inference with large models using the `transformers` library from Hugging Face.

#### Inference with `pipeline`

If we use the Hugging Face ecosystem, it's very simple, as everything happens under the hood without us having to do much. In the case of using `pipeline`, which is the easiest way to perform inference with the `transformers` library, we just need to specify the model we want to use and, importantly, pass `device_map="auto"`. This will make `accelerate` distribute the model across different GPUs, CPU RAM, or hard drive if necessary.

There are more possible values for `device_map`, which we will see later, but for now stick with `"auto"`.

We are going to use the `Llama3 8B` model, which, as its name suggests, is a model with around 8 billion parameters. Since each parameter by default is in FP32 format, which corresponds to 4 bytes (32 bits), this means that if we multiply 8 billion parameters by 4 bytes, we get that it would require a GPU with around 32 GB of VRAM.
In my case, I have 2 GPUs with 24 GB of VRAM each, so it wouldn't fit into a single GPU. But thanks to setting `device_map="auto"`, accelerate will distribute the model across both GPUs and I will be able to perform inference.

In [100]:
%%writefile accelerate_scripts/16_inference_with_pipeline.py

from transformers import pipeline

checkpoints = "meta-llama/Meta-Llama-3-8B-Instruct"
generator = pipeline(model=checkpoints, device_map="auto")

prompt = "Conoces accelerate de hugging face?"
output = generator(prompt)
print(output)

Overwriting accelerate_scripts/09_inference_with_pipeline.py


Now we run it, only as the pipeline uses accelerate under the hood, we don't need to run it with `accelerate launch script.py` but instead with `python script.py`.

In [100]:
!python accelerate_scripts/16_inference_with_pipeline.py

Loading checkpoint shards: 100%|██████████████████| 4/4 [00:09<00:00,  2.27s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
[{'generated_text': 'Conoces accelerate de hugging face? ¿Qué es el modelo de lenguaje de transformers y cómo se utiliza en el marco de hugging face? ¿Cómo puedo utilizar modelos de lenguaje de transformers en mi aplicación? ¿Qué son los tokenizers y cómo se utilizan en el marco de hugging face? ¿Cómo puedo crear un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los datasets y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar datasets para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo 

As can be seen, it has not responded, but has continued to ask questions. This is because Llama3 is a language model that predicts the next token, so with the prompt I passed it, it considered that the next best tokens were ones that correspond to more questions. This makes sense because sometimes people have doubts about a topic and generate many questions, so to get it to answer the question, we need to condition it a bit.

In [100]:
%%writefile accelerate_scripts/17_inference_with_pipeline_condition.py

from transformers import pipeline

checkpoints = "meta-llama/Meta-Llama-3-8B-Instruct"
generator = pipeline(model=checkpoints, device_map="auto")

prompt = "Conoces accelerate de hugging face?"
messages = [
    {
        "role": "system",
        "content": "Eres un chatbot amigable que siempre intenta solucionar las dudas",
    },
    {"role": "user", "content": f"{prompt}"},
]
output = generator(messages)
print(output[0]['generated_text'][-1])

Overwriting accelerate_scripts/10_inference_with_pipeline_condition.py


As you can see, a message has been generated with roles, conditioning the model and with the prompt

In [100]:
!python accelerate_scripts/17_inference_with_pipeline_condition.py

Loading checkpoint shards: 100%|██████████████████| 4/4 [00:09<00:00,  2.41s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
{'role': 'assistant', 'content': '¡Hola!\n\nSí, conozco Accelerate de Hugging Face. Accelerate es una biblioteca de Python desarrollada por Hugging Face que se enfoca en simplificar y acelerar el entrenamiento y la evaluación de modelos de lenguaje en diferentes dispositivos y entornos.\n\nCon Accelerate, puedes entrenar modelos de lenguaje en diferentes plataformas y dispositivos, como GPUs, TPUs, CPUs y servidores, sin necesidad de cambiar el código de tu modelo. Esto te permite aprovechar al máximo la potencia de cálculo de tus dispositivos y reducir el tiempo de entrenamiento.\n\nAccelerate también ofrece varias características adicionales, como:\n\n* Soporte para diferentes frameworks de machine learning, como Ten

Now the response does answer our prompt

#### Inference with `AutoClass`

Lastly, we will see how to perform inference using only `AutoClass`.

In [100]:
%%writefile accelerate_scripts/18_inference_with_autoclass.py

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

checkpoints = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(checkpoints, device_map="auto")
model = AutoModelForCausalLM.from_pretrained(checkpoints, device_map="auto")
streamer = TextStreamer(tokenizer)

prompt = "Conoces accelerate de hugging face?"
tokens_input = tokenizer([prompt], return_tensors="pt").to(model.device)

_ = model.generate(**tokens_input, streamer=streamer, max_new_tokens=500, do_sample=True, top_k=50, top_p=0.95, temperature=0.7)

Overwriting accelerate_scripts/11_inference_with_autoclass.py


As can be seen, the `streamer` object has been created and is then passed to the model's `generate` method. This is useful for printing each word as it is generated, rather than waiting for the entire output to be generated before printing it.

In [100]:
!python accelerate_scripts/18_inference_with_autoclass.py

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████████████| 4/4 [00:09<00:00,  2.28s/it]
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
<|begin_of_text|>Conoces accelerate de hugging face? Si es así, puedes utilizar la biblioteca `transformers` de Hugging Face para crear un modelo de lenguaje que pueda predecir la siguiente palabra en una secuencia de texto.

Aquí te muestro un ejemplo de cómo hacerlo:
```
import pandas as pd
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Cargar el modelo y el tokenizador
model_name = "bert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Cargar el conjunto de datos
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

# Preprocesar los datos
train_texts = train_df["t

### Use PyTorch

Normally, the way to make inferences with PyTorch is to create a model with randomly initialized weights and then load a `state dict` with the pretrained model's weights. So, to get that `state dict`, we'll first take a small shortcut and download it.

In [100]:
import torch
import torchvision.models as models

model = models.resnet152(weights=models.ResNet152_Weights.IMAGENET1K_V1)
torch.save(model.state_dict(), 'accelerate_scripts/resnet152_pretrained.pth')

Downloading: "https://download.pytorch.org/models/resnet152-394f9c45.pth" to /home/maximo.fernandez/.cache/torch/hub/checkpoints/resnet152-394f9c45.pth
100%|██████████| 230M/230M [02:48<00:00, 1.43MB/s] 


Now that we have the `state dict`, we are going to perform inference as it is typically done in PyTorch.

In [100]:
import torch
import torchvision.models as models

device = "cuda" if torch.cuda.is_available() else "cpu"     # Set device

resnet152 = models.resnet152().to(device) # Create model with random weights and move to device
state_dict = torch.load('accelerate_scripts/resnet152_pretrained.pth', map_location=device) # Load pretrained weights into device memory
resnet152.load_state_dict(state_dict) # Load this weights into the model

input = torch.rand(1, 3, 224, 224).to(device)  # Random image with batch size 1
output = resnet152(input)
output.shape

torch.Size([1, 1000])

Let's explain what has happened
* When we did `resnet152 = models.resnet152().to(device)`, a ResNet152 with random weights was loaded into the GPU memory.* When we did `state_dict = torch.load('accelerate_scripts/resnet152_pretrained.pth', map_location=device)`, a dictionary with the trained weights was loaded into the GPU memory.* When we did `resnet152.load_state_dict(state_dict)`, those pretrained weights were assigned to the model.
That is, the model has been loaded twice into the GPU memory.

You might be wondering why we did first
``` python
model = models.resnet152(weights=models.ResNet152_Weights.IMAGENET1K_V1)torch.save(model.state_dict(), 'accelerate_scripts/resnet152_pretrained.pth')```

To do later
``` python
resnet152 = models.resnet152().to(device)state_dict = torch.load('accelerate_scripts/resnet152_pretrained.pth', map_location=device)resnet152.load_state_dict(state_dict)```

And why don't we use directly
```
model = models.resnet152(weights=models.ResNet152_Weights.IMAGENET1K_V1)```

And we stop saving the `state dict` to load it later. Well, because Pytorch does the same thing under the hood. So, to see the whole process, we have broken down into several lines what Pytorch does in one.

This way of working has been effective so far, while models were manageable by user GPUs. But since the arrival of LLMs, this approach no longer makes sense.
For example, a 6B parameter model would occupy 24 GB in memory, and since it is loaded twice with this way of working, you would need a 48 GB GPU.

So to fix this, the way to load a pretrained model in PyTorch is:* Create an empty model with `init_empty_weights` that will not occupy RAM memory* Then load the weights with `load_checkpoint_and_dispatch` which will load a checkpoint into the empty model and distribute the weights for each layer across all available devices (GPU, CPU, RAM, and hard drive), thanks to setting `device_map="auto"`

In [100]:
import torch
import torchvision.models as models
from accelerate import init_empty_weights, load_checkpoint_and_dispatch

with init_empty_weights():
    resnet152 = models.resnet152()

resnet152 = load_checkpoint_and_dispatch(resnet152, checkpoint='accelerate_scripts/resnet152_pretrained.pth', device_map="auto")

device = "cuda" if torch.cuda.is_available() else "cpu"

input = torch.rand(1, 3, 224, 224).to(device)  # Random image with batch size 1
output = resnet152(input)
output.shape

torch.Size([1, 1000])

### How accelerate works under the hood

In this video you can see graphically how accelerate works under the hood

<iframe width="1280" height="720" src="https://www.youtube.com/embed/MWCSGj9jEAo" title="Accelerate Big Model Inference: How Does it Work?" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

#### Initialization of an empty model

`Accelerate` creates the skeleton of an empty model using `init_empty_weights` so that it occupies as little memory as possible.

For example, let's see how much RAM I have available on my computer now.

In [100]:
import psutil

def get_ram_info():
    ram = dict(psutil.virtual_memory()._asdict())
    print(f"Total RAM: {(ram['total']/1024/1024/1024):.2f} GB, Available RAM: {(ram['available']/1024/1024/1024):.2f} GB, Used RAM: {(ram['used']/1024/1024/1024):.2f} GB")

get_ram_info()

Total RAM: 31.24 GB, Available RAM: 22.62 GB, Used RAM: 7.82 GB


I have about 22 GB of RAM available.
Now let's try to create a model with 5000x1000x1000 parameters, that is, 5B parameters. If each parameter is in FP32, it would require 20 GB of RAM.

In [100]:
import torch
from torch import nn

model = nn.Sequential(*[nn.Linear(5000, 1000) for _ in range(1000)])

If we look at the RAM again

In [100]:
get_ram_info()

Total RAM: 31.24 GB, Available RAM: 3.77 GB, Used RAM: 26.70 GB


As we can see, now we only have 3 GB of RAM available

Now we are going to delete the model to free up RAM

In [100]:
del model
get_ram_info()

Total RAM: 31.24 GB, Available RAM: 22.44 GB, Used RAM: 8.03 GB


We have about 22 GB of RAM available again.

Let's now use `init_empty_weights` from `accelerate` and then check the RAM

In [100]:
from accelerate import init_empty_weights

with init_empty_weights():
    model = nn.Sequential(*[nn.Linear(5000, 1000) for _ in range(1000)])

get_ram_info()

Total RAM: 31.24 GB, Available RAM: 22.32 GB, Used RAM: 8.16 GB


We previously had exactly 22.44 GB free, and after creating the model with `init_empty_weights` we have 22.32 GB. The RAM savings are huge! Almost no RAM was used to create the model.
This is based on the meta-device introduced in PyTorch 1.9, so it is important that to use `accelerate` we have a version of Pytorch later than that.

#### Loading the weights

Once we have initialized the model, we need to load its weights, which we do through `load_checkpoint_and_dispatch`, which, as its name suggests, loads the weights and dispatches them to the necessary device or devices.