# Data Parallel

DP divides the data and sends each portion to the GPUs. It then centralizes and merges the results. 
This approach is valid only when each device can run the model independently, since it will make a copy of model to each device.

The steps for DP:

  1. load Model and batch data to GPU0
  2. devide the batch data on GPU0 to other devices
  3. copy the model on GPU0 to other devices
  4. foward pass on all devices
  5. get all outputs to GPU0, compute loss
  6. copy loss to all devices and do backward pass to compute gradients
  7. get gradients to GPU0
  8. update model on GPU0

In pratice, we use pytorch - nn.DataParralle


Warning: I had a lot of problems using this approach:
 - it seems that the wrapped model  had hard time to release memory after running.
 - The copy of data between gpus is very obscure.

This approach is not recommanded. It is better to use distributed data parallel approach.

## I. Example on Classification Using PyTorch

We follow 7 steps for training:

 1) load dataset
 2) data loader
 3) load model
 4) define optimizer
 5) define eval function
 6) training loop
 7) train

Compared to training without DP, there are 3 modifications:

    A) wrap model: in step 3, after loading the model, wrap the model for parallel computation
    B) Scatter data: in step 6 in the for training loop, remove the ops of sending data to gpu0.
    C) Loss mean: since for DP, the loss is a tensor of losses retured from all gpus. To do the backward of loss, we have to compute the mean of all losses and then do the backward op.

In [1]:
# my gpus are not connected, so I have to disable this option

import os

os.environ["NCCL_P2P_DISABLE"] = "1"

In [2]:
# imports

import torch
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer

In [3]:
ckp_data = "davidberg/sentiment-reviews"
ckp = "google-bert/bert-base-uncased"
# ckp = "michellejieli/emotion_text_classifier"

### 1. load dataset

In [4]:
# load data
data = load_dataset(ckp_data)
data

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'review', 'polarity', 'division'],
        num_rows: 4084
    })
})

In [5]:
# dataset

from torch.utils.data import Dataset

class dataset(Dataset):

    label2id = {"positive": 0, "negative": 1}

    def __init__(self, _data):

        super().__init__()

        self.data = {"review":[], "division":[]}
        for i in range(len(_data["train"]["review"])):
            
            if _data["train"][i]["division"] in dataset.label2id.keys():
                self.data["review"].append(_data["train"][i]["review"])
                self.data["division"].append(_data["train"][i]["division"])
            # else:
            #     print(_data["train"][i]["review"], _data["train"][i]["division"])

    
    def __getitem__(self, index):

        return self.data["review"][index], dataset.label2id.get(self.data["division"][index])
    
    def __len__(self):

        return len(self.data["review"])


In [6]:
# construct dataset

ds = dataset(data)
print(len(ds))

3548


In [7]:
# split dataset

from torch.utils.data import random_split

trainset, validset = random_split(ds, lengths=[0.9, 0.1])


### 2. data loader

In [8]:
# load tokenizer

tokenizer = AutoTokenizer.from_pretrained(ckp)
tokenizer

BertTokenizerFast(name_or_path='google-bert/bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [9]:
# function collate

def collate_fct(batch):

    texts, labels = [], []

    for item in batch:

        texts.append(item[0])
        labels.append(item[1])

    toks = tokenizer(texts, max_length=512, truncation=True, padding="max_length", return_tensors="pt", add_special_tokens=True)

    toks["labels"] = torch.tensor(labels)

    return toks

In [10]:
# dataloader

from torch.utils.data import DataLoader

trainloader = DataLoader(trainset, batch_size=32, shuffle=True, collate_fn=collate_fct)
validloader = DataLoader(validset, batch_size=32, shuffle=False, collate_fn=collate_fct)

### 3. load Model

In [11]:
# load model

model = AutoModelForSequenceClassification.from_pretrained(ckp)

if torch.cuda.is_available():
    model = model.cuda()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
##################
#  A. wrap model #
##################

# wrap the model for data parallel computation

# device_ids = None to use all available GPUs

model = torch.nn.DataParallel(model, device_ids=None)
# model = torch.nn.parallel.DataParallel(model, device_ids=None)
model

DataParallel(
  (module): BertForSequenceClassification(
    (bert): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(30522, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (token_type_embeddings): Embedding(2, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0-11): 12 x BertLayer(
            (attention): BertAttention(
              (self): BertSdpaSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=7

In [13]:
# show devices

model.device_ids

[0, 1]

In [14]:
# this shows that the bert model was wraped as "DadaParallel" object
# to retrieve the bert model use: model = model.module

model.module

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

### 4. define optimizer

In [13]:
# define optimizer

from torch.optim import Adam

optimizer = Adam(model.parameters(), lr=2e-5)

### 5. define eval function

In [14]:
def evaluate():

    model.eval()

    acc_num = 0
    count = 0

    with torch.inference_mode():
        
        for _batch in validloader:
            
            # if torch.cuda.is_available():
            #     _batch = {k:v.cuda() for k, v in _batch.items()}

            output = model(**_batch)

            pred = torch.argmax(output.logits, dim=-1)

            count += len(_batch["labels"])

            acc_num += (_batch["labels"].int() == pred.int().cpu()).sum()

    return acc_num /count

### 6. training loop

In [15]:
# train

def train(epoch=1, log_step=200):

    gStep = 0

    for e in range(epoch):

        model.train()

        for batch in trainloader:
            
            ###################
            # B. Scatter data #
            ###################

            # on multiple GPU training, don't send the data to gpu
            # The DataParallel module will divide the data and copy them from cpu to each gpu
            # If we send the original data to gpu first, the module will copy garbage to gpus

            # if torch.cuda.is_available():
            #     batch = {k:v.to("cuda:0") for k, v in batch.items()}
            if batch["labels"].size()[0] != 32:
                continue

            optimizer.zero_grad()

            output = model(**batch)

            ################
            # C. Loss mean #
            ################

            # As mentioned in the DP steps, it is said that the loss was calculated on GPU_data["train"][i]["division"]
            # However here, the loss was calculated on each GPU and copied to the GPU0.
            # Therefore, the the loss value on GPU0 now is a vector instead of a scalar.
            # So we have to convert the vector loss to scalar loss for the back-prop to work.
            # So instead of doing the line (1)
            # output.loss.backward() # (1) 
            # we do the following line:

            output.loss.mean().backward()

            # if uncomment the following line, we will see the loss is a vector of n values
            # where n is the number of devices (use model.device_ids)
            # print("old loss: ", batch["labels"].size(), output.loss)

            optimizer.step()

            if gStep % log_step == 0:

                print(f"epoch {e+1} / {epoch}: global step: {gStep}, loss: {output.loss.mean().item()}")

            gStep += 1

        acc = evaluate()

        print(f"epoch: {e+1} : acc: {acc}")


### 7. train

In [17]:
train(log_step=20)

epoch 1 / 1: global step: 0, loss: 0.12998534739017487
epoch 1 / 1: global step: 20, loss: 0.13787509500980377
epoch 1 / 1: global step: 40, loss: 0.11812254041433334
epoch 1 / 1: global step: 60, loss: 0.0784941017627716
epoch 1 / 1: global step: 80, loss: 0.06834685802459717
epoch: 1 : acc: 0.8757061958312988


- one gpu - batch 32- 4.4G - 41.1s
- 2 gpus - batch 32 - 45.3s

It is not obvious that DP decreases the training time in this case. But if we increase the batch size, we could see a bigger difference.

######
TODO
pytorch parallel not working
For this to work, we have to do changes in the source code - not recommanded.

## II. Example on Classification Using Trainer

In [1]:
import os

os.environ["NCCL_P2P_DISABLE"] = "1"

In [2]:
import evaluate
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer

ckp_data = "davidberg/sentiment-reviews"
ckp = "google-bert/bert-base-uncased"

# load data
data = load_dataset(ckp_data)

split_data = data["train"].train_test_split(test_size=0.2)

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(ckp)

label2id = {"positive": 0, "negative": 1}

# process data
def process(samples):

    _data = {"review":[], "division":[]}
    for i in range(len(samples["review"])):
        if samples["division"][i] in label2id.keys():
            _data["review"].append(samples["review"][i])
            _data["division"].append(samples["division"][i])

    toks = tokenizer(_data["review"], max_length=128, truncation=True, padding="max_length", return_tensors="pt")

    toks["labels"] = [label2id.get(d) for d in _data["division"]]

    return toks

tokenized_data = split_data.map(process, batched=True, remove_columns=split_data["train"].column_names)

# load model
model = AutoModelForSequenceClassification.from_pretrained(ckp)

# metric
acc_fct = evaluate.load("accuracy")
f1_fct = evaluate.load("f1")

def metric(pred):

    preds, refs = pred

    preds = preds.argmax(axis=-1)

    acc = acc_fct.compute(predictions=preds, references=refs)
    f1 = f1_fct.compute(predictions=preds, references=refs)
    acc.update(f1)

    return acc


2024-07-22 17:20:06.854126: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-22 17:20:06.854183: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-22 17:20:06.856844: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-22 17:20:06.868897: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Map:   0%|          | 0/3267 [00:00<?, ? examples/s]

Map:   0%|          | 0/817 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
# training

from transformers import DataCollatorWithPadding, TrainingArguments, Trainer

args = TrainingArguments(
    output_dir="../tmp/checkpoints",
    per_device_train_batch_size=32,
    gradient_accumulation_steps=32,
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    metric_for_best_model="f1",
    load_best_model_at_end=True
)

trainer = Trainer(
    model=model, 
    args=args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer, padding=True),
    compute_metrics=metric
)


In [19]:
# here we see: _n_gpu=2
# when this argument is greater than 1, the trainer will use DP automatically.
# See source code for details

args

TrainingArguments(
_n_gpu=2,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_steps=None,
eval_strategy=epoch,
evaluation_strategy=None,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp

In [None]:
# we got error due to pytorch DP bug

trainer.train()

## Comments

 * wee see that DP doesn't really increase the training speed, and can decrease it.
 * This is a synchroneous process, whose processing time depends on the slowest process.
 * The main GPU needs more space for synchronization
 * it is used for single machine with multiple gpus.

This approach is not recommanded. It is better to use distributed data parallel approach.

However, it can be used to do inference.

# Inference

In [1]:
# my gpus are not connected, so I have to disable this option

import os

os.environ["NCCL_P2P_DISABLE"] = "1"

In [5]:
# load model and data

# imports
import torch
from datasets import load_dataset
from torch.utils.data import Dataset, random_split, DataLoader
from transformers import AutoModelForSequenceClassification, AutoTokenizer

ckp_data = "davidberg/sentiment-reviews"
ckp = "google-bert/bert-base-uncased"

# load data
data = load_dataset(ckp_data)

# dataset
class dataset(Dataset):

    label2id = {"positive": 0, "negative": 1}

    def __init__(self, _data):

        super().__init__()

        self.data = {"review":[], "division":[]}
        for i in range(len(_data["train"]["review"])):
            
            if _data["train"][i]["division"] in dataset.label2id.keys():
                self.data["review"].append(_data["train"][i]["review"])
                self.data["division"].append(_data["train"][i]["division"])
            # else:
            #     print(_data["train"][i]["review"], _data["train"][i]["division"])

    
    def __getitem__(self, index):

        return self.data["review"][index], dataset.label2id.get(self.data["division"][index])
    
    def __len__(self):

        return len(self.data["review"])

ds = dataset(data)

# split dataset
trainset, validset = random_split(ds, lengths=[0.9, 0.1])

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(ckp)

# function collate
def collate_fct(batch):

    texts, labels = [], []

    for item in batch:

        texts.append(item[0])
        labels.append(item[1])

    toks = tokenizer(texts, max_length=512, truncation=True, padding="max_length", return_tensors="pt", add_special_tokens=True)

    toks["labels"] = torch.tensor(labels)

    return toks


# load model
model = AutoModelForSequenceClassification.from_pretrained(ckp)

if torch.cuda.is_available():
    model = model.cuda()

# wrap model
parallel_model = torch.nn.DataParallel(model, device_ids=None)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
# loaders
trainloader = DataLoader(trainset, batch_size=32, shuffle=False, collate_fn=collate_fct)
validloader = DataLoader(validset, batch_size=32, shuffle=False, collate_fn=collate_fct)

In [7]:
%%time

mode = "single"

with torch.inference_mode():

    for batch in trainloader: # we use trainloader which has more data

        if mode == "single":  
            if torch.cuda.is_available():
                batch = {k:v.cuda() for k, v in batch.items()} 
            output = parallel_model.module(**batch)
        
        # here we copy the model to gpus for each batch
        if mode == "multiple":
            output = parallel_model(**batch)

CPU times: user 26.2 s, sys: 1.74 s, total: 28 s
Wall time: 18.3 s


In [7]:
# TODO

# another way to optimize

# however, we can't use model.generate to get results.
# We can only use parallel_model.module(**input) to get logits and eventually the label

# we copy the model only once 
replicas = parallel_model.replicate(parallel_model.module, parallel_model.device_ids)

In [None]:
%%time

with torch.inference_mode():

    for batch in trainloader:
        if torch.cuda.is_available():
            batch = {k:v.cuda() for k, v in batch.items()} 

            inputs, module_kwargs = parallel_model.scatter(inputs=None, kwargs=batch, device_ids=parallel_model.device_ids)

            test = [print(i, len(kwargs["labels"]), kwargs["labels"].get_device()) for i, kwargs in enumerate(module_kwargs)]
            #print(inputs, module_kwargs[0]["labels"].get_device())

            outputs = parallel_model.parallel_apply(replicas, inputs, module_kwargs)
            outputs = parallel_model.gather(outputs, parallel_model.output_device)

We compared the single and multi GPU inference. This is a not a rigorous timming method, but to illustrate the effect. 
This shows that bigger the batch size, the more time it save using multi-GPUs.

| method        | 32      | 64      |128      |
| --------      | ------- |------- |------- |
| single GPU    |  18.3   |17.9    |17.4   |
| Multiple GPUs | 16.2     |16.4     |10.5    |

Extra:

 * exaplaination of NCCL: https://medium.com/polo-club-of-data-science/how-to-measure-inter-gpu-connection-speed-single-node-3d122acb93f8

 * git repo for NCCL: https://github.com/nvidia/nccl