int8 quantization doesn't work with accelerate on multi-GPUs #875

giulio98 · 2022-11-19T18:58:30Z

System Info

python 3.8
pytorch 1.12
openmpi 4.1.0
cuda 11.3
cudnn8
ubuntu 20.04
accelerate==0.14.0
transformers==4.24.0
bitsandbytes==0.35.4

1 node with 4xT4 GPUs

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

import os
from accelerate import Accelerator
from torch.utils.data.dataloader import DataLoader
from tqdm import tqdm
from torch.utils.data.dataset import Dataset
import torch

checkpoint = 'facebook/opt-1.3b'

accelerator = Accelerator()
os.environ["TOKENIZERS_PARALLELISM"] = "false"
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast=False)
input_list = [
    "Hugging Face is pushing the convention that a unicorn with two horns becomes a llama.",
    "Hello world",
    "Hello my name is",
    "Happy to see you"
]
class CustomDataset(Dataset):

    def __init__(self, txt_list, tokenizer):

        self.tokenizer = tokenizer
        self.input_ids = []
        self.attn_masks = []

        for txt in txt_list:

            encodings_dict = tokenizer(txt, padding=True)

            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    
    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]

tokenized_dataset = CustomDataset(input_list, tokenizer)

dataloader = DataLoader(tokenized_dataset, batch_size=1)
model, dataloader = accelerator.prepare(model, dataloader)
for step, batch in tqdm(enumerate(dataloader)):
    with torch.no_grad():
        output = accelerator.unwrap_model(model).generate(batch[0], min_length=30, max_length=30, do_sample=True)
        print(tokenizer.decode(output[0].tolist()))

Expected behavior

accelerator.unwrap_model(model).generate(...) should work fine instead fail with the following error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__native_layer_norm)

Full backtrace:

/bin/bash: /azureml-envs/pytorch-1.12/lib/libtinfo.so.6: no version information available (required by /bin/bash)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.5<0>
3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO Plugin Path : /usr/local/nccl-rdma-sharp-plugins/lib/libnccl-net.so
3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO P2P plugin IBext
3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO NET/IB : No device found.
3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.5<0>
3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 1(=6f9100000) and dev 0(=402400000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 2(=71a000000) and dev 0(=402400000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 3(=c45a00000) and dev 0(=402400000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 1(=6f9100000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 2(=71a000000) and dev 1(=6f9100000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 3(=c45a00000) and dev 1(=6f9100000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 2(=71a000000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 1(=6f9100000) and dev 2(=71a000000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 3(=c45a00000) and dev 2(=71a000000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 3(=c45a00000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 1(=6f9100000) and dev 3(=c45a00000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 2(=71a000000) and dev 3(=c45a00000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 1(=6f9100000) and dev 0(=402400000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 2(=71a000000) and dev 0(=402400000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 3(=c45a00000) and dev 0(=402400000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 1(=6f9100000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 2(=71a000000) and dev 1(=6f9100000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 3(=c45a00000) and dev 1(=6f9100000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 2(=71a000000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 1(=6f9100000) and dev 2(=71a000000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 3(=c45a00000) and dev 2(=71a000000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 3(=c45a00000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 1(=6f9100000) and dev 3(=c45a00000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 2(=71a000000) and dev 3(=c45a00000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Channel 00/02 : 0 1 2 3
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Channel 01/02 : 0 1 2 3
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Setting affinity for GPU 0 to ffff
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 3(=c45a00000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 3(=c45a00000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 1(=6f9100000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Channel 00 : 0[402400000] -> 1[6f9100000] via direct shared memory
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 1(=6f9100000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Channel 01 : 0[402400000] -> 1[6f9100000] via direct shared memory
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Connected all rings
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 1(=6f9100000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 1(=6f9100000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Connected all trees
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO comm 0x7fd570002fb0 rank 0 nranks 4 cudaDev 0 busId 402400000 - Init COMPLETE
3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO Launch mode Parallel

0it [00:00, ?it/s]
0it [00:02, ?it/s]
Traceback (most recent call last):
File "test_8bit.py", line 49, in
output = accelerator.unwrap_model(model).generate(batch[0], min_length=30, max_length=30, do_sample=True)
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/transformers/generation_utils.py", line 1543, in generate
return self.sample(
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/transformers/generation_utils.py", line 2482, in sample
outputs = self(
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/accelerate/hooks.py", line 156, in new_forward
output = old_forward(*args, **kwargs)
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 929, in forward
outputs = self.model.decoder(
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 693, in forward
layer_outputs = decoder_layer(
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/accelerate/hooks.py", line 156, in new_forward
output = old_forward(*args, **kwargs)
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 321, in forward
hidden_states = self.self_attn_layer_norm(hidden_states)
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/accelerate/hooks.py", line 156, in new_forward
output = old_forward(*args, **kwargs)
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/torch/nn/modules/normalization.py", line 189, in forward
return F.layer_norm(
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/torch/nn/functional.py", line 2503, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__native_layer_norm)

The text was updated successfully, but these errors were encountered:

giulio98 · 2022-11-19T19:02:15Z

The scipt work fine with 1 T4 GPU, the error persist only with multi-GPUs

sgugger · 2022-11-21T14:26:47Z

The problem is that you are sending your model to Accelerator.prepare which puts it on GPU 0 and destroys the work done by device_map="auto". You should not send it to this method and it will work fine (you will also be able to remove the unwrap).

giulio98 · 2022-11-22T02:18:25Z

Thanks for the response,
I would like to be able to use Accelerator.prepare to split the dataset across all the available GPUs, what you are suggesting works only with 4 sentences on 4 GPUs then the execution hangs.

Reproduction

import os
from accelerate import Accelerator
from torch.utils.data.dataloader import DataLoader
from tqdm import tqdm
from torch.utils.data.dataset import Dataset
import torch

checkpoint = 'facebook/opt-1.3b'

accelerator = Accelerator()
os.environ["TOKENIZERS_PARALLELISM"] = "false"
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast=False)
input_list = [
    "Hugging Face is pushing the convention that a unicorn with two horns becomes a llama.",
    "Hello world",
    "Hello my name is",
    "Happy to see you",
    "This sentence will not run",
    "This sentence will not run",
    "This sentence will not run",
    "This sentence will not run"
]
class CustomDataset(Dataset):

    def __init__(self, txt_list, tokenizer):

        self.tokenizer = tokenizer
        self.input_ids = []
        self.attn_masks = []

        for txt in txt_list:

            encodings_dict = tokenizer(txt, padding=True)

            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    
    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]

tokenized_dataset = CustomDataset(input_list, tokenizer)

dataloader = DataLoader(tokenized_dataset, batch_size=1)
dataloader = accelerator.prepare(dataloader)
for step, batch in tqdm(enumerate(dataloader)):
    with torch.no_grad():
        output = model.generate(batch[0], min_length=30, max_length=30, do_sample=True)
        print(tokenizer.decode(output[0].tolist()))

sgugger · 2022-11-22T13:07:58Z

You can't use data parallelism with device_map="auto": the model will expect the inputs to be on GPU 0, then one part will be computed on GPU 0, then 1, 2, 3.

giulio98 · 2022-11-25T01:43:51Z

Hello, this behavior indeed is quite strange, if the script above works for the first batch I don't see why it shouldn't work for the second batch. In any case, is possible to find somewhere a list of the libraries supported by accelerate and the ones not supported? For the moment is not very clear how to use this library with int8 quantization and deepspeed_for_inference.

pacman100 · 2022-11-25T03:30:08Z

Hello @giulio98, https://github.com/huggingface/accelerate#supported-integrations has the list of all the integrations supported by Accelerate. For more details and guidance on how to use these, please refer the How-To Guides of the docs: https://huggingface.co/docs/accelerate/index

github-actions · 2022-12-20T15:05:50Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Pacific-wide · 2023-01-12T05:15:49Z

Is there any solution to use data parallelism with int8 quantized model?

giulio98 · 2024-02-05T21:05:20Z

Hi, I'm reopening this issue to inquire whether it's currently feasible to perform inference across multiple GPUs (by distributing the weights on multiple GPUs) while employing data parallelism. Specifically, is it viable to utilize PyTorch's Fully Sharded Data Parallel (FSDP) for this purpose?

muellerzr · 2024-02-05T21:10:04Z

@giulio98 we're working on that, stay tuned :) #2345

(via fsdp, the answer is still the same)

github-actions bot closed this as completed Dec 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

int8 quantization doesn't work with accelerate on multi-GPUs #875

int8 quantization doesn't work with accelerate on multi-GPUs #875

giulio98 commented Nov 19, 2022 •

edited

Loading

giulio98 commented Nov 19, 2022

sgugger commented Nov 21, 2022

giulio98 commented Nov 22, 2022 •

edited

Loading

sgugger commented Nov 22, 2022

giulio98 commented Nov 25, 2022

pacman100 commented Nov 25, 2022

github-actions bot commented Dec 20, 2022

Pacific-wide commented Jan 12, 2023

giulio98 commented Feb 5, 2024

muellerzr commented Feb 5, 2024 •

edited

Loading

int8 quantization doesn't work with accelerate on multi-GPUs #875

int8 quantization doesn't work with accelerate on multi-GPUs #875

Comments

giulio98 commented Nov 19, 2022 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

giulio98 commented Nov 19, 2022

sgugger commented Nov 21, 2022

giulio98 commented Nov 22, 2022 • edited Loading

Reproduction

sgugger commented Nov 22, 2022

giulio98 commented Nov 25, 2022

pacman100 commented Nov 25, 2022

github-actions bot commented Dec 20, 2022

Pacific-wide commented Jan 12, 2023

giulio98 commented Feb 5, 2024

muellerzr commented Feb 5, 2024 • edited Loading

giulio98 commented Nov 19, 2022 •

edited

Loading

giulio98 commented Nov 22, 2022 •

edited

Loading

muellerzr commented Feb 5, 2024 •

edited

Loading