Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

int8 quantization doesn't work with accelerate on multi-GPUs #875

Closed
2 of 4 tasks
giulio98 opened this issue Nov 19, 2022 · 10 comments
Closed
2 of 4 tasks

int8 quantization doesn't work with accelerate on multi-GPUs #875

giulio98 opened this issue Nov 19, 2022 · 10 comments

Comments

@giulio98
Copy link

giulio98 commented Nov 19, 2022

System Info

python 3.8
pytorch 1.12
openmpi 4.1.0
cuda 11.3
cudnn8
ubuntu 20.04
accelerate==0.14.0
transformers==4.24.0
bitsandbytes==0.35.4

1 node with 4xT4 GPUs

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

import os
from accelerate import Accelerator
from torch.utils.data.dataloader import DataLoader
from tqdm import tqdm
from torch.utils.data.dataset import Dataset
import torch

checkpoint = 'facebook/opt-1.3b'

accelerator = Accelerator()
os.environ["TOKENIZERS_PARALLELISM"] = "false"
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast=False)
input_list = [
    "Hugging Face is pushing the convention that a unicorn with two horns becomes a llama.",
    "Hello world",
    "Hello my name is",
    "Happy to see you"
]
class CustomDataset(Dataset):

    def __init__(self, txt_list, tokenizer):

        self.tokenizer = tokenizer
        self.input_ids = []
        self.attn_masks = []

        for txt in txt_list:

            encodings_dict = tokenizer(txt, padding=True)

            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    
    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]

tokenized_dataset = CustomDataset(input_list, tokenizer)

dataloader = DataLoader(tokenized_dataset, batch_size=1)
model, dataloader = accelerator.prepare(model, dataloader)
for step, batch in tqdm(enumerate(dataloader)):
    with torch.no_grad():
        output = accelerator.unwrap_model(model).generate(batch[0], min_length=30, max_length=30, do_sample=True)
        print(tokenizer.decode(output[0].tolist()))

Expected behavior

accelerator.unwrap_model(model).generate(...) should work fine instead fail with the following error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__native_layer_norm)

Full backtrace:

/bin/bash: /azureml-envs/pytorch-1.12/lib/libtinfo.so.6: no version information available (required by /bin/bash)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.5<0>
3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO Plugin Path : /usr/local/nccl-rdma-sharp-plugins/lib/libnccl-net.so
3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO P2P plugin IBext
3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO NET/IB : No device found.
3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.5<0>
3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 1(=6f9100000) and dev 0(=402400000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 2(=71a000000) and dev 0(=402400000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 3(=c45a00000) and dev 0(=402400000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 1(=6f9100000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 2(=71a000000) and dev 1(=6f9100000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 3(=c45a00000) and dev 1(=6f9100000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 2(=71a000000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 1(=6f9100000) and dev 2(=71a000000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 3(=c45a00000) and dev 2(=71a000000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 3(=c45a00000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 1(=6f9100000) and dev 3(=c45a00000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 2(=71a000000) and dev 3(=c45a00000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 1(=6f9100000) and dev 0(=402400000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 2(=71a000000) and dev 0(=402400000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 3(=c45a00000) and dev 0(=402400000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 1(=6f9100000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 2(=71a000000) and dev 1(=6f9100000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 3(=c45a00000) and dev 1(=6f9100000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 2(=71a000000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 1(=6f9100000) and dev 2(=71a000000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 3(=c45a00000) and dev 2(=71a000000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 3(=c45a00000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 1(=6f9100000) and dev 3(=c45a00000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 2(=71a000000) and dev 3(=c45a00000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Channel 00/02 : 0 1 2 3
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Channel 01/02 : 0 1 2 3
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Setting affinity for GPU 0 to ffff
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 3(=c45a00000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 3(=c45a00000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 1(=6f9100000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Channel 00 : 0[402400000] -> 1[6f9100000] via direct shared memory
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 1(=6f9100000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Channel 01 : 0[402400000] -> 1[6f9100000] via direct shared memory
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Connected all rings
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 1(=6f9100000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 1(=6f9100000)
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Connected all trees
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO comm 0x7fd570002fb0 rank 0 nranks 4 cudaDev 0 busId 402400000 - Init COMPLETE
3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO Launch mode Parallel

0it [00:00, ?it/s]
0it [00:02, ?it/s]
Traceback (most recent call last):
File "test_8bit.py", line 49, in
output = accelerator.unwrap_model(model).generate(batch[0], min_length=30, max_length=30, do_sample=True)
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/transformers/generation_utils.py", line 1543, in generate
return self.sample(
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/transformers/generation_utils.py", line 2482, in sample
outputs = self(
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/accelerate/hooks.py", line 156, in new_forward
output = old_forward(*args, **kwargs)
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 929, in forward
outputs = self.model.decoder(
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 693, in forward
layer_outputs = decoder_layer(
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/accelerate/hooks.py", line 156, in new_forward
output = old_forward(*args, **kwargs)
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 321, in forward
hidden_states = self.self_attn_layer_norm(hidden_states)
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/accelerate/hooks.py", line 156, in new_forward
output = old_forward(*args, **kwargs)
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/torch/nn/modules/normalization.py", line 189, in forward
return F.layer_norm(
File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/torch/nn/functional.py", line 2503, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__native_layer_norm)

@giulio98
Copy link
Author

The scipt work fine with 1 T4 GPU, the error persist only with multi-GPUs

@sgugger
Copy link
Collaborator

sgugger commented Nov 21, 2022

The problem is that you are sending your model to Accelerator.prepare which puts it on GPU 0 and destroys the work done by device_map="auto". You should not send it to this method and it will work fine (you will also be able to remove the unwrap).

@giulio98
Copy link
Author

giulio98 commented Nov 22, 2022

Thanks for the response,
I would like to be able to use Accelerator.prepare to split the dataset across all the available GPUs, what you are suggesting works only with 4 sentences on 4 GPUs then the execution hangs.

Reproduction

import os
from accelerate import Accelerator
from torch.utils.data.dataloader import DataLoader
from tqdm import tqdm
from torch.utils.data.dataset import Dataset
import torch

checkpoint = 'facebook/opt-1.3b'

accelerator = Accelerator()
os.environ["TOKENIZERS_PARALLELISM"] = "false"
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast=False)
input_list = [
    "Hugging Face is pushing the convention that a unicorn with two horns becomes a llama.",
    "Hello world",
    "Hello my name is",
    "Happy to see you",
    "This sentence will not run",
    "This sentence will not run",
    "This sentence will not run",
    "This sentence will not run"
]
class CustomDataset(Dataset):

    def __init__(self, txt_list, tokenizer):

        self.tokenizer = tokenizer
        self.input_ids = []
        self.attn_masks = []

        for txt in txt_list:

            encodings_dict = tokenizer(txt, padding=True)

            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    
    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]

tokenized_dataset = CustomDataset(input_list, tokenizer)

dataloader = DataLoader(tokenized_dataset, batch_size=1)
dataloader = accelerator.prepare(dataloader)
for step, batch in tqdm(enumerate(dataloader)):
    with torch.no_grad():
        output = model.generate(batch[0], min_length=30, max_length=30, do_sample=True)
        print(tokenizer.decode(output[0].tolist()))

@sgugger
Copy link
Collaborator

sgugger commented Nov 22, 2022

You can't use data parallelism with device_map="auto": the model will expect the inputs to be on GPU 0, then one part will be computed on GPU 0, then 1, 2, 3.

@giulio98
Copy link
Author

Hello, this behavior indeed is quite strange, if the script above works for the first batch I don't see why it shouldn't work for the second batch. In any case, is possible to find somewhere a list of the libraries supported by accelerate and the ones not supported? For the moment is not very clear how to use this library with int8 quantization and deepspeed_for_inference.

@pacman100
Copy link
Contributor

Hello @giulio98, https://github.com/huggingface/accelerate#supported-integrations has the list of all the integrations supported by Accelerate. For more details and guidance on how to use these, please refer the How-To Guides of the docs: https://huggingface.co/docs/accelerate/index

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@Pacific-wide
Copy link

Is there any solution to use data parallelism with int8 quantized model?

@giulio98
Copy link
Author

giulio98 commented Feb 5, 2024

Hi, I'm reopening this issue to inquire whether it's currently feasible to perform inference across multiple GPUs (by distributing the weights on multiple GPUs) while employing data parallelism. Specifically, is it viable to utilize PyTorch's Fully Sharded Data Parallel (FSDP) for this purpose?

@muellerzr
Copy link
Collaborator

muellerzr commented Feb 5, 2024

@giulio98 we're working on that, stay tuned :) #2345

(via fsdp, the answer is still the same)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants