Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

device_map='auto' gives bad results #20896

Closed
2 of 4 tasks
youngwoo-yoon opened this issue Dec 26, 2022 · 18 comments
Closed
2 of 4 tasks

device_map='auto' gives bad results #20896

youngwoo-yoon opened this issue Dec 26, 2022 · 18 comments

Comments

@youngwoo-yoon
Copy link

youngwoo-yoon commented Dec 26, 2022

System Info

  • transformers version: 4.25.1

  • Platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.17

  • Python version: 3.8.15

  • Huggingface_hub version: 0.11.1

  • PyTorch version (GPU?): 1.11.0 (True)

  • Tensorflow version (GPU?): not installed (NA)

  • Flax version (CPU?/GPU?/TPU?): not installed (NA)

  • Jax version: not installed

  • JaxLib version: not installed

  • Using GPU in script?: yes

  • Using distributed or parallel set-up in script?: no

  • GPUs: two A100

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Minimal test example:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = 'EleutherAI/gpt-neo-125M'
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(model_name)

sentence = 'Hello, nice to meet you. How are'
with torch.no_grad():
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    gen_tokens = model.generate(tensor_input, max_length=32)
    generated = tokenizer.batch_decode(gen_tokens)[0]

print(generated)

Results:

Hello, nice to meet you. How are noise retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy

The above result is not expected behavior.
Without device_map='auto' at line 5, it works correctly.
Line 5 becomes model = AutoModelForCausalLM.from_pretrained(model_name)

Results:

Hello, nice to meet you. How are you?

I’m a bit of a newbie to the world of web development, but I

My machine has two A100 (80 GB) GPUs, and I confirmed that the model is loaded on two GPUs when I use device_map='auto'.

Expected behavior

Explained above

@younesbelkada
Copy link
Contributor

younesbelkada commented Dec 26, 2022

Hi @youngwoo-yoon

Thanks for the issue!
What is your version of accelerate ? With the latest version (0.15.0) & same pytorch version I get (on a NVIDIA T4) on the minimal test example shared above that uses device_map=auto :

Hello, nice to meet you. How are you?

I’m a bit of a newbie to the world of web development, but I

@youngwoo-yoon
Copy link
Author

youngwoo-yoon commented Dec 26, 2022

Hello, @younesbelkada
I'm using the same version 0.15.0 of accelerate.
I also got the correct result when I ran with export CUDA_VISIBLE_DEVICES=0
Still wrong results with two GPUS export CUDA_VISIBLE_DEVICES=0,1

@younesbelkada
Copy link
Contributor

younesbelkada commented Dec 26, 2022

Thanks for the details! I still did not managed to reproduce, can you try this snippet instead:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = 'EleutherAI/gpt-neo-125M'
model = AutoModelForCausalLM.from_pretrained(model_name, device_map={"transformer.wte":0, "transformer.wpe":0, "transformer.h":1, "transformer.ln_f":1, "lm_head":1})
tokenizer = AutoTokenizer.from_pretrained(model_name)

sentence = 'Hello, nice to meet you. How are'
with torch.no_grad():
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    gen_tokens = model.generate(tensor_input, max_length=32)
    generated = tokenizer.batch_decode(gen_tokens)[0]

print(generated)

and let me know if the problem still persists?
We're using the same Pytorch, transformers, accelerate version. The only difference is on the hardware (I am using 2xNvidia T4)
Can you also try your script with export CUDA_VISIBLE_DEVICES=1 instead of export CUDA_VISIBLE_DEVICES=0?

@youngwoo-yoon
Copy link
Author

Thanks for the quick replies.
This is the result and it still doesn't look good.

Hello, nice to meet you. How are!!!!!!!!!!!!!!!!!!!!!!!

My original test code with export CUDA_VISIBLE_DEVICES=1 gives the same correct result with export CUDA_VISIBLE_DEVICES=0

Hello, nice to meet you. How are you?

I’m a bit of a newbie to the world of web development, but I

@younesbelkada
Copy link
Contributor

younesbelkada commented Dec 26, 2022

I am slightly unsure here about what could be causing the issue but I suspect it's highly correlated to the fact that you're running your script under two RTX A6000 but not sure
@sgugger do you think that the problem can be related to accelerate & the fact that the script is running under two RTX A6000 instead of another hardware (i.e. have you seen similar discrepancy errors in the past)?
@youngwoo-yoon could you ultimately try the script with the latest pytorch version (1.13.1)?

@youngwoo-yoon
Copy link
Author

@younesbelkada, I got the same wrong result with PyTorch 1.13.1.

Hello, nice to meet you. How are noise retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy

@sgugger
Copy link
Collaborator

sgugger commented Dec 27, 2022

Mmmm there is no reason for the script to give different results for different GPUs, especially since removing the device_map="auto" gives the same results.

I also can't reproduce on my side. Are you absolutely certain your script is launched in the same Python environment you are reporting? E.g. can you print the versions of Accelerate/Transformers/Pytorch in the same script?

@youngwoo-yoon
Copy link
Author

I put the test scripts using cpu, gpu0, gpu1, and device_map=auto on a single python file to be sure.

from importlib.metadata import version
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

print('torch', version('torch'))
print('transformers', version('transformers'))
print('accelerate', version('accelerate'))
print('# of gpus: ', torch.cuda.device_count())

# cpu
model_name = 'EleutherAI/gpt-neo-125M'
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

sentence = 'Hello, nice to meet you. How are'
with torch.no_grad():
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    gen_tokens = model.generate(tensor_input, max_length=32)
    generated = tokenizer.batch_decode(gen_tokens)[0]

print(generated)
print('-------------------------------------------')

# on the gpu 0
model = AutoModelForCausalLM.from_pretrained(model_name)
model = model.to('cuda:0')

with torch.no_grad():
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    tensor_input = tensor_input.to('cuda:0')
    gen_tokens = model.generate(tensor_input, max_length=32)
    generated = tokenizer.batch_decode(gen_tokens)[0]

print(generated)
print('-------------------------------------------')

# on the gpu 1
model = AutoModelForCausalLM.from_pretrained(model_name)
model = model.to('cuda:1')

with torch.no_grad():
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    tensor_input = tensor_input.to('cuda:1')
    gen_tokens = model.generate(tensor_input, max_length=32)
    generated = tokenizer.batch_decode(gen_tokens)[0]

print(generated)
print('-------------------------------------------')

# with device_map=auto
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto')

with torch.no_grad():
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    gen_tokens = model.generate(tensor_input, max_length=32)
    generated = tokenizer.batch_decode(gen_tokens)[0]

print(generated)

And this the result

torch 1.13.1
transformers 4.25.1
accelerate 0.15.0
# of gpus:  2
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Hello, nice to meet you. How are you?

I’m a bit of a newbie to the world of web development, but I
-------------------------------------------
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Hello, nice to meet you. How are you?

I’m a bit of a newbie to the world of web development, but I
-------------------------------------------
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Hello, nice to meet you. How are you?

I’m a bit of a newbie to the world of web development, but I
-------------------------------------------
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/transformers/generation/utils.py:1470: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.
  warnings.warn(
Hello, nice to meet you. How are noise retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy retaliateousy

And this is nvidia-smi results

Tue Dec 27 16:57:48 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.106.00   Driver Version: 460.106.00   CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100 80GB PCIe      Off  | 00000000:4F:00.0 Off |                    0 |
| N/A   36C    P0    47W / 300W |      9MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100 80GB PCIe      Off  | 00000000:52:00.0 Off |                    0 |
| N/A   37C    P0    45W / 300W |      9MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2915      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A    119486      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      2915      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A    119486      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

@sorgfresser
Copy link
Contributor

There is a warning

/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/transformers/generation/utils.py:1470: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.

You did move the inputs when processing on one of the two GPUs, it might be necessary here too. Could you print the hf_device_map attribute of the model and try to move the inputs to cuda device 0 and 1?

@youngwoo-yoon
Copy link
Author

I moved inputs to cuda:0 and cuda:1 but both gave the same wrong result.
Below is the output when I moved inputs to cuda:0.

torch 1.13.1
transformers 4.25.1
accelerate 0.15.0
# of gpus: 2
hf_device_map output: {'transformer.wte': 0, 'lm_head': 0, 'transformer.wpe': 0, 'transformer.drop': 0, 'transformer.h.0': 0, 'transformer.h.1': 0, 'transformer.h.2': 0, 'transformer.h.3': 0, 'transformer.h.4': 0, 'transformer.h.5': 0, 'transformer.h.6': 1, 'transformer.h.7': 1, 'transformer.h.8': 1, 'transformer.h.9': 1, 'transformer.h.10': 1, 'transformer.h.11': 1, 'transformer.ln_f': 1}
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Hello, nice to meet you. How are noiseleanor pressuring retaliate incarcer boundousy]= incarcer incarcer high * Karin�� Annotationsousyousyousy pressuring retaliateousyousyousy

I will try to reproduce this issue on another machine having two GPUs.

@youngwoo-yoon
Copy link
Author

It works well on another machine with two Quadro 6000 GPUs.
I've tried different device_map strategies 'sequential' and 'balanced_low_0', but it still fails when two A100 GPUs are used.

I ran accelerate test command which tests accelerate library but it also failed. It seems like a problem of accelerate library.
I found some other people also had problems with A100 GPUs.
Related issue: huggingface/accelerate#934

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Feb 3, 2023
@yuchguo1007
Copy link

Hi @younesbelkada I got the same error with two V100, with accelerate version 0.18.0
prompt = 'Q: What is the largest animal?\nA:'
output:

A: The blue whale.
Q: What is the largest animal?
A: The blue whale. It is the largest animal on Earth. It is also the largest mammal. It is the largest creature that has ever lived.
Q: What is the largest animal?
A: The blue whale is the largest animal on Earth. It is also the largest mammal. It is the largest creature that has ever lived.
Q: What is the largest animal?
A: The blue whale is the largest animal on Earth. It is also the largest mammal. It is the largest creature that has ever lived.
Q: What is the largest animal?
A: The blue whale is the largest animal on Earth. It is also the largest mammal. It is the largest creature that has ever lived.
Q: What is the largest animal?
A: The blue whale is the largest animal on Earth. It is also the largest mammal. It is the largest creature that has ever lived.
Q: What is the largest animal?
A: The blue whale is the largest animal on Earth. It is also the largest mammal. It is the largest creature that has ever lived.
Q

code:

model_path = 'openlm-research/open_llama_3b'

tokenizer = LlamaTokenizer.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.float16, device_map='auto'
)

prompt = 'Q: What is the largest animal?\nA:'
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to('cuda')

generation_output = model.generate(
    input_ids=input_ids, max_length=400
)
print(tokenizer.decode(generation_output[0]))

Have you found a solution?

@nhungntaime
Copy link

nhungntaime commented Aug 29, 2023

I think you should add the prompt which is the same one in the training. Moreover, please note the special token that you add.
Example:
In the training, I tokenize:

`f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n ### Input: <s>{input}</s>. \n### Response: <s>{ouput}</s>"`

Afterward, I used the model:

text = f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n ### Input: {input}. \n### Response: "
batch = tokenizer(text, return_tensors='pt', padding=True, return_token_type_ids=False)
with torch.cuda.amp.autocast():
      output_tokens = model.generate(**batch, max_new_tokens=500)
decode = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
decode_text = decode[len(text):]
print(decode_text)

Hope to help you!

@ZaVang
Copy link

ZaVang commented Sep 1, 2023

It works well on another machine with two Quadro 6000 GPUs. I've tried different device_map strategies 'sequential' and 'balanced_low_0', but it still fails when two A100 GPUs are used.

I ran accelerate test command which tests accelerate library but it also failed. It seems like a problem of accelerate library. I found some other people also had problems with A100 GPUs. Related issue: huggingface/accelerate#934

@youngwoo-yoon hi, have you solved this problem? I have the same problem on A100

@tsengalb99
Copy link

I'm also running into a similar issue, except with A6000s. With 1 A6000 and the rest of the weights on cpu, I get coherent text. With multiple A6000s, I get garbage outputs.

@youngwoo-yoon
Copy link
Author

I solved this problem by disabling ACS in BIOS.
This document might be helpful to some of you.
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html

@yuge-byte
Copy link

I solved this problem by disabling ACS in BIOS. This document might be helpful to some of you. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html

Amazing!!! It works for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants