-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
device_map='auto' gives bad results #20896
Comments
Thanks for the issue!
|
Hello, @younesbelkada |
Thanks for the details! I still did not managed to reproduce, can you try this snippet instead:
and let me know if the problem still persists? |
Thanks for the quick replies.
My original test code with
|
I am slightly unsure here about what could be causing the issue but I suspect it's highly correlated to the fact that you're running your script under two RTX A6000 but not sure |
@younesbelkada, I got the same wrong result with PyTorch 1.13.1.
|
Mmmm there is no reason for the script to give different results for different GPUs, especially since removing the device_map="auto" gives the same results. I also can't reproduce on my side. Are you absolutely certain your script is launched in the same Python environment you are reporting? E.g. can you print the versions of Accelerate/Transformers/Pytorch in the same script? |
I put the test scripts using cpu, gpu0, gpu1, and device_map=auto on a single python file to be sure.
And this the result
And this is
|
There is a warning
You did move the inputs when processing on one of the two GPUs, it might be necessary here too. Could you print the |
I moved inputs to cuda:0 and cuda:1 but both gave the same wrong result.
I will try to reproduce this issue on another machine having two GPUs. |
It works well on another machine with two Quadro 6000 GPUs. I ran |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hi @younesbelkada I got the same error with two V100, with
code:
Have you found a solution? |
I think you should add the prompt which is the same one in the training. Moreover, please note the special token that you add.
Afterward, I used the model:
Hope to help you! |
@youngwoo-yoon hi, have you solved this problem? I have the same problem on A100 |
I'm also running into a similar issue, except with A6000s. With 1 A6000 and the rest of the weights on cpu, I get coherent text. With multiple A6000s, I get garbage outputs. |
I solved this problem by disabling ACS in BIOS. |
Amazing!!! It works for me. |
System Info
transformers
version: 4.25.1Platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.17
Python version: 3.8.15
Huggingface_hub version: 0.11.1
PyTorch version (GPU?): 1.11.0 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no
GPUs: two A100
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Minimal test example:
Results:
The above result is not expected behavior.
Without
device_map='auto'
at line 5, it works correctly.Line 5 becomes
model = AutoModelForCausalLM.from_pretrained(model_name)
Results:
My machine has two A100 (80 GB) GPUs, and I confirmed that the model is loaded on two GPUs when I use
device_map='auto'
.Expected behavior
Explained above
The text was updated successfully, but these errors were encountered: