New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NLLB-MoE 54B multi-GPU inference throws "Expected all tensors to be on the same device" error #23385
Comments
Hi @liyier90 # Demonstrate that only "model.encoder.layer_norm" and "model.encoder.embed_tokens"
# needs to be on the same device as the input
for module, device in device_map.items():
if module in {"model.encoder.layer_norm", "model.encoder.embed_tokens"}:
if device != 0:
device_map[module] = 0
else:
if device == 0:
device_map[module] = 1 necessary? I think |
If I comment out that block, I get the following error:
Because
It could be because I'm moving all inputs to device 0, but if I were to remove the
block. I get
|
|
I don't see where the error in Accelerate lies. No layers that is not supposed to be split has been split. So the issue is definitely a Transformers one. |
Yeah I think it is definitely something that has to do with no split modules not correctly set. Having a look now |
@liyier90 import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_name = "facebook/nllb-moe-54b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
load_in_8bit=True,
)
batched_input = [
'We now have 4-month-old mice that are non-diabetic that used to be diabetic," he added.',
"Dr. Ehud Ur, professor of medicine at Dalhousie University in Halifax, Nova Scotia and chair of the clinical and scientific division of the Canadian Diabetes Association cautioned that the research is still in its early days."
"Like some other experts, he is skeptical about whether diabetes can be cured, noting that these findings have no relevance to people who already have Type 1 diabetes."
"On Monday, Sara Danius, permanent secretary of the Nobel Committee for Literature at the Swedish Academy, publicly announced during a radio program on Sveriges Radio in Sweden the committee, unable to reach Bob Dylan directly about winning the 2016 Nobel Prize in Literature, had abandoned its efforts to reach him.",
'Danius said, "Right now we are doing nothing. I have called and sent emails to his closest collaborator and received very friendly replies. For now, that is certainly enough."',
"Previously, Ring's CEO, Jamie Siminoff, remarked the company started when his doorbell wasn't audible from his shop in his garage.",
]
inputs = tokenizer(batched_input, return_tensors="pt", padding=True).to(1)
translated_tokens = model.generate(
**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["fra_Latn"]
)
outputs = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)
print(outputs) I had to assign the input to the device 1 because in my case the lm head was on the device 1. But you can retrieve it with lm_head_device = model.hf_device_map["lm_head"] And the result I get is: ['Nous avons maintenant des souris de 4 mois qui ne sont pas diabétiques mais qui l\'étaient", a-t-il ajouté.', "Le Dr Ehud Ur, professeur de médecine à l'Université Dalhousie à Halifax, en Nouvelle-Écosse, et président de la division clinique et scientifique de l'Association canadienne du diabète, a averti que la recherche en était encore à ses débuts. Comme d'autres experts, il est sceptique quant à la possibilité de guérir le diabète, notant que ces résultats n'ont aucune pertinence pour les personnes atteintes de diabète de type 1.", 'Danius a déclaré: "Pour le moment, nous ne faisons rien. J\'ai appelé et envoyé des courriels à son plus proche collaborateur et j\'ai reçu des réponses très amicales. Pour l\'instant, c\'est certainement suffisant".', "Auparavant, le PDG de Ring, Jamie Siminoff, a déclaré que la société avait commencé lorsque sa sonnette n'était pas audible depuis son magasin dans son garage."] |
Unfortunately, I don't think changes in the PR was sufficient to resolve the error. I updated
The latest commit on the I ran code snippet you provided but on 4 x A100 40GB as I do not have access to 80 GB cards. I made the modification to move the input to the same device as import os
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_name = "facebook/nllb-moe-54b"
cache_dir = <path>
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)
model = AutoModelForSeq2SeqLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
load_in_8bit=True,
cache_dir=cache_dir,
)
batched_input = [
'We now have 4-month-old mice that are non-diabetic that used to be diabetic," he added.',
"Dr. Ehud Ur, professor of medicine at Dalhousie University in Halifax, Nova Scotia and chair of the clinical and scientific division of the Canadian Diabetes Association cautioned that the research is still in its early days."
"Like some other experts, he is skeptical about whether diabetes can be cured, noting that these findings have no relevance to people who already have Type 1 diabetes."
"On Monday, Sara Danius, permanent secretary of the Nobel Committee for Literature at the Swedish Academy, publicly announced during a radio program on Sveriges Radio in Sweden the committee, unable to reach Bob Dylan directly about winning the 2016 Nobel Prize in Literature, had abandoned its efforts to reach him.",
'Danius said, "Right now we are doing nothing. I have called and sent emails to his closest collaborator and received very friendly replies. For now, that is certainly enough."',
"Previously, Ring's CEO, Jamie Siminoff, remarked the company started when his doorbell wasn't audible from his shop in his garage.",
]
inputs = tokenizer(batched_input, return_tensors="pt", padding=True).to(
model.hf_device_map["lm_head"]
)
translated_tokens = model.generate(
**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["fra_Latn"]
)
outputs = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)
print(outputs) But I am still getting an "Expected all tensors to be on the same device" error.
I notice that one of the layers I moved in my earlier snippets (
The code ran successfully after I moved Please let me know if I made any mistakes in trying out your solution or if I should be raising this in the Accelerate repo instead. Thanks! |
I am having the same issues. I installed transformers after the fix and I get Unfortunately I only have 3 A100 40gb gpus that I can use.
expected result: translated "test" (french) actual result: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! Am I doing anything wrong?
|
The same issue here |
cc @SunMarc 😉 |
Hi, I found the issue. In the meantime, the hack is to have the input on the same device as |
@SunMarc Could you please check my problem? |
Hi @molokanov50 .Please open a new issue as this is not linked to this issue which was about encoder decoder model in general, not specific to nllb model. Also, provide a minimal reproductible script so that I can try to reproduce the error on my side. For now the following script works as expected: from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-1.3B")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-1.3B", device_map="auto")
input = 'We now have 4-month-old mice that are non-diabetic that used to be diabetic," he added.'
input = tokenizer(input, return_tensors="pt")
translated_tokens = model.generate(
**input, forced_bos_token_id=tokenizer.lang_code_to_id["fra_Latn"]
)
print(tokenizer.decode(translated_tokens[0], skip_special_tokens=True)) |
System Info
transformers
version: 4.28.1Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Note: there is a workaround/fix with manual device mapping attached below but I'm wondering if there could be an official fix for the bug.
Code sample
infer.py (Mostly from the HF Hub sample with some modifications to load with multi-GPU and quantization)
Steps:
CUDA_VISIBLE_DEVICES=0,1,2,3 python infer.py
Expected behavior
A list of translated text.
The following code contains a workaround to prevent certain module splits and moves certain modules to the same device as the input in order to run the inference without errors.
Code
Output:
The text was updated successfully, but these errors were encountered: