Complete gibberish produced by any and all models only when device_map="auto". #2692

FanaticPythoner · 2024-04-20T11:19:30Z

System Info

- `Accelerate` version: 0.29.3
- Platform: Linux-5.15.0-101-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /root/miniconda3/envs/codevalet_ai/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.24.3
- PyTorch version (GPU?): 2.2.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 2015.56 GB
- GPU type: NVIDIA L40S
- `Accelerate` default config:
  Not found

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

System specs:

8 times L40S GPUs
Intel Xeon Gold 6558H

This:

import torch
from transformers import pipeline
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2"
pipe = pipeline("text-generation", model="HuggingFaceH4/starchat-beta", torch_dtype=torch.bfloat16, device_map='auto')
# We use a variant of ChatML to format each message
prompt_template = "<|system|>\n<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
prompt = prompt_template.format(query="How do I sort a list in Python?")
# We use a special <|end|> token with ID 49155 to denote ends of a turn
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.2, top_k=50, top_p=0.95, eos_token_id=49155)
print(outputs)

Prints:
[{'generated_text': '<|system|>\n<|end|>\n<|user|>\nHow do I sort a list in Python?<|end|>\n<|assistant|> to. Air1 (\nь\nInfo plit An che a\n the weьь Share Share\n aremobatar\n…We brain be S jj jj'..., … …: no J\n,…AL more of… y they code lifefl\n -- B moreand.. L\nplitahph a after\n Ishare, E I I is L\n unel not Mid' I'’ …\n\n …" you a a South strength I I S said "no\n\n\n E E11\n EASC not Sh English. of of E |isse\n as that said said of said reg of The The– n a… Open. The The for | A after After\n was M open open over in been\n\n into,onAR down :-)mad cos I you to E,( not "a001 that vis m44\n\n\n of3\n re1 T by so itack in inententancy of is int Library to U U.. a a = ==Compression Itdata66 as111110 S'}]
While this:

import torch
from transformers import pipeline
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
pipe = pipeline("text-generation", model="HuggingFaceH4/starchat-beta", torch_dtype=torch.bfloat16, device_map='auto')
# We use a variant of ChatML to format each message
prompt_template = "<|system|>\n<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
prompt = prompt_template.format(query="How do I sort a list in Python?")
# We use a special <|end|> token with ID 49155 to denote ends of a turn
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.2, top_k=50, top_p=0.95, eos_token_id=49155)
print(outputs)

Prints:
[{'generated_text': '<|system|>\n<|end|>\n<|user|>\nHow do I sort a list in Python?<|end|>\n<|assistant|>\nThere are multiple ways to sort a list in Python. One of the most common ways is to use the sort() method. Here is an example:\n\n```\nmy_list = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5]\nmy_list.sort()\nprint(my_list)\n```\n\nThis will sort the list in place and print the sorted list.\n\nAnother way to sort a list is to use the sorted() function. This function returns a new sorted list and does not modify the original list. Here is an example:\n\n```\nmy_list = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5]\nsorted_list = sorted(my_list)\nprint(sorted_list)\n```\n\nIn this example, the sorted_list variable will contain the sorted list and the my_list variable will remain unchanged.\n\nThere are also other sorting algorithms available in the built-in sort module, such as quicksort, heapsort, and merge sort. You'}]

Expected behavior

Both should print coherent text. This happens no matter the model chosen. In the above reproduction steps, the model used is HuggingFaceH4/starchat-beta. The exact same thing happens with mistralai/Mixtral-8x7B-Instruct-v0.1, no matter if ran in bfloat16, float16, float32, or quantized / not quantized. The issue also occurs no matter the prompt.

The issue, however, does NOT occur when device_map="sequential" is set (tested with HuggingFaceH4/starchat-beta only).
Furthermore, the issue does NOT occur with device_map="auto" on my home 3x RTX 3090 / Threadripper 3960x setup.

However, I cannot use sequential in our current production environment without making significant changes.

The text was updated successfully, but these errors were encountered:

SunMarc · 2024-04-23T11:45:52Z

Hi @FanaticPythoner, thanks for the detailed report ! This is indeed strange that sequential works while it fails with auto (using "balanced"). Could you check what is the output of model.hf_device_map ? Maybe you can try to allocate each layer to a specific device or set the max_memory arg when using sequential.

FanaticPythoner · 2024-04-23T13:13:36Z

@SunMarc

For the code:

import torch
from transformers import pipeline
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2"
pipe = pipeline("text-generation", model="HuggingFaceH4/starchat-beta", torch_dtype=torch.bfloat16, device_map='auto')
# We use a variant of ChatML to format each message
prompt_template = "<|system|>\n<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
prompt = prompt_template.format(query="How do I sort a list in Python?")
# We use a special <|end|> token with ID 49155 to denote ends of a turn
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.2, top_k=50, top_p=0.95, eos_token_id=49155)
print(outputs)

Doing:

import json
print(json.dumps(pipe.model.hf_device_map, indent=4))

Prints:

{
    "transformer.wte": 0,
    "lm_head": 0,
    "transformer.wpe": 0,
    "transformer.drop": 0,
    "transformer.h.0": 0,
    "transformer.h.1": 0,
    "transformer.h.2": 0,
    "transformer.h.3": 0,
    "transformer.h.4": 0,
    "transformer.h.5": 0,
    "transformer.h.6": 0,
    "transformer.h.7": 0,
    "transformer.h.8": 0,
    "transformer.h.9": 0,
    "transformer.h.10": 0,
    "transformer.h.11": 0,
    "transformer.h.12": 0,
    "transformer.h.13": 1,
    "transformer.h.14": 1,
    "transformer.h.15": 1,
    "transformer.h.16": 1,
    "transformer.h.17": 1,
    "transformer.h.18": 1,
    "transformer.h.19": 1,
    "transformer.h.20": 1,
    "transformer.h.21": 1,
    "transformer.h.22": 1,
    "transformer.h.23": 1,
    "transformer.h.24": 1,
    "transformer.h.25": 1,
    "transformer.h.26": 1,
    "transformer.h.27": 1,
    "transformer.h.28": 2,
    "transformer.h.29": 2,
    "transformer.h.30": 2,
    "transformer.h.31": 2,
    "transformer.h.32": 2,
    "transformer.h.33": 2,
    "transformer.h.34": 2,
    "transformer.h.35": 2,
    "transformer.h.36": 2,
    "transformer.h.37": 2,
    "transformer.h.38": 2,
    "transformer.h.39": 2,
    "transformer.ln_f": 2
}

Furthermore, in our current codebase, we have several different mechanism that handle model balancing. Changing device_map="auto" to device_map="sequential" is much more time consuming to do than if it would be in a small scale project. My team and I would highly appreciate if this issue could be considered high priority, given that it breaks the entire system, and I'm sure we won't be alone experiencing it.

SunMarc · 2024-04-23T13:31:58Z

And what is the model.hf_device_map with device_map="sequential" ?

FanaticPythoner · 2024-04-23T14:02:00Z

@SunMarc

For the code:

import torch
from transformers import pipeline
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2"
pipe = pipeline("text-generation", model="HuggingFaceH4/starchat-beta", torch_dtype=torch.bfloat16, device_map='sequential')
# We use a variant of ChatML to format each message
prompt_template = "<|system|>\n<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
prompt = prompt_template.format(query="How do I sort a list in Python?")
# We use a special <|end|> token with ID 49155 to denote ends of a turn
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.2, top_k=50, top_p=0.95, eos_token_id=49155)
print(outputs)

Doing:

import json
print(json.dumps(pipe.model.hf_device_map, indent=4))

Prints:

{
    "": 0
}

FanaticPythoner · 2024-04-23T14:03:19Z

Let me send you the same code for comparison, but instead of starchat, using mixtral, which is larger.

SunMarc · 2024-04-23T14:04:04Z

Oh makes sense why it works. It is because the model fits in a single gpu in the case of starchat. Yeah, let's check for mixtral.

FanaticPythoner · 2024-04-23T14:08:36Z

@SunMarc

For the code:

import torch
from transformers import pipeline
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2"
pipe = pipeline("text-generation", model="mistralai/Mixtral-8x7B-Instruct-v0.1", torch_dtype=torch.bfloat16, device_map='auto')
# We use a variant of ChatML to format each message
prompt_template = "<|system|>\n<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
prompt = prompt_template.format(query="How do I sort a list in Python?")
# We use a special <|end|> token with ID 49155 to denote ends of a turn
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.2, top_k=50, top_p=0.95, eos_token_id=49155)
print(outputs)

Doing:

import json
print(json.dumps(pipe.model.hf_device_map, indent=4))

Prints:

{
    "model.embed_tokens": 0,
    "model.layers.0": 0,
    "model.layers.1": 0,
    "model.layers.2": 0,
    "model.layers.3": 0,
    "model.layers.4": 0,
    "model.layers.5": 0,
    "model.layers.6": 0,
    "model.layers.7": 0,
    "model.layers.8": 0,
    "model.layers.9": 0,
    "model.layers.10": 1,
    "model.layers.11": 1,
    "model.layers.12": 1,
    "model.layers.13": 1,
    "model.layers.14": 1,
    "model.layers.15": 1,
    "model.layers.16": 1,
    "model.layers.17": 1,
    "model.layers.18": 1,
    "model.layers.19": 1,
    "model.layers.20": 1,
    "model.layers.21": 2,
    "model.layers.22": 2,
    "model.layers.23": 2,
    "model.layers.24": 2,
    "model.layers.25": 2,
    "model.layers.26": 2,
    "model.layers.27": 2,
    "model.layers.28": 2,
    "model.layers.29": 2,
    "model.layers.30": 2,
    "model.layers.31": 2,
    "model.norm": 2,
    "lm_head": 2
}

Now, for the code:

import torch
from transformers import pipeline
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2"
pipe = pipeline("text-generation", model="mistralai/Mixtral-8x7B-Instruct-v0.1", torch_dtype=torch.bfloat16, device_map='sequential')
# We use a variant of ChatML to format each message
prompt_template = "<|system|>\n<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
prompt = prompt_template.format(query="How do I sort a list in Python?")
# We use a special <|end|> token with ID 49155 to denote ends of a turn
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.2, top_k=50, top_p=0.95, eos_token_id=49155)
print(outputs)

Doing:

import json
print(json.dumps(pipe.model.hf_device_map, indent=4))

Prints:

{
    "model.embed_tokens": 0,
    "model.layers.0": 0,
    "model.layers.1": 0,
    "model.layers.2": 0,
    "model.layers.3": 0,
    "model.layers.4": 0,
    "model.layers.5": 0,
    "model.layers.6": 0,
    "model.layers.7": 0,
    "model.layers.8": 0,
    "model.layers.9": 0,
    "model.layers.10": 0,
    "model.layers.11": 0,
    "model.layers.12": 0,
    "model.layers.13": 0,
    "model.layers.14": 0,
    "model.layers.15": 0,
    "model.layers.16": 1,
    "model.layers.17": 1,
    "model.layers.18": 1,
    "model.layers.19": 1,
    "model.layers.20": 1,
    "model.layers.21": 1,
    "model.layers.22": 1,
    "model.layers.23": 1,
    "model.layers.24": 1,
    "model.layers.25": 1,
    "model.layers.26": 1,
    "model.layers.27": 1,
    "model.layers.28": 1,
    "model.layers.29": 1,
    "model.layers.30": 1,
    "model.layers.31": 1,
    "model.norm": 1,
    "lm_head": 1
}

I also looked at the result of outputs. For "auto", it doesn't even finish, throwing a nan/inf error. For "sequential", it does expected behavior, i.e., answers correctly.

FanaticPythoner · 2024-04-23T14:09:49Z

Yes, in the above code, it uses the starchat template... Still works.

SunMarc · 2024-04-24T15:55:17Z

It is probably a communication issue with your GPUs. I see that in "sequential", only two gpus are used. Maybe one quick way to solve this would be to run this model on only the first 2 GPUS by specifying CUDA_VISIBLE_DEVICES=0,1. you can try to check in which layers the generation starts to output gliberish too.

FanaticPythoner · 2024-04-25T12:38:36Z

The hardware/drivers has/have been triple checked by the bare metal provider. On my 3x 3090 setup, I don't use NVLink. Maybe that's the key. Or maybe it's something else.

As an update, I tested both Sequential and Auto on llama 3 70b, in bfloat16. Both are unable to run the inference and throw:

Exception has occurred: RuntimeError
probability tensor contains either `inf`, `nan` or element < 0
  File "/root/hwsrc/project_name/main.py", line 16, in <module>
    output = pipe("Hey how are you doing today?")
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Here are the device maps and the code that was used.

The code:

import transformers
import torch
import json
model_id = "meta-llama/Meta-Llama-3-70B"

pipe = transformers.pipeline("text-generation",
                                 model=model_id,
                                 model_kwargs={
                                     "torch_dtype": torch.bfloat16,
                                     "max_memory": {0: "42GiB", 1: "42GiB", 2: "42GiB", 3: "42GiB"}
                                 },
                                 device_map="sequential")

print(json.dumps(pipe.model.hf_device_map, indent=4))

output = pipe("Hey how are you doing today?")

print(output)

outputs:

{
    "model.embed_tokens": 0,
    "model.layers.0": 0,
    "model.layers.1": 0,
    "model.layers.2": 0,
    "model.layers.3": 0,
    "model.layers.4": 0,
    "model.layers.5": 0,
    "model.layers.6": 0,
    "model.layers.7": 0,
    "model.layers.8": 0,
    "model.layers.9": 0,
    "model.layers.10": 0,
    "model.layers.11": 0,
    "model.layers.12": 0,
    "model.layers.13": 0,
    "model.layers.14": 0,
    "model.layers.15": 0,
    "model.layers.16": 0,
    "model.layers.17": 0,
    "model.layers.18": 0,
    "model.layers.19": 0,
    "model.layers.20": 0,
    "model.layers.21": 0,
    "model.layers.22": 0,
    "model.layers.23": 1,
    "model.layers.24": 1,
    "model.layers.25": 1,
    "model.layers.26": 1,
    "model.layers.27": 1,
    "model.layers.28": 1,
    "model.layers.29": 1,
    "model.layers.30": 1,
    "model.layers.31": 1,
    "model.layers.32": 1,
    "model.layers.33": 1,
    "model.layers.34": 1,
    "model.layers.35": 1,
    "model.layers.36": 1,
    "model.layers.37": 1,
    "model.layers.38": 1,
    "model.layers.39": 1,
    "model.layers.40": 1,
    "model.layers.41": 1,
    "model.layers.42": 1,
    "model.layers.43": 1,
    "model.layers.44": 1,
    "model.layers.45": 1,
    "model.layers.46": 1,
    "model.layers.47": 1,
    "model.layers.48": 1,
    "model.layers.49": 2,
    "model.layers.50": 2,
    "model.layers.51": 2,
    "model.layers.52": 2,
    "model.layers.53": 2,
    "model.layers.54": 2,
    "model.layers.55": 2,
    "model.layers.56": 2,
    "model.layers.57": 2,
    "model.layers.58": 2,
    "model.layers.59": 2,
    "model.layers.60": 2,
    "model.layers.61": 2,
    "model.layers.62": 2,
    "model.layers.63": 2,
    "model.layers.64": 2,
    "model.layers.65": 2,
    "model.layers.66": 2,
    "model.layers.67": 2,
    "model.layers.68": 2,
    "model.layers.69": 2,
    "model.layers.70": 2,
    "model.layers.71": 2,
    "model.layers.72": 2,
    "model.layers.73": 2,
    "model.layers.74": 2,
    "model.layers.75": 3,
    "model.layers.76": 3,
    "model.layers.77": 3,
    "model.layers.78": 3,
    "model.layers.79": 3,
    "model.norm": 3,
    "lm_head": 3
}

And the code:

import transformers
import torch
import json
model_id = "meta-llama/Meta-Llama-3-70B"

pipe = transformers.pipeline("text-generation",
                                 model=model_id,
                                 model_kwargs={
                                     "torch_dtype": torch.bfloat16,
                                     "max_memory": {0: "42GiB", 1: "42GiB", 2: "42GiB", 3: "42GiB"}
                                 },
                                 device_map="auto")

print(json.dumps(pipe.model.hf_device_map, indent=4))

output = pipe("Hey how are you doing today?")

print(output)

outputs:

{
    "model.embed_tokens": 0,
    "model.layers.0": 0,
    "model.layers.1": 0,
    "model.layers.2": 0,
    "model.layers.3": 0,
    "model.layers.4": 0,
    "model.layers.5": 0,
    "model.layers.6": 0,
    "model.layers.7": 0,
    "model.layers.8": 0,
    "model.layers.9": 0,
    "model.layers.10": 0,
    "model.layers.11": 0,
    "model.layers.12": 0,
    "model.layers.13": 0,
    "model.layers.14": 0,
    "model.layers.15": 0,
    "model.layers.16": 0,
    "model.layers.17": 0,
    "model.layers.18": 0,
    "model.layers.19": 1,
    "model.layers.20": 1,
    "model.layers.21": 1,
    "model.layers.22": 1,
    "model.layers.23": 1,
    "model.layers.24": 1,
    "model.layers.25": 1,
    "model.layers.26": 1,
    "model.layers.27": 1,
    "model.layers.28": 1,
    "model.layers.29": 1,
    "model.layers.30": 1,
    "model.layers.31": 1,
    "model.layers.32": 1,
    "model.layers.33": 1,
    "model.layers.34": 1,
    "model.layers.35": 1,
    "model.layers.36": 1,
    "model.layers.37": 1,
    "model.layers.38": 1,
    "model.layers.39": 1,
    "model.layers.40": 2,
    "model.layers.41": 2,
    "model.layers.42": 2,
    "model.layers.43": 2,
    "model.layers.44": 2,
    "model.layers.45": 2,
    "model.layers.46": 2,
    "model.layers.47": 2,
    "model.layers.48": 2,
    "model.layers.49": 2,
    "model.layers.50": 2,
    "model.layers.51": 2,
    "model.layers.52": 2,
    "model.layers.53": 2,
    "model.layers.54": 2,
    "model.layers.55": 2,
    "model.layers.56": 2,
    "model.layers.57": 2,
    "model.layers.58": 2,
    "model.layers.59": 2,
    "model.layers.60": 2,
    "model.layers.61": 3,
    "model.layers.62": 3,
    "model.layers.63": 3,
    "model.layers.64": 3,
    "model.layers.65": 3,
    "model.layers.66": 3,
    "model.layers.67": 3,
    "model.layers.68": 3,
    "model.layers.69": 3,
    "model.layers.70": 3,
    "model.layers.71": 3,
    "model.layers.72": 3,
    "model.layers.73": 3,
    "model.layers.74": 3,
    "model.layers.75": 3,
    "model.layers.76": 3,
    "model.layers.77": 3,
    "model.layers.78": 3,
    "model.layers.79": 3,
    "model.norm": 3,
    "lm_head": 3
}

@SunMarc

FanaticPythoner · 2024-05-05T10:50:54Z

Would anyone care to look at it please? Let it be @SunMarc or someone else? I highly suppose it's an HF compatibility issue with NVLink, but I can't say with 100% certainty.

sjsmith88 · 2024-05-16T03:19:57Z

We really would appreciate any help on this roadblock....much thx!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete gibberish produced by any and all models only when device_map="auto". #2692

Complete gibberish produced by any and all models only when device_map="auto". #2692

FanaticPythoner commented Apr 20, 2024 •

edited

SunMarc commented Apr 23, 2024 •

edited

FanaticPythoner commented Apr 23, 2024 •

edited

SunMarc commented Apr 23, 2024 •

edited

FanaticPythoner commented Apr 23, 2024

FanaticPythoner commented Apr 23, 2024

SunMarc commented Apr 23, 2024

FanaticPythoner commented Apr 23, 2024 •

edited

FanaticPythoner commented Apr 23, 2024

SunMarc commented Apr 24, 2024

FanaticPythoner commented Apr 25, 2024 •

edited

FanaticPythoner commented May 5, 2024

sjsmith88 commented May 16, 2024

Complete gibberish produced by any and all models only when device_map="auto". #2692

Complete gibberish produced by any and all models only when device_map="auto". #2692

Comments

FanaticPythoner commented Apr 20, 2024 • edited

System Info

Information

Tasks

Reproduction

Expected behavior

SunMarc commented Apr 23, 2024 • edited

FanaticPythoner commented Apr 23, 2024 • edited

SunMarc commented Apr 23, 2024 • edited

FanaticPythoner commented Apr 23, 2024

FanaticPythoner commented Apr 23, 2024

SunMarc commented Apr 23, 2024

FanaticPythoner commented Apr 23, 2024 • edited

FanaticPythoner commented Apr 23, 2024

SunMarc commented Apr 24, 2024

FanaticPythoner commented Apr 25, 2024 • edited

FanaticPythoner commented May 5, 2024

sjsmith88 commented May 16, 2024

FanaticPythoner commented Apr 20, 2024 •

edited

SunMarc commented Apr 23, 2024 •

edited

FanaticPythoner commented Apr 23, 2024 •

edited

SunMarc commented Apr 23, 2024 •

edited

FanaticPythoner commented Apr 23, 2024 •

edited

FanaticPythoner commented Apr 25, 2024 •

edited