Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete gibberish produced by any and all models only when device_map="auto". #2692

Open
4 tasks
FanaticPythoner opened this issue Apr 20, 2024 · 12 comments
Open
4 tasks

Comments

@FanaticPythoner
Copy link

FanaticPythoner commented Apr 20, 2024

System Info

- `Accelerate` version: 0.29.3
- Platform: Linux-5.15.0-101-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /root/miniconda3/envs/codevalet_ai/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.24.3
- PyTorch version (GPU?): 2.2.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 2015.56 GB
- GPU type: NVIDIA L40S
- `Accelerate` default config:
  Not found

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

System specs:

  • 8 times L40S GPUs
  • Intel Xeon Gold 6558H

This:

import torch
from transformers import pipeline
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2"
pipe = pipeline("text-generation", model="HuggingFaceH4/starchat-beta", torch_dtype=torch.bfloat16, device_map='auto')
# We use a variant of ChatML to format each message
prompt_template = "<|system|>\n<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
prompt = prompt_template.format(query="How do I sort a list in Python?")
# We use a special <|end|> token with ID 49155 to denote ends of a turn
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.2, top_k=50, top_p=0.95, eos_token_id=49155)
print(outputs)

Prints:
[{'generated_text': '<|system|>\n<|end|>\n<|user|>\nHow do I sort a list in Python?<|end|>\n<|assistant|> to. Air1 (\nь\nInfo plit An che a\n the weьь Share Share\n aremobatar\n…We brain be S jj jj'..., … …: no J\n,…AL more of… y they code lifefl\n -- B moreand.. L\nplitahph a after\n Ishare, E I I is L\n unel not Mid' I'’ …\n\n …" you a a South strength I I S said "no\n\n\n E E11\n EASC not Sh English. of of E |isse\n as that said said of said reg of The The– n a… Open. The The for | A after After\n was M open open over in been\n\n into,onAR down :-)mad cos I you to E,( not "a001 that vis m44\n\n\n of3\n re1 T by so itack in inententancy of is int Library to U U.. a a = ==Compression Itdata66 as111110 S'}]
While this:

import torch
from transformers import pipeline
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
pipe = pipeline("text-generation", model="HuggingFaceH4/starchat-beta", torch_dtype=torch.bfloat16, device_map='auto')
# We use a variant of ChatML to format each message
prompt_template = "<|system|>\n<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
prompt = prompt_template.format(query="How do I sort a list in Python?")
# We use a special <|end|> token with ID 49155 to denote ends of a turn
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.2, top_k=50, top_p=0.95, eos_token_id=49155)
print(outputs)

Prints:
[{'generated_text': '<|system|>\n<|end|>\n<|user|>\nHow do I sort a list in Python?<|end|>\n<|assistant|>\nThere are multiple ways to sort a list in Python. One of the most common ways is to use the sort() method. Here is an example:\n\n```\nmy_list = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5]\nmy_list.sort()\nprint(my_list)\n```\n\nThis will sort the list in place and print the sorted list.\n\nAnother way to sort a list is to use the sorted() function. This function returns a new sorted list and does not modify the original list. Here is an example:\n\n```\nmy_list = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5]\nsorted_list = sorted(my_list)\nprint(sorted_list)\n```\n\nIn this example, the sorted_list variable will contain the sorted list and the my_list variable will remain unchanged.\n\nThere are also other sorting algorithms available in the built-in sort module, such as quicksort, heapsort, and merge sort. You'}]

Expected behavior

Both should print coherent text. This happens no matter the model chosen. In the above reproduction steps, the model used is HuggingFaceH4/starchat-beta. The exact same thing happens with mistralai/Mixtral-8x7B-Instruct-v0.1, no matter if ran in bfloat16, float16, float32, or quantized / not quantized. The issue also occurs no matter the prompt.

The issue, however, does NOT occur when device_map="sequential" is set (tested with HuggingFaceH4/starchat-beta only).
Furthermore, the issue does NOT occur with device_map="auto" on my home 3x RTX 3090 / Threadripper 3960x setup.

However, I cannot use sequential in our current production environment without making significant changes.

@SunMarc
Copy link
Member

SunMarc commented Apr 23, 2024

Hi @FanaticPythoner, thanks for the detailed report ! This is indeed strange that sequential works while it fails with auto (using "balanced"). Could you check what is the output of model.hf_device_map ? Maybe you can try to allocate each layer to a specific device or set the max_memory arg when using sequential.

@FanaticPythoner
Copy link
Author

FanaticPythoner commented Apr 23, 2024

@SunMarc

For the code:

import torch
from transformers import pipeline
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2"
pipe = pipeline("text-generation", model="HuggingFaceH4/starchat-beta", torch_dtype=torch.bfloat16, device_map='auto')
# We use a variant of ChatML to format each message
prompt_template = "<|system|>\n<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
prompt = prompt_template.format(query="How do I sort a list in Python?")
# We use a special <|end|> token with ID 49155 to denote ends of a turn
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.2, top_k=50, top_p=0.95, eos_token_id=49155)
print(outputs)

Doing:

import json
print(json.dumps(pipe.model.hf_device_map, indent=4))

Prints:

{
    "transformer.wte": 0,
    "lm_head": 0,
    "transformer.wpe": 0,
    "transformer.drop": 0,
    "transformer.h.0": 0,
    "transformer.h.1": 0,
    "transformer.h.2": 0,
    "transformer.h.3": 0,
    "transformer.h.4": 0,
    "transformer.h.5": 0,
    "transformer.h.6": 0,
    "transformer.h.7": 0,
    "transformer.h.8": 0,
    "transformer.h.9": 0,
    "transformer.h.10": 0,
    "transformer.h.11": 0,
    "transformer.h.12": 0,
    "transformer.h.13": 1,
    "transformer.h.14": 1,
    "transformer.h.15": 1,
    "transformer.h.16": 1,
    "transformer.h.17": 1,
    "transformer.h.18": 1,
    "transformer.h.19": 1,
    "transformer.h.20": 1,
    "transformer.h.21": 1,
    "transformer.h.22": 1,
    "transformer.h.23": 1,
    "transformer.h.24": 1,
    "transformer.h.25": 1,
    "transformer.h.26": 1,
    "transformer.h.27": 1,
    "transformer.h.28": 2,
    "transformer.h.29": 2,
    "transformer.h.30": 2,
    "transformer.h.31": 2,
    "transformer.h.32": 2,
    "transformer.h.33": 2,
    "transformer.h.34": 2,
    "transformer.h.35": 2,
    "transformer.h.36": 2,
    "transformer.h.37": 2,
    "transformer.h.38": 2,
    "transformer.h.39": 2,
    "transformer.ln_f": 2
}

Furthermore, in our current codebase, we have several different mechanism that handle model balancing. Changing device_map="auto" to device_map="sequential" is much more time consuming to do than if it would be in a small scale project. My team and I would highly appreciate if this issue could be considered high priority, given that it breaks the entire system, and I'm sure we won't be alone experiencing it.

@SunMarc
Copy link
Member

SunMarc commented Apr 23, 2024

And what is the model.hf_device_map with device_map="sequential" ?

@FanaticPythoner
Copy link
Author

@SunMarc

For the code:

import torch
from transformers import pipeline
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2"
pipe = pipeline("text-generation", model="HuggingFaceH4/starchat-beta", torch_dtype=torch.bfloat16, device_map='sequential')
# We use a variant of ChatML to format each message
prompt_template = "<|system|>\n<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
prompt = prompt_template.format(query="How do I sort a list in Python?")
# We use a special <|end|> token with ID 49155 to denote ends of a turn
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.2, top_k=50, top_p=0.95, eos_token_id=49155)
print(outputs)

Doing:

import json
print(json.dumps(pipe.model.hf_device_map, indent=4))

Prints:

{
    "": 0
}

@FanaticPythoner
Copy link
Author

Let me send you the same code for comparison, but instead of starchat, using mixtral, which is larger.

@SunMarc
Copy link
Member

SunMarc commented Apr 23, 2024

Oh makes sense why it works. It is because the model fits in a single gpu in the case of starchat. Yeah, let's check for mixtral.

@FanaticPythoner
Copy link
Author

FanaticPythoner commented Apr 23, 2024

@SunMarc

For the code:

import torch
from transformers import pipeline
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2"
pipe = pipeline("text-generation", model="mistralai/Mixtral-8x7B-Instruct-v0.1", torch_dtype=torch.bfloat16, device_map='auto')
# We use a variant of ChatML to format each message
prompt_template = "<|system|>\n<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
prompt = prompt_template.format(query="How do I sort a list in Python?")
# We use a special <|end|> token with ID 49155 to denote ends of a turn
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.2, top_k=50, top_p=0.95, eos_token_id=49155)
print(outputs)

Doing:

import json
print(json.dumps(pipe.model.hf_device_map, indent=4))

Prints:

{
    "model.embed_tokens": 0,
    "model.layers.0": 0,
    "model.layers.1": 0,
    "model.layers.2": 0,
    "model.layers.3": 0,
    "model.layers.4": 0,
    "model.layers.5": 0,
    "model.layers.6": 0,
    "model.layers.7": 0,
    "model.layers.8": 0,
    "model.layers.9": 0,
    "model.layers.10": 1,
    "model.layers.11": 1,
    "model.layers.12": 1,
    "model.layers.13": 1,
    "model.layers.14": 1,
    "model.layers.15": 1,
    "model.layers.16": 1,
    "model.layers.17": 1,
    "model.layers.18": 1,
    "model.layers.19": 1,
    "model.layers.20": 1,
    "model.layers.21": 2,
    "model.layers.22": 2,
    "model.layers.23": 2,
    "model.layers.24": 2,
    "model.layers.25": 2,
    "model.layers.26": 2,
    "model.layers.27": 2,
    "model.layers.28": 2,
    "model.layers.29": 2,
    "model.layers.30": 2,
    "model.layers.31": 2,
    "model.norm": 2,
    "lm_head": 2
}

Now, for the code:

import torch
from transformers import pipeline
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2"
pipe = pipeline("text-generation", model="mistralai/Mixtral-8x7B-Instruct-v0.1", torch_dtype=torch.bfloat16, device_map='sequential')
# We use a variant of ChatML to format each message
prompt_template = "<|system|>\n<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
prompt = prompt_template.format(query="How do I sort a list in Python?")
# We use a special <|end|> token with ID 49155 to denote ends of a turn
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.2, top_k=50, top_p=0.95, eos_token_id=49155)
print(outputs)

Doing:

import json
print(json.dumps(pipe.model.hf_device_map, indent=4))

Prints:

{
    "model.embed_tokens": 0,
    "model.layers.0": 0,
    "model.layers.1": 0,
    "model.layers.2": 0,
    "model.layers.3": 0,
    "model.layers.4": 0,
    "model.layers.5": 0,
    "model.layers.6": 0,
    "model.layers.7": 0,
    "model.layers.8": 0,
    "model.layers.9": 0,
    "model.layers.10": 0,
    "model.layers.11": 0,
    "model.layers.12": 0,
    "model.layers.13": 0,
    "model.layers.14": 0,
    "model.layers.15": 0,
    "model.layers.16": 1,
    "model.layers.17": 1,
    "model.layers.18": 1,
    "model.layers.19": 1,
    "model.layers.20": 1,
    "model.layers.21": 1,
    "model.layers.22": 1,
    "model.layers.23": 1,
    "model.layers.24": 1,
    "model.layers.25": 1,
    "model.layers.26": 1,
    "model.layers.27": 1,
    "model.layers.28": 1,
    "model.layers.29": 1,
    "model.layers.30": 1,
    "model.layers.31": 1,
    "model.norm": 1,
    "lm_head": 1
}

I also looked at the result of outputs. For "auto", it doesn't even finish, throwing a nan/inf error. For "sequential", it does expected behavior, i.e., answers correctly.

@FanaticPythoner
Copy link
Author

Yes, in the above code, it uses the starchat template... Still works.

@SunMarc
Copy link
Member

SunMarc commented Apr 24, 2024

It is probably a communication issue with your GPUs. I see that in "sequential", only two gpus are used. Maybe one quick way to solve this would be to run this model on only the first 2 GPUS by specifying CUDA_VISIBLE_DEVICES=0,1. you can try to check in which layers the generation starts to output gliberish too.

@FanaticPythoner
Copy link
Author

FanaticPythoner commented Apr 25, 2024

The hardware/drivers has/have been triple checked by the bare metal provider. On my 3x 3090 setup, I don't use NVLink. Maybe that's the key. Or maybe it's something else.

As an update, I tested both Sequential and Auto on llama 3 70b, in bfloat16. Both are unable to run the inference and throw:

Exception has occurred: RuntimeError
probability tensor contains either `inf`, `nan` or element < 0
  File "/root/hwsrc/project_name/main.py", line 16, in <module>
    output = pipe("Hey how are you doing today?")
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Here are the device maps and the code that was used.

  1. The code:
import transformers
import torch
import json
model_id = "meta-llama/Meta-Llama-3-70B"

pipe = transformers.pipeline("text-generation",
                                 model=model_id,
                                 model_kwargs={
                                     "torch_dtype": torch.bfloat16,
                                     "max_memory": {0: "42GiB", 1: "42GiB", 2: "42GiB", 3: "42GiB"}
                                 },
                                 device_map="sequential")

print(json.dumps(pipe.model.hf_device_map, indent=4))

output = pipe("Hey how are you doing today?")

print(output)

outputs:

{
    "model.embed_tokens": 0,
    "model.layers.0": 0,
    "model.layers.1": 0,
    "model.layers.2": 0,
    "model.layers.3": 0,
    "model.layers.4": 0,
    "model.layers.5": 0,
    "model.layers.6": 0,
    "model.layers.7": 0,
    "model.layers.8": 0,
    "model.layers.9": 0,
    "model.layers.10": 0,
    "model.layers.11": 0,
    "model.layers.12": 0,
    "model.layers.13": 0,
    "model.layers.14": 0,
    "model.layers.15": 0,
    "model.layers.16": 0,
    "model.layers.17": 0,
    "model.layers.18": 0,
    "model.layers.19": 0,
    "model.layers.20": 0,
    "model.layers.21": 0,
    "model.layers.22": 0,
    "model.layers.23": 1,
    "model.layers.24": 1,
    "model.layers.25": 1,
    "model.layers.26": 1,
    "model.layers.27": 1,
    "model.layers.28": 1,
    "model.layers.29": 1,
    "model.layers.30": 1,
    "model.layers.31": 1,
    "model.layers.32": 1,
    "model.layers.33": 1,
    "model.layers.34": 1,
    "model.layers.35": 1,
    "model.layers.36": 1,
    "model.layers.37": 1,
    "model.layers.38": 1,
    "model.layers.39": 1,
    "model.layers.40": 1,
    "model.layers.41": 1,
    "model.layers.42": 1,
    "model.layers.43": 1,
    "model.layers.44": 1,
    "model.layers.45": 1,
    "model.layers.46": 1,
    "model.layers.47": 1,
    "model.layers.48": 1,
    "model.layers.49": 2,
    "model.layers.50": 2,
    "model.layers.51": 2,
    "model.layers.52": 2,
    "model.layers.53": 2,
    "model.layers.54": 2,
    "model.layers.55": 2,
    "model.layers.56": 2,
    "model.layers.57": 2,
    "model.layers.58": 2,
    "model.layers.59": 2,
    "model.layers.60": 2,
    "model.layers.61": 2,
    "model.layers.62": 2,
    "model.layers.63": 2,
    "model.layers.64": 2,
    "model.layers.65": 2,
    "model.layers.66": 2,
    "model.layers.67": 2,
    "model.layers.68": 2,
    "model.layers.69": 2,
    "model.layers.70": 2,
    "model.layers.71": 2,
    "model.layers.72": 2,
    "model.layers.73": 2,
    "model.layers.74": 2,
    "model.layers.75": 3,
    "model.layers.76": 3,
    "model.layers.77": 3,
    "model.layers.78": 3,
    "model.layers.79": 3,
    "model.norm": 3,
    "lm_head": 3
}
  1. And the code:
import transformers
import torch
import json
model_id = "meta-llama/Meta-Llama-3-70B"

pipe = transformers.pipeline("text-generation",
                                 model=model_id,
                                 model_kwargs={
                                     "torch_dtype": torch.bfloat16,
                                     "max_memory": {0: "42GiB", 1: "42GiB", 2: "42GiB", 3: "42GiB"}
                                 },
                                 device_map="auto")

print(json.dumps(pipe.model.hf_device_map, indent=4))

output = pipe("Hey how are you doing today?")

print(output)

outputs:

{
    "model.embed_tokens": 0,
    "model.layers.0": 0,
    "model.layers.1": 0,
    "model.layers.2": 0,
    "model.layers.3": 0,
    "model.layers.4": 0,
    "model.layers.5": 0,
    "model.layers.6": 0,
    "model.layers.7": 0,
    "model.layers.8": 0,
    "model.layers.9": 0,
    "model.layers.10": 0,
    "model.layers.11": 0,
    "model.layers.12": 0,
    "model.layers.13": 0,
    "model.layers.14": 0,
    "model.layers.15": 0,
    "model.layers.16": 0,
    "model.layers.17": 0,
    "model.layers.18": 0,
    "model.layers.19": 1,
    "model.layers.20": 1,
    "model.layers.21": 1,
    "model.layers.22": 1,
    "model.layers.23": 1,
    "model.layers.24": 1,
    "model.layers.25": 1,
    "model.layers.26": 1,
    "model.layers.27": 1,
    "model.layers.28": 1,
    "model.layers.29": 1,
    "model.layers.30": 1,
    "model.layers.31": 1,
    "model.layers.32": 1,
    "model.layers.33": 1,
    "model.layers.34": 1,
    "model.layers.35": 1,
    "model.layers.36": 1,
    "model.layers.37": 1,
    "model.layers.38": 1,
    "model.layers.39": 1,
    "model.layers.40": 2,
    "model.layers.41": 2,
    "model.layers.42": 2,
    "model.layers.43": 2,
    "model.layers.44": 2,
    "model.layers.45": 2,
    "model.layers.46": 2,
    "model.layers.47": 2,
    "model.layers.48": 2,
    "model.layers.49": 2,
    "model.layers.50": 2,
    "model.layers.51": 2,
    "model.layers.52": 2,
    "model.layers.53": 2,
    "model.layers.54": 2,
    "model.layers.55": 2,
    "model.layers.56": 2,
    "model.layers.57": 2,
    "model.layers.58": 2,
    "model.layers.59": 2,
    "model.layers.60": 2,
    "model.layers.61": 3,
    "model.layers.62": 3,
    "model.layers.63": 3,
    "model.layers.64": 3,
    "model.layers.65": 3,
    "model.layers.66": 3,
    "model.layers.67": 3,
    "model.layers.68": 3,
    "model.layers.69": 3,
    "model.layers.70": 3,
    "model.layers.71": 3,
    "model.layers.72": 3,
    "model.layers.73": 3,
    "model.layers.74": 3,
    "model.layers.75": 3,
    "model.layers.76": 3,
    "model.layers.77": 3,
    "model.layers.78": 3,
    "model.layers.79": 3,
    "model.norm": 3,
    "lm_head": 3
}

@SunMarc

@FanaticPythoner
Copy link
Author

Would anyone care to look at it please? Let it be @SunMarc or someone else? I highly suppose it's an HF compatibility issue with NVLink, but I can't say with 100% certainty.

@sjsmith88
Copy link

We really would appreciate any help on this roadblock....much thx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants