-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llava-next demo / tutorial code does not work #30294
Comments
Can this be related to the device "mps"? I am not able to reproduce on "cuda" |
happens on CPU too, and other people on the Community page of the 34b model have reported the same issue |
|
@bghira I'm unable to reproduce on CPU. Can you show the MRE you have for CPU? Ie. a script/gist that I can run top to bottom that reproduces what you're seeing. On MPS I'm able to reproduce, and I've verified there's a bug (in PyTorch/MPS). I have a way to mitigate the issue, but it's suboptimal to merge this approach in transformers since that specific bug is only affecting MPS. Please let me know if you want the transformers workaround while I work on getting this fixed in PyTorch. |
@hvaara Thanks for looking into this! If you have a workaround, it would be great if you could share. As you say, it doesn't make sense to merge it on the transformers side but potentially very useful for the community whilst the fix is still to be merged into PyTorch. |
hmm, i can't reproduce it on CPU, but it also never really returns anything now 😓 might have seen this error on a different version when I saw it there. maybe now it's simply working, computing the result. |
is there a link to upstream bug? |
Sorry, I completely forgot about this. I'll get both the patch to upstream and a workaround for transformers out ASAP (aiming for within a few hours). |
The code changes for a workaround in transformers can be seen in hvaara@ed2f0df. I've created a gist to show how to run the model with these changes patched in. Note that I'm using import os
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1" is needed because the To answer your original question regarding a link to a bug in upstream, there isn't one yet (AFAIK). I think I'll skip the ceremony of creating a bug, and jump straight to a pull request since I have the code fix and regression test ready for it already. I'll provide an update when I've done that with a link to the PR. Please feel free to nudge me again if there isn't an update in a timely manner ;) |
@pcuenca That's great! Thanks for letting us know 😄 pytorch/pytorch#96614 describes the problem in MPS/PyTorch seen in this issue. |
PyTorch PR: pytorch/pytorch#125318 |
pytorch/pytorch#125318 has been merged. I don't know exactly when a new release will be cut, but I'd expect it to be available in the nightlies in <24h. Assuming it's not reverted of course ;) |
Nice! 🙌 |
I guess this issue can be closed? :) |
yes, this error is fixed but the model still just doesn't work on MPS... big sad |
For the 34B version? Does the 7b version work as expected? What's the failure mode? How do you provoke it? What do you expect to see? |
yes, the 34b on 128G M3 Max. it just gets stuck and never returns. i left it for >12 hours overnight. how long is it expected to take? i do not have the network bandwidth to download the 7B as well yet, i never focused on it because it performed very poorly in my group's initial tests. it is very hamstrung by the 7B LLM. |
What's the code/prompt you used? Can't guarantee I know how to fix it, but I want to try to replicate what you're seeing. I don't know about speed, but I'd expect it to be measured in iterations per second, not iterations per day 😅 |
used the same demo code as initially posted:
|
I tested on an M3 Max 128 GB. The problem you're seeing seems to be due to a resource utilization issue (see https://gist.github.com/hvaara/f8f911c6e271cae8aa0e4811b24d4eb5). I suspect another issue with MPS in PyTorch (memory leak in the MPS cache). I'll try to give another update with a workaround in a couple of hours. Nudge me if I forget. |
oh 🤔 can we disable the mps cache? i assume leaked objects means we can't even sync and force gc to clear them |
Yup. The proposed workaround was doing for chunk in tqdm(streamer, total=max_new_tokens+2):
text.append(chunk)
print(chunk)
+ torch.mps.empty_cache()
print(f"{torch.mps.driver_allocated_memory()/1024**3 = }")
print(f"{torch.mps.current_allocated_memory()/1024**3 = }")
print(f"{len(text)}") I'll know if this worked very soon 😄 |
It worked https://gist.github.com/hvaara/9a7c361478ad8aa429d8172b3119a657. Are you able to reproduce on your end? |
i can't really run the current pytorch nightly, so i'm unable to test this anymore. the error I receive is now inside pytorch, some hardcoded cuda codepath. current pytorch 2.3 is back to the:
current pytorch 2.4:
when looking through the pytorch codebase for this error, it's in some rng function that seems to not be called from anywhere :D |
ok, by updating that method and adding a helper: @run_and_save_rng_state.py_impl(DispatchKey.MPS)
def impl_mps(op, *args, **kwargs):
return torch.mps.get_rng_state(), op(*args, **kwargs)
@run_and_save_rng_state.py_impl(DispatchKey.BackendSelect)
def impl_backend_select(op, *args, **kwargs):
impl_map = {"cuda": impl_cuda, "cpu": impl_cpu, "mps": impl_mps}
device = get_device(args, kwargs)
assert device in impl_map, f"Backend not supported for {device}"
impl = impl_map[device]
return impl(op, *args, **kwargs) i can get past the errors on pytorch nightly:
seems marginal but it's 'overing around that value |
The reason it works on PyTorch nightly is probably because of pytorch/pytorch#125318. If you can't use the nightlies, does it work if you apply hvaara@ed2f0df as a workaround in transformers (with PyTorch 2.3)? From your output it looks like it's working? |
it appears equivalent to using pytorch 2.4 with my additional fix for rng impl |
Great! Is everything working as you'd expect on your end now? (Although it's currently held together with string and duct tape 😅) |
it appears this is the best we can do on mps for now though it's so slow i've never seen the example finish :( i'll have to try downloading the smaller model at some point. |
Unfortunately the 34B model is quite slow on this hardware. Quantization and efficient alternative attention mechanisms with MPS support are still limited. This will eventually change. The 7B model is much faster. FWIW I've allocated resources to find and hopefully mitigate the second batch of issues we saw (MPS cache). |
yes, MPS unfortunately seems to receive less love than MLX does? |
Short update: @skotapati (et al?) has found a solution to a memory leak in pooling (see pytorch/pytorch#125217) 🎉 The fix will be in an upcoming macOS release. I'm hopeful this will resolve the bulk of the memory congestion we saw in LLaVA-NeXT 34B. I had nothing to do with the fix, just wanted to share the good news 😃 /cc @bghira |
System Info
transformers
version: 4.40.0.dev0- distributed_type: NO
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Who can help?
@amyeroberts @arth
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
There is apparently some confusion on the other open issue about this error message, but it seems to be a distinctly different problem.
I can't use the Git version of Transformers or the latest release of Tokenizers, because this library (for SOME reason) needs an older release of Tokenizers.
Expected behavior
The demo should work.
The text was updated successfully, but these errors were encountered: