-
Notifications
You must be signed in to change notification settings - Fork 781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gibberish output with Llama-2-7b-chat-hf-q4f32_1
#356
Comments
Thanks for reporting the issue, this seems to be an out-of-memory issue (f32 KV cache, and 13b params); llama-2-7b-q4f32_1 requires roughly 9 GB, while 13b-q4f16_1 requires roughly 10 GB. How much RAM does Intel(R) Graphics (ADL GT2) have? Is it 16 GB? It might be a bit hard to catch the OOM error as we've seen earlier in #209. |
Similar VK_ERROR_OUT_OF_DEVICE_MEMORY issue was reported in mlc-llm: mlc-ai/mlc-llm#974 |
I think we should catch GPU out-of-memory errors like we tried previously in #209 (comment). FYI I was not able to catch them with https://chromewebstore.google.com/detail/webgpu-dev-extension/gkeaijopdhfempknmaggbjbedicopjgm either @greggman. EDIT: The reason why is because the extension doesn't support workers. |
FYI https://webgpureport.org says the integrated GPU memoryHeaps is |
@CharlieFRuan Are out-of-memory errors captured somewhere? In WebLLM or Apache TVM? |
I know TVM can capture OOM for other backends (e.g. for Vulkan here). I'm not too sure what would be the case for webgpu. I'll make another attempt this week; thanks for the pointers! |
https://github.com/search?q=repo%3Aapache%2Ftvm+%22out-of-memory%22&type=code returns no results for me ;( |
I think webllm would needs its own mechanism. There are a few things.
|
@CharlieFRuan Did you have a chance to have a look at this? |
@beaufortfrancois I tried to catch the error with So I instead added a 1024 context length version model for llama-2-7b-q4f32_1, and made them the default choices in the demo page. This lowers the VRAM for ~3GB for llama-2 q4f32. I also added a note about the |
According to #356 (comment) logs, it looks like errors happen when validating entries in createBindGroup(), not after createBuffer(). Does it help? Did you try uncapturederror as well? device.onuncapturederror = ({error}) => {
console.log(error);
})
That's useful. Thanks! |
(gentle ping) |
@CharlieFRuan Did you have a chance to look at this? |
Sorry for the delay, will take a look tonight |
Quick update: it does seem that the error can be caught! Not sure if I did something wrong earlier or there are some updates on the webgpu side. Since my laptop does not run into OOM for most models, to reproduce the error, I set maxTotalSeqLen to an arbitrary large number Upon finishing loading the model, the engine will allocate the KVCache, and I see: This log corresponds to the Then upon ignoring the error and start chatting, we hit the uncaptured error you suggested: I will refine the handling and upstream the changes after verifying the errors can indeed be well-caught. Should have another update by the end of this week. Thank you so much for the help! |
Prior to this PR, when users `createEngine()` or call `reload()` with a model that is too large for the device, likely the device would keep generating, ignoring OOM issue and correctness. See #356 and #209. This PR catches such error with `device.lost.then()`, depending on tvmjs to call `device.destroy()` upon detecting error in `createBuffer()` via apache/tvm#17005. We have only observed `createBuffer()` errors and hence will only process such kind of errors for now. Besides, since most OOM errors occur in `reload()`, we make the error handling synchronous despite using `.then()` by throwing the error at the end of `reload()` if there is one.
OOM errors in Redeployed https://webllm.mlc.ai/ as well. |
Closing this issue as completed; feel free to open new ones if problems persist! |
Chrome Version: 125.0.6283.3
OS: ChromeOS
GPU: Intel(R) Graphics (ADL GT2) - Intel open-source Mesa driver: Mesa 23.3.0 (git-5cb3f1e4fa)
Dawn Backend: Vulkan
What steps will reproduce the problem?
Llama-2-7b-chat-hf-q4f32_1
What color is the dress?
What is the expected result?
Some text that at least makes sense.
What happens instead?
Some gibberish text appears.
DevTools JavaScript console contains the following logs:
Then I enter "What color is the dress?"
Note
It does work properly with the following f16 variants:
Llama-2-7b-chat-hf-q4f16_1
andLlama-2-7b-chat-hf-q4f16_1-1k
I can reproduce with
Llama-2-13b-chat-hf-q4f16_1
The text was updated successfully, but these errors were encountered: