Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory usage differs from local. #426

Closed
137591 opened this issue May 26, 2024 · 6 comments
Closed

GPU memory usage differs from local. #426

137591 opened this issue May 26, 2024 · 6 comments

Comments

@137591
Copy link

137591 commented May 26, 2024

I tried to compare a specific model (such as llama 3B) between Web-LLM and local (MLC-LLM) environments, and found that under the same parameters, i.e., without making any changes, the GPU memory usage differs. Please explain the reason. Additionally, is there a way to obtain or modify the KV-Cache settings of Web-LLM?

@CharlieFRuan
Copy link
Contributor

How much does the usage differ?

Additionally, is there a way to obtain or modify the KV-Cache settings of Web-LLM?

Good question; there isn't a way as of now, but should be a TODO from us. Currently, we usually provide 2 context length for each model, 4k and 1k. And only Mistral uses sliding windows as of now.

@137591
Copy link
Author

137591 commented May 30, 2024

we usually provide 2 context length for each model, 4k and 1k

How much does the usage differ?用法有何不同?

Additionally, is there a way to obtain or modify the KV-Cache settings of Web-LLM?另外,有没有办法获取或修改Web-LLM的KV-Cache设置?

Good question; there isn't a way as of now, but should be a TODO from us. Currently, we usually provide 2 context length for each model, 4k and 1k. And only Mistral uses sliding windows as of now.好问题;目前还没有办法,但我们应该有一个待办事项。目前,我们通常为每个模型提供 2 个上下文长度,4k 和 1k。目前只有米斯特拉尔使用滑动窗。

For example, Llama3-8B-q4f32-1 uses around 7800MB of VRAM natively and around 5600MB on the web, without changing any of the original example configurations. My input prompt is "what is the meaning of life?" I suspect the KV Cache settings are different, but I can't view VRAM usage details on the web. How can I check the KV Cache size on the web?
This is the data I provided when launching natively (mlc-llm).
img_v3_02bc_3098a712-0484-4966-b208-c284c9187edg

@CharlieFRuan
Copy link
Contributor

I see; I'm guessing this is probably due to KVCache size. For WebLLM, if you are using the web app, you can set Loggil Level to Debug in Settings, and you can see in the console log the kv cache size; here we have 2048 for TinyLlama
image

@137591
Copy link
Author

137591 commented May 31, 2024

I see; I'm guessing this is probably due to KVCache size. For WebLLM, if you are using the web app, you can set Loggil Level to Debug in Settings, and you can see in the console log the kv cache size; here we have 2048 for TinyLlama明白了;我猜这可能是由于 KVCache 大小。对于 WebLLM,如果您使用的是 Web 应用程序,则可以设置为 Loggil Level Debug in Settings ,并且可以在控制台日志中看到 kv 缓存大小;在这里,我们有 2048 个 TinyLlama image

got it!Thank you!

@tqchen
Copy link
Contributor

tqchen commented May 31, 2024

if you use MLC LLM note that it defaults to "local" mode that sets a bigger kv for concurrent access, you can change that via --mode interactive, which will map to batch 1

@137591
Copy link
Author

137591 commented Jun 2, 2024

if you use MLC LLM note that it defaults to "local" mode that sets a bigger kv for concurrent access, you can change that via --mode interactive, which will map to batch 1如果您使用 MLCLLM,请注意它默认为“本地”模式,该模式为并发访问设置更大的 kv,您可以通过 --mode interactive 进行更改,这将映射到批次 1

Thank you for your answer!

@tqchen tqchen closed this as completed Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants