GPU memory usage differs from local. #426

137591 · 2024-05-26T14:44:05Z

I tried to compare a specific model (such as llama 3B) between Web-LLM and local (MLC-LLM) environments, and found that under the same parameters, i.e., without making any changes, the GPU memory usage differs. Please explain the reason. Additionally, is there a way to obtain or modify the KV-Cache settings of Web-LLM?

CharlieFRuan · 2024-05-29T22:17:37Z

How much does the usage differ?

Additionally, is there a way to obtain or modify the KV-Cache settings of Web-LLM?

Good question; there isn't a way as of now, but should be a TODO from us. Currently, we usually provide 2 context length for each model, 4k and 1k. And only Mistral uses sliding windows as of now.

137591 · 2024-05-30T10:42:24Z

we usually provide 2 context length for each model, 4k and 1k

How much does the usage differ?用法有何不同？

Additionally, is there a way to obtain or modify the KV-Cache settings of Web-LLM?另外，有没有办法获取或修改Web-LLM的KV-Cache设置？

Good question; there isn't a way as of now, but should be a TODO from us. Currently, we usually provide 2 context length for each model, 4k and 1k. And only Mistral uses sliding windows as of now.好问题;目前还没有办法，但我们应该有一个待办事项。目前，我们通常为每个模型提供 2 个上下文长度，4k 和 1k。目前只有米斯特拉尔使用滑动窗。

For example, Llama3-8B-q4f32-1 uses around 7800MB of VRAM natively and around 5600MB on the web, without changing any of the original example configurations. My input prompt is "what is the meaning of life?" I suspect the KV Cache settings are different, but I can't view VRAM usage details on the web. How can I check the KV Cache size on the web?
This is the data I provided when launching natively (mlc-llm).

CharlieFRuan · 2024-05-31T01:03:08Z

I see; I'm guessing this is probably due to KVCache size. For WebLLM, if you are using the web app, you can set Loggil Level to Debug in Settings, and you can see in the console log the kv cache size; here we have 2048 for TinyLlama

137591 · 2024-05-31T05:34:57Z

I see; I'm guessing this is probably due to KVCache size. For WebLLM, if you are using the web app, you can set Loggil Level to Debug in Settings, and you can see in the console log the kv cache size; here we have 2048 for TinyLlama明白了;我猜这可能是由于 KVCache 大小。对于 WebLLM，如果您使用的是 Web 应用程序，则可以设置为 Loggil Level Debug in Settings ，并且可以在控制台日志中看到 kv 缓存大小;在这里，我们有 2048 个 TinyLlama

got it！Thank you！

tqchen · 2024-05-31T21:07:41Z

if you use MLC LLM note that it defaults to "local" mode that sets a bigger kv for concurrent access, you can change that via --mode interactive, which will map to batch 1

137591 · 2024-06-02T13:34:48Z

if you use MLC LLM note that it defaults to "local" mode that sets a bigger kv for concurrent access, you can change that via --mode interactive, which will map to batch 1如果您使用 MLCLLM，请注意它默认为“本地”模式，该模式为并发访问设置更大的 kv，您可以通过 --mode interactive 进行更改，这将映射到批次 1

Thank you for your answer!

tqchen closed this as completed Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU memory usage differs from local. #426

GPU memory usage differs from local. #426

137591 commented May 26, 2024

CharlieFRuan commented May 29, 2024

137591 commented May 30, 2024

CharlieFRuan commented May 31, 2024

137591 commented May 31, 2024

tqchen commented May 31, 2024

137591 commented Jun 2, 2024

GPU memory usage differs from local. #426

GPU memory usage differs from local. #426

Comments

137591 commented May 26, 2024

CharlieFRuan commented May 29, 2024

137591 commented May 30, 2024

CharlieFRuan commented May 31, 2024

137591 commented May 31, 2024

tqchen commented May 31, 2024

137591 commented Jun 2, 2024