-
Notifications
You must be signed in to change notification settings - Fork 773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU memory usage differs from local. #426
Comments
How much does the usage differ?
Good question; there isn't a way as of now, but should be a TODO from us. Currently, we usually provide 2 context length for each model, 4k and 1k. And only Mistral uses sliding windows as of now. |
if you use MLC LLM note that it defaults to "local" mode that sets a bigger kv for concurrent access, you can change that via --mode interactive, which will map to batch 1 |
Thank you for your answer! |
I tried to compare a specific model (such as llama 3B) between Web-LLM and local (MLC-LLM) environments, and found that under the same parameters, i.e., without making any changes, the GPU memory usage differs. Please explain the reason. Additionally, is there a way to obtain or modify the KV-Cache settings of Web-LLM?
The text was updated successfully, but these errors were encountered: