-
Notifications
You must be signed in to change notification settings - Fork 12.1k
mixtral:8x7b-instruct-v0.1-fp16 served on Ollama performs worse than the same model served on vllm with same configuration #3349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
hi @yilei-ding on witch OS are you running Ollama, with what amount of RAM Memory. Can you please share a prompt or a script to run several prompts, so we can replicate the issue? |
Hi @yilei-ding did you tried with version 0.1.31 ? Could you please share your RAM, CPU, OS and script to try to reproduce the issue? With no more news and no other users reporting the same issue, the issues could be closed. |
Could you share you vllm configuration and command line ? |
@yilei-ding the template for Ollama doesn't do any quantization on the fly, but there was a change a month or so ago to the convert scripts which changed how the moes get converted (specifically it lumped the experts together in a different way w/ the up/down/gate attention layers). I'll try that out and see if there is a performance difference. |
OK, I have re-converted the fp16 version and I get comparable performance for both. On the new version I get:
On
So there is effectively no difference between the two conversions. What I think may be happening is something is getting offloaded to the CPU? Can you update your ollama version and try the new |
@yilei-ding if you're still seeing performance problems, please share more information about your setup and I'll reopen the issue. Share the |
hi, I've been comparing the inference speeds of serving unquantzied
mixtral:8x7b-instruct-v0.1-fp16
between using the ollama and vllm platforms. I had set the temeparture to 0 and also set the same number of generated tokens, the mixtral model served on ollama performs very bad. I also checked that the [INST] and [/INST] was added to the prompt on ollama, same as vllm. But the model still performs very bad. Notably, ollama manages to run the model using just 2 A6000 GPUs (each with 48G memory), whereas both vllm and Hugging Face require 4 GPUs to handle the unquantized mixtral 8x7b model. This has led me to wonder if ollama applies any form of on-the-fly quantization?The text was updated successfully, but these errors were encountered: