Skip to content

mixtral:8x7b-instruct-v0.1-fp16 served on Ollama performs worse than the same model served on vllm with same configuration #3349

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yilei-ding opened this issue Mar 25, 2024 · 6 comments
Labels
bug Something isn't working gpu nvidia Issues relating to Nvidia GPUs and CUDA

Comments

@yilei-ding
Copy link

hi, I've been comparing the inference speeds of serving unquantzied mixtral:8x7b-instruct-v0.1-fp16 between using the ollama and vllm platforms. I had set the temeparture to 0 and also set the same number of generated tokens, the mixtral model served on ollama performs very bad. I also checked that the [INST] and [/INST] was added to the prompt on ollama, same as vllm. But the model still performs very bad. Notably, ollama manages to run the model using just 2 A6000 GPUs (each with 48G memory), whereas both vllm and Hugging Face require 4 GPUs to handle the unquantized mixtral 8x7b model. This has led me to wonder if ollama applies any form of on-the-fly quantization?

@igorschlum
Copy link

hi @yilei-ding on witch OS are you running Ollama, with what amount of RAM Memory. Can you please share a prompt or a script to run several prompts, so we can replicate the issue?

@igorschlum
Copy link

Hi @yilei-ding did you tried with version 0.1.31 ? Could you please share your RAM, CPU, OS and script to try to reproduce the issue? With no more news and no other users reporting the same issue, the issues could be closed.

@flefevre
Copy link

Could you share you vllm configuration and command line ?

@pdevine
Copy link
Contributor

pdevine commented May 16, 2024

@yilei-ding the template for mixtral:8x7b-instruct-v0.1-fp16 was slightly off (there was an additional space at the beginning of the template) which may have been causing poor results. I've just pushed an update to the template, so you may want to try again.

Ollama doesn't do any quantization on the fly, but there was a change a month or so ago to the convert scripts which changed how the moes get converted (specifically it lumped the experts together in a different way w/ the up/down/gate attention layers). I'll try that out and see if there is a performance difference.

@pdevine
Copy link
Contributor

pdevine commented May 17, 2024

OK, I have re-converted the fp16 version and I get comparable performance for both.

On the new version I get:

total duration:       1m28.047026667s
load duration:        2.070959ms
prompt eval count:    13 token(s)
prompt eval duration: 3.371297s
prompt eval rate:     3.86 tokens/s
eval count:           1132 token(s)
eval duration:        1m24.670792s
eval rate:            13.37 tokens/s

On mixtral:8x7b-instruct-v0.1-fp16 I get:

total duration:       1m20.200884042s
load duration:        4.080167ms
prompt eval count:    13 token(s)
prompt eval duration: 3.398857s
prompt eval rate:     3.82 tokens/s
eval count:           1031 token(s)
eval duration:        1m16.795729s
eval rate:            13.43 tokens/s

So there is effectively no difference between the two conversions. What I think may be happening is something is getting offloaded to the CPU? Can you update your ollama version and try the new ollama ps command when the model is loaded? It should say 100% GPU if it was loaded correctly onto the GPUs.

@pdevine pdevine added bug Something isn't working nvidia Issues relating to Nvidia GPUs and CUDA gpu labels May 17, 2024
@dhiltgen
Copy link
Collaborator

@yilei-ding if you're still seeing performance problems, please share more information about your setup and I'll reopen the issue. Share the ollama ps output so we can rule out partial offload as the cause of the performance problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gpu nvidia Issues relating to Nvidia GPUs and CUDA
Projects
None yet
Development

No branches or pull requests

5 participants