mixtral:8x7b-instruct-v0.1-fp16 served on Ollama performs worse than the same model served on vllm with same configuration #3349

yilei-ding · 2024-03-25T20:50:45Z

hi, I've been comparing the inference speeds of serving unquantzied mixtral:8x7b-instruct-v0.1-fp16 between using the ollama and vllm platforms. I had set the temeparture to 0 and also set the same number of generated tokens, the mixtral model served on ollama performs very bad. I also checked that the [INST] and [/INST] was added to the prompt on ollama, same as vllm. But the model still performs very bad. Notably, ollama manages to run the model using just 2 A6000 GPUs (each with 48G memory), whereas both vllm and Hugging Face require 4 GPUs to handle the unquantized mixtral 8x7b model. This has led me to wonder if ollama applies any form of on-the-fly quantization?

The text was updated successfully, but these errors were encountered:

igorschlum · 2024-03-25T23:03:35Z

hi @yilei-ding on witch OS are you running Ollama, with what amount of RAM Memory. Can you please share a prompt or a script to run several prompts, so we can replicate the issue?

igorschlum · 2024-04-12T23:48:14Z

Hi @yilei-ding did you tried with version 0.1.31 ? Could you please share your RAM, CPU, OS and script to try to reproduce the issue? With no more news and no other users reporting the same issue, the issues could be closed.

flefevre · 2024-04-19T05:46:27Z

Could you share you vllm configuration and command line ?

pdevine · 2024-05-16T23:49:33Z

@yilei-ding the template for mixtral:8x7b-instruct-v0.1-fp16 was slightly off (there was an additional space at the beginning of the template) which may have been causing poor results. I've just pushed an update to the template, so you may want to try again.

Ollama doesn't do any quantization on the fly, but there was a change a month or so ago to the convert scripts which changed how the moes get converted (specifically it lumped the experts together in a different way w/ the up/down/gate attention layers). I'll try that out and see if there is a performance difference.

pdevine · 2024-05-17T00:32:14Z

OK, I have re-converted the fp16 version and I get comparable performance for both.

On the new version I get:

total duration:       1m28.047026667s
load duration:        2.070959ms
prompt eval count:    13 token(s)
prompt eval duration: 3.371297s
prompt eval rate:     3.86 tokens/s
eval count:           1132 token(s)
eval duration:        1m24.670792s
eval rate:            13.37 tokens/s

On mixtral:8x7b-instruct-v0.1-fp16 I get:

total duration:       1m20.200884042s
load duration:        4.080167ms
prompt eval count:    13 token(s)
prompt eval duration: 3.398857s
prompt eval rate:     3.82 tokens/s
eval count:           1031 token(s)
eval duration:        1m16.795729s
eval rate:            13.43 tokens/s

So there is effectively no difference between the two conversions. What I think may be happening is something is getting offloaded to the CPU? Can you update your ollama version and try the new ollama ps command when the model is loaded? It should say 100% GPU if it was loaded correctly onto the GPUs.

dhiltgen · 2024-07-25T00:17:33Z

@yilei-ding if you're still seeing performance problems, please share more information about your setup and I'll reopen the issue. Share the ollama ps output so we can rule out partial offload as the cause of the performance problem.

pdevine added bug Something isn't working nvidia Issues relating to Nvidia GPUs and CUDA gpu labels May 17, 2024

dhiltgen closed this as completed Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mixtral:8x7b-instruct-v0.1-fp16 served on Ollama performs worse than the same model served on vllm with same configuration #3349

mixtral:8x7b-instruct-v0.1-fp16 served on Ollama performs worse than the same model served on vllm with same configuration #3349

yilei-ding commented Mar 25, 2024

igorschlum commented Mar 25, 2024

Uh oh!

igorschlum commented Apr 12, 2024

Uh oh!

flefevre commented Apr 19, 2024

Uh oh!

pdevine commented May 16, 2024

Uh oh!

pdevine commented May 17, 2024

Uh oh!

dhiltgen commented Jul 25, 2024

Uh oh!

mixtral:8x7b-instruct-v0.1-fp16 served on Ollama performs worse than the same model served on vllm with same configuration #3349

mixtral:8x7b-instruct-v0.1-fp16 served on Ollama performs worse than the same model served on vllm with same configuration #3349

Comments

yilei-ding commented Mar 25, 2024

igorschlum commented Mar 25, 2024

Uh oh!

igorschlum commented Apr 12, 2024

Uh oh!

flefevre commented Apr 19, 2024

Uh oh!

pdevine commented May 16, 2024

Uh oh!

pdevine commented May 17, 2024

Uh oh!

dhiltgen commented Jul 25, 2024

Uh oh!