Swift generates tokens substantially slower than python for Phi-3 #93

neilmehta24 · 2024-07-10T18:37:49Z

The python mlx_lm implementation generates at ~101 tokens per second for mlx-community/Phi-3-mini-4k-instruct-4bit, whereas the swift code here generates at ~60 tokens per second.

Here is my python implementation

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Phi-3-mini-4k-instruct-4bit")
response = generate(model, tokenizer, max_tokens=100000, prompt="""<|system|>
You are a helpful assistant.<|end|>
<|user|>
How to explain Internet for a medieval knight?<|end|>
<|assistant|>""", verbose=True)

Here is my swift command

➜  mlx-swift-examples git:(main) ./mlx-run llm-tool eval --model mlx-community/Phi-3-mini-4k-instruct-4bit -m 100000 -p "<|system|>
You are a helpful assistant.<|end|>
<|user|>
How to explain Internet for a medieval knight?<|end|>
<|assistant|>"

Any ideas on how I can achieve similar speed in swift?

The text was updated successfully, but these errors were encountered:

davidkoski · 2024-07-10T23:10:32Z

I will investigate this

davidkoski · 2024-07-11T15:19:45Z

I can see two things that look like they are contributing here:

JIT startup costs -- the first time I ran the swift version I saw ~60 tokens per second, but subsequent runs were at 100 tokens per second
StreamingDetokenizer -- if I generate 100 tokens the python and swift version are nearly the same speed (around 100 tokens per second on my laptop)
- but if I generate 1000 tokens (and modify the various parameters so that it doesn't stop short) I get ~60 tokens per second from swift and still 100 tokens per second from python

In swift we are calling the Tokenizer with the entire list of output tokens and taking any additions to the resulting string to print out. This is $O(n^2)$ performance as we need to scan $n$ tokens $n$ times. The python version has StreamingDetokenizer that gets O(n) performance as it only generates the tail end of the output string as it runs -- we need a version of this in swift.

I suspect you were seeing the latter effect (you probably ran this a few times and the first effect was negligible).

davidkoski · 2024-07-11T15:22:36Z

TASKs:

port StreamingDetokenizer
use mx.async_eval(y) to pipeline the generation
look at KVCache from the python side as well

We can do these in that order as the first one is probably the biggest benefit.

awni · 2024-07-11T15:43:57Z

port StreamingDetokenizer

In Python we have a naive detokenizer that chops the history on every line break to avoid needing to re-decode the full sequence. That actually gets you pretty far and is quite simple to implement. The full streaming detokenizers add some speed after that.. but they are more involved to implement since there are a few cases for different models.

use mx.async_eval(y) to pipeline the generation

That should be fairly simple to add. It's like a four line change in Python

look at KVCache from the python side as well

That actually makes a noticeable difference for longer generations

awni · 2024-08-16T20:44:23Z

Another optimization in Python which is really useful for long prompts/generations ml-explore/mlx-examples#931

There are two things there

Prompt splitting
Rotating buffer for the cache

The prompt splitting is an easy win / no brainer. Basically four lines for faster / lower memory prompt processing:

  while y.size > prefill_step_size:
       model(y[:prefill_step_size][None], cache=cache)
       mx.eval([c.state for c in cache])
       y = y[prefill_step_size:]

The rotating buffer is more involved but useful for memory constrained situations (at the cost of accuracy. We can look at adding that after the other items above.

- sampling code compiled - KVCache - async_eval - NaiveStreamingDetokenizer

- sampling code compiled - KVCache - async_eval - NaiveStreamingDetokenizer - use mlx-swift 16.0.1

* add kvcache, async eval, etc for #93 - sampling code compiled - KVCache - async_eval - NaiveStreamingDetokenizer - use mlx-swift 16.0.1

davidkoski · 2024-08-29T20:05:49Z

The performance should be roughly the same as python now, though I found both of them to be a little noisy in the measurement. See #109

davidkoski self-assigned this Jul 10, 2024

davidkoski mentioned this issue Aug 17, 2024

Avoid exceeding maximum allowed buffer size #106

Closed

awni mentioned this issue Aug 19, 2024

Unable to allocate memory #84

Open

davidkoski added a commit that referenced this issue Aug 23, 2024

add kvcache, async eval, etc for #93

974ecdb

- sampling code compiled - KVCache - async_eval - NaiveStreamingDetokenizer

davidkoski added a commit that referenced this issue Aug 28, 2024

add kvcache, async eval, etc for #93

0f273bd

- sampling code compiled - KVCache - async_eval - NaiveStreamingDetokenizer - use mlx-swift 16.0.1

davidkoski added a commit that referenced this issue Aug 29, 2024

add kvcache, async eval, etc for #93 (#109)

ab94ffc

* add kvcache, async eval, etc for #93 - sampling code compiled - KVCache - async_eval - NaiveStreamingDetokenizer - use mlx-swift 16.0.1

davidkoski closed this as completed Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swift generates tokens substantially slower than python for Phi-3 #93

Swift generates tokens substantially slower than python for Phi-3 #93

neilmehta24 commented Jul 10, 2024

davidkoski commented Jul 10, 2024

davidkoski commented Jul 11, 2024

davidkoski commented Jul 11, 2024

awni commented Jul 11, 2024

awni commented Aug 16, 2024 •

edited

Loading

davidkoski commented Aug 29, 2024

Swift generates tokens substantially slower than python for Phi-3 #93

Swift generates tokens substantially slower than python for Phi-3 #93

Comments

neilmehta24 commented Jul 10, 2024

davidkoski commented Jul 10, 2024

davidkoski commented Jul 11, 2024

davidkoski commented Jul 11, 2024

awni commented Jul 11, 2024

awni commented Aug 16, 2024 • edited Loading

davidkoski commented Aug 29, 2024

awni commented Aug 16, 2024 •

edited

Loading