-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Swift generates tokens substantially slower than python for Phi-3 #93
Comments
I will investigate this |
I can see two things that look like they are contributing here:
In swift we are calling the Tokenizer with the entire list of output tokens and taking any additions to the resulting string to print out. This is I suspect you were seeing the latter effect (you probably ran this a few times and the first effect was negligible). |
TASKs:
We can do these in that order as the first one is probably the biggest benefit. |
In Python we have a naive detokenizer that chops the history on every line break to avoid needing to re-decode the full sequence. That actually gets you pretty far and is quite simple to implement. The full streaming detokenizers add some speed after that.. but they are more involved to implement since there are a few cases for different models.
That should be fairly simple to add. It's like a four line change in Python
That actually makes a noticeable difference for longer generations |
Another optimization in Python which is really useful for long prompts/generations ml-explore/mlx-examples#931 There are two things there
The prompt splitting is an easy win / no brainer. Basically four lines for faster / lower memory prompt processing: while y.size > prefill_step_size:
model(y[:prefill_step_size][None], cache=cache)
mx.eval([c.state for c in cache])
y = y[prefill_step_size:] The rotating buffer is more involved but useful for memory constrained situations (at the cost of accuracy. We can look at adding that after the other items above. |
- sampling code compiled - KVCache - async_eval - NaiveStreamingDetokenizer
- sampling code compiled - KVCache - async_eval - NaiveStreamingDetokenizer - use mlx-swift 16.0.1
The performance should be roughly the same as python now, though I found both of them to be a little noisy in the measurement. See #109 |
The python
mlx_lm
implementation generates at ~101 tokens per second formlx-community/Phi-3-mini-4k-instruct-4bit
, whereas the swift code here generates at ~60 tokens per second.Here is my python implementation
Here is my swift command
Any ideas on how I can achieve similar speed in swift?
The text was updated successfully, but these errors were encountered: