Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster prompt eval w/ early exit after last layer's kv cache write #253

Closed
wants to merge 1 commit into from

Conversation

ochafik
Copy link

@ochafik ochafik commented Aug 7, 2023

Prompt evaluation's only side effect is to build the kv cache for all the layers, so we can exit early at the last layer once we've cached its k & v vectors (and also skip the final rmsnorm & logits computation)

Tested: make runfast w/ hyperfine on M2 Max Mac and adjusted -n to only test prompt eval using prompts generated w/ lorem:

hyperfine -L bin run,run_baseline -L model llama2.c.stories110M.bin,llama-2-7b-chat.llama2.c.bin --warmup=1 './{bin} "{model}" -n 100 -i "$( lorem -n 100 )"'
model -n master this PR speedup wrt/ master fast tok (#251) this PR + fast tok speedup wrt/ fast tok
stories110M 100 7.8s 7.5s -3.8% 1.02s
(103 tok/s)
0.73s (146 tok/s) -28% (??)
stories110M 336
(lorem -n 128)
14.8s 13.6s -8% 3.5s
(96 tok/s)
2.5s (134 tok/s) -28% (??)
llama2-7b 336 - - - 211s
(1.76 tok/s)
187s
(1.84 tok/s)
-11%
stories110M 1997
(lorem -n 750)
- - - 38.6s 32.2s -16%

Notes:

@karpathy
Copy link
Owner

karpathy commented Aug 7, 2023

not willing to bloat things just yet for this speedup

@karpathy karpathy closed this Aug 7, 2023
@ochafik
Copy link
Author

ochafik commented Aug 7, 2023

@karpathy thanks for taking a look (and for the cool repo!). Note that after #251 it would be a 11-16% speedup (or 28% in some cases) for 10 extra LOC (updated figures), but fair enough ✌️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants