Faster prompt eval w/ early exit after last layer's kv cache write #253

ochafik · 2023-08-07T01:13:04Z

Prompt evaluation's only side effect is to build the kv cache for all the layers, so we can exit early at the last layer once we've cached its k & v vectors (and also skip the final rmsnorm & logits computation)

Tested: make runfast w/ hyperfine on M2 Max Mac and adjusted -n to only test prompt eval using prompts generated w/ lorem:

hyperfine -L bin run,run_baseline -L model llama2.c.stories110M.bin,llama-2-7b-chat.llama2.c.bin --warmup=1 './{bin} "{model}" -n 100 -i "$( lorem -n 100 )"'

model	-n	master	this PR	speedup wrt/ master	fast tok (#251)	this PR + fast tok	speedup wrt/ fast tok
stories110M	100	7.8s	7.5s	-3.8%	1.02s (103 tok/s)	0.73s (146 tok/s)	-28% (??)
stories110M	336 (lorem -n 128)	14.8s	13.6s	-8%	3.5s (96 tok/s)	2.5s (134 tok/s)	-28% (??)
llama2-7b	336	-	-	-	211s (1.76 tok/s)	187s (1.84 tok/s)	-11%
stories110M	1997 (lorem -n 750)	-	-	-	38.6s	32.2s	-16%

Notes:

Tokenization is very slow in master and dominates the execution time. I've added figures after merging the fast tokenizer from Faster tokenization (log(n) lookups & caching of possible merges) #251 and using it as baseline
Some numbers are a bit surprising (excessive speed up), I'll test again when I get a chance.

karpathy · 2023-08-07T02:41:17Z

not willing to bloat things just yet for this speedup

ochafik · 2023-08-07T03:24:07Z

@karpathy thanks for taking a look (and for the cool repo!). Note that after #251 it would be a 11-16% speedup (or 28% in some cases) for 10 extra LOC (updated figures), but fair enough ✌️

Faster prompt eval w/ early exit after last layer's kv cache write

00401b6

karpathy closed this Aug 7, 2023

ochafik mentioned this pull request Aug 22, 2023

Skip computation of much of last layer & unused logits during prompt eval / large N ggerganov/llama.cpp#2700

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster prompt eval w/ early exit after last layer's kv cache write #253

Faster prompt eval w/ early exit after last layer's kv cache write #253

ochafik commented Aug 7, 2023 •

edited

karpathy commented Aug 7, 2023

ochafik commented Aug 7, 2023 •

edited

Faster prompt eval w/ early exit after last layer's kv cache write #253

Faster prompt eval w/ early exit after last layer's kv cache write #253

Conversation

ochafik commented Aug 7, 2023 • edited

karpathy commented Aug 7, 2023

ochafik commented Aug 7, 2023 • edited

ochafik commented Aug 7, 2023 •

edited

ochafik commented Aug 7, 2023 •

edited