diff --git a/kv-cache.md b/kv-cache.md index e8876ffbca..35cae4b91d 100644 --- a/kv-cache.md +++ b/kv-cache.md @@ -13,7 +13,7 @@ authors: ## TL;DR -We have implemented KV Caching from scratch in our [nanoVLM](https://github.com/huggingface/nanoVLM) repository (a small code base to train your own Vision Language Model with pure PyTorch). This gave us a **38%** of speedup in generation. In this blog post we cover KV Caching and all our experiences while implementing it. The lessons learnt are general and can be applied to all autoregressive language model generations. Implementing from scratch on a small codebase is a great learning experience, come along for the ride! +We have implemented KV Caching from scratch in our [nanoVLM](https://github.com/huggingface/nanoVLM) repository (a small codebase to train your own Vision Language Model with pure PyTorch). This gave us a **38%** speedup in generation. In this blog post we cover KV Caching and all our experiences while implementing it. The lessons learnt are general and can be applied to all autoregressive language model generations. Implementing from scratch on a small codebase is a great learning experience, come along for the ride! ![bar plot showcasing improvement in generation speed](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/kv-cache/speed_improved.png) @@ -25,7 +25,7 @@ Autoregressive language models generate text by sampling *one token at a time*. This step-by-step generation is inherently sequential: -- To generate token \\( t_{i+1} \\), the model must consider the entire sequence from \\( t_0 \\) to \\( t_i \\). From the first instance in the above example \\( t_{i+1} \\) would be `the` , while all the previous tokens \\( t_0 \\) to \\( t_i \\) would be `[What, is, in,]`. +- To generate token \\( t_{i+1} \\), the model must consider the entire sequence from \\( t_0 \\) to \\( t_i \\). From the first instance in the above example \\( t_{i+1} \\) would be `the` , while all the previous tokens \\( t_0 \\) to \\( t_i \\) would be `[What, is, in]`. - Although transformers are internally parallel, each new prediction requires a full forward pass through all transformer layers, which incurs a quadratic memory/compute in terms of the sequence length. This repetition also leads to computational **redundancy**. In this post, we explore **KV Caching**, an optimisation technique that mitigates this inefficiency.