SlidingWindowCache: reduce differences to other Cache classes #30970

gante · 2024-05-22T14:28:52Z

What does this PR do?

Follow up to #30642: this PR aims at reducing the differences between SlidingWindowCache and StaticCache, such that long-term maintenance becomes easier. Fewer attributes/functions = less cognitive overload and fewer bugs 🤗

More specifically:
👉 no need for attributes regarding the sliding window (it is a form of maximum cache size, for which there was an attribute)
👉 list of 4D tensors holding the cache, as opposed to 5D tensors (to keep the same data format as in other caches)
👉 inherits from StaticCache, as most of the __init__ and other boilerplate functions are identical

Slow Mistral tests were ran locally, all green ✅

cc @zhenglongjiepheonix I meant to request these changes in the PR linked above, but I was slow to review 😛

gante · 2024-05-22T14:29:22Z

src/transformers/cache_utils.py

    """
    Sliding Window Cache class to be used with `torch.compile` for models like Mistral that support sliding window attention.
-    Every time when we try to update the cache, we compute the `indices` based on `cache_position >= self.config.sliding_window_size - 1`,
+    Every time when we try to update the cache, we compute the `indices` based on `cache_position >= self.config.sliding_window - 1`,


sliding_window is the config attribute name, not sliding_window_size

gante · 2024-05-22T14:30:37Z

src/transformers/generation/utils.py

        need_new_cache = (
            not hasattr(self, "_cache")
            or (not isinstance(self._cache, cache_cls))
-            or self._cache.max_batch_size < max_batch_size
+            or self._cache.max_batch_size != max_batch_size


(unrelated to the sliding window cache) this was incorrect, we need a new cache object with a different batch size

that's a nice catch!

src/transformers/models/bigbird_pegasus/modeling_bigbird_pegasus.py

HuggingFaceDocBuilderDev · 2024-05-22T14:54:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Overall good for me, but wondering about where these graph breaks are from?

ArthurZucker · 2024-05-23T08:55:20Z

src/transformers/cache_utils.py

-        # assume this will be called only in the first generation step
-        # `cache_postion` will be used in other cases
-        return 0
+        # `_.zero()` followed by `+=` is equivalent `=`, but compile-friendly (without graph breaks due to assignment)


where are these graph break from? (this did not work before?) because it's equivalent but less fast no?

there is an extra zero here involved to make cudagraphs happy, I believe we should not change the address of the tensor during compiling and direct assign violates that, in StaticCache there is no problem because k_out[:,:,cache_position] = key_states does not change the address of k_out, and if we want a 4d instead of 5d cache, the direct assign will just substitute the original tensor in layers list, causing address change

Ahhh yeah which is why you had did not have this

If there is no tradeof to using this (make bench + test on a100 as well) fine, otherwise not fine but a comment to say why

gante · 2024-05-24T17:16:57Z

@ArthurZucker @zhenglongjiepheonix the implementation from this PR is also faster 🙌

Setup:

A100 80GB
input length=502
max_new_tokens=128
compiling forward but calling generate (i.e. there is some overhead from calling the uncompiled generate)
model: mistralai/Mistral-7B-v0.1

code

from transformers import AutoTokenizer, MistralForCausalLM
import torch
import time

prompts = ["My favourite condiment is " * 100]
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", use_fast=False)
tokenizer.pad_token = tokenizer.eos_token
model = MistralForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1", device_map="auto", torch_dtype=torch.float16
)
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)
print(inputs.input_ids.shape)

model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)

for i in range(5):
    start = time.time()
    generated_ids = model.generate(
        **inputs, max_new_tokens=128, do_sample=False, cache_implementation="sliding_window"
    )
    assert generated_ids.shape[1] == 128 + inputs.input_ids.shape[1]
    print(f"Time: {time.time() - start:.2f}s")

👉 static cache: 76.2 tok/s
👉 original sliding window: 70.7 tok/s
👉 this PR's sliding window: 74.9 tok/s

Could it be because there are fewer slicing OPs? (before, we had to slice the 5D cache into a 4D tensor at every layer)

zhenglongjiepheonix · 2024-05-27T16:45:28Z

@ArthurZucker @zhenglongjiepheonix the implementation from this PR is also faster 🙌

Setup:

A100 80GB

input length=502

max_new_tokens=128

compiling forward but calling generate (i.e. there is some overhead from calling the uncompiled generate)

model: mistralai/Mistral-7B-v0.1

code
👉 static cache: 76.2 tok/s 👉 original sliding window: 70.7 tok/s 👉 this PR's sliding window: 74.9 tok/s

Could it be because there are fewer slicing OPs? (before, we had to slice the 5D cache into a 4D tensor at every layer)

Yes, Slicing can be time-consuming, I have tested on my side and in your setting your implementation indeed saves about 1ms per token, I think it's good if we don't have to slice everytime by using zero

ArthurZucker

LGTM! Let's merge @gante is not here

…gface#30970) * tmp commit * sliding window with fewer differences * make fixup + rebase * missing overwrite

gante requested a review from ArthurZucker May 22, 2024 14:28

gante commented May 22, 2024

View reviewed changes

src/transformers/models/bigbird_pegasus/modeling_bigbird_pegasus.py Outdated Show resolved Hide resolved

gante added 3 commits May 22, 2024 14:32

tmp commit

3630dde

sliding window with fewer differences

3725b63

make fixup + rebase

dddca6a

gante force-pushed the sliding_window branch from f7f475b to dddca6a Compare May 22, 2024 14:34

ArthurZucker reviewed May 23, 2024

View reviewed changes

missing overwrite

9b82437

zhenglongjiepheonix closed this May 27, 2024

zhenglongjiepheonix reopened this May 27, 2024

gante requested a review from ArthurZucker May 29, 2024 15:19

ArthurZucker approved these changes Jun 3, 2024

View reviewed changes

ArthurZucker merged commit d475f76 into huggingface:main Jun 3, 2024
26 checks passed

gante deleted the sliding_window branch June 13, 2024 16:19

gante mentioned this pull request Jun 13, 2024

[whisper] static kv cache #31166

Merged

6 tasks

gante mentioned this pull request Jul 25, 2024

Make static cache compatible with torch.export #32168

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SlidingWindowCache: reduce differences to other Cache classes #30970

SlidingWindowCache: reduce differences to other Cache classes #30970

gante commented May 22, 2024 •

edited

Loading

gante May 22, 2024

gante May 22, 2024

zhenglongjiepheonix May 22, 2024

HuggingFaceDocBuilderDev commented May 22, 2024

ArthurZucker left a comment

ArthurZucker May 23, 2024

zhenglongjiepheonix May 23, 2024

ArthurZucker May 23, 2024

ArthurZucker May 23, 2024

gante commented May 24, 2024 •

edited

Loading

zhenglongjiepheonix commented May 27, 2024 •

edited

Loading

ArthurZucker left a comment

SlidingWindowCache: reduce differences to other Cache classes #30970

SlidingWindowCache: reduce differences to other Cache classes #30970

Conversation

gante commented May 22, 2024 • edited Loading

What does this PR do?

gante May 22, 2024

Choose a reason for hiding this comment

gante May 22, 2024

Choose a reason for hiding this comment

zhenglongjiepheonix May 22, 2024

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented May 22, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker May 23, 2024

Choose a reason for hiding this comment

zhenglongjiepheonix May 23, 2024

Choose a reason for hiding this comment

ArthurZucker May 23, 2024

Choose a reason for hiding this comment

ArthurZucker May 23, 2024

Choose a reason for hiding this comment

gante commented May 24, 2024 • edited Loading

zhenglongjiepheonix commented May 27, 2024 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

gante commented May 22, 2024 •

edited

Loading

gante commented May 24, 2024 •

edited

Loading

zhenglongjiepheonix commented May 27, 2024 •

edited

Loading