Attention sliding window implementation #29623

caiom · 2024-03-13T04:03:33Z

System Info

transformers version: 4.27.4
Platform: Linux-5.15.0-94-generic-x86_64-with-debian-bullseye-sid
Python version: 3.7.16
Huggingface_hub version: 0.16.4
PyTorch version (GPU?): 1.11.0 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker @youne

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

The implementation of the sliding window seems to differ between attention backends in the case of the Mistral model.

If we are using the "eager" backend, a window size of 2048 will use a maximum of 2048 tokens:

import torch
from transformers.modeling_attn_mask_utils import AttentionMaskConverter

attn_mask_converter = AttentionMaskConverter(is_causal=True, sliding_window=2048)
mask = attn_mask_converter.to_causal_4d(1, 4096, 4096, torch.float32)
print(torch.sum(mask[0, 0, ...] == 0, dim=1))

Output:
tensor([ 1, 2, 3, ..., 2048, 2048, 2048])

In the case of Flash Attention 2 backend, a window size of 2048 will use a maximum of 2049 as suggested by their documentation and also their tests:

from einops import rearrange

def construct_local_mask(
    seqlen_q,
    seqlen_k,
    window_size=(-1, -1),  # -1 means infinite window size
    query_padding_mask=None,
    key_padding_mask=None,
    device=None,
):
    row_idx = rearrange(torch.arange(seqlen_q, device=device, dtype=torch.long), "s -> s 1")
    col_idx = torch.arange(seqlen_k, device=device, dtype=torch.long)
    sk = seqlen_k if key_padding_mask is None else rearrange(key_padding_mask.sum(-1), "b -> b 1 1 1")
    sq = seqlen_q if query_padding_mask is None else rearrange(query_padding_mask.sum(-1), "b -> b 1 1 1")
    if window_size[0] < 0:
        return col_idx > row_idx + sk - sq + window_size[1]
    else:
        sk = torch.full_like(col_idx, seqlen_k) if key_padding_mask is None else sk
        return torch.logical_or(
            col_idx > torch.minimum(row_idx + sk - sq + window_size[1], sk),
            col_idx < row_idx + sk - sq - window_size[0],
        )


mask = construct_local_mask(4096, 4096, (2048, 0))
print(torch.sum(~mask, dim=1))

Output:
tensor([ 1, 2, 3, ..., 2049, 2049, 2049])

Expected behavior

All implementations of sliding window should use the same number of tokens given a window size.

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-03-13T16:19:56Z

Hi @caiom, thanks for raising this issue!

There's been quite a few changes to the attention mask creation in recent releases, and 4.27 is relatively old (~10 months or so). Do these observations still hold on the most recent release or on main?

Recent stable:pip install -U transformers
Main: pip install git+https://github.com/huggingface/transformers

cc @younesbelkada

caiom · 2024-03-13T19:43:18Z

This was fixed 5 days ago: 608fa54

caiom closed this as completed Mar 13, 2024

caiom mentioned this issue Mar 13, 2024

Attention sliding window vllm-project/vllm#3385

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attention sliding window implementation #29623

Attention sliding window implementation #29623

caiom commented Mar 13, 2024 •

edited

amyeroberts commented Mar 13, 2024

caiom commented Mar 13, 2024

Attention sliding window implementation #29623

Attention sliding window implementation #29623

Comments

caiom commented Mar 13, 2024 • edited

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Mar 13, 2024

caiom commented Mar 13, 2024

caiom commented Mar 13, 2024 •

edited