You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There's been quite a few changes to the attention mask creation in recent releases, and 4.27 is relatively old (~10 months or so). Do these observations still hold on the most recent release or on main?
System Info
transformers
version: 4.27.4Who can help?
@ArthurZucker @youne
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The implementation of the sliding window seems to differ between attention backends in the case of the Mistral model.
If we are using the "eager" backend, a window size of 2048 will use a maximum of 2048 tokens:
Output:
tensor([ 1, 2, 3, ..., 2048, 2048, 2048])
In the case of Flash Attention 2 backend, a window size of 2048 will use a maximum of 2049 as suggested by their documentation and also their tests:
Output:
tensor([ 1, 2, 3, ..., 2049, 2049, 2049])
Expected behavior
All implementations of sliding window should use the same number of tokens given a window size.
The text was updated successfully, but these errors were encountered: