Sliding window for transformer #61

MarcusLoppe · 2024-02-11T16:27:29Z

Hi,
I was wondering if it was possible to implement a sliding window decoder for the transformer?
When increasing the max sequences length, the training time goes up dramatic and and I think that using a sliding decoder would greatly help with the training and inference speed.

I've tried using LocalAttention but I'm not sure how to properly implement it since it inputs q, k and v.

I know @lucidrains have already spent all their allotted timed and more for this project so if I could be provided with some tips I could try to implement it.

lucidrains · 2024-02-28T15:47:44Z

@MarcusLoppe local attention is a good exercise to implement, moderate difficulty for a research engineer. getting kv cache working for bonus points..

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sliding window for transformer #61

Sliding window for transformer #61

MarcusLoppe commented Feb 11, 2024 •

edited

Loading

lucidrains commented Feb 28, 2024 •

edited

Loading

Sliding window for transformer #61

Sliding window for transformer #61

Comments

MarcusLoppe commented Feb 11, 2024 • edited Loading

lucidrains commented Feb 28, 2024 • edited Loading

MarcusLoppe commented Feb 11, 2024 •

edited

Loading

lucidrains commented Feb 28, 2024 •

edited

Loading