Transformer - High VRAM, context length #39

MarcusLoppe · 2023-12-27T04:42:11Z

Hello again, this issue is for next year 😃

When training the transformer, I used the follow config:

transformer = MeshTransformer(
    autoencoder,
    dim = 128,
    attn_depth = 4,
    attn_dim_head = 8,
    attn_heads = 4,
    coarse_pre_gateloop_depth = 1,#6,
    fine_pre_gateloop_depth= 0,#4, 
    max_seq_len = max_seq,
    gateloop_use_heinsen = False,
    condition_on_text = True
)

This resulted in a transformer that was 22M parameters.
I then tried try to train it on a 6206 faces mesh which is 37236 tokens (6206 * 6).
When I feed it the faces codes (1,6206,128) it used about 11GB VRAM and at the end of the forward it used about 20 GB.
If I used a transformer that as 188M (256dim) it used 50GB of VRAM.

My suggestion to implement Sliding-Window Attention / Local attention since most long context LLM uses it and it seems to be working.

Or creating a embedding of the tokens and concating it together with the text conditioner embedding so the cross attention can beware of previous tokens as well.

Also take a look if Grouped-Query Attention is beneficial :)

attended_face_codes, coarse_cache = self.decoder(
                face_codes,
                cache = coarse_cache,
                return_hiddens = True,
                **attn_context_kwargs
            )

The text was updated successfully, but these errors were encountered:

lucidrains · 2023-12-27T16:48:47Z

@MarcusLoppe yes indeed, you are correct on both accounts. local attention is tricky to handle with kv cache

grouped query attention is also already available in x-transformers, which this lib is using

ok, no more AI stuff until after the new years 😆

MarcusLoppe · 2024-01-04T16:39:00Z

@MarcusLoppe yes indeed, you are correct on both accounts. local attention is tricky to handle with kv cache

grouped query attention is also already available in x-transformers, which this lib is using

ok, no more AI stuff until after the new years 😆

@lucidrains

Have you given any thought on implementing Flash Attention 2? 😄 Seem like a great benefit to speed up the transformers training times & inference.
https://github.com/kyegomez/FlashAttention20/tree/main

lucidrains · 2024-01-04T16:44:35Z

@MarcusLoppe flash attention 2 will make it into the next release of pytorch, so no need!

claudiomartella · 2024-04-15T10:53:52Z

will it need code change?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transformer - High VRAM, context length #39

Transformer - High VRAM, context length #39

MarcusLoppe commented Dec 27, 2023 •

edited

Loading

lucidrains commented Dec 27, 2023 •

edited

Loading

MarcusLoppe commented Jan 4, 2024

lucidrains commented Jan 4, 2024 •

edited

Loading

claudiomartella commented Apr 15, 2024

Transformer - High VRAM, context length #39

Transformer - High VRAM, context length #39

Comments

MarcusLoppe commented Dec 27, 2023 • edited Loading

lucidrains commented Dec 27, 2023 • edited Loading

MarcusLoppe commented Jan 4, 2024

lucidrains commented Jan 4, 2024 • edited Loading

claudiomartella commented Apr 15, 2024

MarcusLoppe commented Dec 27, 2023 •

edited

Loading

lucidrains commented Dec 27, 2023 •

edited

Loading

lucidrains commented Jan 4, 2024 •

edited

Loading