Causal linear attention benchmark #64

caffeinetoomuch · 2021-04-12T20:49:35Z

First, thanks for this awesome repo!!

Based on T5 model classes from Huggingface's transformers, I was trying to use performer attention instead of original T5 attention. We finetuned t5-large with summarization model, and tried to profile both time and memory usage, and compare the performer attention with the original attention. I have only benchmarked with input size of 1024.

The result clearly showed that performer attention use lot less memory compared to the original transformer. I know from the paper that performer outperforms the original transformer when input size is bigger than 1024. However, finetuning and generation with the performer actually took longer, so I profiled the forward call of both the original T5 attention and the performer attention. The forward of T5 performer took twice longer and the main bottleneck was causal_dot_product_kernel from fast-transformers.

Is this a normal performace of the performer or causal attention calculation? or Will the performer attention be faster with the bigger input size?

The text was updated successfully, but these errors were encountered:

lucidrains · 2021-04-13T21:59:01Z

@ice-americano ohh yes, so actually, for generation, I am not doing any caching. It should be way faster than T5

lucidrains · 2021-04-13T22:00:00Z

is that what you meant by it being slow?

caffeinetoomuch · 2021-04-14T05:51:48Z

Yes, that's what I was trying to ask. I just replaced softmax of T5Attention from huggingface transformers with FastAttention from this repo. However, both finetuning and generation were slower for FastAttention, even though it was clearly using less memory. Any idea on what I might be doing wrong?

Thanks!

lucidrains · 2021-04-14T14:36:23Z

If you are working at context lengths of less than 2048, training will be slower. The benefits of performers comes at 4096 and beyond

As for generation, it's because I never built the caching portion. It should be a lot faster

caffeinetoomuch · 2021-04-14T20:47:51Z

If we were to build caching, what will be cached? Projection matrices?

lucidrains · 2021-04-14T23:21:10Z

in linear attention, there's a two tensors that are accumulated over the sequence, so you would just need to cache those https://github.com/lucidrains/performer-pytorch/blob/main/performer_pytorch/performer_pytorch.py#L168-L169

lucidrains · 2021-04-14T23:21:25Z

i'll get around to it this week!

lucidrains · 2021-04-14T23:32:16Z

@ice-americano the big problem for linear attention for pytorch is the fact that everyone relies on this CUDA kernel written by EPFL. i need to write my own in numba so i can have more control over the changes

caffeinetoomuch · 2021-04-15T21:42:04Z

What is EPFL? Also did you mean you are planning to rewrite causal_linear_attention in numba instead of using CausalDotProduct from fast_transformers.causal_product?
What are the advantages of using code in numba? Will it be faster?

Thanks for all the responses!

lucidrains · 2021-04-16T18:58:16Z

@ice-americano its just so we can experiment more with linear attention https://developer.nvidia.com/cuda-python i doubt it can get any faster than what EPFL already wrote. the code is just too much to build upon

can you confirm the slow down is when you try to generate from an autoregressive performer? i can fix it if so

caffeinetoomuch · 2021-04-19T17:45:17Z

Actually installing from pip or building from source took a while, and that should have happend due to EPFL compilation(I have a shallow knowlodge on cuda kernel or library 😅).

We have fixed our code to use SelfAttention instead of FastAttention, and we might have been setting wrong parameters or etc, since now the performance and speed of the performer looks similar to wha the paper was specifying. So I think you can close this issue for now, and thanks for responsive feedbacks!

lucidrains · 2021-04-20T15:47:06Z

ok! i'll work on the other issue (fast generation) - glad to hear the original issue is resolved!

lh-gt · 2022-12-06T12:51:53Z

Actually installing from pip or building from source took a while, and that should have happend due to EPFL compilation(I have a shallow knowlodge on cuda kernel or library 😅).

We have fixed our code to use SelfAttention instead of FastAttention, and we might have been setting wrong parameters or etc, since now the performance and speed of the performer looks similar to wha the paper was specifying. So I think you can close this issue for now, and thanks for responsive feedbacks!

@ice-americano hi, i have met the same problem. i use SelfAttention in performer to replace bert self-attention, and eval is slower. could you share your config?

lucidrains closed this as completed Apr 20, 2021

JamesDeAntonis mentioned this issue Apr 26, 2021

Causal performer slower than causal regular attention #66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Causal linear attention benchmark #64

Causal linear attention benchmark #64

caffeinetoomuch commented Apr 12, 2021 •

edited

lucidrains commented Apr 13, 2021

lucidrains commented Apr 13, 2021

caffeinetoomuch commented Apr 14, 2021

lucidrains commented Apr 14, 2021 •

edited

caffeinetoomuch commented Apr 14, 2021

lucidrains commented Apr 14, 2021

lucidrains commented Apr 14, 2021

lucidrains commented Apr 14, 2021 •

edited

caffeinetoomuch commented Apr 15, 2021

lucidrains commented Apr 16, 2021

caffeinetoomuch commented Apr 19, 2021

lucidrains commented Apr 20, 2021

lh-gt commented Dec 6, 2022

Causal linear attention benchmark #64

Causal linear attention benchmark #64

Comments

caffeinetoomuch commented Apr 12, 2021 • edited

lucidrains commented Apr 13, 2021

lucidrains commented Apr 13, 2021

caffeinetoomuch commented Apr 14, 2021

lucidrains commented Apr 14, 2021 • edited

caffeinetoomuch commented Apr 14, 2021

lucidrains commented Apr 14, 2021

lucidrains commented Apr 14, 2021

lucidrains commented Apr 14, 2021 • edited

caffeinetoomuch commented Apr 15, 2021

lucidrains commented Apr 16, 2021

caffeinetoomuch commented Apr 19, 2021

lucidrains commented Apr 20, 2021

lh-gt commented Dec 6, 2022

caffeinetoomuch commented Apr 12, 2021 •

edited

lucidrains commented Apr 14, 2021 •

edited

lucidrains commented Apr 14, 2021 •

edited