New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any tips for speeding up generation? #21
Comments
@pabloppp Oh hey Pablo! are you using the repository in production? Yea I can make the inference fast (by adding caching of key / values, standard practice) |
@pabloppp the fastest speedup you'll get is to train a vanilla transformer, and then fine-tune it with Performer linear attention https://github.com/lucidrains/performer-pytorch that's probably the penultimate trick |
What do you mean by 'fine-tune'? Training a vanilla transformer, then replacing the attention layers with performer attention layers and do some more training? |
Yes exactly! |
I will try that, thanks! Any idea about what could be the expected speedup? |
In short, it will be as fast as if you had an RNN |
@lucidrains thanks for your awesome work! Can you explain a bit, why not training performers from scratch, why you recommend to train vanilla and then finetune? |
@stas-sl Performers scale very efficiently at longer sequence lengths (roughly 1500+), but they lose that advantage for short sequences. This is especially true for the softmax Performer, which is the version that's directly compatible with vanilla Transformers. For the softmax Performer, the constant costs of calculating the attention can cause it to be even slower than a Transformer during training. Hope that helps! |
@stas-sl What Tom said :) @pabloppp relevant to your interests https://arxiv.org/abs/2103.13076 |
Awesome, thanks! |
@lucidrains > I can make the inference fast (by adding caching of key / values, standard practice) Please can you help me how to make the inference fast (by adding caching of key / values, standard practice)? |
I am curious why key value can be cached? Doesn't key and value is globally changed, except for the first decoder layer, after a new id is produced? |
Because of the autoregressive nature of Transformers, I know that they are fairly slow when generating new sequences from scratch, but I was wondering if you had any tips or tricks on how to do faster inference or to know if you had plans for maybe adding some of the tricks to avoid full computation, like the ones used by Huggingface https://huggingface.co/blog/accelerated-inference
Thank you very much for your amazing work!
The text was updated successfully, but these errors were encountered: