New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Pytorch model.generate
method to work on TPU
#18661
Comments
Hey @mikcnt, This sounds like a very cool project and I think we should sooner or later focus on it. Currently I won't have the time to take a closer look here, but my advice would be:
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hi, Any updates on this? When can we expect to generate a function to work on TPUs? Also, will it be part of transformers or optimum? as mentioned by @patrickvonplaten above? |
I won't have time to look into this sadly anytime soon. @gante maybe? |
Added to my @divyanshuaggarwal it would be part of |
Thanks @gante! |
Hi, @gante just noticed it had been marked WIP, any ETAs on when can we expect this feature? |
This is not a prioritized feature as you can already use TPUs for generation in Flax and TensorFlow. Since you can easily convert a model from one framework to the other, there is an easy workaround :-) |
Is there any update on this PR? |
@deveworld we are atm exploring PT-level optimizations, which include the static shapes needed for XLA (TPU). A significant upgrade in this direction is likely in the next releases (keep an eye there :) ) |
@gante folks from Meta were able to do llama inference on TPU using pytorch XLA. Might be helpful for this issue. |
Has there been any update on this? When is the next release likely to be released? |
We have some code ready, which makes the generation loop friendly with compiled forward passes (e.g. with However, there are performance regressions on some devices, and the PyTorch team is having a look. We will include these changes when the performance bump is consistent across devices. Meanwhile, feel free to adapt code from this repo/PR. |
I see. Will this work on TPU then / are TPUs one of the device that are experiencing performance regressions? I also looked into the Optimum Neuron greedy decode implementation. While it no longer requires moving computations to CPU, running inference on TPU with it seems significantly slower than on GPU. |
@verityw I can't confirm. We are aiming at having models that are fully compatible and efficient to use with |
Any update on this? I'm trying to work with I looked into it a bit and it seems that both mostly wrap the |
Not far from seeing the light, actually! Our current major endeavor in |
Any updates on this @gante ? |
Yes: #27931 (it is a pre requisite :) ) |
Feature request
Refactor PT version of the method
model.generate
for text generating models to make it compatible with XLA and speed up inference on TPU.Motivation
Right now,
model.generate
on PT is extremely slow on TPU compared to CPU and GPU. This is probably due to the fact that some operations done in the PT version ofmodel.generate
are not XLA compatible, and thus the generation process falls back on CPU. This makes inference on TPU infeasible. A major refactoring work has already been done on its TF counterpart, so it would be nice to have the PT version working as well.A more in-depth discussion with @gante took place in #12322 and on this huggingface discussion.
Your contribution
If there is some interest from the HF team, I can definitely assist during the work.
The text was updated successfully, but these errors were encountered: