Refactor Pytorch `model.generate` method to work on TPU #18661

mikcnt · 2022-08-17T09:25:55Z

Feature request

Refactor PT version of the method model.generate for text generating models to make it compatible with XLA and speed up inference on TPU.

Motivation

Right now, model.generate on PT is extremely slow on TPU compared to CPU and GPU. This is probably due to the fact that some operations done in the PT version of model.generate are not XLA compatible, and thus the generation process falls back on CPU. This makes inference on TPU infeasible. A major refactoring work has already been done on its TF counterpart, so it would be nice to have the PT version working as well.

A more in-depth discussion with @gante took place in #12322 and on this huggingface discussion.

Your contribution

If there is some interest from the HF team, I can definitely assist during the work.

The text was updated successfully, but these errors were encountered:

gante · 2022-08-17T09:32:15Z

cc @patrickvonplaten

patrickvonplaten · 2022-08-17T15:05:07Z

Hey @mikcnt,

This sounds like a very cool project and I think we should sooner or later focus on it. Currently I won't have the time to take a closer look here, but my advice would be:

I think you're totally right in that PyTorch/XLA often falls back on CPU which is why it is very slow. We're luckier here with Jax and TF because if things fall back on CPU the code fails
It'll take some time to get this fully working so we should start with the easiest example -> see what code changes are necessary to make PyTorch/XLA work with greedy(...)
To set expectations: PyTorch's generate method is one of Transformers most used functions - it's extremely important and we're trying very hard to keep the code readable, easy to understand. If making PyTorch XLA-compatible requires too many changes or makes the code too unreadable we might come to the conclusion that it's just not worth it and maybe just add it as a "experimental" additional function but not in "main" generate. Also @michaelbenayoun @mfuntowicz is that maybe something we want to have only in optimum maybe but not in Transformers?

github-actions · 2022-09-16T15:01:48Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

divyanshuaggarwal · 2022-09-24T18:44:33Z

Hi,

Any updates on this? When can we expect to generate a function to work on TPUs? Also, will it be part of transformers or optimum? as mentioned by @patrickvonplaten above?

patrickvonplaten · 2022-09-27T11:22:30Z

I won't have time to look into this sadly anytime soon. @gante maybe?

gante · 2022-09-28T12:31:10Z

Added to my generate task queue 👍

@divyanshuaggarwal it would be part of transformers!

divyanshuaggarwal · 2022-09-28T16:00:56Z

Thanks @gante!

divyanshuaggarwal · 2022-12-10T15:55:34Z

Hi, @gante just noticed it had been marked WIP, any ETAs on when can we expect this feature?

sgugger · 2022-12-12T14:43:23Z

This is not a prioritized feature as you can already use TPUs for generation in Flax and TensorFlow. Since you can easily convert a model from one framework to the other, there is an easy workaround :-)

deveworld · 2023-06-03T02:47:42Z

Is there any update on this PR?

gante · 2023-06-12T12:45:27Z

@deveworld we are atm exploring PT-level optimizations, which include the static shapes needed for XLA (TPU). A significant upgrade in this direction is likely in the next releases (keep an eye there :) )

divyanshuaggarwal · 2023-07-16T11:48:41Z

@gante folks from Meta were able to do llama inference on TPU using pytorch XLA. Might be helpful for this issue.

https://pytorch.org/blog/path-achieve-low-inference-latency/?utm_content=254892693&utm_medium=social&utm_source=linkedin&hss_channel=lcp-78618366

verityw · 2023-08-11T01:20:03Z

Has there been any update on this? When is the next release likely to be released?

gante · 2023-08-16T14:05:02Z

We have some code ready, which makes the generation loop friendly with compiled forward passes (e.g. with torch.compile). Pretty much the same algorithm we use with TF/FLAX + XLA.

However, there are performance regressions on some devices, and the PyTorch team is having a look. We will include these changes when the performance bump is consistent across devices.

Meanwhile, feel free to adapt code from this repo/PR.

verityw · 2023-08-16T20:54:40Z

I see. Will this work on TPU then / are TPUs one of the device that are experiencing performance regressions?

I also looked into the Optimum Neuron greedy decode implementation. While it no longer requires moving computations to CPU, running inference on TPU with it seems significantly slower than on GPU.

gante · 2023-08-17T10:49:48Z

@verityw I can't confirm. We are aiming at having models that are fully compatible and efficient to use with torch.compile(), there may be additional issues when selecting the XLA backend :)

paulbricman · 2023-11-03T16:41:05Z

Any update on this? I'm trying to work with trl and peft on a TPU slice (to run tests on yet another HF-aspiring lib), but these newer parts of the ecosystem seem to currently only support torch, which is not supported in an XLA-friendly way in the underlying transformers.

I looked into it a bit and it seems that both mostly wrap the transformers generate(), so maybe an XLA-friendly version of that would help throughout? I also expect to encounter other issues of XLA-awkwardness in the backward step of trl, but I don't have a good intuition of that. Would love any pointers to learn about what it takes to make them XLA-friendly and how far the stack is from that.

gante · 2023-11-07T13:38:01Z

Not far from seeing the light, actually!

Our current major endeavor in generate is possibility of using different types of caches. By default, caches grow with the input length, but XLA needs a fixed-size cache -- we will be adding it as part of this task. In turn, this should make the forward pass of most models XLA-compatible (or close to it).

mmcclean-aws · 2024-01-11T21:00:23Z

Any updates on this @gante ?

gante · 2024-01-12T09:31:57Z

Yes: #27931 (it is a pre requisite :) )

huggingface deleted a comment from github-actions bot Oct 24, 2022

huggingface deleted a comment from github-actions bot Nov 17, 2022

gante self-assigned this Nov 17, 2022

gante added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Nov 17, 2022

aashiqmuhamed mentioned this issue Apr 27, 2023

Adding XLA support for greedy sampling #23021

Closed

3 tasks

aashiqmuhamed mentioned this issue May 24, 2023

Enable greedy sampling huggingface/optimum-neuron#70

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Pytorch `model.generate` method to work on TPU #18661

Refactor Pytorch `model.generate` method to work on TPU #18661

mikcnt commented Aug 17, 2022

gante commented Aug 17, 2022

patrickvonplaten commented Aug 17, 2022

github-actions bot commented Sep 16, 2022

divyanshuaggarwal commented Sep 24, 2022 •

edited

patrickvonplaten commented Sep 27, 2022 •

edited

gante commented Sep 28, 2022

divyanshuaggarwal commented Sep 28, 2022

divyanshuaggarwal commented Dec 10, 2022

sgugger commented Dec 12, 2022

deveworld commented Jun 3, 2023

gante commented Jun 12, 2023

divyanshuaggarwal commented Jul 16, 2023

verityw commented Aug 11, 2023

gante commented Aug 16, 2023 •

edited

verityw commented Aug 16, 2023

gante commented Aug 17, 2023

paulbricman commented Nov 3, 2023

gante commented Nov 7, 2023

mmcclean-aws commented Jan 11, 2024

gante commented Jan 12, 2024

Refactor Pytorch model.generate method to work on TPU #18661

Refactor Pytorch model.generate method to work on TPU #18661

Comments

mikcnt commented Aug 17, 2022

Feature request

Motivation

Your contribution

gante commented Aug 17, 2022

patrickvonplaten commented Aug 17, 2022

github-actions bot commented Sep 16, 2022

divyanshuaggarwal commented Sep 24, 2022 • edited

patrickvonplaten commented Sep 27, 2022 • edited

gante commented Sep 28, 2022

divyanshuaggarwal commented Sep 28, 2022

divyanshuaggarwal commented Dec 10, 2022

sgugger commented Dec 12, 2022

deveworld commented Jun 3, 2023

gante commented Jun 12, 2023

divyanshuaggarwal commented Jul 16, 2023

verityw commented Aug 11, 2023

gante commented Aug 16, 2023 • edited

verityw commented Aug 16, 2023

gante commented Aug 17, 2023

paulbricman commented Nov 3, 2023

gante commented Nov 7, 2023

mmcclean-aws commented Jan 11, 2024

gante commented Jan 12, 2024

Refactor Pytorch `model.generate` method to work on TPU #18661

Refactor Pytorch `model.generate` method to work on TPU #18661

divyanshuaggarwal commented Sep 24, 2022 •

edited

patrickvonplaten commented Sep 27, 2022 •

edited

gante commented Aug 16, 2023 •

edited