Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSpeed + TPU support via transformer's Trainer #97

Open
minimaxir opened this issue Mar 2, 2021 · 4 comments
Open

DeepSpeed + TPU support via transformer's Trainer #97

minimaxir opened this issue Mar 2, 2021 · 4 comments

Comments

@minimaxir
Copy link
Owner

minimaxir commented Mar 2, 2021

Currently, training via pytorch-lightning's implementation of DeepSpeed/TPUs is not working, and it's impossible to debug where the issues lie (i.e. within aitextgen, transformers, pytorch-lightning, or pytorch-xla) since the entire ecosystem is very fragile and error messages are unhelpful.

A short-term workaround is to use transformer's native Trainer for DeepSpeed + TPUs (and only those specific use cases for now) as it limits potential breakage, and also serves as a baseline for pytorch-lightning's approach when that is more stabilized.

The downside is that Trainer is not as good as pytorch-lightning UX-wise, but given that DeepSpeed + TPUs are a more niche use case for power users. that's acceptable.

@minimaxir
Copy link
Owner Author

Now Zero-3 Offload is available which in theory should be easier to implement (once it works with base Transformers)

@minimaxir minimaxir pinned this issue Mar 14, 2021
@minimaxir minimaxir mentioned this issue Mar 14, 2021
@williamFalcon
Copy link

thanks for highlighting this!

@SeanNaren can help get this solved on the PL side.

@SeanNaren
Copy link
Contributor

hey @minimaxir what issues are you running into? If you're able to point to issues I can help escalate/resolve them for PL!

ZeRO 3 Offload has it's own quirks that will require HuggingFace Transformers and us both to figure out, so it may take a bit longer to integrate however we're working together on this where we can. We do have experimental support in place, and can give some pointers if you're keen to try :)

@tchaton
Copy link

tchaton commented Mar 15, 2021

Dear @minimaxir,

Would you mind joining Pytorch Lightning Slack. I sent you an invitation. We can coordinate efforts there to resolve your issues with Sean and I.

Best,
T.C

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants