-
Notifications
You must be signed in to change notification settings - Fork 497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement tensor parallelism #17
Comments
Turns out |
Progress - disabling gradient clipping and fused AdamW actually makes it work (even with torch.compile!) 🎉🎉🎉 |
Almost there but I've discovered a weird behaviour of On the bright side, getting just shy of 150k tok/sec on a 2x 3090 configuration and can now work with a model, which wouldn't fit on a single 3090. Bad news for 8x A100 users - you have too many GPUs to shard the 12 attention heads 🤣 #protip: didn't know it was possible to use the VScode debugger with distributed workloads, turns out you can and all the [conditional] breakpoints, stepping into libraries etc. works like a dream! Here's my
|
Ok, all done, except it disappointingly slow! 😮 Quite possible I've messed something up, so if anybody notices, do let me know please. The repo available at https://github.com/marib00/build-nanogpt and the file of interest is train_gpt2_tp.py - I didn't touch any of the other files. I have added some benchmarks to README.md. |
I thought tensor parallelism would be an interesting idea. There's a tutorial for this and even some code examples, but so far no joy.
I started simple, trying to shard the MLP like this:
But PyTorch (nightly) gives me grief:
As a quick fix I tried converting what I thought were DTensors to local tensors:
but then I get even more grief 🤦♂️:
Any ideas? 🙏
The text was updated successfully, but these errors were encountered: