Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious all gather performance drop. #384

Open
etiennemlb opened this issue Apr 29, 2024 · 0 comments
Open

Spurious all gather performance drop. #384

etiennemlb opened this issue Apr 29, 2024 · 0 comments

Comments

@etiennemlb
Copy link

I am benchmarking a machine with the 175B GPT3 case. I run on 32 nodes equipped with 4 MI250X each. The interconnect is Slignshot (the one on Frontier, ORNL's machine).

I get spurious all gather drops that always occur during the optimizer step but never during forward or backward pass. For a given iteration it may get slower, or not even when the same nodes are used across runs.

The slowdown is on the order of 50. It would take 30s instead of ~0.6s. It happens in
DeepSpeedZeroOptimizer_Stage3::step() -> _post_step -> persistent_parameters[0].all_gather.

I use zero 3 and activation checkpointing, rccl (nccl), adam and no pipeline or tensor parallelism.

Did I overlook a setting ? Did someone experience something similar on Cray Slingshot/AMD hardware ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant