Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training fails on Vertex AI (GCP) due to NCCL error on A100 GPUs #998

Open
arturnn opened this issue Jun 22, 2023 · 2 comments
Open

Training fails on Vertex AI (GCP) due to NCCL error on A100 GPUs #998

arturnn opened this issue Jun 22, 2023 · 2 comments
Labels

Comments

@arturnn
Copy link
Contributor

arturnn commented Jun 22, 2023

Bug description

Training does not start on Vertex AI when using >1 A100 GPUs with NCCL due to an unhandled system error. The problem currently only occurs on A100 GPUs, probably due to GPU partitioning in the K8s cluster. Bumping up NCCL during compilation to the most recent master (https://github.com/NVIDIA/nccl) seems to fix the issue. Could we update Marian NCCL fork (https://github.com/marian-nmt/nccl), or would that possibly break something else? @snukky what are your thoughts on that?

Sample log:

[2023-04-21 19:17:31] Error: NCCL error 2 'unhandled system error' - /marian/src/training/communicator_nccl.h:43: ncclGroupEnd()
[2023-04-21 19:17:31] Error: Aborted from void marian::NCCLCommunicator::groupEnd() const in /marian/src/training/communicator_nccl.h:43
[CALL STACK]
[0x561c348a40d2] marian::NCCLCommunicator:: groupEnd () const + 0x222
[0x561c348ab90a] marian::NCCLCommunicator:: NCCLCommunicator (std::vector<std::shared_ptr<marian::ExpressionGraph>,std::allocator<std::shared_ptr<marian::ExpressionGraph>>> const&, marian::ShardingMode, std::shared_ptr<marian::IMPIWrapper>) + 0x2c4a
[0x561c3489203b] marian:: createCommunicator (std::vector<std::shared_ptr<marian::ExpressionGraph>,std::allocator<std::shared_ptr<marian::ExpressionGraph>>> const&, bool, marian::ShardingMode, std::shared_ptr<marian::IMPIWrapper>) + 0x44b
[0x561c3482438d] marian::GraphGroup:: GraphGroup (std::shared_ptr<marian::Options>, std::shared_ptr<marian::IMPIWrapper>) + 0x4fd
[0x561c347fd4e4] marian::SyncGraphGroup:: SyncGraphGroup (std::shared_ptr<marian::Options>, std::shared_ptr<marian::IMPIWrapper>) + 0x74
[0x561c3433af33] marian::Train<marian::SyncGraphGroup>:: run () + 0x333
[0x561c34260fe7] mainTrainer (int, char**) + 0x157
[0x561c3421931c] main + 0x3c
[0x7fc4a2cce083] __libc_start_main + 0xf3
[0x561c3425fc6e] _start + 0x2e 

If NCCL debug log is enabled, the following warning shows up, while it does not show up e.g. when using V100 GPUs:
graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f

@arturnn arturnn added the bug label Jun 22, 2023
@snukky
Copy link
Member

snukky commented Jun 22, 2023

I don't see any problems with updating NCCL if it helps. It seems we use vanilla NCCL (NVIDIA/nccl@master...marian-nmt:nccl:master) so updating shouldn't be problematic. Would you like to open a PR?

@TommyJonathanSinaga TommyJonathanSinaga mentioned this issue Jun 22, 2023
4 tasks
@arturnn
Copy link
Contributor Author

arturnn commented Jun 22, 2023

Sure, here it is @snukky: marian-nmt/nccl#1 (after that we need to update the submodule in marian-dev). Else I could just open a PR changing submodule in marian-dev to main NVIDIA repo instead of a fork.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants