Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What does "NCCL WARN Cannot get incoming CPU" mean? #4

Open
nvlcambier opened this issue Feb 14, 2023 · 0 comments
Open

What does "NCCL WARN Cannot get incoming CPU" mean? #4

nvlcambier opened this issue Feb 14, 2023 · 0 comments

Comments

@nvlcambier
Copy link

I am trying out the fastsocket NCCL plugin on GCP (specifically a GCE SLURM cluster build out of 2x(8xA100) nodes with gVNIC's). I see those warnings in the logs, specifically NCCL WARN Cannot get incoming CPU. and NCCL WARN Maximum retry reached for accept 3.. Does that mean something specific or can it be safely ignored?

The code runs despite the warning, although performance with/without the plugin look very similar.

full-debug2-test-1:4024:4048 [0] net_fastsocket.cc:765 NCCL WARN Cannot get incoming CPU.

full-debug2-test-0:4300:4325 [0] net_fastsocket.cc:785 NCCL WARN Maximum retry reached for accept 3.

full-debug2-test-1:4024:4055 [0] net_fastsocket.cc:674 NCCL WARN Maximum retry reached for connect 3.
full-debug2-test-0:4300:4325 [0] NCCL INFO accept qid: 3, rqid: 3
full-debug2-test-0:4300:4325 [0] NCCL INFO accept incoming cpu: 0
full-debug2-test-0:4300:4325 [0] NCCL INFO NET/FastSocket : Connected after 1000 retries.
full-debug2-test-0:4300:4325 [0] NCCL INFO NET/FastSocket : Accepted data socket 3

full-debug2-test-0:4300:4348 [0] net_fastsocket.cc:652 NCCL WARN Cannot get incoming CPU.
full-debug2-test-1:4024:4055 [0] NCCL INFO connect incoming cpu: 0
full-debug2-test-1:4024:4055 [0] NCCL INFO connect qid: 3, rqid: 3
full-debug2-test-1:4024:4055 [0] NCCL INFO NET/FastSocket : Connected after 1000 retries.
full-debug2-test-1:4024:4055 [0] NCCL INFO NET/FastSocket : Connected data socket 3

full-debug2-test-1:4024:4048 [0] net_fastsocket.cc:765 NCCL WARN Cannot get incoming CPU.
full-debug2-test-1:4024:4055 [0] NCCL INFO NET/FastSocket : Async connect done

full-debug2-test-0:4300:4348 [0] net_fastsocket.cc:652 NCCL WARN Cannot get incoming CPU.

full-debug2-test-1:4024:4048 [0] net_fastsocket.cc:765 NCCL WARN Cannot get incoming CPU.

full-debug2-test-0:4300:4348 [0] net_fastsocket.cc:652 NCCL WARN Cannot get incoming CPU
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant