Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syndrom 0x51 #173

Open
farhanma opened this issue Oct 26, 2022 · 3 comments
Open

Syndrom 0x51 #173

farhanma opened this issue Oct 26, 2022 · 3 comments

Comments

@farhanma
Copy link

When I ran the perftest bidirectional/unidirectional between two GPUs communicating over a PCIe link, after certain iterations, I got the following error:

Completion with error at client
  Failed status 4: wr_id 0 syndrom 0x51
scnt=828, ccnt=700
  Failed to complete run_iter_bw function successfully

With a little bit of Googling I found out that this error somehow is related to ibv_post_send and ibv_poll_cq Linux system call written by Mellanox folks. Did anyone encounter such error before? Thank you.

@sshaulnv
Copy link
Contributor

sshaulnv commented Nov 2, 2022

We also encountered this problem.

  1. which GPU you are using?
  2. what exact commands you used on both sides?
  3. Have you tried using 'use_cuda' only on one side?

Thanks

@farhanma
Copy link
Author

farhanma commented Nov 9, 2022

  1. NVIDIA A100-SXM4-80GB W HS
### master
./ib_write_bw -p <port_number> -a -b -F -d <ib_card_ip_address> --report_gbits -i 1 --use_cuda=<device_id>

### slave
./ib_write_bw -p <port_number> -a -b -F -d <ib_card_ip_address> --report_gbits -i 1 --use_cuda=<device_id> <master_ip_address>
  1. No I've not. I can try that and update the GitHub issue.

@sshaulnv
Copy link
Contributor

sshaulnv commented Dec 7, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants