New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/gni: fix a problem with progress #1347
prov/gni: fix a problem with progress #1347
Conversation
Is this for manual or our pseudo-auto progress? |
Are you saying that we used to progress before the vc refactor even in manual progress mode? Or is this auto progress? What happens to the SOS app? Does it fail before it starts calling cq read? |
I've attached the SOS app. Note it uses extensions to OpenSHMEM but you can see what its trying to do, namely post a whole bunch of non blocking AMOs, then go into a shmem quiet.
sung: this is for either progress model. If one tries to post many millions of atomic ops, OOM's are observed. |
Normally the progress happens in a separate thread, true? Is it ok to call gnix_nic_progress in the sender's thread? |
no progress for small atomics doesn't happen from the progress thread unless you turn on zti's irq for every op option - then you get high latency and a nose-dive in message throughput rates. |
So, I'm hearing this won't affect latency. |
Should we have a similar addition to _gnix_sendv as well? |
I'll add one for _gnix_sendv too. good catch. |
Turns out that the VC refactor had another impact, it causes all data progress to be delayed until the app starts calling fi_cq_read, etc. This brings out all kind of issues with one-sided program models. Signed-off-by: Howard Pritchard <howardp@lanl.gov>
cc7545f
to
83b9c8f
Compare
add similar code in sendv code. |
Turns out that the VC refactor had another impact, it causes all data progress to be delayed until the app starts calling fi_cq_read, etc. This brings out all kind of issues with one-sided program models. upstream merge of ofi-cray#1347 Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit ofi-cray/libfabric-cray@83b9c8f)
Turns out that the VC refactor had another impact,
it causes all data progress to be delayed until
the app starts calling fi_cq_read, etc. This
brings out all kind of issues with one-sided
program models.
Fix verified by SOS developers.
Signed-off-by: Howard Pritchard howardp@lanl.gov