Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] support for shared memory provider of libfabric in SST + simple support for manual data progress via threading #3964

Open
wants to merge 34 commits into
base: master
Choose a base branch
from

Conversation

franzpoeschel
Copy link
Contributor

@franzpoeschel franzpoeschel commented Dec 12, 2023

Background: For some of our intended SST workflows, we exchange data only within nodes. Using the system networks is an unnecessary detour in that case. Since libfabric implements shared memory, this is a somewhat low-hanging fruit to enable truly in-memory staging workflows with SST.

Necessary changes:

  • The most important change is that the shm provider requires manual data progress. This is hence also a follow-up to Adapt libfabric dataplane of SST to Cray CXI provider #3672: The CXI provider supported by that PR also technically requires manual data progress, but effectively works fine without it.
  • FI_MR_BASIC registration mode prints an error, but interestingly it still works anyway. This PR still replaces FI_MR_BASIC with the equivalent FI_MR_VIRT_ADDR | FI_MR_ALLOCATED | FI_MR_PROV_KEY | FI_MR_LOCAL.
  • Some subtleties in address handling

The manual data progress has turned out to be somewhat annoying. My idea was to spawn a thread that regularly pokes the provider, but this approach does not work well with any less than a busy loop.
Every call to fi_read() by the Reader requires an accompanying call to fi_cq_read() by the Writer. fi_read() will fail with EAGAIN until the writer has acknowledged the load request. It seems that (at least with my current approach) this requires a ping-pong sort of protocol: I tried decreasing latencies by processing fi_read() as well as fi_cq_read() in batches and it made no difference, the provider only processes one load request at a time. In consequence, the current implementation has extreme latencies since it puts the progress thread to sleep before poking the provider again.

@eisenhauer mentioned in a video call that the control plane of SST implements a potential alternative approach based on file-descriptor polling.

Further benefit of implementing manual progress:
One of the most common issues with a badly-configured installation of the libfabric dataplane are hangups. Having support in SST for manual progress might fix this in some instances.
I have observed that this PR also "unlocks" the tcp provider which previously did not work.

Future potential / ideas:

Both these following points are probably an immense amount of work. Just some ideas that I wanted to put out here on how SST might be used to implement zero-overhead staging workflows:

  • This might be used to introduce a notion of memory hierarchy into SST. Local data can be loaded via shared memory, while remote data is sent via the network. This might immensely decrease the load on the network for large-scale application runs.
    I imagine that this is probably not an easy change to implement, since the control plane would need to deal with two data planes at once.
  • Not sure if this is possible with libfabric's shared memory provider, but it might be possible to use the zero-copy Engine-Get call void Engine::Get<T>(Variable<T>, T**) const; for data from the same node (currently used by the Inline Engine).

TODO:

  • Lazy connecting of endpoints (endpoints might not be reachable in shm settings)
  • Parameterization: Batch reading, threaded reading in ucx
  • threaded reading in libfabric: make it depend on PROGRESS_MANUAL parameter
  • no threading on the reader end

@eisenhauer
Copy link
Member

I can take a look at enabling a manual progress thread, or possibly using EVPath tools to tie progress to FDs if supported by CXI, but realistically I have one week before I disappear for two weeks, and given the other things on my plate the odds of this happening before January are unfortunately small.

WRT the future work notes, yes, supporting different data planes between different ranks is probably impactical given how SST is architected. It would have to be a single data plane that supported both transport mechanisms, which is still a lot of work, but fits the way dataplanes integrate into SST. I've also long had in mind an extension to the data access mechanisms that might reduce the copy overheads for RDMA and shared memory data access, but it involves several changes from the BP5Deserializer, through the engine and down to the data plane, so it has remained on the to-do list for a long time. But it's something to re-examine at some point.

@franzpoeschel
Copy link
Contributor Author

I can take a look at enabling a manual progress thread, or possibly using EVPath tools to tie progress to FDs if supported by CXI, but realistically I have one week before I disappear for two weeks, and given the other things on my plate the odds of this happening before January are unfortunately small.

No problem, this is not urgent.
It turns out that the solution was simpler than I had expected. Instead of running fi_cq_read() (non-blocking) on the thread every five seconds, I now run fi_cq_sread() (blocking) on the thread with a timeout of five seconds. The shm provider becomes much more responsive with this change. With this, I expect that the control-plane-enabled solution might not be needed any longer.

WRT the future work notes, yes, supporting different data planes between different ranks is probably impactical given how SST is architected. It would have to be a single data plane that supported both transport mechanisms, which is still a lot of work, but fits the way dataplanes integrate into SST. I've also long had in mind an extension to the data access mechanisms that might reduce the copy overheads for RDMA and shared memory data access, but it involves several changes from the BP5Deserializer, through the engine and down to the data plane, so it has remained on the to-do list for a long time. But it's something to re-examine at some point.

Thank you for the info. My main motivation in posting these ideas was to get a rough estimation of how viable these are to implement. It does not surprise me very much that lots of work would be required.

@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Dec 15, 2023

It seems that the current thread-based implementation runs into a libfabric bug, fixed by ofiwg/libfabric#9644. The bug means that calls to fi_cq_sread() that the progress thread potentially might make at the end of the simulation will not return. Finalizing the dataplane will hence not work.

Todo: Better than doing this, initialize endpoints on demand only
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants