-
Notifications
You must be signed in to change notification settings - Fork 1
prov/off_s - SHARP provider scaffolding #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f3c43a1 to
a4e57c7
Compare
930766a to
92eb124
Compare
ldorau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: 0 of 30 files reviewed, all discussions resolved
3a9fb96 to
a2f7709
Compare
The FI_PEER flag is sufficient for allocating peer objects. Signed-off-by: Sean Hefty <sean.hefty@intel.com>
This is carry over from the tcp provider but unused. Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Bounce buffer copy overhead is high for ZE device memory. The rendezvous protocol takes advantage of GPU RDMA and performs better even for small messages. Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
ldorau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 8 files at r1, 6 of 24 files at r3, 1 of 14 files at r5, 2 of 4 files at r6, all commit messages.
Reviewable status: 11 of 34 files reviewed, 2 unresolved discussions (waiting on @grom72)
include/ofi_sharp.h line 43 at r6 (raw file):
#include <ofi_atom.h> #include <ofi_proto.h>
It seems they all are not needed here ...
Code quote:
#include "config.h"
#include <stdint.h>
#include <stddef.h>
#include <sys/un.h>
#include <ofi_atom.h>
#include <ofi_proto.h>
#include <ofi_mem.h>
#include <ofi_rbuf.h>
#include <ofi_tree.h>
#include <ofi_hmem.h>
#include <rdma/providers/fi_prov.h>
#include "ofi_coll.h"include/rdma/fi_domain.h line 252 at r6 (raw file):
FI_SCATTER, FI_GATHER, FI_COLLECTIVE_OP_MAX, //must be always the last value of this enum
MAX /* must be always the last value of this enum */
Code quote:
MAX, //must be always the last value of this enumcoll_cq implementation can be reused by other collective providers. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
…er is available Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
…initialization It is rxm provider responsability to initialize collective offload provider's fabric. Otherwise collective offload functionality will not be available Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
…IDER FI_OFFLOAD_PROVIDER environment variable shall be set to offload provider name to instruct libcabric to setup and use particular provider. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
Peer provider must create peer_eq for offload provider, to allow offload provider reporting events to peer provider. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
Offload provider may execute collective operations via util_coll provider. It must call fi_join() operation to get struct mc required for collective operations. It can only call fi_join() on it's peer provider (e.g. rxm). FI_PEER flag is used to inform peer provider to coll fi_join() operation for util_coll_ep Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
offload_coll_mask value is calculated based on the actual offload capabilities confirmed by fi_query_collective(). Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
ldorau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 36 files at r7, 1 of 1 files at r8, 15 of 30 files at r9, all commit messages.
Reviewable status: 22 of 37 files reviewed, 6 unresolved discussions (waiting on @grom72)
prov/rxm/src/rxm_ep.c line 394 at r9 (raw file):
//FI_PEER flag is used to force util_coll context //where fi_join() is called from offload provider
- C++ style comment
- space at the end
Code quote:
//FI_PEER flag is used to force util_coll context
//where fi_join() is called from offload providerprov/rxm/src/rxm_ep.c line 402 at r9 (raw file):
if (ret) goto err_util_coll; // It is collective offload provider responsibility to store util_coll provider mc
.
Code quote:
// It is collective offload provider responsibility to store util_coll provider mcprov/rxm/src/rxm_msg.c line 818 at r9 (raw file):
rxm_ep_sar_calc_segs_cnt(rxm_ep, data_len)); } else { rndv_send:
Empty line before label, not after
Code quote:
} else {
rndv_send:
ret = rxm_alloc_rndv_buf(rxm_ep, rxm_conn, context,prov/sharp/configure.m4 line 1 at r9 (raw file):
dnl Configury specific to the libfabric sharp provider
?
Code quote:
ConfigurySigned-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
ldorau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 22 of 37 files reviewed, 7 unresolved discussions (waiting on @grom72)
prov/sharp/src/sharp_domain.c line 124 at r10 (raw file):
/// mapped to int sharp_coll_finalize(struct sharp_coll_context *context); ret = ofi_domain_close(&domain->util_domain); if (ret)
Memory leak - domain is not freed.
Code quote:
if (ret)
return ret;
ldorau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 22 of 37 files reviewed, 6 unresolved discussions (waiting on @grom72)
prov/sharp/src/sharp_domain.c line 124 at r10 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
Memory leak - domain is not freed.
Maybe not a memory leak - if .close checks for return code and retries - I hope it does so ;-)
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
|
see #8 |
If a posted receive matches with a saved receive, we may need to increment the rx counter. Set the rx counter increment callback to match that of the posted receive. This fixes an assert in xnet_cntr_inc() accessing a NULL cntr_inc function pointer. Program received signal SIGABRT, Aborted. 0x0000155552d4d37f in raise () from /lib64/libc.so.6 #0 0x0000155552d4d37f in raise () from /lib64/libc.so.6 #1 0x0000155552d37db5 in abort () from /lib64/libc.so.6 #2 0x0000155552d37c89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6 #3 0x0000155552d45a76 in __assert_fail () from /lib64/libc.so.6 #4 0x00001555522967f9 in xnet_cntr_inc (ep=0x6e4c70, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:347 #5 0x0000155552296836 in xnet_report_cntr_success (ep=0x6e4c70, cq=0x6ca930, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:354 #6 0x000015555229970d in xnet_complete_saved (saved_entry=0x6f7a30) at prov/tcp/src/xnet_progress.c:153 #7 0x0000155552299961 in xnet_recv_saved (saved_entry=0x6f7a30, rx_entry=0x6f7840) at prov/tcp/src/xnet_progress.c:188 #8 0x00001555522946f8 in xnet_srx_tag (srx=0x6dd1c0, recv_entry=0x6f7840) at prov/tcp/src/xnet_srx.c:445 #9 0x0000155552294bb1 in xnet_srx_trecv (ep_fid=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_srx.c:558 ofiwg#10 0x000015555228f60e in fi_trecv (ep=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at ./include/rdma/fi_tagged.h:91 ofiwg#11 0x00001555522900a7 in xnet_rdm_trecv (ep_fid=0x6d9fe0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_rdm.c:212 Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Tomasz Gromadzki tomasz.gromadzki@intel.com
Requires #2
This change is