Skip to content

Conversation

@grom72
Copy link

@grom72 grom72 commented Nov 25, 2022

Signed-off-by: Tomasz Gromadzki tomasz.gromadzki@intel.com

Requires #2


This change is Reviewable

@grom72 grom72 force-pushed the off_s branch 8 times, most recently from f3c43a1 to a4e57c7 Compare November 28, 2022 21:29
@grom72 grom72 force-pushed the off_s branch 6 times, most recently from 930766a to 92eb124 Compare November 29, 2022 10:42
Copy link
Member

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 0 of 30 files reviewed, all discussions resolved

@grom72 grom72 force-pushed the off_s branch 3 times, most recently from 3a9fb96 to a2f7709 Compare December 2, 2022 14:42
Sean Hefty and others added 3 commits December 3, 2022 13:04
The FI_PEER flag is sufficient for allocating peer objects.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
This is carry over from the tcp provider but unused.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Bounce buffer copy overhead is high for ZE device memory. The rendezvous
protocol takes advantage of GPU RDMA and performs better even for small
messages.

Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Copy link
Member

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 8 files at r1, 6 of 24 files at r3, 1 of 14 files at r5, 2 of 4 files at r6, all commit messages.
Reviewable status: 11 of 34 files reviewed, 2 unresolved discussions (waiting on @grom72)


include/ofi_sharp.h line 43 at r6 (raw file):

#include <ofi_atom.h>
#include <ofi_proto.h>

It seems they all are not needed here ...

Code quote:

#include "config.h"

#include <stdint.h>
#include <stddef.h>
#include <sys/un.h>

#include <ofi_atom.h>
#include <ofi_proto.h>
#include <ofi_mem.h>
#include <ofi_rbuf.h>
#include <ofi_tree.h>
#include <ofi_hmem.h>

#include <rdma/providers/fi_prov.h>

#include "ofi_coll.h"

include/rdma/fi_domain.h line 252 at r6 (raw file):

	FI_SCATTER,
	FI_GATHER,
	FI_COLLECTIVE_OP_MAX, //must be always the last value of this enum

MAX /* must be always the last value of this enum */

Code quote:

MAX, //must be always the last value of this enum

coll_cq implementation can be reused by other collective providers.

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
…er is available

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
…initialization

It is rxm provider responsability to initialize collective offload provider's fabric.
Otherwise collective offload functionality will not be available

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
…IDER

FI_OFFLOAD_PROVIDER environment variable shall be set to offload provider name
to instruct libcabric to setup and use particular provider.

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
Peer provider must create peer_eq for offload provider, to allow offload provider
reporting events to peer provider.

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
Offload provider may execute collective operations via util_coll provider.
It must call fi_join() operation to get struct mc required for collective operations.
It can only call fi_join() on it's peer provider (e.g. rxm). FI_PEER flag is used
to inform peer provider to coll fi_join() operation for util_coll_ep

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
offload_coll_mask value is calculated based on the actual offload capabilities
confirmed by fi_query_collective().

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
Copy link
Member

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 36 files at r7, 1 of 1 files at r8, 15 of 30 files at r9, all commit messages.
Reviewable status: 22 of 37 files reviewed, 6 unresolved discussions (waiting on @grom72)


prov/rxm/src/rxm_ep.c line 394 at r9 (raw file):

	//FI_PEER flag is used to force util_coll context 
	//where fi_join() is called from offload provider
  1. C++ style comment
  2. space at the end

Code quote:

	//FI_PEER flag is used to force util_coll context
	//where fi_join() is called from offload provider

prov/rxm/src/rxm_ep.c line 402 at r9 (raw file):

		if (ret)
			goto err_util_coll;
		// It is collective offload provider responsibility to store util_coll provider mc

.

Code quote:

// It is collective offload provider responsibility to store util_coll provider mc

prov/rxm/src/rxm_msg.c line 818 at r9 (raw file):

				   rxm_ep_sar_calc_segs_cnt(rxm_ep, data_len));
	} else {
rndv_send:

Empty line before label, not after

Code quote:

	} else {
rndv_send:

		ret = rxm_alloc_rndv_buf(rxm_ep, rxm_conn, context,

prov/sharp/configure.m4 line 1 at r9 (raw file):

dnl Configury specific to the libfabric sharp provider

?

Code quote:

Configury

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
Copy link
Member

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 22 of 37 files reviewed, 7 unresolved discussions (waiting on @grom72)


prov/sharp/src/sharp_domain.c line 124 at r10 (raw file):

	/// mapped to int sharp_coll_finalize(struct sharp_coll_context *context);
	ret = ofi_domain_close(&domain->util_domain);
	if (ret)

Memory leak - domain is not freed.

Code quote:

	if (ret)
		return ret;

Copy link
Member

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 22 of 37 files reviewed, 6 unresolved discussions (waiting on @grom72)


prov/sharp/src/sharp_domain.c line 124 at r10 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

Memory leak - domain is not freed.

Maybe not a memory leak - if .close checks for return code and retries - I hope it does so ;-)

@ldorau ldorau mentioned this pull request Dec 6, 2022
1 task
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
@grom72 grom72 closed this Dec 13, 2022
@grom72
Copy link
Author

grom72 commented Dec 13, 2022

see #8

grom72 pushed a commit that referenced this pull request Mar 24, 2023
If a posted receive matches with a saved receive, we may need to
increment the rx counter.  Set the rx counter increment callback
to match that of the posted receive.  This fixes an assert in
xnet_cntr_inc() accessing a NULL cntr_inc function pointer.

Program received signal SIGABRT, Aborted.
0x0000155552d4d37f in raise () from /lib64/libc.so.6
#0  0x0000155552d4d37f in raise () from /lib64/libc.so.6
#1  0x0000155552d37db5 in abort () from /lib64/libc.so.6
#2  0x0000155552d37c89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#3  0x0000155552d45a76 in __assert_fail () from /lib64/libc.so.6
#4  0x00001555522967f9 in xnet_cntr_inc (ep=0x6e4c70, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:347
#5  0x0000155552296836 in xnet_report_cntr_success (ep=0x6e4c70, cq=0x6ca930, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:354
#6  0x000015555229970d in xnet_complete_saved (saved_entry=0x6f7a30) at prov/tcp/src/xnet_progress.c:153
#7  0x0000155552299961 in xnet_recv_saved (saved_entry=0x6f7a30, rx_entry=0x6f7840) at prov/tcp/src/xnet_progress.c:188
#8  0x00001555522946f8 in xnet_srx_tag (srx=0x6dd1c0, recv_entry=0x6f7840) at prov/tcp/src/xnet_srx.c:445
#9  0x0000155552294bb1 in xnet_srx_trecv (ep_fid=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_srx.c:558
ofiwg#10 0x000015555228f60e in fi_trecv (ep=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at ./include/rdma/fi_tagged.h:91
ofiwg#11 0x00001555522900a7 in xnet_rdm_trecv (ep_fid=0x6d9fe0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_rdm.c:212

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants