Add ibv_query_qp_data_in_order flags #1312

mrgolin · 2023-02-19T07:32:59Z

Add definition for ibv_query_qp_data_in_order flags to allow querying for specific capabilities related to data polling support, currently adding 128 bytes in order flag (LL128).
Adopt existing code and add support in EFA provider.

mrgolin · 2023-02-19T08:34:55Z

Related kernel patch:
https://lore.kernel.org/linux-rdma/20230219081328.10419-1-mrgolin@amazon.com/T/#u

mrgolin · 2023-02-19T13:40:58Z

@rleon I would really appreciate if this can be reviewed for the upcoming release.

rleon · 2023-02-19T14:06:15Z

@rleon I would really appreciate if this can be reviewed for the upcoming release.

You posted kernel patch day before merge window starts. I afraid that it is too late for UAPI changes.

mrgolin · 2023-02-19T14:29:00Z

@rleon I would really appreciate if this can be reviewed for the upcoming release.

You posted kernel patch day before merge window starts. I afraid that it is too late for UAPI changes.

Thanks Leon I understand the timeline and I wish we could submit this earlier. Yet, the kernel patch is really simple one and it changes UAPI only by a definition of another bit in an existing field.

rleon · 2023-02-19T14:33:17Z

@rleon I would really appreciate if this can be reviewed for the upcoming release.

You posted kernel patch day before merge window starts. I afraid that it is too late for UAPI changes.

Thanks Leon I understand the timeline and I wish we could submit this earlier. Yet, the kernel patch is really simple one and it changes UAPI only by a definition of another bit in an existing field.

I'm aware. however I don't want to take any chances and get angry about sending PR to Linus with patches not being in linux-next.

Thanks

rleon · 2023-02-19T14:33:48Z

@jgunthorpe , you are sending PRs to Linus, so it is your call.

Thanks

jgunthorpe · 2023-02-21T15:55:24Z

This doesn't make alot of sense to me.. "in order" reflects the entire message stream, so what does "within a block" even mean? It seems to be some completely orthogonal concept related to cache tearing not message ordering?

mrgolin · 2023-02-21T17:46:36Z

This doesn't make alot of sense to me.. "in order" reflects the entire message stream, so what does "within a block" even mean? It seems to be some completely orthogonal concept related to cache tearing not message ordering?

From documentation:

**ibv_query_qp_data_in_order()** Checks whether WQE data is guaranteed to be
written in-order, and thus reader may poll for data instead of poll for completion.
This function indicates data is written in-order within each WQE, but cannot be used to determine ordering between separate WQEs.

This function is used to determine whether data can be polled, i.e. it's guaranteed that if some byte was written then all previous bytes of this message are ready to be used.
Some devices (e.g. EFA) may not support this for the full message but can guarantee write in order inside each data block of a particular size.

mrgolin · 2023-03-01T16:32:37Z

This doesn't make alot of sense to me.. "in order" reflects the entire message stream, so what does "within a block" even mean? It seems to be some completely orthogonal concept related to cache tearing not message ordering?

From documentation:
**ibv_query_qp_data_in_order()** Checks whether WQE data is guaranteed to be
written in-order, and thus reader may poll for data instead of poll for completion.
This function indicates data is written in-order within each WQE, but cannot be used to determine ordering between separate WQEs.
This function is used to determine whether data can be polled, i.e. it's guaranteed that if some byte was written then all previous bytes of this message are ready to be used. Some devices (e.g. EFA) may not support this for the full message but can guarantee write in order inside each data block of a particular size.

@jgunthorpe Does it make sense now?

YonatanNachum · 2023-03-16T07:54:29Z

We have a kernel patch: link
Thanks.

gal-pressman · 2023-03-16T11:21:33Z

libibverbs/driver.h

@@ -420,8 +420,7 @@ struct verbs_context_ops {
 			  struct ibv_port_attr *port_attr);
 	int (*query_qp)(struct ibv_qp *qp, struct ibv_qp_attr *attr,
 			int attr_mask, struct ibv_qp_init_attr *init_attr);
-	int (*query_qp_data_in_order)(struct ibv_qp *qp, enum ibv_wr_opcode op,
-				      uint32_t flags);


I would keep the flags here, the provider can choose to ignore them.

The idea of this interface change is that providers will now return all their supported capabilities (related to data inorder) without depending on the request. I think flags here may be confusing, when would you use it in provider code?

I don't know, depends on the flags.
Anyway, it's not a big deal, I guess the flags can be reintroduced if/when they're needed.

libibverbs/verbs.c

gal-pressman · 2023-03-16T11:24:52Z

libibverbs/verbs.c

@@ -701,7 +701,16 @@ int ibv_query_qp_data_in_order(struct ibv_qp *qp, enum ibv_wr_opcode op,
 	 */
 	return 0;
 #else
-	return get_ops(qp->context)->query_qp_data_in_order(qp, op, flags);
+	uint32_t query_mask;
+	uint32_t comp_mask;


Not sure comp_mask is the best name for this variable.

Changed to supported_flags.

gal-pressman · 2023-03-16T11:32:42Z

IIUC, it means every 128 bytes are written "atomically", it doesn't really replace the completion though, as you have no idea when the entire message is completed (unless you check each 128 chunk separately?).

What is this used for?

mrgolin · 2023-03-16T14:51:45Z

IIUC, it means every 128 bytes are written "atomically", it doesn't really replace the completion though, as you have no idea when the entire message is completed (unless you check each 128 chunk separately?).

What is this used for?

It doesn't replace the completion but allows consumers that are interested in to consume received data prior to getting receive completion, by reading each 128 bytes chunk separately. Specifically it's intended to enable LL128 in NCCL.

gal-pressman · 2023-03-16T15:04:28Z

Makes sense.
LL128 is where you have a flags header for each 120 bytes of data, right?

mrgolin · 2023-03-16T15:09:24Z

Makes sense. LL128 is where you have a flags header for each 120 bytes of data, right?

Right, 120 bytes of data and 8 bytes flag.

jgunthorpe · 2023-03-22T15:24:54Z

libibverbs/verbs.h

@@ -3205,11 +3210,12 @@ ibv_modify_qp_rate_limit(struct ibv_qp *qp,
 *   written in-order.
 * @qp: The QP to query.
 * @op: Operation type.
- * @flags: Extra field for future input. For now must be 0.
+ * @flags: A bit-mask used to select specific capabilities to query. If 0,
+ * will query for IBV_QUERY_QP_DATA_IN_ORDER_WHOLE_MSG support.


This flags was not intended to be used like this. The new flag should be more like IBV_QUERY_QP_DATA_RETURN_FLAGS

And there should be not really a good reason to change all the ops around either.

Do you suggest that passing IBV_QUERY_QP_DATA_RETURN_FLAGS as flags to that function will make it return supported capabilities instead of 0/1?

That actually was one of the options we considered but it didn't feel as a natural API extension but more like combining two into a single one. Is there any future use case for this function you can think of that won't be satisfied by querying for a capability bit?

@jgunthorpe Changed according to your request.

If it looks ok now, can we get this merged?

wenduwan · 2023-03-28T15:41:02Z

kernel-headers/rdma/efa-abi.h

@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
 /*
- * Copyright 2018-2022 Amazon.com, Inc. or its affiliates. All rights reserved.
+ * Copyright 2018-2023 Amazon.com, Inc. or its affiliates. All rights reserved.


Same as below, the latest open source policy advises dateless notices, e.g.

Copyright Amazon.com, Inc. or its affiliates. All rights reserved.

@wenduwan Thanks.
This is a kernel header, can't change it here so will leave dated copyrights for now.

jgunthorpe

The rest looks OK

jgunthorpe · 2023-03-30T23:24:05Z

libibverbs/verbs.c

+	result = get_ops(qp->context)->query_qp_data_in_order(qp, op, query_flags);
+	if (query_flags == IBV_QUERY_QP_DATA_IN_ORDER_RETURN_CAPS) {
+		if (result & IBV_QUERY_QP_DATA_IN_ORDER_WHOLE_MSG ||
+		    get_ops(qp->context)->query_qp_data_in_order(qp, op, 0))


There is only one other implementation just fix it to return the new style always

jgunthorpe · 2023-04-05T18:24:32Z

providers/efa/verbs.c

+	struct efa_context *ctx = to_efa_context(ibvqp->context);
+	int caps = 0;
+
+	if (flags != IBV_QUERY_QP_DATA_IN_ORDER_RETURN_CAPS)


When I said to just make it use the new API, I ment always return the flags style and have the core code covert the result to 0/1 if RETURN_CAPS is not set

So to make sure we are aligned, you would remove flags param from provider ops as I did in the first place or just move this condition to common code and ignore flags in providers?

yes just ignore the flag

Add ibv_query_qp_data_in_order_flags to enable querying for qp data polling various capabilities support. Add ibv_query_qp_data_in_order_caps to define capabilities that can be returned by ibv_query_qp_data_in_order when RETURN_CAPS flag is used. Handle common flags logic in libibverbs code. Move existing mlx5 implementation to the new style. Signed-off-by: Michael Margolin <mrgolin@amazon.com>

To commit 6dddd93938b3 ("RDMA/efa: Add data polling capability feature bit"). Signed-off-by: Michael Margolin <mrgolin@amazon.com>

Add implementation of query_qp_data_in_order in EFA provider. EFA currently has support for 128 bytes data polling. Signed-off-by: Michael Margolin <mrgolin@amazon.com>

mrgolin · 2023-04-17T11:35:38Z

@jgunthorpe any additional comments?

jgunthorpe · 2023-04-17T18:45:19Z

I'm still a bit nervous about this - AFAICT the only way to achieve this in a modern system is if the device promises to generate 128 byte PCIe MemWr TLPs so that the platform can execute single TLPs "in order". But the max TLP size is controlled by PCI config space, so how can verbs know it is 128 bytes at this point?

The existing flag is basically saying 'the device does not use relaxed ordering and writes all bytes in any TLPs in order" and further that "the platform does not re-order non-relaxed ordering TLPs", which doesn't have this problem.

And further, semantically, according to verbs MRs created should be of the non-relaxed ordering type anyhow, so how do you get into a situation where otherwise in-order TLPs are re-ordered? Does this only work with relaxed ordering or is something in EFA wrongly forcing relaxed ordering in the MRs?

mrgolin · 2023-04-19T20:29:48Z

I'm still a bit nervous about this - AFAICT the only way to achieve this in a modern system is if the device promises to generate 128 byte PCIe MemWr TLPs so that the platform can execute single TLPs "in order". But the max TLP size is controlled by PCI config space, so how can verbs know it is 128 bytes at this point?

The existing flag is basically saying 'the device does not use relaxed ordering and writes all bytes in any TLPs in order" and further that "the platform does not re-order non-relaxed ordering TLPs", which doesn't have this problem.

Whenever 128 bytes in-order is promised it's up to the device to make sure each such block isn't getting splitted and reordered nether on the device/communication level nor by PCIe TLPs. It can achieve the PCIe part as you suggested by not using PCIe relaxed ordering (same way it is done for whole message in-order support) or if it is familiar with the platform, ensure that write TLPs don't split data anywhere in-between 128 bytes boundaries.
Either way it is less strict than whole message in-order so I believe that other providers may also correctly support this.

And further, semantically, according to verbs MRs created should be of the non-relaxed ordering type anyhow, so how do you get into a situation where otherwise in-order TLPs are re-ordered? Does this only work with relaxed ordering or is something in EFA wrongly forcing relaxed ordering in the MRs?

I think it is mostly valuable for relaxed ordering to gain both from relaxed order performance improvement and from LL128 data polling low latency but if 128 bytes in-order is assured for relaxed ordering I don't see a reason why it won't be for non-relaxed ordering.

jgunthorpe · 2023-04-28T13:41:56Z

Having just learned about this NCCL LL128 stuff, this patch is somewhat wrong :(

The query_qp_data_in_order() is all about the CPU's perception of ordering. This new flag you've added is really about the device's ability to generate 128 byte PCIe TLPs.

mrgolin · 2023-05-03T17:04:16Z

I believe this verb documentation can, hopefully not too late for some users, be improved to more explicitly state where its responsibility ends but looking at the discussion on original PR that introduced this there wasn't any intension to cover destination memory behavior. For example your comment from that discussion:

Sean, I covered that above, this API is only about the behavior up until verbs delivers PCIe TLPs out of the HCA, it doesn't try very hard to cover platform behavior. MR specific configuration including relaxed ordering and the behavior of the target memory is out of scope because we can't get that kind of information from verbs.

So I think what we did here perfectly aligns with this, extending it to support additional cases.

wzamazon mentioned this pull request Feb 21, 2023

prov/efa: query 128-bytes aligned writing capacity ofiwg/libfabric#8525

Closed

rleon added the needs-kernel-patch label Mar 13, 2023

YonatanNachum mentioned this pull request Mar 16, 2023

Add RDMA write support to EFA provider #1317

Merged

gal-pressman reviewed Mar 16, 2023

View reviewed changes

mrgolin force-pushed the query-data-in-order branch from ffef59a to fc6e94c Compare March 16, 2023 23:15

jgunthorpe requested changes Mar 22, 2023

View reviewed changes

mrgolin force-pushed the query-data-in-order branch from fc6e94c to 335c19c Compare March 27, 2023 14:11

wenduwan reviewed Mar 28, 2023

View reviewed changes

wzamazon mentioned this pull request Mar 28, 2023

prov/efa: adopt to rdma-core API adjustment ofiwg/libfabric#8722

Merged

mrgolin force-pushed the query-data-in-order branch from 335c19c to cfe630a Compare March 29, 2023 08:03

jgunthorpe removed the needs-kernel-patch label Mar 30, 2023

jgunthorpe requested changes Mar 30, 2023

View reviewed changes

mrgolin force-pushed the query-data-in-order branch from cfe630a to fa7d88e Compare April 2, 2023 13:30

jgunthorpe reviewed Apr 5, 2023

View reviewed changes

mrgolin added 2 commits April 11, 2023 07:33

Update kernel headers

29ccd49

To commit 6dddd93938b3 ("RDMA/efa: Add data polling capability feature bit"). Signed-off-by: Michael Margolin <mrgolin@amazon.com>

providers/efa: Add query_qp_data_in_order implementation

9ab9118

Add implementation of query_qp_data_in_order in EFA provider. EFA currently has support for 128 bytes data polling. Signed-off-by: Michael Margolin <mrgolin@amazon.com>

mrgolin force-pushed the query-data-in-order branch from fa7d88e to 9ab9118 Compare April 11, 2023 07:33

jgunthorpe merged commit d2dbc88 into linux-rdma:master Apr 21, 2023
14 checks passed

shijin-aws mentioned this pull request Apr 24, 2023

prov/efa: fix a bug when calling ibv_query_qp_data_in_order ofiwg/libfabric#8842

Merged

Add ibv_query_qp_data_in_order flags #1312

Add ibv_query_qp_data_in_order flags #1312

Conversation

mrgolin commented Feb 19, 2023

mrgolin commented Feb 19, 2023

mrgolin commented Feb 19, 2023

rleon commented Feb 19, 2023

mrgolin commented Feb 19, 2023

rleon commented Feb 19, 2023

rleon commented Feb 19, 2023

jgunthorpe commented Feb 21, 2023

mrgolin commented Feb 21, 2023

mrgolin commented Mar 1, 2023

YonatanNachum commented Mar 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gal-pressman commented Mar 16, 2023

mrgolin commented Mar 16, 2023

gal-pressman commented Mar 16, 2023

mrgolin commented Mar 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgunthorpe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrgolin commented Apr 17, 2023

jgunthorpe commented Apr 17, 2023

mrgolin commented Apr 19, 2023

jgunthorpe commented Apr 28, 2023

mrgolin commented May 3, 2023