New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ibv_query_qp_data_in_order flags #1312
Conversation
Related kernel patch: |
@rleon I would really appreciate if this can be reviewed for the upcoming release. |
You posted kernel patch day before merge window starts. I afraid that it is too late for UAPI changes. |
Thanks Leon I understand the timeline and I wish we could submit this earlier. Yet, the kernel patch is really simple one and it changes UAPI only by a definition of another bit in an existing field. |
I'm aware. however I don't want to take any chances and get angry about sending PR to Linus with patches not being in linux-next. Thanks |
@jgunthorpe , you are sending PRs to Linus, so it is your call. Thanks |
This doesn't make alot of sense to me.. "in order" reflects the entire message stream, so what does "within a block" even mean? It seems to be some completely orthogonal concept related to cache tearing not message ordering? |
From documentation:
This function is used to determine whether data can be polled, i.e. it's guaranteed that if some byte was written then all previous bytes of this message are ready to be used. |
@jgunthorpe Does it make sense now? |
We have a kernel patch: link |
libibverbs/driver.h
Outdated
@@ -420,8 +420,7 @@ struct verbs_context_ops { | |||
struct ibv_port_attr *port_attr); | |||
int (*query_qp)(struct ibv_qp *qp, struct ibv_qp_attr *attr, | |||
int attr_mask, struct ibv_qp_init_attr *init_attr); | |||
int (*query_qp_data_in_order)(struct ibv_qp *qp, enum ibv_wr_opcode op, | |||
uint32_t flags); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep the flags here, the provider can choose to ignore them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea of this interface change is that providers will now return all their supported capabilities (related to data inorder) without depending on the request. I think flags here may be confusing, when would you use it in provider code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, depends on the flags.
Anyway, it's not a big deal, I guess the flags can be reintroduced if/when they're needed.
libibverbs/verbs.c
Outdated
@@ -701,7 +701,16 @@ int ibv_query_qp_data_in_order(struct ibv_qp *qp, enum ibv_wr_opcode op, | |||
*/ | |||
return 0; | |||
#else | |||
return get_ops(qp->context)->query_qp_data_in_order(qp, op, flags); | |||
uint32_t query_mask; | |||
uint32_t comp_mask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure comp_mask is the best name for this variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to supported_flags.
IIUC, it means every 128 bytes are written "atomically", it doesn't really replace the completion though, as you have no idea when the entire message is completed (unless you check each 128 chunk separately?). What is this used for? |
It doesn't replace the completion but allows consumers that are interested in to consume received data prior to getting receive completion, by reading each 128 bytes chunk separately. Specifically it's intended to enable LL128 in NCCL. |
Makes sense. |
Right, 120 bytes of data and 8 bytes flag. |
ffef59a
to
fc6e94c
Compare
libibverbs/verbs.h
Outdated
@@ -3205,11 +3210,12 @@ ibv_modify_qp_rate_limit(struct ibv_qp *qp, | |||
* written in-order. | |||
* @qp: The QP to query. | |||
* @op: Operation type. | |||
* @flags: Extra field for future input. For now must be 0. | |||
* @flags: A bit-mask used to select specific capabilities to query. If 0, | |||
* will query for IBV_QUERY_QP_DATA_IN_ORDER_WHOLE_MSG support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This flags was not intended to be used like this. The new flag should be more like IBV_QUERY_QP_DATA_RETURN_FLAGS
And there should be not really a good reason to change all the ops around either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you suggest that passing IBV_QUERY_QP_DATA_RETURN_FLAGS as flags to that function will make it return supported capabilities instead of 0/1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That actually was one of the options we considered but it didn't feel as a natural API extension but more like combining two into a single one. Is there any future use case for this function you can think of that won't be satisfied by querying for a capability bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jgunthorpe Changed according to your request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it looks ok now, can we get this merged?
fc6e94c
to
335c19c
Compare
@@ -1,6 +1,6 @@ | |||
/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */ | |||
/* | |||
* Copyright 2018-2022 Amazon.com, Inc. or its affiliates. All rights reserved. | |||
* Copyright 2018-2023 Amazon.com, Inc. or its affiliates. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as below, the latest open source policy advises dateless notices, e.g.
Copyright Amazon.com, Inc. or its affiliates. All rights reserved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wenduwan Thanks.
This is a kernel header, can't change it here so will leave dated copyrights for now.
335c19c
to
cfe630a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest looks OK
libibverbs/verbs.c
Outdated
result = get_ops(qp->context)->query_qp_data_in_order(qp, op, query_flags); | ||
if (query_flags == IBV_QUERY_QP_DATA_IN_ORDER_RETURN_CAPS) { | ||
if (result & IBV_QUERY_QP_DATA_IN_ORDER_WHOLE_MSG || | ||
get_ops(qp->context)->query_qp_data_in_order(qp, op, 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is only one other implementation just fix it to return the new style always
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, done.
cfe630a
to
fa7d88e
Compare
providers/efa/verbs.c
Outdated
struct efa_context *ctx = to_efa_context(ibvqp->context); | ||
int caps = 0; | ||
|
||
if (flags != IBV_QUERY_QP_DATA_IN_ORDER_RETURN_CAPS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I said to just make it use the new API, I ment always return the flags style and have the core code covert the result to 0/1 if RETURN_CAPS is not set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to make sure we are aligned, you would remove flags param from provider ops as I did in the first place or just move this condition to common code and ignore flags in providers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes just ignore the flag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Add ibv_query_qp_data_in_order_flags to enable querying for qp data polling various capabilities support. Add ibv_query_qp_data_in_order_caps to define capabilities that can be returned by ibv_query_qp_data_in_order when RETURN_CAPS flag is used. Handle common flags logic in libibverbs code. Move existing mlx5 implementation to the new style. Signed-off-by: Michael Margolin <mrgolin@amazon.com>
To commit 6dddd93938b3 ("RDMA/efa: Add data polling capability feature bit"). Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Add implementation of query_qp_data_in_order in EFA provider. EFA currently has support for 128 bytes data polling. Signed-off-by: Michael Margolin <mrgolin@amazon.com>
fa7d88e
to
9ab9118
Compare
@jgunthorpe any additional comments? |
I'm still a bit nervous about this - AFAICT the only way to achieve this in a modern system is if the device promises to generate 128 byte PCIe MemWr TLPs so that the platform can execute single TLPs "in order". But the max TLP size is controlled by PCI config space, so how can verbs know it is 128 bytes at this point? The existing flag is basically saying 'the device does not use relaxed ordering and writes all bytes in any TLPs in order" and further that "the platform does not re-order non-relaxed ordering TLPs", which doesn't have this problem. And further, semantically, according to verbs MRs created should be of the non-relaxed ordering type anyhow, so how do you get into a situation where otherwise in-order TLPs are re-ordered? Does this only work with relaxed ordering or is something in EFA wrongly forcing relaxed ordering in the MRs? |
Whenever 128 bytes in-order is promised it's up to the device to make sure each such block isn't getting splitted and reordered nether on the device/communication level nor by PCIe TLPs. It can achieve the PCIe part as you suggested by not using PCIe relaxed ordering (same way it is done for whole message in-order support) or if it is familiar with the platform, ensure that write TLPs don't split data anywhere in-between 128 bytes boundaries.
I think it is mostly valuable for relaxed ordering to gain both from relaxed order performance improvement and from LL128 data polling low latency but if 128 bytes in-order is assured for relaxed ordering I don't see a reason why it won't be for non-relaxed ordering. |
Having just learned about this NCCL LL128 stuff, this patch is somewhat wrong :( The query_qp_data_in_order() is all about the CPU's perception of ordering. This new flag you've added is really about the device's ability to generate 128 byte PCIe TLPs. |
I believe this verb documentation can, hopefully not too late for some users, be improved to more explicitly state where its responsibility ends but looking at the discussion on original PR that introduced this there wasn't any intension to cover destination memory behavior. For example your comment from that discussion:
So I think what we did here perfectly aligns with this, extending it to support additional cases. |
Add definition for ibv_query_qp_data_in_order flags to allow querying for specific capabilities related to data polling support, currently adding 128 bytes in order flag (LL128).
Adopt existing code and add support in EFA provider.