Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ibv_query_qp_data_in_order flags #1312

Merged
merged 3 commits into from Apr 21, 2023

Conversation

mrgolin
Copy link
Contributor

@mrgolin mrgolin commented Feb 19, 2023

Add definition for ibv_query_qp_data_in_order flags to allow querying for specific capabilities related to data polling support, currently adding 128 bytes in order flag (LL128).
Adopt existing code and add support in EFA provider.

@mrgolin
Copy link
Contributor Author

mrgolin commented Feb 19, 2023

@mrgolin
Copy link
Contributor Author

mrgolin commented Feb 19, 2023

@rleon I would really appreciate if this can be reviewed for the upcoming release.

@rleon
Copy link
Member

rleon commented Feb 19, 2023

@rleon I would really appreciate if this can be reviewed for the upcoming release.

You posted kernel patch day before merge window starts. I afraid that it is too late for UAPI changes.

@mrgolin
Copy link
Contributor Author

mrgolin commented Feb 19, 2023

@rleon I would really appreciate if this can be reviewed for the upcoming release.

You posted kernel patch day before merge window starts. I afraid that it is too late for UAPI changes.

Thanks Leon I understand the timeline and I wish we could submit this earlier. Yet, the kernel patch is really simple one and it changes UAPI only by a definition of another bit in an existing field.

@rleon
Copy link
Member

rleon commented Feb 19, 2023

@rleon I would really appreciate if this can be reviewed for the upcoming release.

You posted kernel patch day before merge window starts. I afraid that it is too late for UAPI changes.

Thanks Leon I understand the timeline and I wish we could submit this earlier. Yet, the kernel patch is really simple one and it changes UAPI only by a definition of another bit in an existing field.

I'm aware. however I don't want to take any chances and get angry about sending PR to Linus with patches not being in linux-next.

Thanks

@rleon
Copy link
Member

rleon commented Feb 19, 2023

@jgunthorpe , you are sending PRs to Linus, so it is your call.

Thanks

@jgunthorpe
Copy link
Member

This doesn't make alot of sense to me.. "in order" reflects the entire message stream, so what does "within a block" even mean? It seems to be some completely orthogonal concept related to cache tearing not message ordering?

@mrgolin
Copy link
Contributor Author

mrgolin commented Feb 21, 2023

This doesn't make alot of sense to me.. "in order" reflects the entire message stream, so what does "within a block" even mean? It seems to be some completely orthogonal concept related to cache tearing not message ordering?

From documentation:

**ibv_query_qp_data_in_order()** Checks whether WQE data is guaranteed to be
written in-order, and thus reader may poll for data instead of poll for completion.
This function indicates data is written in-order within each WQE, but cannot be used to determine ordering between separate WQEs.

This function is used to determine whether data can be polled, i.e. it's guaranteed that if some byte was written then all previous bytes of this message are ready to be used.
Some devices (e.g. EFA) may not support this for the full message but can guarantee write in order inside each data block of a particular size.

@mrgolin
Copy link
Contributor Author

mrgolin commented Mar 1, 2023

This doesn't make alot of sense to me.. "in order" reflects the entire message stream, so what does "within a block" even mean? It seems to be some completely orthogonal concept related to cache tearing not message ordering?

From documentation:

**ibv_query_qp_data_in_order()** Checks whether WQE data is guaranteed to be
written in-order, and thus reader may poll for data instead of poll for completion.
This function indicates data is written in-order within each WQE, but cannot be used to determine ordering between separate WQEs.

This function is used to determine whether data can be polled, i.e. it's guaranteed that if some byte was written then all previous bytes of this message are ready to be used. Some devices (e.g. EFA) may not support this for the full message but can guarantee write in order inside each data block of a particular size.

@jgunthorpe Does it make sense now?

@YonatanNachum
Copy link
Contributor

We have a kernel patch: link
Thanks.

@@ -420,8 +420,7 @@ struct verbs_context_ops {
struct ibv_port_attr *port_attr);
int (*query_qp)(struct ibv_qp *qp, struct ibv_qp_attr *attr,
int attr_mask, struct ibv_qp_init_attr *init_attr);
int (*query_qp_data_in_order)(struct ibv_qp *qp, enum ibv_wr_opcode op,
uint32_t flags);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep the flags here, the provider can choose to ignore them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of this interface change is that providers will now return all their supported capabilities (related to data inorder) without depending on the request. I think flags here may be confusing, when would you use it in provider code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, depends on the flags.
Anyway, it's not a big deal, I guess the flags can be reintroduced if/when they're needed.

libibverbs/verbs.c Outdated Show resolved Hide resolved
@@ -701,7 +701,16 @@ int ibv_query_qp_data_in_order(struct ibv_qp *qp, enum ibv_wr_opcode op,
*/
return 0;
#else
return get_ops(qp->context)->query_qp_data_in_order(qp, op, flags);
uint32_t query_mask;
uint32_t comp_mask;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure comp_mask is the best name for this variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to supported_flags.

@gal-pressman
Copy link
Contributor

IIUC, it means every 128 bytes are written "atomically", it doesn't really replace the completion though, as you have no idea when the entire message is completed (unless you check each 128 chunk separately?).

What is this used for?

@mrgolin
Copy link
Contributor Author

mrgolin commented Mar 16, 2023

IIUC, it means every 128 bytes are written "atomically", it doesn't really replace the completion though, as you have no idea when the entire message is completed (unless you check each 128 chunk separately?).

What is this used for?

It doesn't replace the completion but allows consumers that are interested in to consume received data prior to getting receive completion, by reading each 128 bytes chunk separately. Specifically it's intended to enable LL128 in NCCL.

@gal-pressman
Copy link
Contributor

Makes sense.
LL128 is where you have a flags header for each 120 bytes of data, right?

@mrgolin
Copy link
Contributor Author

mrgolin commented Mar 16, 2023

Makes sense. LL128 is where you have a flags header for each 120 bytes of data, right?

Right, 120 bytes of data and 8 bytes flag.

@@ -3205,11 +3210,12 @@ ibv_modify_qp_rate_limit(struct ibv_qp *qp,
* written in-order.
* @qp: The QP to query.
* @op: Operation type.
* @flags: Extra field for future input. For now must be 0.
* @flags: A bit-mask used to select specific capabilities to query. If 0,
* will query for IBV_QUERY_QP_DATA_IN_ORDER_WHOLE_MSG support.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flags was not intended to be used like this. The new flag should be more like IBV_QUERY_QP_DATA_RETURN_FLAGS

And there should be not really a good reason to change all the ops around either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you suggest that passing IBV_QUERY_QP_DATA_RETURN_FLAGS as flags to that function will make it return supported capabilities instead of 0/1?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That actually was one of the options we considered but it didn't feel as a natural API extension but more like combining two into a single one. Is there any future use case for this function you can think of that won't be satisfied by querying for a capability bit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgunthorpe Changed according to your request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it looks ok now, can we get this merged?

@@ -1,6 +1,6 @@
/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
/*
* Copyright 2018-2022 Amazon.com, Inc. or its affiliates. All rights reserved.
* Copyright 2018-2023 Amazon.com, Inc. or its affiliates. All rights reserved.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as below, the latest open source policy advises dateless notices, e.g.

Copyright Amazon.com, Inc. or its affiliates. All rights reserved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wenduwan Thanks.
This is a kernel header, can't change it here so will leave dated copyrights for now.

Copy link
Member

@jgunthorpe jgunthorpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest looks OK

result = get_ops(qp->context)->query_qp_data_in_order(qp, op, query_flags);
if (query_flags == IBV_QUERY_QP_DATA_IN_ORDER_RETURN_CAPS) {
if (result & IBV_QUERY_QP_DATA_IN_ORDER_WHOLE_MSG ||
get_ops(qp->context)->query_qp_data_in_order(qp, op, 0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only one other implementation just fix it to return the new style always

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done.

struct efa_context *ctx = to_efa_context(ibvqp->context);
int caps = 0;

if (flags != IBV_QUERY_QP_DATA_IN_ORDER_RETURN_CAPS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I said to just make it use the new API, I ment always return the flags style and have the core code covert the result to 0/1 if RETURN_CAPS is not set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to make sure we are aligned, you would remove flags param from provider ops as I did in the first place or just move this condition to common code and ignore flags in providers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes just ignore the flag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Add ibv_query_qp_data_in_order_flags to enable querying for qp data
polling various capabilities support.
Add ibv_query_qp_data_in_order_caps to define capabilities that can be
returned by ibv_query_qp_data_in_order when RETURN_CAPS flag is used.
Handle common flags logic in libibverbs code.
Move existing mlx5 implementation to the new style.

Signed-off-by: Michael Margolin <mrgolin@amazon.com>
To commit 6dddd93938b3 ("RDMA/efa: Add data polling capability feature
bit").

Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Add implementation of query_qp_data_in_order in EFA provider. EFA
currently has support for 128 bytes data polling.

Signed-off-by: Michael Margolin <mrgolin@amazon.com>
@mrgolin
Copy link
Contributor Author

mrgolin commented Apr 17, 2023

@jgunthorpe any additional comments?

@jgunthorpe
Copy link
Member

I'm still a bit nervous about this - AFAICT the only way to achieve this in a modern system is if the device promises to generate 128 byte PCIe MemWr TLPs so that the platform can execute single TLPs "in order". But the max TLP size is controlled by PCI config space, so how can verbs know it is 128 bytes at this point?

The existing flag is basically saying 'the device does not use relaxed ordering and writes all bytes in any TLPs in order" and further that "the platform does not re-order non-relaxed ordering TLPs", which doesn't have this problem.

And further, semantically, according to verbs MRs created should be of the non-relaxed ordering type anyhow, so how do you get into a situation where otherwise in-order TLPs are re-ordered? Does this only work with relaxed ordering or is something in EFA wrongly forcing relaxed ordering in the MRs?

@mrgolin
Copy link
Contributor Author

mrgolin commented Apr 19, 2023

I'm still a bit nervous about this - AFAICT the only way to achieve this in a modern system is if the device promises to generate 128 byte PCIe MemWr TLPs so that the platform can execute single TLPs "in order". But the max TLP size is controlled by PCI config space, so how can verbs know it is 128 bytes at this point?

The existing flag is basically saying 'the device does not use relaxed ordering and writes all bytes in any TLPs in order" and further that "the platform does not re-order non-relaxed ordering TLPs", which doesn't have this problem.

Whenever 128 bytes in-order is promised it's up to the device to make sure each such block isn't getting splitted and reordered nether on the device/communication level nor by PCIe TLPs. It can achieve the PCIe part as you suggested by not using PCIe relaxed ordering (same way it is done for whole message in-order support) or if it is familiar with the platform, ensure that write TLPs don't split data anywhere in-between 128 bytes boundaries.
Either way it is less strict than whole message in-order so I believe that other providers may also correctly support this.

And further, semantically, according to verbs MRs created should be of the non-relaxed ordering type anyhow, so how do you get into a situation where otherwise in-order TLPs are re-ordered? Does this only work with relaxed ordering or is something in EFA wrongly forcing relaxed ordering in the MRs?

I think it is mostly valuable for relaxed ordering to gain both from relaxed order performance improvement and from LL128 data polling low latency but if 128 bytes in-order is assured for relaxed ordering I don't see a reason why it won't be for non-relaxed ordering.

@jgunthorpe jgunthorpe merged commit d2dbc88 into linux-rdma:master Apr 21, 2023
14 checks passed
@jgunthorpe
Copy link
Member

Having just learned about this NCCL LL128 stuff, this patch is somewhat wrong :(

The query_qp_data_in_order() is all about the CPU's perception of ordering. This new flag you've added is really about the device's ability to generate 128 byte PCIe TLPs.

@mrgolin
Copy link
Contributor Author

mrgolin commented May 3, 2023

I believe this verb documentation can, hopefully not too late for some users, be improved to more explicitly state where its responsibility ends but looking at the discussion on original PR that introduced this there wasn't any intension to cover destination memory behavior. For example your comment from that discussion:

Sean, I covered that above, this API is only about the behavior up until verbs delivers PCIe TLPs out of the HCA, it doesn't try very hard to cover platform behavior. MR specific configuration including relaxed ordering and the behavior of the target memory is out of scope because we can't get that kind of information from verbs.

So I think what we did here perfectly aligns with this, extending it to support additional cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants