Skip to content

Commit

Permalink
Merge pull request #4108 from shefty/master
Browse files Browse the repository at this point in the history
fabric: Introduce new mode bit FI_BUFFERED_RECV
  • Loading branch information
shefty committed May 22, 2018
2 parents aca08e5 + 996d965 commit a75f3f1
Show file tree
Hide file tree
Showing 8 changed files with 163 additions and 88 deletions.
12 changes: 12 additions & 0 deletions include/rdma/fabric.h
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,7 @@ typedef struct fid *fid_t;
#define FI_AFFINITY (1ULL << 29)
#define FI_COMMIT_COMPLETE (1ULL << 30)

#define FI_VARIABLE_MSG (1ULL << 48)
#define FI_RMA_PMEM (1ULL << 49)
#define FI_SOURCE_ERR (1ULL << 50)
#define FI_LOCAL_COMM (1ULL << 51)
Expand All @@ -171,6 +172,11 @@ typedef struct fid *fid_t;
#define FI_DIRECTED_RECV (1ULL << 59)


/* Tagged messages, buffered receives, CQ flags */
#define FI_CLAIM (1ULL << 59)
#define FI_DISCARD (1ULL << 58)


struct fi_ioc {
void *addr;
size_t count;
Expand Down Expand Up @@ -301,6 +307,7 @@ enum {
#define FI_NOTIFY_FLAGS_ONLY (1ULL << 54)
#define FI_RESTRICTED_COMP (1ULL << 53)
#define FI_CONTEXT2 (1ULL << 52)
#define FI_BUFFERED_RECV (1ULL << 51)

struct fi_tx_attr {
uint64_t caps;
Expand Down Expand Up @@ -600,6 +607,11 @@ struct fi_context2 {
};
#endif

struct fi_recv_context {
struct fid_ep *ep;
void *context;
};

#ifdef __cplusplus
}
#endif
Expand Down
1 change: 1 addition & 0 deletions include/rdma/fi_endpoint.h
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ enum {
enum {
FI_OPT_MIN_MULTI_RECV, /* size_t */
FI_OPT_CM_DATA_SIZE, /* size_t */
FI_OPT_BUFFERED_LIMIT, /* size_t */
};

struct fi_ops_ep {
Expand Down
2 changes: 0 additions & 2 deletions include/rdma/fi_tagged.h
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,6 @@
extern "C" {
#endif

#define FI_CLAIM (1ULL << 59)
#define FI_DISCARD (1ULL << 58)

struct fi_msg_tagged {
const struct iovec *msg_iov;
Expand Down
14 changes: 14 additions & 0 deletions man/fi_cq.3.md
Original file line number Diff line number Diff line change
Expand Up @@ -594,6 +594,20 @@ operation. The following completion flags are defined.
buffer has been released, and the completion entry is not associated
with a received message.

*FI_MORE*
: See the 'Buffered Receives' section in `fi_msg`(3) for more details.
This flag is associated with receive completions on endpoints that
have FI_BUFFERED_RECV mode enabled. When set to one, it indicates that
the buffer referenced by the completion is limited by the
FI_OPT_BUFFERED_LIMIT threshold, and additional message data must be
retrieved by the application using an FI_CLAIM operation.

*FI_CLAIM*
: See the 'Buffered Receives' section in `fi_msg`(3) for more details.
This flag is set on completions associated with receive operations
that claim buffered receive data. Note that this flag only applies
to endpoints configured with the FI_BUFFERED_RECV mode bit.

# NOTES

A completion queue must be bound to at least one enabled endpoint before any
Expand Down
11 changes: 5 additions & 6 deletions man/fi_endpoint.3.md
Original file line number Diff line number Diff line change
Expand Up @@ -513,12 +513,11 @@ The following option levels and option names and parameters are defined.
the maximum size of the data that may be present as part of a connection
request event. This option is read only.

- *FI_OPT_VARIABLE_THRESHOLD - size_t*
: Defines the minimum size for variable length messages. Transfers
equal to FI_OPT_VARIABLE_THRESHOLD size or smaller are handled as
standard message transfers. Message transfers larger than the
threshold are handled by the provider as variable length transfers.

- *FI_OPT_BUFFERED_LIMIT - size_t*
: Defines the maximum size of a buffered message that will be reported
to users as part of a receive completion when the FI_BUFFERED_RECV mode
is enabled on an endpoint.

fi_getopt() will return the currently configured threshold, or the
provider's default threshold if one has not be set by the application.
fi_setopt() allows an application to configure the threshold. If the
Expand Down
11 changes: 11 additions & 0 deletions man/fi_getinfo.3.md
Original file line number Diff line number Diff line change
Expand Up @@ -555,6 +555,17 @@ supported set of modes will be returned in the info structure(s).
and counters among endpoints, transmit contexts, and receive contexts that
have the same set of capability flags.

*FI_BUFFERED_RECV*
: The buffered receive mode bit indicates that the provider owns the
data buffer(s) that are accessed by the networking layer for received
messages. Typically, this implies that data must be copied from the
provider buffer into the application buffer. Applications that can
handle message processing from network allocated data buffers can set
this mode bit to avoid copies. For full details on application
requirements to support this mode, see the 'Buffered Receives' section
in `fi_msg`(3). This mode bit applies to FI_MSG and FI_TAGGED receive
operations.

# ADDRESSING FORMATS

Multiple fabric interfaces take as input either a source or
Expand Down
154 changes: 92 additions & 62 deletions man/fi_msg.3.md
Original file line number Diff line number Diff line change
Expand Up @@ -288,82 +288,112 @@ fi_sendmsg.
be used in all multicast transfers, in conjunction with a multicast
fi_addr_t.

# Variable Length Messages

Variable length messages, or simply variable messages, are transfers
where the size of the message is unknown to the receiver prior to the
message being sent. It indicates that the recipient of a message does
not know the amount of data to expect prior to the message arriving.
It is most commonly used when the size of message transfers varies
greatly, with very large messages interspersed with much smaller
messages. Variable messages are not subject to max message length
restrictions (i.e. struct fi_ep_attr::max_msg_size limits), and may
be up to the maximum value of size_t (e.g. SIZE_MAX) in length.

Variable messages are associated with a variable message threshold.
The variable threshold indicates the size above which a transfer
becomes a variable message. The completion mechanism of variable
messages differ from standard receive completions; however,
completions at the sender remain unchanged. Messages smaller than the
threshold are treated as standard messages (or tagged messages if
using the fi_tagged.3 operations). That is, they consume posted
application receive buffers and generate standard completions, including
generating any possible errors that may arise. The variable message
threshold is configurable per endpoint, subject to provider limitations.
Under most conditions, the threshold limit must be the same at both the
sending and receiving endpoints, and must be configured prior to
enabling the endpoint.

When a variable message is ready to be received, a notification is
generated on the associated receive completion queue. Such
completions will have the FI_VARIABLE_MSG flag set as part of the CQ
entry. The entry will report the length of the message to the receiver.
Since variable message notifications are not directly associated with
an application's posted receive operation, the CQ entry's op_context
field will point to a struct fi_var_context.
# Buffered Receives

Buffered receives indicate that the networking layer allocates and
manages the data buffers used to receive network data transfers. As
a result, received messages must be copied from the network buffers
into application buffers for processing. However, applications can
avoid this copy if they are able to process the message in place
(directly from the networking buffers).

Handling buffered receives differs based on the size of the message
being sent. In general, smaller messages are passed directly to the
application for processing. However, for large messages, an application
will only receive the start of the message and must claim the rest.
The details for how small messages are reported and large messages may
be claimed are described below.

When a provider receives a message, it will write an entry to the completion
queue associated with the receiving endpoint. For discussion purposes,
the completion queue is assumed to be configured for FI_CQ_FORMAT_DATA.
Since buffered receives are not associated with application posted buffers,
the CQ entry op_context will point to a struct fi_recv_context.

{% highlight c %}
struct fi_var_context {
void *op_context;
struct fi_recv_context {
struct fid_ep *ep;
void *context;
};
{% endhighlight %}

After being notified that a variable message is ready to be received,
applications should either claim or discard the message. To claim a
message, an application must post a receive operation with the
FI_CLAIM flag set. The struct fi_var_context returned as part of the
The 'ep' field will point to the receiving endpoint or Rx context, and
'context' will be NULL. The CQ entry's 'buf' will point to a provider
managed buffer where the start of the received message is located, and
'len' will be set to the total size of the message.

The maximum sized message that a provider can buffer is limited by
an FI_OPT_BUFFERED_LIMIT. This threshold can be obtained and may be adjusted
by the application using the fi_getopt and fi_setopt calls, respectively.
Any adjustments must be made prior to enabling the endpoint. The
CQ entry 'buf' will point to a buffer that is the _minimum_ of 'len' and
the FI_OPT_BUFFERED_LIMIT value. If the sent message is larger than the
buffered limit, the CQ entry 'flags' will have the FI_MORE bit set.

After being notified that a buffered receive has arrived,
applications must either claim or discard the message. Typically,
small messages are processed and discarded, while large messages
are claimed. However, an application is free to claim or discard any
message regardless of message size.

To claim a message, an application must post a receive operation with the
FI_CLAIM flag set. The struct fi_recv_context returned as part of the
notification must be provided as the receive operation's context. The
struct fi_var_context contains an op_context field. Applications may
struct fi_recv_context contains a 'context' field. Applications may
modify this field prior to claiming the message. When the claim
operation completes, a standard receive completion entry will be
generated on the completion queue. The op_context of the associated
CQ entry will be set to the op_context value passed in through
the fi_var_context structure.

Applications that do not wish to receive a variable message that they
were notified of may discard it. To discard a message, an application
must post a receive operation with the FI_DISCARD flag set. The
receive context should be the struct fi_var_context from the
notification. When the FI_DISCARD flag is set, the receive input
buffer(s) and length parameters are ignored.
generated on the completion queue. The 'context' of the associated
CQ entry will be set to the 'context' value passed in through
the fi_recv_context structure, and the CQ entry flags will have the
FI_CLAIM bit set.

Buffered receives that are not claimed must be discarded by the application
when it is done processing the CQ entry data. To discard a message, an
application must post a receive operation with the FI_DISCARD flag set.
The struct fi_recv_context returned as part of the notification must be
provided as the receive operation's context. When the FI_DISCARD flag is set
for a receive operation, the receive input buffer(s) and length parameters
are ignored.

IMPORTANT: Buffered receives must be claimed or discarded in a timely manner.
Failure to do so may result in increased memory usage for network buffering
or communication stalls. Once a buffered receive has been claimed or
discarded, the original CQ entry 'buf' or struct fi_recv_context data may no
longer be accessed by the application.

The use of the FI_CLAIM and FI_DISCARD operation flags is also
described with respect to tagged message transfers in fi_tagged.3.
Variable length tagged messages will include the message tag as part
of the message notification.

Support for variable messages is indicated through the FI_VARIABLE_MSG
capability bit. Additionally, the variable length message threshold
may be obtained and/or adjusted using an endpoint's
fi_getopt/fi_setopt operations.
Buffered receives of tagged messages will include the message tag as part
of the CQ entry, if available.

The handling of variable message headers follows all message ordering
The handling of buffered receives follows all message ordering
restrictions assigned to and endpoint. For example, completions
may indicate the order in which variable messages arrived at the
receiver. However, the transfer of variable message data should be
treated as conceptually occurring out of band. No ordering within or
between the data of variable messages is implied.
may indicate the order in which received messages arrived at the
receiver based on the endpoint attributes.

# Variable Length Messages

Variable length messages, or simply variable messages, are transfers
where the size of the message is unknown to the receiver prior to the
message being sent. It indicates that the recipient of a message does
not know the amount of data to expect prior to the message arriving.
It is most commonly used when the size of message transfers varies
greatly, with very large messages interspersed with much smaller
messages, making receive side message buffering difficult to manage.
Variable messages are not subject to max message length
restrictions (i.e. struct fi_ep_attr::max_msg_size limits), and may
be up to the maximum value of size_t (e.g. SIZE_MAX) in length.

Variable length messages support requests that the provider allocate and
manage the network message buffers. As a result, the application
requirements and provider behavior is identical as those defined
for supporting the FI_BUFFERED_RECV mode bit. See the Buffered
Receive section above for details. The main difference is that buffered
receives are limited by the fi_ep_attr::max_msg_size threshold, whereas
variable length messages are not.

Support for variable messages is indicated through the FI_VARIABLE_MSG
capability bit.

# NOTES

Expand Down
Loading

0 comments on commit a75f3f1

Please sign in to comment.