-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core + prov/efa: Request page alignment for buffer pools #7431
Conversation
Could you modify the commit/PR title to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use ofi_get_page_size
to get page size.
Add OFI_BUFPOOL_PAGE_ALIGNED flag that requests page alignment for the entire buffer pool rather than individual entries. This is required to safely use MADV_DONTFORK when registering the buffers with ibv_reg_mr. Signed-off-by: Greg Inozemtsev <ginozemt@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please split the PR into 2:
1st one make change to prov/utils to introduce the flag OFI_BUFPOOL_PAGE_ALIGNED
2nd one change the efa provider to use the the flag.
@@ -1399,7 +1399,7 @@ static int rxr_create_pkt_pool(struct rxr_ep *ep, size_t size, | |||
int rxr_ep_init(struct rxr_ep *ep) | |||
{ | |||
size_t entry_sz, sendv_pool_size; | |||
int hp_pool_flag; | |||
int flags; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to pkt_pool_flags
is better because it is only apply to packet pools.
@@ -1409,19 +1409,21 @@ int rxr_ep_init(struct rxr_ep *ep) | |||
#endif | |||
|
|||
if (efa_fork_status == EFA_FORK_SUPPORT_ON) | |||
hp_pool_flag = 0; | |||
// Buffer pool needs to be page aligned in order to set MADV_DONTFORK | |||
// on the pages making up the pool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use /*
instead of //
. Also please elaborate
@@ -1452,7 +1454,8 @@ int rxr_ep_init(struct rxr_ep *ep) | |||
*/ | |||
ret = rxr_create_pkt_pool(ep, entry_sz, | |||
rxr_env.readcopy_pool_size, | |||
0, &ep->rx_readcopy_pkt_pool); | |||
flags & ~OFI_BUFPOOL_HUGEPAGES, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use pkt_pool_flags
to be consistent. I understand it changed behavior to use huge pages when fork support is off, which is ok.
We have identified this PR to fix a segfault when application running with rdma-core's fork support turned on, and so want it to be included in upcoming libfabric releases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment I'm about to post to the conversation.
@@ -289,6 +289,7 @@ enum { | |||
OFI_BUFPOOL_INDEXED = 1 << 1, | |||
OFI_BUFPOOL_NO_TRACK = 1 << 2, | |||
OFI_BUFPOOL_HUGEPAGES = 1 << 3, | |||
OFI_BUFPOOL_PAGE_ALIGNED = 1 << 4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not used for page alignment, but to specify that memory allocations need to be a multiple of page sizes. Please separate the changes to the core files from the EFA provider.
retry: | ||
ret = ofi_memalign((void **) &buf_region->alloc_region, | ||
roundup_power_of_two(pool->attr.alignment), | ||
roundup_power_of_two(alignment), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above code is confused and mixes data structure alignment with memory allocation sizes. The two are separate.
@@ -193,6 +204,9 @@ int ofi_bufpool_create_attr(struct ofi_bufpool_attr *attr, | |||
if (pool->attr.flags & OFI_BUFPOOL_HUGEPAGES) { | |||
pool->alloc_size = ofi_get_aligned_size(pool->alloc_size, | |||
hp_size); | |||
} else if (pool->attr.flags & OFI_BUFPOOL_PAGE_ALIGNED) { | |||
pool->alloc_size = ofi_get_aligned_size(pool->alloc_size, | |||
page_sizes[OFI_PAGE_SIZE]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in previous reviews, if the allocation size changes, we need to adjust other pool attributes.
The bufpool has these attributes
Note that entry_size can be larger, smaller, or equal to the object's alignment. The chunk_cnt is currently used to calculate alloc_size. What's needed is:
Combined, the alloc_size and alloc_alignment can guarantee that allocations are in whole pages, which is the desired feature. These fields are related to each other. If we change the alloc_size, the chunk_cnt should reflect that change. And introducing alloc_alignment is essentially doing that. alloc_size needs to include alloc_alignment when being set, the same way entry_size includes the object alignment. The attributes should be set in this order:
The calculation for the chunk_cnt needs to take into consideration any alloc_alignment. For example, assuming no chunk_cnt specified by user, if alloc_alignment is 4k and entry_size is 32, then the chunk_cnt should be 128. The alloc_size would also 4k (aligned allocation) in this case. However, if alloc_alignment is 4k and entry_size is 8k, then chunk_cnt might be 16 with alloc_size = 128k (aligned). The alloc_alignment likely changes based on whether we're allocating huge pages or not. The code needs to handle this or add an assert that alloc_alignment is unspecified if huge pages are allowed. At the very least alloc_alignment should match object alignment. That is, we should have:
Instead of "roundup_power_of_two(pool->attr.alignment)" filling in where alloc_alignment should be. |
After discussing this with the team internally, we decided to approach this differently. Rather than changing the bufpool alignment we decided to use mmap to satisfy the allocation. I split this PR into two:
|
This lets us safely use MADV_DONTFORK without worrying about the buffer sharing a page with an unrelated allocation.
We also have to make the allocation size aligned to avoid the same page sharing issue at the end of the buffer.