Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core + prov/efa: Request page alignment for buffer pools #7431

Closed
wants to merge 1 commit into from

Conversation

nzmsv
Copy link

@nzmsv nzmsv commented Feb 8, 2022

This lets us safely use MADV_DONTFORK without worrying about the buffer sharing a page with an unrelated allocation.

We also have to make the allocation size aligned to avoid the same page sharing issue at the end of the buffer.

@shijin-aws
Copy link
Contributor

Could you modify the commit/PR title to be prov/efa: ... to be consistent with other PRs/commits. This will help us filter commits. Thanks.

prov/efa/src/rxr/rxr.h Outdated Show resolved Hide resolved
Copy link
Contributor

@wzamazon wzamazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use ofi_get_page_size to get page size.

prov/efa/src/rxr/rxr.h Outdated Show resolved Hide resolved
prov/util/src/util_buf.c Outdated Show resolved Hide resolved
@shefty shefty changed the title efa: Request page alignment for RXR buffer pool core + prov/efa: Request page alignment for buffer pools Mar 3, 2022
@nzmsv nzmsv changed the title core + prov/efa: Request page alignment for buffer pools prov/efa: Request page alignment for buffer pools Mar 15, 2022
prov/efa/src/rxr/rxr_ep.c Outdated Show resolved Hide resolved
prov/efa/src/rxr/rxr_ep.c Outdated Show resolved Hide resolved
@nzmsv nzmsv changed the title prov/efa: Request page alignment for buffer pools core + prov/efa: Request page alignment for buffer pools Mar 16, 2022
@nzmsv nzmsv requested a review from wzamazon March 16, 2022 05:28
Add OFI_BUFPOOL_PAGE_ALIGNED flag that requests page alignment
for the entire buffer pool rather than individual entries.
This is required to safely use MADV_DONTFORK when registering the
buffers with ibv_reg_mr.

Signed-off-by: Greg Inozemtsev <ginozemt@amazon.com>
Copy link
Contributor

@wzamazon wzamazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please split the PR into 2:

1st one make change to prov/utils to introduce the flag OFI_BUFPOOL_PAGE_ALIGNED

2nd one change the efa provider to use the the flag.

@@ -1399,7 +1399,7 @@ static int rxr_create_pkt_pool(struct rxr_ep *ep, size_t size,
int rxr_ep_init(struct rxr_ep *ep)
{
size_t entry_sz, sendv_pool_size;
int hp_pool_flag;
int flags;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to pkt_pool_flags is better because it is only apply to packet pools.

@@ -1409,19 +1409,21 @@ int rxr_ep_init(struct rxr_ep *ep)
#endif

if (efa_fork_status == EFA_FORK_SUPPORT_ON)
hp_pool_flag = 0;
// Buffer pool needs to be page aligned in order to set MADV_DONTFORK
// on the pages making up the pool.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use /* instead of //. Also please elaborate

@@ -1452,7 +1454,8 @@ int rxr_ep_init(struct rxr_ep *ep)
*/
ret = rxr_create_pkt_pool(ep, entry_sz,
rxr_env.readcopy_pool_size,
0, &ep->rx_readcopy_pkt_pool);
flags & ~OFI_BUFPOOL_HUGEPAGES,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use pkt_pool_flags to be consistent. I understand it changed behavior to use huge pages when fork support is off, which is ok.

@wzamazon
Copy link
Contributor

@shefty

We have identified this PR to fix a segfault when application running with rdma-core's fork support turned on, and so want it to be included in upcoming libfabric releases.

Copy link
Member

@shefty shefty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment I'm about to post to the conversation.

@@ -289,6 +289,7 @@ enum {
OFI_BUFPOOL_INDEXED = 1 << 1,
OFI_BUFPOOL_NO_TRACK = 1 << 2,
OFI_BUFPOOL_HUGEPAGES = 1 << 3,
OFI_BUFPOOL_PAGE_ALIGNED = 1 << 4,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used for page alignment, but to specify that memory allocations need to be a multiple of page sizes. Please separate the changes to the core files from the EFA provider.

retry:
ret = ofi_memalign((void **) &buf_region->alloc_region,
roundup_power_of_two(pool->attr.alignment),
roundup_power_of_two(alignment),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above code is confused and mixes data structure alignment with memory allocation sizes. The two are separate.

@@ -193,6 +204,9 @@ int ofi_bufpool_create_attr(struct ofi_bufpool_attr *attr,
if (pool->attr.flags & OFI_BUFPOOL_HUGEPAGES) {
pool->alloc_size = ofi_get_aligned_size(pool->alloc_size,
hp_size);
} else if (pool->attr.flags & OFI_BUFPOOL_PAGE_ALIGNED) {
pool->alloc_size = ofi_get_aligned_size(pool->alloc_size,
page_sizes[OFI_PAGE_SIZE]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in previous reviews, if the allocation size changes, we need to adjust other pool attributes.

@shefty
Copy link
Member

shefty commented Mar 16, 2022

The bufpool has these attributes

entry_size /* size of user's object size plus metadata plus alignment */
alignment /* stores alignment needed for each object */
chunk_cnt /* number of entries in each allocation */
alloc_size /* size of the memory allocation */

Note that entry_size can be larger, smaller, or equal to the object's alignment. The chunk_cnt is currently used to calculate alloc_size. What's needed is:

alloc_alignment /* aligned size for each allocation */

Combined, the alloc_size and alloc_alignment can guarantee that allocations are in whole pages, which is the desired feature.

These fields are related to each other. If we change the alloc_size, the chunk_cnt should reflect that change. And introducing alloc_alignment is essentially doing that. alloc_size needs to include alloc_alignment when being set, the same way entry_size includes the object alignment. The attributes should be set in this order:

object size <- size given by user
alignment <- based on object size or specified by user
entry_size <- based on the object size & alignment
alloc_alignment <- based on entry_size or specified by user
chunk_cnt <- specified by user or defaults & adjusted by alloc_alignment & entry_size
alloc_size <- based on entry_size/chunk_cnt & alloc_alignment

The calculation for the chunk_cnt needs to take into consideration any alloc_alignment. For example, assuming no chunk_cnt specified by user, if alloc_alignment is 4k and entry_size is 32, then the chunk_cnt should be 128. The alloc_size would also 4k (aligned allocation) in this case. However, if alloc_alignment is 4k and entry_size is 8k, then chunk_cnt might be 16 with alloc_size = 128k (aligned).

The alloc_alignment likely changes based on whether we're allocating huge pages or not. The code needs to handle this or add an assert that alloc_alignment is unspecified if huge pages are allowed. At the very least alloc_alignment should match object alignment. That is, we should have:

    ret = ofi_memalign(&buf_region->alloc_region, pool->alloc_alignment, pool->alloc_size)

Instead of "roundup_power_of_two(pool->attr.alignment)" filling in where alloc_alignment should be.

@nzmsv
Copy link
Author

nzmsv commented Mar 17, 2022

After discussing this with the team internally, we decided to approach this differently. Rather than changing the bufpool alignment we decided to use mmap to satisfy the allocation. I split this PR into two:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants