-
Notifications
You must be signed in to change notification settings - Fork 911
Description
Background information
I have been working on a new osc/ofi component in an internship at HPE. To get started I tried to understand the existing osc/sm component. While jumping through the code I confirmed what @hkuno and @jlbyrne-hpe already told me. Everything in the sm component gets shifted over by 4 bytes. This means that nothing is cache-aligned, even though the code tries to make everything cache-aligned.
What version of Open MPI are you using?
I only looked at the code in master
Details of the problem
The problem is that each opal/shmem component introduces a header to the shared memory segment:
ompi/opal/mca/shmem/shmem_types.h
Lines 101 to 106 in c4d0752
struct opal_shmem_seg_hdr_t { | |
/* segment lock */ | |
opal_atomic_lock_t lock; | |
/* pid of the segment creator */ | |
pid_t cpid; | |
}; |
This causes everything else in the segment to get shifted over by the size of the header:
ompi/opal/mca/shmem/mmap/shmem_mmap_module.c
Lines 544 to 545 in c4d0752
/* update returned base pointer with an offset that hides our stuff */ | |
return (ds_buf->seg_base_addr + sizeof(opal_shmem_seg_hdr_t)); |
This causes two problems:
- The returned adress is no longer cache- and page- aligned.
- The allocated memory is larger than requested, potentially mapping a page more than intended.
The first problem is potentially an issue because e.g. osc/sm is not aware of how much the allocated memory shifted away from a cache line boundary:
ompi/ompi/mca/osc/sm/osc_sm_component.c
Lines 281 to 284 in c4d0752
state_size = sizeof(ompi_osc_sm_global_state_t) + sizeof(ompi_osc_sm_node_state_t) * comm_size; | |
state_size += OPAL_ALIGN_PAD_AMOUNT(state_size, 64); | |
posts_size = comm_size * post_size * sizeof (module->posts[0][0]); | |
posts_size += OPAL_ALIGN_PAD_AMOUNT(posts_size, 64); |
ompi/ompi/mca/osc/sm/osc_sm_component.c
Lines 337 to 349 in c4d0752
for (i = 0, total = state_size + posts_size ; i < comm_size ; ++i) { | |
if (i > 0) { | |
module->posts[i] = module->posts[i - 1] + post_size; | |
} | |
module->sizes[i] = rbuf[i]; | |
if (module->sizes[i]) { | |
module->bases[i] = ((char *) module->segment_base) + total; | |
total += rbuf[i]; | |
} else { | |
module->bases[i] = NULL; | |
} | |
} |
Possible solutions
Remove the segment header
I couldn't find any evidence the segment base header is used apart from the initialisation of the lock in it. So if I am not mistaken the header could easily be removed and all problems would be solved.
Move the segment header to the end of the segment
This would solve the alignment issue but not the size issue.
Pad the header to the size of a cache line
This would be the method that probably requires the least amount of code change. Similar to the previous solution, this would only solve one of the "problems" described.
Add a method to the framework to get the header size
Now we get into more invasive idea but I want to name it anyway.
This would shift the issue over to the user of the opal/shmem components. It would allow the user to request exactly the amount of memory he wants, taking the header size into account. This especially means that if the user is padding to a page anyway he can take the header size into account and pad with less.
On the other hand this would be a lot of overhead on the user side and could lead to a potential misuse of the component if the user is not aware of the fact that his segment base will not be at a page boundary.