-
Notifications
You must be signed in to change notification settings - Fork 928
Description
Is your feature request related to a problem? Please describe.
During the work on fixing multi-threaded OBJ_RELEASE issues in MPI_Remove_error_code (commits 44dee9a and 6d3ee24), we identified an efficiency issue with how error codes and classes are allocated in the ompi_mpi_errcodes pointer array.
The current implementation in ompi/errhandler/errcode.c always increments ompi_mpi_errcode_lastused and allocates at the end of the array, even when there are NULL slots available from previously removed error codes/classes. When ompi_mpi_errcode_remove() or ompi_mpi_errclass_remove() is called, they set the pointer array entry to NULL, but these NULL slots are never reused, causing the array to grow indefinitely even if the actual number of active error codes/classes remains constant or decreases.
Impact:
- Memory waste: The pointer array continues to grow even when error codes/classes are removed
- Cache inefficiency: Sparse arrays lead to poor cache locality
- Handle exhaustion: In long-running applications with many add/remove cycles, the array could grow unnecessarily large
Current implementation details:
ompi_mpi_errcode_add() (errcode.c:376-390):
int ompi_mpi_errcode_add(int errclass)
{
ompi_mpi_errcode_t *newerrcode;
newerrcode = OBJ_NEW(ompi_mpi_errcode_t);
opal_mutex_lock(&errcode_lock);
newerrcode->code = (ompi_mpi_errcode_lastused+1);
newerrcode->cls = errclass;
opal_pointer_array_set_item(&ompi_mpi_errcodes, newerrcode->code, newerrcode
);
ompi_mpi_errcode_lastused++;
opal_mutex_unlock(&errcode_lock);
return newerrcode->code;
}ompi_mpi_errclass_add() (errcode.c:392-405):
int ompi_mpi_errclass_add(void)
{
ompi_mpi_errcode_t *newerrcode;
newerrcode = OBJ_NEW(ompi_mpi_errcode_t);
opal_mutex_lock(&errcode_lock);
newerrcode->cls = (ompi_mpi_errcode_lastused+1);
opal_pointer_array_set_item(&ompi_mpi_errcodes, newerrcode->cls, newerrcode)
;
ompi_mpi_errcode_lastused++;
opal_mutex_unlock(&errcode_lock);
return newerrcode->cls;
}Describe the solution you'd like
Implement a mechanism to track and reuse NULL slots in the ompi_mpi_errcodes pointer array before allocating new entries at the end. The preferred approach is to scan for free slots:
Before incrementing ompi_mpi_errcode_lastused, scan the array from ompi_mpi_errcode_lastpredefined + 1 to ompi_mpi_errcode_lastused for NULL entries:
int ompi_mpi_errcode_add(int errclass)
{
ompi_mpi_errcode_t *newerrcode;
int new_code = -1;
newerrcode = OBJ_NEW(ompi_mpi_errcode_t);
opal_mutex_lock(&errcode_lock);
// Try to find a free slot first
for (int i = ompi_mpi_errcode_lastpredefined + 1; i <= ompi_mpi_errcode_lastused; i++) {
if (NULL == opal_pointer_array_get_item(&ompi_mpi_errcodes, i)) {
new_code = i;
break;
}
}
// If no free slot found, allocate at the end
if (new_code < 0) {
new_code = ompi_mpi_errcode_lastused + 1;
ompi_mpi_errcode_lastused++;
}
newerrcode->code = new_code;
newerrcode->cls = errclass;
opal_pointer_array_set_item(&ompi_mpi_errcodes, new_code, newerrcode);
opal_mutex_unlock(&errcode_lock);
return new_code;
}Benefits:
- Simple implementation with no additional data structures
- Minimal performance overhead for typical usage patterns
- Efficient memory utilization
- Better cache locality with denser arrays
Implementation files to modify:
ompi/errhandler/errcode.c: Modifyompi_mpi_errcode_add()andompi_mpi_errclass_add()
Testing requirements:
- Add error code/class
- Remove it
- Add another error code/class
- Verify that the second add reuses the first slot (same error code value)
- Test with multiple add/remove cycles
- Thread safety: Ensure concurrent add/remove operations work correctly
The scan-for-free-slots approach is preferred initially because it's simple and straightforward. It can possibly be optimized later to a free list approach if profiling shows it's a bottleneck.
Additional context
This efficiency improvement builds upon recent bug fixes:
- Commit 44dee9a: "properly release errcode object and skip NULL on finalize"
- Commit 6d3ee24: "avoid data race setting LASTUSED attribute"
- Commit da50804: "Support MPI 4.1 removal of error class/code/string"
These commits fixed the multi-threading issues with OBJ_RELEASE and proper NULL handling, which are prerequisites for safely implementing efficient slot reuse.