Skip to content

Stabilize MPI test timing#780

Merged
allnes merged 1 commit into
learning-process:masterfrom
aobolensk:mpi-timing-fixes
May 28, 2026
Merged

Stabilize MPI test timing#780
allnes merged 1 commit into
learning-process:masterfrom
aobolensk:mpi-timing-fixes

Conversation

@aobolensk
Copy link
Copy Markdown
Member

Synchronize ranks before timed sections so scheduler skew and barrier waits are not counted as task runtime, preventing rare timeout flakes like these:

[ RUN      ] PicMatrixTests/NesterovARunFuncTestsProcesses3.MatmulFromPic/nesterov_a_test_task_processes_3_mpi_enabled_3_3
unknown file: error: C++ exception with description "
Task execute time need to be: time < 1 secs.
Original time in secs: 1.21769
" thrown in the test body.

[       OK ] PicMatrixTests/NesterovARunFuncTestsProcesses3.MatmulFromPic/nesterov_a_test_task_processes_3_mpi_enabled_3_3 (1224 ms)
[  FAILED  ] PicMatrixTests/NesterovARunFuncTestsProcesses3.MatmulFromPic/nesterov_a_test_task_processes_3_mpi_enabled_3_3, where GetParam() = (64-byte object <20-AA 75-60 F6-7F 00-00 C0-6C 6E-60 F6-7F 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 88-77 B8-48 FD-01 00-00>, "nesterov_a_test_task_processes_3_mpi_enabled", (3, "3")) (1225 ms)
[ RUN      ] PicMatrixTests/NesterovARunFuncTestsProcesses3.MatmulFromPic/nesterov_a_test_task_processes_3_mpi_enabled_7_7

job aborted:
[ranks] message

[0] terminated

[1] application aborted
aborting MPI_COMM_WORLD (comm=0x44000000), error 1, comm rank 1

[2] terminated

---- error analysis -----

[1] on runnervmqq1k9
D:\a\parallel_programming_course\parallel_programming_course\install\bin\ppc_func_tests aborted the job. abort code 1

---- error analysis -----
 [  PROCESS 1  ]  [  PROCESS 1  ] Traceback (most recent call last):
  File "D:\a\parallel_programming_course\parallel_programming_course\scripts\run_tests.py", line 308, in <module>
    _execute(args_dict, env_copy)
  File "D:\a\parallel_programming_course\parallel_programming_course\scripts\run_tests.py", line 283, in _execute
    runner.run_processes(args_dict["additional_mpi_args"])
  File "D:\a\parallel_programming_course\parallel_programming_course\scripts\run_tests.py", line 247, in run_processes
    self.__run_exec(
  File "D:\a\parallel_programming_course\parallel_programming_course\scripts\run_tests.py", line 122, in __run_exec
    raise Exception(f"Subprocess return {result.returncode}.")
Exception: Subprocess return 1.
Error: Process completed with exit code 1.

Please go to the Preview tab and select the appropriate template:

Synchronize ranks before timed sections so scheduler skew and barrier waits are not counted as task runtime, preventing rare timeout flakes like these:

```
[ RUN      ] PicMatrixTests/NesterovARunFuncTestsProcesses3.MatmulFromPic/nesterov_a_test_task_processes_3_mpi_enabled_3_3
unknown file: error: C++ exception with description "
Task execute time need to be: time < 1 secs.
Original time in secs: 1.21769
" thrown in the test body.

[       OK ] PicMatrixTests/NesterovARunFuncTestsProcesses3.MatmulFromPic/nesterov_a_test_task_processes_3_mpi_enabled_3_3 (1224 ms)
[  FAILED  ] PicMatrixTests/NesterovARunFuncTestsProcesses3.MatmulFromPic/nesterov_a_test_task_processes_3_mpi_enabled_3_3, where GetParam() = (64-byte object <20-AA 75-60 F6-7F 00-00 C0-6C 6E-60 F6-7F 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 88-77 B8-48 FD-01 00-00>, "nesterov_a_test_task_processes_3_mpi_enabled", (3, "3")) (1225 ms)
[ RUN      ] PicMatrixTests/NesterovARunFuncTestsProcesses3.MatmulFromPic/nesterov_a_test_task_processes_3_mpi_enabled_7_7

job aborted:
[ranks] message

[0] terminated

[1] application aborted
aborting MPI_COMM_WORLD (comm=0x44000000), error 1, comm rank 1

[2] terminated

---- error analysis -----

[1] on runnervmqq1k9
D:\a\parallel_programming_course\parallel_programming_course\install\bin\ppc_func_tests aborted the job. abort code 1

---- error analysis -----
 [  PROCESS 1  ]  [  PROCESS 1  ] Traceback (most recent call last):
  File "D:\a\parallel_programming_course\parallel_programming_course\scripts\run_tests.py", line 308, in <module>
    _execute(args_dict, env_copy)
  File "D:\a\parallel_programming_course\parallel_programming_course\scripts\run_tests.py", line 283, in _execute
    runner.run_processes(args_dict["additional_mpi_args"])
  File "D:\a\parallel_programming_course\parallel_programming_course\scripts\run_tests.py", line 247, in run_processes
    self.__run_exec(
  File "D:\a\parallel_programming_course\parallel_programming_course\scripts\run_tests.py", line 122, in __run_exec
    raise Exception(f"Subprocess return {result.returncode}.")
Exception: Subprocess return 1.
Error: Process completed with exit code 1.
```
@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

Codecov Report

❌ Patch coverage is 55.55556% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.13%. Comparing base (626e4be) to head (9bc57a2).

Files with missing lines Patch % Lines
modules/util/src/util.cpp 55.55% 1 Missing and 3 partials ⚠️

❌ Your patch check has failed because the patch coverage (55.55%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #780      +/-   ##
==========================================
- Coverage   93.82%   93.13%   -0.70%     
==========================================
  Files          15       15              
  Lines         486      495       +9     
  Branches      182      180       -2     
==========================================
+ Hits          456      461       +5     
- Misses          0        1       +1     
- Partials       30       33       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@allnes allnes merged commit a3fb7bb into learning-process:master May 28, 2026
36 of 37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants