Adds retry logic for timeout tests#5448
Conversation
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Summary
This PR adds retry logic for tests that hit their hard timeout, mirroring the existing startup hang retry mechanism. It also increases timeouts for several tests that were presumably timing out inconsistently. The implementation is correct and follows the existing pattern, though there's one minor inconsistency in the retry reporting.
Architecture Impact
Self-contained. Changes are isolated to the test harness (conftest.py) and test configuration (test_settings.py). No impact on the core framework, task environments, or any production code paths.
Implementation Verdict
Minor fixes needed
Test Coverage
This is CI infrastructure code. The retry logic itself is not unit-tested, but this is consistent with the existing approach for the startup hang retry mechanism. Manual verification through CI runs would be the expected validation path.
CI Status
No CI checks available yet. This PR should be validated by observing whether the flaky timeout issues are reduced in subsequent CI runs.
Findings
🔵 Improvement: tools/conftest.py:377-378 — Inconsistent attempt numbering in log message
The startup hang retry message shows attempt {startup_hang_attempts}/{STARTUP_HANG_RETRIES + 1} (1/3, 2/3) while the timeout retry message at line 391-392 shows attempt {timeout_attempts}/{TIMEOUT_RETRIES + 1} (1/3, 2/3). However, the final failure message at line 446 says retried {timeout_attempts} time(s) which would be "retried 2 time(s)" after using all retries.
This is semantically correct but the phrasing is slightly confusing — "attempt 2/3" vs "retried 2 time(s)" could be clearer. Consider using consistent language like "retry 1/2" and "retry 2/2" for the retry attempts, then "after 2 retries" in the final message.
🔵 Improvement: tools/conftest.py:383-388 — Duplicated diagnostic printing code
The diagnostic printing logic (lines 383-388 for startup hang, lines 397-402 for timeout) is duplicated. Both blocks do:
diag = pre_kill_diag or _capture_system_diagnostics()
if len(diag) > 10000:
diag = diag[:10000] + "\n... (truncated)"
print(diag)This is identical to _get_diagnostics() which already exists (lines 224-228). Consider using print(_get_diagnostics(pre_kill_diag)) instead for consistency and to reduce duplication.
🟡 Warning: tools/conftest.py:385 — Timeout retry prints STDOUT but startup hang retry doesn't
The startup hang retry (lines 379-381) only prints STDERR, while the timeout retry (lines 393-398) prints both STDOUT and STDERR. This inconsistency may cause confusion when debugging. If the rationale is that startup hangs never produce meaningful stdout, that's reasonable, but the inconsistency should be intentional and documented.
🔵 Improvement: tools/test_settings.py:20 — Timeout increases should have comments explaining why
The timeout for test_articulation.py was doubled from 1000 to 2000, and test_manager_based_rl_env_obs_spaces.py, test_visuotactile_sensor.py, and test_visuotactile_render.py were also doubled. It would be helpful to add brief comments explaining why these specific tests need longer timeouts (e.g., "# Frequently times out on slower CI runners" or "# Contains expensive render operations"), similar to the existing comments on other entries.
🔵 Improvement: tools/conftest.py:51 — TIMEOUT_RETRIES value consistency
TIMEOUT_RETRIES = 2 means up to 3 total attempts (1 initial + 2 retries), which is consistent with STARTUP_HANG_RETRIES = 2. This is correct, but combined with doubled timeouts for some tests, a single test could now take up to 3× the timeout value before failing (e.g., test_environments.py could take up to 30,000 seconds ≈ 8.3 hours). Ensure CI job timeouts accommodate this worst-case scenario, or consider whether retry logic should only apply to tests under a certain timeout threshold.
Greptile SummaryThis PR adds retry logic for tests that time out in CI without producing a JUnit report, converting the existing startup-hang for-loop into a unified Confidence Score: 4/5Safe to merge; retry logic is correct and well-guarded, with only minor reporting and style issues. No P0/P1 issues found. The loop termination is correct, retry counters are properly bounded, and the has_report guard prevents retrying tests that already produced results. Two P2 findings: wall_time underreporting across retries and inline duplication of _get_diagnostics() logic. tools/conftest.py — specifically the wall_time accumulation across retry attempts Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Start: run test attempt] --> B[capture_test_output_with_timeout]
B --> C{kill_reason?}
C -->|startup_hang| D{startup_hang_attempts\n< STARTUP_HANG_RETRIES?}
D -->|Yes| E[Increment startup_hang_attempts\nPrint diagnostics]
E --> A
D -->|No| BREAK
C -->|timeout| F{has_report?}
F -->|Yes - test completed| BREAK
F -->|No| G{timeout_attempts\n< TIMEOUT_RETRIES?}
G -->|Yes| H[Increment timeout_attempts\nPrint stdout/stderr/diag]
H --> A
G -->|No| BREAK
C -->|empty or other| BREAK
BREAK[Break loop] --> I{Resolve result}
I -->|startup_hang exhausted| J[Write startup_hang error report]
I -->|timeout, no report| K[Write timeout error report]
I -->|no report, other| L[Write crash error report]
I -->|has_report| M[Parse JUnit XML results]
|
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Summary
The new commit addresses several previous concerns: timeouts were reverted to more reasonable values (test_articulation.py at 1500s instead of 2000s, and the other tests are back to 1000s). The Anymal_D expected load time was bumped slightly from 55s to 60s in the performance test.
Implementation Verdict
Ship it
Findings
Previous concerns about duplicated diagnostic printing code and inconsistent STDOUT/STDERR printing in retry paths remain but are minor code quality issues that don't affect correctness. The timeout increase concern is largely mitigated by the reverted values.
The implementation is correct and ready to merge. The retry logic follows the established pattern for startup hang retries, and the more conservative timeout adjustments reduce the risk of excessively long CI runs.
Description
Some tests arbitrarily times out due to inconsistent CI runs. This change adds a logic to rerun tests that have timed out in an attempt to reduce flaky timeout issues.
Type of change
Checklist
pre-commitchecks with./isaaclab.sh --formatconfig/extension.tomlfileCONTRIBUTORS.mdor my name already exists there