Skip to content

Adds retry logic for timeout tests#5448

Merged
kellyguo11 merged 4 commits into
isaac-sim:developfrom
kellyguo11:retry-timeouts
May 1, 2026
Merged

Adds retry logic for timeout tests#5448
kellyguo11 merged 4 commits into
isaac-sim:developfrom
kellyguo11:retry-timeouts

Conversation

@kellyguo11
Copy link
Copy Markdown
Contributor

Description

Some tests arbitrarily times out due to inconsistent CI runs. This change adds a logic to rerun tests that have timed out in an attempt to reduce flaky timeout issues.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

This PR adds retry logic for tests that hit their hard timeout, mirroring the existing startup hang retry mechanism. It also increases timeouts for several tests that were presumably timing out inconsistently. The implementation is correct and follows the existing pattern, though there's one minor inconsistency in the retry reporting.

Architecture Impact

Self-contained. Changes are isolated to the test harness (conftest.py) and test configuration (test_settings.py). No impact on the core framework, task environments, or any production code paths.

Implementation Verdict

Minor fixes needed

Test Coverage

This is CI infrastructure code. The retry logic itself is not unit-tested, but this is consistent with the existing approach for the startup hang retry mechanism. Manual verification through CI runs would be the expected validation path.

CI Status

No CI checks available yet. This PR should be validated by observing whether the flaky timeout issues are reduced in subsequent CI runs.

Findings

🔵 Improvement: tools/conftest.py:377-378 — Inconsistent attempt numbering in log message

The startup hang retry message shows attempt {startup_hang_attempts}/{STARTUP_HANG_RETRIES + 1} (1/3, 2/3) while the timeout retry message at line 391-392 shows attempt {timeout_attempts}/{TIMEOUT_RETRIES + 1} (1/3, 2/3). However, the final failure message at line 446 says retried {timeout_attempts} time(s) which would be "retried 2 time(s)" after using all retries.

This is semantically correct but the phrasing is slightly confusing — "attempt 2/3" vs "retried 2 time(s)" could be clearer. Consider using consistent language like "retry 1/2" and "retry 2/2" for the retry attempts, then "after 2 retries" in the final message.

🔵 Improvement: tools/conftest.py:383-388 — Duplicated diagnostic printing code

The diagnostic printing logic (lines 383-388 for startup hang, lines 397-402 for timeout) is duplicated. Both blocks do:

diag = pre_kill_diag or _capture_system_diagnostics()
if len(diag) > 10000:
    diag = diag[:10000] + "\n... (truncated)"
print(diag)

This is identical to _get_diagnostics() which already exists (lines 224-228). Consider using print(_get_diagnostics(pre_kill_diag)) instead for consistency and to reduce duplication.

🟡 Warning: tools/conftest.py:385 — Timeout retry prints STDOUT but startup hang retry doesn't

The startup hang retry (lines 379-381) only prints STDERR, while the timeout retry (lines 393-398) prints both STDOUT and STDERR. This inconsistency may cause confusion when debugging. If the rationale is that startup hangs never produce meaningful stdout, that's reasonable, but the inconsistency should be intentional and documented.

🔵 Improvement: tools/test_settings.py:20 — Timeout increases should have comments explaining why

The timeout for test_articulation.py was doubled from 1000 to 2000, and test_manager_based_rl_env_obs_spaces.py, test_visuotactile_sensor.py, and test_visuotactile_render.py were also doubled. It would be helpful to add brief comments explaining why these specific tests need longer timeouts (e.g., "# Frequently times out on slower CI runners" or "# Contains expensive render operations"), similar to the existing comments on other entries.

🔵 Improvement: tools/conftest.py:51 — TIMEOUT_RETRIES value consistency

TIMEOUT_RETRIES = 2 means up to 3 total attempts (1 initial + 2 retries), which is consistent with STARTUP_HANG_RETRIES = 2. This is correct, but combined with doubled timeouts for some tests, a single test could now take up to 3× the timeout value before failing (e.g., test_environments.py could take up to 30,000 seconds ≈ 8.3 hours). Ensure CI job timeouts accommodate this worst-case scenario, or consider whether retry logic should only apply to tests under a certain timeout threshold.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 30, 2026

Greptile Summary

This PR adds retry logic for tests that time out in CI without producing a JUnit report, converting the existing startup-hang for-loop into a unified while True loop with separate counters for startup_hang_attempts and timeout_attempts (each capped at 2 retries). It also doubles the hard timeouts for four frequently-flaky tests in test_settings.py.

Confidence Score: 4/5

Safe to merge; retry logic is correct and well-guarded, with only minor reporting and style issues.

No P0/P1 issues found. The loop termination is correct, retry counters are properly bounded, and the has_report guard prevents retrying tests that already produced results. Two P2 findings: wall_time underreporting across retries and inline duplication of _get_diagnostics() logic.

tools/conftest.py — specifically the wall_time accumulation across retry attempts

Important Files Changed

Filename Overview
tools/conftest.py Converts the startup-hang retry for-loop to a while-True loop and adds a parallel timeout-retry path (up to TIMEOUT_RETRIES=2 extra runs when no report file exists); logic is correct but wall_time in test_status only captures the last attempt, understating cumulative CI cost.
tools/test_settings.py Doubles the per-test hard timeouts for four frequently-flaky tests (test_articulation, test_manager_based_rl_env_obs_spaces, test_visuotactile_sensor, test_visuotactile_render) from 1000 s to 2000 s; straightforward configuration change.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Start: run test attempt] --> B[capture_test_output_with_timeout]
    B --> C{kill_reason?}

    C -->|startup_hang| D{startup_hang_attempts\n< STARTUP_HANG_RETRIES?}
    D -->|Yes| E[Increment startup_hang_attempts\nPrint diagnostics]
    E --> A
    D -->|No| BREAK

    C -->|timeout| F{has_report?}
    F -->|Yes - test completed| BREAK
    F -->|No| G{timeout_attempts\n< TIMEOUT_RETRIES?}
    G -->|Yes| H[Increment timeout_attempts\nPrint stdout/stderr/diag]
    H --> A
    G -->|No| BREAK

    C -->|empty or other| BREAK

    BREAK[Break loop] --> I{Resolve result}
    I -->|startup_hang exhausted| J[Write startup_hang error report]
    I -->|timeout, no report| K[Write timeout error report]
    I -->|no report, other| L[Write crash error report]
    I -->|has_report| M[Parse JUnit XML results]
Loading

Comments Outside Diff (2)

  1. tools/conftest.py, line 459-467 (link)

    P2 wall_time undercounts CI time after retries

    After the retry loop, test_status[test_file]["wall_time"] is set to the wall time from only the last attempt. If a test exhausts all TIMEOUT_RETRIES (e.g., 3 × 2 000 s = 100 min of GPU time), the summary table and the Total Wall Time rollup at the end will only reflect ~2 000 s, making CI time analysis misleading. Consider accumulating wall time across attempts:

    # Before the while loop:
    total_wall_time = 0.0
    
    # Inside the loop, after capture_test_output_with_timeout returns:
    total_wall_time += wall_time
    
    # Then store total_wall_time in test_status instead of wall_time
  2. tools/conftest.py, line 402-405 (link)

    P2 Diagnostic truncation duplicates _get_diagnostics()

    The timeout-retry block manually replicates the truncation logic that the existing _get_diagnostics() helper (line 238) already encapsulates. Use the helper to stay consistent with how the startup-hang section and the final result section handle diagnostics:

Reviews (1): Last reviewed commit: "Adds retry logic for timeout tests" | Re-trigger Greptile

@github-actions github-actions Bot added the isaac-lab Related to Isaac Lab team label Apr 30, 2026
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

The new commit addresses several previous concerns: timeouts were reverted to more reasonable values (test_articulation.py at 1500s instead of 2000s, and the other tests are back to 1000s). The Anymal_D expected load time was bumped slightly from 55s to 60s in the performance test.

Implementation Verdict

Ship it

Findings

Previous concerns about duplicated diagnostic printing code and inconsistent STDOUT/STDERR printing in retry paths remain but are minor code quality issues that don't affect correctness. The timeout increase concern is largely mitigated by the reverted values.

The implementation is correct and ready to merge. The retry logic follows the established pattern for startup hang retries, and the more conservative timeout adjustments reduce the risk of excessively long CI runs.

@kellyguo11 kellyguo11 merged commit 8bb6eff into isaac-sim:develop May 1, 2026
77 of 81 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

infrastructure isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant