Debug CI crash#5621
Conversation
There was a problem hiding this comment.
CI Debugging Changes Review
This PR adds diagnostic tooling to help debug CI crashes. The changes are straightforward and low-risk.
Summary of Changes
.github/actions/run-tests/action.yml
- Adds
PYTHONFAULTHANDLER=1environment variable to the Docker test container
tools/conftest.py
- Adds
-sflag (disables output capture, shows print statements) - Adds
-vvflag (extra verbose output)
Analysis
✅ What Works Well
-
PYTHONFAULTHANDLER - Excellent choice for debugging segfaults and crashes. When enabled, Python will dump a traceback on fatal signals (SIGSEGV, SIGFPE, SIGABRT, SIGBUS, SIGILL), which is invaluable for diagnosing low-level crashes in CI.
-
Pytest verbosity flags - The
-sand-vvflags will provide much more insight into test execution, particularly useful for identifying where crashes occur.
⚠️ Considerations
-
Temporary vs Permanent: These flags significantly increase log verbosity. Consider whether this is intended as a temporary debugging measure or a permanent change. If temporary, consider adding a comment or creating a follow-up issue to revert once the crash is diagnosed.
-
CI Log Size: The
-s -vvflags will substantially increase CI log output. This is fine for debugging but may impact log readability and storage for routine CI runs if left permanently enabled. -
Performance: Minimal impact expected. PYTHONFAULTHANDLER has negligible overhead, and pytest verbosity flags do not affect test execution time.
Verdict
The changes are safe to merge for debugging purposes. The implementation is correct and follows the existing patterns in the codebase.
Suggestion: If these changes are intended to be temporary, consider adding a TODO comment or creating a tracking issue to revert them once the CI crash is resolved.
Update (39a7a1e): New commit adds include-files: "test_sensor_base.py" to narrow test-isaaclab-core-2 to a single test file for SIGSEGV reproduction. This is a reasonable debugging approach to isolate the crash. Previous concerns about debug flags still apply once the issue is diagnosed.
Greptile SummaryThis PR adds two diagnostic aids to help investigate CI crashes:
Confidence Score: 4/5Safe to merge for its stated debugging purpose; no functional regressions introduced, though log verbosity increases substantially on every CI run going forward. Both changes are additive diagnostic flags with no effect on test pass/fail outcomes. The main risk is tools/conftest.py — the -s flag interacts with the in-memory pipe accumulation in capture_test_output_with_timeout Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Docker Container
PYTHONFAULTHANDLER=1
new] --> B[conftest.py
run_individual_tests]
B --> C[subprocess.Popen
stdout=PIPE, stderr=PIPE]
C --> D[pytest subprocess
-s -vv new
--no-header
--tb=short]
D -->|stdout/stderr ALL output unbuffered| E[capture_test_output_with_timeout
accumulates stdout_data in memory]
E -->|crash / segfault| F[PYTHONFAULTHANDLER traceback]
E -->|normal exit| G[JUnit XML + log output]
E -->|timeout / hang| H[kill + system diagnostics]
Reviews (1): Last reviewed commit: "Add PYTHONFAULTHANDLER" | Re-trigger Greptile |
| "-s", | ||
| "-vv", |
There was a problem hiding this comment.
Unbounded memory growth with
-s flag
-s disables pytest's internal output capture so all test stdout flows directly into the subprocess pipe. capture_test_output_with_timeout accumulates the entire pipe content in memory (stdout_data += chunk). Isaac Sim tests are extremely verbose — simulation logs, Kit messages, shader compilation output — and the -s flag means none of that is buffered or discarded by pytest. On a long-running test, stdout_data can grow to hundreds of MB inside the conftest runner process, which could trigger an OOM kill of the runner itself rather than the test under investigation.
| "-s", | ||
| "-vv", |
There was a problem hiding this comment.
Debug flags appear intended as temporary but have no revert marker
-s and -vv are diagnostic flags that significantly increase log volume and are typically added temporarily while investigating a specific crash. There is no comment in the code or the PR description linking these to a follow-up issue or indicating they are intended to be permanent. If these land on develop without a revert plan, every subsequent CI run will produce substantially larger logs, which can make it harder to find signal in the output once the crash is resolved.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Description
Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context.
List any dependencies that are required for this change.
Fixes # (issue)
Type of change
Screenshots
Please attach before and after screenshots of the change if applicable.
Checklist
pre-commitchecks with./isaaclab.sh --formatconfig/extension.tomlfileCONTRIBUTORS.mdor my name already exists there