Skip to content

Conversation

@RobotSail
Copy link
Member

@RobotSail RobotSail commented Dec 18, 2025

This PR solves a race condition in the current codebase where process.poll() is called
before the child process has fully completed which causes the check of process.poll() != 0
to return True.

Summary by CodeRabbit

  • Bug Fixes
    • Improved training subprocess exit handling with enhanced error reporting that includes process exit status
    • Strengthened cleanup logic with graceful process termination followed by forced termination on timeout

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 18, 2025

Walkthrough

Improved subprocess exit handling in the training module by implementing wait-for-exit logic to capture actual process exit codes, adding enhanced error logging with exit status details, and strengthening cleanup procedures with graceful termination followed by force-kill fallback if timeout occurs.

Changes

Cohort / File(s) Summary
Subprocess exit handling enhancement
src/instructlab/training/main_ds.py
Reworked subprocess exit handling to wait for training process completion and read exit codes; enhanced error logging to include process exit status; strengthened cleanup with 60-second graceful termination wait followed by forceful kill; added sys import

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Focus areas: The subprocess termination logic and timeout handling—verify that the 60-second grace period and subsequent kill mechanism work correctly across different scenarios (normal exit, timeout, interruption).
  • Secondary concern: Ensure the exit code capture and logging don't mask or obscure underlying training failures or provide incomplete error context.

Poem

🐰 A process once wild, now tamed with care,
Exit codes captured through the air,
Graceful shutdown, then the final blow,
Cleanup's dance, both swift and slow!
Better logs now tell the tale—
Error handling won't ever fail! 🎉

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main issue: a race condition where process status is checked before completion, matching the core fix described in the objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mergify mergify bot added the ci-failure label Dec 18, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/instructlab/training/main_ds.py (1)

9-9: Unused import sys.

The sys module is imported but does not appear to be used anywhere in this file. Consider removing it to keep imports clean.

🔎 Proposed fix
-import sys
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c495035 and bca48d4.

📒 Files selected for processing (1)
  • src/instructlab/training/main_ds.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: unit: 3.12 on ubuntu-latest
  • GitHub Check: pylint
  • GitHub Check: unit: 3.13 on ubuntu-latest
  • GitHub Check: unit: 3.11 on ubuntu-latest
  • GitHub Check: Summary

Comment on lines +649 to +652
# wait for the process to exit so we can properly read the exit code
process.wait(timeout=60)
process_code = process.poll()
failure = process_code != 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing exception handling for subprocess.TimeoutExpired.

process.wait(timeout=60) raises subprocess.TimeoutExpired if the process doesn't exit within the timeout. When this exception is raised, lines 651-652 are skipped, leaving failure = False. This could mask a real failure since the function might complete without raising RuntimeError at line 674.

🔎 Proposed fix
         # wait for the process to exit so we can properly read the exit code
-        process.wait(timeout=60)
-        process_code = process.poll()
-        failure = process_code != 0
+        try:
+            process.wait(timeout=60)
+            process_code = process.poll()
+            failure = process_code != 0
+        except subprocess.TimeoutExpired:
+            logger.error("Training subprocess did not exit within 60 seconds.")
+            process_code = None
+            failure = True
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# wait for the process to exit so we can properly read the exit code
process.wait(timeout=60)
process_code = process.poll()
failure = process_code != 0
# wait for the process to exit so we can properly read the exit code
try:
process.wait(timeout=60)
process_code = process.poll()
failure = process_code != 0
except subprocess.TimeoutExpired:
logger.error("Training subprocess did not exit within 60 seconds.")
process_code = None
failure = True
🤖 Prompt for AI Agents
In src/instructlab/training/main_ds.py around lines 649 to 652, the call to
process.wait(timeout=60) can raise subprocess.TimeoutExpired which is currently
unhandled and will skip setting failure; catch subprocess.TimeoutExpired, handle
it by terminating/killing the process (process.terminate() then process.kill()
if needed), call process.wait() again to reap it, set failure = True (and
capture the final exit code via process.poll()), and ensure any necessary
stdout/stderr are read before proceeding so the function does not silently treat
a timed-out process as a success.

Comment on lines +657 to +659
logger.error(
f"Training subprocess has not exited yet. Sending SIGTERM. Process code: {process_code}"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Misleading error message.

This branch is taken when process_code != 0, meaning the process has already exited with a non-zero exit code. The message "Training subprocess has not exited yet" is incorrect and confusing. It should indicate the process exited with an error.

🔎 Proposed fix
         else:
             logger.error(
-                f"Training subprocess has not exited yet. Sending SIGTERM. Process code: {process_code}"
+                f"Training subprocess exited with non-zero code: {process_code}"
             )

Note: After fixing this, you may also want to reconsider whether calling process.terminate() on line 661 is appropriate when the process has already exited (it's harmless but unnecessary).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
logger.error(
f"Training subprocess has not exited yet. Sending SIGTERM. Process code: {process_code}"
)
logger.error(
f"Training subprocess exited with non-zero code: {process_code}"
)
🤖 Prompt for AI Agents
In src/instructlab/training/main_ds.py around lines 657 to 659, the log message
incorrectly says "Training subprocess has not exited yet" even though this
branch is executed when process_code != 0 (the subprocess already exited with a
non-zero/error code); change the log to state that the subprocess exited with an
error and include the actual exit code and any available stderr/exception info
(e.g., "Training subprocess exited with non-zero code: {process_code}"); remove
or guard the subsequent process.terminate() call since the process is already
exited (or make it conditional only when the process is still alive) to avoid
unnecessary termination attempts.

@RobotSail RobotSail merged commit 638a753 into instructlab:main Dec 18, 2025
13 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant