fixes bug where process isn't completed by the time the process gets read #675

RobotSail · 2025-12-18T21:26:09Z

This PR solves a race condition in the current codebase where process.poll() is called
before the child process has fully completed which causes the check of process.poll() != 0
to return True.

Summary by CodeRabbit

Bug Fixes
- Improved training subprocess exit handling with enhanced error reporting that includes process exit status
- Strengthened cleanup logic with graceful process termination followed by forced termination on timeout

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…read

coderabbitai · 2025-12-18T21:26:20Z

Walkthrough

Improved subprocess exit handling in the training module by implementing wait-for-exit logic to capture actual process exit codes, adding enhanced error logging with exit status details, and strengthening cleanup procedures with graceful termination followed by force-kill fallback if timeout occurs.

Changes

Cohort / File(s)	Summary
Subprocess exit handling enhancement `src/instructlab/training/main_ds.py`	Reworked subprocess exit handling to wait for training process completion and read exit codes; enhanced error logging to include process exit status; strengthened cleanup with 60-second graceful termination wait followed by forceful kill; added sys import

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Focus areas: The subprocess termination logic and timeout handling—verify that the 60-second grace period and subsequent kill mechanism work correctly across different scenarios (normal exit, timeout, interruption).
Secondary concern: Ensure the exit code capture and logging don't mask or obscure underlying training failures or provide incomplete error context.

Poem

🐰 A process once wild, now tamed with care,
Exit codes captured through the air,
Graceful shutdown, then the final blow,
Cleanup's dance, both swift and slow!
Better logs now tell the tale—
Error handling won't ever fail! 🎉

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly addresses the main issue: a race condition where process status is checked before completion, matching the core fix described in the objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

src/instructlab/training/main_ds.py (1)
9-9: Unused import sys.

The sys module is imported but does not appear to be used anywhere in this file. Consider removing it to keep imports clean.
🔎 Proposed fix
-import sys

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c495035 and bca48d4.

📒 Files selected for processing (1)

src/instructlab/training/main_ds.py (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: unit: 3.12 on ubuntu-latest
GitHub Check: pylint
GitHub Check: unit: 3.13 on ubuntu-latest
GitHub Check: unit: 3.11 on ubuntu-latest
GitHub Check: Summary

coderabbitai · 2025-12-18T21:28:56Z

src/instructlab/training/main_ds.py

+        # wait for the process to exit so we can properly read the exit code
+        process.wait(timeout=60)
+        process_code = process.poll()
+        failure = process_code != 0


⚠️ Potential issue | 🟠 Major

Missing exception handling for subprocess.TimeoutExpired.

process.wait(timeout=60) raises subprocess.TimeoutExpired if the process doesn't exit within the timeout. When this exception is raised, lines 651-652 are skipped, leaving failure = False. This could mask a real failure since the function might complete without raising RuntimeError at line 674.

🔎 Proposed fix

# wait for the process to exit so we can properly read the exit code - process.wait(timeout=60) - process_code = process.poll() - failure = process_code != 0 + try: + process.wait(timeout=60) + process_code = process.poll() + failure = process_code != 0 + except subprocess.TimeoutExpired: + logger.error("Training subprocess did not exit within 60 seconds.") + process_code = None + failure = True

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# wait for the process to exit so we can properly read the exit code

process.wait(timeout=60)

process_code = process.poll()

failure = process_code != 0

# wait for the process to exit so we can properly read the exit code

try:

process.wait(timeout=60)

process_code = process.poll()

failure = process_code != 0

except subprocess.TimeoutExpired:

logger.error("Training subprocess did not exit within 60 seconds.")

process_code = None

failure = True

🤖 Prompt for AI Agents

In src/instructlab/training/main_ds.py around lines 649 to 652, the call to process.wait(timeout=60) can raise subprocess.TimeoutExpired which is currently unhandled and will skip setting failure; catch subprocess.TimeoutExpired, handle it by terminating/killing the process (process.terminate() then process.kill() if needed), call process.wait() again to reap it, set failure = True (and capture the final exit code via process.poll()), and ensure any necessary stdout/stderr are read before proceeding so the function does not silently treat a timed-out process as a success.

coderabbitai · 2025-12-18T21:28:57Z

src/instructlab/training/main_ds.py

+            logger.error(
+                f"Training subprocess has not exited yet. Sending SIGTERM. Process code: {process_code}"
+            )


⚠️ Potential issue | 🟡 Minor

Misleading error message.

This branch is taken when process_code != 0, meaning the process has already exited with a non-zero exit code. The message "Training subprocess has not exited yet" is incorrect and confusing. It should indicate the process exited with an error.

🔎 Proposed fix

else: logger.error( - f"Training subprocess has not exited yet. Sending SIGTERM. Process code: {process_code}" + f"Training subprocess exited with non-zero code: {process_code}" )

Note: After fixing this, you may also want to reconsider whether calling process.terminate() on line 661 is appropriate when the process has already exited (it's harmless but unnecessary).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

logger.error(

f"Training subprocess has not exited yet. Sending SIGTERM. Process code: {process_code}"

)

logger.error(

f"Training subprocess exited with non-zero code: {process_code}"

)

🤖 Prompt for AI Agents

In src/instructlab/training/main_ds.py around lines 657 to 659, the log message incorrectly says "Training subprocess has not exited yet" even though this branch is executed when process_code != 0 (the subprocess already exited with a non-zero/error code); change the log to state that the subprocess exited with an error and include the actual exit code and any available stderr/exception info (e.g., "Training subprocess exited with non-zero code: {process_code}"); remove or guard the subsequent process.terminate() call since the process is already exited (or make it conditional only when the process is still alive) to avoid unnecessary termination attempts.

fixes bug where process isn't completed by the time the process gets …

bca48d4

…read

mergify bot added the ci-failure label Dec 18, 2025

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

RobotSail merged commit 638a753 into instructlab:main Dec 18, 2025
13 of 18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fixes bug where process isn't completed by the time the process gets read #675

fixes bug where process isn't completed by the time the process gets read #675

Uh oh!

RobotSail commented Dec 18, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 18, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 18, 2025

Uh oh!

coderabbitai bot Dec 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fixes bug where process isn't completed by the time the process gets read #675

fixes bug where process isn't completed by the time the process gets read #675

Uh oh!

Conversation

RobotSail commented Dec 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RobotSail commented Dec 18, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 18, 2025 •

edited

Loading