-
Notifications
You must be signed in to change notification settings - Fork 75
fixes bug where process isn't completed by the time the process gets read #675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughImproved subprocess exit handling in the training module by implementing wait-for-exit logic to capture actual process exit codes, adding enhanced error logging with exit status details, and strengthening cleanup procedures with graceful termination followed by force-kill fallback if timeout occurs. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/instructlab/training/main_ds.py (1)
9-9: Unused importsys.The
sysmodule is imported but does not appear to be used anywhere in this file. Consider removing it to keep imports clean.🔎 Proposed fix
-import sys
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/instructlab/training/main_ds.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: unit: 3.12 on ubuntu-latest
- GitHub Check: pylint
- GitHub Check: unit: 3.13 on ubuntu-latest
- GitHub Check: unit: 3.11 on ubuntu-latest
- GitHub Check: Summary
| # wait for the process to exit so we can properly read the exit code | ||
| process.wait(timeout=60) | ||
| process_code = process.poll() | ||
| failure = process_code != 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing exception handling for subprocess.TimeoutExpired.
process.wait(timeout=60) raises subprocess.TimeoutExpired if the process doesn't exit within the timeout. When this exception is raised, lines 651-652 are skipped, leaving failure = False. This could mask a real failure since the function might complete without raising RuntimeError at line 674.
🔎 Proposed fix
# wait for the process to exit so we can properly read the exit code
- process.wait(timeout=60)
- process_code = process.poll()
- failure = process_code != 0
+ try:
+ process.wait(timeout=60)
+ process_code = process.poll()
+ failure = process_code != 0
+ except subprocess.TimeoutExpired:
+ logger.error("Training subprocess did not exit within 60 seconds.")
+ process_code = None
+ failure = True📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # wait for the process to exit so we can properly read the exit code | |
| process.wait(timeout=60) | |
| process_code = process.poll() | |
| failure = process_code != 0 | |
| # wait for the process to exit so we can properly read the exit code | |
| try: | |
| process.wait(timeout=60) | |
| process_code = process.poll() | |
| failure = process_code != 0 | |
| except subprocess.TimeoutExpired: | |
| logger.error("Training subprocess did not exit within 60 seconds.") | |
| process_code = None | |
| failure = True |
🤖 Prompt for AI Agents
In src/instructlab/training/main_ds.py around lines 649 to 652, the call to
process.wait(timeout=60) can raise subprocess.TimeoutExpired which is currently
unhandled and will skip setting failure; catch subprocess.TimeoutExpired, handle
it by terminating/killing the process (process.terminate() then process.kill()
if needed), call process.wait() again to reap it, set failure = True (and
capture the final exit code via process.poll()), and ensure any necessary
stdout/stderr are read before proceeding so the function does not silently treat
a timed-out process as a success.
| logger.error( | ||
| f"Training subprocess has not exited yet. Sending SIGTERM. Process code: {process_code}" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misleading error message.
This branch is taken when process_code != 0, meaning the process has already exited with a non-zero exit code. The message "Training subprocess has not exited yet" is incorrect and confusing. It should indicate the process exited with an error.
🔎 Proposed fix
else:
logger.error(
- f"Training subprocess has not exited yet. Sending SIGTERM. Process code: {process_code}"
+ f"Training subprocess exited with non-zero code: {process_code}"
)Note: After fixing this, you may also want to reconsider whether calling process.terminate() on line 661 is appropriate when the process has already exited (it's harmless but unnecessary).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| logger.error( | |
| f"Training subprocess has not exited yet. Sending SIGTERM. Process code: {process_code}" | |
| ) | |
| logger.error( | |
| f"Training subprocess exited with non-zero code: {process_code}" | |
| ) |
🤖 Prompt for AI Agents
In src/instructlab/training/main_ds.py around lines 657 to 659, the log message
incorrectly says "Training subprocess has not exited yet" even though this
branch is executed when process_code != 0 (the subprocess already exited with a
non-zero/error code); change the log to state that the subprocess exited with an
error and include the actual exit code and any available stderr/exception info
(e.g., "Training subprocess exited with non-zero code: {process_code}"); remove
or guard the subsequent process.terminate() call since the process is already
exited (or make it conditional only when the process is still alive) to avoid
unnecessary termination attempts.
This PR solves a race condition in the current codebase where
process.poll()is calledbefore the child process has fully completed which causes the check of
process.poll() != 0to return True.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.