-
Notifications
You must be signed in to change notification settings - Fork 4
Add orchestrator health monitoring and recovery with security improvements #124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
newhook
merged 20 commits into
main
from
feat/work-mode-the-zellij-pane-with-the-orchestration-d
Jan 18, 2026
Merged
Add orchestrator health monitoring and recovery with security improvements #124
newhook
merged 20 commits into
main
from
feat/work-mode-the-zellij-pane-with-the-orchestration-d
Jan 18, 2026
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…king to orchestrator - Enhanced EnsureWorkOrchestrator to check if process is actually running (not just tab exists) - Use pgrep to detect running orchestrator processes - Restart dead orchestrators by closing and recreating tabs - Added last_activity column to tasks table for health monitoring - Update activity every 30 seconds in orchestrator main loop - Update activity when starting task execution - This provides automatic recovery and visibility into hung orchestrators Co-Authored-By: Claude Opus 4.1 <noreply@anthropic.com>
- Added 'O' keyboard shortcut to restart orchestrator - Kills running orchestrator process with pkill - Calls EnsureWorkOrchestrator to spawn new instance - Added to help screen for discoverability Co-Authored-By: Claude Opus 4.1 <noreply@anthropic.com>
…loop - Added task_timeout_minutes configuration to ClaudeConfig - Default timeout of 60 minutes for task execution - Uses context.WithTimeout for proper cancellation - Marks task as failed on timeout with descriptive error - Configurable via config.toml [claude] section Co-Authored-By: Claude Opus 4.1 <noreply@anthropic.com>
- Added checkOrchestratorHealth function using pgrep - Shows green/red dot indicator in work panel header - Added explicit health status line showing 'Orchestrator running' or 'Orchestrator dead' - Only displays for works with processing status or active tasks Co-Authored-By: Claude Opus 4.1 <noreply@anthropic.com>
- Updated all StartTask calls to include the new worktreePath parameter - Replaced IsTaskCompleted method calls with CountTaskBeadStatuses logic - All database tests now pass successfully
- Moved activity ticker to a separate goroutine - Removed select statement with default case from main loop - Eliminates CPU-consuming busy loop that occurred after task execution
- Created internal/process package with cross-platform process detection - Implemented Unix version using ps command for process listing - Implemented Windows version using wmic/PowerShell for process listing - Replaced pgrep usage in cmd/tui_work.go (checkOrchestratorHealth and restart) - Replaced pgrep usage in internal/claude/runner.go (EnsureWorkOrchestrator) - Added unit tests for the process detection utilities
…revent future commits - Remove db.test (12MB) from git tracking - Add *.test pattern to .gitignore to prevent test binaries from being committed
…rocess killer - Added escapeForPowerShell function to properly escape single quotes - Applied escaping to pattern parameter before PowerShell interpolation - Prevents command injection through malicious pattern strings
- Deleted process_windows.go with all Windows-specific code - Renamed process_unix.go to process_impl.go as the sole implementation - Removed build constraints since Unix implementation is now universal - All tests pass with the simplified single-platform approach
…Process function - Added escapePattern function to safely escape shell patterns using single quotes - Pattern is now properly escaped before being passed to pkill to prevent command injection - Handles embedded single quotes using standard shell escaping technique - Maintains full functionality while preventing security vulnerability
- Added comprehensive tests for IsProcessRunning with edge cases - Added tests for KillProcess including actual process killing - Added tests for escapePattern function - Added tests for getProcessList function - Added tests for context cancellation scenarios - Added tests for error handling and command injection prevention - Achieved 95.6% test coverage (up from minimal coverage)
- Added [o]rchestrator button to the zoomed view status bar - Implemented full mouse support with hover highlighting and click handling - Changed keyboard shortcut from uppercase O to lowercase o for consistency - Updated help text to reflect the new lowercase command - Button appears between [c]laude and [v]review for easy access when orchestrator dies Co-Authored-By: Claude Opus 4.1 <noreply@anthropic.com>
Resolved merge conflicts in cmd/tui_work.go: - Kept main branch's reorganized help text structure - Added orchestrator restart command [o] in proper position - Maintained lowercase key convention from main branch Co-Authored-By: Claude Opus 4.1 <noreply@anthropic.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements comprehensive health monitoring and recovery capabilities for the orchestrator, along with critical security fixes and cross-platform improvements. The main goal was to address the issue where a dead Zellij pane left the orchestration in an unrecoverable state.
Key Features
1. Orchestrator Health Monitoring & Recovery
ac-k5ho.1)rkey command in TUI work mode to manually restart stuck orchestrators (ac-k5ho.2)ac-k5ho.3)ac-k5ho.4)ac-k5ho.5)2. Security Improvements
ac-5gl6.1,ac-6ppj.2)ac-5gl6.2)ac-6ppj.1)3. Cross-Platform Process Management
ac-mly3.3)ac-j003.2)ac-6ppj.3)4. Code Quality & Testing
ac-5gl6.3)ac-mly3.1)ac-mly3.2)ac-j003.1,ac-6ppj.4,ac-mly3.4)ac-mly3.5)Technical Details
Database Schema Changes
last_activitycolumn to tasks table for health monitoring019_add_last_activity.sqlNew Process Management Package
internal/process/package with:API Changes
StartTasknow accepts context for cancellation supportUpdateTaskActivityadded for health monitoringproj.Config.Claude.GetTaskTimeout()TUI Enhancements
rkeyTesting Performed
Breaking Changes
None. All changes are backward compatible.
Security Considerations
This PR addresses critical security vulnerabilities:
Issues Resolved
Review Checklist