-
Notifications
You must be signed in to change notification settings - Fork 7
Closed
Labels
Description
Problem
The current temp file management in CommandExecutor has several issues:
- Race Conditions: Monitor tasks skip cleanup to avoid deadlocks, potentially leaving orphaned temp files
- Incomplete Cleanup: Failed temp file deletions are logged but not retried
- No Orphan Detection: No mechanism to detect and clean up orphaned temp files from crashed processes
- Limited Error Recovery: Temp file creation failures don't have robust fallback mechanisms
Current Issues
Race Condition in Cleanup
# Line ~250: Monitor tasks skip cleanup to avoid deadlocks
if from_monitor:
_log_with_context(
logging.DEBUG,
f"Monitor skipping cleanup for PID {pid}",
{"pid": pid, "from_monitor": from_monitor},
)
returnFailed Cleanup Handling
# Lines ~270-290: Errors are logged but not retried
except Exception as e:
_log_with_context(
logging.WARNING,
f"Error removing stdout temp file",
{"pid": pid, "path": stdout_path, "error": str(e)},
)Proposed Solutions
1. Improved Cleanup Coordination
- Replace simple lock with more sophisticated coordination
- Implement cleanup retry mechanism with exponential backoff
- Add cleanup task queue for failed operations
2. Orphaned File Detection
- Periodic scan for temp files without associated processes
- Age-based cleanup for old temp files
- Startup cleanup to handle files from previous crashes
3. Enhanced Error Recovery
- Retry temp file creation on failure
- Fallback temp directories if primary fails
- Better error reporting and recovery strategies
4. Monitoring and Metrics
- Track temp file creation/deletion success rates
- Monitor temp directory disk usage
- Alert on cleanup failures or orphaned files
Implementation Plan
Phase 1: Fix Race Conditions
class TempFileManager:
def __init__(self):
self.cleanup_queue = asyncio.Queue()
self.cleanup_task = None
async def schedule_cleanup(self, pid: int, files: Tuple[str, str]):
\"\"\"Schedule cleanup without blocking\"\"\"
await self.cleanup_queue.put((pid, files))
async def _cleanup_worker(self):
\"\"\"Background worker for cleanup operations\"\"\"
while True:
pid, (stdout_path, stderr_path) = await self.cleanup_queue.get()
await self._safe_cleanup(pid, stdout_path, stderr_path)Phase 2: Orphan Detection
async def cleanup_orphaned_files(self):
\"\"\"Find and clean up orphaned temp files\"\"\"
for file_path in self.temp_dir.glob("cmd_*"):
if self._is_orphaned(file_path):
await self._safe_remove(file_path)
def _is_orphaned(self, file_path: Path) -> bool:
\"\"\"Check if temp file is orphaned\"\"\"
# Check file age, associated process existence, etc.
passPhase 3: Enhanced Error Recovery
async def _create_temp_files_with_retry(self, prefix: str, max_retries: int = 3):
\"\"\"Create temp files with retry logic\"\"\"
for attempt in range(max_retries):
try:
return self._create_temp_files(prefix)
except OSError as e:
if attempt == max_retries - 1:
raise
await asyncio.sleep(2 ** attempt) # Exponential backoffNew Configuration Options
# Temp file management settings
self.temp_cleanup_retry_attempts = 3
self.temp_cleanup_retry_delay = 1.0
self.orphan_cleanup_interval = 3600 # 1 hour
self.orphan_file_max_age = 7200 # 2 hours
self.temp_dir_max_size_mb = 1024 # 1GB limitBenefits
- Reliability: Reduced temp file leaks and orphaned files
- Performance: Better resource management and cleanup
- Monitoring: Improved observability of temp file operations
- Recovery: Better handling of edge cases and failures
Acceptance Criteria
- Race conditions in cleanup eliminated
- Retry mechanism for failed cleanup operations
- Orphaned file detection and cleanup
- Configurable cleanup policies
- Comprehensive error handling and logging
- Metrics for temp file operations
- No regression in performance
- Backward compatibility maintained
Priority
P1 (High) - Important for system reliability and resource management
Files to Modify
mcp_tools/command_executor/executor.py- New:
mcp_tools/command_executor/temp_file_manager.py - Configuration files
- Tests for all cleanup scenarios
Testing Requirements
- Test cleanup under various failure conditions
- Test orphaned file detection and cleanup
- Test retry mechanisms
- Test concurrent cleanup operations
- Test disk space handling
- Performance tests for cleanup operations