Skip to content

[HIGH] Improve Temp File Management and Race Condition Handling #142

@plusplusoneplusplus

Description

@plusplusoneplusplus

Problem

The current temp file management in CommandExecutor has several issues:

  1. Race Conditions: Monitor tasks skip cleanup to avoid deadlocks, potentially leaving orphaned temp files
  2. Incomplete Cleanup: Failed temp file deletions are logged but not retried
  3. No Orphan Detection: No mechanism to detect and clean up orphaned temp files from crashed processes
  4. Limited Error Recovery: Temp file creation failures don't have robust fallback mechanisms

Current Issues

Race Condition in Cleanup

# Line ~250: Monitor tasks skip cleanup to avoid deadlocks
if from_monitor:
    _log_with_context(
        logging.DEBUG,
        f"Monitor skipping cleanup for PID {pid}",
        {"pid": pid, "from_monitor": from_monitor},
    )
    return

Failed Cleanup Handling

# Lines ~270-290: Errors are logged but not retried
except Exception as e:
    _log_with_context(
        logging.WARNING,
        f"Error removing stdout temp file",
        {"pid": pid, "path": stdout_path, "error": str(e)},
    )

Proposed Solutions

1. Improved Cleanup Coordination

  • Replace simple lock with more sophisticated coordination
  • Implement cleanup retry mechanism with exponential backoff
  • Add cleanup task queue for failed operations

2. Orphaned File Detection

  • Periodic scan for temp files without associated processes
  • Age-based cleanup for old temp files
  • Startup cleanup to handle files from previous crashes

3. Enhanced Error Recovery

  • Retry temp file creation on failure
  • Fallback temp directories if primary fails
  • Better error reporting and recovery strategies

4. Monitoring and Metrics

  • Track temp file creation/deletion success rates
  • Monitor temp directory disk usage
  • Alert on cleanup failures or orphaned files

Implementation Plan

Phase 1: Fix Race Conditions

class TempFileManager:
    def __init__(self):
        self.cleanup_queue = asyncio.Queue()
        self.cleanup_task = None
        
    async def schedule_cleanup(self, pid: int, files: Tuple[str, str]):
        \"\"\"Schedule cleanup without blocking\"\"\"
        await self.cleanup_queue.put((pid, files))
        
    async def _cleanup_worker(self):
        \"\"\"Background worker for cleanup operations\"\"\"
        while True:
            pid, (stdout_path, stderr_path) = await self.cleanup_queue.get()
            await self._safe_cleanup(pid, stdout_path, stderr_path)

Phase 2: Orphan Detection

async def cleanup_orphaned_files(self):
    \"\"\"Find and clean up orphaned temp files\"\"\"
    for file_path in self.temp_dir.glob("cmd_*"):
        if self._is_orphaned(file_path):
            await self._safe_remove(file_path)
            
def _is_orphaned(self, file_path: Path) -> bool:
    \"\"\"Check if temp file is orphaned\"\"\"
    # Check file age, associated process existence, etc.
    pass

Phase 3: Enhanced Error Recovery

async def _create_temp_files_with_retry(self, prefix: str, max_retries: int = 3):
    \"\"\"Create temp files with retry logic\"\"\"
    for attempt in range(max_retries):
        try:
            return self._create_temp_files(prefix)
        except OSError as e:
            if attempt == max_retries - 1:
                raise
            await asyncio.sleep(2 ** attempt)  # Exponential backoff

New Configuration Options

# Temp file management settings
self.temp_cleanup_retry_attempts = 3
self.temp_cleanup_retry_delay = 1.0
self.orphan_cleanup_interval = 3600  # 1 hour
self.orphan_file_max_age = 7200      # 2 hours
self.temp_dir_max_size_mb = 1024     # 1GB limit

Benefits

  • Reliability: Reduced temp file leaks and orphaned files
  • Performance: Better resource management and cleanup
  • Monitoring: Improved observability of temp file operations
  • Recovery: Better handling of edge cases and failures

Acceptance Criteria

  • Race conditions in cleanup eliminated
  • Retry mechanism for failed cleanup operations
  • Orphaned file detection and cleanup
  • Configurable cleanup policies
  • Comprehensive error handling and logging
  • Metrics for temp file operations
  • No regression in performance
  • Backward compatibility maintained

Priority

P1 (High) - Important for system reliability and resource management

Files to Modify

  • mcp_tools/command_executor/executor.py
  • New: mcp_tools/command_executor/temp_file_manager.py
  • Configuration files
  • Tests for all cleanup scenarios

Testing Requirements

  • Test cleanup under various failure conditions
  • Test orphaned file detection and cleanup
  • Test retry mechanisms
  • Test concurrent cleanup operations
  • Test disk space handling
  • Performance tests for cleanup operations

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions