Skip to content

Add resilient filesystem error handling with retry logic #29

@gregpriday

Description

@gregpriday

Summary

Implement automatic retry with exponential backoff for transient filesystem errors (EBUSY, EPERM, EMFILE, EAGAIN) to improve reliability on Windows, network drives, and under high file descriptor pressure.

Current Behavior

Filesystem operations silently fail or return null when encountering transient errors:

In ignoreWalker.js:

try {
  entries = await fs.readdir(dir, { withFileTypes: true });
} catch (error) {
  // Can't read directory - skip it
  return;
}

In fileLoader.js:

} catch (error) {
  logger.error('Failed to load file', {
    path: relativePath,
    error: error.message,
  });
  return null;  // File silently skipped
}

Common failure scenarios:

  • Windows: Antivirus holds file locks → EBUSY, EPERM
  • Network drives: Temporary disconnects → ETIMEDOUT, EIO
  • High concurrency: File descriptor limits → EMFILE, ENFILE
  • Race conditions: Files being written → ENOENT (transient)

Expected Behavior

Transient filesystem errors should trigger automatic retry with exponential backoff:

  1. First attempt fails with EBUSY
  2. Wait 100ms, retry (attempt 2)
  3. Wait 200ms, retry (attempt 3)
  4. Wait 400ms, retry (attempt 4)
  5. If all retries fail, log detailed error and continue (or fail based on --fail-on-errors policy)

Environment

This issue affects all platforms but is most critical on:

  • Windows (antivirus, indexing services, file locks)
  • macOS with aggressive Spotlight indexing
  • Network-mounted filesystems (NFS, SMB, SSHFS)
  • CI environments with high concurrency

Evidence

Existing retry infrastructure (AI services only):

Current codes (network-focused):

export const RETRYABLE_ERROR_CODES = [
  'RATE_LIMIT', 'TIMEOUT', 'SERVICE_UNAVAILABLE',
  'NETWORK_ERROR', 'TEMPORARY_FAILURE',
  'ENOTFOUND', 'ETIMEDOUT', 'ECONNRESET', 'ECONNABORTED'
];

Missing filesystem codes:

  • EBUSY - Resource busy (file locked)
  • EPERM - Permission denied (transient on Windows)
  • EMFILE - Too many open files
  • ENFILE - File table overflow
  • EAGAIN - Resource temporarily unavailable
  • EIO - I/O error (network drives)

Affected files:

Root Cause

  • No retry logic for filesystem operations (only for AI service calls)
  • Silent failures instead of structured error aggregation
  • Binary fallback (line 91-116) masks underlying transient errors

Proposed Solution

1. Add Filesystem Error Codes

Update src/utils/errors.js:

export const RETRYABLE_ERROR_CODES = [
  // Network (existing)
  'RATE_LIMIT', 'TIMEOUT', 'SERVICE_UNAVAILABLE', 
  'ENOTFOUND', 'ETIMEDOUT', 'ECONNRESET', 'ECONNABORTED',
  
  // Filesystem (new)
  'EBUSY',    // Resource busy
  'EPERM',    // Permission denied (Windows antivirus)
  'EMFILE',   // Too many open files
  'ENFILE',   // File table overflow
  'EAGAIN',   // Resource temporarily unavailable
  'EIO',      // I/O error (network drives)
];

2. Create Retry Utility

New file: src/utils/retryableFs.js

import { isRetryableError } from './errors.js';

export async function withRetry(operation, options = {}) {
  const maxAttempts = options.maxAttempts || 3;
  const initialDelay = options.initialDelay || 100;
  const maxDelay = options.maxDelay || 2000;
  
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await operation();
    } catch (error) {
      if (attempt === maxAttempts || !isRetryableError(error)) {
        throw error;
      }
      
      const delay = Math.min(initialDelay * Math.pow(2, attempt - 1), maxDelay);
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }
}

3. Apply to Filesystem Operations

Wrap critical operations in withRetry():

// In ignoreWalker.js
entries = await withRetry(() => fs.readdir(dir, { withFileTypes: true }));

// In fileLoader.js
const stats = await withRetry(() => fs.stat(fullPath));
const content = await withRetry(() => fs.readFile(fullPath, 'utf8'));

4. Error Aggregation

Replace silent failures with structured error collection:

// Track errors for final report
const errors = {
  retried: [],    // Succeeded after retry
  failed: [],     // Failed after all retries
  permanent: []   // Non-retryable errors
};

Tasks

  • Add filesystem error codes to RETRYABLE_ERROR_CODES
  • Create src/utils/retryableFs.js with withRetry() utility
  • Add retry wrapper to readdir() in ignoreWalker.js:131
  • Add retry wrapper to stat() in ignoreWalker.js:149
  • Add retry wrapper to stat() in fileLoader.js:59
  • Add retry wrapper to readFile() in fileLoader.js:72
  • Implement error aggregation and reporting
  • Add --fail-on-fs-errors CLI flag for strict mode
  • Add configuration option: fs.retryAttempts, fs.retryDelay
  • Write tests for retry logic (mock filesystem errors)
  • Write tests for error aggregation
  • Update troubleshooting docs with retry behavior

Acceptance Criteria

  • Transient errors (EBUSY, EPERM, EMFILE) trigger up to 3 retry attempts
  • Exponential backoff delays: 100ms, 200ms, 400ms (configurable)
  • Maximum delay capped at 2000ms to prevent excessive waits
  • Non-retryable errors (ENOENT permanent, EISDIR) fail immediately
  • Error summary includes: total retries, successes after retry, permanent failures
  • --fail-on-fs-errors flag causes exit code 1 if any files fail
  • Retry behavior is configurable via copytree.fs.retryAttempts and copytree.fs.retryDelay
  • Tests cover Windows-specific scenarios (EBUSY, EPERM)
  • Tests cover file descriptor exhaustion (EMFILE, ENFILE)

Additional Context

Windows antivirus impact:
Windows Defender and third-party antivirus can hold file locks for 50-500ms while scanning, causing EBUSY/EPERM errors that resolve after brief delays.

File descriptor limits:

  • macOS default: 256 per process
  • Linux default: 1024 per process
  • CI environments: Often lower limits

Related tools with retry:

  • rsync: Built-in retry for transient errors
  • aws-cli: Automatic retry with exponential backoff
  • rclone: Configurable retry logic

Metadata

Metadata

Assignees

No one assigned

    Labels

    priority-2 ⚙️Planned/normal priority. Product work or quality improvements to schedule this cycle.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions