Summary
Implement automatic retry with exponential backoff for transient filesystem errors (EBUSY, EPERM, EMFILE, EAGAIN) to improve reliability on Windows, network drives, and under high file descriptor pressure.
Current Behavior
Filesystem operations silently fail or return null when encountering transient errors:
In ignoreWalker.js:
try {
entries = await fs.readdir(dir, { withFileTypes: true });
} catch (error) {
// Can't read directory - skip it
return;
}
In fileLoader.js:
} catch (error) {
logger.error('Failed to load file', {
path: relativePath,
error: error.message,
});
return null; // File silently skipped
}
Common failure scenarios:
- Windows: Antivirus holds file locks →
EBUSY, EPERM
- Network drives: Temporary disconnects →
ETIMEDOUT, EIO
- High concurrency: File descriptor limits →
EMFILE, ENFILE
- Race conditions: Files being written →
ENOENT (transient)
Expected Behavior
Transient filesystem errors should trigger automatic retry with exponential backoff:
- First attempt fails with
EBUSY
- Wait 100ms, retry (attempt 2)
- Wait 200ms, retry (attempt 3)
- Wait 400ms, retry (attempt 4)
- If all retries fail, log detailed error and continue (or fail based on
--fail-on-errors policy)
Environment
This issue affects all platforms but is most critical on:
- Windows (antivirus, indexing services, file locks)
- macOS with aggressive Spotlight indexing
- Network-mounted filesystems (NFS, SMB, SSHFS)
- CI environments with high concurrency
Evidence
Existing retry infrastructure (AI services only):
Current codes (network-focused):
export const RETRYABLE_ERROR_CODES = [
'RATE_LIMIT', 'TIMEOUT', 'SERVICE_UNAVAILABLE',
'NETWORK_ERROR', 'TEMPORARY_FAILURE',
'ENOTFOUND', 'ETIMEDOUT', 'ECONNRESET', 'ECONNABORTED'
];
Missing filesystem codes:
EBUSY - Resource busy (file locked)
EPERM - Permission denied (transient on Windows)
EMFILE - Too many open files
ENFILE - File table overflow
EAGAIN - Resource temporarily unavailable
EIO - I/O error (network drives)
Affected files:
Root Cause
- No retry logic for filesystem operations (only for AI service calls)
- Silent failures instead of structured error aggregation
- Binary fallback (line 91-116) masks underlying transient errors
Proposed Solution
1. Add Filesystem Error Codes
Update src/utils/errors.js:
export const RETRYABLE_ERROR_CODES = [
// Network (existing)
'RATE_LIMIT', 'TIMEOUT', 'SERVICE_UNAVAILABLE',
'ENOTFOUND', 'ETIMEDOUT', 'ECONNRESET', 'ECONNABORTED',
// Filesystem (new)
'EBUSY', // Resource busy
'EPERM', // Permission denied (Windows antivirus)
'EMFILE', // Too many open files
'ENFILE', // File table overflow
'EAGAIN', // Resource temporarily unavailable
'EIO', // I/O error (network drives)
];
2. Create Retry Utility
New file: src/utils/retryableFs.js
import { isRetryableError } from './errors.js';
export async function withRetry(operation, options = {}) {
const maxAttempts = options.maxAttempts || 3;
const initialDelay = options.initialDelay || 100;
const maxDelay = options.maxDelay || 2000;
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return await operation();
} catch (error) {
if (attempt === maxAttempts || !isRetryableError(error)) {
throw error;
}
const delay = Math.min(initialDelay * Math.pow(2, attempt - 1), maxDelay);
await new Promise(resolve => setTimeout(resolve, delay));
}
}
}
3. Apply to Filesystem Operations
Wrap critical operations in withRetry():
// In ignoreWalker.js
entries = await withRetry(() => fs.readdir(dir, { withFileTypes: true }));
// In fileLoader.js
const stats = await withRetry(() => fs.stat(fullPath));
const content = await withRetry(() => fs.readFile(fullPath, 'utf8'));
4. Error Aggregation
Replace silent failures with structured error collection:
// Track errors for final report
const errors = {
retried: [], // Succeeded after retry
failed: [], // Failed after all retries
permanent: [] // Non-retryable errors
};
Tasks
Acceptance Criteria
Additional Context
Windows antivirus impact:
Windows Defender and third-party antivirus can hold file locks for 50-500ms while scanning, causing EBUSY/EPERM errors that resolve after brief delays.
File descriptor limits:
- macOS default: 256 per process
- Linux default: 1024 per process
- CI environments: Often lower limits
Related tools with retry:
rsync: Built-in retry for transient errors
aws-cli: Automatic retry with exponential backoff
rclone: Configurable retry logic
Summary
Implement automatic retry with exponential backoff for transient filesystem errors (
EBUSY,EPERM,EMFILE,EAGAIN) to improve reliability on Windows, network drives, and under high file descriptor pressure.Current Behavior
Filesystem operations silently fail or return null when encountering transient errors:
In
ignoreWalker.js:In
fileLoader.js:Common failure scenarios:
EBUSY,EPERMETIMEDOUT,EIOEMFILE,ENFILEENOENT(transient)Expected Behavior
Transient filesystem errors should trigger automatic retry with exponential backoff:
EBUSY--fail-on-errorspolicy)Environment
This issue affects all platforms but is most critical on:
Evidence
Existing retry infrastructure (AI services only):
src/utils/errors.js- DefinesRETRYABLE_ERROR_CODESsrc/utils/errors.js-isRetryableError()functionCurrent codes (network-focused):
Missing filesystem codes:
EBUSY- Resource busy (file locked)EPERM- Permission denied (transient on Windows)EMFILE- Too many open filesENFILE- File table overflowEAGAIN- Resource temporarily unavailableEIO- I/O error (network drives)Affected files:
src/utils/ignoreWalker.js-readdir()failuressrc/utils/ignoreWalker.js-stat()failuressrc/utils/fileLoader.js-stat()andreadFile()failuressrc/utils/fileLoader.js- Binary file fallback handlingRoot Cause
Proposed Solution
1. Add Filesystem Error Codes
Update
src/utils/errors.js:2. Create Retry Utility
New file:
src/utils/retryableFs.js3. Apply to Filesystem Operations
Wrap critical operations in
withRetry():4. Error Aggregation
Replace silent failures with structured error collection:
Tasks
RETRYABLE_ERROR_CODESsrc/utils/retryableFs.jswithwithRetry()utilityreaddir()inignoreWalker.js:131stat()inignoreWalker.js:149stat()infileLoader.js:59readFile()infileLoader.js:72--fail-on-fs-errorsCLI flag for strict modefs.retryAttempts,fs.retryDelayAcceptance Criteria
EBUSY,EPERM,EMFILE) trigger up to 3 retry attemptsENOENTpermanent,EISDIR) fail immediately--fail-on-fs-errorsflag causes exit code 1 if any files failcopytree.fs.retryAttemptsandcopytree.fs.retryDelayEBUSY,EPERM)EMFILE,ENFILE)Additional Context
Windows antivirus impact:
Windows Defender and third-party antivirus can hold file locks for 50-500ms while scanning, causing
EBUSY/EPERMerrors that resolve after brief delays.File descriptor limits:
Related tools with retry:
rsync: Built-in retry for transient errorsaws-cli: Automatic retry with exponential backoffrclone: Configurable retry logic