-
Notifications
You must be signed in to change notification settings - Fork 54
Improve bulkhead resilience with retry and larger queue #1463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve bulkhead resilience with retry and larger queue #1463
Conversation
Based on telemetry analysis showing 74K bulkhead errors from 32 power users with sites containing 5,000-15,000 files: Changes: - Increase MAX_CONCURRENT_REQUEST_QUEUE_COUNT from 1000 to 6000 (covers P90) - Add retry policy with exponential backoff (2 attempts, 2-8s delay) - Add telemetry for retry events (WEB_EXTENSION_REQUEST_RETRY) - Add graceful 404 handling for webfiles (WEB_EXTENSION_WEBFILE_NOT_FOUND) This hybrid approach: - Queue increase covers 90% of affected sessions without retry - Retry handles remaining 10% (largest sites) + network failures - No impact on small site users (queue is lazy, retry only on rejection) - 404 handling prevents errors for deleted webfiles (49K errors) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR improves resilience for handling high-volume Power Pages sites that have been experiencing bulkhead queue overflow errors. Based on telemetry showing 74K errors from 32 users with large sites (5,000-15,000 files), the changes increase queue capacity and add retry logic to handle transient failures.
Changes:
- Increased bulkhead queue size from 1,000 to 6,000 to accommodate P90 request volumes
- Added retry policy with exponential backoff (2 attempts, 2-8s delay) using cockatiel library
- Added graceful 404 handling for deleted/moved webfiles with telemetry tracking
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| src/web/client/common/constants.ts | Increased MAX_CONCURRENT_REQUEST_QUEUE_COUNT to 6000 and added retry policy constants |
| src/web/client/dal/concurrencyHandler.ts | Implemented retry policy wrapped with bulkhead and added retry telemetry |
| src/web/client/dal/remoteFetchProvider.ts | Added 404 error handling for webfiles to gracefully handle deleted/moved files |
| src/common/OneDSLoggerTelemetry/web/client/webExtensionTelemetryEvents.ts | Added WEB_EXTENSION_REQUEST_RETRY and WEB_EXTENSION_WEBFILE_NOT_FOUND telemetry events |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Exclude BulkheadRejectedError from retry policy to prevent wasteful retries when the queue is full (the queue won't drain during backoff) - Add JSDoc comment to clarify RETRY_MAX_ATTEMPTS means total attempts - Add integration tests for ConcurrencyHandler retry behavior - Add integration test for webfile 404 handling with telemetry Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix concurrencyHandler tests: correct error message assertion, remove tests that trigger retry delays - Fix remoteFetchProvider tests: stub fetch before auth calls to avoid retry delays during authentication - Fix WebExtensionContext test: stub concurrencyHandler.handleRequest instead of fetch to bypass retry logic The retry policy (2-8 second backoff) was causing tests to exceed the 2000ms timeout when fetch calls failed and were retried. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…es and improve error handling in the web client. - Translated error message for switching environments in Turkish, Simplified Chinese, Traditional Chinese, Czech, German, Spanish, French, Italian, Japanese, Korean, Brazilian Portuguese, and Russian. - Updated error handling in WebExtensionContext, remoteFetchProvider, remoteSaveProvider, etagHandlerService, graphClientService, and added a new utility for structured HTTP error handling. - Introduced createHttpResponseError and isHttpResponseError functions to standardize error responses and improve telemetry reporting. - Enhanced test cases to validate new error handling logic and ensure proper telemetry is sent for HTTP errors.
Previously SERVERLOGICS silently ignored all non-ok responses without telemetry, while WEBFILES only handled 404s with logging. Consolidate into a single map-driven block that gracefully handles 404s for both entity types with appropriate telemetry events. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Based on telemetry analysis showing 74K bulkhead errors from 32 power users with sites containing 5,000-15,000 files, this PR improves the concurrency handler resilience.
Telemetry findings (30 days, sessions hitting bulkhead limits):
Changes
MAX_CONCURRENT_REQUEST_QUEUE_COUNTfrom 1000 to 6000 (covers P90)WEB_EXTENSION_REQUEST_RETRY)WEB_EXTENSION_WEBFILE_NOT_FOUND) - addresses 49K errorsBenefits
Test plan
🤖 Generated with Claude Code