Fix hub crash during Cloud Logging metadata outage#84
Closed
scion-gteam[bot] wants to merge 2 commits into
Closed
Conversation
…etadata outages When the GCP metadata service is unavailable, Cloud Logging retries could exhaust resources (goroutines, connections, memory) and crash the hub. This change adds three layers of protection: 1. ResilientCloudHandler: A circuit breaker wrapper around CloudHandler that monitors Cloud Logging health via periodic flush checks. After consecutive failures exceed the threshold (default: 3), the circuit opens and log entries are silently dropped from the Cloud Logging path. Local logging continues unaffected via the multiHandler. The circuit automatically probes for recovery and resumes Cloud Logging when the service returns. 2. BufferedByteLimit: Caps the Cloud Logging client's internal buffer at 8 MiB to prevent unbounded memory growth during transient failures. 3. Client creation timeout: Adds a 15-second timeout to Cloud Logging client initialization so the hub doesn't hang at startup when the metadata service is unreachable. Fixes #70
Owner
|
/gemini |
Owner
|
This pull request has been recreated on the target repository as GoogleCloudPlatform#270. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #70 — Hub crashes when Cloud Logging retries exhaust resources during metadata outage.
ResilientCloudHandlerthat wrapsCloudHandlerwith a three-state circuit breaker (closed → open → half-open). After 3 consecutive flush failures, the circuit opens and Cloud Logging entries are silently dropped. Local logging continues unaffected via themultiHandler. The circuit automatically probes for recovery and resumes Cloud Logging when the service returns.BufferedByteLimit(8 MiB default) to the Cloud Logging client to prevent unbounded memory growth during transient failures.gcplog.NewClientso the hub doesn't hang at startup when the metadata service is unreachable.Acceptance Criteria
Test plan
go vetclean