Skip to content

Conversation

@turip
Copy link
Member

@turip turip commented Dec 16, 2025

Overview

During scaling CH might return this error. Let's make sure we are retrying on the backend in such cases.

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced resilience for ClickHouse connection failures. The system now properly handles and automatically retries temporary connection errors, especially during database scaling and maintenance operations, ensuring more stable and reliable service availability.

✏️ Tip: You can customize this high-level summary in your review settings.

@turip turip requested a review from a team as a code owner December 16, 2025 08:22
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 16, 2025

📝 Walkthrough

Walkthrough

The retry logic in the streaming module is extended to detect ClickHouse connection failure exceptions (ErrAllConnectionTriesFailed) via the lo.ErrorsAs utility. When this specific exception is encountered, it's now treated as retriable, enabling automatic retry attempts instead of immediate failure.

Changes

Cohort / File(s) Summary
Retry logic enhancement
openmeter/streaming/retry/retry.go
Added imports for ch-go/proto and samber/lo; extended isErrorRetirable() to detect *clickhouse.Exception and treat ErrAllConnectionTriesFailed as retriable

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Area of attention: Verify that the specific ClickHouse error code (ErrAllConnectionTriesFailed) is the correct one to handle as retriable, and confirm this aligns with the intended behavior during cluster scaling scenarios

Possibly related PRs

Suggested labels

release-note/feature

Suggested reviewers

  • tothandras

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding retry logic for ClickHouse queries during scaling operations, which is exactly what the changeset implements.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chore/retry-on-scaling

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 409466a and 9eb5cca.

📒 Files selected for processing (1)
  • openmeter/streaming/retry/retry.go (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go

⚙️ CodeRabbit configuration file

**/*.go: In general when reviewing the Golang code make readability and maintainability a priority, even potentially suggest restructuring the code to improve them.

Performance should be a priority in critical code paths. Anything related to event ingestion, message processing, database operations (regardless of database) should be vetted for potential performance bottlenecks.

Files:

  • openmeter/streaming/retry/retry.go
🧠 Learnings (1)
📚 Learning: 2025-03-07T12:17:43.129Z
Learnt from: GAlexIHU
Repo: openmeterio/openmeter PR: 2383
File: openmeter/entitlement/metered/lateevents_test.go:37-45
Timestamp: 2025-03-07T12:17:43.129Z
Learning: In the OpenMeter codebase, test files like `openmeter/entitlement/metered/lateevents_test.go` may use variables like `meterSlug` and `namespace` without explicit declarations visible in the same file. This appears to be an accepted pattern in their test structure.

Applied to files:

  • openmeter/streaming/retry/retry.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: Migration Checks
  • GitHub Check: Artifacts / Benthos Collector Container image
  • GitHub Check: Artifacts / Container image
  • GitHub Check: Build
  • GitHub Check: Test
  • GitHub Check: Lint
  • GitHub Check: Code Generators
  • GitHub Check: Repository Scan
  • GitHub Check: Analyze (go)
🔇 Additional comments (2)
openmeter/streaming/retry/retry.go (2)

11-13: Nice, clean imports!

The new imports look good - chproto is a clear alias for accessing the error constants, and lo gives you that type-safe ErrorsAs helper. Both are used appropriately in the retry logic below.


161-167: The error code looks solid for scaling scenarios.

This is handling exactly what it should—when all connection retry attempts fail, which is the typical behavior during cluster upscale/downscale. The implementation correctly distinguishes between three types of transient errors: network disruptions (EOF), resource exhaustion (connection pool timeout), and cluster reconfiguration (all connections tried and failed). The approach is sound and follows good patterns for a critical database path.


Comment @coderabbitai help to get the list of available commands and usage tips.

@turip turip added the release-note/misc Miscellaneous changes label Dec 16, 2025
@turip turip enabled auto-merge (squash) December 16, 2025 08:24
@turip turip merged commit bf910c6 into main Dec 16, 2025
30 of 31 checks passed
@turip turip deleted the chore/retry-on-scaling branch December 16, 2025 08:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note/misc Miscellaneous changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants