Skip to content

Comments

fix: context timeout handling#123

Merged
beer-1 merged 8 commits intomainfrom
fix/error-handling
Feb 23, 2026
Merged

fix: context timeout handling#123
beer-1 merged 8 commits intomainfrom
fix/error-handling

Conversation

@beer-1
Copy link
Member

@beer-1 beer-1 commented Feb 23, 2026

Summary by CodeRabbit

  • New Features

    • Configurable RPC query timeout (default 30s) applied consistently across nodes and RPC client to prevent indefinite queries.
  • Bug Fixes

    • Improved handling of transient RPC/query errors with retry/ignore signaling to reduce spurious failures and improve stability.
  • Chores

    • Bumped core dependencies, updated e2e test container images, and raised the default test container version for improved compatibility.

@beer-1 beer-1 requested a review from a team as a code owner February 23, 2026 06:24
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a configurable QueryTimeout across node configs and the RPC client, wires the timeout into RPC calls and call sites, improves transient query error handling, and updates several dependencies and e2e Docker image tags.

Changes

Cohort / File(s) Summary
Config additions
challenger/types/config.go, executor/types/config.go, node/types/config.go
Added QueryTimeout to NodeConfig (int64 or time.Duration), validation for non-negative values, and default initialization (30s).
RPC client timeout implementation
node/rpcclient/client.go
Introduced DefaultQueryTimeout, per-client queryTimeout field, SetQueryTimeout/QueryTimeout methods, and timeout-aware helpers (GetQueryContextWithTimeout, withQueryTimeout) plus updated RPC methods to use them.
Node wiring
node/node.go, cmd/opinitd/db.go
Set RPC client query timeout during client initialization and in migration path(s) via SetQueryTimeout(...).
Call-site conversions to timeout contexts
cmd/opinitd/tx.go, provider/child/query.go, provider/host/query.go
Replaced GetQueryContext usages with GetQueryContextWithTimeout(..., rpcClient.QueryTimeout()) to apply per-call timeouts.
Error handling improvements
executor/child/withdraw.go
Added isRetryableQueryErr helper and adjusted error wrapping to return ErrIgnoreAndTryLater for transient errors and special-case missing collection responses.
Dependency & e2e updates
go.mod, e2e/go.mod, e2e/opbot.go, e2e/*_test.go
Bumped OPinit and various modules; removed an OPinit replace line; updated default container version and multiple e2e Docker image tags (initiation → v1.3.1, minimove → v1.1.10, OPBot → v1.0.16).

Sequence Diagram

sequenceDiagram
    actor App as Application
    participant Cfg as Config
    participant Node as Node / RPCClient
    participant Provider as Provider (host/child/tx)
    participant RPC as Remote RPC Endpoint

    App->>Cfg: load config (includes QueryTimeout)
    Cfg->>Node: initialize node (pass QueryTimeout)
    Node->>Node: SetQueryTimeout(duration)
    Provider->>Node: request query (uses RPC client)
    Node->>Node: GetQueryContextWithTimeout(ctx, height, QueryTimeout)
    Node->>RPC: execute query with timeout-aware ctx
    RPC-->>Node: response or error
    Node-->>Provider: return result or timed error
    Provider-->>App: deliver processed result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I nibble timeouts, snug and spry,

Deadlines set where queries fly.
RPCs blink, then hustle or fail,
Transient hiccups now gently bail.
Hop—logs tidy, systems sigh less.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 19.05% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the primary change across all modified files: implementing timeout handling for RPC queries and context operations throughout the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/error-handling

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
provider/host/query.go (1)

320-342: ⚠️ Potential issue | 🟡 Minor

Missing context cancellation check in pagination loop.

The QueryValidators function creates a timeout context at line 322 but the pagination loop (lines 329-338) does not check ctx.Done(). If the timeout fires mid-pagination, the loop will continue making requests until it naturally completes or hits an error.

Compare with QueryBatchInfos (line 158-162) and QueryDepositTxHeight (line 202-206) which properly check ctx.Done() in their loops.

🛡️ Proposed fix to add context check
 func (b BaseHost) QueryValidators(ctx context.Context, height int64) ([]*cmtypes.Validator, error) {
 	ctx, cancel := rpcclient.GetQueryContextWithTimeout(ctx, 0, b.node.GetRPCClient().QueryTimeout())
 	defer cancel()

 	page := 1
 	perPage := 100

 	validators := make([]*cmtypes.Validator, 0)
 	for {
+		select {
+		case <-ctx.Done():
+			return nil, ctx.Err()
+		default:
+		}
+
 		result, err := b.node.GetRPCClient().Validators(ctx, &height, &page, &perPage)
 		if err != nil {
 			return nil, err
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@provider/host/query.go` around lines 320 - 342, The pagination loop in
QueryValidators does not check for context cancellation; inside the for loop
(where you call b.node.GetRPCClient().Validators and append to validators) add a
non-blocking check for ctx.Done() and if closed return nil with ctx.Err() (or
propagate the context error) to stop further RPC calls when the
timeout/cancellation fires; keep existing pagination logic (page, perPage,
result.Total) but bail out early on context cancellation.
e2e/go.mod (1)

31-31: ⚠️ Potential issue | 🟠 Major

Upgrade github.com/gofiber/fiber/v2 from v2.52.5 to v2.52.11 to fix critical security vulnerabilities.

The current version v2.52.5 is vulnerable to three CVEs:

  • CVE-2025-48075 (fixed in v2.52.7): BodyParser panic on negative range index
  • CVE-2025-54801 / GHSA-qx2q-88mx-vhg7 (fixed in v2.52.9): BodyParser crash via unvalidated large slice index (DoS)
  • CVE-2025-66630 / GHSA-68rr-p4fp-j59v (fixed in v2.52.11): Insecure UUID fallback to predictable all-zero UUID on Go < 1.24 (affects session/token security)

All are patched in v2.52.11 (latest, published Jan 31, 2026). While marked as indirect in go.mod, the module directly requires this dependency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@e2e/go.mod` at line 31, Update the indirect dependency declaration for
github.com/gofiber/fiber/v2 in e2e/go.mod from v2.52.5 to v2.52.11 to pick up
security patches (CVE-2025-48075, CVE-2025-54801, CVE-2025-66630), then run go
get github.com/gofiber/fiber/v2@v2.52.11 (or go mod tidy) to update go.sum and
vendor as needed and verify the project builds/tests (look for the dependency
line "github.com/gofiber/fiber/v2" in the file and ensure it shows v2.52.11).
🧹 Nitpick comments (2)
provider/child/query.go (1)

83-120: Potential timeout issue with paginated query.

The timeout context is created once at line 90 before the pagination loop, meaning the configured timeout applies to the entire paginated operation. If the query has many pages or the node is slow, the operation may time out before completing all pages.

This may be intentional (cap total operation time), but consider whether you need a per-page timeout or a longer timeout for paginated queries.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@provider/child/query.go` around lines 83 - 120, The QueryGranteeGrants
function creates a single timeout context (rpcclient.GetQueryContextWithTimeout)
before the pagination loop causing the whole multi-page fetch to be subject to
that single timeout; change it to create a per-request context inside the loop
(call rpcclient.GetQueryContextWithTimeout at the start of each iteration, use
the returned ctx when calling authzClient.GranteeGrants, and call the
corresponding cancel after each request) so each page request gets its own
timeout rather than the entire paginated operation being constrained by one
shared ctx; keep the ticker and pagination logic (req.Pagination.Key =
res.Pagination.NextKey, break when NextKey == nil) unchanged.
executor/child/withdraw.go (1)

142-154: Consider making the "eof" match more specific.

The substring match for "eof" is quite short and could potentially match unintended content in error messages (though unlikely in practice). Consider using "eof" at word boundaries or checking for more specific patterns like " eof", "eof:", or the full "unexpected eof".

♻️ Suggested refinement
 	msg := strings.ToLower(err.Error())
 	return strings.Contains(msg, "context deadline exceeded") ||
 		strings.Contains(msg, "post failed") ||
 		strings.Contains(msg, "i/o timeout") ||
 		strings.Contains(msg, "connection refused") ||
 		strings.Contains(msg, "connection reset by peer") ||
-		strings.Contains(msg, "eof")
+		strings.Contains(msg, "unexpected eof") ||
+		strings.HasSuffix(msg, "eof")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@executor/child/withdraw.go` around lines 142 - 154, The current
isRetryableQueryErr function uses a very short substring "eof" which can match
unintended text; update isRetryableQueryErr to detect EOF more specifically by
matching patterns like " unexpected eof", " eof:", "eof:" or the full token
"unexpected eof" (or use a word-boundary regex) instead of a bare "eof" so only
true EOF errors are considered retryable; adjust the check alongside the other
conditions (keep the other substrings intact) and use strings.Contains or a
compiled regexp inside isRetryableQueryErr to perform the refined match.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@node/types/config.go`:
- Around line 31-34: Add a guard that ensures the QueryTimeout field is
non‑negative: locate the config struct that declares QueryTimeout and add
validation in its existing Validate() method (or create one if missing) to
return an error when QueryTimeout < 0; if you have a constructor like NewConfig
or an Unmarshal/Load path, enforce the same check there and return a clear error
message referencing QueryTimeout so callers cannot set a negative duration.

---

Outside diff comments:
In `@e2e/go.mod`:
- Line 31: Update the indirect dependency declaration for
github.com/gofiber/fiber/v2 in e2e/go.mod from v2.52.5 to v2.52.11 to pick up
security patches (CVE-2025-48075, CVE-2025-54801, CVE-2025-66630), then run go
get github.com/gofiber/fiber/v2@v2.52.11 (or go mod tidy) to update go.sum and
vendor as needed and verify the project builds/tests (look for the dependency
line "github.com/gofiber/fiber/v2" in the file and ensure it shows v2.52.11).

In `@provider/host/query.go`:
- Around line 320-342: The pagination loop in QueryValidators does not check for
context cancellation; inside the for loop (where you call
b.node.GetRPCClient().Validators and append to validators) add a non-blocking
check for ctx.Done() and if closed return nil with ctx.Err() (or propagate the
context error) to stop further RPC calls when the timeout/cancellation fires;
keep existing pagination logic (page, perPage, result.Total) but bail out early
on context cancellation.

---

Nitpick comments:
In `@executor/child/withdraw.go`:
- Around line 142-154: The current isRetryableQueryErr function uses a very
short substring "eof" which can match unintended text; update
isRetryableQueryErr to detect EOF more specifically by matching patterns like "
unexpected eof", " eof:", "eof:" or the full token "unexpected eof" (or use a
word-boundary regex) instead of a bare "eof" so only true EOF errors are
considered retryable; adjust the check alongside the other conditions (keep the
other substrings intact) and use strings.Contains or a compiled regexp inside
isRetryableQueryErr to perform the refined match.

In `@provider/child/query.go`:
- Around line 83-120: The QueryGranteeGrants function creates a single timeout
context (rpcclient.GetQueryContextWithTimeout) before the pagination loop
causing the whole multi-page fetch to be subject to that single timeout; change
it to create a per-request context inside the loop (call
rpcclient.GetQueryContextWithTimeout at the start of each iteration, use the
returned ctx when calling authzClient.GranteeGrants, and call the corresponding
cancel after each request) so each page request gets its own timeout rather than
the entire paginated operation being constrained by one shared ctx; keep the
ticker and pagination logic (req.Pagination.Key = res.Pagination.NextKey, break
when NextKey == nil) unchanged.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
provider/host/query.go (2)

148-162: ⚠️ Potential issue | 🟠 Major

Timeout context may expire during pagination loop.

The timeout context is created once before the loop, but the loop can iterate multiple times with PollingInterval() delays between iterations. If QueryTimeout is shorter than (pages × PollingInterval), the context will cancel mid-pagination, potentially returning incomplete results.

Consider either:

  1. Moving the timeout context creation inside the loop for per-request timeouts, or
  2. Using the parent botCtx for loop control and creating a fresh timeout context per RPC call.
Proposed fix: per-request timeout
 func (b BaseHost) QueryBatchInfos(botCtx types.Context, bridgeId uint64) ([]ophosttypes.BatchInfoWithOutput, error) {
-	ctx, cancel := rpcclient.GetQueryContextWithTimeout(botCtx, 0, b.node.GetRPCClient().QueryTimeout())
-	defer cancel()
-
 	ticker := time.NewTicker(botCtx.PollingInterval())
 	defer ticker.Stop()

 	var batchInfos []ophosttypes.BatchInfoWithOutput
 	var nextKey []byte
 	for {
 		select {
-		case <-ctx.Done():
-			return nil, ctx.Err()
+		case <-botCtx.Done():
+			return nil, botCtx.Err()
 		case <-ticker.C:
 		}

 		req := &ophosttypes.QueryBatchInfosRequest{
 			BridgeId: bridgeId,
 			Pagination: &query.PageRequest{
 				Limit: 100,
 				Key:   nextKey,
 			},
 		}
+		ctx, cancel := rpcclient.GetQueryContextWithTimeout(botCtx, 0, b.node.GetRPCClient().QueryTimeout())
 		res, err := b.ophostQueryClient.BatchInfos(ctx, req)
+		cancel()
 		if err != nil {
 			return nil, err
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@provider/host/query.go` around lines 148 - 162, The current
BaseHost.QueryBatchInfos creates a single timeout context via
rpcclient.GetQueryContextWithTimeout before the pagination loop, causing ctx to
potentially expire mid-pagination; change to create a fresh per-request timeout
context inside the loop (i.e., call rpcclient.GetQueryContextWithTimeout at the
start of each iteration before the RPC call that uses nextKey) and ensure you
call cancel() for each created context to avoid leaks; alternatively, if you
prefer global control, drive the loop with the parent botCtx and create a
short-lived timeout context around each RPC call — update uses of ctx in
QueryBatchInfos accordingly so each page fetch has its own timeout tied to
QueryTimeout and respects botCtx cancellation.

184-206: ⚠️ Potential issue | 🟠 Major

Same timeout scoping issue as QueryBatchInfos.

The timeout context created at line 192 applies to the entire pagination loop. With ticker delays consuming timeout budget, the context may cancel before all pages are fetched.

Apply the same per-request timeout pattern suggested for QueryBatchInfos.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@provider/host/query.go` around lines 184 - 206, The timeout context used in
QueryDepositTxHeight (created via rpcclient.GetQueryContextWithTimeout and
assigned to ctx/cancel before the pagination loop) currently covers the entire
pagination loop and can expire due to ticker delays; change this to create a
fresh per-request context inside the for page loop (i.e., call
rpcclient.GetQueryContextWithTimeout(...) at the start of each page iteration
and defer/ensure cancel() immediately after the single request completes) so
each page query has its own timeout budget while keeping the outer loop logic,
ticker handling, and use of ophosttypes.EventTypeInitiateTokenDeposit /
ophosttypes.AttributeKeyL1Sequence intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@provider/host/query.go`:
- Around line 148-162: The current BaseHost.QueryBatchInfos creates a single
timeout context via rpcclient.GetQueryContextWithTimeout before the pagination
loop, causing ctx to potentially expire mid-pagination; change to create a fresh
per-request timeout context inside the loop (i.e., call
rpcclient.GetQueryContextWithTimeout at the start of each iteration before the
RPC call that uses nextKey) and ensure you call cancel() for each created
context to avoid leaks; alternatively, if you prefer global control, drive the
loop with the parent botCtx and create a short-lived timeout context around each
RPC call — update uses of ctx in QueryBatchInfos accordingly so each page fetch
has its own timeout tied to QueryTimeout and respects botCtx cancellation.
- Around line 184-206: The timeout context used in QueryDepositTxHeight (created
via rpcclient.GetQueryContextWithTimeout and assigned to ctx/cancel before the
pagination loop) currently covers the entire pagination loop and can expire due
to ticker delays; change this to create a fresh per-request context inside the
for page loop (i.e., call rpcclient.GetQueryContextWithTimeout(...) at the start
of each page iteration and defer/ensure cancel() immediately after the single
request completes) so each page query has its own timeout budget while keeping
the outer loop logic, ticker handling, and use of
ophosttypes.EventTypeInitiateTokenDeposit / ophosttypes.AttributeKeyL1Sequence
intact.
ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 32f4478 and 9719548.

📒 Files selected for processing (1)
  • provider/host/query.go

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
provider/host/query.go (1)

149-175: ⚠️ Potential issue | 🟠 Major

Timeout scope should be per-RPC call, not the entire pagination loop.

Creating the timeout context once means it can expire mid-pagination (or during ticker waits), causing later requests to fail immediately even though botCtx is still active. Prefer a fresh timeout context inside each iteration.

🛠️ Suggested fix (per-call timeout context inside the loop)
- ctx, cancel := rpcclient.GetQueryContextWithTimeout(botCtx, 0, b.node.GetRPCClient().QueryTimeout())
- defer cancel()
-
  ticker := time.NewTicker(botCtx.PollingInterval())
  defer ticker.Stop()
@@
-    case <-ctx.Done():
-      return nil, ctx.Err()
+    case <-botCtx.Done():
+      return nil, botCtx.Err()
     case <-ticker.C:
     }
@@
-    res, err := b.ophostQueryClient.BatchInfos(ctx, req)
+    requestCtx, cancel := rpcclient.GetQueryContextWithTimeout(botCtx, 0, b.node.GetRPCClient().QueryTimeout())
+    res, err := b.ophostQueryClient.BatchInfos(requestCtx, req)
+    cancel()
     if err != nil {
       return nil, err
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@provider/host/query.go` around lines 149 - 175, The timeout context is
created once via rpcclient.GetQueryContextWithTimeout (ctx, cancel) before the
pagination loop, so it can expire during ticker waits and cause later BatchInfos
calls to fail; move creation of the per-RPC timeout context inside the loop
immediately before calling b.ophostQueryClient.BatchInfos (create a fresh
ctx/cancel from botCtx each iteration using GetQueryContextWithTimeout with the
same QueryTimeout), use that ctx for the RPC, then call cancel() after the RPC
completes (and keep the outer ticker logic and nextKey/pagination handling
intact).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@provider/host/query.go`:
- Around line 149-175: The timeout context is created once via
rpcclient.GetQueryContextWithTimeout (ctx, cancel) before the pagination loop,
so it can expire during ticker waits and cause later BatchInfos calls to fail;
move creation of the per-RPC timeout context inside the loop immediately before
calling b.ophostQueryClient.BatchInfos (create a fresh ctx/cancel from botCtx
each iteration using GetQueryContextWithTimeout with the same QueryTimeout), use
that ctx for the RPC, then call cancel() after the RPC completes (and keep the
outer ticker logic and nextKey/pagination handling intact).
ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 9719548 and 72a3191.

📒 Files selected for processing (1)
  • provider/host/query.go

@beer-1 beer-1 requested a review from traviolus February 23, 2026 09:38
Copy link
Contributor

@traviolus traviolus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@beer-1 beer-1 merged commit 6b721fb into main Feb 23, 2026
9 checks passed
@beer-1 beer-1 deleted the fix/error-handling branch February 23, 2026 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants