Skip to content

fix(eth): RPC failover on broadcast + pin networks on JsonRpcProvider#59

Merged
BitHighlander merged 3 commits intodevelopfrom
fix/eth-broadcast-failover
Apr 30, 2026
Merged

fix(eth): RPC failover on broadcast + pin networks on JsonRpcProvider#59
BitHighlander merged 3 commits intodevelopfrom
fix/eth-broadcast-failover

Conversation

@BitHighlander
Copy link
Copy Markdown
Collaborator

Summary

Optimism broadcast hit rate-limit (Tenderly public, HTTP 429 / -32005). getProvider() did failover for its pre-flight getBlockNumber but every subsequent call hit the same single URL — signing succeeded, broadcast failed.

This PR makes broadcast iterate the full RPC list and adds a hardcoded last-resort table that runs after Pioneer's list (so it can never preempt a live node).

What changed

Broadcast failover (broadcastTransaction in ethereumHandler.ts)

  • Iterates all candidate RPCs in order.
  • New classifyBroadcastError distinguishes:
    • definitiveinsufficient funds, nonce too low, replacement transaction underpriced, gas required exceeds, intrinsic gas too low. Same outcome on any RPC; throw immediately.
    • already-known — tx is in mempool somewhere; treat as success and return the locally-recovered hash from the signed bytes.
    • transient — rate limit, 5xx, network, ethers SERVER_ERROR. Mark URL failed for 60s, try next.
  • The pre-flight signer-recovery / signer-mismatch fail-closed check is preserved.
  • The 8s + 45s drop checks still fire on success.

Last-resort RPC list (chains/lastResortRpcs.ts, new)

  • Hardcoded canonical public RPCs for the 8 main chains (ETH/Optimism/BSC/Polygon/zkSync/Base/Arbitrum/Avalanche), 1–2 entries each.
  • Always appended after Pioneer's list in the failover sequence — Pioneer URLs always win priority. A stale entry in this table can fail; it can never preempt a working Pioneer-discovered URL. Addresses the historical concern from feedback_no_hardcoded_rpcs.md.

Pinned-network JsonRpcProvider (makeStaticProvider in chains/registry.ts)

  • ethers v6 default behavior calls eth_chainId to auto-detect the network on the first RPC call, and retries every 1s indefinitely if the URL is slow / dead / rate-limited. Promise.race(getBalance, timeout) rejects the awaiting promise but leaves the abandoned provider retrying — which is the source of the JsonRpcProvider failed to detect network and cannot start up; retry in 1s spam (50+ lines per balance refresh).
  • New helper passes staticNetwork: true and the chainId at construction time so ethers skips detection. Bad URLs fail fast on the actual call.
  • Applied at all 6 construction sites: index.ts:485 (fetchBalances), :891 (GET_GAS_ESTIMATE), :1352 (GET_ASSET_BALANCE), :1494 (GET_EVM_BALANCE), :1701 (VALIDATE_ERC20_TOKEN), ethereumHandler.ts:150 (getProvider).

Out of scope

signTransaction pre-flight calls (getTransactionCount, estimateGas, getFeeData) still hit a single URL via getProvider(). If those rate-limit you'll see different failure modes — fee data unavailable, nonce 0, etc. Worth following up but a separate PR.

Files

  • chrome-extension/src/background/chains/lastResortRpcs.ts (new)
  • chrome-extension/src/background/chains/registry.tsmakeStaticProvider export
  • chrome-extension/src/background/chains/ethereumHandler.ts — broadcast loop rewrite, getProvider uses pinned helper
  • chrome-extension/src/background/index.ts — 5 JsonRpcProvider sites use pinned helper

Test plan

  • Optimism tx broadcast: trigger the rate-limit case (e.g. burst many txs or hit Tenderly's rough threshold). First URL 429s; failover catches it; tx lands on a subsequent URL. Console shows [HANDOFF] BEX → RPC (broadcast attempt) url=... for each attempt.
  • BSC, Ethereum, Base broadcasts each work via Pioneer's list (no last-resort needed).
  • Send a bad-nonce tx (e.g. with stale nonce: 0) — should fail definitively after the first URL, not walk the whole list.
  • fetchBalances with multiple chains in storage: console no longer shows the 50× JsonRpcProvider failed to detect network spam after a chain switch. Bad URLs fail within 5s instead of looping.
  • After a URL rate-limits, repeat the same broadcast within 60s — that URL is skipped (failedRpcs cooldown).
  • After 60s without retries, the same URL is retried.
  • already known path: difficult to trigger naturally; skip if not reproducible.
  • Existing one-step add+switch (BSC) from feat: Pioneer-sourced EVM chain registry + Solana sign-message UX #57 still works end-to-end.

🤖 Generated with Claude Code

BitHighlander and others added 3 commits April 30, 2026 16:40
Tenderly's public Optimism endpoint started rate-limiting broadcasts (HTTP 429 / -32005). getProvider() did failover for the pre-flight getBlockNumber but once a working URL was selected, every subsequent call (broadcast, getFeeData, etc.) hit that single URL forever. Result: signing succeeded, broadcast failed, dApp got SERVER_ERROR.

- broadcastTransaction now iterates the full RPC list with a transient/definitive error classifier. Rate limits, 5xx, and network errors fail over to the next URL; nonce / insufficient-funds / gas errors stop immediately so we don't waste candidates on a doomed tx. 'already known' is treated as success (the tx is in mempool somewhere) using the locally-recovered hash from the signed bytes.
- Last-resort RPC list (chains/lastResortRpcs.ts) appended AFTER Pioneer's URLs in the failover sequence. Pioneer always wins priority; the hardcoded list only fires when Pioneer's full set is exhausted, so a stale entry can never preempt a working live node — addresses the historical concern about hardcoded RPC tables blocking releases.
- New makeStaticProvider helper (registry.ts) constructs JsonRpcProvider with a pinned network. ethers v6's auto-detect retried eth_chainId every 1s indefinitely on slow/dead/rate-limited URLs, generating background spam from any timed-out caller (Promise.race rejected the awaiting promise but left zombie retries running). All 6 sites in index.ts + ethereumHandler.ts now use the helper. Bad URLs fail fast on the actual call.
- failedRpcs cooldown is honored across the broadcast path so a URL that just rate-limited isn't re-tried for 60s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h helper, preserve ProviderRpcError

- P1: makeStaticProvider gains optional { timeoutMs } that builds a FetchRequest with a per-HTTP-attempt timeout AND sets ThrottleParams.maxAttempts=1. Without this, ethers' default FetchRequest retries 429/5xx with exponential backoff for ~30s before surfacing the error — the exact stall the failover loop was meant to eliminate. Broadcast attempts now use timeoutMs:4000 so a hung URL fails over within ~4s.
- P2: handleEthSendRawTransaction now calls broadcastTransaction(params[0]) instead of inlining getProvider().broadcastTransaction. Raw signed tx submissions (dApps that sign externally) get the same Pioneer + last-resort iteration, transient/definitive classification, and already-known handling as eth_sendTransaction.
- P2: sendTransaction's catch passes ProviderRpcErrors through unchanged. Was wrapping every error as 'Error sending transaction', swallowing the classifier's output (insufficient funds, all-RPC-failed, etc.).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses two of the audit findings tagged P1:

- P1#3 EVM signing preflight now uses a candidate-loop helper. New `withRpcFailover` shares the same Pioneer + last-resort priority and failedRpcs cooldown as broadcastTransaction (factored into `getCandidateRpcs`). signTransaction's nonce / estimateGas / getFeeData calls each fail over independently — a rate-limited URL on one read no longer breaks signing before broadcast can rescue. Each attempt gets a 5s transport timeout so a hung URL doesn't stall.
- P1#4 getProvider's static provider now passes timeoutMs:5000 so ethers' transport itself aborts at 5s. Previously the awaited `Promise.race(getBlockNumber, timeout)` raced-out at the application layer while abandoned ethers requests kept retrying internally. The outer Promise.race becomes redundant once the transport-level timeout exists; dropped it.
- broadcastTransaction's candidate-list/cooldown duplication is collapsed into `getCandidateRpcs` so future failover sites have one place to extend.

Out of scope (separate PRs needed):
- UTXO build-then-approval race + UTXO broadcast resilience (5 chains)
- Solana send-time failover beyond health check
- Tron raw-broadcast injected path
- Background health polling overlap guard
- Pioneer portfolio/custom-token endpoint timeouts
- EVM read paths (balance/token validation) for non-broadcast resilience

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@BitHighlander BitHighlander merged commit c79fc54 into develop Apr 30, 2026
3 of 4 checks passed
@BitHighlander BitHighlander deleted the fix/eth-broadcast-failover branch April 30, 2026 22:08
BitHighlander added a commit that referenced this pull request Apr 30, 2026
… failover (#60)

* fix(audit): UTXO build race + cross-chain broadcast resilience + read failover

Addresses the eight remaining audit findings from PR #59's review.

P1 — UTXO build-then-approval race (#1) and broadcast resilience (#2):
- bitcoin/dogecoin/litecoin/bitcoincash/dash all awaited buildTx() fire-and-forget then created the approval event afterwards. Race produced phantom approvals (response.unsignedTx undefined) or null storedEvent depending on timing. Now buildTx is awaited BEFORE addEvent, with unsignedTx attached to the event from the start. Also applied to thorchain/cosmos/maya/osmosis/ripple — they already awaited the build but used raw fetch() without timeout/retry.
- Both build and broadcast now go through fetchJsonWithTimeout with 15s budget + 1 retry on 5xx.

P2 — cross-cutting resilience:
- New chrome-extension/src/background/fetchUtils.ts with fetchJsonWithTimeout + fetchWithTimeout. Applied to all six Pioneer endpoints in index.ts (portfolio, insight, nodes, tokens/metadata, tokens/custom GET/POST/DELETE, tokens/balances) and the UTXO/Tendermint chain handlers above.
- New chrome-extension/src/background/chains/rpcFailover.ts with withRpcFailoverByNetworkId. Iterates user-override → Pioneer → last-resort with transient/definitive classification and per-attempt transport timeout. Applied to GET_ASSET_BALANCE, GET_EVM_BALANCE, VALIDATE_ERC20_TOKEN.
- Drop check binds to the URL that accepted the broadcast — querying a different RPC could produce false-positive drop warnings, especially after a last-resort fallback succeeded. dropCheckUrlByHash maps hash → success URL; performDropCheck queries that exact URL with a 4s timeout, falling back to getProvider only on service-worker restart.
- Solana broadcastTransaction iterates SOLANA_RPC_URLS on transient failures (rate limit, 5xx, timeout). Definitive errors (insufficient funds, blockhash expired, signature verification, account-in-use) skip the loop. Cached health-checked URL is invalidated on failure so the next caller re-tests.
- Tron tronGridPost (injected bundle, can't import fetchUtils) gets inline timeout + one retry on 5xx. Broadcast paths get a longer 12s budget; reads stay at 8s.
- Health poll (checkKeepKey) gets AbortSignal.timeout(3000) + a singleflight guard so a stalled localhost:1646 can't stack overlapping probes every 5s.

Out of scope: fetchBalances per-chain RPC enrichment (line 500) still uses a single makeStaticProvider per chain since it's already inside a chain-iteration loop; further failover there would be over-engineering. GET_GAS_ESTIMATE (line 909) still uses the active provider directly — could be migrated to withRpcFailover later.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(review): Solana classifier + Bitcoin fall-through

Two findings from PR #60's review.

P1: Solana broadcast classifier was over-eager about definitive errors.
- 'already processed' is treated as success — the tx is in mempool / processed somewhere; the dApp still needs the signature. Recover it from the signed bytes (first 64-byte sig in the signed-tx layout IS the canonical Solana transaction signature). Inline base58 encoder added; reuses the existing BASE58_ALPHABET const that base58Decode already declared.
- 'blockhash not found' demoted from definitive to transient — can be RPC freshness for dApp-supplied transactions; trying the next URL is worth it. 'block height exceeded' stays definitive (the blockhash window has truly closed).
- 'account in use' demoted to transient — usually a parallel-write race that can resolve on the next URL.

P2: bitcoinHandler's broadcast catch logged + sent transaction_error but didn't re-throw, so the case 'transfer' fell through to default and returned 'Method transfer not supported' to the dApp instead of the real broadcast error. Other UTXO/Tendermint handlers don't have this catch — only bitcoin.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant