Skip to content

fix(ratelimiter:PLA-1025): fail open when redis ratelimiter panics#56

Merged
0x666c6f merged 6 commits intomorpho-mainfrom
feature/pla-1025-erpc-fail-open-when-redis-ratelimiter-becomes-unavailable
Mar 31, 2026
Merged

fix(ratelimiter:PLA-1025): fail open when redis ratelimiter panics#56
0x666c6f merged 6 commits intomorpho-mainfrom
feature/pla-1025-erpc-fail-open-when-redis-ratelimiter-becomes-unavailable

Conversation

@0x666c6f
Copy link
Copy Markdown
Collaborator

@0x666c6f 0x666c6f commented Mar 31, 2026

Summary

  • fail open when the Redis-backed rate limiter panics on runtime DoLimit calls
  • keep eRPC alive when Redis is unavailable and hand reconnection back to the initializer retry loop
  • add regression coverage for the runtime panic path

Changes

  • wrap rate limiter DoLimit execution with panic recovery and record panic-specific fail-open telemetry
  • clear the broken Redis cache and mark the Redis reconnect task failed so background retries resume
  • add a regression test covering Redis panic fail-open behavior

Linear


Open with Devin

@0x666c6f 0x666c6f self-assigned this Mar 31, 2026
@linear
Copy link
Copy Markdown

linear Bot commented Mar 31, 2026

devin-ai-integration[bot]

This comment was marked as resolved.

@0x666c6f
Copy link
Copy Markdown
Collaborator Author

@codex review

chatgpt-codex-connector[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Collaborator Author

@0x666c6f 0x666c6f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR Review (Claude)

Reviewed commit: c477a77
Language: Go

Severity Count
Critical 0
High 1
Medium 3
Low 3

Reviewed by 6 specialized agents. This review will re-run when new commits are pushed.

Comment thread upstream/ratelimiter_budget.go Outdated
Comment thread upstream/ratelimiter_budget.go Outdated
Comment thread upstream/ratelimiter_budget.go
Comment thread upstream/ratelimiter_budget.go Outdated
Comment thread upstream/ratelimiter_budget.go Outdated
Comment thread upstream/ratelimiter_registry.go
Comment thread upstream/ratelimiter_budget_metrics_test.go
Copy link
Copy Markdown
Collaborator Author

@0x666c6f 0x666c6f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR Review (Claude) — Re-review

Reviewed commit: 7432d4c
Language: Go
Previous findings: All 7 findings from review #1 (c477a77) have been addressed ✓

Severity Count
Critical 0
High 0
Medium 2
Low 3

Notable improvements in this commit:

  • onRedisCacheFailureonCacheFailure: now clears cache for all drivers, prevents hot-loop panics
  • Timeout MetricRateLimiterFailopenTotal moved to evaluateRule where projectId is available — consistent metric cardinality
  • Updated doc comment on doLimitWithTimeout — now accurate
  • Added PanicBeforeTimeoutFailOpen test — covers the channel-based panic propagation path
  • Added MetricUnexpectedPanicTotal assertion — closes the observability gap
  • Removed fragile MetricNetworkAttemptReasonTotal.Reset() — uses delta pattern instead

Remaining findings are suggestions for further refinement, not blockers.

Reviewed by 6 specialized agents. This review will re-run when new commits are pushed.

Comment thread upstream/ratelimiter_budget.go
Comment thread upstream/ratelimiter_budget.go
Comment thread upstream/ratelimiter_registry.go Outdated
Comment thread upstream/ratelimiter_budget.go Outdated
Comment thread upstream/ratelimiter_budget.go
Copy link
Copy Markdown
Collaborator Author

@0x666c6f 0x666c6f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR Review (Claude) — Re-review #3

Reviewed commit: 69068b4
Language: Go
Previous findings: All 5 findings from review #2 (7432d4c) have been addressed ✓

Severity Count
Critical 0
High 0
Medium 1
Low 0

Excellent iteration — this commit addressed every finding from review #2:

  • ✅ Extracted recordFailOpen helper (was MEDIUM: duplicate fail-open pattern)
  • ✅ Threaded agentName through evaluateRule (was LOW: empty string gap)
  • ✅ Simplified waitDuration — computed once before if/else (was LOW: boilerplate)
  • ✅ Added doc comment about detached goroutine timeout-panic behavior (was MEDIUM)
  • ✅ Updated non-redis log to "(rate limiting disabled until restart)" (was LOW)
  • ✅ Tests now use real requests with agent names for label verification

The single remaining finding is a suggestion to reduce the 13-parameter recordFailOpen signature with a struct. Not a blocker.

Reviewed by 6 specialized agents. This review will re-run when new commits are pushed.

Comment thread upstream/ratelimiter_budget.go
Copy link
Copy Markdown
Collaborator Author

@0x666c6f 0x666c6f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR Review (Claude) — Re-review #4

Reviewed commit: 46db19d
Language: Go
Previous findings: The single MEDIUM finding from review #3 (69068b4) has been addressed ✓

Severity Count
Critical 0
High 0
Medium 0
Low 0

Clean refactoring — introduced rateLimitEvalContext struct to replace the 13-parameter recordFailOpen signature with 6 parameters. Call sites are now self-documenting and resistant to silent string transposition. No behavioral changes, no new issues found.

All findings from all 3 prior reviews have been addressed across 4 iterations. This PR is in excellent shape. 🎯

Reviewed by 6 specialized agents. This review will re-run when new commits are pushed.

Copy link
Copy Markdown
Collaborator Author

@0x666c6f 0x666c6f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR Review (Claude) — Re-review #5

Reviewed commit: dae5d30
Language: Go

Severity Count
Critical 0
High 0
Medium 0
Low 0

Incremental change since last review (46db19d): only erpc/networks_multiplexer_test.go — adds a Delay(50ms) to the mock, a start-channel barrier (start := make(chan struct{}) + <-start + close(start)) to ensure goroutines launch simultaneously, fixing a race condition in the multiplexer test. Clean test fix, no issues found.

All ratelimiter code unchanged from review #4 (clean). All findings from reviews #1-3 remain addressed.

Reviewed by 6 specialized agents. This review will re-run when new commits are pushed.

@0x666c6f 0x666c6f merged commit 26cabd9 into morpho-main Mar 31, 2026
6 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant