[SDCICD-1830] Add retry mechanism with fallback model for LLM analysis engines#3231
[SDCICD-1830] Add retry mechanism with fallback model for LLM analysis engines#3231varunraokadaparthi wants to merge 4 commits into
Conversation
…s engines Add exponential backoff retry logic (1m -> 2m -> 4m, 3 retries) for Gemini API calls in both analysis engines. On primary model exhaustion, automatically switches to gemini-2.5-pro as fallback with the same retry strategy. Non-retryable errors (4xx, context cancellation, tool failures) short-circuit immediately without retry or fallback. Retryable error classification uses a status code allowlist (429, 500, 502, 503) via genai.APIError inspection. Both fallback and primary clients are pre-initialized at engine construction to avoid repeated allocation and auth overhead on the retry path. Co-Authored-By: Claude Code <noreply@anthropic.com>
|
There are test jobs defined for this repository which are not configured to run automatically. Comment |
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughThe PR implements a resilient LLM analysis pipeline with a primary/fallback model strategy. It introduces sentinel errors to classify Gemini failures, adds exponential-backoff retry orchestration that promotes to a fallback model on exhaustion, integrates both patterns into internal and public engines, and validates behavior across success paths, transient failures, API errors, and edge cases. ChangesLLM Retry and Fallback Pattern
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: varunraokadaparthi The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/llm/gemini.go (1)
109-117:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReturn
ErrMaxIterationsinstead of success on the final tool-call iteration.Line 109-114 returns
nilerror on the last iteration, so the max-iteration failure path at Line 117 is bypassed.Suggested fix
func (g *GeminiClient) handleConversationWithTools(ctx context.Context, contents []*genai.Content, genConfig *genai.GenerateContentConfig, toolRegistry *tools.Registry) (*AnalysisResult, error) { const maxIterations = 5 var toolCalls []*genai.FunctionCall + var lastTextContent string for i := range maxIterations { resp, err := g.client.Models.GenerateContent(ctx, g.model, contents, genConfig) if err != nil { return nil, fmt.Errorf("gemini API error: %w", err) @@ textContent, functionCalls := g.processCandidateParts(candidate) + lastTextContent = textContent @@ - // Return partial result if we hit max iterations - if i == maxIterations-1 { - return &AnalysisResult{ - Content: textContent, - ToolCalls: toolCalls, - }, nil - } + _ = i } - return &AnalysisResult{ToolCalls: toolCalls}, ErrMaxIterations + return &AnalysisResult{ + Content: lastTextContent, + ToolCalls: toolCalls, + }, ErrMaxIterations }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/llm/gemini.go` around lines 109 - 117, The final iteration branch currently returns (&AnalysisResult{Content: textContent, ToolCalls: toolCalls}, nil) which bypasses the ErrMaxIterations path; update the branch in the loop that checks if i == maxIterations-1 to return the AnalysisResult along with ErrMaxIterations instead of nil so the caller sees the max-iteration failure (referencing AnalysisResult, ErrMaxIterations, i, maxIterations, textContent, and toolCalls).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/llm/retry_test.go`:
- Line 15: The file internal/llm/retry_test.go is not gofumpt-formatted (issue
at line 15); run the gofumpt formatter on that file to rewrite it to gofumpt
style, stage the resulting changes, and re-run the linter so the test file
(retry_test.go) conforms to the project's gofumpt formatting rules.
- Around line 317-320: The test's fallbackFn currently returns an invalid (nil,
nil) tuple after calling t.Fatal; change fallbackFn in
internal/llm/retry_test.go so it returns a non-nil error instead of nil for the
error value (e.g., return nil, errors.New("fallback called unexpectedly") or
fmt.Errorf(...)) to avoid nilnil; ensure you add the required import (errors or
fmt) if not already present.
In `@internal/llm/retry.go`:
- Around line 37-52: Before switching to the fallback, save the original primary
error (the err returned from retryLoop with primaryFn) into a distinct variable
(e.g., primaryErr) so it isn't lost; then when the fallback retryLoop (with
fallbackFn) also fails, return a combined error that preserves both failures
(use errors.Join(primaryErr, err) or fmt.Errorf with both messages) and include
both in the logger.Error context so diagnostics show both primary and fallback
failures.
- Around line 101-104: The errors.As check uses a value-typed target
(genai.APIError) which won't match the SDK's pointer error type; change the
target to a pointer (e.g., declare apiErr as *genai.APIError) and call
errors.As(err, &apiErr) so the match succeeds and retryableStatusCodes is
consulted for apiErr.Code; update the logic around the variable name `apiErr` in
retry.go to use the pointer type and keep the existing return of
retryableStatusCodes[apiErr.Code].
---
Outside diff comments:
In `@internal/llm/gemini.go`:
- Around line 109-117: The final iteration branch currently returns
(&AnalysisResult{Content: textContent, ToolCalls: toolCalls}, nil) which
bypasses the ErrMaxIterations path; update the branch in the loop that checks if
i == maxIterations-1 to return the AnalysisResult along with ErrMaxIterations
instead of nil so the caller sees the max-iteration failure (referencing
AnalysisResult, ErrMaxIterations, i, maxIterations, textContent, and toolCalls).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 2bc7f5b6-0d3e-4dd5-86d0-245f9f6591a7
📒 Files selected for processing (6)
internal/analysisengine/engine.gointernal/llm/errors.gointernal/llm/gemini.gointernal/llm/retry.gointernal/llm/retry_test.gopkg/krknai/analysisengine/engine.go
…xhaustion Rename err to primaryErr/fallbackErr in AnalyzeWithRetry for clarity. Use errors.Join to preserve both primary and fallback errors when both models are exhausted, so callers and logs see the full failure picture. Co-Authored-By: Claude Code <noreply@anthropic.com>
Co-Authored-By: Claude Code <noreply@anthropic.com>
|
Could |
| return &AnalysisResult{ | ||
| Content: textContent, | ||
| ToolCalls: toolCalls, | ||
| }, nil |
There was a problem hiding this comment.
| }, ErrMaxIterations |
5d3d505 to
f4b0b75
Compare
…, integration tests - 2 retries on primary model, 1 attempt on fallback, graceful error message - Flat 30s retry delay (no exponential backoff) - Integration tests with mock Gemini server for all error code scenarios - Store failed analysis result so LLM errors appear in Slack notifications Co-Authored-By: Claude Code <noreply@anthropic.com>
f4b0b75 to
7dbc4df
Compare
|
@varunraokadaparthi: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
internal/analysisengine,pkg/krknai/analysisengine)gemini-3.1-pro-preview) exhaustion, automatically switches to fallback model (gemini-2.5-pro) with the same retry strategygenai.APIErrorinspection; non-retryable errors (4xx, context cancellation, tool failures) short-circuit immediatelyTest plan
internal/llm/retry_test.gocovering: success paths, retryable/non-retryable status codes, primary exhaustion with fallback, both models exhausted, context cancellation, network timeouts, wrapped errorspkg/krknai/analysisenginetests pass unchangedmake buildsucceeds🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes