Fix RetryOptimizer: offline mode, disjoint coverage, eval_dirs#3
Conversation
Three findings from code review: 1. (High) optimize task always injected real adapter via build_context, forcing online mode even when evals have sample_response. Now only injects adapter when LIVE=1 or PROVIDER= is set. Default = offline (zero API calls, uses sample_response). 2. (High) build_chain used count-based coverage tracking which missed candidates covering different evals (A passes e1, B passes e2 → old code produced [A] instead of [A, B]). Switched to Set-based tracking of which evals are covered. Chain is empty only when covered_evals < total after exhausting all candidates. 3. (Medium) optimize task called load_evals! without eval_dirs, breaking non-Rails setups. Now supports EVAL_DIRS= env var, matching the existing RakeTask behavior. Added spec for disjoint coverage case. 1289 specs passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes issues in the RetryOptimizer workflow and its optimize rake task: ensuring offline-by-default evals work as intended, correcting chain construction when eval coverage is split across candidates, and allowing eval loading from explicit directories in non-Rails setups.
Changes:
- Update the optimize rake task to support
EVAL_DIRSand default to offline mode unlessLIVE=1orPROVIDERis set. - Fix RetryOptimizer chain construction to track covered evals by set-union rather than a simple count.
- Add a spec case covering disjoint eval coverage across candidates.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| spec/ruby_llm/contract/eval/retry_optimizer_spec.rb | Adds a new test scenario for disjoint eval coverage across candidates. |
| lib/ruby_llm/contract/rake_task.rb | Adds EVAL_DIRS support and changes adapter injection logic to preserve offline sample_response behavior by default. |
| lib/ruby_llm/contract/eval/retry_optimizer.rb | Reworks build_chain to track coverage with a set and returns no chain if full coverage can’t be achieved. |
| Gemfile.lock | Bumps local gem version from 0.6.0 to 0.6.1. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| covered_evals = Set.new | ||
|
|
||
| labels.each do |label| | ||
| passes = evals.count { |e| (matrix.dig(e, label) || 0) >= @min_score } | ||
| next if passes <= covered | ||
| newly_covered = evals.select { |e| (matrix.dig(e, label) || 0) >= @min_score } | ||
| new_additions = newly_covered.to_set - covered_evals | ||
| next if new_additions.empty? |
There was a problem hiding this comment.
Set/to_set are used here, but this file does not require the stdlib set library. In a non-Rails environment this can raise NameError: uninitialized constant Set and/or NoMethodError for to_set. Add an explicit require "set" (either in this file or in a central require like ruby_llm/contract).
| it "returns empty chain when candidates cover disjoint evals (no single viable chain)" do | ||
| stub_compare_models_scores(step, { | ||
| "easy" => { "nano" => 1.0, "mini" => 0.5 }, | ||
| "hard" => { "nano" => 0.5, "mini" => 1.0 } | ||
| }) | ||
|
|
||
| result = described_class.new( | ||
| step: step, | ||
| candidates: [{ model: "nano" }, { model: "mini" }] | ||
| ).call | ||
|
|
||
| # nano passes easy, mini passes hard — but neither passes both, | ||
| # and the chain nano→mini still doesn't cover easy+hard because | ||
| # mini fails easy. This is not a viable escalation chain. | ||
| expect(result.chain.length).to eq(2) | ||
| expect(result.chain).to eq([{ model: "nano" }, { model: "mini" }]) |
There was a problem hiding this comment.
This example description and inline comment say the chain should be empty / not viable, but the assertions expect a 2-step chain [nano, mini]. Disjoint coverage across evals is a valid escalation chain (easy passes on nano; hard escalates to mini), so the test text should be updated to match the expected behavior (or, if the intent really is “no viable chain”, then the expectations should assert an empty chain).
| ctx = {} | ||
| provider = ENV["PROVIDER"].to_s.strip | ||
| ctx[:provider] = provider.downcase.to_sym unless provider.empty? | ||
| # Only inject real adapter when LIVE=1 or PROVIDER is set — otherwise | ||
| # evals use sample_response (offline mode, zero API calls). | ||
| if ENV["LIVE"] == "1" || provider.present? | ||
| ctx[:adapter] = RubyLLM::Contract::Adapters::RubyLLM.new | ||
| ctx[:provider] = provider.downcase.to_sym unless provider.empty? | ||
| end |
There was a problem hiding this comment.
provider.present? relies on ActiveSupport’s Object#present?, but this gem’s dependencies (and this file’s requires) don’t include ActiveSupport. This will raise NoMethodError when running the rake task outside Rails. Use plain Ruby checks (e.g., !provider.empty?) and consider also avoiding .presence above for the same reason.
…description - Add `require "set"` to retry_optimizer.rb (Set/to_set fail outside Rails) - Replace `provider.present?` with `!provider.empty?` (no ActiveSupport dep) - Fix disjoint coverage test: description now correctly explains that disjoint coverage IS viable via retry escalation, not "no viable chain" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 2 of review fixes: 1. (High) build_chain: disjoint eval coverage is NOT a viable retry chain. Retry fires on validation_failed/parse_error, not on low eval score — a model returning :ok with wrong output won't escalate. Now the chain's last model MUST pass ALL evals (safe fallback). Cheaper models are prepended as first-try optimization only. If no model passes all evals → empty chain. 2. (High) Replace .presence with plain Ruby (.empty?) in rake task. .presence requires ActiveSupport, which is not a gem dependency. 3. (Medium) Document LIVE=1 and EVAL_DIRS= in the optimization guide. Without these, non-Rails users and live-mode users can't use the optimize task correctly. Disjoint coverage test updated: now expects empty chain (correct per retry_executor semantics). 1289 specs passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 3 — tests requested by review: OptimizeRakeTask specs (7): - build_context: offline by default (no adapter), LIVE=1 injects adapter, PROVIDER= injects adapter+provider, empty PROVIDER doesn't trigger adapter (no ActiveSupport .presence) - parse_candidates: comma-separated, @reasoning_effort, JSON array RetryOptimizer specs (2): - Offline mode: real step with sample_response, no stubs, no API calls — verifies scores and chain from canned responses - Integration: optimizer chain's last model passes all evals when actually run through step.run with test adapter Total: 1298 specs, 0 failures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests added: - EVAL_DIRS: parsing, empty when unset (2 cases) - LIVE=1: end-to-end offline vs online adapter injection (1 case) - Known limitation documented in build_chain: intermediate models assumed retryable; future version could inspect step_status. 1301 specs passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| covered_evals.merge(new_additions) | ||
| chain << parse_label_to_config(label) | ||
| details << { label: label, passes: covered_evals.size, cost: label } |
There was a problem hiding this comment.
build_chain stores passes: covered_evals.size for each intermediate model, but Result#print_chain prints this as " — passes X/Y evals". Because covered_evals is cumulative across prior models, this number can overstate what the current model actually passes (it may only contribute a small subset). Either track per-model pass count for details, or change the printed wording/field name to reflect cumulative "covered so far" semantics.
| details << { label: label, passes: covered_evals.size, cost: label } | |
| details << { label: label, passes: newly_covered.size, cost: label } |
| it "returns empty chain when coverage is disjoint (no model passes all evals)" do | ||
| stub_compare_models_scores(step, { | ||
| "easy" => { "nano" => 1.0, "mini" => 0.5 }, | ||
| "hard" => { "nano" => 0.5, "mini" => 1.0 } | ||
| }) | ||
|
|
||
| result = described_class.new( | ||
| step: step, | ||
| candidates: [{ model: "nano" }, { model: "mini" }] | ||
| ).call | ||
|
|
||
| # nano passes easy but not hard, mini passes hard but not easy. | ||
| # Retry fires on validation_failed/parse_error — NOT on low eval | ||
| # score. A model returning :ok with wrong output won't escalate. | ||
| # No single model covers all evals → no viable chain. | ||
| expect(result.chain).to be_empty | ||
| end |
There was a problem hiding this comment.
The PR description/test plan says the disjoint-coverage case should yield a chain like [A, B], but the new spec asserts the chain is empty in that scenario. Please reconcile the PR description with the implemented behavior (or adjust the spec/algorithm if the intended behavior is actually to include both models).
| describe "EVAL_DIRS support" do | ||
| it "passes EVAL_DIRS to load_evals!" do | ||
| with_env("EVAL_DIRS" => "lib/evals,extra/evals") do | ||
| dirs = ENV["EVAL_DIRS"].to_s.split(",").map(&:strip).reject(&:empty?) | ||
| expect(dirs).to eq(["lib/evals", "extra/evals"]) | ||
| end |
There was a problem hiding this comment.
The "passes EVAL_DIRS to load_evals!" example only asserts how ENV["EVAL_DIRS"] is split, but never exercises the rake task nor verifies that RubyLLM::Contract.load_evals! is actually called with those dirs. Consider stubbing RubyLLM::Contract.load_evals! and invoking the rake task (or extracting a small helper method) so the test covers the real behavior added in lib/ruby_llm/contract/rake_task.rb.
| it "does not use ActiveSupport String#presence" do | ||
| # Verify that .presence is NOT called — plain Ruby .empty? is used instead. | ||
| # This ensures the task works outside Rails/ActiveSupport. | ||
| with_env("LIVE" => nil, "PROVIDER" => "") do | ||
| ctx = task.send(:build_context) | ||
| expect(ctx).not_to have_key(:adapter) | ||
| end |
There was a problem hiding this comment.
This example's description claims it verifies ActiveSupport String#presence isn't used, but the expectations only check adapter injection behavior. If the intent is to ensure non-Rails compatibility, it would be clearer to either (a) exercise the STEP/CANDIDATES parsing paths (where .presence was removed) or (b) rename the spec to match what it actually asserts.
…IRS stub 1. build_chain details: passes now shows per-model new coverage, not cumulative. print_chain says "covers X eval(s)" for intermediates, "passes all N evals" for the safe fallback. 2. EVAL_DIRS test: now stubs RubyLLM::Contract.load_evals! and verifies it receives the parsed dirs (or no args when unset). 3. Renamed misleading "does not use ActiveSupport" test to "treats empty PROVIDER same as nil". 4. PR description/disjoint semantic mismatch: the current spec correctly expects empty chain for disjoint coverage (retry won't bridge semantic gaps). PR description updated separately. 1301 specs passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0.6.1: multi-provider tooling, recommend task, ollama support
0.6.2: optimize_retry_policy, offline-by-default, chain semantics
fix, ActiveSupport removal, EVAL_DIRS, guide
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ding + changelog coverage
Four inline findings from two Copilot review rounds; all addressed:
1. docs/guide/getting_started.md:3 — "OpenAI account" was provider-specific
in a guide that inherits the README's "any ruby_llm provider" promise.
Reworded to "LLM provider budget".
2. CHANGELOG.md — 0.7.2 entry listed only the optimizing_retry_policy
rewrite. Added a Documentation subsection enumerating all five guide
rewrites plus the output_schema DSL bug fix and the best_practices /
pipeline / migration sanity pass, so release notes match what shipped.
3. docs/guide/testing.md:38 — the pipeline.test example used `decisions:
[...]` and `analyses: [...]`. `...` is not valid Ruby inside an array
literal; copy-paste would raise SyntaxError. Replaced with minimal
realistic array entries matching each step's schema from pipeline.md.
4. docs/guide/testing.md:104 — stub_steps example had `response: { ... }`.
Same issue as #3. Replaced with minimal response hashes matching the
SummarizeArticle schema and a plausible RelatedArticles schema.
Full spec suite: 1341 examples, 0 failures.
Batch 2 (partial) of the post-0.9.0 quality refactor. B2-T1 — `with_retry_disabled` removed (Codex finding #3) ======================================================== The old implementation in `eval/retry_optimizer.rb` mutated the step class's singleton `retry_policy` method around `compare_models`, then restored it in an `ensure`. This was a concurrency hazard: two concurrent `optimize_retry_policy` calls on the same step would race, and the `ensure` does not protect against the race. Replaced with `context.merge(retry_policy_override: nil)` — the step already honours this key in `Step::Base#runtime_settings` via `context.key?(:retry_policy_override)`. Same effect, no mutation, no race. Characterization tests written FIRST (4 tests pinning original `with_retry_disabled` behaviour: nil-inside, restore-after, restore- on-raise, no-retry-policy-guard). After refactor, replaced with 3 tests pinning the new contract (retry_policy_override propagation, class-level retry_policy untouched, obsolete private method gone). B2-T3 — CostCalculator.find_model exposed (finding #2) ====================================================== `step/base.rb` was calling `CostCalculator.send(:find_model)` and `CostCalculator.send(:compute_cost)` to bypass `private_class_method`. Visibility bypass via `send` is invisible to grep-for-callers and fragile under refactor. Changes: - Removed `find_model` from `private_class_method` list in `cost_calculator.rb` with a comment explaining the public contract. - Replaced `CostCalculator.send(:find_model, name)` with the direct call. - Dropped `Base.estimated_cost_for` helper entirely; routed through the existing public `CostCalculator.calculate(model_name:, usage:)` which already does the work. Removes the second `send` call. 5 characterization tests added in `spec/ruby_llm/contract/cost_calculator_spec.rb` pinning `find_model` (nil-on-unknown, nil-on-nil, custom registry, custom overrides built-in, calculate-uses-find-model). Suite: 1344 examples / 0 failures (+8 net new vs Batch 1 baseline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the post-0.9.0 quality refactor pull (Batch 1 + Batch 2 partial from TODO.md). Patch-version bump because every change is internal — no DSL macro added/removed, no Step.run signature change, no adapter contract change. Adopters bumping to `~> 0.9.0` automatically pick up the safer code path on next install. Bundled fixes (Codex findings #2, #3, #7, #8): - Drop singleton-method mutation in with_retry_disabled (concurrency hazard) — replaced with retry_policy_override: nil via context. - Expose CostCalculator.find_model publicly; remove two CostCalculator.send workarounds; route estimate_cost through the existing public CostCalculator.calculate. - Unify stub_step on a single thread-local storage path; remove duplicate allow/receive branch. - Drop dead ObjectSpace.each_object fallback (unreachable on Ruby >= 3.2; gemspec requires 3.2.0+ where Class#subclasses ships). Test discipline: every refactor wrote characterization tests BEFORE touching production code, then replaced with new contract tests after. Suite: 1346 examples / 0 failures (+12 net new vs 0.9.0 baseline). Remaining refactor backlog (Codex findings #1, #4, #5, #6) documented in TODO.md and deferred to 0.10.0+ — each requires more invasive surgery on shared surfaces (DSL inheritance macros, Runner config, RakeTask SuiteGate) and benefits from dedicated focus rather than tail-end inclusion in a patch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes Codex finding #5 (RakeTask#define_task god method, 52 LOC, 6+ responsibilities). Last remaining finding from the post-0.9.0 quality audit — TODO.md now marked COMPLETE across all 8 findings. EXTRACTED ========= New `lib/ruby_llm/contract/rake_task/suite_gate.rb` — testable in isolation, returns a `Verdict` Data struct: Verdict.new(passed:, abort_reason:, passed_reports:, suite_cost:) The gate enforces a strict ordering preserved from the prior god method: 1. Cost gate runs FIRST (if maximum_cost set and exceeded → abort before any score check; passed_reports is empty). 2. Score gate runs per-report (uses minimum_score threshold; falls back to report.passed? when no threshold set). 3. Regression gate runs per-report (when fail_on_regression true, prints REGRESSIONS DETECTED to stdout and marks the report failed). 4. Overall passed = ALL reports passed AND cost gate not tripped. REFACTORED ========== `RakeTask#define_task` reduced from ~52 LOC to ~25 LOC. Extracted `collect_host_reports(context)` helper handles eval running, empty case, per-host printing. The gate logic, regression detection, cost comparison, and per-report scoring all delegated to `SuiteGate`. TEST DISCIPLINE =============== 5 characterization tests written FIRST in `spec/ruby_llm/contract/rake_task_gate_spec.rb` pinning all four dimensions: - gate passes → "All evals passed" output, no abort - gate fails on score → SystemExit with "Eval suite FAILED" - cost gate priority → cost message even when score would pass (registered a custom pricey model to force non-zero cost) - baselines NOT saved on score failure - baselines DO save when all reports pass All 5 passed BEFORE refactor (current behaviour) AND after (new behaviour). Existing rake_task tests in f3_track_history_spec.rb and v02_findings_spec.rb continue to pass unchanged. Suite: 1369 examples / 0 failures / 8 pending. CODEX BACKLOG STATUS ==================== All 8 findings from the post-0.9.0 code-quality audit now resolved: #1 DRY inheritance walk → Batch 3 (B3-T2) #2 CostCalculator.send → Batch 2 (B2-T3) #3 with_retry_disabled mutation → Batch 2 (B2-T1) #4 _explicitly_unset shadows → Batch 3 (B3-T3, UNSET sentinel) #5 RakeTask god method → Batch 4 (B4-T2, THIS commit) #6 Runner.new 17 kwargs → Batch 2 (B2-T4, RunnerConfig.build) #7 ObjectSpace dead code → Batch 1 (B1-T1) #8 stub_step dual storage → Batch 2 (B2-T2) Ready for 0.10.0 minor-version release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings from a Codex DRY audit of lib/ (104 .rb files, ~7240 LOC). HIGH-impact #1 already had silent drift in production code. #1 [HIGH] Shared stub-helper concern (already drifting!) ======================================================== Before: `stub_step`, `stub_steps`, `stub_all_steps` plus their `build_test_adapter` helper existed in parallel in Minitest helpers (minitest.rb:50-157) and RSpec helpers (rspec/helpers.rb:24-121). The RSpec version had already drifted — it ran every test response through a `normalize_test_response` hook; Minitest did not. If an adopter using Minitest registered custom normalization, their tests would diverge from RSpec users running the same fixtures. After: extracted `RubyLLM::Contract::Concerns::StubHelpers` containing the three public methods and the `build_test_adapter` / `normalize_test_response` private helpers. Both `MinitestHelpers` and `RSpec::Helpers` now include it. Their host files shrink to a few lines documenting the cleanup contract specific to each framework (Minitest's `teardown` vs RSpec's `around(:each)`). LOC: minitest.rb -107 / rspec/helpers.rb -116 / concerns/stub_helpers.rb +95. Net: -128 LOC and a single source of truth for stub semantics. #2 [HIGH] RetryExecutor now uses Concerns::UsageAggregator ========================================================== Before: `RetryExecutor#aggregate_retry_usage` (retry_executor.rb:83-94) duplicated the input/output token sum that `Concerns::UsageAggregator #aggregate_usage` already implemented. Both produced `{ input_tokens:, output_tokens: }`; both extracted usage via `respond_to?(:usage)`; the only difference was the wrapper that unpacked attempts versus raw traces. Drift scenario this prevents: adding a third token field (e.g. `:cache_read_tokens` for prompt-caching pricing) would have required edits in both methods; missing one would silently undercount retry costs. After: `RetryExecutor` includes `Concerns::UsageAggregator` and delegates: `aggregate_usage(all_attempts.map { |a| a[:result].trace })`. The private duplicate is gone (~12 LOC). #3 [MEDIUM] `validate_positive!` helper in Dsl ============================================== Before: four DSL macros (`max_input`, `max_output`, `max_cost`, `attachment_token_estimate`) each opened with the identical guard: unless value.is_a?(Numeric) && value.positive? raise ArgumentError, "X must be positive, got Y" end After: extracted `validate_positive!(name, value)` private method (mirrors the existing `CostCalculator.validate_price!` pattern). Four call sites reduced from three lines to one. NOT FIXED (Codex called these out as deferable) ================================================ #4 [MEDIUM]: `ReportStats` / `AggregatedReport` share a duck-type contract (score, pass_rate, pass_rate_ratio, total_cost, avg_latency_ms, passed?). Codex flagged this as MEDIUM impact but the duck-type is intentionally load-bearing — making it explicit via a shared interface module would be a bigger interface design choice than this DRY pass warrants. Deferred. LOW items (cost-format strings in Minitest vs RSpec messages, inconsistent ArgumentError text in eval_definition) — left as-is per the audit's recommendation. Test suite: 1369 examples / 0 failures / 8 pending. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Fixes findings from code review of the RetryOptimizer added in 0.6.1.
Fixes
High: forced online mode —
optimizetask always injected a real adapter, bypassingsample_responseoffline path. Now defaults to offline; only goes live withLIVE=1orPROVIDER=.High: chain semantics aligned with retry_executor — retry fires on
validation_failed/parse_error, NOT on low eval score. A model returning:okwith semantically wrong output won't escalate. Therefore the chain's last model must pass ALL evals (safe fallback). Disjoint coverage (A passes e1, B passes e2, neither passes both) produces an empty chain — not a viable retry escalation. Known limitation documented: intermediate models assumed retryable; future version could inspectstep_status.High: ActiveSupport dependency removed —
.presencereplaced with.empty?throughout rake task. Works in plain Ruby without Rails.Medium: missing eval_dirs support —
optimizetask now supportsEVAL_DIRS=env var for non-Rails setups.Medium: docs updated —
LIVE=1andEVAL_DIRS=documented in optimization guide.Tests added (12 new specs)
Test plan
🤖 Generated with Claude Code