Skip to content

Fix RetryOptimizer: offline mode, disjoint coverage, eval_dirs#3

Merged
justi merged 7 commits into
mainfrom
fix/retry-optimizer-findings
Apr 17, 2026
Merged

Fix RetryOptimizer: offline mode, disjoint coverage, eval_dirs#3
justi merged 7 commits into
mainfrom
fix/retry-optimizer-findings

Conversation

@justi
Copy link
Copy Markdown
Owner

@justi justi commented Apr 17, 2026

Summary

Fixes findings from code review of the RetryOptimizer added in 0.6.1.

Fixes

  • High: forced online modeoptimize task always injected a real adapter, bypassing sample_response offline path. Now defaults to offline; only goes live with LIVE=1 or PROVIDER=.

  • High: chain semantics aligned with retry_executor — retry fires on validation_failed/parse_error, NOT on low eval score. A model returning :ok with semantically wrong output won't escalate. Therefore the chain's last model must pass ALL evals (safe fallback). Disjoint coverage (A passes e1, B passes e2, neither passes both) produces an empty chain — not a viable retry escalation. Known limitation documented: intermediate models assumed retryable; future version could inspect step_status.

  • High: ActiveSupport dependency removed.presence replaced with .empty? throughout rake task. Works in plain Ruby without Rails.

  • Medium: missing eval_dirs supportoptimize task now supports EVAL_DIRS= env var for non-Rails setups.

  • Medium: docs updatedLIVE=1 and EVAL_DIRS= documented in optimization guide.

Tests added (12 new specs)

  • RetryOptimizer: score matrix, constraining eval, chain construction, disjoint → empty, single model, reasoning_effort, offline mode (real step + sample_response), integration (chain → step.run)
  • OptimizeRakeTask: build_context offline/online/PROVIDER, parse_candidates (3 formats), EVAL_DIRS stub, empty PROVIDER

Test plan

  • 1301 specs, 0 failures
  • Offline mode verified: step with sample_response returns correct scores
  • Integration: optimizer chain's last model passes step.run
  • Disjoint coverage → empty chain (not viable per retry_executor semantics)
  • Plain Ruby: no ActiveSupport methods used

🤖 Generated with Claude Code

Three findings from code review:

1. (High) optimize task always injected real adapter via build_context,
   forcing online mode even when evals have sample_response. Now only
   injects adapter when LIVE=1 or PROVIDER= is set. Default = offline
   (zero API calls, uses sample_response).

2. (High) build_chain used count-based coverage tracking which missed
   candidates covering different evals (A passes e1, B passes e2 →
   old code produced [A] instead of [A, B]). Switched to Set-based
   tracking of which evals are covered. Chain is empty only when
   covered_evals < total after exhausting all candidates.

3. (Medium) optimize task called load_evals! without eval_dirs,
   breaking non-Rails setups. Now supports EVAL_DIRS= env var,
   matching the existing RakeTask behavior.

Added spec for disjoint coverage case. 1289 specs passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 17, 2026 18:40
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes issues in the RetryOptimizer workflow and its optimize rake task: ensuring offline-by-default evals work as intended, correcting chain construction when eval coverage is split across candidates, and allowing eval loading from explicit directories in non-Rails setups.

Changes:

  • Update the optimize rake task to support EVAL_DIRS and default to offline mode unless LIVE=1 or PROVIDER is set.
  • Fix RetryOptimizer chain construction to track covered evals by set-union rather than a simple count.
  • Add a spec case covering disjoint eval coverage across candidates.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.

File Description
spec/ruby_llm/contract/eval/retry_optimizer_spec.rb Adds a new test scenario for disjoint eval coverage across candidates.
lib/ruby_llm/contract/rake_task.rb Adds EVAL_DIRS support and changes adapter injection logic to preserve offline sample_response behavior by default.
lib/ruby_llm/contract/eval/retry_optimizer.rb Reworks build_chain to track coverage with a set and returns no chain if full coverage can’t be achieved.
Gemfile.lock Bumps local gem version from 0.6.0 to 0.6.1.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +145 to +150
covered_evals = Set.new

labels.each do |label|
passes = evals.count { |e| (matrix.dig(e, label) || 0) >= @min_score }
next if passes <= covered
newly_covered = evals.select { |e| (matrix.dig(e, label) || 0) >= @min_score }
new_additions = newly_covered.to_set - covered_evals
next if new_additions.empty?
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set/to_set are used here, but this file does not require the stdlib set library. In a non-Rails environment this can raise NameError: uninitialized constant Set and/or NoMethodError for to_set. Add an explicit require "set" (either in this file or in a central require like ruby_llm/contract).

Copilot uses AI. Check for mistakes.
Comment on lines +128 to +143
it "returns empty chain when candidates cover disjoint evals (no single viable chain)" do
stub_compare_models_scores(step, {
"easy" => { "nano" => 1.0, "mini" => 0.5 },
"hard" => { "nano" => 0.5, "mini" => 1.0 }
})

result = described_class.new(
step: step,
candidates: [{ model: "nano" }, { model: "mini" }]
).call

# nano passes easy, mini passes hard — but neither passes both,
# and the chain nano→mini still doesn't cover easy+hard because
# mini fails easy. This is not a viable escalation chain.
expect(result.chain.length).to eq(2)
expect(result.chain).to eq([{ model: "nano" }, { model: "mini" }])
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example description and inline comment say the chain should be empty / not viable, but the assertions expect a 2-step chain [nano, mini]. Disjoint coverage across evals is a valid escalation chain (easy passes on nano; hard escalates to mini), so the test text should be updated to match the expected behavior (or, if the intent really is “no viable chain”, then the expectations should assert an empty chain).

Copilot uses AI. Check for mistakes.
Comment on lines +189 to +196
ctx = {}
provider = ENV["PROVIDER"].to_s.strip
ctx[:provider] = provider.downcase.to_sym unless provider.empty?
# Only inject real adapter when LIVE=1 or PROVIDER is set — otherwise
# evals use sample_response (offline mode, zero API calls).
if ENV["LIVE"] == "1" || provider.present?
ctx[:adapter] = RubyLLM::Contract::Adapters::RubyLLM.new
ctx[:provider] = provider.downcase.to_sym unless provider.empty?
end
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provider.present? relies on ActiveSupport’s Object#present?, but this gem’s dependencies (and this file’s requires) don’t include ActiveSupport. This will raise NoMethodError when running the rake task outside Rails. Use plain Ruby checks (e.g., !provider.empty?) and consider also avoiding .presence above for the same reason.

Copilot uses AI. Check for mistakes.
justi and others added 4 commits April 18, 2026 03:46
…description

- Add `require "set"` to retry_optimizer.rb (Set/to_set fail outside Rails)
- Replace `provider.present?` with `!provider.empty?` (no ActiveSupport dep)
- Fix disjoint coverage test: description now correctly explains that
  disjoint coverage IS viable via retry escalation, not "no viable chain"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 2 of review fixes:

1. (High) build_chain: disjoint eval coverage is NOT a viable retry
   chain. Retry fires on validation_failed/parse_error, not on low
   eval score — a model returning :ok with wrong output won't
   escalate. Now the chain's last model MUST pass ALL evals (safe
   fallback). Cheaper models are prepended as first-try optimization
   only. If no model passes all evals → empty chain.

2. (High) Replace .presence with plain Ruby (.empty?) in rake task.
   .presence requires ActiveSupport, which is not a gem dependency.

3. (Medium) Document LIVE=1 and EVAL_DIRS= in the optimization guide.
   Without these, non-Rails users and live-mode users can't use the
   optimize task correctly.

Disjoint coverage test updated: now expects empty chain (correct
per retry_executor semantics).

1289 specs passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 3 — tests requested by review:

OptimizeRakeTask specs (7):
- build_context: offline by default (no adapter), LIVE=1 injects
  adapter, PROVIDER= injects adapter+provider, empty PROVIDER
  doesn't trigger adapter (no ActiveSupport .presence)
- parse_candidates: comma-separated, @reasoning_effort, JSON array

RetryOptimizer specs (2):
- Offline mode: real step with sample_response, no stubs, no API
  calls — verifies scores and chain from canned responses
- Integration: optimizer chain's last model passes all evals when
  actually run through step.run with test adapter

Total: 1298 specs, 0 failures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests added:
- EVAL_DIRS: parsing, empty when unset (2 cases)
- LIVE=1: end-to-end offline vs online adapter injection (1 case)
- Known limitation documented in build_chain: intermediate models
  assumed retryable; future version could inspect step_status.

1301 specs passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


covered_evals.merge(new_additions)
chain << parse_label_to_config(label)
details << { label: label, passes: covered_evals.size, cost: label }
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_chain stores passes: covered_evals.size for each intermediate model, but Result#print_chain prints this as " — passes X/Y evals". Because covered_evals is cumulative across prior models, this number can overstate what the current model actually passes (it may only contribute a small subset). Either track per-model pass count for details, or change the printed wording/field name to reflect cumulative "covered so far" semantics.

Suggested change
details << { label: label, passes: covered_evals.size, cost: label }
details << { label: label, passes: newly_covered.size, cost: label }

Copilot uses AI. Check for mistakes.
Comment on lines +128 to +144
it "returns empty chain when coverage is disjoint (no model passes all evals)" do
stub_compare_models_scores(step, {
"easy" => { "nano" => 1.0, "mini" => 0.5 },
"hard" => { "nano" => 0.5, "mini" => 1.0 }
})

result = described_class.new(
step: step,
candidates: [{ model: "nano" }, { model: "mini" }]
).call

# nano passes easy but not hard, mini passes hard but not easy.
# Retry fires on validation_failed/parse_error — NOT on low eval
# score. A model returning :ok with wrong output won't escalate.
# No single model covers all evals → no viable chain.
expect(result.chain).to be_empty
end
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description/test plan says the disjoint-coverage case should yield a chain like [A, B], but the new spec asserts the chain is empty in that scenario. Please reconcile the PR description with the implemented behavior (or adjust the spec/algorithm if the intended behavior is actually to include both models).

Copilot uses AI. Check for mistakes.
Comment on lines +51 to +56
describe "EVAL_DIRS support" do
it "passes EVAL_DIRS to load_evals!" do
with_env("EVAL_DIRS" => "lib/evals,extra/evals") do
dirs = ENV["EVAL_DIRS"].to_s.split(",").map(&:strip).reject(&:empty?)
expect(dirs).to eq(["lib/evals", "extra/evals"])
end
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "passes EVAL_DIRS to load_evals!" example only asserts how ENV["EVAL_DIRS"] is split, but never exercises the rake task nor verifies that RubyLLM::Contract.load_evals! is actually called with those dirs. Consider stubbing RubyLLM::Contract.load_evals! and invoking the rake task (or extracting a small helper method) so the test covers the real behavior added in lib/ruby_llm/contract/rake_task.rb.

Copilot uses AI. Check for mistakes.
Comment on lines +41 to +47
it "does not use ActiveSupport String#presence" do
# Verify that .presence is NOT called — plain Ruby .empty? is used instead.
# This ensures the task works outside Rails/ActiveSupport.
with_env("LIVE" => nil, "PROVIDER" => "") do
ctx = task.send(:build_context)
expect(ctx).not_to have_key(:adapter)
end
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example's description claims it verifies ActiveSupport String#presence isn't used, but the expectations only check adapter injection behavior. If the intent is to ensure non-Rails compatibility, it would be clearer to either (a) exercise the STEP/CANDIDATES parsing paths (where .presence was removed) or (b) rename the spec to match what it actually asserts.

Copilot uses AI. Check for mistakes.
justi and others added 2 commits April 18, 2026 04:07
…IRS stub

1. build_chain details: passes now shows per-model new coverage,
   not cumulative. print_chain says "covers X eval(s)" for
   intermediates, "passes all N evals" for the safe fallback.

2. EVAL_DIRS test: now stubs RubyLLM::Contract.load_evals! and
   verifies it receives the parsed dirs (or no args when unset).

3. Renamed misleading "does not use ActiveSupport" test to
   "treats empty PROVIDER same as nil".

4. PR description/disjoint semantic mismatch: the current spec
   correctly expects empty chain for disjoint coverage (retry
   won't bridge semantic gaps). PR description updated separately.

1301 specs passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0.6.1: multi-provider tooling, recommend task, ollama support
0.6.2: optimize_retry_policy, offline-by-default, chain semantics
        fix, ActiveSupport removal, EVAL_DIRS, guide

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@justi justi merged commit 235e466 into main Apr 17, 2026
1 check passed
justi added a commit that referenced this pull request Apr 22, 2026
…ding + changelog coverage

Four inline findings from two Copilot review rounds; all addressed:

1. docs/guide/getting_started.md:3 — "OpenAI account" was provider-specific
   in a guide that inherits the README's "any ruby_llm provider" promise.
   Reworded to "LLM provider budget".

2. CHANGELOG.md — 0.7.2 entry listed only the optimizing_retry_policy
   rewrite. Added a Documentation subsection enumerating all five guide
   rewrites plus the output_schema DSL bug fix and the best_practices /
   pipeline / migration sanity pass, so release notes match what shipped.

3. docs/guide/testing.md:38 — the pipeline.test example used `decisions:
   [...]` and `analyses: [...]`. `...` is not valid Ruby inside an array
   literal; copy-paste would raise SyntaxError. Replaced with minimal
   realistic array entries matching each step's schema from pipeline.md.

4. docs/guide/testing.md:104 — stub_steps example had `response: { ... }`.
   Same issue as #3. Replaced with minimal response hashes matching the
   SummarizeArticle schema and a plausible RelatedArticles schema.

Full spec suite: 1341 examples, 0 failures.
justi added a commit that referenced this pull request May 31, 2026
Batch 2 (partial) of the post-0.9.0 quality refactor.

B2-T1 — `with_retry_disabled` removed (Codex finding #3)
========================================================
The old implementation in `eval/retry_optimizer.rb` mutated the step
class's singleton `retry_policy` method around `compare_models`, then
restored it in an `ensure`. This was a concurrency hazard: two
concurrent `optimize_retry_policy` calls on the same step would race,
and the `ensure` does not protect against the race.

Replaced with `context.merge(retry_policy_override: nil)` — the step
already honours this key in `Step::Base#runtime_settings` via
`context.key?(:retry_policy_override)`. Same effect, no mutation, no
race.

Characterization tests written FIRST (4 tests pinning original
`with_retry_disabled` behaviour: nil-inside, restore-after, restore-
on-raise, no-retry-policy-guard). After refactor, replaced with 3
tests pinning the new contract (retry_policy_override propagation,
class-level retry_policy untouched, obsolete private method gone).

B2-T3 — CostCalculator.find_model exposed (finding #2)
======================================================
`step/base.rb` was calling `CostCalculator.send(:find_model)` and
`CostCalculator.send(:compute_cost)` to bypass `private_class_method`.
Visibility bypass via `send` is invisible to grep-for-callers and
fragile under refactor.

Changes:
- Removed `find_model` from `private_class_method` list in
  `cost_calculator.rb` with a comment explaining the public contract.
- Replaced `CostCalculator.send(:find_model, name)` with the direct
  call.
- Dropped `Base.estimated_cost_for` helper entirely; routed through
  the existing public `CostCalculator.calculate(model_name:, usage:)`
  which already does the work. Removes the second `send` call.

5 characterization tests added in
`spec/ruby_llm/contract/cost_calculator_spec.rb` pinning
`find_model` (nil-on-unknown, nil-on-nil, custom registry, custom
overrides built-in, calculate-uses-find-model).

Suite: 1344 examples / 0 failures (+8 net new vs Batch 1 baseline).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
justi added a commit that referenced this pull request May 31, 2026
Closes the post-0.9.0 quality refactor pull (Batch 1 + Batch 2 partial
from TODO.md). Patch-version bump because every change is internal —
no DSL macro added/removed, no Step.run signature change, no adapter
contract change. Adopters bumping to `~> 0.9.0` automatically pick up
the safer code path on next install.

Bundled fixes (Codex findings #2, #3, #7, #8):
- Drop singleton-method mutation in with_retry_disabled (concurrency
  hazard) — replaced with retry_policy_override: nil via context.
- Expose CostCalculator.find_model publicly; remove two
  CostCalculator.send workarounds; route estimate_cost through the
  existing public CostCalculator.calculate.
- Unify stub_step on a single thread-local storage path; remove
  duplicate allow/receive branch.
- Drop dead ObjectSpace.each_object fallback (unreachable on Ruby
  >= 3.2; gemspec requires 3.2.0+ where Class#subclasses ships).

Test discipline: every refactor wrote characterization tests BEFORE
touching production code, then replaced with new contract tests after.
Suite: 1346 examples / 0 failures (+12 net new vs 0.9.0 baseline).

Remaining refactor backlog (Codex findings #1, #4, #5, #6) documented
in TODO.md and deferred to 0.10.0+ — each requires more invasive
surgery on shared surfaces (DSL inheritance macros, Runner config,
RakeTask SuiteGate) and benefits from dedicated focus rather than
tail-end inclusion in a patch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
justi added a commit that referenced this pull request May 31, 2026
Closes Codex finding #5 (RakeTask#define_task god method, 52 LOC, 6+
responsibilities). Last remaining finding from the post-0.9.0 quality
audit — TODO.md now marked COMPLETE across all 8 findings.

EXTRACTED
=========

New `lib/ruby_llm/contract/rake_task/suite_gate.rb` — testable in
isolation, returns a `Verdict` Data struct:

  Verdict.new(passed:, abort_reason:, passed_reports:, suite_cost:)

The gate enforces a strict ordering preserved from the prior god
method:
  1. Cost gate runs FIRST (if maximum_cost set and exceeded → abort
     before any score check; passed_reports is empty).
  2. Score gate runs per-report (uses minimum_score threshold; falls
     back to report.passed? when no threshold set).
  3. Regression gate runs per-report (when fail_on_regression true,
     prints REGRESSIONS DETECTED to stdout and marks the report failed).
  4. Overall passed = ALL reports passed AND cost gate not tripped.

REFACTORED
==========

`RakeTask#define_task` reduced from ~52 LOC to ~25 LOC. Extracted
`collect_host_reports(context)` helper handles eval running, empty
case, per-host printing. The gate logic, regression detection, cost
comparison, and per-report scoring all delegated to `SuiteGate`.

TEST DISCIPLINE
===============

5 characterization tests written FIRST in
`spec/ruby_llm/contract/rake_task_gate_spec.rb` pinning all four
dimensions:
- gate passes → "All evals passed" output, no abort
- gate fails on score → SystemExit with "Eval suite FAILED"
- cost gate priority → cost message even when score would pass
  (registered a custom pricey model to force non-zero cost)
- baselines NOT saved on score failure
- baselines DO save when all reports pass

All 5 passed BEFORE refactor (current behaviour) AND after (new
behaviour). Existing rake_task tests in f3_track_history_spec.rb and
v02_findings_spec.rb continue to pass unchanged.

Suite: 1369 examples / 0 failures / 8 pending.

CODEX BACKLOG STATUS
====================

All 8 findings from the post-0.9.0 code-quality audit now resolved:
  #1 DRY inheritance walk         → Batch 3 (B3-T2)
  #2 CostCalculator.send          → Batch 2 (B2-T3)
  #3 with_retry_disabled mutation → Batch 2 (B2-T1)
  #4 _explicitly_unset shadows    → Batch 3 (B3-T3, UNSET sentinel)
  #5 RakeTask god method          → Batch 4 (B4-T2, THIS commit)
  #6 Runner.new 17 kwargs         → Batch 2 (B2-T4, RunnerConfig.build)
  #7 ObjectSpace dead code        → Batch 1 (B1-T1)
  #8 stub_step dual storage       → Batch 2 (B2-T2)

Ready for 0.10.0 minor-version release.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
justi added a commit that referenced this pull request May 31, 2026
Three findings from a Codex DRY audit of lib/ (104 .rb files, ~7240 LOC).
HIGH-impact #1 already had silent drift in production code.

#1 [HIGH] Shared stub-helper concern (already drifting!)
========================================================

Before: `stub_step`, `stub_steps`, `stub_all_steps` plus their
`build_test_adapter` helper existed in parallel in Minitest helpers
(minitest.rb:50-157) and RSpec helpers (rspec/helpers.rb:24-121).
The RSpec version had already drifted — it ran every test response
through a `normalize_test_response` hook; Minitest did not. If an
adopter using Minitest registered custom normalization, their tests
would diverge from RSpec users running the same fixtures.

After: extracted `RubyLLM::Contract::Concerns::StubHelpers` containing
the three public methods and the `build_test_adapter` /
`normalize_test_response` private helpers. Both `MinitestHelpers` and
`RSpec::Helpers` now include it. Their host files shrink to a few
lines documenting the cleanup contract specific to each framework
(Minitest's `teardown` vs RSpec's `around(:each)`).

LOC: minitest.rb -107 / rspec/helpers.rb -116 / concerns/stub_helpers.rb
+95. Net: -128 LOC and a single source of truth for stub semantics.

#2 [HIGH] RetryExecutor now uses Concerns::UsageAggregator
==========================================================

Before: `RetryExecutor#aggregate_retry_usage` (retry_executor.rb:83-94)
duplicated the input/output token sum that `Concerns::UsageAggregator
#aggregate_usage` already implemented. Both produced
`{ input_tokens:, output_tokens: }`; both extracted usage via
`respond_to?(:usage)`; the only difference was the wrapper that
unpacked attempts versus raw traces.

Drift scenario this prevents: adding a third token field (e.g.
`:cache_read_tokens` for prompt-caching pricing) would have required
edits in both methods; missing one would silently undercount retry
costs.

After: `RetryExecutor` includes `Concerns::UsageAggregator` and
delegates: `aggregate_usage(all_attempts.map { |a| a[:result].trace })`.
The private duplicate is gone (~12 LOC).

#3 [MEDIUM] `validate_positive!` helper in Dsl
==============================================

Before: four DSL macros (`max_input`, `max_output`, `max_cost`,
`attachment_token_estimate`) each opened with the identical guard:

  unless value.is_a?(Numeric) && value.positive?
    raise ArgumentError, "X must be positive, got Y"
  end

After: extracted `validate_positive!(name, value)` private method
(mirrors the existing `CostCalculator.validate_price!` pattern).
Four call sites reduced from three lines to one.

NOT FIXED (Codex called these out as deferable)
================================================

#4 [MEDIUM]: `ReportStats` / `AggregatedReport` share a duck-type
contract (score, pass_rate, pass_rate_ratio, total_cost,
avg_latency_ms, passed?). Codex flagged this as MEDIUM impact but
the duck-type is intentionally load-bearing — making it explicit
via a shared interface module would be a bigger interface design
choice than this DRY pass warrants. Deferred.

LOW items (cost-format strings in Minitest vs RSpec messages,
inconsistent ArgumentError text in eval_definition) — left as-is
per the audit's recommendation.

Test suite: 1369 examples / 0 failures / 8 pending.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants