v0.6.0 — Model recommendation with reasoning_effort support#2
Conversation
Step.recommend("eval", candidates: [...]) returns actionable advice:
optimal model, retry chain, savings, and to_dsl code output.
Key changes:
- Candidates are configurations (model + reasoning_effort), not just names
- RetryPolicy supports per-attempt config hashes via escalate()
- RetryExecutor forwards per-attempt reasoning_effort to provider
- compare_models extended with candidates: parameter
- New Recommender algorithm + Recommendation value object
- pass_rate_ratio (float) on Report/ReportStats
- History entries store model, reasoning_effort, pass_rate_ratio
1246 tests, 0 failures. RubyCritic score: 88.07.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR bumps the gem to v0.6.0 and introduces a model/config recommendation workflow driven by eval results, including support for per-attempt reasoning_effort in retry policies and richer eval/report metadata.
Changes:
- Add
Step.recommend(eval, candidates: [...])plus newEval::Recommender/Eval::Recommendationto return a best candidate, retry chain, rationale, and DSL output. - Extend
compare_modelsto acceptcandidates:(config hashes) and enrichModelComparison/history entries withreasoning_effort+pass_rate_ratio. - Upgrade retry policies/execution to use config-based escalation (
{ model:, reasoning_effort: }) and forward per-attempt options to providers.
Reviewed changes
Copilot reviewed 24 out of 25 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| spec/ruby_llm/contract/step/retry_policy_spec.rb | Adds specs for config-based escalation normalization and attempt selection. |
| spec/ruby_llm/contract/step/retry_executor_spec.rb | Adds specs ensuring per-attempt reasoning_effort forwarding and trace behavior. |
| spec/ruby_llm/contract/step/base_spec.rb | Adds specs for .recommend and determining current model config. |
| spec/ruby_llm/contract/eval/report_stats_spec.rb | Adds tests for numeric pass_rate_ratio. |
| spec/ruby_llm/contract/eval/recommender_spec.rb | Adds tests for recommendation selection, retry chain, savings, warnings. |
| spec/ruby_llm/contract/eval/recommendation_spec.rb | Adds tests for immutability and DSL output generation. |
| spec/ruby_llm/contract/eval/model_comparison_spec.rb | Adds tests for candidate labeling/config mapping and pass_rate_ratio in to_h. |
| lib/ruby_llm/contract/version.rb | Bumps gem version to 0.6.0. |
| lib/ruby_llm/contract/step/retry_policy.rb | Switches from model lists to config lists (hash/string normalization). |
| lib/ruby_llm/contract/step/retry_executor.rb | Forwards per-attempt config extras (e.g., reasoning_effort) and adds config to trace. |
| lib/ruby_llm/contract/step/base.rb | Adds .recommend and a helper to determine current model config. |
| lib/ruby_llm/contract/eval/report.rb | Delegates pass_rate_ratio from stats. |
| lib/ruby_llm/contract/eval/report_storage.rb | Adds reasoning_effort to paths/entries and persists pass_rate_ratio. |
| lib/ruby_llm/contract/eval/report_stats.rb | Implements numeric pass_rate_ratio. |
| lib/ruby_llm/contract/eval/recommender.rb | New recommendation algorithm selecting best/first-try chain and computing savings/warnings. |
| lib/ruby_llm/contract/eval/recommendation.rb | New Recommendation value object + to_dsl output. |
| lib/ruby_llm/contract/eval/model_comparison.rb | Adds candidate labeling and config mapping support (vs model-only keys). |
| lib/ruby_llm/contract/eval.rb | Requires new recommender/recommendation files. |
| lib/ruby_llm/contract/concerns/eval_host.rb | Extends compare_models to accept candidates: and carry reasoning_effort into eval runs. |
| lib/ruby_llm/contract.rb | Adds normalize_candidate_config helper used across retry + eval flows. |
| README.md | Documents recommendation API and updates roadmap. |
| docs/guide/getting_started.md | Updates reasoning_effort docs language. |
| docs/architecture.md | Adds new recommender/recommendation components. |
| CHANGELOG.md | Adds v0.6.0 entry describing new features. |
| Gemfile.lock | Bumps locked gem version to 0.6.0. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| when String then { model: entry } | ||
| when Hash then entry | ||
| else raise ArgumentError, "Expected String or Hash, got #{entry.class}" |
There was a problem hiding this comment.
normalize_candidate_config returns the input Hash as-is, which means downstream callers that freeze configs (e.g., RetryPolicy/escalate and Recommendation) will end up freezing user-owned hashes. It also doesn’t validate that a Hash candidate actually contains a usable :model (or handle string keys), which can lead to model=nil being passed into adapters.
Consider returning a normalized copy (e.g., symbolize keys, require a non-empty :model String, and drop nil optional keys like reasoning_effort) so internal code can safely freeze without mutating caller objects and so invalid candidates fail fast with a clear ArgumentError.
| when String then { model: entry } | |
| when Hash then entry | |
| else raise ArgumentError, "Expected String or Hash, got #{entry.class}" | |
| when String | |
| model = entry.strip | |
| raise ArgumentError, "Candidate config model must be a non-empty String" if model.empty? | |
| { model: model } | |
| when Hash | |
| candidate = entry.each_with_object({}) do |(key, value), normalized| | |
| normalized[key.respond_to?(:to_sym) ? key.to_sym : key] = value | |
| end | |
| model = candidate[:model] | |
| unless model.is_a?(String) && !model.strip.empty? | |
| raise ArgumentError, "Candidate config must include a non-empty String :model" | |
| end | |
| normalized = { model: model.strip } | |
| normalized[:reasoning_effort] = candidate[:reasoning_effort] unless candidate[:reasoning_effort].nil? | |
| normalized | |
| else | |
| raise ArgumentError, "Expected String or Hash, got #{entry.class}" |
| def escalate(*config_list) | ||
| @configs = config_list.flatten.map { |c| normalize_config(c).freeze }.freeze | ||
| @max_attempts = @configs.length if @max_attempts < @configs.length |
There was a problem hiding this comment.
escalate freezes each config hash. If the caller passes a Hash (not a String), this currently freezes the same object they provided (because normalize_candidate_config returns the Hash unchanged). That’s a surprising side effect and can break callers that reuse or mutate their config hashes.
Prefer duplicating/normalizing configs before freezing (e.g., dup + symbolize/validate in the normalizer, or deep-dup-freeze here) so RetryPolicy never mutates caller-owned objects.
| attr_reader :best, :retry_chain, :score, :cost_per_call, | ||
| :rationale, :current_config, :savings, :warnings | ||
|
|
||
| def initialize(best:, retry_chain:, score:, cost_per_call:, | ||
| rationale:, current_config:, savings:, warnings:) | ||
| @best = best&.freeze | ||
| @retry_chain = retry_chain.freeze | ||
| @score = score | ||
| @cost_per_call = cost_per_call | ||
| @rationale = rationale.freeze | ||
| @current_config = current_config&.freeze | ||
| @savings = savings.freeze | ||
| @warnings = warnings.freeze |
There was a problem hiding this comment.
Recommendation is intended to be an immutable value object (it freezes itself), but retry_chain, savings, and warnings are only shallow-frozen; the nested Hashes/Arrays remain mutable, so callers can still mutate internal state through rec.retry_chain.first[:model] = ..., etc.
Consider using the existing Concerns::DeepFreeze helper (deep-dup-freeze) for these attributes so the object is deeply immutable without freezing caller-owned objects.
| attr_reader :best, :retry_chain, :score, :cost_per_call, | |
| :rationale, :current_config, :savings, :warnings | |
| def initialize(best:, retry_chain:, score:, cost_per_call:, | |
| rationale:, current_config:, savings:, warnings:) | |
| @best = best&.freeze | |
| @retry_chain = retry_chain.freeze | |
| @score = score | |
| @cost_per_call = cost_per_call | |
| @rationale = rationale.freeze | |
| @current_config = current_config&.freeze | |
| @savings = savings.freeze | |
| @warnings = warnings.freeze | |
| include Concerns::DeepFreeze | |
| attr_reader :best, :retry_chain, :score, :cost_per_call, | |
| :rationale, :current_config, :savings, :warnings | |
| def initialize(best:, retry_chain:, score:, cost_per_call:, | |
| rationale:, current_config:, savings:, warnings:) | |
| @best = best&.freeze | |
| @retry_chain = deep_dup_freeze(retry_chain) | |
| @score = score | |
| @cost_per_call = cost_per_call | |
| @rationale = rationale.freeze | |
| @current_config = current_config&.freeze | |
| @savings = deep_dup_freeze(savings) | |
| @warnings = deep_dup_freeze(warnings) |
| def build_rationale(scored, best) | ||
| sorted = scored.sort_by { |s| [s[:cost_per_call], s[:latency], s[:label]] } | ||
| sorted.map { |s| rationale_line(s, best) } | ||
| end | ||
|
|
||
| def rationale_line(candidate, best) | ||
| header = "#{candidate[:label]}, score #{format("%.2f", candidate[:score])}, at $#{format("%.4f", candidate[:cost_per_call])}/call" | ||
| notes = rationale_notes(candidate, best) |
There was a problem hiding this comment.
build_rationale sorts candidates purely by cost_per_call, but candidates with unknown pricing are represented with cost_per_call = 0.0, so they will sort to the top and appear “cheapest” in the rationale despite being excluded from selection and flagged as unknown pricing.
Consider sorting unknown-pricing candidates last (e.g., sort key [cost_known? ? cost_per_call : Float::INFINITY, latency, label]) and/or formatting unknown costs as “n/a” to avoid misleading rationale ordering/output.
| warn "[ruby_llm-contract] retry_policy has max_attempts=1 with no models. " \ | ||
| "This means no actual retry will happen. Add `attempts 2` or `escalate %w[model1 model2]`." |
There was a problem hiding this comment.
The warning message still says “no models”, but the API is now config-based (@configs, config_list). Updating the wording (and the suggested DSL snippet) to reflect configs would make the warning clearer for users reading logs.
| warn "[ruby_llm-contract] retry_policy has max_attempts=1 with no models. " \ | |
| "This means no actual retry will happen. Add `attempts 2` or `escalate %w[model1 model2]`." | |
| warn "[ruby_llm-contract] retry_policy has max_attempts=1 with no configs. " \ | |
| "This means no actual retry will happen. Add `attempts 2` or " \ | |
| "`escalate [{ model: \"model1\" }, { model: \"model2\" }]`." |
… fix - normalize_candidate_config validates :model, symbolizes keys, returns a new hash (never freezes caller's object) - Recommendation uses deep_dup_freeze for nested immutability - Rationale sorts unknown-pricing candidates last - Warning message updated for config-based API Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each section leads with the outcome and the money, not the API. Move minor features to docs. Install near the top. Shorter code blocks. Before/after framing on every killer feature. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 24 out of 25 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| model = policy.model_for_attempt(attempt_index, default_model) | ||
| config = policy.config_for_attempt(attempt_index, default_config) | ||
| model = config[:model] | ||
| attempt_extra = extra_options.merge(config.except(:model)) |
There was a problem hiding this comment.
Hash#except is not part of Ruby core (Ruby >=3.2 has Hash#slice but not except), so config.except(:model) will raise NoMethodError in production. Replace this with a core-Ruby approach (e.g., config.reject { |k,_| k == :model } or dup+delete) before merging into extra_options.
| attempt_extra = extra_options.merge(config.except(:model)) | |
| attempt_extra = extra_options.merge(config.reject { |k, _| k == :model }) |
| config: config, | ||
| score: report.score, | ||
| cost_per_call: cost_per_call, | ||
| latency: report.avg_latency_ms || 0, |
There was a problem hiding this comment.
Treating missing latency as 0 (report.avg_latency_ms || 0) will bias tiebreaking toward candidates with unknown latency (they’ll look fastest). Consider using a sentinel like Float::INFINITY (or keeping nil and sorting with nil last) so “unknown” doesn’t win latency comparisons.
| latency: report.avg_latency_ms || 0, | |
| latency: report.avg_latency_ms || Float::INFINITY, |
| header = "#{candidate[:label]}, score #{format("%.2f", candidate[:score])}, at $#{format("%.4f", candidate[:cost_per_call])}/call" | ||
| notes = rationale_notes(candidate, best) | ||
| notes.any? ? "#{header} — #{notes.join(", ")}" : header | ||
| end | ||
|
|
There was a problem hiding this comment.
When pricing is unknown, cost_per_call becomes 0.0, so rationale lines currently render as at $0.0000/call even though the same candidate is flagged as “unknown pricing”. Consider formatting cost as n/a (or similar) when cost_known? is false to avoid misleading output.
| header = "#{candidate[:label]}, score #{format("%.2f", candidate[:score])}, at $#{format("%.4f", candidate[:cost_per_call])}/call" | |
| notes = rationale_notes(candidate, best) | |
| notes.any? ? "#{header} — #{notes.join(", ")}" : header | |
| end | |
| header = "#{candidate[:label]}, score #{format("%.2f", candidate[:score])}, at #{format_cost_per_call(candidate)}/call" | |
| notes = rationale_notes(candidate, best) | |
| notes.any? ? "#{header} — #{notes.join(", ")}" : header | |
| end | |
| def format_cost_per_call(candidate) | |
| return "n/a" unless cost_known?(candidate) | |
| "$#{format("%.4f", candidate[:cost_per_call])}" | |
| end |
| @rationale = rationale.freeze | ||
| @current_config = deep_dup_freeze(current_config) | ||
| @savings = deep_dup_freeze(savings) | ||
| @warnings = warnings.freeze |
There was a problem hiding this comment.
Recommendation includes Concerns::DeepFreeze, but @rationale = rationale.freeze and @warnings = warnings.freeze are only shallow freezes and keep the original object references. To ensure the object is truly immutable (and consistent with other DeepFreeze users), consider using deep_dup_freeze for rationale and warnings as well.
| @rationale = rationale.freeze | |
| @current_config = deep_dup_freeze(current_config) | |
| @savings = deep_dup_freeze(savings) | |
| @warnings = warnings.freeze | |
| @rationale = deep_dup_freeze(rationale) | |
| @current_config = deep_dup_freeze(current_config) | |
| @savings = deep_dup_freeze(savings) | |
| @warnings = deep_dup_freeze(warnings) |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…number Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uality link - Tagline: "missing link between cost and quality" instead of generic pitch - validate row: clarify it needs retry_policy for auto-retry - Remove specific dollar amounts that depend on prompt length/pricing - "first model that passed" not "cheapest" (depends on user ordering) - savings: "calculated automatically" not hardcoded $17 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Visual: BEFORE (pick one model, hope) → CONFIGURE (4 lines) → GEM HANDLES (escalation flow with boxes) → YOU GET (outcomes). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…, correct trace output - output_schema: "send to provider + validate client-side" not "force" - trace[:cost] returns float, not string with $ - Remove hardcoded 90/9/1 percentages from diagram - Remove unsubstantiated savings example from recommend section - "most requests stay on nano" instead of "90%" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Config box: "YOU DEFINE A CONTRACT" Flow boxes: "contract" instead of "validate" — schema + business rules together are the contract that gates every LLM response. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Diagram says "contract" — now all prose sections use the same word: "contract failed/passed" instead of "validation_failed", "fails the contract" instead of "fails validation". Consistent mental model. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…g, deep freeze - Unknown latency → Float::INFINITY (not 0, which biased tiebreaking) - Unknown pricing → "unknown pricing" in rationale (not $0.0000) - Recommendation: deep_dup_freeze on rationale and warnings - Hash#except is fine — Ruby >= 3.0, gem requires >= 3.2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…deep freeze - Tiebreaker: equal cost → lower latency wins (was only testing lexicographic) - Unknown latency does not win tiebreak (Float::INFINITY, not 0) - Rationale shows "unknown pricing" not "$0.0000" for zero-cost candidates - Deep immutability: nested hashes in retry_chain/savings raise FrozenError Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ReportStats: add tests for score, total_cost, avg_latency_ms, passed?, pass_rate (was only testing pass_rate_ratio — 1 of 11 methods) - normalize_candidate_config: new spec covering string keys, empty input, missing model, whitespace, no-caller-mutation - Recommender savings: assert exact per_call and monthly_at values - ModelComparison to_h: verify full hash, not just one key - Recommendation to_dsl: assert reasoning_effort: "high" with quotes - Step.recommend: verify score, rationale count, unknown pricing warning - Remove rand() from all test helpers — use deterministic names Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 25 out of 26 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ffort override
Bug Copilot caught on the second review pass:
class MyStep < Step::Base
thinking effort: :low, budget: 2048
end
MyStep.run(input, context: { reasoning_effort: :high })
Previously runtime_settings only injected `extra[:thinking]` when
context did NOT carry `:reasoning_effort`. With a per-call override
present, `:thinking` was dropped entirely — budget silently lost,
even though the adapter's `resolve_thinking_config` is capable of
merging effort over `thinking[:effort]` while preserving budget.
Fix: always pass class-level `:thinking` to extra_options when set.
Seed `:reasoning_effort` from `thinking[:effort]` only when context
did not already supply one. The adapter then resolves the precedence
correctly: per-call effort wins, class-level budget survives.
Two specs added in `runtime_settings integration` group:
- "context :reasoning_effort wins for effort but :thinking still travels"
- "context :reasoning_effort preserves class-level :budget through runtime_settings"
Suite: 1311 examples, 0 failures, 8 pending.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Batch 2 (partial) of the post-0.9.0 quality refactor. B2-T1 — `with_retry_disabled` removed (Codex finding #3) ======================================================== The old implementation in `eval/retry_optimizer.rb` mutated the step class's singleton `retry_policy` method around `compare_models`, then restored it in an `ensure`. This was a concurrency hazard: two concurrent `optimize_retry_policy` calls on the same step would race, and the `ensure` does not protect against the race. Replaced with `context.merge(retry_policy_override: nil)` — the step already honours this key in `Step::Base#runtime_settings` via `context.key?(:retry_policy_override)`. Same effect, no mutation, no race. Characterization tests written FIRST (4 tests pinning original `with_retry_disabled` behaviour: nil-inside, restore-after, restore- on-raise, no-retry-policy-guard). After refactor, replaced with 3 tests pinning the new contract (retry_policy_override propagation, class-level retry_policy untouched, obsolete private method gone). B2-T3 — CostCalculator.find_model exposed (finding #2) ====================================================== `step/base.rb` was calling `CostCalculator.send(:find_model)` and `CostCalculator.send(:compute_cost)` to bypass `private_class_method`. Visibility bypass via `send` is invisible to grep-for-callers and fragile under refactor. Changes: - Removed `find_model` from `private_class_method` list in `cost_calculator.rb` with a comment explaining the public contract. - Replaced `CostCalculator.send(:find_model, name)` with the direct call. - Dropped `Base.estimated_cost_for` helper entirely; routed through the existing public `CostCalculator.calculate(model_name:, usage:)` which already does the work. Removes the second `send` call. 5 characterization tests added in `spec/ruby_llm/contract/cost_calculator_spec.rb` pinning `find_model` (nil-on-unknown, nil-on-nil, custom registry, custom overrides built-in, calculate-uses-find-model). Suite: 1344 examples / 0 failures (+8 net new vs Batch 1 baseline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the post-0.9.0 quality refactor pull (Batch 1 + Batch 2 partial from TODO.md). Patch-version bump because every change is internal — no DSL macro added/removed, no Step.run signature change, no adapter contract change. Adopters bumping to `~> 0.9.0` automatically pick up the safer code path on next install. Bundled fixes (Codex findings #2, #3, #7, #8): - Drop singleton-method mutation in with_retry_disabled (concurrency hazard) — replaced with retry_policy_override: nil via context. - Expose CostCalculator.find_model publicly; remove two CostCalculator.send workarounds; route estimate_cost through the existing public CostCalculator.calculate. - Unify stub_step on a single thread-local storage path; remove duplicate allow/receive branch. - Drop dead ObjectSpace.each_object fallback (unreachable on Ruby >= 3.2; gemspec requires 3.2.0+ where Class#subclasses ships). Test discipline: every refactor wrote characterization tests BEFORE touching production code, then replaced with new contract tests after. Suite: 1346 examples / 0 failures (+12 net new vs 0.9.0 baseline). Remaining refactor backlog (Codex findings #1, #4, #5, #6) documented in TODO.md and deferred to 0.10.0+ — each requires more invasive surgery on shared surfaces (DSL inheritance macros, Runner config, RakeTask SuiteGate) and benefits from dedicated focus rather than tail-end inclusion in a patch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes Codex finding #5 (RakeTask#define_task god method, 52 LOC, 6+ responsibilities). Last remaining finding from the post-0.9.0 quality audit — TODO.md now marked COMPLETE across all 8 findings. EXTRACTED ========= New `lib/ruby_llm/contract/rake_task/suite_gate.rb` — testable in isolation, returns a `Verdict` Data struct: Verdict.new(passed:, abort_reason:, passed_reports:, suite_cost:) The gate enforces a strict ordering preserved from the prior god method: 1. Cost gate runs FIRST (if maximum_cost set and exceeded → abort before any score check; passed_reports is empty). 2. Score gate runs per-report (uses minimum_score threshold; falls back to report.passed? when no threshold set). 3. Regression gate runs per-report (when fail_on_regression true, prints REGRESSIONS DETECTED to stdout and marks the report failed). 4. Overall passed = ALL reports passed AND cost gate not tripped. REFACTORED ========== `RakeTask#define_task` reduced from ~52 LOC to ~25 LOC. Extracted `collect_host_reports(context)` helper handles eval running, empty case, per-host printing. The gate logic, regression detection, cost comparison, and per-report scoring all delegated to `SuiteGate`. TEST DISCIPLINE =============== 5 characterization tests written FIRST in `spec/ruby_llm/contract/rake_task_gate_spec.rb` pinning all four dimensions: - gate passes → "All evals passed" output, no abort - gate fails on score → SystemExit with "Eval suite FAILED" - cost gate priority → cost message even when score would pass (registered a custom pricey model to force non-zero cost) - baselines NOT saved on score failure - baselines DO save when all reports pass All 5 passed BEFORE refactor (current behaviour) AND after (new behaviour). Existing rake_task tests in f3_track_history_spec.rb and v02_findings_spec.rb continue to pass unchanged. Suite: 1369 examples / 0 failures / 8 pending. CODEX BACKLOG STATUS ==================== All 8 findings from the post-0.9.0 code-quality audit now resolved: #1 DRY inheritance walk → Batch 3 (B3-T2) #2 CostCalculator.send → Batch 2 (B2-T3) #3 with_retry_disabled mutation → Batch 2 (B2-T1) #4 _explicitly_unset shadows → Batch 3 (B3-T3, UNSET sentinel) #5 RakeTask god method → Batch 4 (B4-T2, THIS commit) #6 Runner.new 17 kwargs → Batch 2 (B2-T4, RunnerConfig.build) #7 ObjectSpace dead code → Batch 1 (B1-T1) #8 stub_step dual storage → Batch 2 (B2-T2) Ready for 0.10.0 minor-version release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings from a Codex DRY audit of lib/ (104 .rb files, ~7240 LOC). HIGH-impact #1 already had silent drift in production code. #1 [HIGH] Shared stub-helper concern (already drifting!) ======================================================== Before: `stub_step`, `stub_steps`, `stub_all_steps` plus their `build_test_adapter` helper existed in parallel in Minitest helpers (minitest.rb:50-157) and RSpec helpers (rspec/helpers.rb:24-121). The RSpec version had already drifted — it ran every test response through a `normalize_test_response` hook; Minitest did not. If an adopter using Minitest registered custom normalization, their tests would diverge from RSpec users running the same fixtures. After: extracted `RubyLLM::Contract::Concerns::StubHelpers` containing the three public methods and the `build_test_adapter` / `normalize_test_response` private helpers. Both `MinitestHelpers` and `RSpec::Helpers` now include it. Their host files shrink to a few lines documenting the cleanup contract specific to each framework (Minitest's `teardown` vs RSpec's `around(:each)`). LOC: minitest.rb -107 / rspec/helpers.rb -116 / concerns/stub_helpers.rb +95. Net: -128 LOC and a single source of truth for stub semantics. #2 [HIGH] RetryExecutor now uses Concerns::UsageAggregator ========================================================== Before: `RetryExecutor#aggregate_retry_usage` (retry_executor.rb:83-94) duplicated the input/output token sum that `Concerns::UsageAggregator #aggregate_usage` already implemented. Both produced `{ input_tokens:, output_tokens: }`; both extracted usage via `respond_to?(:usage)`; the only difference was the wrapper that unpacked attempts versus raw traces. Drift scenario this prevents: adding a third token field (e.g. `:cache_read_tokens` for prompt-caching pricing) would have required edits in both methods; missing one would silently undercount retry costs. After: `RetryExecutor` includes `Concerns::UsageAggregator` and delegates: `aggregate_usage(all_attempts.map { |a| a[:result].trace })`. The private duplicate is gone (~12 LOC). #3 [MEDIUM] `validate_positive!` helper in Dsl ============================================== Before: four DSL macros (`max_input`, `max_output`, `max_cost`, `attachment_token_estimate`) each opened with the identical guard: unless value.is_a?(Numeric) && value.positive? raise ArgumentError, "X must be positive, got Y" end After: extracted `validate_positive!(name, value)` private method (mirrors the existing `CostCalculator.validate_price!` pattern). Four call sites reduced from three lines to one. NOT FIXED (Codex called these out as deferable) ================================================ #4 [MEDIUM]: `ReportStats` / `AggregatedReport` share a duck-type contract (score, pass_rate, pass_rate_ratio, total_cost, avg_latency_ms, passed?). Codex flagged this as MEDIUM impact but the duck-type is intentionally load-bearing — making it explicit via a shared interface module would be a bigger interface design choice than this DRY pass warrants. Deferred. LOW items (cost-format strings in Minitest vs RSpec messages, inconsistent ArgumentError text in eval_definition) — left as-is per the audit's recommendation. Test suite: 1369 examples / 0 failures / 8 pending. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Step.recommend("eval", candidates: [...])— runs eval on all candidates and returnsRecommendationwith optimal model, retry chain, rationale, savings vs current config, andto_dslcode output{ model:, reasoning_effort: }hashes, not just model name stringsreasoning_effortin retry policies —escalateaccepts config hashes, each attempt gets its own reasoning_effort forwarded to the providercompare_modelsextended withcandidates:parameter (backward compatible,models:still works)pass_rate_ratio(float 0.0–1.0) on Report/ReportStatsTest plan
models: %w[...]API unchangedretry_policy models: %w[a b]still works identically🤖 Generated with Claude Code