v0.6.0 — Model recommendation with reasoning_effort support by justi · Pull Request #2 · justi/ruby_llm-contract

justi · 2026-04-12T13:42:09Z

Summary

Step.recommend("eval", candidates: [...]) — runs eval on all candidates and returns Recommendation with optimal model, retry chain, rationale, savings vs current config, and to_dsl code output
Candidates as configurations — { model:, reasoning_effort: } hashes, not just model name strings
Per-attempt reasoning_effort in retry policies — escalate accepts config hashes, each attempt gets its own reasoning_effort forwarded to the provider
compare_models extended with candidates: parameter (backward compatible, models: still works)
pass_rate_ratio (float 0.0–1.0) on Report/ReportStats
History entries enriched with model, reasoning_effort, pass_rate_ratio

Test plan

1279 tests pass, 0 failures
RubyCritic score 88.06
Two code review rounds (panel review + deep audit), all findings addressed
Two Copilot review rounds, all 9 comments addressed
Weak test audit: 13 findings fixed — exact value assertions, full ReportStats coverage, normalize_candidate_config spec, no rand() in helpers
Backward compatibility: existing models: %w[...] API unchanged
retry_policy models: %w[a b] still works identically
No mentions of o3 or non-GPT models in examples
README claims verified against code — no unsubstantiated numbers, honest wording

🤖 Generated with Claude Code

Step.recommend("eval", candidates: [...]) returns actionable advice: optimal model, retry chain, savings, and to_dsl code output. Key changes: - Candidates are configurations (model + reasoning_effort), not just names - RetryPolicy supports per-attempt config hashes via escalate() - RetryExecutor forwards per-attempt reasoning_effort to provider - compare_models extended with candidates: parameter - New Recommender algorithm + Recommendation value object - pass_rate_ratio (float) on Report/ReportStats - History entries store model, reasoning_effort, pass_rate_ratio 1246 tests, 0 failures. RubyCritic score: 88.07. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR bumps the gem to v0.6.0 and introduces a model/config recommendation workflow driven by eval results, including support for per-attempt reasoning_effort in retry policies and richer eval/report metadata.

Changes:

Add Step.recommend(eval, candidates: [...]) plus new Eval::Recommender/Eval::Recommendation to return a best candidate, retry chain, rationale, and DSL output.
Extend compare_models to accept candidates: (config hashes) and enrich ModelComparison/history entries with reasoning_effort + pass_rate_ratio.
Upgrade retry policies/execution to use config-based escalation ({ model:, reasoning_effort: }) and forward per-attempt options to providers.

Reviewed changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
spec/ruby_llm/contract/step/retry_policy_spec.rb	Adds specs for config-based escalation normalization and attempt selection.
spec/ruby_llm/contract/step/retry_executor_spec.rb	Adds specs ensuring per-attempt `reasoning_effort` forwarding and trace behavior.
spec/ruby_llm/contract/step/base_spec.rb	Adds specs for `.recommend` and determining current model config.
spec/ruby_llm/contract/eval/report_stats_spec.rb	Adds tests for numeric `pass_rate_ratio`.
spec/ruby_llm/contract/eval/recommender_spec.rb	Adds tests for recommendation selection, retry chain, savings, warnings.
spec/ruby_llm/contract/eval/recommendation_spec.rb	Adds tests for immutability and DSL output generation.
spec/ruby_llm/contract/eval/model_comparison_spec.rb	Adds tests for candidate labeling/config mapping and `pass_rate_ratio` in `to_h`.
lib/ruby_llm/contract/version.rb	Bumps gem version to 0.6.0.
lib/ruby_llm/contract/step/retry_policy.rb	Switches from model lists to config lists (hash/string normalization).
lib/ruby_llm/contract/step/retry_executor.rb	Forwards per-attempt config extras (e.g., `reasoning_effort`) and adds config to trace.
lib/ruby_llm/contract/step/base.rb	Adds `.recommend` and a helper to determine current model config.
lib/ruby_llm/contract/eval/report.rb	Delegates `pass_rate_ratio` from stats.
lib/ruby_llm/contract/eval/report_storage.rb	Adds `reasoning_effort` to paths/entries and persists `pass_rate_ratio`.
lib/ruby_llm/contract/eval/report_stats.rb	Implements numeric `pass_rate_ratio`.
lib/ruby_llm/contract/eval/recommender.rb	New recommendation algorithm selecting best/first-try chain and computing savings/warnings.
lib/ruby_llm/contract/eval/recommendation.rb	New Recommendation value object + `to_dsl` output.
lib/ruby_llm/contract/eval/model_comparison.rb	Adds candidate labeling and config mapping support (vs model-only keys).
lib/ruby_llm/contract/eval.rb	Requires new recommender/recommendation files.
lib/ruby_llm/contract/concerns/eval_host.rb	Extends `compare_models` to accept `candidates:` and carry reasoning_effort into eval runs.
lib/ruby_llm/contract.rb	Adds `normalize_candidate_config` helper used across retry + eval flows.
README.md	Documents recommendation API and updates roadmap.
docs/guide/getting_started.md	Updates reasoning_effort docs language.
docs/architecture.md	Adds new recommender/recommendation components.
CHANGELOG.md	Adds v0.6.0 entry describing new features.
Gemfile.lock	Bumps locked gem version to 0.6.0.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-12T13:48:09Z

+        when String then { model: entry }
+        when Hash then entry
+        else raise ArgumentError, "Expected String or Hash, got #{entry.class}"


normalize_candidate_config returns the input Hash as-is, which means downstream callers that freeze configs (e.g., RetryPolicy/escalate and Recommendation) will end up freezing user-owned hashes. It also doesn’t validate that a Hash candidate actually contains a usable :model (or handle string keys), which can lead to model=nil being passed into adapters.

Consider returning a normalized copy (e.g., symbolize keys, require a non-empty :model String, and drop nil optional keys like reasoning_effort) so internal code can safely freeze without mutating caller objects and so invalid candidates fail fast with a clear ArgumentError.

Suggested change

when String then { model: entry }

when Hash then entry

else raise ArgumentError, "Expected String or Hash, got #{entry.class}"

when String

model = entry.strip

raise ArgumentError, "Candidate config model must be a non-empty String" if model.empty?

{ model: model }

when Hash

candidate = entry.each_with_object({}) do |(key, value), normalized|

normalized[key.respond_to?(:to_sym) ? key.to_sym : key] = value

end

model = candidate[:model]

unless model.is_a?(String) && !model.strip.empty?

raise ArgumentError, "Candidate config must include a non-empty String :model"

end

normalized = { model: model.strip }

normalized[:reasoning_effort] = candidate[:reasoning_effort] unless candidate[:reasoning_effort].nil?

normalized

else

raise ArgumentError, "Expected String or Hash, got #{entry.class}"

Copilot · 2026-04-12T13:48:09Z

+        def escalate(*config_list)
+          @configs = config_list.flatten.map { |c| normalize_config(c).freeze }.freeze
+          @max_attempts = @configs.length if @max_attempts < @configs.length


escalate freezes each config hash. If the caller passes a Hash (not a String), this currently freezes the same object they provided (because normalize_candidate_config returns the Hash unchanged). That’s a surprising side effect and can break callers that reuse or mutate their config hashes.

Prefer duplicating/normalizing configs before freezing (e.g., dup + symbolize/validate in the normalizer, or deep-dup-freeze here) so RetryPolicy never mutates caller-owned objects.

Copilot · 2026-04-12T13:48:10Z

+        attr_reader :best, :retry_chain, :score, :cost_per_call,
+                    :rationale, :current_config, :savings, :warnings
+
+        def initialize(best:, retry_chain:, score:, cost_per_call:,
+                       rationale:, current_config:, savings:, warnings:)
+          @best = best&.freeze
+          @retry_chain = retry_chain.freeze
+          @score = score
+          @cost_per_call = cost_per_call
+          @rationale = rationale.freeze
+          @current_config = current_config&.freeze
+          @savings = savings.freeze
+          @warnings = warnings.freeze


Recommendation is intended to be an immutable value object (it freezes itself), but retry_chain, savings, and warnings are only shallow-frozen; the nested Hashes/Arrays remain mutable, so callers can still mutate internal state through rec.retry_chain.first[:model] = ..., etc.

Consider using the existing Concerns::DeepFreeze helper (deep-dup-freeze) for these attributes so the object is deeply immutable without freezing caller-owned objects.

Suggested change

attr_reader :best, :retry_chain, :score, :cost_per_call,

:rationale, :current_config, :savings, :warnings

def initialize(best:, retry_chain:, score:, cost_per_call:,

rationale:, current_config:, savings:, warnings:)

@best = best&.freeze

@retry_chain = retry_chain.freeze

@score = score

@cost_per_call = cost_per_call

@rationale = rationale.freeze

@current_config = current_config&.freeze

@savings = savings.freeze

@warnings = warnings.freeze

include Concerns::DeepFreeze

attr_reader :best, :retry_chain, :score, :cost_per_call,

:rationale, :current_config, :savings, :warnings

def initialize(best:, retry_chain:, score:, cost_per_call:,

rationale:, current_config:, savings:, warnings:)

@best = best&.freeze

@retry_chain = deep_dup_freeze(retry_chain)

@score = score

@cost_per_call = cost_per_call

@rationale = rationale.freeze

@current_config = current_config&.freeze

@savings = deep_dup_freeze(savings)

@warnings = deep_dup_freeze(warnings)

Copilot · 2026-04-12T13:48:10Z

+        def build_rationale(scored, best)
+          sorted = scored.sort_by { |s| [s[:cost_per_call], s[:latency], s[:label]] }
+          sorted.map { |s| rationale_line(s, best) }
+        end
+
+        def rationale_line(candidate, best)
+          header = "#{candidate[:label]}, score #{format("%.2f", candidate[:score])}, at $#{format("%.4f", candidate[:cost_per_call])}/call"
+          notes = rationale_notes(candidate, best)


build_rationale sorts candidates purely by cost_per_call, but candidates with unknown pricing are represented with cost_per_call = 0.0, so they will sort to the top and appear “cheapest” in the rationale despite being excluded from selection and flagged as unknown pricing.

Consider sorting unknown-pricing candidates last (e.g., sort key [cost_known? ? cost_per_call : Float::INFINITY, latency, label]) and/or formatting unknown costs as “n/a” to avoid misleading rationale ordering/output.

Copilot · 2026-04-12T13:48:10Z

          warn "[ruby_llm-contract] retry_policy has max_attempts=1 with no models. " \
               "This means no actual retry will happen. Add `attempts 2` or `escalate %w[model1 model2]`."


The warning message still says “no models”, but the API is now config-based (@configs, config_list). Updating the wording (and the suggested DSL snippet) to reflect configs would make the warning clearer for users reading logs.

Suggested change

warn "[ruby_llm-contract] retry_policy has max_attempts=1 with no models. " \

"This means no actual retry will happen. Add `attempts 2` or `escalate %w[model1 model2]`."

warn "[ruby_llm-contract] retry_policy has max_attempts=1 with no configs. " \

"This means no actual retry will happen. Add `attempts 2` or " \

"`escalate [{ model: \"model1\" }, { model: \"model2\" }]`."

… fix - normalize_candidate_config validates :model, symbolizes keys, returns a new hash (never freezes caller's object) - Recommendation uses deep_dup_freeze for nested immutability - Rationale sorts unknown-pricing candidates last - Warning message updated for config-based API Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Each section leads with the outcome and the money, not the API. Move minor features to docs. Install near the top. Shorter code blocks. Before/after framing on every killer feature. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 24 out of 25 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-12T13:56:32Z

-            model = policy.model_for_attempt(attempt_index, default_model)
+            config = policy.config_for_attempt(attempt_index, default_config)
+            model = config[:model]
+            attempt_extra = extra_options.merge(config.except(:model))


Hash#except is not part of Ruby core (Ruby >=3.2 has Hash#slice but not except), so config.except(:model) will raise NoMethodError in production. Replace this with a core-Ruby approach (e.g., config.reject { |k,_| k == :model } or dup+delete) before merging into extra_options.

Suggested change

attempt_extra = extra_options.merge(config.except(:model))

attempt_extra = extra_options.merge(config.reject { |k, _| k == :model })

Copilot · 2026-04-12T13:56:33Z

+              config: config,
+              score: report.score,
+              cost_per_call: cost_per_call,
+              latency: report.avg_latency_ms || 0,


Treating missing latency as 0 (report.avg_latency_ms || 0) will bias tiebreaking toward candidates with unknown latency (they’ll look fastest). Consider using a sentinel like Float::INFINITY (or keeping nil and sorting with nil last) so “unknown” doesn’t win latency comparisons.

Suggested change

latency: report.avg_latency_ms || 0,

latency: report.avg_latency_ms || Float::INFINITY,

Copilot · 2026-04-12T13:56:33Z

+          header = "#{candidate[:label]}, score #{format("%.2f", candidate[:score])}, at $#{format("%.4f", candidate[:cost_per_call])}/call"
+          notes = rationale_notes(candidate, best)
+          notes.any? ? "#{header} — #{notes.join(", ")}" : header
+        end
+


When pricing is unknown, cost_per_call becomes 0.0, so rationale lines currently render as at $0.0000/call even though the same candidate is flagged as “unknown pricing”. Consider formatting cost as n/a (or similar) when cost_known? is false to avoid misleading output.

Suggested change

header = "#{candidate[:label]}, score #{format("%.2f", candidate[:score])}, at $#{format("%.4f", candidate[:cost_per_call])}/call"

notes = rationale_notes(candidate, best)

notes.any? ? "#{header} — #{notes.join(", ")}" : header

end

header = "#{candidate[:label]}, score #{format("%.2f", candidate[:score])}, at #{format_cost_per_call(candidate)}/call"

notes = rationale_notes(candidate, best)

notes.any? ? "#{header} — #{notes.join(", ")}" : header

end

def format_cost_per_call(candidate)

return "n/a" unless cost_known?(candidate)

"$#{format("%.4f", candidate[:cost_per_call])}"

end

Copilot · 2026-04-12T13:56:33Z

+          @rationale = rationale.freeze
+          @current_config = deep_dup_freeze(current_config)
+          @savings = deep_dup_freeze(savings)
+          @warnings = warnings.freeze


Recommendation includes Concerns::DeepFreeze, but @rationale = rationale.freeze and @warnings = warnings.freeze are only shallow freezes and keep the original object references. To ensure the object is truly immutable (and consistent with other DeepFreeze users), consider using deep_dup_freeze for rationale and warnings as well.

Suggested change

@rationale = rationale.freeze

@current_config = deep_dup_freeze(current_config)

@savings = deep_dup_freeze(savings)

@warnings = warnings.freeze

@rationale = deep_dup_freeze(rationale)

@current_config = deep_dup_freeze(current_config)

@savings = deep_dup_freeze(savings)

@warnings = deep_dup_freeze(warnings)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…number Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…uality link - Tagline: "missing link between cost and quality" instead of generic pitch - validate row: clarify it needs retry_policy for auto-retry - Remove specific dollar amounts that depend on prompt length/pricing - "first model that passed" not "cheapest" (depends on user ordering) - savings: "calculated automatically" not hardcoded $17 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Visual: BEFORE (pick one model, hope) → CONFIGURE (4 lines) → GEM HANDLES (escalation flow with boxes) → YOU GET (outcomes). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…, correct trace output - output_schema: "send to provider + validate client-side" not "force" - trace[:cost] returns float, not string with $ - Remove hardcoded 90/9/1 percentages from diagram - Remove unsubstantiated savings example from recommend section - "most requests stay on nano" instead of "90%" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Config box: "YOU DEFINE A CONTRACT" Flow boxes: "contract" instead of "validate" — schema + business rules together are the contract that gates every LLM response. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diagram says "contract" — now all prose sections use the same word: "contract failed/passed" instead of "validation_failed", "fails the contract" instead of "fails validation". Consistent mental model. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…g, deep freeze - Unknown latency → Float::INFINITY (not 0, which biased tiebreaking) - Unknown pricing → "unknown pricing" in rationale (not $0.0000) - Recommendation: deep_dup_freeze on rationale and warnings - Hash#except is fine — Ruby >= 3.0, gem requires >= 3.2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…deep freeze - Tiebreaker: equal cost → lower latency wins (was only testing lexicographic) - Unknown latency does not win tiebreak (Float::INFINITY, not 0) - Rationale shows "unknown pricing" not "$0.0000" for zero-cost candidates - Deep immutability: nested hashes in retry_chain/savings raise FrozenError Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- ReportStats: add tests for score, total_cost, avg_latency_ms, passed?, pass_rate (was only testing pass_rate_ratio — 1 of 11 methods) - normalize_candidate_config: new spec covering string keys, empty input, missing model, whitespace, no-caller-mutation - Recommender savings: assert exact per_call and monthly_at values - ModelComparison to_h: verify full hash, not just one key - Recommendation to_dsl: assert reasoning_effort: "high" with quotes - Step.recommend: verify score, rationale count, unknown pricing warning - Remove rand() from all test helpers — use deterministic names Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 25 out of 26 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ffort override Bug Copilot caught on the second review pass: class MyStep < Step::Base thinking effort: :low, budget: 2048 end MyStep.run(input, context: { reasoning_effort: :high }) Previously runtime_settings only injected `extra[:thinking]` when context did NOT carry `:reasoning_effort`. With a per-call override present, `:thinking` was dropped entirely — budget silently lost, even though the adapter's `resolve_thinking_config` is capable of merging effort over `thinking[:effort]` while preserving budget. Fix: always pass class-level `:thinking` to extra_options when set. Seed `:reasoning_effort` from `thinking[:effort]` only when context did not already supply one. The adapter then resolves the precedence correctly: per-call effort wins, class-level budget survives. Two specs added in `runtime_settings integration` group: - "context :reasoning_effort wins for effort but :thinking still travels" - "context :reasoning_effort preserves class-level :budget through runtime_settings" Suite: 1311 examples, 0 failures, 8 pending. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Batch 2 (partial) of the post-0.9.0 quality refactor. B2-T1 — `with_retry_disabled` removed (Codex finding #3) ======================================================== The old implementation in `eval/retry_optimizer.rb` mutated the step class's singleton `retry_policy` method around `compare_models`, then restored it in an `ensure`. This was a concurrency hazard: two concurrent `optimize_retry_policy` calls on the same step would race, and the `ensure` does not protect against the race. Replaced with `context.merge(retry_policy_override: nil)` — the step already honours this key in `Step::Base#runtime_settings` via `context.key?(:retry_policy_override)`. Same effect, no mutation, no race. Characterization tests written FIRST (4 tests pinning original `with_retry_disabled` behaviour: nil-inside, restore-after, restore- on-raise, no-retry-policy-guard). After refactor, replaced with 3 tests pinning the new contract (retry_policy_override propagation, class-level retry_policy untouched, obsolete private method gone). B2-T3 — CostCalculator.find_model exposed (finding #2) ====================================================== `step/base.rb` was calling `CostCalculator.send(:find_model)` and `CostCalculator.send(:compute_cost)` to bypass `private_class_method`. Visibility bypass via `send` is invisible to grep-for-callers and fragile under refactor. Changes: - Removed `find_model` from `private_class_method` list in `cost_calculator.rb` with a comment explaining the public contract. - Replaced `CostCalculator.send(:find_model, name)` with the direct call. - Dropped `Base.estimated_cost_for` helper entirely; routed through the existing public `CostCalculator.calculate(model_name:, usage:)` which already does the work. Removes the second `send` call. 5 characterization tests added in `spec/ruby_llm/contract/cost_calculator_spec.rb` pinning `find_model` (nil-on-unknown, nil-on-nil, custom registry, custom overrides built-in, calculate-uses-find-model). Suite: 1344 examples / 0 failures (+8 net new vs Batch 1 baseline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes the post-0.9.0 quality refactor pull (Batch 1 + Batch 2 partial from TODO.md). Patch-version bump because every change is internal — no DSL macro added/removed, no Step.run signature change, no adapter contract change. Adopters bumping to `~> 0.9.0` automatically pick up the safer code path on next install. Bundled fixes (Codex findings #2, #3, #7, #8): - Drop singleton-method mutation in with_retry_disabled (concurrency hazard) — replaced with retry_policy_override: nil via context. - Expose CostCalculator.find_model publicly; remove two CostCalculator.send workarounds; route estimate_cost through the existing public CostCalculator.calculate. - Unify stub_step on a single thread-local storage path; remove duplicate allow/receive branch. - Drop dead ObjectSpace.each_object fallback (unreachable on Ruby >= 3.2; gemspec requires 3.2.0+ where Class#subclasses ships). Test discipline: every refactor wrote characterization tests BEFORE touching production code, then replaced with new contract tests after. Suite: 1346 examples / 0 failures (+12 net new vs 0.9.0 baseline). Remaining refactor backlog (Codex findings #1, #4, #5, #6) documented in TODO.md and deferred to 0.10.0+ — each requires more invasive surgery on shared surfaces (DSL inheritance macros, Runner config, RakeTask SuiteGate) and benefits from dedicated focus rather than tail-end inclusion in a patch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes Codex finding #5 (RakeTask#define_task god method, 52 LOC, 6+ responsibilities). Last remaining finding from the post-0.9.0 quality audit — TODO.md now marked COMPLETE across all 8 findings. EXTRACTED ========= New `lib/ruby_llm/contract/rake_task/suite_gate.rb` — testable in isolation, returns a `Verdict` Data struct: Verdict.new(passed:, abort_reason:, passed_reports:, suite_cost:) The gate enforces a strict ordering preserved from the prior god method: 1. Cost gate runs FIRST (if maximum_cost set and exceeded → abort before any score check; passed_reports is empty). 2. Score gate runs per-report (uses minimum_score threshold; falls back to report.passed? when no threshold set). 3. Regression gate runs per-report (when fail_on_regression true, prints REGRESSIONS DETECTED to stdout and marks the report failed). 4. Overall passed = ALL reports passed AND cost gate not tripped. REFACTORED ========== `RakeTask#define_task` reduced from ~52 LOC to ~25 LOC. Extracted `collect_host_reports(context)` helper handles eval running, empty case, per-host printing. The gate logic, regression detection, cost comparison, and per-report scoring all delegated to `SuiteGate`. TEST DISCIPLINE =============== 5 characterization tests written FIRST in `spec/ruby_llm/contract/rake_task_gate_spec.rb` pinning all four dimensions: - gate passes → "All evals passed" output, no abort - gate fails on score → SystemExit with "Eval suite FAILED" - cost gate priority → cost message even when score would pass (registered a custom pricey model to force non-zero cost) - baselines NOT saved on score failure - baselines DO save when all reports pass All 5 passed BEFORE refactor (current behaviour) AND after (new behaviour). Existing rake_task tests in f3_track_history_spec.rb and v02_findings_spec.rb continue to pass unchanged. Suite: 1369 examples / 0 failures / 8 pending. CODEX BACKLOG STATUS ==================== All 8 findings from the post-0.9.0 code-quality audit now resolved: #1 DRY inheritance walk → Batch 3 (B3-T2) #2 CostCalculator.send → Batch 2 (B2-T3) #3 with_retry_disabled mutation → Batch 2 (B2-T1) #4 _explicitly_unset shadows → Batch 3 (B3-T3, UNSET sentinel) #5 RakeTask god method → Batch 4 (B4-T2, THIS commit) #6 Runner.new 17 kwargs → Batch 2 (B2-T4, RunnerConfig.build) #7 ObjectSpace dead code → Batch 1 (B1-T1) #8 stub_step dual storage → Batch 2 (B2-T2) Ready for 0.10.0 minor-version release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three findings from a Codex DRY audit of lib/ (104 .rb files, ~7240 LOC). HIGH-impact #1 already had silent drift in production code. #1 [HIGH] Shared stub-helper concern (already drifting!) ======================================================== Before: `stub_step`, `stub_steps`, `stub_all_steps` plus their `build_test_adapter` helper existed in parallel in Minitest helpers (minitest.rb:50-157) and RSpec helpers (rspec/helpers.rb:24-121). The RSpec version had already drifted — it ran every test response through a `normalize_test_response` hook; Minitest did not. If an adopter using Minitest registered custom normalization, their tests would diverge from RSpec users running the same fixtures. After: extracted `RubyLLM::Contract::Concerns::StubHelpers` containing the three public methods and the `build_test_adapter` / `normalize_test_response` private helpers. Both `MinitestHelpers` and `RSpec::Helpers` now include it. Their host files shrink to a few lines documenting the cleanup contract specific to each framework (Minitest's `teardown` vs RSpec's `around(:each)`). LOC: minitest.rb -107 / rspec/helpers.rb -116 / concerns/stub_helpers.rb +95. Net: -128 LOC and a single source of truth for stub semantics. #2 [HIGH] RetryExecutor now uses Concerns::UsageAggregator ========================================================== Before: `RetryExecutor#aggregate_retry_usage` (retry_executor.rb:83-94) duplicated the input/output token sum that `Concerns::UsageAggregator #aggregate_usage` already implemented. Both produced `{ input_tokens:, output_tokens: }`; both extracted usage via `respond_to?(:usage)`; the only difference was the wrapper that unpacked attempts versus raw traces. Drift scenario this prevents: adding a third token field (e.g. `:cache_read_tokens` for prompt-caching pricing) would have required edits in both methods; missing one would silently undercount retry costs. After: `RetryExecutor` includes `Concerns::UsageAggregator` and delegates: `aggregate_usage(all_attempts.map { |a| a[:result].trace })`. The private duplicate is gone (~12 LOC). #3 [MEDIUM] `validate_positive!` helper in Dsl ============================================== Before: four DSL macros (`max_input`, `max_output`, `max_cost`, `attachment_token_estimate`) each opened with the identical guard: unless value.is_a?(Numeric) && value.positive? raise ArgumentError, "X must be positive, got Y" end After: extracted `validate_positive!(name, value)` private method (mirrors the existing `CostCalculator.validate_price!` pattern). Four call sites reduced from three lines to one. NOT FIXED (Codex called these out as deferable) ================================================ #4 [MEDIUM]: `ReportStats` / `AggregatedReport` share a duck-type contract (score, pass_rate, pass_rate_ratio, total_cost, avg_latency_ms, passed?). Codex flagged this as MEDIUM impact but the duck-type is intentionally load-bearing — making it explicit via a shared interface module would be a bigger interface design choice than this DRY pass warrants. Deferred. LOW items (cost-format strings in Minitest vs RSpec messages, inconsistent ArgumentError text in eval_definition) — left as-is per the audit's recommendation. Test suite: 1369 examples / 0 failures / 8 pending. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 12, 2026 13:42

Copilot started reviewing on behalf of justi April 12, 2026 13:42 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

justi requested a review from Copilot April 12, 2026 13:52

Copilot started reviewing on behalf of justi April 12, 2026 13:52 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

justi and others added 12 commits April 12, 2026 16:03

Add overview table to README: config → what it does → what you get

80f119f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix recommend row in overview table — remove unsubstantiated savings …

81edef3

…number Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix eval row — clarify online (LLM) vs offline (sample_response) modes

df63112

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Simplify eval row — be honest that regression detection needs real LLM

5d1847c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add before/after flow diagram to README

ff46d58

Visual: BEFORE (pick one model, hope) → CONFIGURE (4 lines) → GEM HANDLES (escalation flow with boxes) → YOU GET (outcomes). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Put 'contract' front and center in flow diagram

44021eb

Config box: "YOU DEFINE A CONTRACT" Flow boxes: "contract" instead of "validate" — schema + business rules together are the contract that gates every LLM response. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

justi requested a review from Copilot April 12, 2026 14:46

Copilot started reviewing on behalf of justi April 12, 2026 14:47 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

justi merged commit bf9cda9 into main Apr 12, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.0 — Model recommendation with reasoning_effort support#2

v0.6.0 — Model recommendation with reasoning_effort support#2
justi merged 15 commits into
mainfrom
v0.6.0-model-recommendation

justi commented Apr 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 12, 2026

Uh oh!

Copilot AI Apr 12, 2026

Uh oh!

Copilot AI Apr 12, 2026

Uh oh!

Copilot AI Apr 12, 2026

Uh oh!

Copilot AI Apr 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 12, 2026

Uh oh!

Copilot AI Apr 12, 2026

Uh oh!

Copilot AI Apr 12, 2026

Uh oh!

Copilot AI Apr 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        when String then { model: entry }
-        when Hash then entry
-        else raise ArgumentError, "Expected String or Hash, got #{entry.class}"
+        when String
+          model = entry.strip
+          raise ArgumentError, "Candidate config model must be a non-empty String" if model.empty?
+          { model: model }
+        when Hash
+          candidate = entry.each_with_object({}) do |(key, value), normalized|
+            normalized[key.respond_to?(:to_sym) ? key.to_sym : key] = value
+          end
+          model = candidate[:model]
+          unless model.is_a?(String) && !model.strip.empty?
+            raise ArgumentError, "Candidate config must include a non-empty String :model"
+          end
+          normalized = { model: model.strip }
+          normalized[:reasoning_effort] = candidate[:reasoning_effort] unless candidate[:reasoning_effort].nil?
+          normalized
+        else
+          raise ArgumentError, "Expected String or Hash, got #{entry.class}"

		warn "[ruby_llm-contract] retry_policy has max_attempts=1 with no models. " \
		"This means no actual retry will happen. Add `attempts 2` or `escalate %w[model1 model2]`."

-          warn "[ruby_llm-contract] retry_policy has max_attempts=1 with no models. " \
-               "This means no actual retry will happen. Add `attempts 2` or `escalate %w[model1 model2]`."
+          warn "[ruby_llm-contract] retry_policy has max_attempts=1 with no configs. " \
+               "This means no actual retry will happen. Add `attempts 2` or " \
+               "`escalate [{ model: \"model1\" }, { model: \"model2\" }]`."

	attempt_extra = extra_options.merge(config.except(:model))
	attempt_extra = extra_options.merge(config.reject { \|k, _\| k == :model })

	latency: report.avg_latency_ms \|\| 0,
	latency: report.avg_latency_ms \|\| Float::INFINITY,

Conversation

justi commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

justi commented Apr 12, 2026 •

edited

Loading