Skip to content

v0.6.0 — Model recommendation with reasoning_effort support#2

Merged
justi merged 15 commits into
mainfrom
v0.6.0-model-recommendation
Apr 12, 2026
Merged

v0.6.0 — Model recommendation with reasoning_effort support#2
justi merged 15 commits into
mainfrom
v0.6.0-model-recommendation

Conversation

@justi
Copy link
Copy Markdown
Owner

@justi justi commented Apr 12, 2026

Summary

  • Step.recommend("eval", candidates: [...]) — runs eval on all candidates and returns Recommendation with optimal model, retry chain, rationale, savings vs current config, and to_dsl code output
  • Candidates as configurations{ model:, reasoning_effort: } hashes, not just model name strings
  • Per-attempt reasoning_effort in retry policiesescalate accepts config hashes, each attempt gets its own reasoning_effort forwarded to the provider
  • compare_models extended with candidates: parameter (backward compatible, models: still works)
  • pass_rate_ratio (float 0.0–1.0) on Report/ReportStats
  • History entries enriched with model, reasoning_effort, pass_rate_ratio

Test plan

  • 1279 tests pass, 0 failures
  • RubyCritic score 88.06
  • Two code review rounds (panel review + deep audit), all findings addressed
  • Two Copilot review rounds, all 9 comments addressed
  • Weak test audit: 13 findings fixed — exact value assertions, full ReportStats coverage, normalize_candidate_config spec, no rand() in helpers
  • Backward compatibility: existing models: %w[...] API unchanged
  • retry_policy models: %w[a b] still works identically
  • No mentions of o3 or non-GPT models in examples
  • README claims verified against code — no unsubstantiated numbers, honest wording

🤖 Generated with Claude Code

Step.recommend("eval", candidates: [...]) returns actionable advice:
optimal model, retry chain, savings, and to_dsl code output.

Key changes:
- Candidates are configurations (model + reasoning_effort), not just names
- RetryPolicy supports per-attempt config hashes via escalate()
- RetryExecutor forwards per-attempt reasoning_effort to provider
- compare_models extended with candidates: parameter
- New Recommender algorithm + Recommendation value object
- pass_rate_ratio (float) on Report/ReportStats
- History entries store model, reasoning_effort, pass_rate_ratio

1246 tests, 0 failures. RubyCritic score: 88.07.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 12, 2026 13:42
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR bumps the gem to v0.6.0 and introduces a model/config recommendation workflow driven by eval results, including support for per-attempt reasoning_effort in retry policies and richer eval/report metadata.

Changes:

  • Add Step.recommend(eval, candidates: [...]) plus new Eval::Recommender/Eval::Recommendation to return a best candidate, retry chain, rationale, and DSL output.
  • Extend compare_models to accept candidates: (config hashes) and enrich ModelComparison/history entries with reasoning_effort + pass_rate_ratio.
  • Upgrade retry policies/execution to use config-based escalation ({ model:, reasoning_effort: }) and forward per-attempt options to providers.

Reviewed changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
spec/ruby_llm/contract/step/retry_policy_spec.rb Adds specs for config-based escalation normalization and attempt selection.
spec/ruby_llm/contract/step/retry_executor_spec.rb Adds specs ensuring per-attempt reasoning_effort forwarding and trace behavior.
spec/ruby_llm/contract/step/base_spec.rb Adds specs for .recommend and determining current model config.
spec/ruby_llm/contract/eval/report_stats_spec.rb Adds tests for numeric pass_rate_ratio.
spec/ruby_llm/contract/eval/recommender_spec.rb Adds tests for recommendation selection, retry chain, savings, warnings.
spec/ruby_llm/contract/eval/recommendation_spec.rb Adds tests for immutability and DSL output generation.
spec/ruby_llm/contract/eval/model_comparison_spec.rb Adds tests for candidate labeling/config mapping and pass_rate_ratio in to_h.
lib/ruby_llm/contract/version.rb Bumps gem version to 0.6.0.
lib/ruby_llm/contract/step/retry_policy.rb Switches from model lists to config lists (hash/string normalization).
lib/ruby_llm/contract/step/retry_executor.rb Forwards per-attempt config extras (e.g., reasoning_effort) and adds config to trace.
lib/ruby_llm/contract/step/base.rb Adds .recommend and a helper to determine current model config.
lib/ruby_llm/contract/eval/report.rb Delegates pass_rate_ratio from stats.
lib/ruby_llm/contract/eval/report_storage.rb Adds reasoning_effort to paths/entries and persists pass_rate_ratio.
lib/ruby_llm/contract/eval/report_stats.rb Implements numeric pass_rate_ratio.
lib/ruby_llm/contract/eval/recommender.rb New recommendation algorithm selecting best/first-try chain and computing savings/warnings.
lib/ruby_llm/contract/eval/recommendation.rb New Recommendation value object + to_dsl output.
lib/ruby_llm/contract/eval/model_comparison.rb Adds candidate labeling and config mapping support (vs model-only keys).
lib/ruby_llm/contract/eval.rb Requires new recommender/recommendation files.
lib/ruby_llm/contract/concerns/eval_host.rb Extends compare_models to accept candidates: and carry reasoning_effort into eval runs.
lib/ruby_llm/contract.rb Adds normalize_candidate_config helper used across retry + eval flows.
README.md Documents recommendation API and updates roadmap.
docs/guide/getting_started.md Updates reasoning_effort docs language.
docs/architecture.md Adds new recommender/recommendation components.
CHANGELOG.md Adds v0.6.0 entry describing new features.
Gemfile.lock Bumps locked gem version to 0.6.0.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/ruby_llm/contract.rb Outdated
Comment on lines +83 to +85
when String then { model: entry }
when Hash then entry
else raise ArgumentError, "Expected String or Hash, got #{entry.class}"
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalize_candidate_config returns the input Hash as-is, which means downstream callers that freeze configs (e.g., RetryPolicy/escalate and Recommendation) will end up freezing user-owned hashes. It also doesn’t validate that a Hash candidate actually contains a usable :model (or handle string keys), which can lead to model=nil being passed into adapters.

Consider returning a normalized copy (e.g., symbolize keys, require a non-empty :model String, and drop nil optional keys like reasoning_effort) so internal code can safely freeze without mutating caller objects and so invalid candidates fail fast with a clear ArgumentError.

Suggested change
when String then { model: entry }
when Hash then entry
else raise ArgumentError, "Expected String or Hash, got #{entry.class}"
when String
model = entry.strip
raise ArgumentError, "Candidate config model must be a non-empty String" if model.empty?
{ model: model }
when Hash
candidate = entry.each_with_object({}) do |(key, value), normalized|
normalized[key.respond_to?(:to_sym) ? key.to_sym : key] = value
end
model = candidate[:model]
unless model.is_a?(String) && !model.strip.empty?
raise ArgumentError, "Candidate config must include a non-empty String :model"
end
normalized = { model: model.strip }
normalized[:reasoning_effort] = candidate[:reasoning_effort] unless candidate[:reasoning_effort].nil?
normalized
else
raise ArgumentError, "Expected String or Hash, got #{entry.class}"

Copilot uses AI. Check for mistakes.
Comment on lines +31 to +33
def escalate(*config_list)
@configs = config_list.flatten.map { |c| normalize_config(c).freeze }.freeze
@max_attempts = @configs.length if @max_attempts < @configs.length
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

escalate freezes each config hash. If the caller passes a Hash (not a String), this currently freezes the same object they provided (because normalize_candidate_config returns the Hash unchanged). That’s a surprising side effect and can break callers that reuse or mutate their config hashes.

Prefer duplicating/normalizing configs before freezing (e.g., dup + symbolize/validate in the normalizer, or deep-dup-freeze here) so RetryPolicy never mutates caller-owned objects.

Copilot uses AI. Check for mistakes.
Comment on lines +7 to +19
attr_reader :best, :retry_chain, :score, :cost_per_call,
:rationale, :current_config, :savings, :warnings

def initialize(best:, retry_chain:, score:, cost_per_call:,
rationale:, current_config:, savings:, warnings:)
@best = best&.freeze
@retry_chain = retry_chain.freeze
@score = score
@cost_per_call = cost_per_call
@rationale = rationale.freeze
@current_config = current_config&.freeze
@savings = savings.freeze
@warnings = warnings.freeze
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommendation is intended to be an immutable value object (it freezes itself), but retry_chain, savings, and warnings are only shallow-frozen; the nested Hashes/Arrays remain mutable, so callers can still mutate internal state through rec.retry_chain.first[:model] = ..., etc.

Consider using the existing Concerns::DeepFreeze helper (deep-dup-freeze) for these attributes so the object is deeply immutable without freezing caller-owned objects.

Suggested change
attr_reader :best, :retry_chain, :score, :cost_per_call,
:rationale, :current_config, :savings, :warnings
def initialize(best:, retry_chain:, score:, cost_per_call:,
rationale:, current_config:, savings:, warnings:)
@best = best&.freeze
@retry_chain = retry_chain.freeze
@score = score
@cost_per_call = cost_per_call
@rationale = rationale.freeze
@current_config = current_config&.freeze
@savings = savings.freeze
@warnings = warnings.freeze
include Concerns::DeepFreeze
attr_reader :best, :retry_chain, :score, :cost_per_call,
:rationale, :current_config, :savings, :warnings
def initialize(best:, retry_chain:, score:, cost_per_call:,
rationale:, current_config:, savings:, warnings:)
@best = best&.freeze
@retry_chain = deep_dup_freeze(retry_chain)
@score = score
@cost_per_call = cost_per_call
@rationale = rationale.freeze
@current_config = current_config&.freeze
@savings = deep_dup_freeze(savings)
@warnings = deep_dup_freeze(warnings)

Copilot uses AI. Check for mistakes.
Comment on lines +76 to +83
def build_rationale(scored, best)
sorted = scored.sort_by { |s| [s[:cost_per_call], s[:latency], s[:label]] }
sorted.map { |s| rationale_line(s, best) }
end

def rationale_line(candidate, best)
header = "#{candidate[:label]}, score #{format("%.2f", candidate[:score])}, at $#{format("%.4f", candidate[:cost_per_call])}/call"
notes = rationale_notes(candidate, best)
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_rationale sorts candidates purely by cost_per_call, but candidates with unknown pricing are represented with cost_per_call = 0.0, so they will sort to the top and appear “cheapest” in the rationale despite being excluded from selection and flagged as unknown pricing.

Consider sorting unknown-pricing candidates last (e.g., sort key [cost_known? ? cost_per_call : Float::INFINITY, latency, label]) and/or formatting unknown costs as “n/a” to avoid misleading rationale ordering/output.

Copilot uses AI. Check for mistakes.
Comment on lines 82 to 83
warn "[ruby_llm-contract] retry_policy has max_attempts=1 with no models. " \
"This means no actual retry will happen. Add `attempts 2` or `escalate %w[model1 model2]`."
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning message still says “no models”, but the API is now config-based (@configs, config_list). Updating the wording (and the suggested DSL snippet) to reflect configs would make the warning clearer for users reading logs.

Suggested change
warn "[ruby_llm-contract] retry_policy has max_attempts=1 with no models. " \
"This means no actual retry will happen. Add `attempts 2` or `escalate %w[model1 model2]`."
warn "[ruby_llm-contract] retry_policy has max_attempts=1 with no configs. " \
"This means no actual retry will happen. Add `attempts 2` or " \
"`escalate [{ model: \"model1\" }, { model: \"model2\" }]`."

Copilot uses AI. Check for mistakes.
… fix

- normalize_candidate_config validates :model, symbolizes keys, returns
  a new hash (never freezes caller's object)
- Recommendation uses deep_dup_freeze for nested immutability
- Rationale sorts unknown-pricing candidates last
- Warning message updated for config-based API

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each section leads with the outcome and the money, not the API.
Move minor features to docs. Install near the top. Shorter code
blocks. Before/after framing on every killer feature.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 25 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

model = policy.model_for_attempt(attempt_index, default_model)
config = policy.config_for_attempt(attempt_index, default_config)
model = config[:model]
attempt_extra = extra_options.merge(config.except(:model))
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hash#except is not part of Ruby core (Ruby >=3.2 has Hash#slice but not except), so config.except(:model) will raise NoMethodError in production. Replace this with a core-Ruby approach (e.g., config.reject { |k,_| k == :model } or dup+delete) before merging into extra_options.

Suggested change
attempt_extra = extra_options.merge(config.except(:model))
attempt_extra = extra_options.merge(config.reject { |k, _| k == :model })

Copilot uses AI. Check for mistakes.
config: config,
score: report.score,
cost_per_call: cost_per_call,
latency: report.avg_latency_ms || 0,
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Treating missing latency as 0 (report.avg_latency_ms || 0) will bias tiebreaking toward candidates with unknown latency (they’ll look fastest). Consider using a sentinel like Float::INFINITY (or keeping nil and sorting with nil last) so “unknown” doesn’t win latency comparisons.

Suggested change
latency: report.avg_latency_ms || 0,
latency: report.avg_latency_ms || Float::INFINITY,

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +86
header = "#{candidate[:label]}, score #{format("%.2f", candidate[:score])}, at $#{format("%.4f", candidate[:cost_per_call])}/call"
notes = rationale_notes(candidate, best)
notes.any? ? "#{header} — #{notes.join(", ")}" : header
end

Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When pricing is unknown, cost_per_call becomes 0.0, so rationale lines currently render as at $0.0000/call even though the same candidate is flagged as “unknown pricing”. Consider formatting cost as n/a (or similar) when cost_known? is false to avoid misleading output.

Suggested change
header = "#{candidate[:label]}, score #{format("%.2f", candidate[:score])}, at $#{format("%.4f", candidate[:cost_per_call])}/call"
notes = rationale_notes(candidate, best)
notes.any? ? "#{header}#{notes.join(", ")}" : header
end
header = "#{candidate[:label]}, score #{format("%.2f", candidate[:score])}, at #{format_cost_per_call(candidate)}/call"
notes = rationale_notes(candidate, best)
notes.any? ? "#{header}#{notes.join(", ")}" : header
end
def format_cost_per_call(candidate)
return "n/a" unless cost_known?(candidate)
"$#{format("%.4f", candidate[:cost_per_call])}"
end

Copilot uses AI. Check for mistakes.
Comment on lines +18 to +21
@rationale = rationale.freeze
@current_config = deep_dup_freeze(current_config)
@savings = deep_dup_freeze(savings)
@warnings = warnings.freeze
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommendation includes Concerns::DeepFreeze, but @rationale = rationale.freeze and @warnings = warnings.freeze are only shallow freezes and keep the original object references. To ensure the object is truly immutable (and consistent with other DeepFreeze users), consider using deep_dup_freeze for rationale and warnings as well.

Suggested change
@rationale = rationale.freeze
@current_config = deep_dup_freeze(current_config)
@savings = deep_dup_freeze(savings)
@warnings = warnings.freeze
@rationale = deep_dup_freeze(rationale)
@current_config = deep_dup_freeze(current_config)
@savings = deep_dup_freeze(savings)
@warnings = deep_dup_freeze(warnings)

Copilot uses AI. Check for mistakes.
justi and others added 12 commits April 12, 2026 16:03
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…number

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uality link

- Tagline: "missing link between cost and quality" instead of generic pitch
- validate row: clarify it needs retry_policy for auto-retry
- Remove specific dollar amounts that depend on prompt length/pricing
- "first model that passed" not "cheapest" (depends on user ordering)
- savings: "calculated automatically" not hardcoded $17

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Visual: BEFORE (pick one model, hope) → CONFIGURE (4 lines) →
GEM HANDLES (escalation flow with boxes) → YOU GET (outcomes).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…, correct trace output

- output_schema: "send to provider + validate client-side" not "force"
- trace[:cost] returns float, not string with $
- Remove hardcoded 90/9/1 percentages from diagram
- Remove unsubstantiated savings example from recommend section
- "most requests stay on nano" instead of "90%"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Config box: "YOU DEFINE A CONTRACT"
Flow boxes: "contract" instead of "validate" — schema + business
rules together are the contract that gates every LLM response.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Diagram says "contract" — now all prose sections use the same word:
"contract failed/passed" instead of "validation_failed", "fails the
contract" instead of "fails validation". Consistent mental model.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…g, deep freeze

- Unknown latency → Float::INFINITY (not 0, which biased tiebreaking)
- Unknown pricing → "unknown pricing" in rationale (not $0.0000)
- Recommendation: deep_dup_freeze on rationale and warnings
- Hash#except is fine — Ruby >= 3.0, gem requires >= 3.2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…deep freeze

- Tiebreaker: equal cost → lower latency wins (was only testing lexicographic)
- Unknown latency does not win tiebreak (Float::INFINITY, not 0)
- Rationale shows "unknown pricing" not "$0.0000" for zero-cost candidates
- Deep immutability: nested hashes in retry_chain/savings raise FrozenError

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ReportStats: add tests for score, total_cost, avg_latency_ms, passed?,
  pass_rate (was only testing pass_rate_ratio — 1 of 11 methods)
- normalize_candidate_config: new spec covering string keys, empty input,
  missing model, whitespace, no-caller-mutation
- Recommender savings: assert exact per_call and monthly_at values
- ModelComparison to_h: verify full hash, not just one key
- Recommendation to_dsl: assert reasoning_effort: "high" with quotes
- Step.recommend: verify score, rationale count, unknown pricing warning
- Remove rand() from all test helpers — use deterministic names

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 26 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@justi justi merged commit bf9cda9 into main Apr 12, 2026
5 checks passed
justi added a commit that referenced this pull request Apr 26, 2026
…ffort override

Bug Copilot caught on the second review pass:

  class MyStep < Step::Base
    thinking effort: :low, budget: 2048
  end

  MyStep.run(input, context: { reasoning_effort: :high })

Previously runtime_settings only injected `extra[:thinking]` when
context did NOT carry `:reasoning_effort`. With a per-call override
present, `:thinking` was dropped entirely — budget silently lost,
even though the adapter's `resolve_thinking_config` is capable of
merging effort over `thinking[:effort]` while preserving budget.

Fix: always pass class-level `:thinking` to extra_options when set.
Seed `:reasoning_effort` from `thinking[:effort]` only when context
did not already supply one. The adapter then resolves the precedence
correctly: per-call effort wins, class-level budget survives.

Two specs added in `runtime_settings integration` group:
- "context :reasoning_effort wins for effort but :thinking still travels"
- "context :reasoning_effort preserves class-level :budget through runtime_settings"

Suite: 1311 examples, 0 failures, 8 pending.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
justi added a commit that referenced this pull request May 31, 2026
Batch 2 (partial) of the post-0.9.0 quality refactor.

B2-T1 — `with_retry_disabled` removed (Codex finding #3)
========================================================
The old implementation in `eval/retry_optimizer.rb` mutated the step
class's singleton `retry_policy` method around `compare_models`, then
restored it in an `ensure`. This was a concurrency hazard: two
concurrent `optimize_retry_policy` calls on the same step would race,
and the `ensure` does not protect against the race.

Replaced with `context.merge(retry_policy_override: nil)` — the step
already honours this key in `Step::Base#runtime_settings` via
`context.key?(:retry_policy_override)`. Same effect, no mutation, no
race.

Characterization tests written FIRST (4 tests pinning original
`with_retry_disabled` behaviour: nil-inside, restore-after, restore-
on-raise, no-retry-policy-guard). After refactor, replaced with 3
tests pinning the new contract (retry_policy_override propagation,
class-level retry_policy untouched, obsolete private method gone).

B2-T3 — CostCalculator.find_model exposed (finding #2)
======================================================
`step/base.rb` was calling `CostCalculator.send(:find_model)` and
`CostCalculator.send(:compute_cost)` to bypass `private_class_method`.
Visibility bypass via `send` is invisible to grep-for-callers and
fragile under refactor.

Changes:
- Removed `find_model` from `private_class_method` list in
  `cost_calculator.rb` with a comment explaining the public contract.
- Replaced `CostCalculator.send(:find_model, name)` with the direct
  call.
- Dropped `Base.estimated_cost_for` helper entirely; routed through
  the existing public `CostCalculator.calculate(model_name:, usage:)`
  which already does the work. Removes the second `send` call.

5 characterization tests added in
`spec/ruby_llm/contract/cost_calculator_spec.rb` pinning
`find_model` (nil-on-unknown, nil-on-nil, custom registry, custom
overrides built-in, calculate-uses-find-model).

Suite: 1344 examples / 0 failures (+8 net new vs Batch 1 baseline).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
justi added a commit that referenced this pull request May 31, 2026
Closes the post-0.9.0 quality refactor pull (Batch 1 + Batch 2 partial
from TODO.md). Patch-version bump because every change is internal —
no DSL macro added/removed, no Step.run signature change, no adapter
contract change. Adopters bumping to `~> 0.9.0` automatically pick up
the safer code path on next install.

Bundled fixes (Codex findings #2, #3, #7, #8):
- Drop singleton-method mutation in with_retry_disabled (concurrency
  hazard) — replaced with retry_policy_override: nil via context.
- Expose CostCalculator.find_model publicly; remove two
  CostCalculator.send workarounds; route estimate_cost through the
  existing public CostCalculator.calculate.
- Unify stub_step on a single thread-local storage path; remove
  duplicate allow/receive branch.
- Drop dead ObjectSpace.each_object fallback (unreachable on Ruby
  >= 3.2; gemspec requires 3.2.0+ where Class#subclasses ships).

Test discipline: every refactor wrote characterization tests BEFORE
touching production code, then replaced with new contract tests after.
Suite: 1346 examples / 0 failures (+12 net new vs 0.9.0 baseline).

Remaining refactor backlog (Codex findings #1, #4, #5, #6) documented
in TODO.md and deferred to 0.10.0+ — each requires more invasive
surgery on shared surfaces (DSL inheritance macros, Runner config,
RakeTask SuiteGate) and benefits from dedicated focus rather than
tail-end inclusion in a patch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
justi added a commit that referenced this pull request May 31, 2026
Closes Codex finding #5 (RakeTask#define_task god method, 52 LOC, 6+
responsibilities). Last remaining finding from the post-0.9.0 quality
audit — TODO.md now marked COMPLETE across all 8 findings.

EXTRACTED
=========

New `lib/ruby_llm/contract/rake_task/suite_gate.rb` — testable in
isolation, returns a `Verdict` Data struct:

  Verdict.new(passed:, abort_reason:, passed_reports:, suite_cost:)

The gate enforces a strict ordering preserved from the prior god
method:
  1. Cost gate runs FIRST (if maximum_cost set and exceeded → abort
     before any score check; passed_reports is empty).
  2. Score gate runs per-report (uses minimum_score threshold; falls
     back to report.passed? when no threshold set).
  3. Regression gate runs per-report (when fail_on_regression true,
     prints REGRESSIONS DETECTED to stdout and marks the report failed).
  4. Overall passed = ALL reports passed AND cost gate not tripped.

REFACTORED
==========

`RakeTask#define_task` reduced from ~52 LOC to ~25 LOC. Extracted
`collect_host_reports(context)` helper handles eval running, empty
case, per-host printing. The gate logic, regression detection, cost
comparison, and per-report scoring all delegated to `SuiteGate`.

TEST DISCIPLINE
===============

5 characterization tests written FIRST in
`spec/ruby_llm/contract/rake_task_gate_spec.rb` pinning all four
dimensions:
- gate passes → "All evals passed" output, no abort
- gate fails on score → SystemExit with "Eval suite FAILED"
- cost gate priority → cost message even when score would pass
  (registered a custom pricey model to force non-zero cost)
- baselines NOT saved on score failure
- baselines DO save when all reports pass

All 5 passed BEFORE refactor (current behaviour) AND after (new
behaviour). Existing rake_task tests in f3_track_history_spec.rb and
v02_findings_spec.rb continue to pass unchanged.

Suite: 1369 examples / 0 failures / 8 pending.

CODEX BACKLOG STATUS
====================

All 8 findings from the post-0.9.0 code-quality audit now resolved:
  #1 DRY inheritance walk         → Batch 3 (B3-T2)
  #2 CostCalculator.send          → Batch 2 (B2-T3)
  #3 with_retry_disabled mutation → Batch 2 (B2-T1)
  #4 _explicitly_unset shadows    → Batch 3 (B3-T3, UNSET sentinel)
  #5 RakeTask god method          → Batch 4 (B4-T2, THIS commit)
  #6 Runner.new 17 kwargs         → Batch 2 (B2-T4, RunnerConfig.build)
  #7 ObjectSpace dead code        → Batch 1 (B1-T1)
  #8 stub_step dual storage       → Batch 2 (B2-T2)

Ready for 0.10.0 minor-version release.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
justi added a commit that referenced this pull request May 31, 2026
Three findings from a Codex DRY audit of lib/ (104 .rb files, ~7240 LOC).
HIGH-impact #1 already had silent drift in production code.

#1 [HIGH] Shared stub-helper concern (already drifting!)
========================================================

Before: `stub_step`, `stub_steps`, `stub_all_steps` plus their
`build_test_adapter` helper existed in parallel in Minitest helpers
(minitest.rb:50-157) and RSpec helpers (rspec/helpers.rb:24-121).
The RSpec version had already drifted — it ran every test response
through a `normalize_test_response` hook; Minitest did not. If an
adopter using Minitest registered custom normalization, their tests
would diverge from RSpec users running the same fixtures.

After: extracted `RubyLLM::Contract::Concerns::StubHelpers` containing
the three public methods and the `build_test_adapter` /
`normalize_test_response` private helpers. Both `MinitestHelpers` and
`RSpec::Helpers` now include it. Their host files shrink to a few
lines documenting the cleanup contract specific to each framework
(Minitest's `teardown` vs RSpec's `around(:each)`).

LOC: minitest.rb -107 / rspec/helpers.rb -116 / concerns/stub_helpers.rb
+95. Net: -128 LOC and a single source of truth for stub semantics.

#2 [HIGH] RetryExecutor now uses Concerns::UsageAggregator
==========================================================

Before: `RetryExecutor#aggregate_retry_usage` (retry_executor.rb:83-94)
duplicated the input/output token sum that `Concerns::UsageAggregator
#aggregate_usage` already implemented. Both produced
`{ input_tokens:, output_tokens: }`; both extracted usage via
`respond_to?(:usage)`; the only difference was the wrapper that
unpacked attempts versus raw traces.

Drift scenario this prevents: adding a third token field (e.g.
`:cache_read_tokens` for prompt-caching pricing) would have required
edits in both methods; missing one would silently undercount retry
costs.

After: `RetryExecutor` includes `Concerns::UsageAggregator` and
delegates: `aggregate_usage(all_attempts.map { |a| a[:result].trace })`.
The private duplicate is gone (~12 LOC).

#3 [MEDIUM] `validate_positive!` helper in Dsl
==============================================

Before: four DSL macros (`max_input`, `max_output`, `max_cost`,
`attachment_token_estimate`) each opened with the identical guard:

  unless value.is_a?(Numeric) && value.positive?
    raise ArgumentError, "X must be positive, got Y"
  end

After: extracted `validate_positive!(name, value)` private method
(mirrors the existing `CostCalculator.validate_price!` pattern).
Four call sites reduced from three lines to one.

NOT FIXED (Codex called these out as deferable)
================================================

#4 [MEDIUM]: `ReportStats` / `AggregatedReport` share a duck-type
contract (score, pass_rate, pass_rate_ratio, total_cost,
avg_latency_ms, passed?). Codex flagged this as MEDIUM impact but
the duck-type is intentionally load-bearing — making it explicit
via a shared interface module would be a bigger interface design
choice than this DRY pass warrants. Deferred.

LOW items (cost-format strings in Minitest vs RSpec messages,
inconsistent ArgumentError text in eval_definition) — left as-is
per the audit's recommendation.

Test suite: 1369 examples / 0 failures / 8 pending.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants