0.7.2 + batch: guide optimizations (optimizing_retry_policy + getting_started + output_schema + terminal labels) by justi · Pull Request #21 · justi/ruby_llm-contract

justi · 2026-04-22T16:52:02Z

Consolidates four guide optimizations into one batch. The work is thematically coupled — terminology alignment across the code-to-guide boundary, a DSL bug fix that blocks copy-paste, and narrative-continuity rewrites keeping every guide on the SummarizeArticle case from the README.

Previously split across three PRs (#18, #19, #20), now consolidated here. A fourth guide (eval_first.md) joined the batch in this branch.

What's in this PR

1. Version bump to 0.7.2

2. Terminal output labels renamed (non-breaking)

print_summary output strings aligned with the README narrative. Programmatic metric names stable.

Before	After
`Constraining eval:`	`Hardest eval:`
`Suggested chain:`	`Suggested fallback list:`
column `single-shot`	`first-attempt`
column `escalation`	`fallback %`

RetryOptimizer::Result now exposes hardest_eval as an alias for constraining_eval. Programmatic metric names (single_shot_cost, single_shot_latency_ms, escalation_rate) unchanged.

Copilot finding on model_comparison.rb (column widths didn't match after header rename) addressed: header and data row both use %-13s for the first-attempt column, separator bumped from chain_width + 60 → chain_width + 62.

3. `docs/guide/optimizing_retry_policy.md` rewritten

17.7k → 6.4k chars. Continues the SummarizeArticle narrative from README. Offline mode clearly positioned as a wiring check; real optimization via LIVE=1 RUNS=3. Output samples captured from actual print_summary runs.

Two codex review rounds (mid/senior Ruby persona): BIGGER REWORK → ONE OR TWO TWEAKS → applied.

4. `docs/guide/getting_started.md` rewritten

8.7k → 6.1k chars. Every example uses SummarizeArticle with the same schema and validates as the README. Walkthrough layers on max_input / max_output / max_cost / define_eval / run_eval / save_baseline! / pass_eval.

Section order reshuffled: Evals and CI gates before Budget caps (README links to this guide as "CI regression gates").

Removed: Structured Prompts + Dynamic Prompts (delegated to prompt_ast.md), "Already using ruby_llm?" (README covers), "Reasoning effort" (niche), "Model priority" paragraph (redundant).

Copilot finding on trace[:cost] addressed: # => 0.000042 → # => 0.00052 (sum of all attempts) — matches RetryExecutor sum semantics.

API verified via tmp/verify_getting_started.rb against the real adapter.

5. `docs/guide/eval_first.md` refined

6.3k → 5.0k chars. Switched every example from ClassifyTicket to SummarizeArticle. Team workflow section trimmed to 5 one-line bullets linking back to getting_started.md for the matcher chain — the old version duplicated setup steps that now live there.

Kept intact: Core Rule, Three eval kinds (smoke/regression/ab), sample_response caveat, few-shot note, model-selection-after-prompt-stability ordering, Short Version. These are the philosophy bits that belong specifically in eval-first.

API calls verified against the code: compare_with and compared_with matcher chain both real.

6. `docs/guide/testing.md` refined

10.7k → 7.4k chars. Switched every example from ClassifyIntent / ClassifyTicket / EvaluatePersona / EvaluateComparative to SummarizeArticle.

Kept intact (unique to testing guide): Test Adapter, symbol-keys caveat, stub_step / stub_steps / stub_all_steps reference with block form, RSpec setup + Minitest equivalent, satisfy_contract + pass_eval matchers, Offline vs Online decision table, Inspecting failures (Report API), Soft observations, Baseline file format.

Cut or compressed with links back to proper homes: Threshold Gating (getting_started has the matcher chain), Rake Task (getting_started), Baseline Regression walkthrough (getting_started + eval_first), Prompt A/B walkthrough (eval_first).

7. P3 sanity pass (`best_practices.md`, `pipeline.md`, `migration.md`)

Terminology and case consistency over the three remaining guides.

best_practices.md: section 6 renamed "Model escalation" → "Model fallback" (matches README + Optimizing retry_policy). Fabricated 90% / 9% / 1% attempt distribution removed. AnalyzeCompetitor / diverse validate cases kept — this is a patterns reference, a SummarizeArticle monoculture would fight the topic.
pipeline.md: eval example fixed (TicketPipeline replaced with MeetingFollowUp, which is actually defined earlier). See also section added. MeetingFollowUp case kept — pipelines need multi-step, SummarizeArticle is single-step.
migration.md: ClassifyTicket replaced with SummarizeArticle across every example. The original ticket-classification case carried the same "fabricated case study" baggage README feedback called out. Before/After diff now shows a real article-summary service wrapped in a contract. See also section added.

8. `docs/guide/output_schema.md` DSL bug fix (critical)

The "Supported constraints" table documented keywords in camelCase (minLength, maxLength, minItems, maxItems, additionalProperties). Those are JSON Schema spec names, not the actual ruby_llm-schema DSL. The DSL accepts snake_case (min_length, min_items, …) and converts internally.

Every copy-paste from the previous table would have raised ArgumentError. Fixed across the table, added a short note on the internal conversion.

Verified: tmp/verify_schema_dsl.rb builds a schema using every snake_case constraint and round-trips to the expected camelCase in the JSON Schema output.

Companion audit of prompt_ast.md — no changes needed.

CHANGELOG entry

0.7.2 — terminal label renames (non-breaking) + guide rewrites. Programmatic API unchanged.

Tests

bundle exec rspec — 1341 examples, 0 failures, 8 pending.

Size impact

File	Before	After	Δ
`optimizing_retry_policy.md`	17.7k	6.4k	−64%
`getting_started.md`	8.7k	6.1k	−30%
`eval_first.md`	6.3k	5.0k	−20%
`testing.md`	10.7k	7.4k	−31%
`output_schema.md`	3.3k	3.4k	+3% (note added)
`best_practices.md`	3.5k	3.5k	±0% (terminology)
`pipeline.md`	4.2k	4.3k	+2% (See also link)
`migration.md`	4.6k	5.4k	+17% (See also + clearer case)
Total guide content	54.3k	36.7k	−32%

Closed PRs

0.7.2: align output labels + rewrite optimizing_retry_policy guide (17.7k → 6.4k) #18 (docs/optimizing-retry-policy-rewrite) — content here.
docs(getting-started): rewrite for SummarizeArticle narrative + cut duplication (8.7k → 6.1k) #19 (docs/getting-started-rewrite) — content here.
docs(output-schema): fix camelCase DSL bug (copy-paste raised ArgumentError) #20 (docs/output-schema-prompt-ast-fixes) — content here.

…_retry_policy guide Two coupled changes that together close the gap between the new mid/senior-focused README and the guide it links to as "Find the cheapest viable fallback list". Changed — output labels `print_summary` now prints terminology consistent with README: Constraining eval: X → Hardest eval: X Suggested chain: → Suggested fallback list: column "single-shot" → "first-attempt" column "escalation" → "fallback %" Programmatic metric names are deliberately unchanged to avoid a breaking API bump: `single_shot_cost`, `single_shot_latency_ms`, `escalation_rate`. A `hardest_eval` alias is added to `RetryOptimizer::Result` for the narrative accessor. Two spec assertions updated; full suite 1341 examples, 0 failures. Docs — optimizing_retry_policy.md Rewritten from 17.7k to 6.4k characters, same radical-cut style as the README pass. Continues the `SummarizeArticle` narrative from README rather than introducing ClassifyThread / MyStep placeholders. Structural fixes from two rounds of codex review: - Offline mode repositioned as a wiring check (every candidate returns the same sample_response score), real optimization via `LIVE=1 RUNS=3` as the primary command. - Sample outputs captured from an actual run against Test adapter so the format matches what `print_summary` really prints, not a plausible-looking invention. - "Suggested fallback list" rows annotated with "Order matters" so two entries don't read as options rather than a chain. - "Manual procedure" / duplicated troubleshooting / gpt-5-specific reasoning-effort case studies cut — moved to follow-up docs if ever needed. - `Programmatic API names` section at the end names the metrics on Report / AggregatedReport so Kasia-style readers don't feel the guide is inconsistent with the code.

…r rename

…uplication Guide was 8.7k chars using a separate ClassifyTicket case study, with three sections — Structured Prompts, Dynamic Prompts, and "Already using ruby_llm?" — that either belonged in other guides or duplicated content the freshly-rewritten README now carries. Changes Narrative continuity with README. All examples use `SummarizeArticle` (the flagship step from README) with the same schema and validates. The walkthrough expands the README example by layering on `max_input`, `max_output`, `max_cost`, `define_eval`, `run_eval`, `save_baseline!`, and `pass_eval` matchers. Section order reshuffled so CI gating reads first. README links to this guide as "CI regression gates" and "Budget caps"; the Evals and CI gates section now comes before Budget caps so the primary link target lands on what the reader clicked in for. Removed sections: - Structured Prompts / Dynamic Prompts — delegated to prompt_ast.md. - "Already using ruby_llm?" — the new README boundary table + knockout paragraph cover this, better. - Reasoning effort — niche, not essential for Getting Started. - "Model priority" explanation paragraph — redundant with retry_policy semantics. Empirically verified. `tmp/verify_getting_started.rb` instantiates SummarizeArticle with every feature shown in the guide, runs the "smoke" eval end-to-end, exercises `run_eval`, `print_summary`, and `trace[:attempts]`. All pass against the real adapter. The `trace[:attempts]` example was updated to reflect the real hash shape (includes cost, latency_ms; usage abbreviated with "..."). Terminology aligned with README and optimizing_retry_policy.md: "escalate to a smarter model" → "fallback". Size: 8,723 → 6,146 chars (−30%). One round of codex review (mid/senior Ruby dev persona): verdict ONE OR TWO TWEAKS, three fixes applied — realistic eval sample (input/output aligned), Evals-before-Budget section order, `print_summary` call added after `run_eval`. Full spec suite: 1341 examples, 0 failures.

… costs

The "Supported constraints" table documented DSL keywords in camelCase (`minLength`, `maxLength`, `minItems`, `maxItems`, `additionalProperties`), which matches the JSON Schema spec but **not** the `ruby_llm-schema` DSL. The DSL accepts snake_case (`min_length`, `min_items`, etc.) and converts to JSON Schema camelCase internally before sending to the provider. Every code example that copy-pasted from the previous table would have raised `ArgumentError` when the schema was built. Changed to snake_case across the table, added a short note on the internal camelCase conversion so readers who recognize the JSON Schema names aren't confused. Verified: `tmp/verify_schema_dsl.rb` builds a schema with `min_length`, `max_length`, `min_items`, `max_items` and every one round-trips to the expected `minLength` / `minItems` in the emitted JSON Schema. Companion audit: `docs/guide/prompt_ast.md` checked for the same class of issue. `input_type Types::Hash.schema(...)` + `Types::String` etc. all build successfully. No changes needed there.

Copilot

Pull request overview

This PR batches a 0.7.2 release bump plus a set of guide/terminal-output terminology alignments, and fixes a docs DSL mismatch that previously caused copy-paste failures.

Changes:

Bump gem version to 0.7.2 and add a corresponding changelog entry.
Rename terminal output labels in retry optimization/model comparison output (non-breaking) and add a hardest_eval alias for constraining_eval.
Rewrite/streamline guides (optimizing_retry_policy, getting_started) and fix output_schema docs to use the correct snake_case DSL keywords.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
spec/ruby_llm/contract/eval/retry_optimizer_spec.rb	Updates spec expectations for the new printed labels.
lib/ruby_llm/contract/version.rb	Version bump to 0.7.2.
lib/ruby_llm/contract/eval/retry_optimizer.rb	Adds `hardest_eval` alias and updates printed labels in summaries.
lib/ruby_llm/contract/eval/model_comparison.rb	Renames production-mode table headers and updates formatting widths.
docs/guide/output_schema.md	Fixes documented DSL constraint keyword casing (snake_case) and explains conversion.
docs/guide/optimizing_retry_policy.md	Rewrites/shortens the guide and aligns terminology/output samples with new labels.
docs/guide/getting_started.md	Rewrites/shortens the guide around a consistent `SummarizeArticle` narrative and updated examples.
Gemfile.lock	Updates locked gem version to 0.7.2.
CHANGELOG.md	Adds 0.7.2 entry describing the changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ith getting_started Reduced 6.3k to 5.0k chars. Every example now uses SummarizeArticle (README flagship step) instead of ClassifyTicket. Team workflow section shortened to 5 one-line bullets with a link to getting_started.md for the full matcher chain — the previous version duplicated the setup steps. Kept intact: Core Rule, Three eval kinds (smoke/regression/ab), sample_response caveat, few-shot note, model-selection-after-prompt-stability ordering, Short Version. These are the philosophy bits that belong in eval-first specifically. API calls verified against the code: compare_with method and the compared_with matcher chain are both real. compared_with is past-tense on purpose so it reads naturally in RSpec as "was compared with OldPrompt". Part of the guides-batch-rewrite PR, same narrative pass as getting_started, optimizing_retry_policy, output_schema.

… getting_started + eval_first Reduced 10.7k to 7.4k chars. Every example now uses SummarizeArticle (README flagship step) instead of the previous mix of ClassifyIntent / ClassifyTicket / EvaluatePersona / EvaluateComparative. Kept intact (unique to testing guide): - Test Adapter (String / Hash / Array / sequential responses) - "Output keys are always symbols" caveat - stub_step / stub_steps / stub_all_steps full reference with block form - RSpec setup, Minitest equivalent - satisfy_contract + pass_eval matchers (with chain cross-referenced to getting_started for the full surface) - Offline vs Online eval decision table - Inspecting failures (Report API) - Soft observations (observe blocks) - Baseline file format reference Cut / compressed with links back to proper homes: - Threshold-Based Gating -> getting_started has the matcher chain - Rake Task configuration -> getting_started - Baseline Regression Detection walkthrough -> getting_started + eval_first - Prompt A/B Testing walkthrough -> eval_first - Per-section long narratives trimmed to what a test author actually needs to know Part of the guides-batch-rewrite PR, same narrative pass as optimizing_retry_policy, getting_started, eval_first, output_schema. Full spec suite: 1341 examples, 0 failures.

…st_practices, pipeline, migration P3 sanity pass over the three remaining guides. Minimal structural changes; the goal was terminology consistency with README + optimize and the SummarizeArticle case where appropriate. best_practices.md (3.5k unchanged in size): - Section 6 renamed "Model escalation" to "Model fallback" so it matches README narrative and the Optimizing retry_policy guide. - Commentary about fixed 90/9/1 attempt distribution removed — invented numbers not backed by data, same reason the similar line was cut from README. - Summary table updated (last row: "Cost optimization via model fallback"). - Kept AnalyzeCompetitor / target_lang / priority-body examples. This guide is a reference of validate patterns; diverse examples are deliberate rather than a SummarizeArticle monoculture. pipeline.md (4.2k nearly unchanged): - Pipeline eval example: TicketPipeline renamed to MeetingFollowUp so the class actually exists earlier in the guide (previously referenced a pipeline never defined). - "See also" section added with links to testing.md (pipeline-level adapters) and optimizing_retry_policy.md (per-step fallback). - MeetingFollowUp case kept — pipelines need multiple steps; SummarizeArticle is single-step so it would fight the topic. migration.md (4.6k -> 5.4k): - ClassifyTicket replaced with SummarizeArticle across every example. The original "classify ticket" case was already called out in README feedback as a fabricated case study; migration guide inherited that baggage. Before/After diff now shows a real article-summary service getting wrapped in a contract. - Eval cases rewritten against tone (analytical vs negative) matching the schema SummarizeArticle actually ships with. - compare_models call uses candidates: keyword for consistency with Optimizing retry_policy (models: is still supported). - Added "See also" links at the bottom to getting_started, testing, eval_first so each migration step lands near the full reference. Full spec suite: 1341 examples, 0 failures.

Copilot

Pull request overview

This PR bumps ruby_llm-contract to v0.7.2 and aligns retry-optimization output terminology with the README while batching several documentation guide rewrites/fixes (including a critical copy/paste DSL correction in output_schema).

Changes:

Bump gem version to 0.7.2 (code + lockfile) and add a hardest_eval alias for RetryOptimizer::Result.
Rename terminal/summary labels (Hardest eval, Suggested fallback list, first-attempt, fallback %) and adjust formatting to match.
Rewrite/refine multiple guides to consistently use the SummarizeArticle narrative and fix the schema DSL constraint keywords to snake_case.

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`lib/ruby_llm/contract/version.rb`	Version bump to 0.7.2.
`Gemfile.lock`	Lockfile version update to 0.7.2.
`CHANGELOG.md`	Adds 0.7.2 entry describing label changes + one guide rewrite.
`lib/ruby_llm/contract/eval/retry_optimizer.rb`	Adds `hardest_eval` alias and updates printed labels.
`lib/ruby_llm/contract/eval/model_comparison.rb`	Updates production-mode table headers and column widths.
`spec/ruby_llm/contract/eval/retry_optimizer_spec.rb`	Updates assertions for renamed printed labels.
`docs/guide/optimizing_retry_policy.md`	Major rewrite focused on `SummarizeArticle` + updated terminology.
`docs/guide/getting_started.md`	Rewrite focused on `SummarizeArticle` walkthrough and CI gating.
`docs/guide/testing.md`	Rewrite focused on `SummarizeArticle`, stubbing/matchers, and mode table.
`docs/guide/eval_first.md`	Refine examples to `SummarizeArticle` and streamline workflow section.
`docs/guide/output_schema.md`	Fixes documented constraint keywords to snake_case + clarifies camelCase conversion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

 # Getting Started

-## When you need more
+The README shows a minimal `SummarizeArticle` step. This guide walks through the features you reach for as production requirements grow: budget caps so runaway inputs don't drain your OpenAI account, evals so you catch regressions in CI, and CI gating so a merge that lowers accuracy gets blocked.


+## 0.7.2 (2026-04-22)
+
+### Changed
+
+- **Terminal output labels renamed for consistency with README narrative.** `print_summary` now prints `Hardest eval` (was `Constraining eval`), `Suggested fallback list` (was `Suggested chain`), and the production-mode table uses `first-attempt` / `fallback %` as column headers (was `single-shot` / `escalation`). Programmatic metric names unchanged: `single_shot_cost`, `single_shot_latency_ms`, `escalation_rate`. `RetryOptimizer::Result` exposes `hardest_eval` as an alias for `constraining_eval`.
+- **`docs/guide/optimizing_retry_policy.md` rewritten.** Reduced from 17.7k → 6.4k characters. Continues the `SummarizeArticle` narrative from README. Offline mode now clearly positioned as wiring-check; real optimization runs via `LIVE=1 RUNS=3`. Output samples match actual `print_summary` format. Terminology aligned with the new labels.


    extract:  { decisions: [...] },
    analyze:  { analyses: [...] },


+  SummarizeArticle => { response: { ... } },
+  RelatedArticles  => { response: { ... } }


…ding + changelog coverage Four inline findings from two Copilot review rounds; all addressed: 1. docs/guide/getting_started.md:3 — "OpenAI account" was provider-specific in a guide that inherits the README's "any ruby_llm provider" promise. Reworded to "LLM provider budget". 2. CHANGELOG.md — 0.7.2 entry listed only the optimizing_retry_policy rewrite. Added a Documentation subsection enumerating all five guide rewrites plus the output_schema DSL bug fix and the best_practices / pipeline / migration sanity pass, so release notes match what shipped. 3. docs/guide/testing.md:38 — the pipeline.test example used `decisions: [...]` and `analyses: [...]`. `...` is not valid Ruby inside an array literal; copy-paste would raise SyntaxError. Replaced with minimal realistic array entries matching each step's schema from pipeline.md. 4. docs/guide/testing.md:104 — stub_steps example had `response: { ... }`. Same issue as #3. Replaced with minimal response hashes matching the SummarizeArticle schema and a plausible RelatedArticles schema. Full spec suite: 1341 examples, 0 failures.

justi · 2026-04-22T17:12:13Z

Four Copilot inline findings addressed in commit 7c968e9:

getting_started.md:3 — 'OpenAI account' → 'LLM provider budget' (provider-agnostic, consistent with README's multi-provider claim).
CHANGELOG.md:8 — 0.7.2 entry expanded with a Documentation subsection covering all five guide rewrites + output_schema DSL bug fix + P3 sanity pass.
testing.md:38 — replaced [...] placeholders in pipeline.test example with realistic minimal arrays matching each step's schema.
testing.md:104 — replaced { ... } placeholders in stub_steps example with minimal response hashes matching SummarizeArticle schema.

Both testing.md snippets now copy-paste without SyntaxError.

…es + empirical verification Earlier passes left four guides on different cases (MeetingFollowUp in pipeline, AnalyzeCompetitor in best_practices, intent/confidence in output_schema, GenerateComment in prompt_ast). Per user directive "tylko i wyłącznie przykład z aktualnego readme.md", this commit rewrites all four to extend SummarizeArticle. Part A — case alignment - output_schema.md Every example now builds on SummarizeArticle's schema (tldr / takeaways / tone). Nested-objects-in-arrays section demonstrates the case of attaching confidence per takeaway. "Why schema alone isn't enough" shows three realistic validates: UI-card length guard, uniqueness, and a cross-field rule ("negative tone requires at least one concrete risk"). - prompt_ast.md Hash-input + interpolation example is now a SummarizeArticle variant that accepts article / audience / language. Cross-validate examples show SummarizeArticle-specific guards (tldr not just the article reprinted; no takeaway repeats the TL;DR). - best_practices.md All six validate patterns reframed around SummarizeArticle: empty / placeholder guards, length-based cross-validate, conditional tone / takeaways rule, content quality, pipeline carry-through, model fallback. The diverse-examples exception I argued for earlier doesn't hold against the "one case throughout" directive. - pipeline.md MeetingFollowUp replaced with a three-step ArticleCardPipeline built around the README step: SummarizeArticle → GenerateHashtags → BuildArticleCard. Each step has its own schema and at least one validate that refers to the previous step's output. Per-step-override and eval sections renamed accordingly. - testing.md Updated the two pipeline.test / stub_steps examples to use the new ArticleCardPipeline class names and realistic response shapes matching each step's schema (was MyPipeline / ArticlePipeline). Part B — empirical verification tmp/verify_all_guides.rb now exercises every pattern shown in the guides: SummarizeArticle build + run with Test adapter, smoke eval with sample_response, snake_case schema constraints round-trip, nested objects schema, Hash-input prompt variant, three-step ArticleCardPipeline build, cross-validate blocks, Test adapter array responses, and the migration-form variant. 11/11 pass. The file is the regression gate if any future edit breaks a snippet. bundle exec rspec stays green: 1341 examples, 0 failures. Size impact (this commit only): output_schema.md 3.4k -> 3.5k prompt_ast.md 1.5k -> 2.1k best_practices.md 3.5k -> 3.3k pipeline.md 4.3k -> 4.5k testing.md 7.4k -> 7.4k

justi · 2026-04-22T17:31:21Z

Addressed the full directive: all 8 guides extend SummarizeArticle, and every snippet is empirically verified.

Part A — case alignment across all guides

The four guides that were still on other cases now extend SummarizeArticle:

Guide	Was	Now
output_schema.md	intent / confidence / groups	SummarizeArticle schema + nested confidence per takeaway
prompt_ast.md	GenerateComment (reddit)	SummarizeArticle Hash-input variant (article / audience / language)
best_practices.md	AnalyzeCompetitor + diverse mini-cases	all 6 validate patterns framed around SummarizeArticle
pipeline.md	MeetingFollowUp (extract / analyze / email)	ArticleCardPipeline: SummarizeArticle → GenerateHashtags → BuildArticleCard

testing.md's pipeline.test / stub_steps examples updated to match the new ArticleCardPipeline class names and realistic response shapes.

Part B — empirical runner

tmp/verify_all_guides.rb exercises every pattern shown in the guides:

SummarizeArticle build + run with Test adapter
Smoke check with sample_response
snake_case schema constraints round-trip to JSON Schema camelCase
Nested objects in arrays
Hash-input prompt variant with interpolation
Three-step ArticleCardPipeline build + steps with 2-arity validates
Cross-validate blocks (length + conditional tone rule)
Test adapter array responses
Migration-form variant (model DSL + prompt do block)

11 / 11 green on bundle exec ruby -Ilib tmp/verify_all_guides.rb. The runner is a regression gate if any future edit breaks a snippet.

bundle exec rspec stays green: 1341 examples, 0 failures.

…-standard terms Searched the guides for terms that might read as invented hybrids rather than established industry vocabulary. Found two instances worth fixing: 1. optimizing_retry_policy.md:100 said "escalating 60% of the time". The rest of the guide uses "fallback" (matching README + code output). One leftover "escalating" verb clashed with the narrative. Changed to "falling back 60% of the time". 2. output_schema.md used "structural validates" as a compound noun in two places (header + intro bullet). "Validates" as a noun is Rails idiom for validate blocks; the "structural" modifier reads as invented. Replaced with "type and shape checks" — plain English, same meaning. Terms checked and kept as-is (not invented, established in the LLM / Ruby ecosystem): - fallback / fallback list / fallback rate — LangChain, OpenAI cookbook - first-attempt cost — clear compound, no hybrid - hardest eval — plain English replacement for "constraining eval" - eval-first — established in OpenAI / Anthropic / Braintrust docs - Prompt AST — AST is standard CS; title kept for URL stability - wiring check / smoke check — systems-testing idiom - flywheel — mainstream startup term - quality gate — CI/CD standard - preflight check — standard operations term - runaway inputs — idiomatic compound Empirical runner tmp/verify_all_guides.rb — still 11/11 green; text-only changes do not break any snippet. Full spec suite — 1341 examples, 0 failures.

Widened the jargon audit from /docs/guide/ to every tracked markdown file. Four real inconsistencies found and fixed. 1. retry_optimizer.rb:23 + matching spec + guide output sample print_summary printed "#{step} — retry chain optimization". Every other line in the same summary uses "fallback" language now (Hardest eval, Suggested fallback list). The header still said "retry chain" — a hybrid left over from the old terminology. Changed to "— fallback list optimization". Updated the spec that asserted the old string, and the output sample in optimizing_retry_policy.md. 2. optimizing_retry_policy.md:63 — "constraining row" Inside the `←` marker explanation, we still referred to the "constraining row". The outer sentence already says "the hardest eval", so the compact compound was restated as plain-English "the row that matters most". 3. docs/architecture.md:9 — "retry with model escalation" The module tree listed Step::RetryExecutor with a comment that still described it as "retry with model escalation". Narrative across README, guides, and CHANGELOG 0.7.2 has switched to "fallback". The comment is narrative, not an API name, so it was updated to match. 4. examples/README.md — 12 × "invariant" The examples README called validate blocks "invariants" in twelve places (table rows, section descriptions). `invariant` IS a real alias for `validate` in the DSL, but the README and all the guides use "validate" as the primary term. The examples README was the outlier. Normalized to "validate" / "validates" / "validate blocks" for consistency; left the `invariant` method name intact in code examples that live in examples/*.rb files. Terms checked across all .md files and kept deliberately: - "validates" as Rails-idiomatic noun (plural of the DSL method) - "AST" (standard CS term) for Prompt AST - "CI regression gate", "baseline", "regression test" - "fallback" / "fallback list" / "fallback rate" - "hardest eval" (plain English; `constraining_eval` is the preserved struct field name, still referenced in API documentation) - "wiring check", "smoke check", "preflight check" - "flywheel", "quality gate" Historical CHANGELOG entries (0.6.x and earlier) deliberately left alone — they describe the code as it was named at the time. Empirical: tmp/verify_all_guides.rb still 11/11 green. Full spec suite: 1341 examples, 0 failures.

Codex did an honest pass over every tracked .md and the matching code. All 7 findings were real bugs I had missed, not nitpicks. 1. README.md:60 — Step.recommend signature was wrong. Old: Step.recommend(candidates:, min_score:) New: Step.recommend("regression", candidates: [...], min_score: 0.95) The first positional arg (eval_name) is required and was missing, so a copy-paste would have raised ArgumentError. 2. README.md:82 — Roadmap line still said "Latest: v0.7.1". CHANGELOG is on 0.7.2. Updated the line to describe the 0.7.2 work (terminal labels + guides alignment + output_schema DSL fix). 3. docs/guide/output_schema.md:6 — "the model is forced to return JSON" is too strong and contradicts getting_started.md which honestly says with_schema is a request that cheaper models can ignore. Softened to "asking the model to return JSON", with the client-side-validation reason for keeping point 1 spelled out. 4. docs/guide/testing.md:202 — observe example was unrunnable. It had no prompt, no output_schema, no adapter, no response. Expanded into a complete runnable CompareArticles step with integer schema, validate + observe, and a Test adapter that returns two equal scores so the observation actually fires as demonstrated. 5. docs/guide/eval_first.md:40 — sample_response({ takeaways: [...] }) is Ruby SyntaxError. Replaced with a realistic 3-item array matching SummarizeArticle's schema. 6. examples/README.md — stale inventory: - Removed 06_reddit_promo.rb section (file does not exist). - Removed 06_reddit_promo references from the "Running" block. - Added 09_eval_dataset.rb and 10_reddit_full_showcase.rb sections, matching the actual examples/*.rb files. - Updated the "no API keys needed" footnote accordingly. 7. docs/architecture.md:40 — RubyLLM::Contract::CI namespace does not exist in the codebase. RakeTask and Railtie live directly under RubyLLM::Contract. Tree diagram updated to reflect reality. Empirical runner (tmp/verify_all_guides.rb) still 11/11 green. Full spec suite: 1341 examples, 0 failures.

…rrency, around_call Codex final-review follow-up: document API surface that was shipped but not mentioned in the guides batch. - getting_started.md — estimate_cost / estimate_eval_cost preflight examples and on_unknown_pricing: :warn vs default :refuse under Budget caps. - eval_first.md — run_eval(..., concurrency:) section and estimate_eval_cost for CI budgeting across candidate models. - testing.md — around_call assertion example (fires once per run, receives final Result after retry fallback). All 17 checks in local verify_all_guides.rb pass. 1341 specs, 0 failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e cases Audit across all 9 guides: README, getting_started, optimizing_retry_policy, pipeline, migration already had strong business scenes. Four guides opened cold or were missing "what breaks in prod" consequences. - eval_first.md — open with concrete production incident (customer success filter breaking because outage complaints are labelled analytical). Turns the abstract "prompt-by-feel" warning into a scene the reader recognises. - prompt_ast.md — explain WHY AST over strings via a real multi-tenant / multi-language newsletter scenario; adds business framing to the Hash-input variable-interpolation section too. - best_practices.md — add one-line "why it matters" to every validate section. Empty output => broken UI card. Cross-validate => catches lazy models echoing input. Conditional logic => prevents silent routing breaks. Content quality => leaked placeholders embarrass in front of users. Model fallback => 80/20 cost math made concrete. - output_schema.md — nested-objects example now anchored to a UI "confidence bar" feature instead of a bare schema shape. - testing.md — open with CI speed/cost/flake cost to justify Test adapter. Prose-only changes. All 17 verify_all_guides.rb checks still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Batch release/docs PR for 0.7.2 that aligns retry-optimization terminology with the README narrative (“fallback”), updates terminal summary/table labels (non-breaking), and rewrites multiple guides around the SummarizeArticle case while fixing a copy/paste-blocking output_schema DSL doc bug.

Changes:

Bump gem version to 0.7.2 and update release notes/lockfile references.
Rename print_summary/production-mode table labels to “fallback” terminology and add hardest_eval alias on RetryOptimizer::Result.
Rewrite/trim guides for narrative continuity and fix output_schema constraint keyword casing in docs.

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
spec/ruby_llm/contract/eval/retry_optimizer_spec.rb	Updates assertions for the renamed terminal labels.
lib/ruby_llm/contract/version.rb	Version bump to `0.7.2`.
lib/ruby_llm/contract/eval/retry_optimizer.rb	Renames printed labels and adds `hardest_eval` alias.
lib/ruby_llm/contract/eval/model_comparison.rb	Renames production-mode column headers and adjusts formatting widths/separator.
examples/README.md	Terminology updates (“validate blocks”) and example index updates (adds 09/10, removes 06).
docs/guide/testing.md	Rewrites testing guide examples around `SummarizeArticle`, matcher chain, and related references.
docs/guide/prompt_ast.md	Updates prompt AST examples to `SummarizeArticle` and expands interpolation/2-arity validate guidance.
docs/guide/pipeline.md	Reframes pipeline guide as `SummarizeArticle` → hashtags → card, adds “See also”.
docs/guide/output_schema.md	Fixes snake_case constraint keywords and rewrites schema guidance/examples around `SummarizeArticle`.
docs/guide/optimizing_retry_policy.md	Major rewrite emphasizing “fallback list” workflow and live optimization.
docs/guide/migration.md	Rewrites migration walkthrough around `SummarizeArticle`, adds “See also”.
docs/guide/getting_started.md	Reorders/rewrites walkthrough with evals/CI gating first, then budget caps and estimates.
docs/guide/eval_first.md	Refines eval-first philosophy and examples using `SummarizeArticle`.
docs/guide/best_practices.md	Terminology alignment (fallback) + updated validate patterns and examples.
docs/architecture.md	Updates terminology and reflects `RakeTask`/`Railtie` as top-level constants.
README.md	Updates “Most useful next” callout and roadmap line for `0.7.2`.
Gemfile.lock	Locks gem version to `0.7.2`.
CHANGELOG.md	Adds `0.7.2` entry summarizing label changes and doc rewrites.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+validate("no markdown headings in the TL;DR") do |o, _|
+  !o[:tldr].match?(/^\#{1,6}\s/)
 end


justi · 2026-04-23T00:07:57Z

+| `{"takeaways": [{"text": "...", "confidence": 0.9}]}` | Array of objects | `array :takeaways do; object do; string :text; number :confidence; end; end` |

-The schema tells the LLM provider **exactly** what JSON structure to return. Without `object do...end`, `array :groups do; string :who; end` tells the provider "groups is an array of strings" — and that's what you get back.
+Without `object do...end`, `array :takeaways do; string :text; end` tells the provider "takeaways is an array of strings" — not objects. That's what you get back.


Verified empirically — the current wording is correct. Test:

s = RubyLLM::Schema.create do array :keywords do string :keyword number :probability end end s.new.to_json_schema[:schema][:properties][:keywords][:items] # => { "type" => "string" } ← first child wins; number :probability is ignored

Without object do...end the items type becomes the first declared primitive, not a compound object. This is the pitfall the guide documents and that spec/ruby_llm/contract/nested_schema_spec.rb:71 explicitly tests ("WRONG: array without object wrapper produces flat string items"). examples/07_keyword_extraction.rb:30 has the same bug in the wild — separate cleanup. No change needed here.

+```

-1. **Model escalation with quality gate.** Start every request on nano ($0.10/M tokens). When `validate` catches a bad answer, auto-retry on mini ($0.40/M), then full ($2.00/M). 90% of requests succeed on nano. At 10k requests/month: ~$40 instead of ~$200.
+Returns `nil` for a single call (or `0.0` summed) when pricing isn't registered — same failure mode as `max_cost`.


+result = ArticleCardPipeline.run("")
+result.failed?        # => true
+result.failed_step    # => :summarize (empty input fails schema / validate → stops here)
+# tag and card never run — no downstream tokens spent on garbage


…-based docs index + TL;DR boxes Real-user feedback: adopters have trouble recognising what problems the gem solves. Diagnosis: docs started from *how* (API walkthrough) not *why* (production failure modes). Plan consulted with codex, sharpened scope. Four coordinated workstreams, all constrained by "README stays short": A) New docs/guide/why.md (failure gallery, sharp not exhaustive) - 1 paragraph framing - 4 fully worked failure cards: schema-valid-logically-wrong, silent prompt regression, refusal-as-valid-JSON, runaway cost / no fallback - 2 code samples (not 7) — codex: seven narratives would feel like a second sales README - 3 "also catches" bullets (leaked placeholder, input echo, tone drift) - failure → contract mechanism table - exit ramps to getting_started / migration / eval_first B) README micro-additions (zero bloat) - "Do I need this?" 3-sentence prose block after Install, before Example (codex pushback: Q&A sprawls, prose stays tight) - Reading-order hint: README → why.md → getting_started.md D) Outcome-based docs index in README (codex's added workstream — "may matter more than TL;DR boxes" for value recognition) - Renamed table column from blank to "What it does for your app" - Implementation-centric guide labels replaced with outcome labels: * "Eval-First" → "Prevent silent prompt regressions" * "Optimizing retry_policy" → "Control retry cost and fallback behaviour" * "Best Practices" → "Write validate rules that catch real bugs" * "Testing" → "Stub LLM calls in tests" * "Migration" → "Adopt in an existing Rails app" * "Pipeline" → "Chain LLM calls into a pipeline" - why.md listed first; getting_started.md second C) TL;DR one-liner blockquote at top of every guide (9 guides) - Compact single sentence ("Read this when X.") - "Skip if Y" added only to eval_first / testing / migration per codex (elsewhere adds visual clutter without value) Memory saved for future conversations: adopters don't recognise the problem → docs must lead with failure scenarios. All 17 verify_all_guides.rb checks still pass. 1341 specs, 0 failures. Prose and markdown only; zero Ruby snippet changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot flagged 4 issues on the adoption-friction commit; 3 are real, 1 is a false positive. - best_practices.md — replace String#exclude? + rescue hack with plain Ruby !include?. exclude? is ActiveSupport-only; snippet now works in non-Rails apps. - getting_started.md — correct estimate_eval_cost description. Unlike max_cost (fail-closed / :warn), estimate_eval_cost silently sums unknown-pricing cases as $0.00, so the previous "same failure mode" framing was misleading. Now flagged as a floor, not a guarantee. - pipeline.md — fail-fast snippet previously claimed "empty input fails schema / validate", but the SummarizeArticle definition above has no min_length on tldr. Replaced with a stubbed-adapter scenario where the TL;DR exceeds the "fits the card" validate — actually demonstrable. Replied to the fourth comment (output_schema.md) inline on the PR with an empirical counter-example: `array :keywords do; string :keyword; number :probability; end` produces `items: {type: "string"}` in JSON Schema (first child wins; the number declaration is silently ignored). The current wording is correct and matches `nested_schema_spec.rb:71` ("WRONG: array without object wrapper produces flat string items"). verify_all_guides.rb grown from 17 to 18 checks — added empirical proof that the new pipeline.md fail-fast snippet actually stops at :summarize. 1341 specs pass. No version bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… variants (#22) Lowest-friction entry point for evaluating the gem. Two runnable scripts, zero API keys, with inline expected output so readers see the fallback loop without cloning. ## examples/11_fallback_showcase.rb Variance-induced tone/takeaways mismatch on gpt-5-nano (where temperature=1.0 is server-enforced) → cross-field validate rejects → retry_policy escalates to gpt-5-mini. Part A shows schema-only (refusal would ship); Part B shows the full contract recovering. ## examples/12_retry_variants.rb Three retry_policy shapes beyond cross-model escalation: - A: attempts: 3 on the same model — sampling-variance absorption; replaces the typical begin/rescue/retry loop - B: reasoning_effort low → medium → high on one model - C: cross-provider Ollama → Anthropic → OpenAI (local first because it costs nothing; hosted last because it is the most accurate) All three runnable through the Test adapter so no provider keys are needed. ## Side updates - docs/guide/why.md Failure 3 reframed from "refusal as valid JSON" (edge case) to "sampling variance on fixed-temperature models" (universal for gpt-5 / o-series). - Vocabulary audit: replaced invented compounds (temperature-locked, variance-induced, severity signals, takeaway drift) with industry-standard terms. - examples/README.md entries for both showcases include abridged expected output inline. - examples/09_eval_dataset.rb fixed — eval_case returns CaseResult, not Hash; the .passed? / .score / .output accessors are now used instead of [:passed] etc. ## No version bump Docs + examples only; gem stays at 0.7.2. ## Reviews addressed - 4 Copilot review rounds handled (US spelling, REFUSAL_PREFIXES array form, Ollama wording, running-list adapter notes, Part-A label clarity). - External code review (codex) shaped the refusal-regex narrative and Part A → B framing. - Branch was rebuilt on clean main after the initial push carried 17 pre-squash commits from #21; final branch is a single commit on top of main.

Adoption-friction release. No runtime behavior changes — every delta is in `docs/`, `examples/`, or `spec/integration/` (plus version.rb / Gemfile.lock bumps). Upgrading from 0.7.2 picks up the expanded guide set, the consolidated runnable showcases, and one extra integration spec. Consolidates 7 merged PRs (#21–#27) into one release: - #21 Guide rewrite + adoption friction (why.md, "Do I need this?", outcome labels, TL;DR boxes) - #22 Runnable aha-moment showcases (fallback + retry variants) - #23 architecture.md refresh + docs/ideas untracked - #24 Schema pitfall fix (5 example files) + expected output coverage - #25 Examples consolidation — drop Reddit, renumber 00-06, restore pipeline + real-LLM minimal - #26 Rails integration FAQ guide (7 pre-emptive questions) - #27 Pipeline-level run_eval coverage — closes the "09 STEP 5" known issue from 0.7.2 Copilot review of the CHANGELOG itself flagged two inaccuracies before merge: - "No gem-level code changes" replaced with "No runtime behavior changes" so version.rb / Gemfile.lock bumps are not misrepresented. - Stale `examples/09_eval_dataset.rb` reference updated to current `05_eval_dataset.rb` after the renumber. Verification: 1287 specs pass, 6/6 test-adapter examples run clean, bundle install resolves 0.7.3. Full changelog entry on main in CHANGELOG.md.

justi added 5 commits April 23, 2026 01:50

model_comparison: fix production-mode table column widths after heade…

d91b13e

…r rename

getting_started: fix trace[:cost] example to match sum of per-attempt…

217acf0

… costs

Copilot AI review requested due to automatic review settings April 22, 2026 16:52

Copilot started reviewing on behalf of justi April 22, 2026 16:52 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

justi added 2 commits April 23, 2026 01:56

justi requested a review from Copilot April 22, 2026 17:03

Copilot started reviewing on behalf of justi April 22, 2026 17:04 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

justi and others added 4 commits April 23, 2026 02:35

justi requested a review from Copilot April 22, 2026 23:30

Copilot started reviewing on behalf of justi April 22, 2026 23:30 View session

justi added the documentation Improvements or additions to documentation label Apr 22, 2026

justi self-assigned this Apr 22, 2026

Copilot AI reviewed Apr 22, 2026

View reviewed changes

justi and others added 2 commits April 23, 2026 08:45

justi merged commit 77a6c2d into main Apr 23, 2026
1 check passed

justi mentioned this pull request Apr 23, 2026

examples: add 11_fallback_showcase.rb — runnable 'aha moment' demo #22

Merged

This was referenced Apr 23, 2026

docs: add Rails integration FAQ guide (pre-emptive adoption answers) #26

Merged

0.7.3: adoption-friction release (docs + examples consolidation) #28

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.7.2 + batch: guide optimizations (optimizing_retry_policy + getting_started + output_schema + terminal labels)#21

0.7.2 + batch: guide optimizations (optimizing_retry_policy + getting_started + output_schema + terminal labels)#21
justi merged 17 commits into
mainfrom
docs/guides-batch-rewrite

justi commented Apr 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

justi commented Apr 22, 2026 •

edited

Loading

Uh oh!

justi commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

justi Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		SummarizeArticle => { response: { ... } },
		RelatedArticles => { response: { ... } }

Conversation

justi commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's in this PR

1. Version bump to 0.7.2

2. Terminal output labels renamed (non-breaking)

3. docs/guide/optimizing_retry_policy.md rewritten

4. docs/guide/getting_started.md rewritten

5. docs/guide/eval_first.md refined

6. docs/guide/testing.md refined

7. P3 sanity pass (best_practices.md, pipeline.md, migration.md)

8. docs/guide/output_schema.md DSL bug fix (critical)

CHANGELOG entry

Tests

Size impact

Closed PRs

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

justi commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

justi commented Apr 22, 2026

Part A — case alignment across all guides

Part B — empirical runner

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

justi Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

justi commented Apr 22, 2026 •

edited

Loading

3. `docs/guide/optimizing_retry_policy.md` rewritten

4. `docs/guide/getting_started.md` rewritten

5. `docs/guide/eval_first.md` refined

6. `docs/guide/testing.md` refined

7. P3 sanity pass (`best_practices.md`, `pipeline.md`, `migration.md`)

8. `docs/guide/output_schema.md` DSL bug fix (critical)

justi commented Apr 22, 2026 •

edited

Loading