0.7.2 + batch: guide optimizations (optimizing_retry_policy + getting_started + output_schema + terminal labels)#21
Conversation
…_retry_policy guide Two coupled changes that together close the gap between the new mid/senior-focused README and the guide it links to as "Find the cheapest viable fallback list". Changed — output labels `print_summary` now prints terminology consistent with README: Constraining eval: X → Hardest eval: X Suggested chain: → Suggested fallback list: column "single-shot" → "first-attempt" column "escalation" → "fallback %" Programmatic metric names are deliberately unchanged to avoid a breaking API bump: `single_shot_cost`, `single_shot_latency_ms`, `escalation_rate`. A `hardest_eval` alias is added to `RetryOptimizer::Result` for the narrative accessor. Two spec assertions updated; full suite 1341 examples, 0 failures. Docs — optimizing_retry_policy.md Rewritten from 17.7k to 6.4k characters, same radical-cut style as the README pass. Continues the `SummarizeArticle` narrative from README rather than introducing ClassifyThread / MyStep placeholders. Structural fixes from two rounds of codex review: - Offline mode repositioned as a wiring check (every candidate returns the same sample_response score), real optimization via `LIVE=1 RUNS=3` as the primary command. - Sample outputs captured from an actual run against Test adapter so the format matches what `print_summary` really prints, not a plausible-looking invention. - "Suggested fallback list" rows annotated with "Order matters" so two entries don't read as options rather than a chain. - "Manual procedure" / duplicated troubleshooting / gpt-5-specific reasoning-effort case studies cut — moved to follow-up docs if ever needed. - `Programmatic API names` section at the end names the metrics on Report / AggregatedReport so Kasia-style readers don't feel the guide is inconsistent with the code.
…uplication Guide was 8.7k chars using a separate ClassifyTicket case study, with three sections — Structured Prompts, Dynamic Prompts, and "Already using ruby_llm?" — that either belonged in other guides or duplicated content the freshly-rewritten README now carries. Changes Narrative continuity with README. All examples use `SummarizeArticle` (the flagship step from README) with the same schema and validates. The walkthrough expands the README example by layering on `max_input`, `max_output`, `max_cost`, `define_eval`, `run_eval`, `save_baseline!`, and `pass_eval` matchers. Section order reshuffled so CI gating reads first. README links to this guide as "CI regression gates" and "Budget caps"; the Evals and CI gates section now comes before Budget caps so the primary link target lands on what the reader clicked in for. Removed sections: - Structured Prompts / Dynamic Prompts — delegated to prompt_ast.md. - "Already using ruby_llm?" — the new README boundary table + knockout paragraph cover this, better. - Reasoning effort — niche, not essential for Getting Started. - "Model priority" explanation paragraph — redundant with retry_policy semantics. Empirically verified. `tmp/verify_getting_started.rb` instantiates SummarizeArticle with every feature shown in the guide, runs the "smoke" eval end-to-end, exercises `run_eval`, `print_summary`, and `trace[:attempts]`. All pass against the real adapter. The `trace[:attempts]` example was updated to reflect the real hash shape (includes cost, latency_ms; usage abbreviated with "..."). Terminology aligned with README and optimizing_retry_policy.md: "escalate to a smarter model" → "fallback". Size: 8,723 → 6,146 chars (−30%). One round of codex review (mid/senior Ruby dev persona): verdict ONE OR TWO TWEAKS, three fixes applied — realistic eval sample (input/output aligned), Evals-before-Budget section order, `print_summary` call added after `run_eval`. Full spec suite: 1341 examples, 0 failures.
The "Supported constraints" table documented DSL keywords in camelCase (`minLength`, `maxLength`, `minItems`, `maxItems`, `additionalProperties`), which matches the JSON Schema spec but **not** the `ruby_llm-schema` DSL. The DSL accepts snake_case (`min_length`, `min_items`, etc.) and converts to JSON Schema camelCase internally before sending to the provider. Every code example that copy-pasted from the previous table would have raised `ArgumentError` when the schema was built. Changed to snake_case across the table, added a short note on the internal camelCase conversion so readers who recognize the JSON Schema names aren't confused. Verified: `tmp/verify_schema_dsl.rb` builds a schema with `min_length`, `max_length`, `min_items`, `max_items` and every one round-trips to the expected `minLength` / `minItems` in the emitted JSON Schema. Companion audit: `docs/guide/prompt_ast.md` checked for the same class of issue. `input_type Types::Hash.schema(...)` + `Types::String` etc. all build successfully. No changes needed there.
There was a problem hiding this comment.
Pull request overview
This PR batches a 0.7.2 release bump plus a set of guide/terminal-output terminology alignments, and fixes a docs DSL mismatch that previously caused copy-paste failures.
Changes:
- Bump gem version to 0.7.2 and add a corresponding changelog entry.
- Rename terminal output labels in retry optimization/model comparison output (non-breaking) and add a
hardest_evalalias forconstraining_eval. - Rewrite/streamline guides (
optimizing_retry_policy,getting_started) and fixoutput_schemadocs to use the correct snake_case DSL keywords.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| spec/ruby_llm/contract/eval/retry_optimizer_spec.rb | Updates spec expectations for the new printed labels. |
| lib/ruby_llm/contract/version.rb | Version bump to 0.7.2. |
| lib/ruby_llm/contract/eval/retry_optimizer.rb | Adds hardest_eval alias and updates printed labels in summaries. |
| lib/ruby_llm/contract/eval/model_comparison.rb | Renames production-mode table headers and updates formatting widths. |
| docs/guide/output_schema.md | Fixes documented DSL constraint keyword casing (snake_case) and explains conversion. |
| docs/guide/optimizing_retry_policy.md | Rewrites/shortens the guide and aligns terminology/output samples with new labels. |
| docs/guide/getting_started.md | Rewrites/shortens the guide around a consistent SummarizeArticle narrative and updated examples. |
| Gemfile.lock | Updates locked gem version to 0.7.2. |
| CHANGELOG.md | Adds 0.7.2 entry describing the changes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ith getting_started Reduced 6.3k to 5.0k chars. Every example now uses SummarizeArticle (README flagship step) instead of ClassifyTicket. Team workflow section shortened to 5 one-line bullets with a link to getting_started.md for the full matcher chain — the previous version duplicated the setup steps. Kept intact: Core Rule, Three eval kinds (smoke/regression/ab), sample_response caveat, few-shot note, model-selection-after-prompt-stability ordering, Short Version. These are the philosophy bits that belong in eval-first specifically. API calls verified against the code: compare_with method and the compared_with matcher chain are both real. compared_with is past-tense on purpose so it reads naturally in RSpec as "was compared with OldPrompt". Part of the guides-batch-rewrite PR, same narrative pass as getting_started, optimizing_retry_policy, output_schema.
… getting_started + eval_first Reduced 10.7k to 7.4k chars. Every example now uses SummarizeArticle (README flagship step) instead of the previous mix of ClassifyIntent / ClassifyTicket / EvaluatePersona / EvaluateComparative. Kept intact (unique to testing guide): - Test Adapter (String / Hash / Array / sequential responses) - "Output keys are always symbols" caveat - stub_step / stub_steps / stub_all_steps full reference with block form - RSpec setup, Minitest equivalent - satisfy_contract + pass_eval matchers (with chain cross-referenced to getting_started for the full surface) - Offline vs Online eval decision table - Inspecting failures (Report API) - Soft observations (observe blocks) - Baseline file format reference Cut / compressed with links back to proper homes: - Threshold-Based Gating -> getting_started has the matcher chain - Rake Task configuration -> getting_started - Baseline Regression Detection walkthrough -> getting_started + eval_first - Prompt A/B Testing walkthrough -> eval_first - Per-section long narratives trimmed to what a test author actually needs to know Part of the guides-batch-rewrite PR, same narrative pass as optimizing_retry_policy, getting_started, eval_first, output_schema. Full spec suite: 1341 examples, 0 failures.
…st_practices, pipeline, migration P3 sanity pass over the three remaining guides. Minimal structural changes; the goal was terminology consistency with README + optimize and the SummarizeArticle case where appropriate. best_practices.md (3.5k unchanged in size): - Section 6 renamed "Model escalation" to "Model fallback" so it matches README narrative and the Optimizing retry_policy guide. - Commentary about fixed 90/9/1 attempt distribution removed — invented numbers not backed by data, same reason the similar line was cut from README. - Summary table updated (last row: "Cost optimization via model fallback"). - Kept AnalyzeCompetitor / target_lang / priority-body examples. This guide is a reference of validate patterns; diverse examples are deliberate rather than a SummarizeArticle monoculture. pipeline.md (4.2k nearly unchanged): - Pipeline eval example: TicketPipeline renamed to MeetingFollowUp so the class actually exists earlier in the guide (previously referenced a pipeline never defined). - "See also" section added with links to testing.md (pipeline-level adapters) and optimizing_retry_policy.md (per-step fallback). - MeetingFollowUp case kept — pipelines need multiple steps; SummarizeArticle is single-step so it would fight the topic. migration.md (4.6k -> 5.4k): - ClassifyTicket replaced with SummarizeArticle across every example. The original "classify ticket" case was already called out in README feedback as a fabricated case study; migration guide inherited that baggage. Before/After diff now shows a real article-summary service getting wrapped in a contract. - Eval cases rewritten against tone (analytical vs negative) matching the schema SummarizeArticle actually ships with. - compare_models call uses candidates: keyword for consistency with Optimizing retry_policy (models: is still supported). - Added "See also" links at the bottom to getting_started, testing, eval_first so each migration step lands near the full reference. Full spec suite: 1341 examples, 0 failures.
There was a problem hiding this comment.
Pull request overview
This PR bumps ruby_llm-contract to v0.7.2 and aligns retry-optimization output terminology with the README while batching several documentation guide rewrites/fixes (including a critical copy/paste DSL correction in output_schema).
Changes:
- Bump gem version to 0.7.2 (code + lockfile) and add a
hardest_evalalias forRetryOptimizer::Result. - Rename terminal/summary labels (
Hardest eval,Suggested fallback list,first-attempt,fallback %) and adjust formatting to match. - Rewrite/refine multiple guides to consistently use the
SummarizeArticlenarrative and fix the schema DSL constraint keywords to snake_case.
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
lib/ruby_llm/contract/version.rb |
Version bump to 0.7.2. |
Gemfile.lock |
Lockfile version update to 0.7.2. |
CHANGELOG.md |
Adds 0.7.2 entry describing label changes + one guide rewrite. |
lib/ruby_llm/contract/eval/retry_optimizer.rb |
Adds hardest_eval alias and updates printed labels. |
lib/ruby_llm/contract/eval/model_comparison.rb |
Updates production-mode table headers and column widths. |
spec/ruby_llm/contract/eval/retry_optimizer_spec.rb |
Updates assertions for renamed printed labels. |
docs/guide/optimizing_retry_policy.md |
Major rewrite focused on SummarizeArticle + updated terminology. |
docs/guide/getting_started.md |
Rewrite focused on SummarizeArticle walkthrough and CI gating. |
docs/guide/testing.md |
Rewrite focused on SummarizeArticle, stubbing/matchers, and mode table. |
docs/guide/eval_first.md |
Refine examples to SummarizeArticle and streamline workflow section. |
docs/guide/output_schema.md |
Fixes documented constraint keywords to snake_case + clarifies camelCase conversion. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Getting Started | ||
|
|
||
| ## When you need more | ||
| The README shows a minimal `SummarizeArticle` step. This guide walks through the features you reach for as production requirements grow: budget caps so runaway inputs don't drain your OpenAI account, evals so you catch regressions in CI, and CI gating so a merge that lowers accuracy gets blocked. |
| ## 0.7.2 (2026-04-22) | ||
|
|
||
| ### Changed | ||
|
|
||
| - **Terminal output labels renamed for consistency with README narrative.** `print_summary` now prints `Hardest eval` (was `Constraining eval`), `Suggested fallback list` (was `Suggested chain`), and the production-mode table uses `first-attempt` / `fallback %` as column headers (was `single-shot` / `escalation`). Programmatic metric names unchanged: `single_shot_cost`, `single_shot_latency_ms`, `escalation_rate`. `RetryOptimizer::Result` exposes `hardest_eval` as an alias for `constraining_eval`. | ||
| - **`docs/guide/optimizing_retry_policy.md` rewritten.** Reduced from 17.7k → 6.4k characters. Continues the `SummarizeArticle` narrative from README. Offline mode now clearly positioned as wiring-check; real optimization runs via `LIVE=1 RUNS=3`. Output samples match actual `print_summary` format. Terminology aligned with the new labels. |
| extract: { decisions: [...] }, | ||
| analyze: { analyses: [...] }, |
| SummarizeArticle => { response: { ... } }, | ||
| RelatedArticles => { response: { ... } } |
…ding + changelog coverage
Four inline findings from two Copilot review rounds; all addressed:
1. docs/guide/getting_started.md:3 — "OpenAI account" was provider-specific
in a guide that inherits the README's "any ruby_llm provider" promise.
Reworded to "LLM provider budget".
2. CHANGELOG.md — 0.7.2 entry listed only the optimizing_retry_policy
rewrite. Added a Documentation subsection enumerating all five guide
rewrites plus the output_schema DSL bug fix and the best_practices /
pipeline / migration sanity pass, so release notes match what shipped.
3. docs/guide/testing.md:38 — the pipeline.test example used `decisions:
[...]` and `analyses: [...]`. `...` is not valid Ruby inside an array
literal; copy-paste would raise SyntaxError. Replaced with minimal
realistic array entries matching each step's schema from pipeline.md.
4. docs/guide/testing.md:104 — stub_steps example had `response: { ... }`.
Same issue as #3. Replaced with minimal response hashes matching the
SummarizeArticle schema and a plausible RelatedArticles schema.
Full spec suite: 1341 examples, 0 failures.
|
Four Copilot inline findings addressed in commit 7c968e9:
Both testing.md snippets now copy-paste without SyntaxError. |
…es + empirical verification
Earlier passes left four guides on different cases (MeetingFollowUp in
pipeline, AnalyzeCompetitor in best_practices, intent/confidence in
output_schema, GenerateComment in prompt_ast). Per user directive
"tylko i wyłącznie przykład z aktualnego readme.md", this commit
rewrites all four to extend SummarizeArticle.
Part A — case alignment
- output_schema.md
Every example now builds on SummarizeArticle's schema (tldr / takeaways /
tone). Nested-objects-in-arrays section demonstrates the case of
attaching confidence per takeaway. "Why schema alone isn't enough"
shows three realistic validates: UI-card length guard, uniqueness,
and a cross-field rule ("negative tone requires at least one concrete
risk").
- prompt_ast.md
Hash-input + interpolation example is now a SummarizeArticle variant
that accepts article / audience / language. Cross-validate examples
show SummarizeArticle-specific guards (tldr not just the article
reprinted; no takeaway repeats the TL;DR).
- best_practices.md
All six validate patterns reframed around SummarizeArticle: empty /
placeholder guards, length-based cross-validate, conditional tone /
takeaways rule, content quality, pipeline carry-through, model
fallback. The diverse-examples exception I argued for earlier doesn't
hold against the "one case throughout" directive.
- pipeline.md
MeetingFollowUp replaced with a three-step ArticleCardPipeline built
around the README step: SummarizeArticle → GenerateHashtags →
BuildArticleCard. Each step has its own schema and at least one
validate that refers to the previous step's output. Per-step-override
and eval sections renamed accordingly.
- testing.md
Updated the two pipeline.test / stub_steps examples to use the new
ArticleCardPipeline class names and realistic response shapes matching
each step's schema (was MyPipeline / ArticlePipeline).
Part B — empirical verification
tmp/verify_all_guides.rb now exercises every pattern shown in the
guides: SummarizeArticle build + run with Test adapter, smoke eval with
sample_response, snake_case schema constraints round-trip, nested
objects schema, Hash-input prompt variant, three-step
ArticleCardPipeline build, cross-validate blocks, Test adapter array
responses, and the migration-form variant. 11/11 pass. The file is
the regression gate if any future edit breaks a snippet.
bundle exec rspec stays green: 1341 examples, 0 failures.
Size impact (this commit only):
output_schema.md 3.4k -> 3.5k
prompt_ast.md 1.5k -> 2.1k
best_practices.md 3.5k -> 3.3k
pipeline.md 4.3k -> 4.5k
testing.md 7.4k -> 7.4k
|
Addressed the full directive: all 8 guides extend SummarizeArticle, and every snippet is empirically verified. Part A — case alignment across all guidesThe four guides that were still on other cases now extend SummarizeArticle:
testing.md's pipeline.test / stub_steps examples updated to match the new ArticleCardPipeline class names and realistic response shapes. Part B — empirical runner
11 / 11 green on
|
…-standard terms
Searched the guides for terms that might read as invented hybrids rather
than established industry vocabulary. Found two instances worth fixing:
1. optimizing_retry_policy.md:100 said "escalating 60% of the time". The
rest of the guide uses "fallback" (matching README + code output). One
leftover "escalating" verb clashed with the narrative. Changed to
"falling back 60% of the time".
2. output_schema.md used "structural validates" as a compound noun in
two places (header + intro bullet). "Validates" as a noun is Rails
idiom for validate blocks; the "structural" modifier reads as
invented. Replaced with "type and shape checks" — plain English,
same meaning.
Terms checked and kept as-is (not invented, established in the LLM / Ruby
ecosystem):
- fallback / fallback list / fallback rate — LangChain, OpenAI cookbook
- first-attempt cost — clear compound, no hybrid
- hardest eval — plain English replacement
for "constraining eval"
- eval-first — established in OpenAI /
Anthropic / Braintrust docs
- Prompt AST — AST is standard CS; title
kept for URL stability
- wiring check / smoke check — systems-testing idiom
- flywheel — mainstream startup term
- quality gate — CI/CD standard
- preflight check — standard operations term
- runaway inputs — idiomatic compound
Empirical runner tmp/verify_all_guides.rb — still 11/11 green; text-only
changes do not break any snippet.
Full spec suite — 1341 examples, 0 failures.
Widened the jargon audit from /docs/guide/ to every tracked markdown
file. Four real inconsistencies found and fixed.
1. retry_optimizer.rb:23 + matching spec + guide output sample
print_summary printed "#{step} — retry chain optimization". Every
other line in the same summary uses "fallback" language now
(Hardest eval, Suggested fallback list). The header still said
"retry chain" — a hybrid left over from the old terminology.
Changed to "— fallback list optimization". Updated the spec that
asserted the old string, and the output sample in
optimizing_retry_policy.md.
2. optimizing_retry_policy.md:63 — "constraining row"
Inside the `←` marker explanation, we still referred to the
"constraining row". The outer sentence already says "the hardest
eval", so the compact compound was restated as plain-English
"the row that matters most".
3. docs/architecture.md:9 — "retry with model escalation"
The module tree listed Step::RetryExecutor with a comment that
still described it as "retry with model escalation". Narrative
across README, guides, and CHANGELOG 0.7.2 has switched to
"fallback". The comment is narrative, not an API name, so it
was updated to match.
4. examples/README.md — 12 × "invariant"
The examples README called validate blocks "invariants" in twelve
places (table rows, section descriptions). `invariant` IS a real
alias for `validate` in the DSL, but the README and all the
guides use "validate" as the primary term. The examples README
was the outlier. Normalized to "validate" / "validates" /
"validate blocks" for consistency; left the `invariant` method
name intact in code examples that live in examples/*.rb files.
Terms checked across all .md files and kept deliberately:
- "validates" as Rails-idiomatic noun (plural of the DSL method)
- "AST" (standard CS term) for Prompt AST
- "CI regression gate", "baseline", "regression test"
- "fallback" / "fallback list" / "fallback rate"
- "hardest eval" (plain English; `constraining_eval` is the preserved
struct field name, still referenced in API documentation)
- "wiring check", "smoke check", "preflight check"
- "flywheel", "quality gate"
Historical CHANGELOG entries (0.6.x and earlier) deliberately left
alone — they describe the code as it was named at the time.
Empirical: tmp/verify_all_guides.rb still 11/11 green.
Full spec suite: 1341 examples, 0 failures.
Codex did an honest pass over every tracked .md and the matching code.
All 7 findings were real bugs I had missed, not nitpicks.
1. README.md:60 — Step.recommend signature was wrong.
Old: Step.recommend(candidates:, min_score:)
New: Step.recommend("regression", candidates: [...], min_score: 0.95)
The first positional arg (eval_name) is required and was missing, so
a copy-paste would have raised ArgumentError.
2. README.md:82 — Roadmap line still said "Latest: v0.7.1".
CHANGELOG is on 0.7.2. Updated the line to describe the 0.7.2 work
(terminal labels + guides alignment + output_schema DSL fix).
3. docs/guide/output_schema.md:6 — "the model is forced to return JSON"
is too strong and contradicts getting_started.md which honestly says
with_schema is a request that cheaper models can ignore. Softened to
"asking the model to return JSON", with the client-side-validation
reason for keeping point 1 spelled out.
4. docs/guide/testing.md:202 — observe example was unrunnable. It had
no prompt, no output_schema, no adapter, no response. Expanded into
a complete runnable CompareArticles step with integer schema,
validate + observe, and a Test adapter that returns two equal
scores so the observation actually fires as demonstrated.
5. docs/guide/eval_first.md:40 — sample_response({ takeaways: [...] })
is Ruby SyntaxError. Replaced with a realistic 3-item array
matching SummarizeArticle's schema.
6. examples/README.md — stale inventory:
- Removed 06_reddit_promo.rb section (file does not exist).
- Removed 06_reddit_promo references from the "Running" block.
- Added 09_eval_dataset.rb and 10_reddit_full_showcase.rb sections,
matching the actual examples/*.rb files.
- Updated the "no API keys needed" footnote accordingly.
7. docs/architecture.md:40 — RubyLLM::Contract::CI namespace does not
exist in the codebase. RakeTask and Railtie live directly under
RubyLLM::Contract. Tree diagram updated to reflect reality.
Empirical runner (tmp/verify_all_guides.rb) still 11/11 green.
Full spec suite: 1341 examples, 0 failures.
…rrency, around_call Codex final-review follow-up: document API surface that was shipped but not mentioned in the guides batch. - getting_started.md — estimate_cost / estimate_eval_cost preflight examples and on_unknown_pricing: :warn vs default :refuse under Budget caps. - eval_first.md — run_eval(..., concurrency:) section and estimate_eval_cost for CI budgeting across candidate models. - testing.md — around_call assertion example (fires once per run, receives final Result after retry fallback). All 17 checks in local verify_all_guides.rb pass. 1341 specs, 0 failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e cases Audit across all 9 guides: README, getting_started, optimizing_retry_policy, pipeline, migration already had strong business scenes. Four guides opened cold or were missing "what breaks in prod" consequences. - eval_first.md — open with concrete production incident (customer success filter breaking because outage complaints are labelled analytical). Turns the abstract "prompt-by-feel" warning into a scene the reader recognises. - prompt_ast.md — explain WHY AST over strings via a real multi-tenant / multi-language newsletter scenario; adds business framing to the Hash-input variable-interpolation section too. - best_practices.md — add one-line "why it matters" to every validate section. Empty output => broken UI card. Cross-validate => catches lazy models echoing input. Conditional logic => prevents silent routing breaks. Content quality => leaked placeholders embarrass in front of users. Model fallback => 80/20 cost math made concrete. - output_schema.md — nested-objects example now anchored to a UI "confidence bar" feature instead of a bare schema shape. - testing.md — open with CI speed/cost/flake cost to justify Test adapter. Prose-only changes. All 17 verify_all_guides.rb checks still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Batch release/docs PR for 0.7.2 that aligns retry-optimization terminology with the README narrative (“fallback”), updates terminal summary/table labels (non-breaking), and rewrites multiple guides around the SummarizeArticle case while fixing a copy/paste-blocking output_schema DSL doc bug.
Changes:
- Bump gem version to
0.7.2and update release notes/lockfile references. - Rename
print_summary/production-mode table labels to “fallback” terminology and addhardest_evalalias onRetryOptimizer::Result. - Rewrite/trim guides for narrative continuity and fix
output_schemaconstraint keyword casing in docs.
Reviewed changes
Copilot reviewed 17 out of 18 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| spec/ruby_llm/contract/eval/retry_optimizer_spec.rb | Updates assertions for the renamed terminal labels. |
| lib/ruby_llm/contract/version.rb | Version bump to 0.7.2. |
| lib/ruby_llm/contract/eval/retry_optimizer.rb | Renames printed labels and adds hardest_eval alias. |
| lib/ruby_llm/contract/eval/model_comparison.rb | Renames production-mode column headers and adjusts formatting widths/separator. |
| examples/README.md | Terminology updates (“validate blocks”) and example index updates (adds 09/10, removes 06). |
| docs/guide/testing.md | Rewrites testing guide examples around SummarizeArticle, matcher chain, and related references. |
| docs/guide/prompt_ast.md | Updates prompt AST examples to SummarizeArticle and expands interpolation/2-arity validate guidance. |
| docs/guide/pipeline.md | Reframes pipeline guide as SummarizeArticle → hashtags → card, adds “See also”. |
| docs/guide/output_schema.md | Fixes snake_case constraint keywords and rewrites schema guidance/examples around SummarizeArticle. |
| docs/guide/optimizing_retry_policy.md | Major rewrite emphasizing “fallback list” workflow and live optimization. |
| docs/guide/migration.md | Rewrites migration walkthrough around SummarizeArticle, adds “See also”. |
| docs/guide/getting_started.md | Reorders/rewrites walkthrough with evals/CI gating first, then budget caps and estimates. |
| docs/guide/eval_first.md | Refines eval-first philosophy and examples using SummarizeArticle. |
| docs/guide/best_practices.md | Terminology alignment (fallback) + updated validate patterns and examples. |
| docs/architecture.md | Updates terminology and reflects RakeTask/Railtie as top-level constants. |
| README.md | Updates “Most useful next” callout and roadmap line for 0.7.2. |
| Gemfile.lock | Locks gem version to 0.7.2. |
| CHANGELOG.md | Adds 0.7.2 entry summarizing label changes and doc rewrites. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| validate("no markdown headings in the TL;DR") do |o, _| | ||
| !o[:tldr].match?(/^\#{1,6}\s/) | ||
| end |
| | `{"takeaways": [{"text": "...", "confidence": 0.9}]}` | Array of objects | `array :takeaways do; object do; string :text; number :confidence; end; end` | | ||
|
|
||
| The schema tells the LLM provider **exactly** what JSON structure to return. Without `object do...end`, `array :groups do; string :who; end` tells the provider "groups is an array of strings" — and that's what you get back. | ||
| Without `object do...end`, `array :takeaways do; string :text; end` tells the provider "takeaways is an array of strings" — not objects. That's what you get back. |
There was a problem hiding this comment.
Verified empirically — the current wording is correct. Test:
s = RubyLLM::Schema.create do
array :keywords do
string :keyword
number :probability
end
end
s.new.to_json_schema[:schema][:properties][:keywords][:items]
# => { "type" => "string" } ← first child wins; number :probability is ignoredWithout object do...end the items type becomes the first declared primitive, not a compound object. This is the pitfall the guide documents and that spec/ruby_llm/contract/nested_schema_spec.rb:71 explicitly tests ("WRONG: array without object wrapper produces flat string items"). examples/07_keyword_extraction.rb:30 has the same bug in the wild — separate cleanup. No change needed here.
| ``` | ||
|
|
||
| 1. **Model escalation with quality gate.** Start every request on nano ($0.10/M tokens). When `validate` catches a bad answer, auto-retry on mini ($0.40/M), then full ($2.00/M). 90% of requests succeed on nano. At 10k requests/month: ~$40 instead of ~$200. | ||
| Returns `nil` for a single call (or `0.0` summed) when pricing isn't registered — same failure mode as `max_cost`. |
| result = ArticleCardPipeline.run("") | ||
| result.failed? # => true | ||
| result.failed_step # => :summarize (empty input fails schema / validate → stops here) | ||
| # tag and card never run — no downstream tokens spent on garbage |
…-based docs index + TL;DR boxes
Real-user feedback: adopters have trouble recognising what problems the
gem solves. Diagnosis: docs started from *how* (API walkthrough) not *why*
(production failure modes). Plan consulted with codex, sharpened scope.
Four coordinated workstreams, all constrained by "README stays short":
A) New docs/guide/why.md (failure gallery, sharp not exhaustive)
- 1 paragraph framing
- 4 fully worked failure cards: schema-valid-logically-wrong, silent
prompt regression, refusal-as-valid-JSON, runaway cost / no fallback
- 2 code samples (not 7) — codex: seven narratives would feel like
a second sales README
- 3 "also catches" bullets (leaked placeholder, input echo, tone drift)
- failure → contract mechanism table
- exit ramps to getting_started / migration / eval_first
B) README micro-additions (zero bloat)
- "Do I need this?" 3-sentence prose block after Install, before Example
(codex pushback: Q&A sprawls, prose stays tight)
- Reading-order hint: README → why.md → getting_started.md
D) Outcome-based docs index in README (codex's added workstream — "may
matter more than TL;DR boxes" for value recognition)
- Renamed table column from blank to "What it does for your app"
- Implementation-centric guide labels replaced with outcome labels:
* "Eval-First" → "Prevent silent prompt regressions"
* "Optimizing retry_policy" → "Control retry cost and fallback behaviour"
* "Best Practices" → "Write validate rules that catch real bugs"
* "Testing" → "Stub LLM calls in tests"
* "Migration" → "Adopt in an existing Rails app"
* "Pipeline" → "Chain LLM calls into a pipeline"
- why.md listed first; getting_started.md second
C) TL;DR one-liner blockquote at top of every guide (9 guides)
- Compact single sentence ("Read this when X.")
- "Skip if Y" added only to eval_first / testing / migration per codex
(elsewhere adds visual clutter without value)
Memory saved for future conversations: adopters don't recognise the
problem → docs must lead with failure scenarios.
All 17 verify_all_guides.rb checks still pass. 1341 specs, 0 failures.
Prose and markdown only; zero Ruby snippet changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot flagged 4 issues on the adoption-friction commit; 3 are real, 1 is
a false positive.
- best_practices.md — replace String#exclude? + rescue hack with plain
Ruby !include?. exclude? is ActiveSupport-only; snippet now works in
non-Rails apps.
- getting_started.md — correct estimate_eval_cost description. Unlike
max_cost (fail-closed / :warn), estimate_eval_cost silently sums
unknown-pricing cases as $0.00, so the previous "same failure mode"
framing was misleading. Now flagged as a floor, not a guarantee.
- pipeline.md — fail-fast snippet previously claimed "empty input fails
schema / validate", but the SummarizeArticle definition above has no
min_length on tldr. Replaced with a stubbed-adapter scenario where the
TL;DR exceeds the "fits the card" validate — actually demonstrable.
Replied to the fourth comment (output_schema.md) inline on the PR with
an empirical counter-example: `array :keywords do; string :keyword; number
:probability; end` produces `items: {type: "string"}` in JSON Schema (first
child wins; the number declaration is silently ignored). The current
wording is correct and matches `nested_schema_spec.rb:71` ("WRONG: array
without object wrapper produces flat string items").
verify_all_guides.rb grown from 17 to 18 checks — added empirical proof
that the new pipeline.md fail-fast snippet actually stops at :summarize.
1341 specs pass.
No version bump.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… variants (#22) Lowest-friction entry point for evaluating the gem. Two runnable scripts, zero API keys, with inline expected output so readers see the fallback loop without cloning. ## examples/11_fallback_showcase.rb Variance-induced tone/takeaways mismatch on gpt-5-nano (where temperature=1.0 is server-enforced) → cross-field validate rejects → retry_policy escalates to gpt-5-mini. Part A shows schema-only (refusal would ship); Part B shows the full contract recovering. ## examples/12_retry_variants.rb Three retry_policy shapes beyond cross-model escalation: - A: attempts: 3 on the same model — sampling-variance absorption; replaces the typical begin/rescue/retry loop - B: reasoning_effort low → medium → high on one model - C: cross-provider Ollama → Anthropic → OpenAI (local first because it costs nothing; hosted last because it is the most accurate) All three runnable through the Test adapter so no provider keys are needed. ## Side updates - docs/guide/why.md Failure 3 reframed from "refusal as valid JSON" (edge case) to "sampling variance on fixed-temperature models" (universal for gpt-5 / o-series). - Vocabulary audit: replaced invented compounds (temperature-locked, variance-induced, severity signals, takeaway drift) with industry-standard terms. - examples/README.md entries for both showcases include abridged expected output inline. - examples/09_eval_dataset.rb fixed — eval_case returns CaseResult, not Hash; the .passed? / .score / .output accessors are now used instead of [:passed] etc. ## No version bump Docs + examples only; gem stays at 0.7.2. ## Reviews addressed - 4 Copilot review rounds handled (US spelling, REFUSAL_PREFIXES array form, Ollama wording, running-list adapter notes, Part-A label clarity). - External code review (codex) shaped the refusal-regex narrative and Part A → B framing. - Branch was rebuilt on clean main after the initial push carried 17 pre-squash commits from #21; final branch is a single commit on top of main.
Adoption-friction release. No runtime behavior changes — every delta is in `docs/`, `examples/`, or `spec/integration/` (plus version.rb / Gemfile.lock bumps). Upgrading from 0.7.2 picks up the expanded guide set, the consolidated runnable showcases, and one extra integration spec. Consolidates 7 merged PRs (#21–#27) into one release: - #21 Guide rewrite + adoption friction (why.md, "Do I need this?", outcome labels, TL;DR boxes) - #22 Runnable aha-moment showcases (fallback + retry variants) - #23 architecture.md refresh + docs/ideas untracked - #24 Schema pitfall fix (5 example files) + expected output coverage - #25 Examples consolidation — drop Reddit, renumber 00-06, restore pipeline + real-LLM minimal - #26 Rails integration FAQ guide (7 pre-emptive questions) - #27 Pipeline-level run_eval coverage — closes the "09 STEP 5" known issue from 0.7.2 Copilot review of the CHANGELOG itself flagged two inaccuracies before merge: - "No gem-level code changes" replaced with "No runtime behavior changes" so version.rb / Gemfile.lock bumps are not misrepresented. - Stale `examples/09_eval_dataset.rb` reference updated to current `05_eval_dataset.rb` after the renumber. Verification: 1287 specs pass, 6/6 test-adapter examples run clean, bundle install resolves 0.7.3. Full changelog entry on main in CHANGELOG.md.
Consolidates four guide optimizations into one batch. The work is thematically coupled — terminology alignment across the code-to-guide boundary, a DSL bug fix that blocks copy-paste, and narrative-continuity rewrites keeping every guide on the
SummarizeArticlecase from the README.Previously split across three PRs (#18, #19, #20), now consolidated here. A fourth guide (
eval_first.md) joined the batch in this branch.What's in this PR
1. Version bump to 0.7.2
2. Terminal output labels renamed (non-breaking)
print_summaryoutput strings aligned with the README narrative. Programmatic metric names stable.Constraining eval:Hardest eval:Suggested chain:Suggested fallback list:single-shotfirst-attemptescalationfallback %RetryOptimizer::Resultnow exposeshardest_evalas an alias forconstraining_eval. Programmatic metric names (single_shot_cost,single_shot_latency_ms,escalation_rate) unchanged.Copilot finding on
model_comparison.rb(column widths didn't match after header rename) addressed: header and data row both use%-13sfor thefirst-attemptcolumn, separator bumped fromchain_width + 60→chain_width + 62.3.
docs/guide/optimizing_retry_policy.mdrewritten17.7k → 6.4k chars. Continues the
SummarizeArticlenarrative from README. Offline mode clearly positioned as a wiring check; real optimization viaLIVE=1 RUNS=3. Output samples captured from actualprint_summaryruns.Two codex review rounds (mid/senior Ruby persona):
BIGGER REWORK→ONE OR TWO TWEAKS→ applied.4.
docs/guide/getting_started.mdrewritten8.7k → 6.1k chars. Every example uses
SummarizeArticlewith the same schema and validates as the README. Walkthrough layers onmax_input/max_output/max_cost/define_eval/run_eval/save_baseline!/pass_eval.Section order reshuffled: Evals and CI gates before Budget caps (README links to this guide as "CI regression gates").
Removed: Structured Prompts + Dynamic Prompts (delegated to
prompt_ast.md), "Already using ruby_llm?" (README covers), "Reasoning effort" (niche), "Model priority" paragraph (redundant).Copilot finding on
trace[:cost]addressed:# => 0.000042→# => 0.00052 (sum of all attempts)— matchesRetryExecutorsum semantics.API verified via
tmp/verify_getting_started.rbagainst the real adapter.5.
docs/guide/eval_first.mdrefined6.3k → 5.0k chars. Switched every example from
ClassifyTickettoSummarizeArticle. Team workflow section trimmed to 5 one-line bullets linking back togetting_started.mdfor the matcher chain — the old version duplicated setup steps that now live there.Kept intact: Core Rule, Three eval kinds (smoke/regression/ab),
sample_responsecaveat, few-shot note, model-selection-after-prompt-stability ordering, Short Version. These are the philosophy bits that belong specifically in eval-first.API calls verified against the code:
compare_withandcompared_withmatcher chain both real.6.
docs/guide/testing.mdrefined10.7k → 7.4k chars. Switched every example from
ClassifyIntent/ClassifyTicket/EvaluatePersona/EvaluateComparativetoSummarizeArticle.Kept intact (unique to testing guide): Test Adapter, symbol-keys caveat,
stub_step/stub_steps/stub_all_stepsreference with block form, RSpec setup + Minitest equivalent,satisfy_contract+pass_evalmatchers, Offline vs Online decision table, Inspecting failures (Report API), Soft observations, Baseline file format.Cut or compressed with links back to proper homes: Threshold Gating (getting_started has the matcher chain), Rake Task (getting_started), Baseline Regression walkthrough (getting_started + eval_first), Prompt A/B walkthrough (eval_first).
7. P3 sanity pass (
best_practices.md,pipeline.md,migration.md)Terminology and case consistency over the three remaining guides.
best_practices.md: section 6 renamed "Model escalation" → "Model fallback" (matches README + Optimizing retry_policy). Fabricated90% / 9% / 1%attempt distribution removed.AnalyzeCompetitor/ diverse validate cases kept — this is a patterns reference, a SummarizeArticle monoculture would fight the topic.pipeline.md: eval example fixed (TicketPipelinereplaced withMeetingFollowUp, which is actually defined earlier).See alsosection added.MeetingFollowUpcase kept — pipelines need multi-step, SummarizeArticle is single-step.migration.md:ClassifyTicketreplaced withSummarizeArticleacross every example. The original ticket-classification case carried the same "fabricated case study" baggage README feedback called out. Before/After diff now shows a real article-summary service wrapped in a contract.See alsosection added.8.
docs/guide/output_schema.mdDSL bug fix (critical)The "Supported constraints" table documented keywords in camelCase (
minLength,maxLength,minItems,maxItems,additionalProperties). Those are JSON Schema spec names, not the actualruby_llm-schemaDSL. The DSL accepts snake_case (min_length,min_items, …) and converts internally.Every copy-paste from the previous table would have raised
ArgumentError. Fixed across the table, added a short note on the internal conversion.Verified:
tmp/verify_schema_dsl.rbbuilds a schema using every snake_case constraint and round-trips to the expected camelCase in the JSON Schema output.Companion audit of
prompt_ast.md— no changes needed.CHANGELOG entry
0.7.2— terminal label renames (non-breaking) + guide rewrites. Programmatic API unchanged.Tests
bundle exec rspec— 1341 examples, 0 failures, 8 pending.Size impact
optimizing_retry_policy.mdgetting_started.mdeval_first.mdtesting.mdoutput_schema.mdbest_practices.mdpipeline.mdmigration.mdClosed PRs
docs/optimizing-retry-policy-rewrite) — content here.docs/getting-started-rewrite) — content here.docs/output-schema-prompt-ast-fixes) — content here.