Skip to content

0.7.2 + batch: guide optimizations (optimizing_retry_policy + getting_started + output_schema + terminal labels)#21

Merged
justi merged 17 commits into
mainfrom
docs/guides-batch-rewrite
Apr 23, 2026
Merged

0.7.2 + batch: guide optimizations (optimizing_retry_policy + getting_started + output_schema + terminal labels)#21
justi merged 17 commits into
mainfrom
docs/guides-batch-rewrite

Conversation

@justi
Copy link
Copy Markdown
Owner

@justi justi commented Apr 22, 2026

Consolidates four guide optimizations into one batch. The work is thematically coupled — terminology alignment across the code-to-guide boundary, a DSL bug fix that blocks copy-paste, and narrative-continuity rewrites keeping every guide on the SummarizeArticle case from the README.

Previously split across three PRs (#18, #19, #20), now consolidated here. A fourth guide (eval_first.md) joined the batch in this branch.

What's in this PR

1. Version bump to 0.7.2

2. Terminal output labels renamed (non-breaking)

print_summary output strings aligned with the README narrative. Programmatic metric names stable.

Before After
Constraining eval: Hardest eval:
Suggested chain: Suggested fallback list:
column single-shot first-attempt
column escalation fallback %

RetryOptimizer::Result now exposes hardest_eval as an alias for constraining_eval. Programmatic metric names (single_shot_cost, single_shot_latency_ms, escalation_rate) unchanged.

Copilot finding on model_comparison.rb (column widths didn't match after header rename) addressed: header and data row both use %-13s for the first-attempt column, separator bumped from chain_width + 60chain_width + 62.

3. docs/guide/optimizing_retry_policy.md rewritten

17.7k → 6.4k chars. Continues the SummarizeArticle narrative from README. Offline mode clearly positioned as a wiring check; real optimization via LIVE=1 RUNS=3. Output samples captured from actual print_summary runs.

Two codex review rounds (mid/senior Ruby persona): BIGGER REWORKONE OR TWO TWEAKS → applied.

4. docs/guide/getting_started.md rewritten

8.7k → 6.1k chars. Every example uses SummarizeArticle with the same schema and validates as the README. Walkthrough layers on max_input / max_output / max_cost / define_eval / run_eval / save_baseline! / pass_eval.

Section order reshuffled: Evals and CI gates before Budget caps (README links to this guide as "CI regression gates").

Removed: Structured Prompts + Dynamic Prompts (delegated to prompt_ast.md), "Already using ruby_llm?" (README covers), "Reasoning effort" (niche), "Model priority" paragraph (redundant).

Copilot finding on trace[:cost] addressed: # => 0.000042# => 0.00052 (sum of all attempts) — matches RetryExecutor sum semantics.

API verified via tmp/verify_getting_started.rb against the real adapter.

5. docs/guide/eval_first.md refined

6.3k → 5.0k chars. Switched every example from ClassifyTicket to SummarizeArticle. Team workflow section trimmed to 5 one-line bullets linking back to getting_started.md for the matcher chain — the old version duplicated setup steps that now live there.

Kept intact: Core Rule, Three eval kinds (smoke/regression/ab), sample_response caveat, few-shot note, model-selection-after-prompt-stability ordering, Short Version. These are the philosophy bits that belong specifically in eval-first.

API calls verified against the code: compare_with and compared_with matcher chain both real.

6. docs/guide/testing.md refined

10.7k → 7.4k chars. Switched every example from ClassifyIntent / ClassifyTicket / EvaluatePersona / EvaluateComparative to SummarizeArticle.

Kept intact (unique to testing guide): Test Adapter, symbol-keys caveat, stub_step / stub_steps / stub_all_steps reference with block form, RSpec setup + Minitest equivalent, satisfy_contract + pass_eval matchers, Offline vs Online decision table, Inspecting failures (Report API), Soft observations, Baseline file format.

Cut or compressed with links back to proper homes: Threshold Gating (getting_started has the matcher chain), Rake Task (getting_started), Baseline Regression walkthrough (getting_started + eval_first), Prompt A/B walkthrough (eval_first).

7. P3 sanity pass (best_practices.md, pipeline.md, migration.md)

Terminology and case consistency over the three remaining guides.

  • best_practices.md: section 6 renamed "Model escalation" → "Model fallback" (matches README + Optimizing retry_policy). Fabricated 90% / 9% / 1% attempt distribution removed. AnalyzeCompetitor / diverse validate cases kept — this is a patterns reference, a SummarizeArticle monoculture would fight the topic.
  • pipeline.md: eval example fixed (TicketPipeline replaced with MeetingFollowUp, which is actually defined earlier). See also section added. MeetingFollowUp case kept — pipelines need multi-step, SummarizeArticle is single-step.
  • migration.md: ClassifyTicket replaced with SummarizeArticle across every example. The original ticket-classification case carried the same "fabricated case study" baggage README feedback called out. Before/After diff now shows a real article-summary service wrapped in a contract. See also section added.

8. docs/guide/output_schema.md DSL bug fix (critical)

The "Supported constraints" table documented keywords in camelCase (minLength, maxLength, minItems, maxItems, additionalProperties). Those are JSON Schema spec names, not the actual ruby_llm-schema DSL. The DSL accepts snake_case (min_length, min_items, …) and converts internally.

Every copy-paste from the previous table would have raised ArgumentError. Fixed across the table, added a short note on the internal conversion.

Verified: tmp/verify_schema_dsl.rb builds a schema using every snake_case constraint and round-trips to the expected camelCase in the JSON Schema output.

Companion audit of prompt_ast.md — no changes needed.

CHANGELOG entry

0.7.2 — terminal label renames (non-breaking) + guide rewrites. Programmatic API unchanged.

Tests

bundle exec rspec — 1341 examples, 0 failures, 8 pending.

Size impact

File Before After Δ
optimizing_retry_policy.md 17.7k 6.4k −64%
getting_started.md 8.7k 6.1k −30%
eval_first.md 6.3k 5.0k −20%
testing.md 10.7k 7.4k −31%
output_schema.md 3.3k 3.4k +3% (note added)
best_practices.md 3.5k 3.5k ±0% (terminology)
pipeline.md 4.2k 4.3k +2% (See also link)
migration.md 4.6k 5.4k +17% (See also + clearer case)
Total guide content 54.3k 36.7k −32%

Closed PRs

justi added 5 commits April 23, 2026 01:50
…_retry_policy guide

Two coupled changes that together close the gap between the new
mid/senior-focused README and the guide it links to as
"Find the cheapest viable fallback list".

Changed — output labels

`print_summary` now prints terminology consistent with README:

  Constraining eval: X        →  Hardest eval: X
  Suggested chain:            →  Suggested fallback list:
  column "single-shot"        →  "first-attempt"
  column "escalation"         →  "fallback %"

Programmatic metric names are deliberately unchanged to avoid a
breaking API bump: `single_shot_cost`, `single_shot_latency_ms`,
`escalation_rate`. A `hardest_eval` alias is added to
`RetryOptimizer::Result` for the narrative accessor.

Two spec assertions updated; full suite 1341 examples, 0 failures.

Docs — optimizing_retry_policy.md

Rewritten from 17.7k to 6.4k characters, same radical-cut style as
the README pass. Continues the `SummarizeArticle` narrative from
README rather than introducing ClassifyThread / MyStep placeholders.

Structural fixes from two rounds of codex review:
- Offline mode repositioned as a wiring check (every candidate
  returns the same sample_response score), real optimization via
  `LIVE=1 RUNS=3` as the primary command.
- Sample outputs captured from an actual run against Test adapter
  so the format matches what `print_summary` really prints, not a
  plausible-looking invention.
- "Suggested fallback list" rows annotated with "Order matters" so
  two entries don't read as options rather than a chain.
- "Manual procedure" / duplicated troubleshooting / gpt-5-specific
  reasoning-effort case studies cut — moved to follow-up docs if
  ever needed.
- `Programmatic API names` section at the end names the metrics on
  Report / AggregatedReport so Kasia-style readers don't feel the
  guide is inconsistent with the code.
…uplication

Guide was 8.7k chars using a separate ClassifyTicket case study, with
three sections — Structured Prompts, Dynamic Prompts, and "Already
using ruby_llm?" — that either belonged in other guides or duplicated
content the freshly-rewritten README now carries.

Changes

Narrative continuity with README. All examples use `SummarizeArticle`
(the flagship step from README) with the same schema and validates.
The walkthrough expands the README example by layering on `max_input`,
`max_output`, `max_cost`, `define_eval`, `run_eval`, `save_baseline!`,
and `pass_eval` matchers.

Section order reshuffled so CI gating reads first. README links to
this guide as "CI regression gates" and "Budget caps"; the Evals and
CI gates section now comes before Budget caps so the primary link
target lands on what the reader clicked in for.

Removed sections:
- Structured Prompts / Dynamic Prompts — delegated to prompt_ast.md.
- "Already using ruby_llm?" — the new README boundary table + knockout
  paragraph cover this, better.
- Reasoning effort — niche, not essential for Getting Started.
- "Model priority" explanation paragraph — redundant with retry_policy
  semantics.

Empirically verified. `tmp/verify_getting_started.rb` instantiates
SummarizeArticle with every feature shown in the guide, runs the
"smoke" eval end-to-end, exercises `run_eval`, `print_summary`, and
`trace[:attempts]`. All pass against the real adapter. The
`trace[:attempts]` example was updated to reflect the real hash
shape (includes cost, latency_ms; usage abbreviated with "...").

Terminology aligned with README and optimizing_retry_policy.md:
"escalate to a smarter model" → "fallback".

Size: 8,723 → 6,146 chars (−30%).

One round of codex review (mid/senior Ruby dev persona): verdict
ONE OR TWO TWEAKS, three fixes applied — realistic eval sample
(input/output aligned), Evals-before-Budget section order, `print_summary`
call added after `run_eval`.

Full spec suite: 1341 examples, 0 failures.
The "Supported constraints" table documented DSL keywords in camelCase
(`minLength`, `maxLength`, `minItems`, `maxItems`, `additionalProperties`),
which matches the JSON Schema spec but **not** the `ruby_llm-schema` DSL.
The DSL accepts snake_case (`min_length`, `min_items`, etc.) and converts
to JSON Schema camelCase internally before sending to the provider.

Every code example that copy-pasted from the previous table would have
raised `ArgumentError` when the schema was built. Changed to snake_case
across the table, added a short note on the internal camelCase conversion
so readers who recognize the JSON Schema names aren't confused.

Verified: `tmp/verify_schema_dsl.rb` builds a schema with `min_length`,
`max_length`, `min_items`, `max_items` and every one round-trips to the
expected `minLength` / `minItems` in the emitted JSON Schema.

Companion audit: `docs/guide/prompt_ast.md` checked for the same class
of issue. `input_type Types::Hash.schema(...)` + `Types::String` etc.
all build successfully. No changes needed there.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR batches a 0.7.2 release bump plus a set of guide/terminal-output terminology alignments, and fixes a docs DSL mismatch that previously caused copy-paste failures.

Changes:

  • Bump gem version to 0.7.2 and add a corresponding changelog entry.
  • Rename terminal output labels in retry optimization/model comparison output (non-breaking) and add a hardest_eval alias for constraining_eval.
  • Rewrite/streamline guides (optimizing_retry_policy, getting_started) and fix output_schema docs to use the correct snake_case DSL keywords.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated no comments.

Show a summary per file
File Description
spec/ruby_llm/contract/eval/retry_optimizer_spec.rb Updates spec expectations for the new printed labels.
lib/ruby_llm/contract/version.rb Version bump to 0.7.2.
lib/ruby_llm/contract/eval/retry_optimizer.rb Adds hardest_eval alias and updates printed labels in summaries.
lib/ruby_llm/contract/eval/model_comparison.rb Renames production-mode table headers and updates formatting widths.
docs/guide/output_schema.md Fixes documented DSL constraint keyword casing (snake_case) and explains conversion.
docs/guide/optimizing_retry_policy.md Rewrites/shortens the guide and aligns terminology/output samples with new labels.
docs/guide/getting_started.md Rewrites/shortens the guide around a consistent SummarizeArticle narrative and updated examples.
Gemfile.lock Updates locked gem version to 0.7.2.
CHANGELOG.md Adds 0.7.2 entry describing the changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

justi added 2 commits April 23, 2026 01:56
…ith getting_started

Reduced 6.3k to 5.0k chars. Every example now uses SummarizeArticle
(README flagship step) instead of ClassifyTicket. Team workflow section
shortened to 5 one-line bullets with a link to getting_started.md for the
full matcher chain — the previous version duplicated the setup steps.

Kept intact: Core Rule, Three eval kinds (smoke/regression/ab),
sample_response caveat, few-shot note, model-selection-after-prompt-stability
ordering, Short Version. These are the philosophy bits that belong in
eval-first specifically.

API calls verified against the code: compare_with method and the
compared_with matcher chain are both real. compared_with is past-tense on
purpose so it reads naturally in RSpec as "was compared with OldPrompt".

Part of the guides-batch-rewrite PR, same narrative pass as getting_started,
optimizing_retry_policy, output_schema.
… getting_started + eval_first

Reduced 10.7k to 7.4k chars. Every example now uses SummarizeArticle
(README flagship step) instead of the previous mix of ClassifyIntent /
ClassifyTicket / EvaluatePersona / EvaluateComparative.

Kept intact (unique to testing guide):
- Test Adapter (String / Hash / Array / sequential responses)
- "Output keys are always symbols" caveat
- stub_step / stub_steps / stub_all_steps full reference with block form
- RSpec setup, Minitest equivalent
- satisfy_contract + pass_eval matchers (with chain cross-referenced to
  getting_started for the full surface)
- Offline vs Online eval decision table
- Inspecting failures (Report API)
- Soft observations (observe blocks)
- Baseline file format reference

Cut / compressed with links back to proper homes:
- Threshold-Based Gating -> getting_started has the matcher chain
- Rake Task configuration -> getting_started
- Baseline Regression Detection walkthrough -> getting_started + eval_first
- Prompt A/B Testing walkthrough -> eval_first
- Per-section long narratives trimmed to what a test author actually
  needs to know

Part of the guides-batch-rewrite PR, same narrative pass as
optimizing_retry_policy, getting_started, eval_first, output_schema.

Full spec suite: 1341 examples, 0 failures.
…st_practices, pipeline, migration

P3 sanity pass over the three remaining guides. Minimal structural changes;
the goal was terminology consistency with README + optimize and the
SummarizeArticle case where appropriate.

best_practices.md (3.5k unchanged in size):
- Section 6 renamed "Model escalation" to "Model fallback" so it matches
  README narrative and the Optimizing retry_policy guide.
- Commentary about fixed 90/9/1 attempt distribution removed — invented
  numbers not backed by data, same reason the similar line was cut from
  README.
- Summary table updated (last row: "Cost optimization via model fallback").
- Kept AnalyzeCompetitor / target_lang / priority-body examples. This guide
  is a reference of validate patterns; diverse examples are deliberate
  rather than a SummarizeArticle monoculture.

pipeline.md (4.2k nearly unchanged):
- Pipeline eval example: TicketPipeline renamed to MeetingFollowUp so the
  class actually exists earlier in the guide (previously referenced a
  pipeline never defined).
- "See also" section added with links to testing.md (pipeline-level
  adapters) and optimizing_retry_policy.md (per-step fallback).
- MeetingFollowUp case kept — pipelines need multiple steps; SummarizeArticle
  is single-step so it would fight the topic.

migration.md (4.6k -> 5.4k):
- ClassifyTicket replaced with SummarizeArticle across every example.
  The original "classify ticket" case was already called out in README
  feedback as a fabricated case study; migration guide inherited that
  baggage. Before/After diff now shows a real article-summary service
  getting wrapped in a contract.
- Eval cases rewritten against tone (analytical vs negative) matching
  the schema SummarizeArticle actually ships with.
- compare_models call uses candidates: keyword for consistency with
  Optimizing retry_policy (models: is still supported).
- Added "See also" links at the bottom to getting_started, testing,
  eval_first so each migration step lands near the full reference.

Full spec suite: 1341 examples, 0 failures.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR bumps ruby_llm-contract to v0.7.2 and aligns retry-optimization output terminology with the README while batching several documentation guide rewrites/fixes (including a critical copy/paste DSL correction in output_schema).

Changes:

  • Bump gem version to 0.7.2 (code + lockfile) and add a hardest_eval alias for RetryOptimizer::Result.
  • Rename terminal/summary labels (Hardest eval, Suggested fallback list, first-attempt, fallback %) and adjust formatting to match.
  • Rewrite/refine multiple guides to consistently use the SummarizeArticle narrative and fix the schema DSL constraint keywords to snake_case.

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
lib/ruby_llm/contract/version.rb Version bump to 0.7.2.
Gemfile.lock Lockfile version update to 0.7.2.
CHANGELOG.md Adds 0.7.2 entry describing label changes + one guide rewrite.
lib/ruby_llm/contract/eval/retry_optimizer.rb Adds hardest_eval alias and updates printed labels.
lib/ruby_llm/contract/eval/model_comparison.rb Updates production-mode table headers and column widths.
spec/ruby_llm/contract/eval/retry_optimizer_spec.rb Updates assertions for renamed printed labels.
docs/guide/optimizing_retry_policy.md Major rewrite focused on SummarizeArticle + updated terminology.
docs/guide/getting_started.md Rewrite focused on SummarizeArticle walkthrough and CI gating.
docs/guide/testing.md Rewrite focused on SummarizeArticle, stubbing/matchers, and mode table.
docs/guide/eval_first.md Refine examples to SummarizeArticle and streamline workflow section.
docs/guide/output_schema.md Fixes documented constraint keywords to snake_case + clarifies camelCase conversion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/guide/getting_started.md Outdated
# Getting Started

## When you need more
The README shows a minimal `SummarizeArticle` step. This guide walks through the features you reach for as production requirements grow: budget caps so runaway inputs don't drain your OpenAI account, evals so you catch regressions in CI, and CI gating so a merge that lowers accuracy gets blocked.
Comment thread CHANGELOG.md Outdated
Comment on lines +3 to +8
## 0.7.2 (2026-04-22)

### Changed

- **Terminal output labels renamed for consistency with README narrative.** `print_summary` now prints `Hardest eval` (was `Constraining eval`), `Suggested fallback list` (was `Suggested chain`), and the production-mode table uses `first-attempt` / `fallback %` as column headers (was `single-shot` / `escalation`). Programmatic metric names unchanged: `single_shot_cost`, `single_shot_latency_ms`, `escalation_rate`. `RetryOptimizer::Result` exposes `hardest_eval` as an alias for `constraining_eval`.
- **`docs/guide/optimizing_retry_policy.md` rewritten.** Reduced from 17.7k → 6.4k characters. Continues the `SummarizeArticle` narrative from README. Offline mode now clearly positioned as wiring-check; real optimization runs via `LIVE=1 RUNS=3`. Output samples match actual `print_summary` format. Terminology aligned with the new labels.
Comment thread docs/guide/testing.md Outdated
Comment on lines 37 to 38
extract: { decisions: [...] },
analyze: { analyses: [...] },
Comment thread docs/guide/testing.md Outdated
Comment on lines +103 to +104
SummarizeArticle => { response: { ... } },
RelatedArticles => { response: { ... } }
…ding + changelog coverage

Four inline findings from two Copilot review rounds; all addressed:

1. docs/guide/getting_started.md:3 — "OpenAI account" was provider-specific
   in a guide that inherits the README's "any ruby_llm provider" promise.
   Reworded to "LLM provider budget".

2. CHANGELOG.md — 0.7.2 entry listed only the optimizing_retry_policy
   rewrite. Added a Documentation subsection enumerating all five guide
   rewrites plus the output_schema DSL bug fix and the best_practices /
   pipeline / migration sanity pass, so release notes match what shipped.

3. docs/guide/testing.md:38 — the pipeline.test example used `decisions:
   [...]` and `analyses: [...]`. `...` is not valid Ruby inside an array
   literal; copy-paste would raise SyntaxError. Replaced with minimal
   realistic array entries matching each step's schema from pipeline.md.

4. docs/guide/testing.md:104 — stub_steps example had `response: { ... }`.
   Same issue as #3. Replaced with minimal response hashes matching the
   SummarizeArticle schema and a plausible RelatedArticles schema.

Full spec suite: 1341 examples, 0 failures.
@justi
Copy link
Copy Markdown
Owner Author

justi commented Apr 22, 2026

Four Copilot inline findings addressed in commit 7c968e9:

  1. getting_started.md:3 — 'OpenAI account' → 'LLM provider budget' (provider-agnostic, consistent with README's multi-provider claim).
  2. CHANGELOG.md:8 — 0.7.2 entry expanded with a Documentation subsection covering all five guide rewrites + output_schema DSL bug fix + P3 sanity pass.
  3. testing.md:38 — replaced [...] placeholders in pipeline.test example with realistic minimal arrays matching each step's schema.
  4. testing.md:104 — replaced { ... } placeholders in stub_steps example with minimal response hashes matching SummarizeArticle schema.

Both testing.md snippets now copy-paste without SyntaxError.

…es + empirical verification

Earlier passes left four guides on different cases (MeetingFollowUp in
pipeline, AnalyzeCompetitor in best_practices, intent/confidence in
output_schema, GenerateComment in prompt_ast). Per user directive
"tylko i wyłącznie przykład z aktualnego readme.md", this commit
rewrites all four to extend SummarizeArticle.

Part A — case alignment

- output_schema.md
  Every example now builds on SummarizeArticle's schema (tldr / takeaways /
  tone). Nested-objects-in-arrays section demonstrates the case of
  attaching confidence per takeaway. "Why schema alone isn't enough"
  shows three realistic validates: UI-card length guard, uniqueness,
  and a cross-field rule ("negative tone requires at least one concrete
  risk").

- prompt_ast.md
  Hash-input + interpolation example is now a SummarizeArticle variant
  that accepts article / audience / language. Cross-validate examples
  show SummarizeArticle-specific guards (tldr not just the article
  reprinted; no takeaway repeats the TL;DR).

- best_practices.md
  All six validate patterns reframed around SummarizeArticle: empty /
  placeholder guards, length-based cross-validate, conditional tone /
  takeaways rule, content quality, pipeline carry-through, model
  fallback. The diverse-examples exception I argued for earlier doesn't
  hold against the "one case throughout" directive.

- pipeline.md
  MeetingFollowUp replaced with a three-step ArticleCardPipeline built
  around the README step: SummarizeArticle → GenerateHashtags →
  BuildArticleCard. Each step has its own schema and at least one
  validate that refers to the previous step's output. Per-step-override
  and eval sections renamed accordingly.

- testing.md
  Updated the two pipeline.test / stub_steps examples to use the new
  ArticleCardPipeline class names and realistic response shapes matching
  each step's schema (was MyPipeline / ArticlePipeline).

Part B — empirical verification

tmp/verify_all_guides.rb now exercises every pattern shown in the
guides: SummarizeArticle build + run with Test adapter, smoke eval with
sample_response, snake_case schema constraints round-trip, nested
objects schema, Hash-input prompt variant, three-step
ArticleCardPipeline build, cross-validate blocks, Test adapter array
responses, and the migration-form variant. 11/11 pass. The file is
the regression gate if any future edit breaks a snippet.

bundle exec rspec stays green: 1341 examples, 0 failures.

Size impact (this commit only):
  output_schema.md  3.4k -> 3.5k
  prompt_ast.md     1.5k -> 2.1k
  best_practices.md 3.5k -> 3.3k
  pipeline.md       4.3k -> 4.5k
  testing.md        7.4k -> 7.4k
@justi
Copy link
Copy Markdown
Owner Author

justi commented Apr 22, 2026

Addressed the full directive: all 8 guides extend SummarizeArticle, and every snippet is empirically verified.

Part A — case alignment across all guides

The four guides that were still on other cases now extend SummarizeArticle:

Guide Was Now
output_schema.md intent / confidence / groups SummarizeArticle schema + nested confidence per takeaway
prompt_ast.md GenerateComment (reddit) SummarizeArticle Hash-input variant (article / audience / language)
best_practices.md AnalyzeCompetitor + diverse mini-cases all 6 validate patterns framed around SummarizeArticle
pipeline.md MeetingFollowUp (extract / analyze / email) ArticleCardPipeline: SummarizeArticle → GenerateHashtags → BuildArticleCard

testing.md's pipeline.test / stub_steps examples updated to match the new ArticleCardPipeline class names and realistic response shapes.

Part B — empirical runner

tmp/verify_all_guides.rb exercises every pattern shown in the guides:

  • SummarizeArticle build + run with Test adapter
  • Smoke check with sample_response
  • snake_case schema constraints round-trip to JSON Schema camelCase
  • Nested objects in arrays
  • Hash-input prompt variant with interpolation
  • Three-step ArticleCardPipeline build + steps with 2-arity validates
  • Cross-validate blocks (length + conditional tone rule)
  • Test adapter array responses
  • Migration-form variant (model DSL + prompt do block)

11 / 11 green on bundle exec ruby -Ilib tmp/verify_all_guides.rb. The runner is a regression gate if any future edit breaks a snippet.

bundle exec rspec stays green: 1341 examples, 0 failures.

justi and others added 4 commits April 23, 2026 02:35
…-standard terms

Searched the guides for terms that might read as invented hybrids rather
than established industry vocabulary. Found two instances worth fixing:

1. optimizing_retry_policy.md:100 said "escalating 60% of the time". The
   rest of the guide uses "fallback" (matching README + code output). One
   leftover "escalating" verb clashed with the narrative. Changed to
   "falling back 60% of the time".

2. output_schema.md used "structural validates" as a compound noun in
   two places (header + intro bullet). "Validates" as a noun is Rails
   idiom for validate blocks; the "structural" modifier reads as
   invented. Replaced with "type and shape checks" — plain English,
   same meaning.

Terms checked and kept as-is (not invented, established in the LLM / Ruby
ecosystem):

- fallback / fallback list / fallback rate  — LangChain, OpenAI cookbook
- first-attempt cost                         — clear compound, no hybrid
- hardest eval                               — plain English replacement
                                               for "constraining eval"
- eval-first                                  — established in OpenAI /
                                               Anthropic / Braintrust docs
- Prompt AST                                  — AST is standard CS; title
                                               kept for URL stability
- wiring check / smoke check                  — systems-testing idiom
- flywheel                                    — mainstream startup term
- quality gate                                — CI/CD standard
- preflight check                             — standard operations term
- runaway inputs                              — idiomatic compound

Empirical runner tmp/verify_all_guides.rb — still 11/11 green; text-only
changes do not break any snippet.

Full spec suite — 1341 examples, 0 failures.
Widened the jargon audit from /docs/guide/ to every tracked markdown
file. Four real inconsistencies found and fixed.

1. retry_optimizer.rb:23 + matching spec + guide output sample

   print_summary printed "#{step} — retry chain optimization". Every
   other line in the same summary uses "fallback" language now
   (Hardest eval, Suggested fallback list). The header still said
   "retry chain" — a hybrid left over from the old terminology.
   Changed to "— fallback list optimization". Updated the spec that
   asserted the old string, and the output sample in
   optimizing_retry_policy.md.

2. optimizing_retry_policy.md:63 — "constraining row"

   Inside the `←` marker explanation, we still referred to the
   "constraining row". The outer sentence already says "the hardest
   eval", so the compact compound was restated as plain-English
   "the row that matters most".

3. docs/architecture.md:9 — "retry with model escalation"

   The module tree listed Step::RetryExecutor with a comment that
   still described it as "retry with model escalation". Narrative
   across README, guides, and CHANGELOG 0.7.2 has switched to
   "fallback". The comment is narrative, not an API name, so it
   was updated to match.

4. examples/README.md — 12 × "invariant"

   The examples README called validate blocks "invariants" in twelve
   places (table rows, section descriptions). `invariant` IS a real
   alias for `validate` in the DSL, but the README and all the
   guides use "validate" as the primary term. The examples README
   was the outlier. Normalized to "validate" / "validates" /
   "validate blocks" for consistency; left the `invariant` method
   name intact in code examples that live in examples/*.rb files.

Terms checked across all .md files and kept deliberately:

- "validates" as Rails-idiomatic noun (plural of the DSL method)
- "AST" (standard CS term) for Prompt AST
- "CI regression gate", "baseline", "regression test"
- "fallback" / "fallback list" / "fallback rate"
- "hardest eval" (plain English; `constraining_eval` is the preserved
  struct field name, still referenced in API documentation)
- "wiring check", "smoke check", "preflight check"
- "flywheel", "quality gate"

Historical CHANGELOG entries (0.6.x and earlier) deliberately left
alone — they describe the code as it was named at the time.

Empirical: tmp/verify_all_guides.rb still 11/11 green.
Full spec suite: 1341 examples, 0 failures.
Codex did an honest pass over every tracked .md and the matching code.
All 7 findings were real bugs I had missed, not nitpicks.

1. README.md:60 — Step.recommend signature was wrong.
   Old: Step.recommend(candidates:, min_score:)
   New: Step.recommend("regression", candidates: [...], min_score: 0.95)
   The first positional arg (eval_name) is required and was missing, so
   a copy-paste would have raised ArgumentError.

2. README.md:82 — Roadmap line still said "Latest: v0.7.1".
   CHANGELOG is on 0.7.2. Updated the line to describe the 0.7.2 work
   (terminal labels + guides alignment + output_schema DSL fix).

3. docs/guide/output_schema.md:6 — "the model is forced to return JSON"
   is too strong and contradicts getting_started.md which honestly says
   with_schema is a request that cheaper models can ignore. Softened to
   "asking the model to return JSON", with the client-side-validation
   reason for keeping point 1 spelled out.

4. docs/guide/testing.md:202 — observe example was unrunnable. It had
   no prompt, no output_schema, no adapter, no response. Expanded into
   a complete runnable CompareArticles step with integer schema,
   validate + observe, and a Test adapter that returns two equal
   scores so the observation actually fires as demonstrated.

5. docs/guide/eval_first.md:40 — sample_response({ takeaways: [...] })
   is Ruby SyntaxError. Replaced with a realistic 3-item array
   matching SummarizeArticle's schema.

6. examples/README.md — stale inventory:
   - Removed 06_reddit_promo.rb section (file does not exist).
   - Removed 06_reddit_promo references from the "Running" block.
   - Added 09_eval_dataset.rb and 10_reddit_full_showcase.rb sections,
     matching the actual examples/*.rb files.
   - Updated the "no API keys needed" footnote accordingly.

7. docs/architecture.md:40 — RubyLLM::Contract::CI namespace does not
   exist in the codebase. RakeTask and Railtie live directly under
   RubyLLM::Contract. Tree diagram updated to reflect reality.

Empirical runner (tmp/verify_all_guides.rb) still 11/11 green.
Full spec suite: 1341 examples, 0 failures.
…rrency, around_call

Codex final-review follow-up: document API surface that was shipped but
not mentioned in the guides batch.

- getting_started.md — estimate_cost / estimate_eval_cost preflight examples
  and on_unknown_pricing: :warn vs default :refuse under Budget caps.
- eval_first.md — run_eval(..., concurrency:) section and
  estimate_eval_cost for CI budgeting across candidate models.
- testing.md — around_call assertion example (fires once per run, receives
  final Result after retry fallback).

All 17 checks in local verify_all_guides.rb pass. 1341 specs, 0 failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e cases

Audit across all 9 guides: README, getting_started, optimizing_retry_policy,
pipeline, migration already had strong business scenes. Four guides opened
cold or were missing "what breaks in prod" consequences.

- eval_first.md — open with concrete production incident (customer success
  filter breaking because outage complaints are labelled analytical). Turns
  the abstract "prompt-by-feel" warning into a scene the reader recognises.
- prompt_ast.md — explain WHY AST over strings via a real multi-tenant /
  multi-language newsletter scenario; adds business framing to the Hash-input
  variable-interpolation section too.
- best_practices.md — add one-line "why it matters" to every validate
  section. Empty output => broken UI card. Cross-validate => catches lazy
  models echoing input. Conditional logic => prevents silent routing breaks.
  Content quality => leaked placeholders embarrass in front of users.
  Model fallback => 80/20 cost math made concrete.
- output_schema.md — nested-objects example now anchored to a UI "confidence
  bar" feature instead of a bare schema shape.
- testing.md — open with CI speed/cost/flake cost to justify Test adapter.

Prose-only changes. All 17 verify_all_guides.rb checks still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@justi justi added the documentation Improvements or additions to documentation label Apr 22, 2026
@justi justi self-assigned this Apr 22, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Batch release/docs PR for 0.7.2 that aligns retry-optimization terminology with the README narrative (“fallback”), updates terminal summary/table labels (non-breaking), and rewrites multiple guides around the SummarizeArticle case while fixing a copy/paste-blocking output_schema DSL doc bug.

Changes:

  • Bump gem version to 0.7.2 and update release notes/lockfile references.
  • Rename print_summary/production-mode table labels to “fallback” terminology and add hardest_eval alias on RetryOptimizer::Result.
  • Rewrite/trim guides for narrative continuity and fix output_schema constraint keyword casing in docs.

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
spec/ruby_llm/contract/eval/retry_optimizer_spec.rb Updates assertions for the renamed terminal labels.
lib/ruby_llm/contract/version.rb Version bump to 0.7.2.
lib/ruby_llm/contract/eval/retry_optimizer.rb Renames printed labels and adds hardest_eval alias.
lib/ruby_llm/contract/eval/model_comparison.rb Renames production-mode column headers and adjusts formatting widths/separator.
examples/README.md Terminology updates (“validate blocks”) and example index updates (adds 09/10, removes 06).
docs/guide/testing.md Rewrites testing guide examples around SummarizeArticle, matcher chain, and related references.
docs/guide/prompt_ast.md Updates prompt AST examples to SummarizeArticle and expands interpolation/2-arity validate guidance.
docs/guide/pipeline.md Reframes pipeline guide as SummarizeArticle → hashtags → card, adds “See also”.
docs/guide/output_schema.md Fixes snake_case constraint keywords and rewrites schema guidance/examples around SummarizeArticle.
docs/guide/optimizing_retry_policy.md Major rewrite emphasizing “fallback list” workflow and live optimization.
docs/guide/migration.md Rewrites migration walkthrough around SummarizeArticle, adds “See also”.
docs/guide/getting_started.md Reorders/rewrites walkthrough with evals/CI gating first, then budget caps and estimates.
docs/guide/eval_first.md Refines eval-first philosophy and examples using SummarizeArticle.
docs/guide/best_practices.md Terminology alignment (fallback) + updated validate patterns and examples.
docs/architecture.md Updates terminology and reflects RakeTask/Railtie as top-level constants.
README.md Updates “Most useful next” callout and roadmap line for 0.7.2.
Gemfile.lock Locks gem version to 0.7.2.
CHANGELOG.md Adds 0.7.2 entry summarizing label changes and doc rewrites.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +64 to 66
validate("no markdown headings in the TL;DR") do |o, _|
!o[:tldr].match?(/^\#{1,6}\s/)
end
| `{"takeaways": [{"text": "...", "confidence": 0.9}]}` | Array of objects | `array :takeaways do; object do; string :text; number :confidence; end; end` |

The schema tells the LLM provider **exactly** what JSON structure to return. Without `object do...end`, `array :groups do; string :who; end` tells the provider "groups is an array of strings" — and that's what you get back.
Without `object do...end`, `array :takeaways do; string :text; end` tells the provider "takeaways is an array of strings" — not objects. That's what you get back.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified empirically — the current wording is correct. Test:

s = RubyLLM::Schema.create do
  array :keywords do
    string :keyword
    number :probability
  end
end
s.new.to_json_schema[:schema][:properties][:keywords][:items]
# => { "type" => "string" }   ← first child wins; number :probability is ignored

Without object do...end the items type becomes the first declared primitive, not a compound object. This is the pitfall the guide documents and that spec/ruby_llm/contract/nested_schema_spec.rb:71 explicitly tests ("WRONG: array without object wrapper produces flat string items"). examples/07_keyword_extraction.rb:30 has the same bug in the wild — separate cleanup. No change needed here.

Comment thread docs/guide/getting_started.md Outdated
```

1. **Model escalation with quality gate.** Start every request on nano ($0.10/M tokens). When `validate` catches a bad answer, auto-retry on mini ($0.40/M), then full ($2.00/M). 90% of requests succeed on nano. At 10k requests/month: ~$40 instead of ~$200.
Returns `nil` for a single call (or `0.0` summed) when pricing isn't registered — same failure mode as `max_cost`.
Comment thread docs/guide/pipeline.md Outdated
Comment on lines +93 to +96
result = ArticleCardPipeline.run("")
result.failed? # => true
result.failed_step # => :summarize (empty input fails schema / validate → stops here)
# tag and card never run — no downstream tokens spent on garbage
justi and others added 2 commits April 23, 2026 08:45
…-based docs index + TL;DR boxes

Real-user feedback: adopters have trouble recognising what problems the
gem solves. Diagnosis: docs started from *how* (API walkthrough) not *why*
(production failure modes). Plan consulted with codex, sharpened scope.

Four coordinated workstreams, all constrained by "README stays short":

A) New docs/guide/why.md (failure gallery, sharp not exhaustive)
   - 1 paragraph framing
   - 4 fully worked failure cards: schema-valid-logically-wrong, silent
     prompt regression, refusal-as-valid-JSON, runaway cost / no fallback
   - 2 code samples (not 7) — codex: seven narratives would feel like
     a second sales README
   - 3 "also catches" bullets (leaked placeholder, input echo, tone drift)
   - failure → contract mechanism table
   - exit ramps to getting_started / migration / eval_first

B) README micro-additions (zero bloat)
   - "Do I need this?" 3-sentence prose block after Install, before Example
     (codex pushback: Q&A sprawls, prose stays tight)
   - Reading-order hint: README → why.md → getting_started.md

D) Outcome-based docs index in README (codex's added workstream — "may
   matter more than TL;DR boxes" for value recognition)
   - Renamed table column from blank to "What it does for your app"
   - Implementation-centric guide labels replaced with outcome labels:
     * "Eval-First" → "Prevent silent prompt regressions"
     * "Optimizing retry_policy" → "Control retry cost and fallback behaviour"
     * "Best Practices" → "Write validate rules that catch real bugs"
     * "Testing" → "Stub LLM calls in tests"
     * "Migration" → "Adopt in an existing Rails app"
     * "Pipeline" → "Chain LLM calls into a pipeline"
   - why.md listed first; getting_started.md second

C) TL;DR one-liner blockquote at top of every guide (9 guides)
   - Compact single sentence ("Read this when X.")
   - "Skip if Y" added only to eval_first / testing / migration per codex
     (elsewhere adds visual clutter without value)

Memory saved for future conversations: adopters don't recognise the
problem → docs must lead with failure scenarios.

All 17 verify_all_guides.rb checks still pass. 1341 specs, 0 failures.
Prose and markdown only; zero Ruby snippet changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot flagged 4 issues on the adoption-friction commit; 3 are real, 1 is
a false positive.

- best_practices.md — replace String#exclude? + rescue hack with plain
  Ruby !include?. exclude? is ActiveSupport-only; snippet now works in
  non-Rails apps.
- getting_started.md — correct estimate_eval_cost description. Unlike
  max_cost (fail-closed / :warn), estimate_eval_cost silently sums
  unknown-pricing cases as $0.00, so the previous "same failure mode"
  framing was misleading. Now flagged as a floor, not a guarantee.
- pipeline.md — fail-fast snippet previously claimed "empty input fails
  schema / validate", but the SummarizeArticle definition above has no
  min_length on tldr. Replaced with a stubbed-adapter scenario where the
  TL;DR exceeds the "fits the card" validate — actually demonstrable.

Replied to the fourth comment (output_schema.md) inline on the PR with
an empirical counter-example: `array :keywords do; string :keyword; number
:probability; end` produces `items: {type: "string"}` in JSON Schema (first
child wins; the number declaration is silently ignored). The current
wording is correct and matches `nested_schema_spec.rb:71` ("WRONG: array
without object wrapper produces flat string items").

verify_all_guides.rb grown from 17 to 18 checks — added empirical proof
that the new pipeline.md fail-fast snippet actually stops at :summarize.
1341 specs pass.

No version bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@justi justi merged commit 77a6c2d into main Apr 23, 2026
1 check passed
justi added a commit that referenced this pull request Apr 23, 2026
… variants (#22)

Lowest-friction entry point for evaluating the gem. Two runnable scripts, zero API keys, with inline expected output so readers see the fallback loop without cloning.

## examples/11_fallback_showcase.rb

Variance-induced tone/takeaways mismatch on gpt-5-nano (where temperature=1.0 is server-enforced) → cross-field validate rejects → retry_policy escalates to gpt-5-mini. Part A shows schema-only (refusal would ship); Part B shows the full contract recovering.

## examples/12_retry_variants.rb

Three retry_policy shapes beyond cross-model escalation:
- A: attempts: 3 on the same model — sampling-variance absorption; replaces the typical begin/rescue/retry loop
- B: reasoning_effort low → medium → high on one model
- C: cross-provider Ollama → Anthropic → OpenAI (local first because it costs nothing; hosted last because it is the most accurate)

All three runnable through the Test adapter so no provider keys are needed.

## Side updates

- docs/guide/why.md Failure 3 reframed from "refusal as valid JSON" (edge case) to "sampling variance on fixed-temperature models" (universal for gpt-5 / o-series).
- Vocabulary audit: replaced invented compounds (temperature-locked, variance-induced, severity signals, takeaway drift) with industry-standard terms.
- examples/README.md entries for both showcases include abridged expected output inline.
- examples/09_eval_dataset.rb fixed — eval_case returns CaseResult, not Hash; the .passed? / .score / .output accessors are now used instead of [:passed] etc.

## No version bump

Docs + examples only; gem stays at 0.7.2.

## Reviews addressed

- 4 Copilot review rounds handled (US spelling, REFUSAL_PREFIXES array form, Ollama wording, running-list adapter notes, Part-A label clarity).
- External code review (codex) shaped the refusal-regex narrative and Part A → B framing.
- Branch was rebuilt on clean main after the initial push carried 17 pre-squash commits from #21; final branch is a single commit on top of main.
justi added a commit that referenced this pull request Apr 24, 2026
Adoption-friction release. No runtime behavior changes — every delta is in `docs/`, `examples/`, or `spec/integration/` (plus version.rb / Gemfile.lock bumps). Upgrading from 0.7.2 picks up the expanded guide set, the consolidated runnable showcases, and one extra integration spec.

Consolidates 7 merged PRs (#21#27) into one release:

- #21 Guide rewrite + adoption friction (why.md, "Do I need this?", outcome labels, TL;DR boxes)
- #22 Runnable aha-moment showcases (fallback + retry variants)
- #23 architecture.md refresh + docs/ideas untracked
- #24 Schema pitfall fix (5 example files) + expected output coverage
- #25 Examples consolidation — drop Reddit, renumber 00-06, restore pipeline + real-LLM minimal
- #26 Rails integration FAQ guide (7 pre-emptive questions)
- #27 Pipeline-level run_eval coverage — closes the "09 STEP 5" known issue from 0.7.2

Copilot review of the CHANGELOG itself flagged two inaccuracies before merge:
- "No gem-level code changes" replaced with "No runtime behavior changes" so version.rb / Gemfile.lock bumps are not misrepresented.
- Stale `examples/09_eval_dataset.rb` reference updated to current `05_eval_dataset.rb` after the renumber.

Verification: 1287 specs pass, 6/6 test-adapter examples run clean, bundle install resolves 0.7.3.

Full changelog entry on main in CHANGELOG.md.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants