RFC: custom-grammar tools for the default catalog (gpt-5.5+)

## Context

`2083d6ea` landed the gating infrastructure: any `ToolDefinition` with `grammar: Some(ToolGrammar { ... })` becomes an OpenAI `Tool::Custom` on `gpt-5.5+` and silently falls back to a `Tool::Function` (JSON-schema) on every other model. The plumbing is in place; no default tool currently declares a grammar.

The open question: should we retrofit grammars onto existing default tools (e.g. `read`, `fs_search`) so they take advantage of the feature on `gpt-5.5+`?

## Why it's tempting

- **Compactness.** `read foo.rs:10-20` is fewer tokens than `{\"file_path\":\"foo.rs\",\"range\":{\"start_line\":10,\"end_line\":20}}`. Multiplied across millions of tool calls, this is real latency + cost savings.
- **Token-level constraint.** A grammar prevents the model from emitting invalid syntax at all (it's enforced at decode time), where JSON-schema is *advisory* — the model can still produce malformed JSON we have to reject and retry.
- **Model affinity.** Anecdotal evidence suggests `gpt-5.5+` handles single-expression tool calls more naturally than JSON object construction.

## Why it's not obvious

1. **Multi-provider asymmetry.** Anthropic, Google, Codex, Bedrock, OpenAI chat-completions don't support custom grammars. Every grammar-enabled tool needs *two* input shapes — a flat grammar form (gpt-5.5+) and a JSON form (everything else). That's twice the surface area to test, document, and keep in sync.
2. **Parser cost.** JSON args round-trip via `serde` for free. A grammar tool's text output has to be parsed back into the executor's structured input — every grammar tool is one more hand-written or generated parser, with its own bugs and edge cases.
3. **Brittle to extension.** Adding an optional field to a JSON-schema tool is one line. Adding it to a grammar tool means rewriting the grammar, the parser, and refreshing the test corpus.
4. **No win for already-structured inputs.** Default catalog tools (`read`, `write`, `patch`, `fs_search`, `shell`, `fetch`, `todo_*`) are all multi-field structured calls. JSON-schema is doing real work for them — type-checking integers, enforcing required fields, gating enums. Flattening into one grammar string strictly removes information.
5. **Debuggability.** JSON-schema validation errors are clear. A grammar parse failure mid-token is harder to diagnose, and partial parses can succeed silently with the wrong fields.

## Where grammar would actually help

Grammar tools shine when *the whole input is one free-form expression*, not when there's structured metadata. Plausible new tools, not retrofits:

- **`math_eval`** — \`2 * (3 + 4)\`. Grammar enforces well-formed arithmetic.
- **`sql_query`** with a known schema — grammar can enforce table/column names exist.
- **`code_search_dsl`** — \`def:foo type:fn file:*.rs\` style queries with a strict vocabulary.
- **`regex_tester`** — input is one regex, grammar enforces meta-regex syntax.
- **`shell_constrained`** — a vocabulary-restricted shell DSL (only \`ls\`, \`grep\`, \`cat\`, etc.).

The default catalog has none of these shapes today.

## Recommendation

**Don't retrofit `read`, `fs_search`, etc.** The cost (dual input shapes, parser maintenance, brittle to extension) outweighs the marginal compactness/correctness win on a single provider tier.

**Do prototype on a new, grammar-native tool** if/when one shows up. That gives us:
- A real measurement of the compactness/correctness benefit on `gpt-5.5+`.
- A reference implementation of the parser pattern.
- Something we can A/B test against an equivalent JSON-schema tool.

If after one experiment grammars demonstrably outperform JSON for our workload, *then* revisit retrofitting — but with data, not vibes.

## Open questions

- [ ] What's the actual token-savings and tool-call success-rate delta on `gpt-5.5+`? (Run an eval on a grammar-vs-JSON pair before deciding.)
- [ ] How does grammar-tool latency compare to JSON-tool latency on the OpenAI Responses API?
- [ ] Is there a candidate first-mover tool worth designing grammar-first? Math eval? Structured-query search? Something else?

## Related

- Grammar feature: `a7751534`
- Gating: `2083d6ea`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: custom-grammar tools for the default catalog (gpt-5.5+) #11

Context

Why it's tempting

Why it's not obvious

Where grammar would actually help

Recommendation

Open questions

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

RFC: custom-grammar tools for the default catalog (gpt-5.5+) #11

Description

Context

Why it's tempting

Why it's not obvious

Where grammar would actually help

Recommendation

Open questions

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions