Context
2083d6ea landed the gating infrastructure: any ToolDefinition with grammar: Some(ToolGrammar { ... }) becomes an OpenAI Tool::Custom on gpt-5.5+ and silently falls back to a Tool::Function (JSON-schema) on every other model. The plumbing is in place; no default tool currently declares a grammar.
The open question: should we retrofit grammars onto existing default tools (e.g. read, fs_search) so they take advantage of the feature on gpt-5.5+?
Why it's tempting
- Compactness.
read foo.rs:10-20 is fewer tokens than {\"file_path\":\"foo.rs\",\"range\":{\"start_line\":10,\"end_line\":20}}. Multiplied across millions of tool calls, this is real latency + cost savings.
- Token-level constraint. A grammar prevents the model from emitting invalid syntax at all (it's enforced at decode time), where JSON-schema is advisory — the model can still produce malformed JSON we have to reject and retry.
- Model affinity. Anecdotal evidence suggests
gpt-5.5+ handles single-expression tool calls more naturally than JSON object construction.
Why it's not obvious
- Multi-provider asymmetry. Anthropic, Google, Codex, Bedrock, OpenAI chat-completions don't support custom grammars. Every grammar-enabled tool needs two input shapes — a flat grammar form (gpt-5.5+) and a JSON form (everything else). That's twice the surface area to test, document, and keep in sync.
- Parser cost. JSON args round-trip via
serde for free. A grammar tool's text output has to be parsed back into the executor's structured input — every grammar tool is one more hand-written or generated parser, with its own bugs and edge cases.
- Brittle to extension. Adding an optional field to a JSON-schema tool is one line. Adding it to a grammar tool means rewriting the grammar, the parser, and refreshing the test corpus.
- No win for already-structured inputs. Default catalog tools (
read, write, patch, fs_search, shell, fetch, todo_*) are all multi-field structured calls. JSON-schema is doing real work for them — type-checking integers, enforcing required fields, gating enums. Flattening into one grammar string strictly removes information.
- Debuggability. JSON-schema validation errors are clear. A grammar parse failure mid-token is harder to diagnose, and partial parses can succeed silently with the wrong fields.
Where grammar would actually help
Grammar tools shine when the whole input is one free-form expression, not when there's structured metadata. Plausible new tools, not retrofits:
math_eval — `2 * (3 + 4)`. Grammar enforces well-formed arithmetic.
sql_query with a known schema — grammar can enforce table/column names exist.
code_search_dsl — `def:foo type:fn file:*.rs` style queries with a strict vocabulary.
regex_tester — input is one regex, grammar enforces meta-regex syntax.
shell_constrained — a vocabulary-restricted shell DSL (only `ls`, `grep`, `cat`, etc.).
The default catalog has none of these shapes today.
Recommendation
Don't retrofit read, fs_search, etc. The cost (dual input shapes, parser maintenance, brittle to extension) outweighs the marginal compactness/correctness win on a single provider tier.
Do prototype on a new, grammar-native tool if/when one shows up. That gives us:
- A real measurement of the compactness/correctness benefit on
gpt-5.5+.
- A reference implementation of the parser pattern.
- Something we can A/B test against an equivalent JSON-schema tool.
If after one experiment grammars demonstrably outperform JSON for our workload, then revisit retrofitting — but with data, not vibes.
Open questions
Related
- Grammar feature:
a7751534
- Gating:
2083d6ea
Context
2083d6ealanded the gating infrastructure: anyToolDefinitionwithgrammar: Some(ToolGrammar { ... })becomes an OpenAITool::Customongpt-5.5+and silently falls back to aTool::Function(JSON-schema) on every other model. The plumbing is in place; no default tool currently declares a grammar.The open question: should we retrofit grammars onto existing default tools (e.g.
read,fs_search) so they take advantage of the feature ongpt-5.5+?Why it's tempting
read foo.rs:10-20is fewer tokens than{\"file_path\":\"foo.rs\",\"range\":{\"start_line\":10,\"end_line\":20}}. Multiplied across millions of tool calls, this is real latency + cost savings.gpt-5.5+handles single-expression tool calls more naturally than JSON object construction.Why it's not obvious
serdefor free. A grammar tool's text output has to be parsed back into the executor's structured input — every grammar tool is one more hand-written or generated parser, with its own bugs and edge cases.read,write,patch,fs_search,shell,fetch,todo_*) are all multi-field structured calls. JSON-schema is doing real work for them — type-checking integers, enforcing required fields, gating enums. Flattening into one grammar string strictly removes information.Where grammar would actually help
Grammar tools shine when the whole input is one free-form expression, not when there's structured metadata. Plausible new tools, not retrofits:
math_eval— `2 * (3 + 4)`. Grammar enforces well-formed arithmetic.sql_querywith a known schema — grammar can enforce table/column names exist.code_search_dsl— `def:foo type:fn file:*.rs` style queries with a strict vocabulary.regex_tester— input is one regex, grammar enforces meta-regex syntax.shell_constrained— a vocabulary-restricted shell DSL (only `ls`, `grep`, `cat`, etc.).The default catalog has none of these shapes today.
Recommendation
Don't retrofit
read,fs_search, etc. The cost (dual input shapes, parser maintenance, brittle to extension) outweighs the marginal compactness/correctness win on a single provider tier.Do prototype on a new, grammar-native tool if/when one shows up. That gives us:
gpt-5.5+.If after one experiment grammars demonstrably outperform JSON for our workload, then revisit retrofitting — but with data, not vibes.
Open questions
gpt-5.5+? (Run an eval on a grammar-vs-JSON pair before deciding.)Related
a77515342083d6ea