Skip to content

RFC: custom-grammar tools for the default catalog (gpt-5.5+) #11

@justrach

Description

@justrach

Context

2083d6ea landed the gating infrastructure: any ToolDefinition with grammar: Some(ToolGrammar { ... }) becomes an OpenAI Tool::Custom on gpt-5.5+ and silently falls back to a Tool::Function (JSON-schema) on every other model. The plumbing is in place; no default tool currently declares a grammar.

The open question: should we retrofit grammars onto existing default tools (e.g. read, fs_search) so they take advantage of the feature on gpt-5.5+?

Why it's tempting

  • Compactness. read foo.rs:10-20 is fewer tokens than {\"file_path\":\"foo.rs\",\"range\":{\"start_line\":10,\"end_line\":20}}. Multiplied across millions of tool calls, this is real latency + cost savings.
  • Token-level constraint. A grammar prevents the model from emitting invalid syntax at all (it's enforced at decode time), where JSON-schema is advisory — the model can still produce malformed JSON we have to reject and retry.
  • Model affinity. Anecdotal evidence suggests gpt-5.5+ handles single-expression tool calls more naturally than JSON object construction.

Why it's not obvious

  1. Multi-provider asymmetry. Anthropic, Google, Codex, Bedrock, OpenAI chat-completions don't support custom grammars. Every grammar-enabled tool needs two input shapes — a flat grammar form (gpt-5.5+) and a JSON form (everything else). That's twice the surface area to test, document, and keep in sync.
  2. Parser cost. JSON args round-trip via serde for free. A grammar tool's text output has to be parsed back into the executor's structured input — every grammar tool is one more hand-written or generated parser, with its own bugs and edge cases.
  3. Brittle to extension. Adding an optional field to a JSON-schema tool is one line. Adding it to a grammar tool means rewriting the grammar, the parser, and refreshing the test corpus.
  4. No win for already-structured inputs. Default catalog tools (read, write, patch, fs_search, shell, fetch, todo_*) are all multi-field structured calls. JSON-schema is doing real work for them — type-checking integers, enforcing required fields, gating enums. Flattening into one grammar string strictly removes information.
  5. Debuggability. JSON-schema validation errors are clear. A grammar parse failure mid-token is harder to diagnose, and partial parses can succeed silently with the wrong fields.

Where grammar would actually help

Grammar tools shine when the whole input is one free-form expression, not when there's structured metadata. Plausible new tools, not retrofits:

  • math_eval — `2 * (3 + 4)`. Grammar enforces well-formed arithmetic.
  • sql_query with a known schema — grammar can enforce table/column names exist.
  • code_search_dsl — `def:foo type:fn file:*.rs` style queries with a strict vocabulary.
  • regex_tester — input is one regex, grammar enforces meta-regex syntax.
  • shell_constrained — a vocabulary-restricted shell DSL (only `ls`, `grep`, `cat`, etc.).

The default catalog has none of these shapes today.

Recommendation

Don't retrofit read, fs_search, etc. The cost (dual input shapes, parser maintenance, brittle to extension) outweighs the marginal compactness/correctness win on a single provider tier.

Do prototype on a new, grammar-native tool if/when one shows up. That gives us:

  • A real measurement of the compactness/correctness benefit on gpt-5.5+.
  • A reference implementation of the parser pattern.
  • Something we can A/B test against an equivalent JSON-schema tool.

If after one experiment grammars demonstrably outperform JSON for our workload, then revisit retrofitting — but with data, not vibes.

Open questions

  • What's the actual token-savings and tool-call success-rate delta on gpt-5.5+? (Run an eval on a grammar-vs-JSON pair before deciding.)
  • How does grammar-tool latency compare to JSON-tool latency on the OpenAI Responses API?
  • Is there a candidate first-mover tool worth designing grammar-first? Math eval? Structured-query search? Something else?

Related

  • Grammar feature: a7751534
  • Gating: 2083d6ea

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions