Skip to content

jmitchell/tree-sitter-dhall

Repository files navigation

tree-sitter-dhall

Build Status

Dhall grammar for tree-sitter.

Status

Work in progress (unstable)

There are some consequential bugs. However, 187/190 Dhall files found in dhall-haskell and its git submodules (including dhall-lang) parse as expected.

Tree-sitter unit tests in ./corpus and ./corpus-broken don't yet cover the full grammar. Corpus test coverage is currently tracked in TODO.md.

Usage

See tree-sitter's using parsers documentation.

Applications

Existing:

Potential:

  • Parser for Dhall ports to languages which support C FFI.
  • Portable Dhall syntax libraries and tools.
    • Validation
    • Formatting
    • Linting and refactoring
    • Querying
    • Syntactic diffing/patching
    • Conversion
      • user-friendly AST
      • programming languages
      • data formats, e.g. Dhall's binary representation
      • syntax highlighting for a document format, e.g. HTML+CSS
    • More editor integrations
      • simplest in editors which support tree-sitter

Approach

From the project plan:

My hope is as dhall.abnf evolves the process of updating tree-sitter-dhall, and any of its future dependencies, can be mostly automated.

abnf-to-tree-sitter compiled the standard dhall.abnf grammar from dhall-lang into generated_grammar.js, a starting point for a tree-sitter input grammar. Adjustments were needed, so manual changes were implemented in grammar.js.

Realizing full automation requires:

  1. Manually resolving bugs in grammar.js through changes that could conceivably be automated, and would ideally make sense for all ABNF grammars.
  2. Improving abnf-to-tree-sitter so it can deterministically generate the working grammar.js based on
    • dhall.abnf
    • initial production rule (unspecified in RFC 5234)
    • production rules to hide in the parse tree
    • set of pure heuristics to transform the grammar into an equivalent one that's preferable for some reason
    • parameters for deterministic search which is informed by
      • feedback from tree-sitter generate
      • feedback from tree-sitter parse on a set of known valid inputs
      • associativity and relative production precedence hints in case it's ambiguous in the ABNF
      • specification defining semantic reductions in terms of syntactic productions
    • ... (hopefully not much more)

Tradeoffs

Depending on your objectives this might not be the right approach.

Incremental and ambiguous parsing are overkill

This could be an issue when optimizing for one-shot parsing time/space.

Word to the wise: performance optimizations should be informed by benchmarking and profiling realistic input, and also balanced against potential performance regressions elsewhere.

Tree-sitter is designed to incrementally parse code as it is modified (common in code editors). The data structures and algorithms to support this may incur unnecessary overhead. I have not yet had a reason to compare performance of tree-sitter parsers against optimized parsers. NB: tree-sitter parse --time (used in CI tests) may be convenient for initial benchmarking.

Additionally, tree-sitter generates GLR parsers which support ambiguous grammars and therefore may search for many potential parses concurrently. The semantics of ABNF's / operator aren't fully specified, dhall.abnf relies on backtracking, and it's generally undecidable whether a given context-free grammar is inherently ambiguous so automation efforts must account for ambiguity by default. I also haven't had a reason to analyze which ambiguous parse paths are guaranteed to reach a dead-end and could therefore be safely pruned from the search.

Parse trees are verbose and subject to change according to the standard grammar

Concise abstract syntax trees or other high-level interfaces for Dhall syntax may be preferred for their relative ease of use.

However, if implementing these interfaces requires any steps which can't be derived automatically, future changes to the standard may conflict with those interfaces. As a result conflicts would require

  • introducing a breaking change,
  • implementing an unclean workaround, or
  • cutting off support for current and future versions of the standard

None of the options are ideal. The next part suggests how to manage this.

Automating parts that have so far been manual may cause regressions

Over the long-term permitting breaking changes is sometimes for the best. It's the only way to avoid introducing and maintaining hacks for the sake of short-term convenience and long-term backwards incompatiblity.

Explicitly dividing projects into two categories should help strike a balance.

  1. Expand the set of tooling that can be deterministically generated from declarative specifications. Breaking changes to the specifications automatically translate into corresponding breaking changes in these tools. Open standards development and version control should make it simple to trace these breaking changes to the original source.
  2. Manually build useful tools on top of this frontier of generated tooling. Take a best-effort approach to designing stable, user-friendly interfaces. Offer a vision for how future incompatible specification changes will be handled, for example support the latest N major versions of the specification concurrently.

Those who want to optimize for stability should primarily use the deterministically generated tooling, and those who are willing to sacrifice some stability for UX can instead opt for the manually crafted interfaces.

The goal is for this project and others like abnf-to-tree-sitter to abstract any special cases particular to Dhall and automate as much as possible. In other words, they are currently in category 2 with aspirations of being in category 1.

Contributing

Issues with the help wanted label are a good place to start.

References

About

(WIP) Dhall grammar for Tree-sitter

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages