Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Would you consider upstreaming the folds.scm and indents.scm files to grammar repos? #3944

Open
patrickt opened this issue Dec 7, 2022 · 34 comments
Labels
enhancement New feature or request

Comments

@patrickt
Copy link

patrickt commented Dec 7, 2022

Hi there 👋🏻 I work on tree-sitter’s core, as well as various language grammars, as part of my work on GitHub’s Semantic Code team. I was really blown away by where the treesitter-nvim community has gone with tree-sitter as a technology: it hadn’t occurred to me to use tree-sitter queries for indentation and code folding, but it makes just so much sense. Well done.

I write because I and the team would love to bring the indentation and folding features into Tree-sitter as a first class citizen. We at GitHub are hoping to use queries like these to power more advanced features in our new code view, and I, though I use the other editor that shall not be named here, would love to bring what you’ve done to my editor ecosystem. I could very plausibly see a world where these powered tree-sitter indent and tree-sitter fold commands to the CLI.

My question to the team is this: is such an undertaking possible? I would love to see the folds.scm and indents.scm files living in their official grammar repositories, but looking at your project hierarchy, it’s clear that you already have a lot of processes in place. Is this an insurmountable amount of work? (We on the tree-sitter side of things would of course help however we can.) Would it be too much to ask of you as a maintenance team, or would it slow your development too much? (I see in the README that languages are divvied up across contributors; we would of course hand out commit bits to the relevant repositories.) What would you think about trying it with, say, Ruby and seeing to what degree that would affect your workflow?

Again, thanks for using tree-sitter so excitingly, and I hope we can put your work in even more people’s hands!

@patrickt patrickt added the enhancement New feature or request label Dec 7, 2022
@clason
Copy link
Contributor

clason commented Dec 7, 2022

Hi! Nice to hear from you, and thanks for the kind words!

I can't speak for the whole team (and invite @theHamsta @vigoux @stsewd to weigh in), but I personally would prefer if queries were owned by the parser devs -- adapting queries to breaking parser changes is a big chunk of work. Some parsers (close to Neovim, like Viml and Vimdoc) already do this, although we sync these updates manually.

There are two reasons we aggregate (and probably will keep aggregating) the queries here:

  1. easier integration tests;
  2. some queries are editor-specific -- this is mostly highlights.scm, though; I believe folds.scm and indents.scm specifically should be agnostic and safe to centralize.

Of course, if tree-sitter would officially recommend this and we could rely on these (and more?) queries being part of the grammar repos, we'd look at automating this.

In any case, I'd personally welcome these queries being useful and used more widely, and I'd be fine with grammar maintainer upstreaming them (Apache 2.0 license allows this) -- maybe tag the listed maintainer in the PR for complete transparency.

What I'm not going to do is open PRs at the 100+ repos myself, though ;)

@stsewd
Copy link
Member

stsewd commented Dec 7, 2022

Agree with what @clason said. Maybe with folds and indents the queries could be re-used across editors, but I guess that would depend on the implementation of those features, and if there is like a "core" implementation of those features that we can follow, I think that would be great and would help to standardize/re-use those queries.

@clason
Copy link
Contributor

clason commented Dec 7, 2022

@patrickt And while I have you: It would be great if tree-sitter had an official binary parser and matching(!) queries distribution mechanism -- say, some sort of Github Action grammar maintainers can just add to their workflows ;)

(Seriously: One of the biggest pain points for us is the fact that queries are only compatible with a specific parser revision, with no way of telling whether they match before trying to run them and getting a bunch of errors. If parsers had a version field that could be inspected before loading a query -- which could have a corresponding ; version: modeline -- that would prevent a lot of headaches.)

@figsoda
Copy link
Contributor

figsoda commented Dec 13, 2022

perhaps we can add an entry in parsers.lua to specify the queries to pull from upstream, so we can explicitly use the upstream queries when the upstream grammar supports neovim

@theHamsta
Copy link
Member

theHamsta commented Dec 13, 2022

We're very open to upstreaming the query files whenever a repo wants to maintain them. We could just copy them during our parser installation process to provide them to end users or rely on some official distribution mechanism for the parsers. We already reference scanner.c/parser.c and could do the same with queries. A registry for parsers and queries would be really handy as it could avoid the security problems from blindly compiling enormous auto-generated C files and also handle versioning and compatibility.

  • The highlight queries were forked at one time because the upstream queries were very limited and since there was no documentation which captures to use very inconsistent. Since then https://github.com/helix-editor/helix/tree/master/runtime/queries has forked from our queries and Emacs also created their own queries https://github.com/emacs-tree-sitter/tree-sitter-langs/tree/599570cd2a6d1b43a109634896b5c52121e155e3/queries. Something that we find very handy and is also supported by helix is to reference the queries of parent languages when a grammar extends another to have consistent queries in C, C++, CUDA, GLSL, HLSL. I think sharing highlight queries at least for some languages would be possible but would require some standardized captures what to use in what situation. It would be possible to have different query dialects in the mainstream repo to follow preferences of certain editors
  • Folds might be easy to upstream. Again referencing parent languages would be nice but you probably also be handled by copy/paste given the small size of the files
  • Indents might be also possible. But this would it would require an official reference/reusable implementation of how the idents behave. At the moment we have a quite hacky implementation and queries are often put in a way that they fit to the quirks of out implementation and let our test pass.
  • textobjects might be only relevant to Neovim/helix. But might be interesting also for editors with Vim keybindings https://github.com/meain/evil-textobj-tree-sitter/

@patrickt what were your plans for the next steps? When the official tree-sitter/tree-sitter-* would provide folds.scm/indents.scm, we could re-use them and many other parser repo would probably follow. We just have to be careful about the expectations as there could differences in what users might expect a indent to do in Vim vs other editors.

@patrickt
Copy link
Author

patrickt commented Jan 6, 2023

Hey, everyone. Happy new year! Sorry for my radio silence, and thank you all for chiming in on this.

@clason mentioned that highlights.scm is very editor-specific, and I agree: I think, for velocity reasons at the very least, that should remain with editors, and that the highlights.scm in the grammar repos is intended both as a jumping-off-point for other editors and for consumption by GH’s highlighting service. But if we have the opportunity to fix GH-side rendering bugs by piggybacking off your work, we’d be tremendously appreciative. (Feel free to @ me in an issue or email me at patrickt@github.com if you encounter such an issue.)

On the other hand, it does strike me that folds.scm and indents.scm would be generally reusable, especially since this (if we integrate it into TS) would be a more greenfield effort. It sounds like we have an opportunity to maybe ease y’all’s maintenance burden with regards to the quirks of the indentation engine.

Oh, and @clason, your point about parser compatibility is very much on the money. I’m cc’ing @maxbrunsfeld, the tree-sitter author, to see if he has any thoughts on the issue. It seems like this is an issue that more and more people are going to encounter.

@theHamsta: My next steps are to talk with the internal GitHub teams that would have some sort of material interest (there are several aside from us on Semantic Code) in using tree-sitter to extract these data. Tree-sitter parsers and queries scale to GH traffic extremely well, so I think there’s a compelling business case to be made there, but I don’t know what other technologies are under consideration, or if they’ve got working prototypes already. It’s possible this all might not pan out, but it’s worth a try, right? 😄 I’m intrigued by the way Helix does referencing of parent languages… will definitely dig into that, too. (As we offer more features for JS/TS, effective reuse pays dividends, as I’m sure you all know well.)

I think I’m going to close this issue out now so that it doesn’t clog up your issues board, but I’ll leave further comments on this issue if and when I have an update. Thank you all very much for being so helpful!

@patrickt patrickt closed this as completed Jan 6, 2023
@clason
Copy link
Contributor

clason commented Mar 2, 2023

@patrickt We have been thinking more about centralizing queries, in particular highlight queries. While I still don't believe it's possible to completely share queries verbatim, I'm very interested in minimizing the divergence where possible and collaborating on a single source of truth that can be used by multiple editors with only minimal customization (like adding our @conceal and @spell, which I don't think other editors have).

As Github is a major player in this field (albeit for a rather limited set of languages, compared with the ~170 we maintain here), I'd be interested in hearing your thoughts and plans on this.

I'm also tagging @the-mikedavis for Helix, which has a similarly broad language support strategy (but a somewhat different capture naming scheme, which I would love to synchronize with).

(Other editors using tree-sitter I'm aware of are Zed -- which is closed-source, and I couldn't find any info on query and captures for -- and Emacs -- which only just added tree-sitter support in core and doesn't yet bundle queries, while the older https://github.com/emacs-tree-sitter/tree-sitter-langs/tree/master/queries is comparatively limited and has low activity.)

@the-mikedavis
Copy link

Being able to share queries would be really nice! I'm not sure how to accomplish it though because they queries tend to be implementation-specific. Indentation is a good example: the Neovim and Helix systems work differently so the queries end up looking different as well. Some features that are straightforward like folding would probably be easy to share though (although folding is not implemented in Helix so I can't say this confidently).

There's a big hurdle to sharing queries though: Neovim's queries have reversed precedence compared to Helix, the tree-sitter CLI and I think emacs as well as GitHub's syntax highlighting tool (based on the order of queries here). The last stanza in a nvim-treesitter query file will override any stanzas that come before it and match the same pattern, but in Helix query files, the first stanza that matches overrides any later stanzas that also match. For example, these two stanzas overlap in the Elixir highlights:

; Identifiers
(identifier) @variable
; Unused Identifiers
((identifier) @comment (#match? @comment "^_"))

In nvim-treesitter, the identifier gets the @comment capture if it starts with an underscore but in Helix that first stanza would always override the second and the identifier would be captured as @variable. So Helix has stanzas in reverse order compared to Neovim queries.

It would be nice to align on the captures we use for syntax highlighting as well. That would be tricky to do in practice though because any scopes that change would break themes - if Helix switched from @variable.parameter to @parameter like Neovim for example, that would break Helix themes and the same goes for the other way around if I understand Neovim tree-sitter-based theming correctly.

@clason
Copy link
Contributor

clason commented Mar 3, 2023

Being able to share queries would be really nice! I'm not sure how to accomplish it though because they queries tend to be implementation-specific.

Well, first step is talking about it seriously and taking stock of what would have to be done (and then do it, if it turns out to be feasible) ;)

One thing to keep in mind is that this repo was meant as a prototype for implementations that eventually end up in Neovim core. We are now in the process of (incrementally) moving things that work well into core -- which is a chance for changing things, so this is now a good time to have this discussion. (Tree-sitter integration in Neovim is still marked as experimental, so we are free to make breaking changes for good reasons -- and sharing queries would definitely be one!)

Indentation is a good example: the Neovim and Helix systems work differently so the queries end up looking different as well.

I haven't much used it myself, but my impression is that the system here (not in Neovim core!) is not working too well, at least for languages like Python. If Helix's works better, we should take a look at it and see if we can align when we implement it in core. If nothing else, indents and folds should be writeable in an editor-agnostic way.

There's a big hurdle to sharing queries though: Neovim's queries have reversed precedence compared to Helix, the tree-sitter CLI and I think emacs as well as GitHub's syntax highlighting tool (based on the order of queries here).

That is a good point. If Neovim is the odd one out, that's a strong incentive to change things. I suspect that the reason we did it this way is for easier user customization: we concatenate all queries on runtime path so users can override individual queries in their own config. Does Helix support extending/overriding queries like that? (We use ; extends and ; inherits keywords for that.) But there's no reason we can't concatenate them in reverse order.

EDIT Turns out that this is incorrect; we are just missing the sort of "early bail" logic from tree-sitter CLI (and Helix, which uses the same code) so we always highlight all matches and rely on draw order to make more specific matches have precedence (see neovim/neovim#22495).

Another issue I believe is the way we do injections (needing the #exclude_children! directive in many cases), which would also have to be changed to be compatible.

EDIT Upstream compatible implementation will be added in neovim/neovim#22518

It would be nice to align on the captures we use for syntax highlighting as well. That would be tricky to do in practice though because any scopes that change would break themes

We already did this once; we have no problem doing it again ;) (But that should be the last time...) I personally don't mind switching (mostly) to Helix' scheme, which I believe aligns (more) closely with Atom/TextMate naming? If you were ready to adapt some captures to ours (for example, I prefer the way we do LaTeX math captures), I'm sure we could come up with a joint document that upstream parser devs could also rely on (e.g., living in the tree-sitter Wiki).

One issue is that we make heavy use of Atom-style fallbacks (@keyword.special falls back to @keyword if the former is not defined in a theme); I expect Helix does too? Otherwise that would be hard to make align.

At the very least we could discuss naming and try to be as consistent as possible -- the more names are shared, the less work in adapting queries between us or from upstream! (If you prefer, we could discuss this and other things on Matrix; I can open a chat room and maybe try to find some people from Emacs to loop in, too?)

@clason
Copy link
Contributor

clason commented Mar 3, 2023

(reopening this for more visibility and to get back to it more easily)

@clason
Copy link
Contributor

clason commented Mar 4, 2023

@the-mikedavis another point of divergence is predicates and directives. It's clear that other editors won't support #lua-match? and that tree-sitter itself will probably not add too many, but do you think at least the two of us could agree on a "greatest common denominator" list both Neovim and Helix support? We'd then try to make do with those.

Or is adding new predicates so much harder for you than for us (where it's fairly trivial)?

@the-mikedavis
Copy link

We are now in the process of (incrementally) moving things that work well into core -- which is a chance for changing things, so this is now a good time to have this discussion. (Tree-sitter integration in Neovim is still marked as experimental, so we are free to make breaking changes for good reasons -- and sharing queries would definitely be one!)

Ah excellent! We try to minimize breaking changes in general in Helix but I think we can afford to make some changes on our end for the sake of compatibility. We always end up with a handful of breaking changes per release anyways. Better compatibility of queries, themes and approaches to tree-sitter features seems to me like a win for everybody 🙂

I haven't much used it myself, but my impression is that the system here (not in Neovim core!) is not working too well, at least for languages like Python. If Helix's works better, we should take a look at it and see if we can align when we implement it in core. If nothing else, indents and folds should be writable in an editor-agnostic way.

I think that the Python indentations have some rough edges in Helix as well although I don't typically write Python myself. There are also some other improvements we want to make to our indentation like outdenting automatically. It seems like potentially a lot of work but maybe we can collaborate on an indentation approach that we can share?

(\cc @Triton171 who authored most of the indentation code for Helix)

we concatenate all queries on runtime path so users can override individual queries in their own config. Does Helix support extending/overriding queries like that?

We have support for ; inherits: <lang1>,<lang2>,...,<langN> but that's mostly used for cases like C++ extending C's highlights. We don't have a way for users to override or extend the built-in queries with custom queries currently. I think a feature like that might be tricky to design because sometimes you want to disable a specific pattern or remove chunks of the query. We have helix-editor/helix#3346 tracking this and currently we're thinking of a $PATH-like solution but queries or parsers in user directories would replace the ones from installation rather than merging.

I personally don't mind switching (mostly) to Helix' scheme, which I believe aligns (more) closely with Atom/TextMate naming? If you were ready to adapt some captures to ours (for example, I prefer the way we do LaTeX math captures), I'm sure we could come up with a joint document that upstream parser devs could also rely on (e.g., living in the tree-sitter Wiki).

Yep we try to adhere to textmate when possible. I'm definitely open to changing up some scopes so our queries and themes can be closer, especially some of the lesser-used ones like math. Our captures aren't set in stone, we do like to revise them when there's an opportunity for improvements (for example helix-editor/helix#4892). It would be really cool to be able to trivially convert Neovim queries and themes to Helix queries and themes and vice versa!

We also use the fallbacks for captures extensively. I think our systems work the same way: @constant.numeric.integer falls back to @constant.numeric and then to @constant.

we could discuss this and other things on Matrix

Matrix would be perfect - we use Matrix for a lot of Helix discussion as well. If you make a room, would you mind inviting me (@the-mikedavis:matrix.org) and @archseer (@speed:matrix.org)?

another point of divergence is predicates and directives. It's clear that other editors won't support #lua-match? and that tree-sitter itself will probably not add too many, but do you think at least the two of us could agree on a "greatest common denominator" list both Neovim and Helix support?

Yeah the regular expression ones seem like a hard problem to solve very consistently between editors. Our #match? just compiles the string with the rust-lang/regex crate. I think it might be an anti-pattern to use fancy regular expression features in queries anyways so maybe the differences in engines won't really matter in practice.

Coming up with a list we can both support sounds do-able 👍. There are some from here that we already want to support like #any-of? that would help a lot with compatibility. Adding new predicates and properties isn't totally trivial in Helix but it's not terribly difficult either.

One other compatibility point - does neovim support combined injections (the (#set! injection.combined) property)? I thought that these were disabled in nvim because of some bugs but I am probably out of date on this. We use them mostly for templating languages like erb/ejs or markup like markdown so that you can combine injected ranges into a single document. Not so many languages use this though so it's probably not a huge deal.

@archseer
Copy link

archseer commented Mar 5, 2023

From what I understand the implementation in neovim is inherently different than the tree sitter rust crate or other helix so even if we have compatible scopes it's not guaranteed to work well across editors. In particular ordering queries from least to most specific vs the other way around

Edit: Sorry, I see @the-mikedavis already clarified most of this! In general I think it's OK for editors to use different scopes since features work differently

@lewis6991
Copy link
Member

In particular ordering queries from least to most specific vs the other way around

This only applies to highlighting. The TS implementation will check queries in the same order we do, but will stop adding highlights if it has already added a highlight for a specific range. Neovim on the other hand will just apply all the matches and will stack the highlights. This allows one match to just set underline whilst another will set the background, whilst another might set the foreground.

We've talked about adjusting our highlighter to add the same range check, and use a custom directive to support highlight stacking.

@lewis6991
Copy link
Member

lewis6991 commented Mar 5, 2023

One other compatibility point - does neovim support combined injections (the (#set! injection.combined) property)?

Yes this is supported. I've also raised a PR to support the same injections formats as what you use. neovim/neovim#22518

@clason
Copy link
Contributor

clason commented Mar 5, 2023

@archseer yes, just to be clear: the whole purpose of this discussion is to find out which differences are necessary and which are not; the goal is then to remove the latter (and document the former) so that more (most) of the queries can be shared -- ideally, we can have an upstream source of truth we both (and Emacs) can pull from with minor modifications where necessary. Just like we do with Vim, we'd ask contributors to make "common" improvements upstream instead in our repo. We are also happy to make breaking changes in our code to facilitate that.

To summarize: the big differences we have identified so far are

  1. injection format -- we will change that to default to the upstream (and Helix) format before the 0.9 release (end of March/beginning of April), see Lewis' linked PR.
  2. query order -- we are discussing improving our implementation so that precedence of captures (where necessary) is determined by the same order of query stanzas as in Helix (probably after 0.9, so sometime in spring)
  3. capture names -- we are fine with changing our custom scheme to align more closely with Helix's; ideally we can put our heads together and come up with the best of both worlds that upstream queries can safely use (changing the queries here will be a longer effort over the summer, I'm afraid).
  4. predicates/directives -- this is probably the area where we have to accept the most divergence, but even there we should be able to come up with a set of common directives (that are named and behave the same!)

@clason
Copy link
Contributor

clason commented Mar 5, 2023

@the-mikedavis

I think that the Python indentations have some rough edges in Helix as well although I don't typically write Python myself. There are also some other improvements we want to make to our indentation like outdenting automatically. It seems like potentially a lot of work but maybe we can collaborate on an indentation approach that we can share?

I think this would be a good idea: we should be able to share the logic (and capture names, of course) even if the implementation is different. A fresh implementation in core would be the ideal occasion.

We have support for ; inherits: ,,..., but that's mostly used for cases like C++ extending C's highlights.

Same in Neovim.

We don't have a way for users to override or extend the built-in queries with custom queries currently. I think a feature like that might be tricky to design because sometimes you want to disable a specific pattern or remove chunks of the query.

Yes, I think that will be the biggest design issue -- user extensibility is baked into the DNA of (Neo)vim, so this is a primary concern for us.

We have helix-editor/helix#3346 tracking this and currently we're thinking of a $PATH-like solution but queries or parsers in user directories would replace the ones from installation rather than merging.

This is exactly how it (now) behaves in Neovim as well; we just have an additional ; extends directive to explicitly request merging. (The order is based on our different precedence behavior, but that is an implementation detail that is easy to change.)

We also use the fallbacks for captures extensively. I think our systems work the same way: @constant.numeric.integer falls back to @constant.numeric and then to @constant.

Yep, that's exactly how it works (except that we have an additional implicit fallback @constant.numeric.integer.<lang> so people can use language-specific theming. Very useful!)

Yeah the regular expression ones seem like a hard problem to solve very consistently between editors. Our #match? just compiles the string with the rust-lang/regex crate. I think it might be an anti-pattern to use fancy regular expression features in queries anyways so maybe the differences in engines won't really matter in practice.

Yeah, #match? is going to be the biggest issue. We actually have two predicates: #vim-match? uses Vim's (unique) regex engine, while #lua-match? uses Lua patterns. We have currently have #match? as an alias for #vim-match?, but we should probably remove that (or restrict it to regex patterns that are so simple that they behave the same in any regex engine). I agree that these should be minimized (they are definite performance killers in some queries), but I don't think you can fully do without for some "barebones" grammars (CMake, I'm looking at you...)

One other compatibility point - does neovim support combined injections (the (#set! injection.combined) property)?

At the risk of beating a dead horse: It will very soon, thanks to Lewis' PR that he linked. This will become our default, so injection queries should be fully compatible going forward.

Matrix would be perfect - we use Matrix for a lot of Helix discussion as well. If you make a room, would you mind inviting me (@the-mikedavis:matrix.org) and @archseer (@speed:matrix.org)?

Done!

@clason clason closed this as completed Mar 5, 2023
@Triton171
Copy link

Since I've written a large part of Helix's current indentation system, I'd be very interested in collaborating in order to create a specification & reference implementation for tree-sitter indentation. It's definitely not easy to find a set of rules that work for all languages (especially since it'll be harder to change later if the system is used in multiple editors) but sharing the queries would eliminate a lot of duplicate work. If someone from nvim-treesitter wants to work on this as well, feel free to ping me or message me on Matrix (@triton171:matrix.org).

@leiserfg
Copy link

leiserfg commented Mar 6, 2023

Yeah, #match? is going to be the biggest issue. We actually have two predicates: #vim-match? uses Vim's (unique) regex engine, while #lua-match? uses Lua patterns.

What do you think of using EcmaScript regex for matches? that's something everybody can agree on, as nvim and emacs could vendorize libregexp from https://bellard.org/quickjs/ and helix could do the same (or use https://github.com/ridiculousfish/regress, not sure how good is it).

@clason
Copy link
Contributor

clason commented Mar 6, 2023

What do you think of using EcmaScript regex for matches?

Sorry, no. We have Lua for performance and Vim for maximal flexibility; we are not going to add a third regex(-like) engine in core.

The real solution is not to use #match? in favor of #any-of? wherever possible.

@lewis6991
Copy link
Member

How does https://github.com/bellard/quickjs/blob/master/libregexp.c compare to vim regex? Would it be possible to take the union? Surely the basics \s, ., *, (), [] all behave the same?

@clason
Copy link
Contributor

clason commented Mar 6, 2023

Yes, I think the vast majority of captures will only require basic "standard" regex features, which should be documented. (A common design document for queries -- including documentation of the remaining necessary divergences as a Rosetta Stone -- is precisely one of the goals of this discussion.)

@leiserfg
Copy link

leiserfg commented Mar 6, 2023

libregexp implements EcmaScript, vim regex is similar but different in important details, I tried to make a compiler from EcmaScript to vim-regex in pure lua but ended up aborting the project because some backtracking stuff in vim-regex and ecmascript don't work the same way. But using a common subset is indeed possible.
The only issue wold be then to guarantee somehow that patterns used follow that subset of common features.

@justinmk
Copy link

justinmk commented Mar 7, 2023

@jcs090218 (https://github.com/emacs-tree-sitter/ts-fold) and @meain (https://github.com/meain/evil-textobj-tree-sitter/) are working on emacs tree-sitter integration, would you be interested in aligning on #3944 (comment) ? Are there others in the emacs space we should contact?

@clason
Copy link
Contributor

clason commented Mar 7, 2023

@jcs090218 (https://github.com/emacs-tree-sitter/ts-fold) and @meain (https://github.com/meain/evil-textobj-tree-sitter/) are working on emacs tree-sitter integration, would you be interested in aligning on #3944 (comment) ? Are there others in the emacs space we should contact?

Just hit me up on Matrix if you want to be part of the (ongoing) discussion!

@meain
Copy link
Contributor

meain commented Mar 7, 2023

I've been passively following this thread. It would be great to have this upstreamed(though distribution might still have to happen within each repo, unless we can sort out something better). Let me ping on Matrix (@meain:matrix.org) .

Tagging a few other folks from Emacs working on tree-sitter: @casouri @ubolonton

@maxbrunsfeld
Copy link

maxbrunsfeld commented Mar 7, 2023

👋 I wrote Tree-sitter and I now work on the Zed code editor, which is not not yet open-source, but will be (probably by the end of this year).

I'd be interested in an effort to standardize some queries, though I think it makes sense to be somewhat conservative about what gets included in the main Tree-sitter grammar repositories. All of these editors' approaches to syntax highlighting, code folding, and auto-indent are still evolving, and so it would be a shame to codify a flawed or limited system into the main language repos.

Highlighting

Syntax highlighting seems like the lowest-hanging fruit for standardization. Having now implemented Tree-sitter-based syntax highlighting twice (once in the tree-sitter-highlight crate used by GitHub, and once in Zed) I have some regrets about the way tree-sitter-highlight works.

  1. query precedence order - This is fairly minor, but right now, the tree-sitter-highlight crate prefers matches listed earlier in the query, so that you need to list the most specific patterns first. I noticed that Neovim uses the opposite convention, and I think this may be slightly more intuitive, because it matches the behavior of the "cascade" in CSS. Also, in Zed, we also found it slightly easier to implement with the more specific matches occurring later (like Neovim).

  2. coupling to locals.scm query - Right now, the tree-sitter-highlight crate supports a "locals query" which can perform simple resolution of local variables, for the purpose of enforcing that all occurrences of a variable have the same highlight color. There are predicates supported in the highlights query that make use of this local variable tracking.

    This allows for better syntax highlighting, especially in some languages like Ruby, but it isn't really feasible to implement in a code editor, where you need to re-highlight changed regions. I wish that the local-variable functionality was somehow relegated to a different query, because it makes it impossible to re-use the upstream highlight queries in code editors.

  3. capture names - I'm mostly happy with the simple capture names that are used in the upstream highlight queries. . It seems like Helix has mostly adopted the same conventions (and has documented them quite nicely). But that set of capture names was developed for use on GitHub.com, where there is only a single fixed (and very minimal) theme. In Zed, we're using a similar set of capture names successfully, but we've only ported a handful of themes from other editors.

    Tree-sitter-highlight's capture names are intentionally not the same as TextMate's scope names, which I still think is good, because TextMate's system was designed with some different goals (scope names were the only form of syntactic information in that editor, so they are set up to support non highlighting uses, and have a complex nested structure).

I'm not sure how to move forward on changing any of these things, but I'd be curious to hear other people's opinions on these issues. AFAIK, GitHub is the biggest stakeholder in the current system.

Injection

For what it's worth, Zed's language injection scheme uses the same query format as the upstream injections.scm query files for tree-sitter-highlight. The combined feature is supported, and is important for templating languages like PHP, ERB, etc. So as far as I know, it'd be reasonable for other text editors to standardize on the current format, or something pretty close to it.

Folding, Auto-indent

These systems are still evolving in Zed, and from what I can tell, they're still evolving in the other editors as well. Personally, I'd be reluctant to try to define a standardized format for indents and folds at this point, especially when even syntax highlighting queries are still not shared between different applications. Possibly I'm being too conservative though 🤷 .

@clason
Copy link
Contributor

clason commented Mar 7, 2023

Thanks for chiming in! It's always great to hear that Neovim Did It Right (first try \o/) :)

More seriously, it's very interesting to hear about Zed's approach, which is a) the closest thing to "author intent" and b) a closed book to the general community right now.

I agree that we should not ossify things for the sake of it right now. The point is more to share ideas and experiences so people don't have to reinvent the wheel. It's also clear that significant effort has gone into creating queries for new parsers, and the more ecosystems profit from that effort, the better for everyone involved -- like LSP, economy of scale is one of the major (albeit not only) value propositions of using tree-sitter in an editor.

So the goal is less the standardization and more the communication (and documentation across editor silos).

(Also, reopening -- again -- since the good comments just keep coming.)

@clason
Copy link
Contributor

clason commented Jun 7, 2023

@maxbrunsfeld Another related issue is parser versioning. As far as I understand it, the parsers under the tree-sitter org follow tree-sitter releases. That was reasonable when tree-sitter had frequent ABI changes, but things have been much more stable in recent times (and tree-sitter supports multiple ABI now). This means that the newest tags on most of these parsers are almost two years old! This is a problem, since it means that downstream (us, Helix) will track HEAD instead, which makes it much harder to coordinate compatible queries.

So my question to you (and the rest of the Tree-sitter Team): Would you consider decoupling parser releases -- and making more frequent releases -- for the grammars you maintain? Ideally, using semantic versioning along the lines of

  • patch releases for parser changes that do not affect queries at all (grammar.js or scanner.c bugfixes),
  • minor releases which add nodes (meaning old queries will still work, but need changes to make full use of the updated grammar),
  • major releases which remove or rename nodes (meaning old queries will no longer work).

This could be automated by requiring conventional commits (fix, feat, and feat!, respectively) and using something like https://github.com/google-github-actions/release-please-action.

@tgross35
Copy link

tgross35 commented Jan 4, 2024

Should there just be a page in tree-sitter documentation that describes expected attributes and behavior? I don't see anything in the upstream tree-sitter but I may be able to start one (@Triton171 maybe you had something here, from #3944 (comment)).

Fold seems pretty straightforward with @fold, though there seem to be some nvim-specific directives to help (e.g. #trim!).

Indenting is trickier. Helix's indent queries are much better documented, based on the comments above it seems like maybe this could be a good starting point?

A universal text syntax would also be great so it's easy to test implementations against one another (similar to what exists highlighting). Roughly something like:

!comment %
% comments will be stripped before sending to the engine (# comments stay)

def foo(a):      % check-fold
           % <- cursor-1
    % <-indent-1
    if a:        % check-fold
          % <- cursor-2
        % <-indent-2
        bar()

Other random thoughts:

  • Is there is any sense in allowing some of these files to be combined? highlights.scm and locals.scm frequently have the same matchers, it would sometimes be nice to just do @local captures at the same time as highlighting.
  • bindings.gyp calls out include paths, source files, and flags for building the parser. Could nvim and others just read this instead of needing this information in parsers.lua? I think every tree-sitter repo has one. Maintainer could probably be read from package.json fields author + maintainers.
  • Perhaps there should be a registry under https://github.com/tree-sitter/ with a list of all active implementations. Any editor could then pull from this.

@clason
Copy link
Contributor

clason commented Jan 5, 2024

Is there is any sense in allowing some of these files to be combined? highlights.scm and locals.scm frequently have the same matchers, it would sometimes be nice to just do @Local captures at the same time as highlighting.

No, mixing locals and highlights is a mistake; leave semantic highlighting to language servers (which can actually do it properly). Locals are used for selection and movement in Neovim, and this will not change.

bindings.gyp calls out include paths, source files, and flags for building the parser.

They need to be manually set, and from a brief check, many parsers don't bother with that and just have the default (incorrect) file. If we can't 100% rely on these, they're not really useful.

That would be nice if it worked, though!

Maintainer could probably be read from package.json fields author + maintainers.

No, that is wrong. The maintainer here is the person responsible for the queries and the Neovim integration, which often is not the same person as the grammar maintainer (who may care little for Neovim).

Perhaps there should be a registry under https://github.com/tree-sitter/ with a list of all active implementations. Any editor could then pull from this.

I'm not sure that's feasible; how do you tell which parser is "active"? I'm already tracking more than 400 parsers, and I'm sure I'm missing quite a few.

@wongjiahau

This comment was marked as off-topic.

@clason

This comment was marked as off-topic.

@wongjiahau

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests