Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add glob file type support #8006

Merged
merged 11 commits into from
Feb 11, 2024

Conversation

gjabell
Copy link
Contributor

@gjabell gjabell commented Aug 19, 2023

One of the issues I've run into with Helix (coming from nvim) is the ability to specify file types based on a full path pattern and not just a suffix. This is necessary for file types like YAML which might need different tooling based on their filesystem context (examples include GitHub Actions and Helm charts) or things like Dockerfile (a common paradigm is to add a suffix for different use cases, e.g. Dockerfile.dev).

It seems that this issue was already discussed in #2455; not sure if there are plans to introduce a different paradigm for file type matching which solves it without globs but I think it's a rather necessary feature either way. Other options would include:

  • adding a separate prefix option => this makes the file-types even more complicated and also might end up with conflicting matches
  • using regex instead of globbing => possibly a better solution (allows for more fine-grained matching), but it feels a bit unnecessary to me; globbing is much simpler and I think still covers all the cases where you'd want to match files based on path

I'm new to Rust and to Helix, so please let me know where improvements can be made or where I'm missing something 🙂

Fixes #7422?

@pascalkuthe
Copy link
Member

We can avoid adding a new deoendy by using globset (which we already depend on)

@pascalkuthe
Copy link
Member

I still want to solve #7029. I think if we do go ahead with this then that could be a good opportunity for fixing that aswell. { glob = "foo" } already behaves the same as the proposed filename matcher there and can be changed to behave the same as suffix by using { glob = "*foo" } so we could get rid of that special cases.

Technically globs can also match file extensions so we could theoretically replace all matching modes with globs for maximum consistency but that's sadly too much of a breaking change. Maybe one day when we migrate to script based config

@pascalkuthe pascalkuthe added E-medium Call for participation: Experience needed to fix: Medium / intermediate S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. A-language-support Area: Support for programming/text languages labels Aug 21, 2023
@gjabell
Copy link
Contributor Author

gjabell commented Aug 21, 2023

I didn't notice we already use globset, thanks for the pointer!

I was considering trying to rewrite all the configs to use globbing for the file extension but I think it would add too many collisions. Maybe if we kept a separate map to try extension globs after the normal globs, but not sure how performant that would be.

@pascalkuthe
Copy link
Member

Yeah that would be too large of a breaking change. Breaking the suffix config is fine IMO, since it's pretty rarely used but for file extensions it would break a ton of user config.

@pascalkuthe pascalkuthe added S-waiting-on-review Status: Awaiting review from a maintainer. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Aug 21, 2023
languages.toml Outdated
@@ -2775,7 +2775,7 @@ source = { git = "https://github.com/kylegoetz/tree-sitter-unison", rev = "98c4e
[[language]]
name = "todotxt"
scope = "text.todotxt"
file-types = [{ suffix = ".todo.txt" }, "todotxt"]
file-types = [{ glob = ".todo.txt" }, "todotxt"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is quite the same thing i believe the indention was to allow *.todo.txt although I never use that format so I am not sure... Which reminds me... I think we need to check glob matches before file extensions to correctly handle cases like this (otherwise this would be detxted as a .txt file so plaintext)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm good point, I'm also not sure what files this should match; this vim plugin seems to match todo.txt or *.todo.txt so maybe we should have a glob for both? Perhaps someone who uses the tool can provide some clarification.

Currently it tries to match exact filenames, then globs, then extensions (so we don't have collisions).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jan9103 As you added todotxt, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think?

AFAIK the official syntax by todotxt.org is that any filename matching regex ^(.*\.)?todo\.txt$ should be considered.

Under the assumption glob uses unix glob syntax i have a few notes on [{ glob = "todo.txt" }, { glob = ".todo.txt" }, "todotxt"]:

  • the code looks as if */$GLOB_VALUE (*/.todo.txt) is matched as glob, but it should be *.todo.txt
  • todo.txt should probably be matched, but its either the exact filename (and therefore dosn't need glob) or required a . and therefore is already matched by { glob = ".todo.txt" }.
  • .todotxt is a alternative extension sometimes used and therefore should get the same treatment as ".todo.txt". but since its not in the official specs it would be fine to leave it out.

therefore this should probably work:

file-types = [{ glob = "{,*.}todo{,.}txt" }]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for filling jn the details. It should be [{ glob = "todo.txt" }, { glob = "*.todo.txt" }, "todotxt", "todo.txt"] with the syntax from this PR. Plaintext now only matches file extensions with this PR, so you need a glob for the " only filename case". By default globs match relatively paths but if your glib starts with / or * it matches an absolute path instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@pascalkuthe pascalkuthe added R-breaking-change This PR is a breaking change for some behavior S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from a maintainer. labels Aug 21, 2023
@gjabell gjabell force-pushed the add-glob-file-type-support branch 2 times, most recently from 673a6b1 to 6f158a0 Compare August 26, 2023 09:53
.language_config_ids_by_glob
.iter()
.find_map(|(glob, id)| {
if glob.compile_matcher().is_match(path) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fairly inefficient, I the very least we should precompile the glob here but even if we do that matching every glob in sequence is fairly slow.

Instead you cab jse a GlobSet. Matching a globset is not much slow than matching a single Glob but matches all globs at once. You can use https://docs.rs/globset/latest/globset/struct.GlobSet.html#method.matches to get a list of all matching patterns. You should then select the longest pattern (if there are multiple matches altough usually there would only be one of then)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, sounds much more efficient!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it would make more sense to select the last pattern matched instead of longest pattern, since if there are conflicts it's likely because of someone's personal config. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using the longest match makes sense as a tiebreaker. For example, if we switch to using globs for filextensions in the future then *.txt and *.todo.txt (example from above) would both match the same baths but .todo.txt should be used because its longer (its matches are more specific). I can imagine something similar for other globs. I can't really think of a case where would would want to use the more general pattern over the more specific one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new implementation that uses GlobSet; it got a little more complicated since we need to also build the GlobSet, but I think it's ok to handle that in Loader as we only load it once at startup (or when language config changes).

@David-Else
Copy link
Contributor

@gjabell I don't think my Rust is good enough for me to review your code, but thanks for asking :)

@Jan9103
Copy link
Contributor

Jan9103 commented Aug 28, 2023

I'm neither good at rust nor familiar with helixes code(-style) and therefore don't want to give a official approval either, but the code looks good to me.

pascalkuthe
pascalkuthe previously approved these changes Aug 28, 2023
Copy link
Member

@pascalkuthe pascalkuthe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am gonna take this for a spin soonish but looking trough the source code this lgtm

gjabell and others added 9 commits February 11, 2024 16:46
Trying to match the file-type raw strings against both filename and
extension leads to files with the same name as the extension having the
incorrect syntax.
It's common practice to add a suffix to dockerfiles based on their
context, e.g. `Dockerfile.dev`, `Dockerfile.prod`, etc.
Match on `.env` or any `.env.*` files.
This is a refactor that improves the error handling for creating
the `helix_core::syntax::Loader` from the default and user language
configuration.
Copy link
Member

@the-mikedavis the-mikedavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look through the languages.toml and I think everything that is a full filename has been converted to a { glob = "<filename>" } 👍

I just have a suggestion for starlark file-types but otherwise this looks good to me

languages.toml Outdated Show resolved Hide resolved
Copy link
Member

@the-mikedavis the-mikedavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for working on this!

@the-mikedavis the-mikedavis added S-waiting-on-review Status: Awaiting review from a maintainer. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 11, 2024
Copy link
Member

@pascalkuthe pascalkuthe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great to see this finished! I think moving fully to globs for all ft detection in the future would be great so we don't have to keep different ft variants around (potentially once we switch away from toml config)

@pascalkuthe pascalkuthe merged commit 581a1eb into helix-editor:master Feb 11, 2024
6 checks passed
@gjabell gjabell deleted the add-glob-file-type-support branch February 11, 2024 17:42
pascalkuthe pushed a commit that referenced this pull request Feb 18, 2024
* languages: add docker-compose language

it uses docker-compose-langserver as lsp
And yaml for syntax highlighting, indents and injections

* languages: add luajit as a shebang of lua

This helps to provide syntax highlighting and
other lua goodies when writing luajit

* book(update): run cargo xtask docgen

* since #8006 full filenames uses glob
uek-1 pushed a commit to uek-1/helix that referenced this pull request Feb 24, 2024
* languages: add docker-compose language

it uses docker-compose-langserver as lsp
And yaml for syntax highlighting, indents and injections

* languages: add luajit as a shebang of lua

This helps to provide syntax highlighting and
other lua goodies when writing luajit

* book(update): run cargo xtask docgen

* since helix-editor#8006 full filenames uses glob
cosmikwolf pushed a commit to cosmikwolf/helix that referenced this pull request Feb 26, 2024
* Replace FileType::Suffix with FileType::Glob

Suffix is rather limited and cannot be used to match files which have
semantic meaning based on location + file type (for example, Github
Action workflow files). This patch adds support for a Glob FileType to
replace Suffix, which encompasses the existing behavior & adds
additional file matching functionality.

Globs are standard Unix-style path globs, which are matched against the
absolute path of the file. If the configured glob for a language is a
relative glob (that is, it isn't an absolute path or already starts with
a glob pattern), a glob pattern will be prepended to allow matching
relative paths from any directory.

The order of file type matching is also updated to first match on globs
and then on extension. This is necessary as most cases where
glob-matching is useful will have already been matched by an extension
if glob matching is done last.

* Convert file-types suffixes to globs

* Use globs for filename matching

Trying to match the file-type raw strings against both filename and
extension leads to files with the same name as the extension having the
incorrect syntax.

* Match dockerfiles with suffixes

It's common practice to add a suffix to dockerfiles based on their
context, e.g. `Dockerfile.dev`, `Dockerfile.prod`, etc.

* Make env filetype matching more generic

Match on `.env` or any `.env.*` files.

* Update docs

* Use GlobSet to match all file type globs at once

* Update todo.txt glob patterns

* Consolidate language Configuration and Loader creation

This is a refactor that improves the error handling for creating
the `helix_core::syntax::Loader` from the default and user language
configuration.

* Fix integration tests

* Add additional starlark file-type glob

---------

Co-authored-by: Michael Davis <mcarsondavis@gmail.com>
shortc pushed a commit to shortc/helix that referenced this pull request Mar 31, 2024
* Replace FileType::Suffix with FileType::Glob

Suffix is rather limited and cannot be used to match files which have
semantic meaning based on location + file type (for example, Github
Action workflow files). This patch adds support for a Glob FileType to
replace Suffix, which encompasses the existing behavior & adds
additional file matching functionality.

Globs are standard Unix-style path globs, which are matched against the
absolute path of the file. If the configured glob for a language is a
relative glob (that is, it isn't an absolute path or already starts with
a glob pattern), a glob pattern will be prepended to allow matching
relative paths from any directory.

The order of file type matching is also updated to first match on globs
and then on extension. This is necessary as most cases where
glob-matching is useful will have already been matched by an extension
if glob matching is done last.

* Convert file-types suffixes to globs

* Use globs for filename matching

Trying to match the file-type raw strings against both filename and
extension leads to files with the same name as the extension having the
incorrect syntax.

* Match dockerfiles with suffixes

It's common practice to add a suffix to dockerfiles based on their
context, e.g. `Dockerfile.dev`, `Dockerfile.prod`, etc.

* Make env filetype matching more generic

Match on `.env` or any `.env.*` files.

* Update docs

* Use GlobSet to match all file type globs at once

* Update todo.txt glob patterns

* Consolidate language Configuration and Loader creation

This is a refactor that improves the error handling for creating
the `helix_core::syntax::Loader` from the default and user language
configuration.

* Fix integration tests

* Add additional starlark file-type glob

---------

Co-authored-by: Michael Davis <mcarsondavis@gmail.com>
shortc pushed a commit to shortc/helix that referenced this pull request Mar 31, 2024
* languages: add docker-compose language

it uses docker-compose-langserver as lsp
And yaml for syntax highlighting, indents and injections

* languages: add luajit as a shebang of lua

This helps to provide syntax highlighting and
other lua goodies when writing luajit

* book(update): run cargo xtask docgen

* since helix-editor#8006 full filenames uses glob
postsolar pushed a commit to postsolar/helix that referenced this pull request Apr 4, 2024
* Replace FileType::Suffix with FileType::Glob

Suffix is rather limited and cannot be used to match files which have
semantic meaning based on location + file type (for example, Github
Action workflow files). This patch adds support for a Glob FileType to
replace Suffix, which encompasses the existing behavior & adds
additional file matching functionality.

Globs are standard Unix-style path globs, which are matched against the
absolute path of the file. If the configured glob for a language is a
relative glob (that is, it isn't an absolute path or already starts with
a glob pattern), a glob pattern will be prepended to allow matching
relative paths from any directory.

The order of file type matching is also updated to first match on globs
and then on extension. This is necessary as most cases where
glob-matching is useful will have already been matched by an extension
if glob matching is done last.

* Convert file-types suffixes to globs

* Use globs for filename matching

Trying to match the file-type raw strings against both filename and
extension leads to files with the same name as the extension having the
incorrect syntax.

* Match dockerfiles with suffixes

It's common practice to add a suffix to dockerfiles based on their
context, e.g. `Dockerfile.dev`, `Dockerfile.prod`, etc.

* Make env filetype matching more generic

Match on `.env` or any `.env.*` files.

* Update docs

* Use GlobSet to match all file type globs at once

* Update todo.txt glob patterns

* Consolidate language Configuration and Loader creation

This is a refactor that improves the error handling for creating
the `helix_core::syntax::Loader` from the default and user language
configuration.

* Fix integration tests

* Add additional starlark file-type glob

---------

Co-authored-by: Michael Davis <mcarsondavis@gmail.com>
postsolar pushed a commit to postsolar/helix that referenced this pull request Apr 4, 2024
* languages: add docker-compose language

it uses docker-compose-langserver as lsp
And yaml for syntax highlighting, indents and injections

* languages: add luajit as a shebang of lua

This helps to provide syntax highlighting and
other lua goodies when writing luajit

* book(update): run cargo xtask docgen

* since helix-editor#8006 full filenames uses glob
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-language-support Area: Support for programming/text languages E-medium Call for participation: Experience needed to fix: Medium / intermediate R-breaking-change This PR is a breaking change for some behavior S-waiting-on-review Status: Awaiting review from a maintainer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ability to specify file type prefixes for language detection
7 participants