Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support simple glob patterns for ignore rules #64

Closed
bpasero opened this issue Oct 1, 2021 · 7 comments · Fixed by #106
Closed

Support simple glob patterns for ignore rules #64

bpasero opened this issue Oct 1, 2021 · 7 comments · Fixed by #106

Comments

@bpasero
Copy link
Contributor

bpasero commented Oct 1, 2021

This is a feature request to enhance the ignore option to support simple glob patterns.

As an example, I recently came across this PR for nsfw that discusses such support for their library: Axosoft/nsfw#124

Bottom line: ignore patterns are very important on Linux at least, because of the operating system global limit of opened file handles, so you really need to exclude folders from the file watcher to not run into issues with opening too many file handles. Having support for glob patterns in the ignore option would mean, clients could define a pattern such as **/node_modules and have that work regardless of the folder they are in.

@devongovett
Copy link
Member

Glob patterns would indeed be useful! There are two ways we could implement them. The simplest approach is to do it in JavaScript, as there are already lots of glob libraries out there. No operating system API supports passing globs directly, so we will have to filter after receiving events anyway.

The other way would be to do it in C++. The benefit there would be performance. Since watching occurs on a background thread, we wouldn't overwhelm the JS thread with events that will be ignored anyway. However, implementing glob matching from scratch sounds hard. Perhaps there are some C++ libraries that implement globbing, though I also wouldn't want to bring in a large library if it has the potential for security issues. I suppose we could use Rust for the glob matching... 🤔

Any thoughts on that from your perspective?

@bpasero
Copy link
Contributor Author

bpasero commented Oct 2, 2021

@devongovett unfortunately the JavaScript approach wouldn't be a good solution for VS Code at least: for Linux it is important to apply the ignore rules at the level of where folders are recursively resolved and added to be watched. Since Linux has a natural upper limit of opened file handles, the ignore pattern is the only way for Linux users to avoid hitting that limit (related issue: microsoft/vscode#40898).

VS Code is already filtering the events on the receiving end but I would much prefer a native solution in C++ within the watcher.

Btw I am not necessarily saying you need full glob support to achieve this. Even VS Code itself implements a simple glob library that supports patterns such as ** and * but not everything the spec allows you to do. It is optimised for performance by avoiding to run a RegEx for all patterns when patterns have a specific form.

When looking into nsfw I noticed this pull request that seems to implement simple globbing in C++: https://github.com/Axosoft/nsfw/pull/124/files?authenticity_token=ifa16H3MpIoYvE3UXvzRn%2FP3eXbU9xziU2OqaQuUTyK99mBonpsX9DPygK%2BJyZCGMgrDUG2aXOne1QvQBXYGtw%3D%3D&file-filters%5B%5D=.cpp&file-filters%5B%5D=.gyp&file-filters%5B%5D=.h&file-filters%5B%5D=.md&file-filters%5B%5D=.ts#diff-7f14a4a5a7200888990f03862d86776107f7825be09854976d56e0760d1d01bb

Maybe that could be used as foundation going forward.

@joaomoreno
Copy link
Contributor

Together with @bpasero and @lszomoru, I've spend some time figuring out a way forward for this feature request, assuming you'd consider it. As @bpasero already mentioned, today VS Code matches globs at the JS level. Doing that at the C++ level would be ideal for obvious performance reasons.

Scenario

Let's consider the classic glob scenario. The user should be able to provide a set of glob patterns, relative to the watched directory. Glob patterns can be positive or negative (starting with !). Events would only be emitted if they match any of the positive glob patterns and none of the negative ones.

Consider the following example:

  • Directory to watch: /work
  • Glob Patterns:
    • src/**
    • build/**
    • !src/**/test/**
    • !build/output/**

The following files and directories would match the glob patterns:

  • /work/src/
  • /work/src/main.js
  • /work/src/widgets/
  • /work/src/widgets/bar.js
  • /work/build/compile.sh

The following ones would not:

  • /work/README.md
  • /work/src/test/
  • /work/src/test/main.js
  • /work/build/output/
  • /work/build/output/a.out

Implementation

Matching glob patterns against paths is commonly solved by converting glob patterns to regular expressions, eg VS Code, minimatch, etc. We suggest doing the same here. One implementation possibility would be to port VS Code's conversion code to C++, leveraging std::regex. Another, to depend on or embed an existing C++ implementation.

API

Then, there's the question of API. An option of a set of glob patterns conflicts with the existing ignore option. Also, it is not ideal to drop the ignore option altogether since it actually has performance benefits by being forwarded to some underlying native watcher implementations. I see two solutions.

We can have both ignore and glob options. The former would be kept to perserve today's behavior of forwarding that value to the underlying watcher as well as using it for filtering, while the former would only be used for additional filtering. This is simple and backwards-compatible, but not elegant: what would it mean to have { ignore: ['foo'], glob: ['foo/**'] }? By definition, none of the files within foo would be detected, despite being included in the glob patterns. Another unhappy example is { ignore: [], glob: ['!build/**'] }, in which we would lose an opportunity to provide build to be ignored by the underlying watcher and improve performance.

Another solution involves breaking API: drop the ignore option and have a single glob option. We would then detect specific negative glob patterns to be automatically provided as ignore rules to the underlying watcher. Specifically, if a negative pattern ends with /** and contains no other ** part, then it can be automatically converted to a path to be ignored. Take the example from above once again: the !build/output/** pattern is negative, ends with /** and contains no other ** parts, so the path /work/build/output should be provided to the underlying watcher to be fully ignored. Such patterns are extremely common in glob scenarios, so we can assume a high rate of successful detections.

Next Steps

Asking again, do you still consider glob pattern matching useful? Would you be open to a contribution from our side? What would be your the preferred approach? Do you maybe have other thoughts on the problem and/or approaches? We'd love to hear more from you and help move the Parcel watcher forward as it has provided real value to VS Code since we started using it.

@devongovett
Copy link
Member

Thanks for the writeup! I think this would be a great feature for you to contribute. 🥳

In terms of implementation, I like the idea of converting globs to regexes on the JavaScript side and passing them to C++. This only needs to happen once at startup and shouldn't be too expensive performance wise. JS glob libraries are widely used and well tested, and managing C++ dependencies is much more complex, so I think this is a good approach.

For the API, another option could be to keep the ignore option, but allow both paths and globs. We can detect whether each ignore pattern is a glob e.g. using https://github.com/micromatch/is-glob, and perhaps pass them as separate lists to the C++ backend but keep the JS API unified. Would that make sense to you?

@bpasero
Copy link
Contributor Author

bpasero commented Apr 6, 2022

What a fantastic idea to convert from JS to RegEx, I was not even thinking about that. We can then just use our existing code to convert from VSCode 💯

@joaomoreno
Copy link
Contributor

joaomoreno commented Apr 7, 2022

Lovely, great idea! I also like the API suggestion, I feel like that's actually the minimal API change we'd need and stays aligned with today's API. Given that, what do you think of directly using minimatch's makeRe API? That is, assuming JavaScript regular expressions are fairly compatible with C++11's... 🤔

@mischnic
Copy link
Member

mischnic commented Apr 7, 2022

That is, assuming JavaScript regular expressions are fairly compatible with C++11's... 🤔

There is an ECMAScript std::regex flavour which is also the default if not set explicitly in the constructor: https://en.cppreference.com/w/cpp/regex/syntax_option_type

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants