Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable excluding files / directories #470

Open
norswap opened this issue Jan 19, 2022 · 18 comments
Open

Enable excluding files / directories #470

norswap opened this issue Jan 19, 2022 · 18 comments
Labels
enhancement New feature or request

Comments

@norswap
Copy link

norswap commented Jan 19, 2022

This might be a case of me being particularly dense, but there doesn't seem to be a way to exclude directories and files from being parsed by lychee (other than by not including them in the inputs). The existing exclude flags are all about patterns of links not to consider.

Why this matters: on my machine, lychee is quite slow at trudging through e.g. .git and node_modules to look for files that either aren't there or I don't want checked. (There are external reasons why it's slow, not least of which is using WSL.)

Still, as is, I'm forced to write:

lychee --exclude-mail README.md "./specs/**/*.md" "./meta/**/*.md" "./opnode/**/*.md"
  • this is multiple order of magnitudes faster than lychee --exclude-mail **/*.md
  • I'm just not bothering with linting my javascript packages, because it's too painful to manually list files in there to avoid the node_modules.

What I'd like to write:

lychee --exclude-mail --exclude-dir .git node_modules -- **/*.md

Ideally the "file patterns" should work just like .gitignore.

@mre
Copy link
Member

mre commented Jan 19, 2022

We also touch on that in #418.
I agree that there needs to be a solution.
ripgrep excludes .git and node_modules by default, which sounds sensical to me. Then in your case it would be

lychee --exclude-mail .

(Not exactly, because it would check html files as well, but that could be configurable as well.)

--exclude-dir might be a bit too narrow, because one might also want to exclude files. Then --exclude-path makes sense, because it covers both, but we also have --exclude-file, which is currently interpreted as a file with regex patterns for excluding URLs. That was a misnomer and will be deprecated in favour of --use-ignore-file.
With that, --exclude-path could work and support both directories and files.

@lebensterben
Copy link
Member

@mre
I fully agree that lychee should be similar to ripgrep when dealing with hidden directories.

It may even by default ignore anything in ".gitignore".

@san-slysz
Copy link

san-slysz commented Jan 28, 2022

I fall into the same situation, where I wanted to ignore (at least) node_modules. Ignoring the gitignore list would make sense to me.

@mre mre added the enhancement New feature or request label Feb 4, 2022
@aerfio
Copy link

aerfio commented Apr 11, 2022

This feature would be really helpful, right now I do something like

git ls-files '*.md' | xargs -n 1 lychee --

but I'd prefer some kind of way to ignore whole directories

@mre
Copy link
Member

mre commented Nov 13, 2022

Update

--exclude-path exists now. It allows excluding files and directories from being checked.

Usage example based on the original request above:

lychee --exclude-path node_modules .git -- .

Regex patterns are supported.
Some more info on the lychee website

@norswap, as a side note, did you know that there's a windows executable build, which could help you with any performance issues because you could avoid the WSL virtualization layer?

TODO

  • Exclude the entries in .gitignore automatically.

@norswap
Copy link
Author

norswap commented Nov 14, 2022

Great to hear!

I was weak and purchased a mac :D
But I think generally execution isn't really the problem with WSL performance, it's file system accesses.

@aj-stein-nist
Copy link
Contributor

Regex patterns are supported. Some more info on the lychee website

I may have to file a potential bug but we are big fans of lychee, I spent the last two days in between other tasks unable to get regex working at all with --exlcude-path, but I will need to consult with all of you if I am using it correctly.

@mre
Copy link
Member

mre commented Oct 4, 2023

Sweet. I've added some more examples to the docs to help you get started. Feel free to add a comment here if you run into an issue.

@aj-stein-nist
Copy link
Contributor

aj-stein-nist commented Oct 5, 2023

Sweet. I've added some more examples to the docs to help you get started. Feel free to add a comment here if you run into an issue.

OK, I will not finish the draft of the separate bug report I was writing, I will move and edit it to here. We want to dynamically add or remove a collection of web pages per directory, and the directory name is based upon git branch or tag names. I only bring this up because I cannot use .lycheeignore or the config method as the directories will be dynamically. We want to filter them out and --exclude-path=site/public/models would be best. This is the directory structure, see usnistgov/OSCAL-Reference#23 for fuller details and current development branch.

https://github.com/usnistgov/OSCAL-Reference/tree/fb13809fea9baf44b9d0f341f694ee1dae66e864/site

Inside of site directory, based upon the Makefile executed a directory above site (./site/..) it would generate the source-code content into the rendered site into site/public. We scan links from there.

When cloning our code, I attempted to configure different variations of * before and after the relative path.

lychee --exclude-file ./support/lychee_ignore.txt  \
  --config ./support/lychee.toml \
  --output lychee_report.md \
  --verbose --format markdown \
  --exclude-path="site/public/models*" \
  site/public/**/*.html
lychee --exclude-file ./support/lychee_ignore.txt  \
  --config ./support/lychee.toml \
  --output lychee_report.md \
  --verbose --format markdown \
  --exclude-path="*site/public/models*" \
  site/public/**/*.html
lychee --exclude-file ./support/lychee_ignore.txt  \
  --config ./support/lychee.toml \
  --output lychee_report.md \
  --verbose --format markdown \
  --exclude-path="*/models*" \
  site/public/**/*.html

I still see lychee linkcheck failures that are under site/public/models/develop and other subdirectories that should be excluded. Once I switch to not using (regex?) wildcards and for example exclude site/public/models/develop without wildcards like so, it works fine.

lychee --exclude-file ./support/lychee_ignore.txt  \
  --config ./support/lychee.toml \
  --output lychee_report.md \
  --verbose --format markdown \
  --exclude-path="site/public/models/develop" \
  site/public/**/*.html

I am using lychee v0.13.0. Do I have to use the WIP version of develop to test this feature working properly?

aj-stein-nist added a commit to aj-stein-nist/lychee that referenced this issue Oct 5, 2023
aj-stein-nist added a commit to aj-stein-nist/lychee that referenced this issue Oct 5, 2023
@mre
Copy link
Member

mre commented Oct 5, 2023

Are you mixing up regex with glob, maybe?
Instead of models*, can you try models.*?

@aj-stein-nist
Copy link
Contributor

aj-stein-nist commented Oct 5, 2023

Are you mixing up regex with glob, maybe? Instead of models*, can you try models.*?

I see what I did there, apologies. Ah, well in that case, I go back to test and come back later. 😬

@askalski85
Copy link
Contributor

askalski85 commented Dec 11, 2023

Hi folks,
I am playing around with the --exclude-path on a simple scenario

.
└── a
    ├── a.html
    └── b
        ├── b.html
        └── c
            ├── c.html
            └── d
                └── d.html

I am able to exclude path using a/b or a file using a/b/b.html but using an asterix * nor .* for wildcarding file/folder names does not work for me.

lychee -vv --exclude-path 'a/b/.*' .
lychee -vv --exclude-path 'a/b/*' .

# same output for both
[./a/b/b.html]:
✗ [ERR] https://badlink.b.com/ | Failed: Network error: dns error: no record found for Query { name: Name("badlink.b.com.fritz.box."), query_type: AAAA, query_class: IN }

Same for 0.13.0 and for the nightly 11d8d44

@mre
Copy link
Member

mre commented Dec 11, 2023

Try

lychee --dump --exclude-path 'a/b' .

@askalski85
Copy link
Contributor

lychee --dump --exclude-path 'a/b' .

tmp % lychee -vv --dump --exclude-path 'a/b' .
https://badlink.a.com/ (./a/a.html)
https://github.com/#a (./a/a.html)

Yes this works. But the online documentation states I can do thinks like */dev/* which apparently does not work:

tmp % lychee -vv --dump --exclude-path '*/b/*' .
https://github.com/#b (./a/b/b.html)
https://badlink.b.com/ (./a/b/b.html)
https://github.com/#c (./a/b/c/c.html)
https://badlink.c.com/ (./a/b/c/c.html)
https://github.com/#d (./a/b/c/d/d.html)
https://badlink.d.com/ (./a/b/c/d/d.html)
https://badlink.a.com/ (./a/a.html)
https://github.com/#a (./a/a.html)

@mre
Copy link
Member

mre commented Dec 11, 2023

Well, the documentation mentions */dev/*, but I realized that it doesn't describe what it does.
I think */dev/* is actually incorrect. It's not a regular expression to begin with. When I put it into a regex tester, I get errors like "Error: invalid target for quantifier". The reason is that the first * is a quantifier, which doesn't quantify anything.
It's a glob pattern, but --exclude-path uses regex matching.
So, we should remove the example.

@mre
Copy link
Member

mre commented Dec 11, 2023

Replaced the pattern with .*/dev/.*.

I haven't looked into why your original regex didn't work. I think it should (?).
For future reference, here is the module that handles path exclusions:
https://github.com/lycheeverse/lychee/blob/master/lychee-lib/src/types/input.rs
I can see that there are some missing cases in our unit tests; e.g. excluding files like foo.html.

If someone finds the time, I'd appreciate a pull request for adding more cases. Maybe there is a bug in the path exclusion handling (or we need to document it better).

@askalski85
Copy link
Contributor

FYI: the suggested .*/dev/.* also does not seem to work.

tmp % lychee -vv --dump --exclude-path '.*/b/.*' .
https://badlink.c.com/ (./a/b/c/c.html)
https://github.com/#c (./a/b/c/c.html)
https://badlink.b.com/ (./a/b/b.html)
https://github.com/#b (./a/b/b.html)
https://badlink.a.com/ (./a/a.html)
https://github.com/#a (./a/a.html)
https://badlink.d.com/ (./a/b/c/d/d.html)
https://github.com/#d (./a/b/c/d/d.html)

@mre
Copy link
Member

mre commented Dec 13, 2023

Looks like it doesn't match the full path, but just the last part? Can you play around with various regexes that match the filename and extension? Like *.html or (a|b).html and c\.htm.?
(It could also be related to the backslash escaping)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants