Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remapping file-extension: .md to .html #1

Closed
tgaff opened this issue Oct 11, 2023 · 13 comments
Closed

Remapping file-extension: .md to .html #1

tgaff opened this issue Oct 11, 2023 · 13 comments

Comments

@tgaff
Copy link
Contributor

tgaff commented Oct 11, 2023

I have a GitHub pages site written in markdown. I want to test the live deployed links from the site. Lychee doesn't support recursion or crawling the site but I thought I could use --base and --remap to generate the links to check from the original markdown before it's turned to HTML.

❯ lychee --max-concurrency 8 --base https://tgaff.github.io/some-gh-pages-site/ --remap "(.*).md  $1.html" index.md
Error: Remaps must be of the form '<pattern> <uri>' (separated by whitespace)

Caused by:
    0: Cannot parse string `.html` as website url: relative URL without a base
    1: relative URL without a base

index.md contains a link like

[link](/bad-links/nested/page.md)

I expected the capture group $1 to contain /bad-links/nested/page and to append .html. Instead I get an error.

Is there any way to accomplish this?

@mre
Copy link
Member

mre commented Oct 17, 2023

I thought I answered that in a separate issue, but I can't remember right now, so here are some thoughts.

As the error message says, the second parameter of the remap must be a valid URI.
Can you try to change the remap to

❯ lychee --max-concurrency 8 --remap "(.*).md  https://tgaff.github.io/some-gh-pages-site/ $1.html" index.md

I don't know if this works, but it's worth a try.

You might need the --base as well.

@tgaff
Copy link
Contributor Author

tgaff commented Oct 24, 2023

I made a few attempts based on this an still can't seem to get it to work. It seems a lot like the capture group isn't capturing.

Any ideas?

First try:

lychee --max-concurrency 8 --remap "(.*).md https://tgaff.github.io/some-gh-pages-site/ $1.html" index.md

Result:

Error: Remaps must be of the form '<pattern> <uri>' (separated by whitespace)

Caused by:
    Cannot parse into URI remapping, must be a Regex pattern and a URL separated by whitespaces: `(.*).md  https://tgaff.github.io/some-gh-pages-site/ .html`
remove space

I think maybe the space was a typo - tried again without it, for a different message:

❯  lychee --max-concurrency 8 --remap "(.*).md  https://tgaff.github.io/some-gh-pages-site/$1.html" index.md

  2/2 ETA 0s ████████████████████ Finished extracting links                                                                                       Issues found in 1 input. Find details below.

[index.md]:
✗ [404] https://tgaff.github.io/some-gh-pages-site/.html | Failed: Network error: Not Found

🔍 2 Total ✅ 1 OK 🚫 1 Error (HTTP:1)

It seems like the capture group didn't work again.

Try with --base pointing to the site URL
❯ lychee --max-concurrency 8 --base https://tgaff.github.io/some-gh-pages-site/ --remap "(.*).md  https://tgaff.github.io/some-gh-pages-site/$1.html" index.md
  3/3 ETA 0s ████████████████████ Finished extracting links                                                                                       Issues found in 1 input. Find details below.

[index.md]:
✗ [404] https://tgaff.github.io/some-gh-pages-site/.html | Failed: Network error: Not Found

🔍 3 Total ✅ 1 OK 🚫 2 Errors (HTTP:2)

This seems the same.

Try with --base and no full URL in the substitution
❯ lychee --max-concurrency 8 --base https://tgaff.github.io/some-gh-pages-site/ --remap "(.*).md $1.html" index.md
Error: Remaps must be of the form '<pattern> <uri>' (separated by whitespace)

Caused by:
    0: Cannot parse string `.html` as website url: relative URL without a base
    1: relative URL without a base
version
❯ lychee --version
lychee 0.13.0

@mre
Copy link
Member

mre commented Oct 27, 2023

Yup, the capture group is not working. This is fixed in master, I hope. Can you try again with this?

uses: lycheeverse/lychee-action@master

Still need to release that version.

tgaff added a commit to tgaff/some-gh-pages-site that referenced this issue Nov 20, 2023
❯   lychee --max-concurrency 8 --remap "(.*).md  https://tgaff.github.io/some-gh-pages-site/$1.html" index.md

  2/2 ETA 0s ████████████████████ Finished extracting links                                                                                                                       Issues found in 1 input. Find details below.

[index.md]:
✗ [404] https://tgaff.github.io/some-gh-pages-site/.html | Failed: Network error: Not Found

🔍 2 Total ✅ 1 OK 🚫 1 Error (HTTP:1)
@tgaff
Copy link
Contributor Author

tgaff commented Nov 20, 2023

I was actually running it locally so I setup an action to test it using the master branch.

That run is here ... I think its still a failure, though I did have some quotes issues and its possible its my fault.
https://github.com/tgaff/some-gh-pages-site/actions/runs/6929634262/job/18847729637?pr=7

@mre
Copy link
Member

mre commented Nov 30, 2023

I noticed that you're using single quotes in the linked run.

--max-concurrency 8 --remap '(.*).md  https://tgaff.github.io/some-gh-pages-site/$1.html' index.md

Did you try double quotes as well?

--max-concurrency 8 --remap "(.*).md  https://tgaff.github.io/some-gh-pages-site/$1.html" index.md

If that doesn't work, try to escape the double quotes \".
My suspicion would be that the single quotes cause verbatim interpretation and the $ is not substituted anymore.
Just an idea.

@tgaff
Copy link
Contributor Author

tgaff commented Jan 17, 2024

I gave it a shot and I get the same ✗ [404] https://tgaff.github.io/some-gh-pages-site/.html | Failed: Network error: Not Found with both. For kicks I switched from zsh to bash and got the same output.

I may be on an older version now though. I imagine there's been a release since the last time I checked on this.

@mre
Copy link
Member

mre commented Jan 17, 2024

Yup. The latest release contains some changes around remapping, so it might be worth a shot.

@tgaff
Copy link
Contributor Author

tgaff commented Jan 21, 2024

OK, I moved from 0.13 to 0.14.1. This version behaves quite differently. 👍

Initially, I tried the same thing we were discussing:

❯ lychee --max-concurrency 8 --remap '(.*).md  https://tgaff.github.io/some-gh-pages-site/$1.html' index.md

  2/2 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                                                                                                                                                                                          Issues found in 1 input. Find details below.

[index.md]:
✗ [404] https://tgaff.github.io/some-gh-pages-site/file:///Users/tgaff/devel/guides/technical_guides/chef.html | Failed: Network error: Not Found
✗ [404] https://tgaff.github.io/some-gh-pages-site/file:///Users/tgaff/devel/guides/technical_guides/css.html | Failed: Network error: Not Found

You can see that the capture group is capturing everything up to the top level directory (note the file://) and trying to sub that in. It's possible to work around this.

❯ lychee --max-concurrency 8 --remap 'file:\/\/\/Users\/tgaff\/devel\/guides\/(.*).md  https://tgaff.github.io/some-gh-pages-site/$1.html' index.md
  2/2 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                                                                                                                                                                                          🔍 2 Total (in 0s) ✅ 2 OK 🚫 0 Errors

Since I do want to check sub-directories I think I'm stuck with this slightly cumbersome path statement. It works though. 👍

Does this belong in the remaps list?


One final thing, since the page contains some links that start with / e.g. [link](/other-dir/other-page.md), it's necessary to also append --base . per lycheeverse/lychee-action#211

@mre
Copy link
Member

mre commented Jan 22, 2024

I'm still trying to understand what's the difference in your regex such that it doesn't capture the protocol in your first version, but it does so in the second version.

Apart from that, yes, that should go into the remaps list.

@tgaff
Copy link
Contributor Author

tgaff commented Jan 23, 2024

The first version uses a capture group that is exceedingly greedy. (.*) so it captures the protocol and full file-system path.
The second group explicitly declares the protocol and file-system path outside of the capture group denoted by (.*)

My understanding is that only the capture group is used in the remap.

@mre
Copy link
Member

mre commented Jan 24, 2024

Ah, got it. Thanks.

I have two small tips for you.

First, you can simplify your regex like this:

file:///.*/(.*).md$
  • Backslashes don't need to be escaped in Rust's regex crate, because they have no special meaning.
  • The user path probably doesn't matter to you, so it's fine to only capture the part after the last slash. This is done by ending the regex with $.

Second, you can quickly experiment with regex ideas by using echo and --dump like so:

echo 'file:///Users/tgaff/devel/guides/technical_guides/chef.md' | lychee --dump --remap 'file:///.*/(.*).md$  https://guides.labzero.com/$1.html' -

If you have the time, I'd be thankful for a pull request to the remaps list. 😃

@tgaff
Copy link
Contributor Author

tgaff commented Jan 26, 2024

Thanks, I assumed I needed to escape the / and I did not know about dump!

First, you can simplify your regex like this:

file:///.*/(.*).md$

This won't quite work, because I do need to capture all of the path after the repo's root, so that it's mapped in as well.

Imagine my repo is at /Users/tgaff/devel/my_site and I have directories within it like

my_site 
  + technical_guides/
     - chef.md
     - docker.md
  + posts/
     - 20230201_painting_the_house.md
     - 20230712_great_day_for_fishin.md

I need to capture the directory names posts and technical_guides as part of the path to remap because they take the same paths on my site (e.g. https://www.mysite.com/technical_guides/chef.html).
Though you're right that it could still be shortened up with another wild-card as long as I don't nest a guides directory inside of guides (or my_site inside my_site): .*/guides/(.*).md


I'll try to get to a PR here soon.

@mre
Copy link
Member

mre commented Jan 26, 2024

You can capture the last path as well:

echo 'file:///Users/tgaff/devel/guides/technical_guides/chef.md' | lychee --dump --remap 'file:///.*/([^/]+/[^/]+)\.md$  https://guides.labzero.com/$1.html' -
https://guides.labzero.com/technical_guides/chef.html

Or you can also introduce a second capture parameter.

echo 'file:///Users/tgaff/devel/guides/technical_guides/chef.md' | lychee --dump --remap 'file:///.*/([^/]+)/([^/]+)\.md$  https://guides.labzero.com/$1/$2.html' -
https://guides.labzero.com/technical_guides/chef.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants