Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect relative URLs on top-level landing pages #155

Closed
Jaifroid opened this issue Aug 14, 2022 · 8 comments
Closed

Incorrect relative URLs on top-level landing pages #155

Jaifroid opened this issue Aug 14, 2022 · 8 comments
Milestone

Comments

@Jaifroid
Copy link

Jaifroid commented Aug 14, 2022

At least in the Type 1 internet-encyclopedia-philosophy_en_all_2022-07.zim (currently in dev directory), the image on the landing page has a URL of ../wp-content/media/mill.jpg. The landing page's URL is C/A/iep.utm.edu/. Logically, therefore, the relative URL points to the image being at C/A/wp-content/media/mill.jpg. However this is of course wrong, as the image is in fact located at C/A/iep.utm.edu/wp-content/media/mill.jpg.

I have also seen this referencing error on the landing page of the Type 0 military medicine ZIM. So I think the issue is in fact with landing pages, or with the root directory of a Zimit stored domain, and not with Type 0 and Type 1.

Kiwix Desktop is able to locate the image on the landing page of the IAP ZIM and the military medicine ZIM. Perhaps it has a feature that prevents a relative link from walking higher up the tree than the domain part of the WARC-style URL? It's still strange that the URLs are technically incorrect, given that this does not occur on pages other than the landing page (in my experience). I guess I need to work around this?

@Jaifroid Jaifroid changed the title Incorrect relative url on front page of Type 1 Zimit ZIM (Internet Encyclopaedia of Philosophy) Incorrect relative URLs on top-level landing pages Aug 14, 2022
@Jaifroid
Copy link
Author

OK, I had forgotten that I had in fact already worked around this issue in relation to Type 0 ZIMs but had forgotten to report the issue. I have code that checks whether a URL beginning with .. is located at the top level of the domain represented in the WARC-style ZIM, and removes the ../ if so. Now of course because of the extra /A/ that Type 1 ZIMs have as a prefix to the URL, the detection of this failed, and the code believes that A/iep.utm.edu/does not represent a top-level URL. This is easily fixed.

Nevertheless, I do think there is an issue in the way these URLs have been stored in the ZIM -- even though the Replay system is a bit of a black box (albeit with source code), URLs should still be stored in an internally consistent way in the ZIM, IMHO!

@kelson42
Copy link
Contributor

@Jaifroid Indeed, most important is to report bug!

@kelson42
Copy link
Contributor

@Jaifroid I don't understand why kiwix-desktop is part of the tickez, considering that kiwix-desktop has no support for SW based ZIM files !

@Jaifroid
Copy link
Author

@kelson42 I was using the indirect support, i.e. when you allow Kiwix Desktop to serve the decoded ZIM files directly to the browser. I was a bit imprecise, I guess it's the server feature of Kiwix Desktop, rather than Kiwix Desktop itself.

@rgaudin
Copy link
Member

rgaudin commented Aug 15, 2022

I believe this can be either handled in replay/reader (what you did – but then that needs to be properly documented) or at crawling level. Would you have a an URL that we could test it with the WARC toolchain (so small or easily scopable) and maybe discuss having it handled on the crawler ?

@kelson42 kelson42 added this to the 2.0.0 milestone Apr 24, 2023
@Jaifroid
Copy link
Author

Jaifroid commented Jan 2, 2024

I now have more experience of this issue. It arises because a document located in the root of www.example.com can have URLs (incorrectly) coded as ../icon.jpg, and the browser will still find those links because it knows it can't navigate any higher than root. When running in another context, though, e.g. library.kiwix.org/www.example.com/, then the browser can attempt to navigate higher.

I can now confirm this case is already handled by Wombat and/or the Service Worker. It might be worth keeping this open as a reminder in case such issues crop up in Zimit v2.0, but if you prefer to close it @rgaudin, and re-open if it turns out to be an issue, that's fine by me.

@rgaudin
Copy link
Member

rgaudin commented Jan 2, 2024

It seems to me this is correctly handled in the new rewriting but I'll let @mgautierfr confirm and close if he thinks that's OK

@mgautierfr
Copy link
Contributor

At least in the Type 1 internet-encyclopedia-philosophy_en_all_2022-07.zim (currently in dev directory), the image on the landing page has a URL of ../wp-content/media/mill.jpg. The landing page's URL is C/A/iep.utm.edu/. Logically, therefore, the relative URL points to the image being at C/A/wp-content/media/mill.jpg. However this is of course wrong, as the image is in fact located at C/A/iep.utm.edu/wp-content/media/mill.jpg.

I wonder if there is not a issue with the last / in C/A/iep.utm.edu/ when relative link is generated. But it may be a bug in the code generating the original site.

I'll let @mgautierfr confirm and close if he thinks that's OK

I confirm it should be correctly handled. Related test is https://github.com/openzim/warc2zim/blob/warc2zim2/tests/test_url_rewriting.py#L26-L29

I'm closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants