New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect relative URLs on top-level landing pages #155
Comments
OK, I had forgotten that I had in fact already worked around this issue in relation to Type 0 ZIMs but had forgotten to report the issue. I have code that checks whether a URL beginning with Nevertheless, I do think there is an issue in the way these URLs have been stored in the ZIM -- even though the Replay system is a bit of a black box (albeit with source code), URLs should still be stored in an internally consistent way in the ZIM, IMHO! |
@Jaifroid Indeed, most important is to report bug! |
@Jaifroid I don't understand why kiwix-desktop is part of the tickez, considering that kiwix-desktop has no support for SW based ZIM files ! |
@kelson42 I was using the indirect support, i.e. when you allow Kiwix Desktop to serve the decoded ZIM files directly to the browser. I was a bit imprecise, I guess it's the server feature of Kiwix Desktop, rather than Kiwix Desktop itself. |
I believe this can be either handled in replay/reader (what you did – but then that needs to be properly documented) or at crawling level. Would you have a an URL that we could test it with the WARC toolchain (so small or easily scopable) and maybe discuss having it handled on the crawler ? |
I now have more experience of this issue. It arises because a document located in the root of www.example.com can have URLs (incorrectly) coded as I can now confirm this case is already handled by Wombat and/or the Service Worker. It might be worth keeping this open as a reminder in case such issues crop up in Zimit v2.0, but if you prefer to close it @rgaudin, and re-open if it turns out to be an issue, that's fine by me. |
It seems to me this is correctly handled in the new rewriting but I'll let @mgautierfr confirm and close if he thinks that's OK |
I wonder if there is not a issue with the last
I confirm it should be correctly handled. Related test is https://github.com/openzim/warc2zim/blob/warc2zim2/tests/test_url_rewriting.py#L26-L29 I'm closing. |
At least in the Type 1
internet-encyclopedia-philosophy_en_all_2022-07.zim
(currently in dev directory), the image on the landing page has a URL of../wp-content/media/mill.jpg
. The landing page's URL isC/A/iep.utm.edu/
. Logically, therefore, the relative URL points to the image being atC/A/wp-content/media/mill.jpg
. However this is of course wrong, as the image is in fact located atC/A/iep.utm.edu/wp-content/media/mill.jpg
.I have also seen this referencing error on the landing page of the Type 0 military medicine ZIM. So I think the issue is in fact with landing pages, or with the root directory of a Zimit stored domain, and not with Type 0 and Type 1.
Kiwix Desktop is able to locate the image on the landing page of the IAP ZIM and the military medicine ZIM. Perhaps it has a feature that prevents a relative link from walking higher up the tree than the domain part of the WARC-style URL? It's still strange that the URLs are technically incorrect, given that this does not occur on pages other than the landing page (in my experience). I guess I need to work around this?
The text was updated successfully, but these errors were encountered: