-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
External links seems not (always) kept in there original form #163
Comments
Remark: These articles have a lot of links to Wikipedia which work fine. |
The missing link is an internal one (same domain) while WP ones are direct ones. |
This is normal / nothing new (AFAIK) because:
The URL is hence not an external one (it is one the same domain) so it is rewritten. But is it not in the ZIM because it has been excluded. AFAIK |
And to complete the information, I decided to exclude these URLs because it makes no sense to have them in an offline content from my PoV |
And in version 1 the settings were different / putting a lot of "useless" stuff in the ZIM |
The original url in https://mesquartierschinois.wordpress.com/ is |
The problem is that the query string character ( |
It is normal and wanted that we encode I would say that the problem is that we remove the query string when we check if the url point to an existing entry. But anyway, even when not removing the querystring during rewrite. The url is actually detected as not in the zim, so not rewritten ( |
This is a remark for all ZIM files. There is only two kinds of URL in a ZIM HTML content:
If this is not the case please report as bug. If someone disagree with this, please let me know ASAP.
Regarding warc2zim this has been made clear that internal and external would have to be differentiated based on the WARC entry index, see for example #137 (comment) |
While true that a URL is either in the ZIM or it is not, with Zimit ZIMs it is actually harder than it seems to determine which is which. See discussion of the "boundary problem" here: #65 (comment). In the case of zimit2, if links are produced dynamically, then they won't be pre-converted, and we rely on Wombat to transform them into the correct format. However, Wombat doesn't know what's in the ZIM, so it converts all absolute URLs. When clicking on such a link, or if a link gets launched programmatically by an onclick function (quite common), the only true way to know if a link is in the ZIM is by looking for it at its ZIM URL and if it's not found, assume it's an external link. Or you can make some shortcut assumptions such as if the link doesn't belong to the main domain of the ZIM, it's probably external. This isn't always a safe assumption, as Zimit ZIMs can contain an arbitrary number of domains. The way zimit1 deals with this, is that Wombat converts ALL links, whether static or produced dynamically client-side, into a local link with a special format that can be read by the Service Worker. The Service Worker traps all Fetch requests for local URLs and tries to fetch it from the ZIM. If not found (in the case of a clicked link), it shows its own page informing the user that the link is not in the ZIM and offers to redirect the user to the external site. |
@mgautierfr @benoit74 Can we have an update on this ticket please? I have tested again on a fresh scrape, and it seems to work now! |
On my side I still didn't totally understood this ticket, so I can't tell if this is really fixed / why. But I agree it seems to work now. |
Closing then, can anyway be reopen if necessary. |
This "new" bug is a known limitation tracked in #276. These links are dynamically injected in the page via JS script responsible to handle these widgets. The link is hence dynamically rewritten by wombat / our custom dynamic rewrite function. Dynamic rewriting has no clue about which entry are present inside the ZIM and hence rewrite all links. I will hence close again this issue since this is already tracked. I will just add a comment in other issue about this interesting "business case" where better handling of dynamic rewriting would help quite a lot. |
Opening https://dev.library.kiwix.org/viewer#mes-quartiers-chinois_fr_all_2024-01/mesquartierschinois.wordpress.com/
And clicking on the Twitter icon, I get at 404 error.
Considering the targeted content is not in the ZIM, this is an external link and should be kept as external link (pointing to original online content).
Seems also a regression as it was working fine with version 1, see:
The text was updated successfully, but these errors were encountered: