External links seems not (always) kept in there original form #163

kelson42 · 2024-01-29T08:03:04Z

Opening https://dev.library.kiwix.org/viewer#mes-quartiers-chinois_fr_all_2024-01/mesquartierschinois.wordpress.com/

And clicking on the Twitter icon, I get at 404 error.

Considering the targeted content is not in the ZIM, this is an external link and should be kept as external link (pointing to original online content).

Seems also a regression as it was working fine with version 1, see:

kelson42 · 2024-01-29T08:04:30Z

Remark: These articles have a lot of links to Wikipedia which work fine.

rgaudin · 2024-01-29T08:05:42Z

Remark: These articles have a lot of links to Wikipedia which work fine.

The missing link is an internal one (same domain) while WP ones are direct ones.

benoit74 · 2024-01-29T08:08:51Z

This is normal / nothing new (AFAIK) because:

this URL is on the domain which is been scraped
this URL has been excluded from the scrape by the exclude parameter: (\/\?|\/\#|\/wp-admin)

The URL is hence not an external one (it is one the same domain) so it is rewritten. But is it not in the ZIM because it has been excluded.

AFAIK warc2zim never took into consideration browsertrix crawler include / exclude rules to decide what is an internal link and should be rewritten. Maybe it is time to open a feature request for that, but it wouldn't be a simple feature in term of user experience.

benoit74 · 2024-01-29T08:10:55Z

And to complete the information, I decided to exclude these URLs because it makes no sense to have them in an offline content from my PoV

benoit74 · 2024-01-29T08:12:05Z

And in version 1 the settings were different / putting a lot of "useless" stuff in the ZIM

mgautierfr · 2024-01-29T08:47:28Z

The original url in https://mesquartierschinois.wordpress.com/ is https://mesquartierschinois.wordpress.com/2014/07/18/carme/?share=twitter&nb=1 which is a internal url. And mesquartierschinois.wordpress.com/2014/07/18/carme/ is actually inside the zim.

rgaudin · 2024-01-29T08:50:17Z

The problem is that the query string character (?) is url-encoded (%3F)

mgautierfr · 2024-01-29T09:23:58Z

It is normal and wanted that we encode ?. This "query string separator" for wordpress is not one for kiwix. We don't want the kiwix server to discard it (well, here, we may, but in general not)

I would say that the problem is that we remove the query string when we check if the url point to an existing entry.

But anyway, even when not removing the querystring during rewrite. The url is actually detected as not in the zim, so not rewritten (https://mesquartierschinois.wordpress.com/2014/07/18/carme/?share=twitter). But at runtime, wordpress js is rewritting the url (with a &nb=1 prepend), so wombat is catching this rewrite and transform it.
Keeping external link is working at creation time, all url are rewritten by wombat.

kelson42 · 2024-01-29T13:55:06Z

The URL is hence not an external one (it is one the same domain) so it is rewritten. But is it not in the ZIM because it has been excluded.

This is a remark for all ZIM files.

There is only two kinds of URL in a ZIM HTML content:

Either internal (and target should exist) OR
External

If this is not the case please report as bug. If someone disagree with this, please let me know ASAP.

AFAIK warc2zim never took into consideration browsertrix crawler include / exclude rules to decide what is an internal link and should be rewritten. Maybe it is time to open a feature request for that, but it wouldn't be a simple feature in term of user experience.

Regarding warc2zim this has been made clear that internal and external would have to be differentiated based on the WARC entry index, see for example #137 (comment)

Jaifroid · 2024-02-10T14:04:35Z

There is only two kinds of URL in a ZIM HTML content:

Either internal (and target should exist) OR

External

If this is not the case please report as bug. If someone disagree with this, please let me know ASAP.

While true that a URL is either in the ZIM or it is not, with Zimit ZIMs it is actually harder than it seems to determine which is which. See discussion of the "boundary problem" here: #65 (comment).

In the case of zimit2, if links are produced dynamically, then they won't be pre-converted, and we rely on Wombat to transform them into the correct format. However, Wombat doesn't know what's in the ZIM, so it converts all absolute URLs. When clicking on such a link, or if a link gets launched programmatically by an onclick function (quite common), the only true way to know if a link is in the ZIM is by looking for it at its ZIM URL and if it's not found, assume it's an external link. Or you can make some shortcut assumptions such as if the link doesn't belong to the main domain of the ZIM, it's probably external. This isn't always a safe assumption, as Zimit ZIMs can contain an arbitrary number of domains.

The way zimit1 deals with this, is that Wombat converts ALL links, whether static or produced dynamically client-side, into a local link with a special format that can be read by the Service Worker. The Service Worker traps all Fetch requests for local URLs and tries to fetch it from the ZIM. If not found (in the case of a clicked link), it shows its own page informing the user that the link is not in the ZIM and offers to redirect the user to the external site.

kelson42 · 2024-03-06T12:22:55Z

@mgautierfr @benoit74 Can we have an update on this ticket please? I have tested again on a fresh scrape, and it seems to work now!
See https://dev.library.kiwix.org/viewer#mes-quartiers-chinois_fr_all_2024-03/mesquartierschinois.wordpress.com/

benoit74 · 2024-03-07T09:34:40Z

On my side I still didn't totally understood this ticket, so I can't tell if this is really fixed / why. But I agree it seems to work now.

@mgautierfr ?

kelson42 · 2024-04-18T06:13:46Z

Closing then, can anyway be reopen if necessary.

kelson42 · 2024-05-31T10:52:36Z

Now, again broken

benoit74 · 2024-05-31T12:17:50Z

This "new" bug is a known limitation tracked in #276.

These links are dynamically injected in the page via JS script responsible to handle these widgets. The link is hence dynamically rewritten by wombat / our custom dynamic rewrite function. Dynamic rewriting has no clue about which entry are present inside the ZIM and hence rewrite all links.

I will hence close again this issue since this is already tracked. I will just add a comment in other issue about this interesting "business case" where better handling of dynamic rewriting would help quite a lot.

kelson42 added bug Something isn't working question Further information is requested labels Jan 29, 2024

kelson42 assigned mgautierfr and benoit74 Jan 29, 2024

kelson42 added this to the 2.0.0 milestone Jan 29, 2024

Jaifroid mentioned this issue Jan 29, 2024

[Query] About encoding of URLs and query strings in zimit2 #164

Closed

Jaifroid mentioned this issue Mar 14, 2024

Fix ambiguities around the usage of URLs in ZIM file format openzim/libzim#865

Closed

kelson42 closed this as completed Apr 18, 2024

kelson42 reopened this May 31, 2024

kelson42 unassigned mgautierfr May 31, 2024

benoit74 closed this as completed May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

External links seems not (always) kept in there original form #163

External links seems not (always) kept in there original form #163

kelson42 commented Jan 29, 2024 •

edited

Loading

kelson42 commented Jan 29, 2024

rgaudin commented Jan 29, 2024

benoit74 commented Jan 29, 2024

benoit74 commented Jan 29, 2024

benoit74 commented Jan 29, 2024

mgautierfr commented Jan 29, 2024

rgaudin commented Jan 29, 2024

mgautierfr commented Jan 29, 2024 •

edited

Loading

kelson42 commented Jan 29, 2024 •

edited

Loading

Jaifroid commented Feb 10, 2024

kelson42 commented Mar 6, 2024

benoit74 commented Mar 7, 2024

kelson42 commented Apr 18, 2024

kelson42 commented May 31, 2024 •

edited

Loading

benoit74 commented May 31, 2024

External links seems not (always) kept in there original form #163

External links seems not (always) kept in there original form #163

Comments

kelson42 commented Jan 29, 2024 • edited Loading

kelson42 commented Jan 29, 2024

rgaudin commented Jan 29, 2024

benoit74 commented Jan 29, 2024

benoit74 commented Jan 29, 2024

benoit74 commented Jan 29, 2024

mgautierfr commented Jan 29, 2024

rgaudin commented Jan 29, 2024

mgautierfr commented Jan 29, 2024 • edited Loading

kelson42 commented Jan 29, 2024 • edited Loading

Jaifroid commented Feb 10, 2024

kelson42 commented Mar 6, 2024

benoit74 commented Mar 7, 2024

kelson42 commented Apr 18, 2024

kelson42 commented May 31, 2024 • edited Loading

benoit74 commented May 31, 2024

kelson42 commented Jan 29, 2024 •

edited

Loading

mgautierfr commented Jan 29, 2024 •

edited

Loading

kelson42 commented Jan 29, 2024 •

edited

Loading

kelson42 commented May 31, 2024 •

edited

Loading