Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External links seems not (always) kept in there original form #163

Closed
kelson42 opened this issue Jan 29, 2024 · 15 comments
Closed

External links seems not (always) kept in there original form #163

kelson42 opened this issue Jan 29, 2024 · 15 comments
Assignees
Labels
bug Something isn't working question Further information is requested
Milestone

Comments

@kelson42
Copy link
Contributor

kelson42 commented Jan 29, 2024

Opening https://dev.library.kiwix.org/viewer#mes-quartiers-chinois_fr_all_2024-01/mesquartierschinois.wordpress.com/

And clicking on the Twitter icon, I get at 404 error.
image

Considering the targeted content is not in the ZIM, this is an external link and should be kept as external link (pointing to original online content).

Seems also a regression as it was working fine with version 1, see:
image

@kelson42 kelson42 added bug Something isn't working question Further information is requested labels Jan 29, 2024
@kelson42 kelson42 added this to the 2.0.0 milestone Jan 29, 2024
@kelson42
Copy link
Contributor Author

Remark: These articles have a lot of links to Wikipedia which work fine.

@rgaudin
Copy link
Member

rgaudin commented Jan 29, 2024

Remark: These articles have a lot of links to Wikipedia which work fine.

The missing link is an internal one (same domain) while WP ones are direct ones.

@benoit74
Copy link
Collaborator

This is normal / nothing new (AFAIK) because:

  • this URL is on the domain which is been scraped
  • this URL has been excluded from the scrape by the exclude parameter: (\/\?|\/\#|\/wp-admin)

The URL is hence not an external one (it is one the same domain) so it is rewritten. But is it not in the ZIM because it has been excluded.

AFAIK warc2zim never took into consideration browsertrix crawler include / exclude rules to decide what is an internal link and should be rewritten. Maybe it is time to open a feature request for that, but it wouldn't be a simple feature in term of user experience.

@benoit74
Copy link
Collaborator

And to complete the information, I decided to exclude these URLs because it makes no sense to have them in an offline content from my PoV

@benoit74
Copy link
Collaborator

And in version 1 the settings were different / putting a lot of "useless" stuff in the ZIM

@mgautierfr
Copy link
Contributor

The original url in https://mesquartierschinois.wordpress.com/ is https://mesquartierschinois.wordpress.com/2014/07/18/carme/?share=twitter&nb=1 which is a internal url. And mesquartierschinois.wordpress.com/2014/07/18/carme/ is actually inside the zim.

@rgaudin
Copy link
Member

rgaudin commented Jan 29, 2024

The problem is that the query string character (?) is url-encoded (%3F)

@mgautierfr
Copy link
Contributor

mgautierfr commented Jan 29, 2024

It is normal and wanted that we encode ?. This "query string separator" for wordpress is not one for kiwix. We don't want the kiwix server to discard it (well, here, we may, but in general not)

I would say that the problem is that we remove the query string when we check if the url point to an existing entry.

But anyway, even when not removing the querystring during rewrite. The url is actually detected as not in the zim, so not rewritten (https://mesquartierschinois.wordpress.com/2014/07/18/carme/?share=twitter). But at runtime, wordpress js is rewritting the url (with a &nb=1 prepend), so wombat is catching this rewrite and transform it.
Keeping external link is working at creation time, all url are rewritten by wombat.

@kelson42
Copy link
Contributor Author

kelson42 commented Jan 29, 2024

The URL is hence not an external one (it is one the same domain) so it is rewritten. But is it not in the ZIM because it has been excluded.

This is a remark for all ZIM files.

There is only two kinds of URL in a ZIM HTML content:

  • Either internal (and target should exist) OR
  • External

If this is not the case please report as bug. If someone disagree with this, please let me know ASAP.

AFAIK warc2zim never took into consideration browsertrix crawler include / exclude rules to decide what is an internal link and should be rewritten. Maybe it is time to open a feature request for that, but it wouldn't be a simple feature in term of user experience.

Regarding warc2zim this has been made clear that internal and external would have to be differentiated based on the WARC entry index, see for example #137 (comment)

@Jaifroid
Copy link

There is only two kinds of URL in a ZIM HTML content:

  • Either internal (and target should exist) OR
  • External

If this is not the case please report as bug. If someone disagree with this, please let me know ASAP.

While true that a URL is either in the ZIM or it is not, with Zimit ZIMs it is actually harder than it seems to determine which is which. See discussion of the "boundary problem" here: #65 (comment).

In the case of zimit2, if links are produced dynamically, then they won't be pre-converted, and we rely on Wombat to transform them into the correct format. However, Wombat doesn't know what's in the ZIM, so it converts all absolute URLs. When clicking on such a link, or if a link gets launched programmatically by an onclick function (quite common), the only true way to know if a link is in the ZIM is by looking for it at its ZIM URL and if it's not found, assume it's an external link. Or you can make some shortcut assumptions such as if the link doesn't belong to the main domain of the ZIM, it's probably external. This isn't always a safe assumption, as Zimit ZIMs can contain an arbitrary number of domains.

The way zimit1 deals with this, is that Wombat converts ALL links, whether static or produced dynamically client-side, into a local link with a special format that can be read by the Service Worker. The Service Worker traps all Fetch requests for local URLs and tries to fetch it from the ZIM. If not found (in the case of a clicked link), it shows its own page informing the user that the link is not in the ZIM and offers to redirect the user to the external site.

@kelson42
Copy link
Contributor Author

kelson42 commented Mar 6, 2024

@mgautierfr @benoit74 Can we have an update on this ticket please? I have tested again on a fresh scrape, and it seems to work now!
See https://dev.library.kiwix.org/viewer#mes-quartiers-chinois_fr_all_2024-03/mesquartierschinois.wordpress.com/

Image

@benoit74
Copy link
Collaborator

benoit74 commented Mar 7, 2024

On my side I still didn't totally understood this ticket, so I can't tell if this is really fixed / why. But I agree it seems to work now.

@mgautierfr ?

@kelson42
Copy link
Contributor Author

Closing then, can anyway be reopen if necessary.

@kelson42
Copy link
Contributor Author

kelson42 commented May 31, 2024

Now, again broken
image

@benoit74
Copy link
Collaborator

This "new" bug is a known limitation tracked in #276.

These links are dynamically injected in the page via JS script responsible to handle these widgets. The link is hence dynamically rewritten by wombat / our custom dynamic rewrite function. Dynamic rewriting has no clue about which entry are present inside the ZIM and hence rewrite all links.

I will hence close again this issue since this is already tracked. I will just add a comment in other issue about this interesting "business case" where better handling of dynamic rewriting would help quite a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants