Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zimit v2. [libzim/libkiwix/warc2zim part] #95

Closed
7 of 12 tasks
mgautierfr opened this issue Sep 18, 2023 · 13 comments
Closed
7 of 12 tasks

zimit v2. [libzim/libkiwix/warc2zim part] #95

mgautierfr opened this issue Sep 18, 2023 · 13 comments
Assignees

Comments

@mgautierfr
Copy link
Member

mgautierfr commented Sep 18, 2023

This is a ticket to list what need to be done to make the PR openzim/warc2zim#113 going from a POC to a real feature.


Specification

Improvement of the current specification to support warc2zim requirement.

Following openzim/warc2zim#113 we need to make evolved the current kiwix/zim format.

The zim file format itself (binary way to store content) will not evolved we will make evolved the "kiwix" format (what we store and how we interpret it).
While this is the "kiwix" format which evolve, this is still a low level change anyway and we may change libzim itself (both at reading and creation time) to support this new format.

The main change is :

  • To store aliases
  • [ ] To store fuzzy matching rules

Aliasing

WARC file contains revisit : Entry which need to be served by the content of another one.
The current POC use H namespace to store redirects that need to be handled as alias.
We can do the same. Or we can do as hard link are done :
Two (or more) entries are content entries and point to the same content (blob/cluster id or redirect id)

Using "hard link" would need to adapt the libzim creator side but no change at all would be needed on specification or reading part.
(However, zim-check would need to be adapted as it will find duplicate content)

Fuzzy Matching

Fuzzy matching is a way to transform a (potentially not fixed) url into a fixed, known one.

There is two part for fuzzy matching:

  • At creation. Fuzzy matching is done by processing the input url with a set of regex. The processed url (or reduced url) is used to store the entry in the zim file (as alias to the real entry ?)
  • At reading. Fuzzy matching is applied to input url, which generate different versions of reduced urls. We search for those versions in the zim file and return with the first found. We need access to the query string when we apply the fuzzy matching.

On the specification part, we need to define how we store the reading fuzzy matching rules.
Also need to define who is applying it (we need to access the query string, is it libkiwix doing several request to libzim ? Or libzim doing the transformation, but we need to pass it the query string ?)

Implementation

  • libzim: Support aliasing creation
  • [ ] libzim: Support storing and retrieving fuzzy rules (including parsing of them)
  • [ ] libzim/libkiwix: Make evolve the "routing" part. (apply fuzzy rules, search for potential entries, ...). As the routing becomes more complex (than simply search entry from the given url), it may the time to implement : Expose (kind of) InternalServer in the public API ? libkiwix#740 Not needed. The specific name scheme (and especially that url are url encoded) allow to resolved everything from libzim.

Warc2zim

Once libzim/libkiwix is providing the needed feature, we need to adapt warc2zim.

Common url schema

We need to define where (using which url) we store our entries.
I suggest:

  • For origin host (the website being scraped), store it as absolute/path
  • For other host, store it as host.tld/absolute/path.

This way, "origin host" url are the same as "non zimit" zim file.
We also remove the A/H sub-directory which is a relic of namespaces.

Implementation

  • warc2zim : url rewriting. We need to parse the content (html/css) and rewrite the url. Rewrote urls must be relative path and conform to the common url schema.
    We can reuse pwb with monkey patching as we do in the POC. Or re-implement ours.
    This part is somehow relatively simple. We don't do any fuzzy matching or else. Simply rewrite url to the common schema. It may be simpler to start from scratch than integrating a project not designed to be integrated.

  • warc2zim : Integrate dynamic url rewriting. Definitively too complex to re-implement. Wombat.js has been put in a separated repository and is advocated to be embeddable. Let's do it ! So we must modify html (and js content ?) to insert some js installing wombat in each page.

    . [ ] wombat : While wombat is supposed to be embeddable (and it is), it seems there is no way to specify our own url rewrite function. We need to make PR on wombat to make this configurable.
    . [x] wombat/warc2zim: Implement the rewrite function (js) and use it in wombat
    . [x] wombat/warc2zim: In the poc, we configure wombat with few fixed, absolute values. We need to make this relative. Maybe simple configuration or need PR on wombat.

Other projects:

  • python-libzim: update to new libzim api
  • java-libkiwix: update to new libzim/libkiwix api
  • kiwix-desktop: use new features (routing handling). (Or do it ourselves)
  • kiwix-android: use new features (routing handling). (Or do it ourselves)
  • kiwix-serve: use new features. (But it will probably do with other tasks as it is somehow first class implementation as it is embedded in libkiwix)
  • kiwix-ios/macos : use new features.
  • zim-check : adapt zim-check to correctly detect duplicated content. Plus other checks (fuzzy matching check ?..)
  • zim-dump : We have to properly dump alias (hard-link ?) and fuzzy-rules (nginx rewrite rules ?) See How should zimdump deal with aliases openzim/zim-tools#395

Open questions :

  • WARC Headers contain... headers. Apart from a simple parsing to detect revisit, we (POC) currently don't care and it seems to work. We have to investigate this part. Do we need them ? If yes, how ? (Storage in zim file, handling of headers in the routing...)
    [This may need to adapt the zim/kiwix specification, warc2zim and libzim/libkiwix routing)
  • Video and other binary content : How video are stored ? From the way WARC are build, we store a record per request. So if scrapper used several requests to get/play a video, we should have several entries in the zim file.
    But android video player use direct access to read the video. So we need to "regroup" records in one entry.
    TODO : See if videos are really composed of several records ? If yes, how to detect records are about the same video ? How to rebuild a single entry ? How fuzzy rules matching will work with regrouped entry ?
@mgautierfr
Copy link
Member Author

As @rgaudin mention in kiwix/kiwix-android#3485 (comment) (I've totally missed the explained behavior), we have to properly handle external link.

Static rewriting

We are (will) rewriting all links (<scheme>://host.tld/path) to /host.tld/path(make relative to the current path the content). So all external link are now internal.

The only way I see to avoid that (if we want to avoid it) is to parse a first time the warc to know all the entries and then do the (classic) handling of content but rewrite only link to existing entry (and keep other link as external links)

Dynamic rewrite

We do the same as static rewrite. But there, we are in the browser and checking if the entry exist before we rewrite it means at least a request.
May be better to rewrite all and do the request to the server and let the server handle it.

Server handling

If we have a request for a non existent entry and if path is /host.tld/path, we may want to create a redirect to <scheme>://host.tld/path and send it back to the server.
(If we do so, we may not need to do a complex static rewrite as we will handle it here)

Questions :

  • How to store the scheme of the initial request ?
  • How to differentiate google.com/path/on/google/site from a/missing/internal/link ? Does the .tld is enough to know it was a external link ? Or can we use the existence of a scheme (see first question) ?
  • How to escape the viewer sandboxing ?
  • We may want to block specific external link all the time (accept link navigating to other website, block js request to other website, block http PUSH request, ...).

In fact, if we accept only links navigating to other website and we assume they can be only in html pages, we can simply do the "complex" static rewrite and we are good.

@Jaifroid
Copy link
Member

Jaifroid commented Nov 14, 2023

@mgautierfr Thank you for documenting your thinking so carefully. I've belatedly read through it.

There's a lot here, but three things stood out for me:

1. Common URL schema

You propose:

  • For origin host (the website being scraped), store it as absolute/path
  • For other host, store it as host.tld/absolute/path.

From the work I've done with warc2zim, I'm really not sure this is a valid distinction. I have noticed that some ZIMs contain valid resources to a wide range of sites. And if you think about it, this is necessary given that a page may be grabbing its JS from a CDN, or images from another domain owned by the company, and especially for video which is almost always from a different domain but is often embedded in a page, and may be first party (or may be YouTube).

It could get very difficult to decide what is first-party and what is third-party, and I think having a rigid distinction like that could break some sites.

An example: a recent Mozilla Development Network scrape contains not only pages from MDN, but also several older MDN pages from archive.org that are linked to and scraped and displayed offline in the ZIM inside an archive.org frame! Now, that may be a mistake by the person who launched that scrape, but in other cases it won't be a mistake. I'm not sure the distinction holds. It might be better to design a more flexible format upfront that allows arbitrary numbers of domains to be stored. Currently this is actually quite logical. The domain name is included in the ZIM URL, like C/A/iep.utm.edu, without any distinction or hierarchy about what is first-party and what isn't. Look at the variation here at the beginning of the URL index of Internet Encyclopaedia of Philosophy (several different domains recorded):

image

2. Usefulness of Headers (pseudo H namespace)

My custom implementation in the PWA is designed mostly to make largely static resources readable (though it can rewrite most links in CSS and JS scripts, just not those that are constructed highly dynamically at run-time unless I'm lucky). Although I mostly ignore the headers, I found that sometimes they are needed. The main use case was to find a redirected resource. Sometimes that information is in the initial response body, but sometimes the server has only sent a redirect header, and there is no Response body. So, I have a recursive lookup: if a requested resource is not found at C/A/some.web.site/some_resource.html?very&cool&one (and there is no response body I can parse), then I launch a lookup for C/H/some.web.site/some_resource.html?very&cool&one, and look in the header for a moved permanently redirect, and follow it if necessary. If the header lookup fails to yield a resource, then I can know that we're dealing with an external resource link that wasn't scraped. In that case, I throw up an external link dialogue box for the user to decide if they want to leave the app and open the link in a browser.

Now, while redirect may be the main use case, there are several other reasons to use the headers in more dynamic situations. The Service Worker has the logic that deals with this. I found 18 references to a function response.headers.get() in wabac.js, dealing with these situations, which gives an idea of the contexts in which they are needed. Note that Headers can either be of type "response" or of type "request". There are many more references to response headers (what is mostly stored in H/ namespace) than request headers, though there are some. I focus on response headers here:


// 1. REDIRECTS:

const status = Number(response.headers.get("x-redirect-status") || response.status);
const statusText = response.headers.get("x-redirect-statusText") || response.statusText;

// 2. MIME TYPES / transfer encoding

mime = response.headers.get("Content-Type") || "";
const encoding = response.headers.get("content-encoding");
const te = response.headers.get("transfer-encoding");

// 3. COOKIES

let presetCookie = response.headers.get("x-wabac-preset-cookie") || "";
const setCookie = response.headers.get("Set-Cookie");

// 4. COMPENSATING FOR SW RUNNING IN AN EXTENSION (**could be important for Kiwix JS!**)

// necessary as service worker seem to not be allowed to return a redirect in some circumstances (eg. in extension)
    if ((request.destination === "video" || request.destination === "audio") && request.mode !== "navigate") {
      while (response && (response.status >= 301 && response.status < 400)) {
        const newUrl = new URL(response.headers.get("location"), url);

// 5. FORMS and UPLOADS

// ... in a series of functions dealing with forms and posting content / authorizations
const lengthHeader = response.headers.get('x-ipfs-datasize') || response.headers.get('Content-Length') 
// ... function dealing with uploading files
return response.headers.get('Location')

// 6. FETCH AND RANGE REQUESTS

// ... In the Fetch Range Loaders class
this.canLoadOnDemand = ((response.status === 206) || response.headers.get("Accept-Ranges") === "bytes");
// ... Getting content length of range requests
this.length = Number(response.headers.get("Content-Length"));
let range = response.headers.get("Content-Range");
// ... In the Remote WARC proxy class (there are some comments here referring to bugs in Kiwix Serve!)
let { headers, encodedUrl, date, status, statusText, hasPayload } = headersData;
      if (reqHeaders.has("Range")) {
        const range = reqHeaders.get("Range");
        // ensure uppercase range to avoid bug in kiwix-serve
        reqHeaders = {"Range": range};
      }

// 7. AJAX REQUESTS

try {
      if (this.allowRewrittenCache && !range) {
        const response = await self.caches.match(request);
        if (response && !!response.headers.get(IS_AJAX_HEADER) === isAjax) {
          return response;
        }
      }
    }

My conclusion about Headers

I found the seven broad categories above where Response Headers are needed (and there is some code for Request Headers too). So, ISTM that to deal with the huge variety of situations in which we may have things such as range requests (especially for streaming data), or AJAX or Fetch requests, and the fact that WARC can intercept these and record the responses, it would be risky to ditch the capacity for storing and using the Headers.


3. Video BLOBs or streams of requests and responses?

You ask above whether video is stored (effectively) as BLOBs or as streams (chunks). I think the point of the WARC format is that it could be either. I don't think the fact that the Android app reads BLOBs from the ZIM in a normal (non-WARC ZIM) is relevant. If the Service Worker is doing its job correctly, it will bypass that. All the Service Worker is doing is effectively intercepting requests and providing responses (yes, it has a lot of logic to do transformations, but basically it is just doing what all Service Workers do: there is an event listener on the Fetch event, and the SW does event.respondWith( [Response with Data] )). WARC is just a recorder of Requests and Responses.

So, my experience is that in MOST cases of YouTube videos (the ones I have implemented in the PWA), there is an identifiable MP4 BLOB (after fuzzy URL transformation / reduction). But of course YouTube COULD simply stream video chunks, and have some complex JS reader that recombines them only when the right authentication response has been sent to the server. The WARC format doesn't care about this. It will merely record the authentication response sent to the server and the encoded chunks received, and the piece of JS that recombines the chunks will be happy. And, I think, Kiwix Android will also be happy because it's not reading the video in the way it would read video from a Wikimedia ZIM file. The webview is just making a request, and the response is elicited from the ZIM by the Service Worker's transformation functions, and these are sent back to the WebView, which has a JS player, and all is good (maybe!).

In any case, I don't think it's safe to assume we'll always have a BLOB to play rather than a stream. We need to design Zimit 2.0 in a way that is flexible and future-proof, which means that multimedia content is also just a set of requests and responses.

@mgautierfr
Copy link
Member Author

1. Common URL schema

I think you misunderstood the url schema.
When we scrap (or convert a warc of) a website (ie http://kiwix.org) we know that main domain is kiwix.org. So when we need to store a entry with a url:

  • http://kiwix.org/en/about-us/, we can store it as en/about-us/.
  • http://foo-cdn.com/bar/baz.img, we store it under foo-cdn.com/bar/baz.img.

We can still store any content from any website. Without any limitation. It is just that we have one domain which is elided from the entry path and we know this is the "main" domain of the scrapped website.

The main purpose is to avoid to have the domain visible in the url from a user point of view
(http://library.kiwix.org/viewer#kiwix/kiwix./org/en/about-us/ vs http://library.kiwix.org/viewer#kiwix/en/about-us/).

2. Usefulness of Headers (pseudo H namespace)

1. REDIRECTS:

const status = Number(response.headers.get("x-redirect-status") || response.status);
const statusText = response.headers.get("x-redirect-statusText") || response.statusText;

We already a mechanism for redirection. We should use them. If a warc record contains a redirection response, we must create a redirect entry. No need for header for that.

2. MIME TYPES / transfer encoding

For mime types, as for redirect, we can already store it in the zim.
Encoding is part of the negotiation between the server and the client. We MUST handle it correctly. We cannot return a content deflated if the client can't inflate it (even if we have scrapped it with a client which can)

3. COOKIES

let presetCookie = response.headers.get("x-wabac-preset-cookie") || "";
const setCookie = response.headers.get("Set-Cookie");

That's a interesting point. But it appears that cookies is my next thing to make work. So I will see :)

4. COMPENSATING FOR SW RUNNING IN AN EXTENSION (could be important for Kiwix JS!)

You should be able to get this information (redirect) from classic zim file as we will store classique redirect entry (or alias, which will lead to even less work on your side)

5. FORMS and UPLOADS

// ... in a series of functions dealing with forms and posting content / authorizations
const lengthHeader = response.headers.get('x-ipfs-datasize') || response.headers.get('Content-Length')

I wonder why you need the lengthHeader. By definition, the server doesn't handle POST request so it is somehow useless to send data to the server. (And on warc2zim, we move all data of a POST request in the entry path querystring __wb_method=POST&<post_data>)

// ... function dealing with uploading files
return response.headers.get('Location')

This is same a redirect

6. FETCH AND RANGE REQUESTS

Indeed this is something we have to handle. But we can move this information in the path, as we do for POST data.

7. AJAX REQUESTS

try {
if (this.allowRewrittenCache && !range) {
const response = await self.caches.match(request);
if (response && !!response.headers.get(IS_AJAX_HEADER) === isAjax) {
return response;
}
}
}

What do you do if it is not response ? Rewrite the content ?
If yes, it will be handle by warc2zim which has access to the header. Server never rewrite content.

it would be risky to ditch the capacity for storing and using the Headers.

We never had the capacity to store and using the Header :) So we ditch nothing :)
Adding the feature now, without knowing how to use it is useless.
If we find that we have to store and use header, we will see at this time.

3. Video BLOBs or streams of requests and responses?

I don't think the fact that the Android app reads BLOBs from the ZIM in a normal (non-WARC ZIM) is relevant. If the Service Worker is doing its job correctly, it will bypass that.

Well, the purpose of zimit v2 is to not have a Service Worker. So no one can do its job correctly (or not).

And, I think, Kiwix Android will also be happy because it's not reading the video in the way it would read video from a Wikimedia ZIM file. The webview is just making a request, and the response is elicited from the ZIM by the Service Worker's transformation functions, and these are sent back to the WebView, which has a JS player, and all is good (maybe!).

If I understand correctly the android behavior, the purpose it to not use the js player (or the webplayer) but use the "native player". It allows the video to be directly played by android native code, bypassing all the app/webview/server/libzim code. But to do this, we need a contiguous data.

In any case, I don't think it's safe to assume we'll always have a BLOB to play rather than a stream. We need to design Zimit 2.0 in a way that is flexible and future-proof, which means that multimedia content is also just a set of requests and responses.

I agree, but it has a impact on readers that have this assumption. (And a valid one as we didn't have a way to store different range of data in different entries, so we always had one entry per content)


BTW, here a small teaser of a zim created with dev of warc2zim. It is without service worker and should work without fuzzy matching or any fancy stuff. (Not working, at least : cookies, external link handling)

@rgaudin
Copy link
Member

rgaudin commented Nov 20, 2023

Common URL schema

I'd also prefer a single way to store entries, for the sake of not having to handle two. Maybe this was chosen to have better-looking paths for the main domain.
You discussed mostly resources which are indeed frequently on different domains but first parties on various domains are allowed. We don't use it much but it's perfectly valid to have pages on multiple domains (even if not related) and browsertrix makes no distinction. It just has a concept of seedUrls and it's only in warc2zim that we look at initial URL to set the homepage.

@mgautierfr what's the reason for the two entries format?

Usefulness of Headers

Thank you for laying them all out. It's really useful. We've discussed a couple of them as theoretical possibilities but haven't encountered them in reality.

It all looks like it can be gradually introduced back. We should probably setup a bunch of websites that trigger and uses some of those use cases so we can have automated tests.

Video BLOBs or streams of requests and responses?

I've said the same thing a few times but lacked an actual use case to back it up. It's very frequent on my own laptop to see non-blobs being transferred ; and there are multiple competing stream technologies.
I think this can be somewhat controlled though at scraping time because most platforms support the various client capabilities that are found in the wild.

Each need to be implemented thouhg

@mgautierfr
Copy link
Member Author

You discussed mostly resources which are indeed frequently on different domains but first parties on various domains are allowed. We don't use it much but it's perfectly valid to have pages on multiple domains (even if not related) and browsertrix makes no distinction

It is still allowed with the schema proposed (and implemented for now).

@mgautierfr what's the reason for the two entries format?

Just have urls which look like we used to.

Storing the host in the entry path (<host.tld>/foo/bar) is "just" needed to avoid conflict between resource with the same path but different source (kiwix.org/index.html vs wikipedia.org/index.html). But we can elide one (and only one) domain from the path and we still have conflict avoided (index.html vs wikipedia.org/index.html)

http://public.kymeria.fr/KIWIX/zimit2/kiwix_no_main_domain.zim is the same zim without url simplified.

@rgaudin
Copy link
Member

rgaudin commented Nov 20, 2023

Yep, I saw your comment just after publishing mine.

@Jaifroid
Copy link
Member

Thanks for the explanations and reassurances, @mgautierfr. I hope at least that the research on the use cases of headers was useful. I hadn't understood the logic behind the URL proposal -- I see now that it's just a form of abbreviation, and in fact it works just as well without the abbreviation, so it's optional. Presumably the main use case for abbreviated URLs is in browsers accessing a ZIM via Kiwix Serve, because I don't think in any other context users are particularly aware of URLs (and in many contexts, they can't see thm at all).

The main reason for POST requests would be to record visits to sites where a POST is used to get a resource without it being in the URL (as POSTing without relying on querystrings is considered more secure). But I imagine this is a bit unlikely for a ZIM, except for google video, which you've already implemented via a separate process.

Congratulations 🎉on those ZIM samples. I've tested both in Kiwix JS and in Kiwix PWA, and (apart from a small issue with some hyperlinks having a /C/ in them that should be easy to fix in our Service Worker, that comes from differences in our backend) they are working very well: all JS, CSS, etc. is loading correctly on the landing page, and most hyperlinks work fine. That's certainly remarkable!

@kelson42
Copy link
Contributor

kelson42 commented Nov 20, 2023

I jump on this very long discussion. I hope I get it right and make a useful comment. That said, I would really prefer to have one ticket per fundamental change. That said:

  • If we do elide one (main) FQDN, like we do today, we can not fully avoid conflict of that kind: www.kiwix.org/www.cloudfare.com/index.html with wwww.cloudfare.com/index.html. Yes that sounds improbable, but when this will happen... what should be done?
  • I think eliding the main FQDN is a nice feature, but before continuing with it, I would like to be sure AFAP about the handling of all the edge cases... otherwise lets keep it simple and remove this "optimisation".
  • All the discussion about this URL format does not seem to be strongly related to the handling of external links, for which my first impression would be to do a static rewriting - like in other scrapers - but I don't have read your arguments on that point.

@rgaudin
Copy link
Member

rgaudin commented Nov 21, 2023

If we do elide one (main) FQDN, like we do today, we can not fully avoid conflict of that kind:

Good point!

my first impression would be to do a static rewriting - like in other scrapers

Attention, this comparison is too simple: we only do this in select scrapers (sotoki, mwoffliner and maybe wikihow) for which we know we're working off a tiny list of basic nodes. This can't be compared with zimit where possibilities are all those offered by HTML and JS. That's why we rely (or will be relying) on Wombat.js

Not sure what you meant with “static rewriting” but if the goal is the same, the implementation is gonna be different (and more complex): an external link has no other property than “not being in the ZIM”. Wombat running in the client (to intercept calls), client side must be able find out if an entry is in ZIM or not.

@mgautierfr
Copy link
Member Author

If we do elide one (main) FQDN, like we do today, we can not fully avoid conflict of that kind: www.kiwix.org/www.cloudfare.com/index.html with wwww.cloudfare.com/index.html. Yes that sounds improbable, but when this will happen... what should be done?
I think eliding the main FQDN is a nice feature, but before continuing with it, I would like to be sure AFAP about the handling of all the edge cases... otherwise lets keep it simple and remove this "optimisation".

I agree. Wombat is a too complex and sensitive (at least with my knowledge) to play to much with him. I have made the eliding optional and I'm testing without it.

my first impression would be to do a static rewriting - like in other scrapers

We (will) do static rewriting. In the example zim files, all html (almost, not html for ajax requests) content is statically rewritten. But we need to dynamically rewrite url (coming from js request) and content (response of ajax request)

@Jaifroid
Copy link
Member

Jaifroid commented Nov 27, 2023

@mgautierfr Having finally managed to integrate the Replay system (with Service Worker) in kiwix/kiwix-js#1173, I have a better understanding of the importance of headers. You wrote above:

WARC Headers contain... headers. Apart from a simple parsing to detect revisit, we (POC) currently don't care and it seems to work. We have to investigate this part. Do we need them ? If yes, how ? (Storage in zim file, handling of headers in the routing...)

I've come to realize (belatedly) that while the headers are not (generally) important for looking up assets from the backend / server, they are potentially important instructions to the user's browser about how to deal with those assets. I apologize in advance if that's really obvious to everyone else, but I think in previous discussions we (or at least I) were focusing on how they might help us look up assets directly (the simple revisits you mention), rather than the fact they tell the browser how to deal with retrieved payloads.

Overview

Currently a client accessing a Zimit article via Kiwix Serve will:

  1. look up the Headers (via a C/H/... ZIM url);
  2. then look up the Response body, assuming the header indicates it has a payload (via a C/A/... ZIM url);
  3. if there is no payload, it will create a Response with the Headers and an empty Uint8Array for the payload;
  4. the Service Worker combines the retrieved Header and the retrieved Payload into a single Response that is returned to the browser;
  5. the browser decides what to do with the Response, based on standard instructions in the Header.

Code

The high-level code that does this in the Replay Service Worker is below. I've added some comments to make it quicker to parse (for a human), but the comment about a bug in Kiwix serve, and the if (this.type === "kiwix") block are in the original.

Obviously, there's a lot more going on behind this top-level code, but for me it gives the clearest picture of what is happening, and therefore how / if to emulate use of headers in Zimit 2.0. Again, sorry if this is stating the obvious, but it seems useful to document it, even if only for the benefit of others finding this:

// The main function for getting the resource from the ZIM / server
async getResource(request, prefix) {
    const { url, headers } = request.prepareProxyRequest(prefix);
    let reqHeaders = headers;

    if (this.type === "kiwix") {

      // Get the headers from the ZIM
      // console.debug('Attempting to resolve canonical headers for url', url);
      let headersData = await this.resolveHeaders(url);

      // If we couldn't find the requested header, do some "fuzzy matching"
      if (!headersData) {
        for (const newUrl of fuzzyMatcher.getFuzzyCanonsWithArgs(url)) {
          // A bunch of code that deals with fuzzy matching...
        }
      }

      // If we still can't find the headers, show a Not Found page
      if (!headersData) {
        // use custom error page for navigate events [ORIGINAL COMMENT]
        if (this.notFoundPageUrl && request.mode === "navigate") {
          // Code that deals with the built-in Not Found page
          ....
        }
        return null;
      }

      // Define Header data, some of which we have to construct if missing
      let { headers, encodedUrl, date, status, statusText, hasPayload } = headersData;

      // Deal with Range requests if there is a Range header
      if (reqHeaders.has("Range")) {
        const range = reqHeaders.get("Range");
        // ensure uppercase range to avoid bug in kiwix-serve [ORIGINAL COMMENT]
        reqHeaders = {"Range": range};
      }

      let payload = null;
      let response = null;

      if (hasPayload) {
        // Get the response from the ZIM in the case of Kiwix, if the headers indicated a payload was recorded
        response = await fetch(this.sourceUrl + "A/" + encodedUrl, {headers: reqHeaders});
        if (response.body) {
          payload = new c(response.body.getReader(), false);
        }

        // Deal with partial responses (probably from a Range request above)
        if (response.status === 206) {
          status = 206;
          statusText = "Partial Content";
          headers.set("Content-Length", response.headers.get("Content-Length"));
          headers.set("Content-Range", response.headers.get("Content-Range"));
          headers.set("Accept-Ranges", "bytes");
        }
      }

      // Deal with responses that don't have a payload, i.e. pure headers
      if (!payload) {
        payload = new Uint8Array([]);
      }
      if (!date) {
        date = new Date();
      }
      if (!headers) {
        headers = new Headers();
      }

      const isLive = false;
      const noRW = false;

      // Finally, construct a Response from the collected data and return it to the browser
      return new ArchiveResponse({payload, status, statusText, headers, url, date, noRW, isLive});

    }
  }

@Jaifroid
Copy link
Member

@mgautierfr A potentially interesting observation from my work on enabling Replay support in Kiwix JS-family apps. NB, this is not a recommendation to change approach, just an observation that might be useful as a fallback. So, I'm just putting it out there. Feel free to shoot this down! I think your approach is ultimately more universal, as it creates a standard ZIM that existing readers should be able to use without changes to their backends.

I realized belatedly that wabac.js can run as a Web Worker instead of as a Service Worker. When you run it as a Worker, it writes ww init in console instead of sw init, and adjusts itself accordingly. In that mode, you simply postMessage URLs you want to transform to it, and it replies with the ZIM URL once it has transformed it (so long as it's initialized the right way).

Now I wondered if the webview used by Kiwix Desktop can run Web Workers. They're pretty old technology. Even IE11 can run a Web Worker (though obviously not this one, because it uses very advanced JS, lots of async, etc.). And if that's the case, can the Webview catch all Fetch requests within its scope (like what a Service Worker does)? Again, if that's the case, ISTM that there might be a "simple" solution for supporting current Zimit ZIMs. (Well, nothing is ever "simple"...)

Of course, this would also require work in the Kiwix Desktop backend, and probably, like in Kiwix JS, would require hosting your own custom copy of wabac.js (I renamed it replayWorker.js for clarity) and topIframe.html, and also forcing the settings in the Web Worker so it recognizes the path of URLs sent to it. I had to write some code that alters the internal state of the Web Worker's configuration, and if necessary change it each time there's a request so that I could support multi-ZIM request coming in arbitrarily. There's probably a standard way of doing it, but there's no API documentation, and much of how it initializes itself is pretty obscure (almost no comments in the code), so I wrote a simple routine that does the job fast.

Caveats:

  • In Kiwix JS, replayWorker.js is imported by the Service Worker, and so it runs in the scope of the Service Worker. So I disable the message port listeners at several points in replayWorker.js (a.k.a. wabac.js) and simply call the functions direct from our Service Worker (this helps with speed). But it's the same principle, and doing it via postMessage ought to work in the same way (and maybe mean less tinkering with wabac.js).
  • I suppose the Web Worker would need to be loaded by topFrame.html, and this means bypassing load.js, and means a non-standard configuration (hence needing to "massage" the internal state of the Worker to get the expected result). It would need at least a function in wabac.js to do the dirty work, unless we can work out a standard way of initializing it. A lot depends on whether you need to support random multi-ZIM access or not.
  • Kiwix Desktop would need to intercept the Fetch requests of the URLs transformed by wombat.js, recognize them as needing further transformation, and postMessage them to wabac.js. In my configuration, wabac.js doesn't postMessage them back, it just calls back with the transformed link. I think, but it needs testing, that it will reply to a postMessage if run as a Worker (rather than issuing a Fetch request). If not, it would need patching to do that.
  • No guarantees! It's hypothetical and untested. I considered changing approach halfway through my development work as it seemed simpler (Web Workers are not put to sleep after 30 seconds or so, whereas Service Workers are, and so need to be reinitialized when a Fetch request is made), but in the end, I had invested too much time in running it directly off the Service Worker, so I stuck with the latter approach. Swings and roundabouts.

@kelson42
Copy link
Contributor

kelson42 commented Jan 13, 2024

Closing in favour of openzim/zimit#193. See also https://github.com/orgs/openzim/projects/10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants