Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

warc2zim ZIM files are not playable with Kiwix JS #644

Closed
kelson42 opened this issue Jul 18, 2020 · 26 comments
Closed

warc2zim ZIM files are not playable with Kiwix JS #644

kelson42 opened this issue Jul 18, 2020 · 26 comments

Comments

@kelson42
Copy link
Collaborator

We are working on warc2zim, a tool able to create ZIM file from a WARC file. These ZIM files integrates a JS player relying on service-worker. You can find an example of file here:
http://tmp.kiwix.org/warc2zim.zim

I don't have achieved - at all - to read this ZIM file with:

@kelson42 kelson42 added the bug label Jul 18, 2020
@kelson42 kelson42 changed the title warc2zim ZIM files are playable with Kiwix JS warc2zim ZIM files are NOT playable with Kiwix JS Jul 18, 2020
@Jaifroid
Copy link
Member

Using Kiwix JS Windows (because it has more console logging turned on for debugging) running in SW mode, I see the following:

image

This tells me that the main HTML loads, and that the HTML is asking for two JS files called replay/ui.js and replay/sw.js. They are successfully extracted. However, ui.js believes it has received a bad response code 404 when attempting to retrieve replay/sw.js. Our Service Worker is clearly trapping the request for sw.js, because it gets retrieved, but ui.js cannot read the response. I can see in the code of ui.js that there are a lot of assumptions about the environment in which it is running, including extracting URL strings from window.location.href, which may well not work inside an iframe embedded in an iframe. But this is just a guess, it needs careful debugging.

That's as far as I've got for now.

@kelson42
Copy link
Collaborator Author

@Jaifroid Thank you so much for your quick and precise feedback. @ikreymer is the author of warc2zim. @ikreymer it would be great if you collaborate with @Jaifroid to identify problems and maybe fix them in an attempt to implement this ticket, or at least better understand the overall challenges.

@ikreymer
Copy link

This system is using its own service worker, right? It'll need to use the one provided in the zim for this to work, which I think should work as it registers a more specific path. I've made an even simpler .zim to test service worker loading from zim,
available in this comment: kiwix/kiwix-desktop#487 (comment)
Perhaps this will help determine what's going on better..

@kelson42
Copy link
Collaborator Author

@ikreymer I have uploaded the new ZIM here as well http://tmp.kiwix.org/basicsw.zim. Considering that AFAIK - and contrary to Kiwix Desktop Linux/Windows - it should theoretically work with Kiwix JS on Chromium. It would be interesting to understand why it does not. So SW of Kiwix JS conflicts here with the one in the ZIM? There is not way to have both?

@Jaifroid
Copy link
Member

This system is using its own service worker, right? It'll need to use the one provided in the zim for this to work, which I think should work as it registers a more specific path. I've made an even simpler .zim to test service worker loading from zim,

@ikreymer So long as the Service Workers are registered for a different scope, it should be OK. Kiwix JS runs in two configurations: one is a Service Worker configuration where every https request to a ZIM URL is trapped and retrieved from the decompression engine, and the other is one where we have no Service Worker but instead parse the HTML of the the article page and retrieve and attach the assets that are required by the DOM. The latter mode doesn't attach JavaScript, except in an experimental branch. I'll try out your code in that branch, as it might tell us if the Service Workers are in conflict with each other.

@Jaifroid
Copy link
Member

@ikreymer One issue (with some readers) may be that this ZIM does not follow the ZIM specification of having JavaScript in the -/ namespace. This is the HTML of the landing page:

<html>
    <head>
        <style>
            body {
              width: 100%
              height: 100%;
              overflow-y: hidden;
              margin: 0px;
              padding: 0px;
            }
        </style>
        <script src="./replay/ui.js"></script>
    </head>
    <body>
        <replay-web-page
           source="../netpreserve-twitter.warc"
           replayBase="./replay/"
           deepLink="true"
           embed="replayonly"
           url="https://twitter.com/netpreserve">
        </replay-web-page>
    </body>
</html>

In a ZIM conforming to the specification I would expect to see the ui.js src to appear as ../-/ui.js. For readers that do any kind of URL checking or parsing, this could be an issue. It's an issue for Kiwix JS in jQuery mode (the mode that parses the DOM for elements to attach) because it does not recognize the ZIM URL as a script asset that needs to be extracted from the ZIM. I am not sure how relevant this is for other readers.

@kelson42 You mentioned at some point that namespaces may be removed from the ZIM specification. Is that still the plan? Do you see an issue with this ZIM format not conforming to the namespace spec?

@Jaifroid
Copy link
Member

@ikreymer Will the reference to https://twitter.com/netpreserve cause the reader to fetch something from Twitter, or is it just a string to display the original source of the data? If the former, there would be CORS issues.

@Jaifroid
Copy link
Member

Jaifroid commented Jul 19, 2020

So, I patched my branch to read the .js file from the non-conforming location, and these are the error messages logged after attaching and running ui.js:

image

As you can see here ui.js is not calculating the ZIM URL for sw.js correctly. At the very least, the URL should probably be: http://localhost/Repos/kiwix-js/www/A/replay/sw.js, since the namespace of the current article is A, and ui.js has ZIM URL ../A/replay/ui.js.

EDIT: I think your script is probably calculating the location correctly, but some readers do not set the location of the iframe to match the subdirectory that an asset is in, which is the case for Kiwix JS in jQuery (DOM-parsing) mode. Instead, the reader relies on identifying a ZIM URL by its namespace. Hence the error logged.

In Service Worker mode, Kiwix JS does set the iframe URL correctly so that assets can be referred to with pure relative links and no further processing of URLs, however in SW mode we are getting the bad HTTP response code, even though I can confirm that sw.js is in fact appearing in our Service Worker's cache.

@kelson42
Copy link
Collaborator Author

@ikreymer We should not have any iframe. Would you be able to build an example without iframe please?

@ikreymer
Copy link

ikreymer commented Jul 19, 2020

The replay system uses an iframe internally (its not visible), that's not the issue. That's how the replay works. I have made simpler ZIM that just tests a service worker (see: kiwix/kiwix-desktop#487 (comment))
Having thought about this a bit more, I don't think this approach will work.

The replay system installs assumes that it is loaded from ZIM and then installs a own service worker which attempts to load paths from the current directory, eg. /A/. It is expected that a server provides a response to a request from /A/ to load from ZIM.

In kiwix-js, it is assumed that the service worker does the loading from ZIM, so once it loads and replaces the sw, it won't be able to load from ZIM anymore! There can only be one SW per page, and they can't be chained.
To make this work, the ZIM loading code would need to built into the service worker that's stored in the ZIM (similar to the way I've done it before with just xzdec) and activated if it is running in full client-side mode..

@ikreymer
Copy link

To put it another way, the way this is designed to work is that the service worker will make a fetch request to upstream:
fetch('./A/https://example.com/') and for headers fetch('./h/https://example.com/').
When running in kiwix serve, that will translate to http://<kiwix-serve:port>/A/https://example.com/.
For some of the other tools, it would translate to zim://<zim>/A/https://example.com/ (though they don't work for other reasons)
But in kiwix-js, there is no upstream server, so an upstream fetch() won't work. It either needs to have the ZIM loading there in the service worker, or load it some other way - it might be doable with a MessageChannel, where messages are sent from the service worker to the outer frame page which has already loaded the ZIM - eg. it'll need to send a message to the kiwix-js page to get a ZIM record, and wait for it to be returned - https://developer.mozilla.org/en-US/docs/Web/API/MessageChannel. This could work, I haven't tried it before.

Again, there is sort of a conflict because in kiwix-js, you're assuming that the reader has the service worker and loads static paths from ZIM. But in this new system, the service worker is coming from the ZIM and takes over the loading/rendering and needs to be able to access other files in the ZIM directly.

@Jaifroid
Copy link
Member

OK, that makes a lot of sense. It is possible for Kiwix JS in DOM parsing mode to load a Service Worker from the ZIM and install it. The problem is that the Service Worker installed would need to be aware of the environment it's operating in and use a message channel to the back end to retrieve files. The messages are quite simple. But it would be a custom, not generic, solution.

@Jaifroid
Copy link
Member

There can only be one SW per page, and they can't be chained.

@ikreymer Are you sure this is quite true? We load all ZIM articles in an iframe, which surely must be able to register its own Service Worker? Our Service Worker is registered for the whole scope of the app, but it only intercepts URLs that match this Regular Expression:

var regexpZIMUrlWithNamespace = /(?:^|\/)([^\/]+\/)([-ABIJMUVWX])\/(.+)/;

If, as you say, all your requests are in namespace A, then they should be intercepted.

@kelson42
Copy link
Collaborator Author

kelson42 commented Mar 4, 2021

From tomorrow, such kind of ZIM files will be officially published at https://download.kiwix.org/zim/zimit

@Jaifroid
Copy link
Member

Jaifroid commented Apr 19, 2022

I've been investigating this a little more, since more and more people are using Zimit ZIMs, despite poor support amongst Kiwix clients.

Through some trickery, I can get the sw.js Service Worker to load, but because it loads with the scope (e.g.) http://localhost:8080/kiwix-js-windows/fas-military-medicine_2022-03.zim/A/, it then takes over Fetch intercepts for that scope.

Unfortunately, this means that our own Service Worker, while still installed and running, no longer intercepts Fetch events for that scope. Therefore, when sw.js attempts to load http://localhost:8080/kiwix-js-windows/fas-military-medicine_2022-03.zim/A/index.html it is not intercepted and never gets extracted from the ZIM. An error is thrown, and the whole app becomes unresponsive (even loading other ZIMs no longer works until the app is restarted).

The only solution I can think of to support these ZIM types would be to incorporate the logic that reads assets from the replay system into our app, which would not be a generic solution. It's also, currently, a lot of minified code that is hard to understand.

@Jaifroid
Copy link
Member

Jaifroid commented Apr 19, 2022

Ironically, our JQuery mode can access files in a Zimit ZIM from the title index, and can load index.html, as well as extract the PDFs (in the case of the military medicine ZIM used here). Of course this completely bypasses the replay system, and stylesheets, etc., are not loaded because of the custom URL for them.

image

@Jaifroid
Copy link
Member

@ikreymer It strikes me that one possibility for leveraging our existing Service Worker is to use the self.importScripts('sw.js') method within our SW (see https://developer.mozilla.org/en-US/docs/Web/API/WorkerGlobalScope/importScripts).

I've tried this with sw.js (including a copy of sw.js in our implementation). This works "too well" as in, it fully takes over our fetch event listener, but of course it then can't extract anything from the ZIM. While it might be patchable so that we maintain our fetch listener, and we could feed event requests to sw.js functions as needed, we'd need an un-minified version of sw.js in order to be able to experiment.

I've looked on https://github.com/webrecorder/replayweb.page/tree/main/src, but all the source code is modularized (and we don't use that module system in our code).

The good news is that sw.js can be imported successfully, without error, and it does look like a viable approach with code patching. See screenshot.

It would be great if we could get support for Zimit ZIMs working in more clients than just Android and Kiwix Serve, but I think we'd need a bit of help from you with this!

image

@Jaifroid
Copy link
Member

Jaifroid commented Apr 21, 2022

Not related to above proposed fix, I've enabled limited support for Zimit ZIMs (primarily static content) in the experimental branch of Kiwix JS Windows "Attempt WARC loading". See screenshot. It does not even attempt to load the Replay system (in fact it has to block it, to prevent fatal errors), but instead loads as many assets as it can for which it's possible to determine and patch the ZIM URL. It loads CSS and some of each page's JavaScript, and images where the storage location can be worked out.

The reason I haven't attempted to apply this to Kiwix JS is that the code does things I know would be unacceptable in this Repo: it alters and patches content on the Message Channel, and changes hyperlinks in each article in Service Worker mode (when it detects a Zimit ZIM, which it does from the warc-header mime type in the ZIM metadata). It has to find the ZIM's home page by inspecting the JavaScript on the landing page (which is the Replay system).

This is definitely a "poorman's" version that will only be useful for informational ZIMs. The WARC system looks great, and is very clever, a natural fit with Kiwix. But the format of WARC ZIMs is quite funky and breaks the ZIM spec in a few ways (hyperlinks do not contain ZIM URLs or are not relative, JavaScript doesn't use the correct ZIM URL for loading assets, the landing page is not the intended home page, etc.). There is also custom redirect code on each page that seems to patch the Android app specifically, which has to be worked around.

A proper solution would involve loading the Replay system and allowing it to live alongside the existing Service Worker as suggested above. For now, this is the best I can do -- but it's definitely monkey wrenching.

image

@Jaifroid
Copy link
Member

The development PWA on Kiwix JS Windows can now be used for testing Zimit ZIM support, but see instructions here. It works in both jQuery and SW modes (but best experience is in SW mode).

@mossroy I began some work on enabling support for Zimit ZIMs in Kiwix JS, because it seems to me that this would be quite a big feature for us. For example, there is a ZIM of most of the content of Mozilla Development Network, and it works well, even in jQuery mode. I would propose to do it as a transformation layer: a separate module called transformZimit.js. This module transforms all the links on a Zimit page into a format that can be read by our existing app. I think this would be uncontroversial in jQuery mode. However, in Service Worker mode, this transformation has to occur in the backend, before the html / css, etc. is delivered to the Service Worker. You can think of it as a Server Side transform.

Is this something you would be interested in? I think it can be done in a way that isn't "monkey-wrenching", but I don't want to work on it if you'd rather wait for a way of running the Replay system alongside our Service Worker. As I mentioned above, this really needs enabling in the sw.js code that is included in the ZIM. Alternatively, we could provide our own sw.js and patch it. However, I'm not really sure what benefit the Replay system provides, as all assets seem to be included in these ZIMs anyway. As I understand it, the Replay system provides cached requests and responses in json files, but I'm not sure if most of this just duplicates content included in the ZIM, or if it does something special. On the Android app, I can't see any real difference (but it's buggy).

@Jaifroid
Copy link
Member

Jaifroid commented Apr 24, 2022

PS, so far I've only tested:

  • lowtechmagazine.com_en_all_2021-12.zim
  • musictheory.net_en_all_2021-12.zim (audio not working)
  • developer.mozilla.org_en_all_2022-03.zim (good support)
  • fas-military-medicine_2022-03.zim
  • stacks.math.columbia.edu_en_all_2021-07.zim
  • bibnum_fr_all_2021-12.zim - CURRENTLY ONLY WORKING IN JQUERY MODE! fixed in v1.9.72

Note most content isn't available via library.kiwix.org, so use https://download.kiwix.org/zim/zimit/ instead.

@mossroy
Copy link
Contributor

mossroy commented Apr 24, 2022

Well, that would need a discussion with @kelson42 (and maybe other ones)
To understand the need, first of all.

We also need to know how this use of a ServiceWorker in a ZIM file is "standardized" or not.
It's no use trying to "run" after these changes if there is not a kind of "contract" we could rely on (a bit like the openzim format is a contract). Especially if we need to "merge" both ServiceWorkers.
If we decide it's important to try to support these files, my first impression is we should focus on ServiceWorker mode, and not waste time on jQuery mode.

We might discuss all this at the hackathon for example (we could schedule a visio meeting with you @Jaifroid if you do not attend, and if you can/wish)

@Jaifroid
Copy link
Member

Thank you @mossroy, that makes a lot of sense. We can discuss (probably virtually) with @kelson42.

Just to be clear, the proposed translation layer I suggest would work equally well for jQuery mode as for Service Worker. It wouldn't need special coding for one of the other. This is because it would simply be a way of converting the absolute hyperlinks on a Replay page into ZIM links. It's all that's needed to read these files (albeit with some possible limitations) in their current format (plus logic to detect that we are dealing with a Replay ZIM).

A version that relies on loading or incorporating the Replay Service Worker would of course only work in Service Worker mode.

There is a proposal to change the format and include the warcz gzip file inside the ZIM, instead of assets stored with xz or zstd compression. See openzim/warc2zim#81 . I think this would be a regression in usability, as the ZIM would rely completely on the Replay software, and there would be no other way to access contents, which seems to be against the spirit and the spec of Open ZIM. However, nothing has happened with this issue in 16 months.

@ikreymer worked quite closely with @rgaudin to enable reading Zimit archives in the Android app. The solution adopted there was to register the Replay Service Worker on the domain https://kiwix.app, even though this domain does not serve anything. The Replay system included in the ZIMs contains custom logic to help the Android app. We would need something similar if adopting the Service Worker approach. However, I have not seen a good case as to why this is necessary, when translation of the links into ZIM URLs seems to work just as well in my prototype -- well some small bugs still need ironing out, ironically in Service Worker mode! -- and could be a good approach for Kiwix Desktop and other apps that can't run Service Workers in the Web view.

Currently only the Android app and KiwixServe can read these archives, if I've understood correctly.

@Jaifroid
Copy link
Member

Here's an example of quite complex Zimit ZIM running in jQuery mode with no Service Worker either in the iframe or in the reader. Indeed, Javascript is completely disabled here. MSDN also works just as well in either mode, despite the warning box I put at the top.

image

ZIM is bibnum_fr_all_2021-12.zim

@Jaifroid
Copy link
Member

Jaifroid commented Apr 25, 2022

PS I should explain why translation of links cannot currently be done fully in our service worker. It's because many links in these ZIMs are absolute, of type https://some.random.site/some-content-link.html. We need to translate these to /fr_zimname_04_22.zim/A/some.random.site/some-content-link.html before our SW will pick them up (the namespace could also be /C/, but we get it from the dirEntry). From what I can tell, one ZIM can contain more than one scraped domain, so we must translate all of them, and handle failure with a dialogue box offering to open a resource online if not in ZIM.

Because we don't have a WebView, we can't trap ALL webrequests like the Android app can, but we do have a backend that can see all data coming from the ZIM and transform them. Hence my likening this to a Server Side Transform.

One curiosity I'm noticing is that many sites seem to run better if we disable JS, i.e. they run better in jQuery mode. This is probably an artefact of the fact that I'm not yet translating links inside JavaScript (still experimental).

@Jaifroid
Copy link
Member

One curiosity I'm noticing is that many sites seem to run better if we disable JS, i.e. they run better in jQuery mode. This is probably an artefact of the fact that I'm not yet translating links inside JavaScript (still experimental).

For info, this part of my comment above is no longer true. It was indeed caused by not translating links inside JS files, now fixed for the most part.

@mossroy mossroy added enhancement and removed bug labels Jun 1, 2022
@mossroy mossroy changed the title warc2zim ZIM files are NOT playable with Kiwix JS warc2zim ZIM files are not playable with Kiwix JS Jun 1, 2022
@Jaifroid
Copy link
Member

Superseded by #1009.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants