New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
warc2zim ZIM files are not playable with Kiwix JS #644
Comments
Using Kiwix JS Windows (because it has more console logging turned on for debugging) running in SW mode, I see the following: This tells me that the main HTML loads, and that the HTML is asking for two JS files called That's as far as I've got for now. |
This system is using its own service worker, right? It'll need to use the one provided in the zim for this to work, which I think should work as it registers a more specific path. I've made an even simpler .zim to test service worker loading from zim, |
@ikreymer I have uploaded the new ZIM here as well http://tmp.kiwix.org/basicsw.zim. Considering that AFAIK - and contrary to Kiwix Desktop Linux/Windows - it should theoretically work with Kiwix JS on Chromium. It would be interesting to understand why it does not. So SW of Kiwix JS conflicts here with the one in the ZIM? There is not way to have both? |
@ikreymer So long as the Service Workers are registered for a different scope, it should be OK. Kiwix JS runs in two configurations: one is a Service Worker configuration where every https request to a ZIM URL is trapped and retrieved from the decompression engine, and the other is one where we have no Service Worker but instead parse the HTML of the the article page and retrieve and attach the assets that are required by the DOM. The latter mode doesn't attach JavaScript, except in an experimental branch. I'll try out your code in that branch, as it might tell us if the Service Workers are in conflict with each other. |
@ikreymer One issue (with some readers) may be that this ZIM does not follow the ZIM specification of having JavaScript in the
In a ZIM conforming to the specification I would expect to see the @kelson42 You mentioned at some point that namespaces may be removed from the ZIM specification. Is that still the plan? Do you see an issue with this ZIM format not conforming to the namespace spec? |
@ikreymer Will the reference to |
So, I patched my branch to read the .js file from the non-conforming location, and these are the error messages logged after attaching and running
EDIT: I think your script is probably calculating the location correctly, but some readers do not set the location of the iframe to match the subdirectory that an asset is in, which is the case for Kiwix JS in jQuery (DOM-parsing) mode. Instead, the reader relies on identifying a ZIM URL by its namespace. Hence the error logged. In Service Worker mode, Kiwix JS does set the iframe URL correctly so that assets can be referred to with pure relative links and no further processing of URLs, however in SW mode we are getting the bad HTTP response code, even though I can confirm that sw.js is in fact appearing in our Service Worker's cache. |
@ikreymer We should not have any iframe. Would you be able to build an example without iframe please? |
The replay system uses an iframe internally (its not visible), that's not the issue. That's how the replay works. I have made simpler ZIM that just tests a service worker (see: kiwix/kiwix-desktop#487 (comment)) The replay system installs assumes that it is loaded from ZIM and then installs a own service worker which attempts to load paths from the current directory, eg. In kiwix-js, it is assumed that the service worker does the loading from ZIM, so once it loads and replaces the sw, it won't be able to load from ZIM anymore! There can only be one SW per page, and they can't be chained. |
To put it another way, the way this is designed to work is that the service worker will make a fetch request to upstream: Again, there is sort of a conflict because in kiwix-js, you're assuming that the reader has the service worker and loads static paths from ZIM. But in this new system, the service worker is coming from the ZIM and takes over the loading/rendering and needs to be able to access other files in the ZIM directly. |
OK, that makes a lot of sense. It is possible for Kiwix JS in DOM parsing mode to load a Service Worker from the ZIM and install it. The problem is that the Service Worker installed would need to be aware of the environment it's operating in and use a message channel to the back end to retrieve files. The messages are quite simple. But it would be a custom, not generic, solution. |
@ikreymer Are you sure this is quite true? We load all ZIM articles in an iframe, which surely must be able to register its own Service Worker? Our Service Worker is registered for the whole scope of the app, but it only intercepts URLs that match this Regular Expression:
If, as you say, all your requests are in namespace A, then they should be intercepted. |
From tomorrow, such kind of ZIM files will be officially published at https://download.kiwix.org/zim/zimit |
I've been investigating this a little more, since more and more people are using Zimit ZIMs, despite poor support amongst Kiwix clients. Through some trickery, I can get the Unfortunately, this means that our own Service Worker, while still installed and running, no longer intercepts Fetch events for that scope. Therefore, when The only solution I can think of to support these ZIM types would be to incorporate the logic that reads assets from the replay system into our app, which would not be a generic solution. It's also, currently, a lot of minified code that is hard to understand. |
Ironically, our JQuery mode can access files in a Zimit ZIM from the title index, and can load index.html, as well as extract the PDFs (in the case of the military medicine ZIM used here). Of course this completely bypasses the replay system, and stylesheets, etc., are not loaded because of the custom URL for them. |
@ikreymer It strikes me that one possibility for leveraging our existing Service Worker is to use the I've tried this with I've looked on https://github.com/webrecorder/replayweb.page/tree/main/src, but all the source code is modularized (and we don't use that module system in our code). The good news is that sw.js can be imported successfully, without error, and it does look like a viable approach with code patching. See screenshot. It would be great if we could get support for Zimit ZIMs working in more clients than just Android and Kiwix Serve, but I think we'd need a bit of help from you with this! |
Not related to above proposed fix, I've enabled limited support for Zimit ZIMs (primarily static content) in the experimental branch of Kiwix JS Windows "Attempt WARC loading". See screenshot. It does not even attempt to load the Replay system (in fact it has to block it, to prevent fatal errors), but instead loads as many assets as it can for which it's possible to determine and patch the ZIM URL. It loads CSS and some of each page's JavaScript, and images where the storage location can be worked out. The reason I haven't attempted to apply this to Kiwix JS is that the code does things I know would be unacceptable in this Repo: it alters and patches content on the Message Channel, and changes hyperlinks in each article in Service Worker mode (when it detects a Zimit ZIM, which it does from the This is definitely a "poorman's" version that will only be useful for informational ZIMs. The WARC system looks great, and is very clever, a natural fit with Kiwix. But the format of WARC ZIMs is quite funky and breaks the ZIM spec in a few ways (hyperlinks do not contain ZIM URLs or are not relative, JavaScript doesn't use the correct ZIM URL for loading assets, the landing page is not the intended home page, etc.). There is also custom redirect code on each page that seems to patch the Android app specifically, which has to be worked around. A proper solution would involve loading the Replay system and allowing it to live alongside the existing Service Worker as suggested above. For now, this is the best I can do -- but it's definitely monkey wrenching. |
The development PWA on Kiwix JS Windows can now be used for testing Zimit ZIM support, but see instructions here. It works in both jQuery and SW modes (but best experience is in SW mode). @mossroy I began some work on enabling support for Zimit ZIMs in Kiwix JS, because it seems to me that this would be quite a big feature for us. For example, there is a ZIM of most of the content of Mozilla Development Network, and it works well, even in jQuery mode. I would propose to do it as a transformation layer: a separate module called Is this something you would be interested in? I think it can be done in a way that isn't "monkey-wrenching", but I don't want to work on it if you'd rather wait for a way of running the Replay system alongside our Service Worker. As I mentioned above, this really needs enabling in the sw.js code that is included in the ZIM. Alternatively, we could provide our own sw.js and patch it. However, I'm not really sure what benefit the Replay system provides, as all assets seem to be included in these ZIMs anyway. As I understand it, the Replay system provides cached requests and responses in json files, but I'm not sure if most of this just duplicates content included in the ZIM, or if it does something special. On the Android app, I can't see any real difference (but it's buggy). |
PS, so far I've only tested:
Note most content isn't available via library.kiwix.org, so use https://download.kiwix.org/zim/zimit/ instead. |
Well, that would need a discussion with @kelson42 (and maybe other ones) We also need to know how this use of a ServiceWorker in a ZIM file is "standardized" or not. We might discuss all this at the hackathon for example (we could schedule a visio meeting with you @Jaifroid if you do not attend, and if you can/wish) |
Thank you @mossroy, that makes a lot of sense. We can discuss (probably virtually) with @kelson42. Just to be clear, the proposed translation layer I suggest would work equally well for jQuery mode as for Service Worker. It wouldn't need special coding for one of the other. This is because it would simply be a way of converting the absolute hyperlinks on a Replay page into ZIM links. It's all that's needed to read these files (albeit with some possible limitations) in their current format (plus logic to detect that we are dealing with a Replay ZIM). A version that relies on loading or incorporating the Replay Service Worker would of course only work in Service Worker mode. There is a proposal to change the format and include the warcz gzip file inside the ZIM, instead of assets stored with xz or zstd compression. See openzim/warc2zim#81 . I think this would be a regression in usability, as the ZIM would rely completely on the Replay software, and there would be no other way to access contents, which seems to be against the spirit and the spec of Open ZIM. However, nothing has happened with this issue in 16 months. @ikreymer worked quite closely with @rgaudin to enable reading Zimit archives in the Android app. The solution adopted there was to register the Replay Service Worker on the domain https://kiwix.app, even though this domain does not serve anything. The Replay system included in the ZIMs contains custom logic to help the Android app. We would need something similar if adopting the Service Worker approach. However, I have not seen a good case as to why this is necessary, when translation of the links into ZIM URLs seems to work just as well in my prototype -- well some small bugs still need ironing out, ironically in Service Worker mode! -- and could be a good approach for Kiwix Desktop and other apps that can't run Service Workers in the Web view. Currently only the Android app and KiwixServe can read these archives, if I've understood correctly. |
PS I should explain why translation of links cannot currently be done fully in our service worker. It's because many links in these ZIMs are absolute, of type https://some.random.site/some-content-link.html. We need to translate these to /fr_zimname_04_22.zim/A/some.random.site/some-content-link.html before our SW will pick them up (the namespace could also be /C/, but we get it from the dirEntry). From what I can tell, one ZIM can contain more than one scraped domain, so we must translate all of them, and handle failure with a dialogue box offering to open a resource online if not in ZIM. Because we don't have a WebView, we can't trap ALL webrequests like the Android app can, but we do have a backend that can see all data coming from the ZIM and transform them. Hence my likening this to a Server Side Transform. One curiosity I'm noticing is that many sites seem to run better if we disable JS, i.e. they run better in jQuery mode. This is probably an artefact of the fact that I'm not yet translating links inside JavaScript (still experimental). |
For info, this part of my comment above is no longer true. It was indeed caused by not translating links inside JS files, now fixed for the most part. |
Superseded by #1009. |
We are working on warc2zim, a tool able to create ZIM file from a WARC file. These ZIM files integrates a JS player relying on service-worker. You can find an example of file here:
http://tmp.kiwix.org/warc2zim.zim
I don't have achieved - at all - to read this ZIM file with:
The text was updated successfully, but these errors were encountered: