Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a queue for extraction of assets #412

Closed
Jaifroid opened this issue Sep 17, 2018 · 8 comments
Closed

Add a queue for extraction of assets #412

Jaifroid opened this issue Sep 17, 2018 · 8 comments
Assignees
Milestone

Comments

@Jaifroid
Copy link
Member

Jaifroid commented Sep 17, 2018

As explained in #219 (here and here), current master delays rendering the page until all the CSS is extracted from the ZIM, but CSS is heavily compressed and is often therefore extracted after all other assets on the page have been extracted. This can lead to long waits for the page to render if a page has many images on it. This is especially the case when there are lots of SVGs (almost any page on a mathematics ZIM). This problem affects jQuery and Service Worker modes. It is ameliorated by caching, but not eliminated. We therefore need a queueing mechanism to control the order of extraction of various assets.

Another argument in favour of queuing would be to allow us to prioritize subcategories of assets, for a better user experience. For example, we could ensure the following extraction order simply by queuing functions to do the following things:

  1. Extract html of page;
  2. Extract all CSS required by page;
  3. Render page;
  4. Extract all JPEG or PNG images above the fold;
  5. Extract any SVGs above the fold;
  6. Extract JavaScript.;
  7. Extract remaining JPEGs etc;
  8. Extract remaining SVGs.

I propose one possibility for an extensible queuing framework in https://github.com/kiwix/kiwix-js/tree/Add-simple-jQuery-queue . Currently it only applies to jQuery mode. We could in principle hook into Service Worker requests as they come into app.js and queue responses to create a similar effect in Service Worker mode. It may well be possible to implement a "universal" queuing mechanism in a lower-level process such as zimDirEntry.js.

Before finalizing any framework, and before creating a PR, we need to decide whether the framework should include ServiceWorker mode or only jQuery mode. Your thoughts, @mossroy ?

@Jaifroid Jaifroid added this to the v2.4 milestone Sep 17, 2018
@Jaifroid Jaifroid self-assigned this Sep 17, 2018
@mossroy
Copy link
Contributor

mossroy commented Sep 18, 2018

In ServiceWorker mode, the browser seems to prioritize the requests as we would expect : first the HTML, then all the CSS, then the images that are immediately visible, then the other images. I tested on both Firefox and Chromium.
It makes sense because it's what browsers do when you open an online page. In ServiceWorker mode, we benefit from this behavior for free.
So I don't think we should try to implement this queuing for ServiceWorker mode. And we would have the issue of not knowing if all the requests have been sent to the ServiceWorker (where we can count them in jQuery mode).

Anyway, for some reason, in ServiceWorker mode, the page does not render before all the resources are fetched. I was hoping it would render earlier, as it does when browsing online. But that's a separate issue I will create, that will certainly be handled differently in ServiceWorker mode.

To sum up, I think you should focus on jQuery mode in this issue, because I don't think it's possible to handle both modes with the same code.

@Jaifroid
Copy link
Member Author

Jaifroid commented Sep 18, 2018

In ServiceWorker mode, the browser seems to prioritize the requests as we would expect : first the HTML, then all the CSS, then the images that are immediately visible, then the other images. I tested on both Firefox and Chromium.

I have been testing the extraction order in Service Worker (SW), using Firefox Quantum (as part of my attempts to solve the issues experienced with #417 ). I have uploaded a branch that counts the CSS in SW mode (and the console logging currently going on in SW). The branch is not changing the order in which assets are sent to SW. Below is the result of loading an average article ("Helston" in English Wikipedia) -- screenshot of the FF console. As you can see, the last CSS is the penultimate thing to be returned, after nearly all the images.

As you rightly say above, Firefox and Chrome will not render the page until this last CSS is received, but that is too late for us. On a small page like this, it doesn't matter, but we could greatly improve page display speed on files with hundreds of images (and especially SVGs) if we prioritize extraction of CSS.

NB The reason CSS is left to last is that extraction of images does not engage xzdec. They are stored as-is, with no further compression (as they are already compressed). CSS requires costly CPU cycles to decompress, so it gets put on the back burner by the async engine. This is the opposite of what we want!

EDIT: You can observe quite a dramatic version of the problem if you run my branch, with console open, on "Trigonometric Functions" from the latest Wikipedia Mathematics ZIM. Most of the CSS is extracted after most of the SVGs.

image

@Jaifroid
Copy link
Member Author

Based on the above experiments, I've now had a go at producing a single, universal queue for jQuery and Service Worker modes:

https://github.com/kiwix/kiwix-js/tree/Universal-queue

Currently the queue only prioritizes extraction of CSS, and then it lets everything else get extracted in whatever order it returns from the decompressor. But it is easily extensible to extract, say, binary images before JavaScript, and JavaScript before SVGs. This can be a TODO -- I wanted a proof of principle.

I'm still doing a bit of tweaking and performance testing.

@mossroy
Copy link
Contributor

mossroy commented Sep 19, 2018

Interesting.
We indeed have a bit of control in the backend.
But I'm still not comfortable with the use of queuing in SW mode.
IMHO, there are several issues :

  • the counting done in the backend only works if the browser has the time to send all the CSS requests before they are all read. But this assumption might become wrong. For example, some browsers (or browser extensions) can pre-fetch other web pages (see https://en.wikipedia.org/wiki/Link_prefetching). In this case, the queue would count all the CSS requests sent by the browser for all the pages (including the prefetched ones), and might display the page only when all the CSS of the prefetched pages would be received. It's probably an edge case, but it illustrates the fact that this counting is not reliable
  • in terms of architecture, I think we should not modify the DOM in the handleMessageChannelMessage function of app.js. Ideally, the SW would be able to read the ZIM file by itself (without going through app.js), and would not have access to the DOM. Unfortunately, we had to use app.js because we can not access the ZIM file from the SW. But I don't think it's a good idea that an HTTP request sent by the browser (and captured by the SW) can lead to a modification of the DOM.

@Jaifroid
Copy link
Member Author

the counting done in the backend only works if the browser has the time to send all the CSS requests before they are all read. But this assumption might become wrong. For example, some browsers (or browser extensions) can pre-fetch other web pages

I'm not sure I follow. Surely a new html request would have to be on a separate browser thread. Otherwise there would be terrible confusion all round if we have, e.g. CSS files with the same name being requested and dumped into the wrong DOM. I don't know how prefetching works low-down, but it would most likely be at the level of populating the browser cache.

I think we should not modify the DOM in the handleMessageChannelMessage function of app.js

Just to be clear, there is no modification of the DOM whatsoever. All I am doing here is delaying the reading of non-CSS or HTML files. We have to extract the files from the ZIM, so intervening in the extraction process is a viable "universal" solution that does not require DOM manipulation.

I understand the objection to adding code in the message channel. Would it be better to do the queuing at a lower level, e.g. at readBinaryFile and readUTFFile level?

@mossroy
Copy link
Contributor

mossroy commented Sep 19, 2018

I'm not sure I follow. Surely a new html request would have to be on a separate browser thread. Otherwise there would be terrible confusion all round if we have, e.g. CSS files with the same name being requested and dumped into the wrong DOM. I don't know how prefetching works low-down, but it would most likely be at the level of populating the browser cache.

Maybe it would be more clear if you think about the ServiceWorker as if it was an HTTP proxy server (except it's dedicated to one window/tab of the browser). It's a usual comparison.
The SW receives HTTP requests, handles them, but has not more information.
If the browser does some prefetching, it will send HTTP requests for the displayed content, but will also send HTTP requests for the content behind some hyperlinks. All of these HTTP requests would go through the same SW. The SW will receive these HTTP requests in an arbitrary order, so there will be race conditions in the counting.

I think we should not modify the DOM in the handleMessageChannelMessage function of app.js

Just to be clear, there is no modification of the DOM whatsoever. All I am doing here is delaying the reading of non-CSS or HTML files. We have to extract the files from the ZIM, so intervening in the extraction process is a viable "universal" solution that does not require DOM manipulation.

I was referring to the following lines of code inside handleMessageChannelMessage function :

                    // Prevent render until CSS fulfilled
                    $('#articleContent').hide();
                    $("#articleName").html(title);
                    $('#readingArticle').show();

I understand the purpose, but we should not do that here. We should keep an architecture like :
web page <--HTTP request---> ServiceWorker <--content--> ZIM Backend
where only the web page can modify the DOM. That's why the ServiceWorkers are designed to not have access the DOM.
In our case however, the backend needs to access the ZIM file. If ZIM files were smaller, we would read and transfer them to the ServiceWorker and backend, so that they can do their job independently from the DOM, and in a multi-threaded way.
Unfortunately, we had to fallback to ask app.js to give us the content, through a MessagePort (that makes the decompression single-threaded, by the way). It's not ideal. Maybe it will be possible in the future to transfer the File object and cleanly do the decompression in an isolated environment. Similar things seem to be done in https://medium.com/@kennethrohde/on-the-fly-webp-decoding-using-wasm-and-a-service-worker-33e519d8c21e

I understand the objection to adding code in the message channel. Would it be better to do the queuing at a lower level, e.g. at readBinaryFile and readUTFFile level?

No, it's the same issue : the ZIM backend should be completely independent from the browser, at least because we'll try to replace it by a wasm version of libzim.

@Jaifroid
Copy link
Member Author

Thank you for the detailed explanation! Very helpful. I was thinking of my code as something along the lines of server load balancing or QoS, freeing up CPU time to deal with the time-sensitive requests. But if it's more like a proxy, then I see the potential for race conditions.

@mossroy mossroy modified the milestones: v2.4, v2.5 Sep 22, 2018
@mossroy mossroy modified the milestones: v2.5, Backlog Jan 9, 2019
@Jaifroid
Copy link
Member Author

I think this is no longer relevant.

@Jaifroid Jaifroid modified the milestones: Backlog, v3.8 Apr 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants