New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZIM backend : Try to cross-compile the libZIM library in webassembly with emscripten #116

Open
mossroy opened this Issue Jun 7, 2015 · 34 comments

Comments

Projects
None yet
4 participants
@mossroy
Member

mossroy commented Jun 7, 2015

It would avoid to re-code it in javascript, and would also ease support of future evolutions of the file format.
But not sure how Emscripten would handle the file I/Os

@mossroy mossroy added the enhancement label Jun 7, 2015

@mossroy mossroy added this to the v2.0 milestone Jun 7, 2015

@mossroy mossroy changed the title from Try to cross-compile the libZIM library in javascript with emscripten to ZIM backend : Try to cross-compile the libZIM library in javascript with emscripten Jun 7, 2015

@peter-x

This comment has been minimized.

Collaborator

peter-x commented Jun 7, 2015

Being able to use libZIM as is depends on the question whether we can use synchronous IO at the low-level javascript side.
This probably has to wait until we know how we can access the filesystem from ServiceWorkers.

@mossroy

This comment has been minimized.

Member

mossroy commented Jun 7, 2015

I don't think there is a way to do synchronous I/O on files in javascript.
Regarding ServiceWorkers, I suppose we have access to the same APIs as in any javascript.

@mossroy mossroy modified the milestones: v2.1, v2.0 Aug 29, 2015

@mossroy mossroy assigned dattaz and unassigned peter-x Apr 8, 2017

@mossroy

This comment has been minimized.

Member

mossroy commented Apr 8, 2017

Dattaz, I assign you this issue, as you expressed some interest in it.
No hurry and no obligation of course

@mossroy

This comment has been minimized.

Member

mossroy commented Apr 8, 2017

@thiolliere if you want to have a look, too

@mossroy

This comment has been minimized.

Member

mossroy commented Apr 22, 2017

I met @bnjbvr yesterday (he was giving a conference about WebAssembly, at @mixitconf), and talked to him about this idea. He said he might put us in touch with the maintainer of emscripten if necessary.
Obviously, it's too early for now.

I found this article that tackles our issue : https://hacks.mozilla.org/2015/02/synchronous-execution-and-filesystem-access-in-emscripten/
There seems to be a notion of "virtual filesystem" that allows synchronous access to files that are preloaded in memory. It's clearly not possible for us because of the size of our ZIM files.
Refactoring the zimlib code to make asynchronous file I/O is probably complicated. It's a change that would only be interesting for us (not for the other applications using this library) and would probably make the code less readable.
Maybe the Emterpreter might be an option, but they say it would be slower. So maybe we would have to find a way to use the Emterpreter only on the code that reads the ZIM file, and compile everything else in WebAssembly, if it's technically possible.

@dattaz

This comment has been minimized.

Member

dattaz commented May 10, 2017

I have compile libzim in webassembly ; you can check demo here : https://dattaz.github.io/libzim_wasm/ (but zimfile (meta.esperanto.stackexchange.com_eng_all_2017-05.zim 1,9M) is embedded ) ; Now we have to deal with filesystem :)

Source code to build is here : https://github.com/dattaz/libzim_wasm

@mossroy

This comment has been minimized.

Member

mossroy commented May 10, 2017

Yeah! @dattaz, you rock!
It works pretty well, that's promising.
As you said, the next challenge is to deal with file I/Os. It might be the most difficult part, and might even not be technically possible.

The virtual filesystem you used for the demo can not work with bigger ZIM files that would not fit into memory.
We're left with the options in previous comment :

  • refactor libzim to make asynchronous file I/Os : it's the recommended way, and the best option for performance, but I don't know if it's not too complicated to do that
  • or use Emterpreter to keep synchronous file I/Os : it would be much slower, according to Mozilla

In both cases, we have to make it use a javascript File object that we would pass to the wasm code : I hope it's possible

@dattaz

This comment has been minimized.

Member

dattaz commented May 26, 2017

Emscripten has WORKERFS as file system which permit to load file object as file into FS. This is only allowed in web worker. According doc (https://kripken.github.io/emscripten-site/docs/api_reference/Filesystem-API.html) it's seem really close of that we want to do : "This file system provides read-only access to File and Blob objects inside a worker without copying the entire data into memory and can potentially be used for huge files."

Here is a (little) demo : https://dattaz.github.io/libzim_wasm/file_api/index.html

Note that for the moment there is a issue with file bigger than 2GB : kripken/emscripten#5250

@mossroy

This comment has been minimized.

Member

mossroy commented May 27, 2017

I did not know about workerfs : it looks great for our need!
This is very promising : if we manage to expose all the libzim functions through javascript, it might replace the low-level zim javascript code. It would be a huge improvement for kiwix-html5

This was referenced May 27, 2017

@mossroy mossroy modified the milestones: v2.2, v2.3 Jan 4, 2018

@kelson42 kelson42 changed the title from ZIM backend : Try to cross-compile the libZIM library in javascript with emscripten to ZIM backend : Try to cross-compile the libZIM library in webassembly with emscripten Jan 10, 2018

@mossroy mossroy added the help wanted label Jan 10, 2018

@mossroy

This comment has been minimized.

Member

mossroy commented Sep 19, 2018

The libzim has evolved a lot since last year. It now relies on ninja + meson to build.
It's pretty easy to build it with the instructions from https://github.com/openzim/libzim.
But now we need to adapt it to compile with emscripten.

I managed to make it spit a build/src/libzim.so.4.0.4 that file recognizes as "LLVM IR bitcode".
That looks like a good first step to me.
I'm not very proud of how I did it, it's probably not the best way :
I modified the generated build/build.ninja to replace cc by emcc and c++ by em++, and also had to modify src/debug.h to remove the dependency on execinfo.h and backtrace. I also manually copied the dependencies (lzma, xapian, zlib, zconf, and unicode) in build/include from /usr/include (it's probably not the right way to do it, but it seemed to work). Finally, I had to remove the warning about ABI version in xapian/version.h.
It does not manage to generate libzim.so.4.0.4.symbols, I hope it's not needed in our case?

Next step is to try to build the source code of the prototype https://github.com/dattaz/libzim_wasm with this new version of libzim, and check that it works the same. Before trying to plug it with kiwix-js

@mossroy

This comment has been minimized.

Member

mossroy commented Sep 20, 2018

Things are going forward, but it's still a long journey. I now need to compile the dependencies.
For the record :
The ninja build gives warnings about these dependencies if I don't compile them :

WARNING:root:emcc: cannot find library "lzma"
WARNING:root:emcc: cannot find library "z"
WARNING:root:emcc: cannot find library "xapian"
WARNING:root:emcc: cannot find library "icui18n"
WARNING:root:emcc: cannot find library "icuuc"
WARNING:root:emcc: cannot find library "icudata"

I also managed to compile demo_file_api.cpp to wasm with :
em++ ../../libzim_wasm/demo_file_api.cpp -I../../libzim_wasm -Iinclude -I../include -fdiagnostics-color=always -pipe -Wall -Winvalid-pch -Wnon-virtual-dtor -Werror -std=c++11 -O0 -g -D_LARGEFILE64_SOURCE=1 -D_FILE_OFFSET_BITS=64 -pthread --pre-js ../../libzim_wasm/prejs_file_api.js --post-js ../../libzim_wasm/postjs_file_api.js
But with these warnings :

warning: unresolved symbol: _ZN3zim4FileC1ERKNSt3__212basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEE
warning: unresolved symbol: _ZNK3zim4File10getArticleEj
warning: unresolved symbol: _ZNK3zim4File11getFilesizeEv
warning: unresolved symbol: _ZNK3zim4File16getCountArticlesEv
warning: unresolved symbol: _ZNK3zim4File17getArticleByTitleEj
warning: unresolved symbol: _ZNK3zim4File3endEv
warning: unresolved symbol: _ZNK3zim4File5beginEv
warning: unresolved symbol: _ZNK3zim7Article6getUrlEv
warning: unresolved symbol: _ZNK3zim7Article8getTitleEv

Then the prototype manages to start, runs some wasm code, but then fails with this error message :

missing function: _ZN3zim4FileC1ERKNSt3__212basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEE

I'll try to compile the dependencies with emscripten one by one, to hopefully fix these problems

@mossroy

This comment has been minimized.

Member

mossroy commented Sep 20, 2018

On ICU, I had the following error :

Target architecture was not detected as supported by Double-Conversion.

Quick workaround : change line 101 of i18n/double-conversion-utils.h to replace the error by :
#define DOUBLE_CONVERSION_CORRECT_DOUBLE_OPERATIONS 1
(hoping the emscripten architecture does not have the rounding issue on double that is mentioned in the comments of this source file)

At the end, it fails on running bin/icupkg, but that might not be necessary in our context.

@Jaifroid

This comment has been minimized.

Collaborator

Jaifroid commented Sep 20, 2018

Bon courage! It'll be a game changer if you can get this working.

@mossroy

This comment has been minimized.

Member

mossroy commented Sep 20, 2018

To compile xapian :
emconfigure ./configure --prefix=pwd/../xapian "CFLAGS=-Ipwd/../z/include -Lpwd/../z/lib" "CXXFLAGS=-Ipwd/../z/include -Lpwd/../z/lib" --disable-backend-remote emmake make "CFLAGS=-Ipwd/../z/include -Lpwd/../z/lib -std=c++11" "CXXFLAGS=-Ipwd/../z/include -Lpwd/../z/lib -std=c++11"

I managed to compile the prototype without warnings with :
em++ ../../libzim_wasm/demo_file_api.cpp src/libzim.so -I../../libzim_wasm -Iinclude -I../include -fdiagnostics-color=always -pipe -Wall -Winvalid-pch -Wnon-virtual-dtor -Werror -std=c++11 -O0 -g -D_LARGEFILE64_SOURCE=1 -D_FILE_OFFSET_BITS=64 -pthread --pre-js ../../libzim_wasm/prejs_file_api.js --post-js ../../libzim_wasm/postjs_file_api.js -s DISABLE_EXCEPTION_CATCHING=0 -s "EXTRA_EXPORTED_RUNTIME_METHODS=['ALLOC_NORMAL','printErr','ALLOC_STACK','ALLOC_STATIC','ALLOC_DYNAMIC','ALLOC_NONE']" -s DEMANGLE_SUPPORT=1

It starts, but then crashes when it starts reading the ZIM file :

will print first 100 url/title of /work/wikipedia_en_ray_charles_2015-06.zim
error reading zim-file header.

Adding some debug info inside libzim gives us the underlying exception :

Cannot mmap size 80 at off 0 : No such device

@mossroy

This comment has been minimized.

Member

mossroy commented Sep 20, 2018

I disabled MMAP in a very dirty way, by removing what's inside the ifdef of src/file_reader.cpp (it's certainly possible to do that in a cleaner way with the ENABLE_USE_MMAP option).

... and the prototype works!

@mossroy

This comment has been minimized.

Member

mossroy commented Sep 20, 2018

I needed to add the same -s TOTAL_MEMORY=83886080 compilation option for the last compilation step (with em++), as we did for the xzdec compilation.

Good news : if I also add -s WASM=0 compilation option, it is generated in asm.js instead of wasm, and also works. So it might be compatible with older browsers that do not support WASM

Bad news : big ZIM files still fail (apparently >2GB), with various error messages :

last cluster offset larger than file size; file corrupt

or

Checksum position is not valid

Same behavior on Firefox and Chromium. So I suppose kripken/emscripten#5250 has not been fixed since last year

@Jaifroid

This comment has been minimized.

Collaborator

Jaifroid commented Sep 20, 2018

Still very encouraging results!

Do you know what File System emulation it's using? The problem @dattaz reported was with WORKERFS, but our current xzdec can be explicitly compiled with no filesystem: NO_FILESYSTEM=1.

I realize libzim is very different, but might it be made to accept a FileReader array buffer? Or must it work with an emulated file system?

@mossroy

This comment has been minimized.

Member

mossroy commented Sep 20, 2018

It is still WORKERFS, with good reasons : while xzdec takes a byte array as its input (because it only handles decompression), libzim actually reads the files (because it handles much more things).
Among the "filesystems" supported by emscripten (https://kripken.github.io/emscripten-site/docs/api_reference/Filesystem-API.html), I don't see another one that could fit our needs (using a javascript File object)

@Jaifroid

This comment has been minimized.

Collaborator

Jaifroid commented Sep 20, 2018

I'm clutching at small straws here, but what about:
Synchronous Virtual XHR Backed File System ?

In particular this:

The backend can improve start up time as the whole file system does not need to be preloaded before compiled code is run. It can also be very efficient if the web server supports byte serving — in this case Emscripten can just read the parts of files that are actually needed.

Perhaps this is the technology behind WORKERFS.

I guess the 2GB limitation is probably a version of https://en.wikipedia.org/wiki/2_GB_limit ? We might be back to using split ZIMs...

@mossroy

This comment has been minimized.

Member

mossroy commented Sep 20, 2018

Next step is probably to check we can call the libzim APIs from our javascript code. See http://kripken.github.io/emscripten-site/docs/porting/connecting_cpp_and_javascript/Interacting-with-code.html#interacting-with-code-binding-cpp
Maybe it will involve adding C++ code that wraps what we need into simple APIs that could be more easily called by javascript.

After discussing with other people at the hackathon, I understand that libzim is the low-level reference implementation to read ZIM files. It might be enough to replace most of our current javascript backend pieces, but https://github.com/kiwix/kiwix-lib provides more high-level APIs that should be more efficient for us. But it would need to compile it too, with its own dependencies. Hopefully we can do that later.

Our need would be to replace the calls to functions of zimArchive.js (the ones called by app.js) by some equivalent APIs running in wasm/asm.js
It might involve refactoring our javascript code in app.js.

One idea to test that might be to replace the calls made by handleMessageChannelMessage in app.js (getDirEntryByTitle, resolveRedirect and readBinaryFile) by a single call to getArticleByUrl (from libzim). It would not handle redirects properly (especially in stackoverflow ZIM files), but it would be an interesting step to see if it can work.

@mossroy

This comment has been minimized.

Member

mossroy commented Sep 20, 2018

The recompiled prototype (in its current form) can be tested on https://mossroy.github.io/libzim_wasm/ (it is the wasm version).
The source code is not up-to-date : I'll update it.
Currently, the last step of this prototype tries to read an article of the Ray Charles archive. So it's normal that it fails on other ZIM files with "article index out of range"

@Jaifroid

This comment has been minimized.

Collaborator

Jaifroid commented Sep 21, 2018

One idea to test that might be to replace the calls made by handleMessageChannelMessage in app.js (getDirEntryByTitle, resolveRedirect and readBinaryFile) by a single call to getArticleByUrl (from libzim). It would not handle redirects properly (especially in stackoverflow ZIM files), but it would be an interesting step to see if it can work.

This sounds do-able, as an experiment. The important thing is for us to build a clearly understandable pipeline / process for requesting articles. Anything that simplifies the micro-management of checking dirEntries and decompressing would be great (so long as it is performant).

The recompiled prototype (in its current form) can be tested on https://mossroy.github.io/libzim_wasm/ (it is the wasm version).

I've tested on Edge with the same results that you report for the other two browsers. It does not, of course, work in IE, or in Firefox ESR 52.5.2 (the only version I could get that still supports the FFOS simulator). The latter spits out "No WebAssembly support found". But building an asm version would solve that (and as we have seen, may be just as performant).

The 2GB limit on files seen by the virtual file system limit (if that's what it is) might be a blocker unless it allows for split ZIMs. On the other hand, Kiwix seems to be phasing out pre-built split ZIMs. I guess the code would still be useful for packaged ZIMs < 2GB.

@mossroy

This comment has been minimized.

Member

mossroy commented Sep 21, 2018

I finally managed to push in https://github.com/mossroy/libzim_wasm repo all that is necessary to do the compilation of libzim and the prototype, including the modifications I had to to make in some source codes.
It took me quite some time, but I did not want to take the risk to forget a step or loose a file. Now everything is on git.

@mossroy

This comment has been minimized.

Member

mossroy commented Sep 21, 2018

I just pushed a new version that measures the execution time of reading an article from its URL.
It seems way faster than our current implementation. At least twice or 3 times faster, sometimes much more (most articles are read from the Ray Charles archive in 1 to 3 milliseconds, and from a 1.2GB ZIM file in around 70 milliseconds). It might come from the fact that the libzim has some built-in cache by itself.
It has to be verified in real conditions (because the content still needs to be transferred in javascript), but that might allow us to avoid adding other kind of caches like #411, #414 or #415. My personal opinion would be to put on hold the work on these custom cache implementations for now, until we find out if they will still be necessary.

@Jaifroid

This comment has been minimized.

Collaborator

Jaifroid commented Sep 21, 2018

Excellent! I'm not working more on my cache code -- I did that in May, and only updated it to latest master since it was an outstanding issue.

I guess we still need to investigate the 2GB file limit issue, as this was a blocker last year.

@mossroy

This comment has been minimized.

Member

mossroy commented Sep 21, 2018

Another good news is that I have similar performance with asm.js (see branch https://github.com/mossroy/libzim_wasm/tree/compiled-in-asm.js) IF I add the -O3 parameter at compilation time (else it is very slow because "almost asm" instead of "use asm").
I tried to put this same -O3 parameter in wasm version : it does not seem to have a significant impact on performance, but it makes generated files around one third smaller. I will keep -O0 while we're still testing.

@mossroy mossroy modified the milestones: v2.4, v2.5 Sep 22, 2018

@mossroy

This comment has been minimized.

Member

mossroy commented Sep 22, 2018

I'm trying to call the C code from javascript, but it's not easy.
I managed (easily) to create some bindings using http://kripken.github.io/emscripten-site/docs/porting/connecting_cpp_and_javascript/WebIDL-Binder.html#webidl-binder for File.getArticleByUrl that look good, but I don't manage to call it.
The generated a.out.js fails on load with :

ReferenceError: addOnPreMain is not defined

If I remove this call to addOnPreMain, it loads but fails when I call the constructor of File object with :

ReferenceError: assert is not defined

or

ReferenceError: intArrayFromString is not defined

@mossroy

This comment has been minimized.

Member

mossroy commented Sep 22, 2018

It seems to work a bit better with embind, and is not more complicated to setup if we wrap everything into high-level functions that don't use classes. I created a new branch for it.
But there might be issues with size of returned content, that we will have to work on, as it says :

Assertion failed: provided buffer should be 16777216 bytes, but it is 134217728

@mossroy

This comment has been minimized.

Member

mossroy commented Sep 22, 2018

After giving a -s TOTAL_MEMORY=83886080 to the compiler (instead of letting the memory grow), this error disappears.
I've managed, in branch https://github.com/mossroy/libzim_wasm/tree/embind-experiments, to read an HTML content with libzim, and get it in javascript. That's cool!
I just pushed it to gh-pages so that it can be tested on https://mossroy.github.io/libzim_wasm/ (only on Ray Charles archive, and you have to wait for the "libzim initialized" message in the browser console before clicking on the second button)

@mossroy

This comment has been minimized.

Member

mossroy commented Sep 23, 2018

I've created a branch on kiwix-js to experiment with using libzim : https://github.com/kiwix/kiwix-js/tree/libzim-experiments
It's only a rough prototype to see how far we can get.
I only plugged it for the first HTTP request (because it only works once, for now), and only in SW mode. There is an unhandled race condition : the libzim needs to be initialized before it can be used : that's why I removed the automatic opening of the main article (you have to wait a bit before opening it).
Like for the libzim_wasm prototype, it only works on ZIM files of size <2GB.
And there is another issue : on some ZIM files, the HTML string returned by libzim seems to be a concatenation of many articles instead of only the expected article (making the browser struggle to handle everything), as if it was continuing to read the ZIM file content after the article end.
I also only implemented a simple C call (getArticleByUrl) that only handles strings : we'll probably have to handle byte arrays and classes, which should be more difficult.

So there are still a lot of issues to handle, but it gives hope that there might a way to make this work in the future.

@Jaifroid

This comment has been minimized.

Collaborator

Jaifroid commented Sep 23, 2018

Excellent work! Do you think the 2GB limit might it be overcome by using split ZIMs at 2,048MiB?

@mossroy

This comment has been minimized.

Member

mossroy commented Sep 26, 2018

Using split ZIM would probably be a workaround, but I could not check for now.
Regarding this 2GB limit, we got an answer on kripken/emscripten#5250 , I'll try to ask for more details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment