Add support the Zstandard (aka zstd) library for compression #84

data-man · 2017-11-05T18:53:52Z

See

Advantages:

Compression ratio: similar
Compression speed: similar
Decompression speed: 8x

kelson42 · 2017-11-05T19:00:58Z

@data-man To go ahead on this the best argument would be just make a libzim/zimwriterfs/ZIM file with that algorithm and clearly demonstrate about what kind of improvement we talk would deal with.

data-man · 2017-11-05T19:09:57Z

@kelson42
My suggestion is based on lzbench benchmarks.
It would be great if the Zim format would support the zstd.

data-man · 2017-11-05T20:32:34Z

The zstdmt also deserves attention.
There is supports multi-threaded compression/decompression for several algorithms with a single API.

kelson42 · 2020-01-10T15:10:08Z

@data-man It makes two years this feature request has been open and it is more attractive now to me that is was. Would you volontaire to make a POC integration of zstd in the libzim?

kelson42 · 2020-01-10T15:13:44Z

Implementing this would help to avoid such a problem like kiwix/kiwix-tools#345

data-man · 2020-01-11T19:31:34Z

@kelson42
I'll try but I can't promise a quick result.
But definitely faster than two years. :)

Some thoughts:

add additional parameter to zimrecreate - compression method.
maybe also add language parameter

// [TODO] Use the correct language

kelson42 · 2020-01-12T08:21:10Z

@kelson42
I'll try but I can't promise a quick result.
But definitely faster than two years. :)

That would be really awesome!

Some thoughts:
* add additional parameter to zimrecreate - compression method.

Yes, but if zstd is better then probably we would at middle-term just switch per default.

* maybe also add language parameter

This is an idea, please open a dedicated ticket in zim-tools.

// [TODO] Use the correct language

mgautierfr · 2020-01-13T09:45:20Z

@data-man please see with me before trying to implement this. (I'm starmad on irc or mgautier on our slack, or simply here)

I've investigate a bit zstandard and it may provide a functionality we want since a long time : random access decompression.
Until now, we separate the content in clusters. This way we have to decompress on a cluster to access the content of a article. We would like to be able to decompress only the article we want to read. It may be possible to do this with zstd using different frame sharing the same dictionary.

But doing this, it would need us to do a lot in libzim : changing the cluster format, use a article cache size (instead of cluster), maybe change the way we regroup articles in cluster, ...

It is also (and of course) possible to simply use zstd to compress the cluster as we already do for now. But at least we should prepare the possibility that we change the way we store articles in cluster.

kelson42 · 2020-01-13T13:17:20Z

@mgautierfr What you talk about is corresponding to #76 amd #78. If we would tackle these peoblem as well, this would be awesome^2?

kelson42 · 2020-03-20T12:30:17Z

@mgautierfr @veloman-yunkan is going to work on this.

data-man · 2020-03-20T12:53:15Z

But doing this, it would need us to do a lot in libzim : changing the cluster format, use a article cache size (instead of cluster), maybe change the way we regroup articles in cluster, ...

💯

BTW, I changed my mind. :)
kanzi-cpp (wiki) is my favorite now:

C++
very fast
multithreading
includes many compression codecs

veloman-yunkan · 2020-03-22T09:03:21Z

I have started working on this in the zstd branch. I am not going to address #76 and/or #78 while working on this issue. Those enhancements imply a significant impact on libzim and must be done separately.

Jaifroid · 2020-04-04T08:42:28Z

@kelson42 Is there a test ZIM file that uses zstd compression, to work with on kiwix/kiwix-js#611 ?

kelson42 · 2020-04-04T09:17:56Z

@data-man Would you please be upload one (ZIM file with zstd compression) somewhere?

I have created a ticket to easily be able in the future to create one with zimrecreate kiwix/kiwix-tools#372

data-man · 2020-04-04T09:23:30Z

foo-zstd.zip

zimdump -l foo-zstd.zim
1
10
11
12
13
14
15
16
2
3
4
5
6
7
8
9
fulltext/xapian
title/xapian

kelson42 · 2020-04-04T10:22:25Z

@data-man Perfect. Thx. I have updated it to http://tmp.kiwix.org/foo-zstd.zim as well.

Jaifroid · 2020-04-11T14:49:40Z

@data-man I'm trying to implement zstd decompression in Kiwix JS, our JavaScript reader (using a node version of zstandard library), see kiwix/kiwix-js#611 (comment) . Regarding the sample ZIM you kindly supplied, I have some questions:

What kind of data are there in each numbered article? Is each one an html file?
Which zstandard API did you use to compress the data? The Simple API or the Streaming API? Is there anything further I need to know to be able to decompress a cluster?
Is each cluster compressed as one unit apart from the first (informational) byte of the cluster as per the OpenZim spec, or do I need to target a specific part of the cluster?
This ZIM file only appears to have two clusters, the first one containing only 121 bytes according to the Cluster Pointer List, which appears too small. Have I made an error?

You will see from the comment referenced above that I can present what I believe is the compressed cluster's data to the decompressor using the Simple API, but it returns null, and if I use the Streaming API I get an out-of-memory error, which may just be because the input is not of the expected format.

data-man · 2020-04-11T15:15:07Z

@Jaifroid
I'm using createZimExample and zimrecreate with patched libzim:

writer/creatordata.h
103: CompressionType compression = zimcompZstd;

Is each one an html file?

No. createZimExample uses text/plain.

veloman-yunkan · 2020-04-11T19:08:12Z

@Jaifroid

Which zstandard API did you use to compress the data? The Simple API or the Streaming API? Is there anything further I need to know to be able to decompress a cluster?

The API used to compress the data shouldn't matter. Your question is akin to asking about the application used to create a PNG file, in order to decide which viewer to use for opening it. Data compressed with any zstd API complies with the zstd codec specification, and should be correctly decoded with any of its decompression APIs.

Is each cluster compressed as one unit apart from the first (informational) byte of the cluster as per the OpenZim spec, or do I need to target a specific part of the cluster?

Introducing zstd support at this stage doesn't bring new approaches to compression in ZIM file format - it's just a new method of compressing the entire cluster as a whole.

Jaifroid · 2020-04-12T06:35:29Z

Thank you @data-man and @veloman-yunkan. The info is helpful for pinpointing where to look for troubleshooting our code (I'm getting a null result when I try to decompress a cluster in the test ZIM file using the JS version of the codec).

mgautierfr · 2020-04-13T08:54:08Z

This ZIM file only appears to have two clusters, the first one containing only 121 bytes according to the Cluster Pointer List, which appears too small. Have I made an error?

Just to mention that the clusters don't have to be written consecutive. So you cannot use the ClusterPointerList to get the size of a cluster.
At best you can use it to get a maximum size. At even that, it is not guaranty, you may have clusterPtrPos[N+1] < clusterPtrPos[N]

Jaifroid · 2020-04-13T10:50:53Z

@mgautierfr OK, thank you for that important clarification. This most likely explains the issue I'm having.

Jaifroid · 2020-04-13T11:43:47Z

@mgautierfr, I understand from your comment in #210 (quoted below) that we cannot know the size-on-disk of a compressed cluster, and have to decompress the data chunk by chunk. Is the chunk size the same for zstandard as it is for xz? This seems to be 1024 x 5 in Kiwix JS for xz-compressed chunks.

In fact, there is no way to know the size of the compressed buffer
without decompressing the data until we reach the end of the
compression stream. So we have to decompress the data, chunk by chunk
until we decompress all the cluster.

(#210)

mgautierfr · 2020-04-13T16:00:38Z

I understand from your comment in #210 (quoted below) that we cannot know the size-on-disk of a compressed cluster

Yes, at best you are sure that if something start at a offset after the cluster (another cluster or anything else as a dirent, clusterPtrPos, ...) the compressed cluster will finish before this offset. But the size may be actually smaller.

Is the chunk size the same for zstandard as it is for xz?

There is no "chunk size" specific to zstandard or xz.
A cluster on disk is a byte (telling how/if the following content is compressed) followed by content. And you don't know how long is this content. Once you have decompressed the content, you can parse it to get the offsets of the blob and so.

As @veloman-yunkan said, how the content has been compress is irrelevant of how you should decompress it. Of course, you must decompress xz content with xz algorithm and decompress zstd algorithm with zstd. But you don't care about the chunck size used at compression time or other things like that.

For decompression you can use the simple api (all in once) or the streaming api (chunck by chunck), it doesn't matter. On a performance/memory usage it may, but the result will be the same.
On c++ libzim implementation, we decompress the content chunck by chunck to avoid to copy a lot of data in memory (and it would be difficult as we don't know the size of the compressed data), but you can do whatever you want depending of you implementation/need.

Jaifroid · 2020-04-13T16:27:55Z

Thank you very much for the hints @mgautierfr . As you can probably tell, I'm rather new to decompressing streams, but I think I've got my head round it now with your help. We're facing Kiwix JS becoming obsolete overnight once new ZIMs are produced with zstd compression, hence the effort to try to reproduce the libzim process in JavaScript. (We would much rather use libzim directly, but it's proved impossible to compile to asm/webassembly with Emscripten in a usable state to date, mostly due to filesystem limitations we think.)

data-man · 2020-05-27T19:00:36Z

GoldenDict now supports zims with zstd (commit).

Jaifroid · 2020-08-30T10:11:40Z

I am a bit confused by the GoldenDict implementation. It appears to calculate the cluster size before decompressing the cluster by subtracting the beginning of the cluster from the beginning of the next cluster:

https://github.com/goldendict/goldendict/blob/master/zim.cc#L322

Is this actually a useful heuristic to know "roughly" how much data we are dealing with?

mgautierfr · 2020-08-31T08:16:46Z

Yes. This is a useful heuristic but there is no guaranty that cluster are written sequentially.
We have changed this in libzim in #210

Jaifroid · 2020-08-31T08:38:14Z

OK, thanks @mgautierfr . I'm now quite close to completing the Kiwix JS implementation of ZSTD decompression. The goldendict version suddenly made me think our implementation was over-complicated, and we might just be able to decompress a whole cluster in one go, but it's clearly not safe even if it would work most of the time.

So instead we decompress from the start of the cluster up to the end of the offset + data length we need. A side effect of this is that we have to restart decompression from the beginning of a cluster each time we want a blob from that cluster, even if the next blob we want is stored consecutively in the cluster after a previously retrieved blob. This seems like a waste of CPU cycles even if the decompression is fast.... (I know it can be ameliorated with cluster caching.)

mgautierfr · 2020-08-31T09:03:26Z

but it's clearly not safe even if it would work most of the time.

Why this is not safe to decompress the whole cluster ?

So instead we decompress from the start of the cluster up to the end of the offset + data length we need. A side effect of this is that we have to restart decompression from the beginning of a cluster each time we want a blob from that cluster, even if the next blob we want is stored consecutively in the cluster after a previously retrieved blob. This seems like a waste of CPU cycles even if the decompression is fast.... (I know it can be ameliorated with cluster caching.)

This is a complex problem (not on the code side, but on what is the best strategy).
We are currently trying to solve this with #78, #394 and #395 and #411

Jaifroid · 2020-08-31T09:14:07Z

Why this is not safe to decompress the whole cluster ?

I meant it's not safe to do it using values calculated by subtracting the current cluster offset from the beginning of the next cluster offset, for the reason you state (they are not guaranteed to be written consecutively). So in Kiwix JS we currently decompress just "enough" data for the blob requested, and then start again for the next blob. We have an experimental cluster cache made by peter-x some years ago, which works well, but of course like all caches it's hit-and-miss, and has high churn rate for large ZIMs.

mgautierfr · 2020-08-31T09:43:36Z

Indeed, if you use the next offset as end offset, it is not safe.

In libzim (for now), we decompress the whole compressed stream (and let lzma/zlib/... detect the end of the stream).

kelson42 added the enhancement label Nov 5, 2017

data-man mentioned this issue May 22, 2018

Add support for SBR (Shared Brotli) compression format #144

Closed

kelson42 added the compression label May 27, 2019

kelson42 mentioned this issue Jan 10, 2020

Fulltext search does not scale properly kiwix/kiwix-tools#345

Closed

kelson42 assigned veloman-yunkan Mar 20, 2020

This was referenced Mar 31, 2020

Add libzstd dependency kiwix/kiwix-build#422

Closed

Support for Zstd compression #308

Merged

veloman-yunkan linked a pull request Mar 31, 2020 that will close this issue

Support for Zstd compression #308

Merged

kelson42 mentioned this issue Mar 31, 2020

Add support for zstd decompression kiwix/kiwix-js#611

Closed

mgautierfr closed this as completed in #308 Apr 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support the Zstandard (aka zstd) library for compression #84

Add support the Zstandard (aka zstd) library for compression #84

data-man commented Nov 5, 2017 •

edited by kelson42

Loading

kelson42 commented Nov 5, 2017

data-man commented Nov 5, 2017

data-man commented Nov 5, 2017

kelson42 commented Jan 10, 2020

kelson42 commented Jan 10, 2020

data-man commented Jan 11, 2020

kelson42 commented Jan 12, 2020

mgautierfr commented Jan 13, 2020

kelson42 commented Jan 13, 2020

kelson42 commented Mar 20, 2020

data-man commented Mar 20, 2020

veloman-yunkan commented Mar 22, 2020

Jaifroid commented Apr 4, 2020

kelson42 commented Apr 4, 2020

data-man commented Apr 4, 2020

kelson42 commented Apr 4, 2020

Jaifroid commented Apr 11, 2020

data-man commented Apr 11, 2020 •

edited

Loading

veloman-yunkan commented Apr 11, 2020

Jaifroid commented Apr 12, 2020

mgautierfr commented Apr 13, 2020

Jaifroid commented Apr 13, 2020

Jaifroid commented Apr 13, 2020

mgautierfr commented Apr 13, 2020

Jaifroid commented Apr 13, 2020

data-man commented May 27, 2020

Jaifroid commented Aug 30, 2020

mgautierfr commented Aug 31, 2020

Jaifroid commented Aug 31, 2020

mgautierfr commented Aug 31, 2020

Jaifroid commented Aug 31, 2020

mgautierfr commented Aug 31, 2020

Add support the Zstandard (aka zstd) library for compression #84

Add support the Zstandard (aka zstd) library for compression #84

Comments

data-man commented Nov 5, 2017 • edited by kelson42 Loading

kelson42 commented Nov 5, 2017

data-man commented Nov 5, 2017

data-man commented Nov 5, 2017

kelson42 commented Jan 10, 2020

kelson42 commented Jan 10, 2020

data-man commented Jan 11, 2020

kelson42 commented Jan 12, 2020

mgautierfr commented Jan 13, 2020

kelson42 commented Jan 13, 2020

kelson42 commented Mar 20, 2020

data-man commented Mar 20, 2020

veloman-yunkan commented Mar 22, 2020

Jaifroid commented Apr 4, 2020

kelson42 commented Apr 4, 2020

data-man commented Apr 4, 2020

kelson42 commented Apr 4, 2020

Jaifroid commented Apr 11, 2020

data-man commented Apr 11, 2020 • edited Loading

veloman-yunkan commented Apr 11, 2020

Jaifroid commented Apr 12, 2020

mgautierfr commented Apr 13, 2020

Jaifroid commented Apr 13, 2020

Jaifroid commented Apr 13, 2020

mgautierfr commented Apr 13, 2020

Jaifroid commented Apr 13, 2020

data-man commented May 27, 2020

Jaifroid commented Aug 30, 2020

mgautierfr commented Aug 31, 2020

Jaifroid commented Aug 31, 2020

mgautierfr commented Aug 31, 2020

Jaifroid commented Aug 31, 2020

mgautierfr commented Aug 31, 2020

data-man commented Nov 5, 2017 •

edited by kelson42

Loading

data-man commented Apr 11, 2020 •

edited

Loading