Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support the Zstandard (aka zstd) library for compression #84

Closed
data-man opened this issue Nov 5, 2017 · 32 comments · Fixed by #308
Closed

Add support the Zstandard (aka zstd) library for compression #84

data-man opened this issue Nov 5, 2017 · 32 comments · Fixed by #308

Comments

@data-man
Copy link
Contributor

data-man commented Nov 5, 2017

See

Advantages:

  • Compression ratio: similar
  • Compression speed: similar
  • Decompression speed: 8x
@kelson42
Copy link
Contributor

kelson42 commented Nov 5, 2017

@data-man To go ahead on this the best argument would be just make a libzim/zimwriterfs/ZIM file with that algorithm and clearly demonstrate about what kind of improvement we talk would deal with.

@data-man
Copy link
Contributor Author

data-man commented Nov 5, 2017

@kelson42
My suggestion is based on lzbench benchmarks.
It would be great if the Zim format would support the zstd.

@data-man
Copy link
Contributor Author

data-man commented Nov 5, 2017

The zstdmt also deserves attention.
There is supports multi-threaded compression/decompression for several algorithms with a single API.

@kelson42
Copy link
Contributor

@data-man It makes two years this feature request has been open and it is more attractive now to me that is was. Would you volontaire to make a POC integration of zstd in the libzim?

@kelson42
Copy link
Contributor

Implementing this would help to avoid such a problem like kiwix/kiwix-tools#345

@data-man
Copy link
Contributor Author

@kelson42
I'll try but I can't promise a quick result.
But definitely faster than two years. :)

Some thoughts:

  • add additional parameter to zimrecreate - compression method.
  • maybe also add language parameter

// [TODO] Use the correct language

@kelson42
Copy link
Contributor

@kelson42
I'll try but I can't promise a quick result.
But definitely faster than two years. :)

That would be really awesome!

Some thoughts:
* add additional parameter to zimrecreate - compression method.

Yes, but if zstd is better then probably we would at middle-term just switch per default.

* maybe also add language parameter

This is an idea, please open a dedicated ticket in zim-tools.

// [TODO] Use the correct language

@mgautierfr
Copy link
Collaborator

@data-man please see with me before trying to implement this. (I'm starmad on irc or mgautier on our slack, or simply here)

I've investigate a bit zstandard and it may provide a functionality we want since a long time : random access decompression.
Until now, we separate the content in clusters. This way we have to decompress on a cluster to access the content of a article. We would like to be able to decompress only the article we want to read. It may be possible to do this with zstd using different frame sharing the same dictionary.

But doing this, it would need us to do a lot in libzim : changing the cluster format, use a article cache size (instead of cluster), maybe change the way we regroup articles in cluster, ...

It is also (and of course) possible to simply use zstd to compress the cluster as we already do for now. But at least we should prepare the possibility that we change the way we store articles in cluster.

@kelson42
Copy link
Contributor

@mgautierfr What you talk about is corresponding to #76 amd #78. If we would tackle these peoblem as well, this would be awesome^2?

@kelson42
Copy link
Contributor

@mgautierfr @veloman-yunkan is going to work on this.

@data-man
Copy link
Contributor Author

But doing this, it would need us to do a lot in libzim : changing the cluster format, use a article cache size (instead of cluster), maybe change the way we regroup articles in cluster, ...

💯

BTW, I changed my mind. :)
kanzi-cpp (wiki) is my favorite now:

  • C++
  • very fast
  • multithreading
  • includes many compression codecs

@veloman-yunkan
Copy link
Collaborator

I have started working on this in the zstd branch. I am not going to address #76 and/or #78 while working on this issue. Those enhancements imply a significant impact on libzim and must be done separately.

@Jaifroid
Copy link

Jaifroid commented Apr 4, 2020

@kelson42 Is there a test ZIM file that uses zstd compression, to work with on kiwix/kiwix-js#611 ?

@kelson42
Copy link
Contributor

kelson42 commented Apr 4, 2020

@data-man Would you please be upload one (ZIM file with zstd compression) somewhere?

I have created a ticket to easily be able in the future to create one with zimrecreate kiwix/kiwix-tools#372

@data-man
Copy link
Contributor Author

data-man commented Apr 4, 2020

foo-zstd.zip

zimdump -l foo-zstd.zim
1
10
11
12
13
14
15
16
2
3
4
5
6
7
8
9
fulltext/xapian
title/xapian

@kelson42
Copy link
Contributor

kelson42 commented Apr 4, 2020

@data-man Perfect. Thx. I have updated it to http://tmp.kiwix.org/foo-zstd.zim as well.

@Jaifroid
Copy link

@data-man I'm trying to implement zstd decompression in Kiwix JS, our JavaScript reader (using a node version of zstandard library), see kiwix/kiwix-js#611 (comment) . Regarding the sample ZIM you kindly supplied, I have some questions:

  1. What kind of data are there in each numbered article? Is each one an html file?
  2. Which zstandard API did you use to compress the data? The Simple API or the Streaming API? Is there anything further I need to know to be able to decompress a cluster?
  3. Is each cluster compressed as one unit apart from the first (informational) byte of the cluster as per the OpenZim spec, or do I need to target a specific part of the cluster?
  4. This ZIM file only appears to have two clusters, the first one containing only 121 bytes according to the Cluster Pointer List, which appears too small. Have I made an error?

You will see from the comment referenced above that I can present what I believe is the compressed cluster's data to the decompressor using the Simple API, but it returns null, and if I use the Streaming API I get an out-of-memory error, which may just be because the input is not of the expected format.

@data-man
Copy link
Contributor Author

data-man commented Apr 11, 2020

@Jaifroid
I'm using createZimExample and zimrecreate with patched libzim:

writer/creatordata.h
103: CompressionType compression = zimcompZstd;

Is each one an html file?

No. createZimExample uses text/plain.

@veloman-yunkan
Copy link
Collaborator

@Jaifroid

  1. Which zstandard API did you use to compress the data? The Simple API or the Streaming API? Is there anything further I need to know to be able to decompress a cluster?

The API used to compress the data shouldn't matter. Your question is akin to asking about the application used to create a PNG file, in order to decide which viewer to use for opening it. Data compressed with any zstd API complies with the zstd codec specification, and should be correctly decoded with any of its decompression APIs.

  1. Is each cluster compressed as one unit apart from the first (informational) byte of the cluster as per the OpenZim spec, or do I need to target a specific part of the cluster?

Introducing zstd support at this stage doesn't bring new approaches to compression in ZIM file format - it's just a new method of compressing the entire cluster as a whole.

@Jaifroid
Copy link

Thank you @data-man and @veloman-yunkan. The info is helpful for pinpointing where to look for troubleshooting our code (I'm getting a null result when I try to decompress a cluster in the test ZIM file using the JS version of the codec).

@mgautierfr
Copy link
Collaborator

  1. This ZIM file only appears to have two clusters, the first one containing only 121 bytes according to the Cluster Pointer List, which appears too small. Have I made an error?

Just to mention that the clusters don't have to be written consecutive. So you cannot use the ClusterPointerList to get the size of a cluster.
At best you can use it to get a maximum size. At even that, it is not guaranty, you may have clusterPtrPos[N+1] < clusterPtrPos[N]

@Jaifroid
Copy link

@mgautierfr OK, thank you for that important clarification. This most likely explains the issue I'm having.

@Jaifroid
Copy link

@mgautierfr, I understand from your comment in #210 (quoted below) that we cannot know the size-on-disk of a compressed cluster, and have to decompress the data chunk by chunk. Is the chunk size the same for zstandard as it is for xz? This seems to be 1024 x 5 in Kiwix JS for xz-compressed chunks.

In fact, there is no way to know the size of the compressed buffer
without decompressing the data until we reach the end of the
compression stream. So we have to decompress the data, chunk by chunk
until we decompress all the cluster.

(#210)

@mgautierfr
Copy link
Collaborator

I understand from your comment in #210 (quoted below) that we cannot know the size-on-disk of a compressed cluster

Yes, at best you are sure that if something start at a offset after the cluster (another cluster or anything else as a dirent, clusterPtrPos, ...) the compressed cluster will finish before this offset. But the size may be actually smaller.

Is the chunk size the same for zstandard as it is for xz?

There is no "chunk size" specific to zstandard or xz.
A cluster on disk is a byte (telling how/if the following content is compressed) followed by content. And you don't know how long is this content. Once you have decompressed the content, you can parse it to get the offsets of the blob and so.

As @veloman-yunkan said, how the content has been compress is irrelevant of how you should decompress it. Of course, you must decompress xz content with xz algorithm and decompress zstd algorithm with zstd. But you don't care about the chunck size used at compression time or other things like that.

For decompression you can use the simple api (all in once) or the streaming api (chunck by chunck), it doesn't matter. On a performance/memory usage it may, but the result will be the same.
On c++ libzim implementation, we decompress the content chunck by chunck to avoid to copy a lot of data in memory (and it would be difficult as we don't know the size of the compressed data), but you can do whatever you want depending of you implementation/need.

@Jaifroid
Copy link

Thank you very much for the hints @mgautierfr . As you can probably tell, I'm rather new to decompressing streams, but I think I've got my head round it now with your help. We're facing Kiwix JS becoming obsolete overnight once new ZIMs are produced with zstd compression, hence the effort to try to reproduce the libzim process in JavaScript. (We would much rather use libzim directly, but it's proved impossible to compile to asm/webassembly with Emscripten in a usable state to date, mostly due to filesystem limitations we think.)

@data-man
Copy link
Contributor Author

GoldenDict now supports zims with zstd (commit).

@Jaifroid
Copy link

I am a bit confused by the GoldenDict implementation. It appears to calculate the cluster size before decompressing the cluster by subtracting the beginning of the cluster from the beginning of the next cluster:

https://github.com/goldendict/goldendict/blob/master/zim.cc#L322

Is this actually a useful heuristic to know "roughly" how much data we are dealing with?

@mgautierfr
Copy link
Collaborator

Yes. This is a useful heuristic but there is no guaranty that cluster are written sequentially.
We have changed this in libzim in #210

@Jaifroid
Copy link

OK, thanks @mgautierfr . I'm now quite close to completing the Kiwix JS implementation of ZSTD decompression. The goldendict version suddenly made me think our implementation was over-complicated, and we might just be able to decompress a whole cluster in one go, but it's clearly not safe even if it would work most of the time.

So instead we decompress from the start of the cluster up to the end of the offset + data length we need. A side effect of this is that we have to restart decompression from the beginning of a cluster each time we want a blob from that cluster, even if the next blob we want is stored consecutively in the cluster after a previously retrieved blob. This seems like a waste of CPU cycles even if the decompression is fast.... (I know it can be ameliorated with cluster caching.)

@mgautierfr
Copy link
Collaborator

but it's clearly not safe even if it would work most of the time.

Why this is not safe to decompress the whole cluster ?

So instead we decompress from the start of the cluster up to the end of the offset + data length we need. A side effect of this is that we have to restart decompression from the beginning of a cluster each time we want a blob from that cluster, even if the next blob we want is stored consecutively in the cluster after a previously retrieved blob. This seems like a waste of CPU cycles even if the decompression is fast.... (I know it can be ameliorated with cluster caching.)

This is a complex problem (not on the code side, but on what is the best strategy).
We are currently trying to solve this with #78, #394 and #395 and #411

@Jaifroid
Copy link

Why this is not safe to decompress the whole cluster ?

I meant it's not safe to do it using values calculated by subtracting the current cluster offset from the beginning of the next cluster offset, for the reason you state (they are not guaranteed to be written consecutively). So in Kiwix JS we currently decompress just "enough" data for the blob requested, and then start again for the next blob. We have an experimental cluster cache made by peter-x some years ago, which works well, but of course like all caches it's hit-and-miss, and has high churn rate for large ZIMs.

@mgautierfr
Copy link
Collaborator

Indeed, if you use the next offset as end offset, it is not safe.

In libzim (for now), we decompress the whole compressed stream (and let lzma/zlib/... detect the end of the stream).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants