-
Notifications
You must be signed in to change notification settings - Fork 108
IPLD and compression #76
Comments
I've had several conversations about this as well. My current view is that we need to put some work into transport level compression and storage level compression soon. The longer we wait the more incentive there will be for people to solve this at the codec layer which we actively don't want because it will make transport and storage level compression redundant. |
A few remarks from somebody who really needs this:
So a quick and dirty solution would be to just have another multiformats format like cbor-deflate. But then of course you have an explosion of formats for different encodings x different compression algorithms. A better solution might be: when you store something using the block api, you can hint whether compression is likely to be useful. This is so you can disable compression when storing something that is already compressed, like images and videos. You get the same hash regardless of which, if any, compression is used. For storage and transfer, you offer and accept the data for a hash with different representations. Then both participants of an exchange figure out the best exchange format and use that. E.g. However, doing this correctly, with no unnecessary transcoding between storage and transmission, will probably require quite a bit of change in many layers, whereas just adding cbor-deflate to https://github.com/multiformats/multicodec/blob/master/table.csv and implementing it is a very small and isolated code change. So it might be better to do this first before opening up the huge can of worms that is doing this correctly... |
Compression at the storage layer should be configurable. Futile transport layer compression is pretty common, the vast majority of HTTP traffic is gzip compressed even when it's video and image content :/ Also, even if the binary data in blocks is already compressed there's plenty of other stuff going on in the transport that will benefit from compression. |
I am aware of that, but gzip is not going to be able to extract any redundancy out of a chunk of an mp4 file, so why do it?
Sure. But I think the benefit of sending bytes over the wire exactly as they are in storage probably outweighs the advantage of saving a few bytes by compressing the protocol overhead. Depends on the use exact use case, of course. |
The answer to both of these is the same: because it's very easy to just compress the entire transport stream. It would be much much harder to compress only parts of the transport stream and to negotiate each part by pulling extra metadata out of the storage format. |
So, the key issue here is that compressing before hashing causes some significant issues with deduplication. Otherwise, our options are effectively equivalent in terms of performance at the end of the day, they just move complexity around. Even if we compress at the transport/storage layer, we can still have smart transports and smart storage systems that:
|
I agree about this being easier. You can just pipe the entire transport stream through inflate/deflate. But I would think that doing so for data that can not be reasonably compressed, such as video data, would slow things down and/or increase energy consumption for no good reason. This will matter especially when running on low powered devices. You will have a tradeoff between using a fast but inefficient compression algo like LZ4 or a slow but space efficient algo like bzip2. You will also have an overhead for the compression / decompression state for each connection. And all that might be for nothing if the majority of the content that is being exchanged is already compressed (videos, images etc.), which might frequently be the case. If you limit your efforts to data where there you expect a large gain due to compression, such as IPLD data encoded as CBOR, and make sure not to transcode between storage and transport, things should be more efficient. I think this can still be done in a way that keeps the storage and transport layers generic. |
Agree with most of what you say. But personally I don't care that much about deduplication not working due to different multiformats encoding. Let's say I have an application that adds some data in a hypothetical multiformats format What I do care about most is that the data is still IPLD data that can be traversed and inspected, instead of essentially a binary blob (which is what I have implemented now because I really need compression for my use case). |
That's the reason why I didn't bring that up as a possible solution. I'd rather go with the format wrapping, so you only have one new format per compression algorithm. You'd also get inspection/traversing capabilities from IPLD. |
This sounds nice, but if I think about the implementation, I fear it'll make things quite complicated/error prone as you somehow need to keep track which items are compressed or not (and keep it in sync). Though we can always start simple and go from there. |
Definitely. Really, I'd hide this behind types. That is, a "smart" transport would produce normal |
I just spent about half a day running a simulation for TLDR; A use-case practically designed to show the benefits of de-duplication doesn't even touch the performance gains of just compressing an entire archive on every release and not de-duplicating anything. I'm still processing the results of this investigation but it's starting to make me think that we might be focusing on the wrong performance cases in IPLD/IPFS across the board. Compression is incredibly powerful and I'm starting to get skeptical of any numbers we're putting together without transport or storage compression in place. Reducing round trips is going to help out no matter what, but compression is probably the biggest thing we could do to impact performance of most use cases. |
If we were bandwidth constrained, that would be the issue. However, I've never seen a bandwidth constrained test except on localhost. I do agree that we need to deal with this ASAP |
Not necessarily.
|
WRT packet loss, a large issue there is that go-ipfs currently sends out way too many packets (we need to buffer better). WRT compression, I'd be surprised if intermediate nodes were all that compressible. They tend to mostly be composed of hashes. |
I'm sorry I obviously overlooked this ticket when I created ipld/legacy-unixfs-v2#34. My idea was mentioned (partly) before:
Security issues with transport compressionThe compression used inside an encrypted connection is prone to be a security issue, see BREACH and CRIME exploits. If you like to use a transport compression it needs to be carefully designed to not leak private content. This isn't an issue when we use the already compressed data from the storage and send those through a connection, or decompress the data (we have stored compressed) and send it through a connection. My thoughts on a possible implementationThe only addition would be the implementation details: There are three types of compression:
The last two apply to zstd (zstd can use/create an external dictionary), the first applies to brotli. Brotli and zstdBrotli is interesting to be supported because it's supported by browsers, can compress pretty well and doesn't need a dictionary inside the archive. Zstd is extremely interesting because it can analyze a dataset, store a dictionary and compress the dataset with it. This allows to analyze say a large amount of small HTML files, create a dictionary, store it in ipfs and put a reference to the dictionary on the file/folder meta information. All further writes containing HTML files could use the dictionary and therefore increase the compression/decompression speed massively: Drawbacks on using a dictionaryThe information which dictionary was used can also be stored inside the zstd format. This allows that we can easily lookup the correct dictionary in a list of all dictionary used on the local node. If a remote node is requesting blocks, it needs to have a way to request the matching CID for the dictionary ID of the blocks received. By default, the dictionary id used on generation is a 4-byte random number, but can also be set to a specific value. Keep in mind that this value needs to be stored in all compressed files. So it shouldn't be an IPNS hash for example.
Drawbacks on using no dictionaryWe have a lot of small chunks in IPFS since the default maximum block size is 256 KByte. Without using a dictionary a compression algorithm needs to create a dictionary and store it inside the block archive, for each block. This decreases the efficiency of the compression. With an external dictionary, we can have both: Good efficiency and the ability to decode each block on its own (together with the dictionary). Compression testSince we care about a specific scenario I tried zstd on a datastore with various settings. The corpus was the Linux git repository (without hidden folders/files) added with standard settings on ipfs 0.4.23 to an empty flatfs datastore. The commit was Uncompressed 866M (with Highest compression ratio with zstd (1MB dictionary): I checked apart from the quoted level 22 also compression level 3, 9 and 15. The same picture, except that on lower compression levels a 1 MB dictionary makes no sense anymore. Offer pretrained dictionariesThe best thing of using zstd is, we can provide already trained dictionaries for common applications, like compressing C code, compressing JS code, compressing Go code, compressing HTML/CSS files etc. since the file metadata would just contain a reference to the dictionary. IPFS can just use the dictionaries that we in the IPFS network while we don't have to add them to the binaries. BrotliBrotli can achieve on the same dataset on the highest setting Other compression algorithms for comparison:Highest compression ratio with bzip2: |
Thanks for the detailed analysis. Custom dictionaries would be really cool, though making it actually work sounds pretty difficult as you need to transfer the information and the dict somehow. If someone would want to prototype this, I'd probably start with no-dict or pre-defined dict compression. |
I wouldn't use zstd without a dictionary at all. It would slow down operations significantly. The easiest way to start would be to create a set of dictionaries and add custom numbers to them. If we use 0-256 for this it will just increase the size of the files by 1 byte vs no dictionary id. We could simply throw a lot of files from the web for each mime code into the analyzer and generate up to 256 dictionaries, one for each mime type. We would store just a list of the id and the corresponding CID in the IPFS code and they are fetched from the network when needed. This way we can make sure all nodes use the same dictionary, while we don't blow up the binary of IPFS. Custom dictionaries by the user on the other hand would need a lot of additionally work on the protocol and in the definition of objects. To avoid having to specify the whole CID on each block or file it would make sense to alter the 'folder' definition in UnixFSv2 to an optional field for defining a dictionary ID and a corresponding CID. But this will get pretty difficult when a file gets pulled by a client which doesn't want to hold the full folder, since we need to transmit this information efficiently, if the client doesn't have the dictionary already. TL;DR stick to 256 mime dictionaries and sort the mess later checksums inside the zstd headerzstd can also save a checksum of the data inside the archive - I would deactivate this feature, since it's using space unnecessarily. |
Should I create a new ticket for this under go-ipfs? |
@RubenKelevra I don't think a new ticket is needed. You could just link from a PR to this issue. |
@RubenKelevra Though if you want to discuss Go specific things or want to make sure such a PR would be accepted before working on it, it makes sense to open an issue there. |
In this case, this ticket should be reopened :) I'll create a ticket just the creation of the dictionaries, and the further stuff can be either discussed here or in a new ticket, if necessary. |
Compression came already up several times. I think there should be some kind of compression for IPLD.
I can think of two different ways which serve different purposes.
On the application level
Add a new container format which is about compression which embeds the original one. There's several ways doing that, one could be by just adding a new format for any compression method, the format implementation would then extract the data and you could access the underlying one. Hashes would then be the one of the compressed data.
Use cases for doing it this way are when you are rather collecting data (e.g. with sensors) and don't look at them too often.
Deep inside IPLD/IPFS
Compression could also happen on a layer below IPLD. Then the hashes would be based on the actual (uncompressed) content. The idea had is that you would implement it on two layers, the storage and the transport.
The storage layer (the repo) would be enabled with compression and just compress all the data that comes in. When you retrieve the data, it would be uncompressed by default. So things would work as they do today. Though there could be also an option to return the compressed data, which could then be passed on the transportation layer
The transportation layer takes an uncompressed block by default, compresses it for the transport and uncompressed it again on the other side. Though there could be an option that it takes the compressed data directly of the storage layer and transmit that data. Then it would only need to be de-compressed on the receiving side to verify the hashes match.
The text was updated successfully, but these errors were encountered: