-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Essential content coding metadata: header or body? #2770
Comments
On the encoding side, the main downside is that the cli tooling for brotli and Zstandard don't currently do the embedding so tooling would have to be added to prepend the hash to the files in a way that isn't standard for either (yet anyway) and for manually decompressing the files. Zstandard has identifiers for the dictionaries when using non-raw dictionaries but both assume that raw dictionaries will be negotiated out of band. Technically it would be a pretty trivial modification for clients and servers that are doing the work, I'm just a bit concerned about the developer experience changes (and whatever needs to be done to get both brotli and Zstandard to understand what amounts to new file formats). |
I opened issues with brotli and ZStandard to see if they would consider adding it to the format of their respective file formats. If it's an optional metadata tag that is backwards compatible with existing encoders and decoders I could see it providing quite a bit of value, even in the non-HTTP case of dictionary compression. |
There was some discussion in the ZStandard repo about possibly reserving one of the skippable frame magic numbers for embedding a dictionary ID but there's some level of risk for collision with people who may be using those frames for watermarking or other application-specific use cases. As best as I can tell, the brotli stream format doesn't have a similar frame capability for metadata or application-specific data. We could create a container format that held the dictionary ID and stream (header basically, not unlike zip vs deflate) but that feels like a fairly large effort and the tooling would have to catch up to make it easy for developers to work with. At this point I'm hesitant to recommend adding anything to the payload itself that the existing brotli and zstd tooling can't process. Being able to create, fetch and test the raw files is quite useful for the developer workflow and for debugging deployments. Would it make sense to allow for future encoding formats to include a dictionary ID in the file itself and make the header optional in those cases (and make the embedded ID authoritative)? I'm not sure if that would make sense in this draft since this one is limited to the 2 existing encodings and it would be addressed in a new draft when the new encodings are created or if it makes sense to allow for it here without requiring that |
I don't think that you want to change the zstd or brotli format, only the content coding. That is, something like this: def decode(body):
hash = body[:32]
dict = lookup_dict(hash)
return decompress(body[32:], dict=hash) This does partly work against the idea that you might have a bunch of files that contain compressed versions of content. You can't just say |
I do think we need to define a file format for it if we go down this path to ease adoption and it should probably have a magic signature at the beginning. Maybe something like gzip is to deflate but with a simple 3-byte signature followed by the hash followed by the stream data. Assuming we create a cli tool to do all of the work for compressing/decompressing them, I'll ping several of the current origin trial participants to see how it will fit into their workflow. Something like:
I'm assuming something like this would be better off as it's own draft that this references or do you think it makes sense to define it here? I agree there are significant benefits to having the hash paired directly with the resource. I just want to be careful to make sure whatever we do fits in well with the developer workflow as well. |
Adding the header would add some work to typical CI workflow. At least in my case, the diff file stream is created using the If this can be done in a published script it would simplify the logic (FYI - node.js currently does yet not support brotli bindings, see nodejs/node#52250) but not a requirement IMO. Do I understand correctly that with this idea implemented the Related to this - what in this case would be |
Yes, this would eliminate the need for the The |
I have a question. If we can use the new |
It's technically not required but it makes it safer and easier to operate
on the files outside of the http case.
For example, here's some discussion on the brotli issue tracker from 2016
asking for the same:
google/brotli#298
…On Wed, Apr 17, 2024 at 7:32 PM Tsuyoshi Horo ***@***.***> wrote:
I have a question.
If we can use the new dcb and dcz in the Content-Encoding header, why do
we need to have the "3-byte signature indicating hash and compression type"
in the response body?
—
Reply to this email directly, view it on GitHub
<#2770 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADMOBNARW65ZIFCFCQQLPDY54BB5AVCNFSM6AAAAABFYFNZAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRSGY3DQOBYGQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I don't have a problem with doing that in this document, if that is the way people want to go. I personally don't think that a new format is needed because this is a content-encoding, not a media type. But if a media type (and tooling) helps with the deployment of the content-encoding, then maybe that is the right thing to do. Either way, I don't think that you should make that decision (it's a significant one) without having a broader discussion than just this issue. |
@martinthomson - could you help me understand this one? (mainly, why wouldn't we be able to apply delta encoding in the absence of an in-body hash?) This will definitely add complexity (e.g. the need to define a new file format that would wrap dictionaries, along with required tooling). It's not currently clear to me what the advantage of this would be. |
@yoavweiss is the complexity you are concerned about limited to tooling and spec process or do you also see it as being more complex after we are at a good state with tooling? Assuming the brotli and Zstandard libs and cli tools have been updated to add a flag for "embedded dictionary hash" format streams, does one start to look better? For me, the hash being embedded in the stream/file removes fragility from the system. It blocks the decode of the file with the wrong dictionary and enables rebuilding of the metadata that maps dictionaries to compressed files if, for some reason, that metadata got lost (file names truncated, etc). It also feels like it simplifies the serving path a bit, bringing us back to "serve this file" based on the On the size side of things, I expect it will likely be a wash. Delta-compressed responses will be a few bytes smaller because the header (name and value) is larger than the file header (10's of bytes, not huge by any means). In the dynamic resource case where multiple responses will re-use the same dictionary, the header can be compressed away with HPACK/QPACK, making the header case a bit smaller. I don't think it's a big change in complexity/fragility one way or the other but it does feel like there are fewer moving pieces once we get to a place where the tooling is taken care of to have the file contents themselves specify the dictionary they were compressed with. The need for tooling changes would delay adoption a little bit so it's not a free decision but I want to make sure we don't sacrifice future use cases and simplicity for an easier launch. |
I'm not sure that I see the tooling process as critical, relative to the robustness. And performance will be a wash (though I tend to view the first interaction as more important than repeated interactions). For tooling, if this content is produced programmatically, then there should be no issue. Integrated tooling can do anything. If content is produced and stored in files, then I don't see the definition of new media types to be necessary as part of that. I don't see these files being used outside of HTTP, ever. Maybe you could teach the command line decompression tools to recognize the format and complain in a useful way, but that's about the extent of the work I'd undertake there. You could do as @pmeenan suggests as well, which would be even more useful, but given the usage context, that's of pretty narrow applicability. |
I'm mostly concerned about official tooling and the latency of getting them to where developers need them to be. Compression dictionaries already require jumping through some custom hoops, due to latency between brotli releases and adoption by the different package managers. At the same time, if y'all feel strongly about this, this extra initial complexity won't be a deal breaker. |
I think there are enough robustness benefits that it is worth some short-term pain that hopefully we will all forget about in a few years. On the header side of things, how do you all feel with respect to a bare 32-byte hash vs a 35-byte header with a 3-byte signature followed by the hash (or a 3-byte signature followed by a 1-byte header size followed by the hash to allow for changes as well as 4-byte alignment)? It's possible I'm mentally stuck in the old days of sniffing content but since the hash can literally be any value, including accidentally looking like something else, I like the explicit nature of a magic signature at the beginning of the stream. It essentially becomes |
+1 to adding a 3 byte magic signature if that's the route we're taking. |
I'm ambivalent on the added signature, so I'll defer to others. I can see how that might help a command-line tool more easily distinguish between this and a genuine brotli-/zstd- compressed file and do the right thing with it. On the other hand, it's 3 more bytes and - if the formats themselves have a magic sequence - the same tools could equally skip 32 bytes and check for their magic there. |
I think I am having the same response as you all in the opposite direction, where to me it feels preferable to make it HTTP's problem so that my layer doesn't have to deal with additional complexity. 😃 But if I overcome that bias and accept that it would be nice to avoid an additional HTTP header, here's what I think: If we used a Zstd skippable frame, that would change the stream overhead to 8 bytes (4 byte magic + 4 byte length) + 32 byte hash. But it would mean that existing And I've thought about it more and I'm actually not concerned about colliding a skippable frame type with someone else's existing use case. It would be more of a problem if we were trying to spec a universal solution, but if we scope this to just this content-encoding, then we're free to reserve whatever code points we want and attach whatever semantics we want. |
Thanks. I guess the main question I have is if there would be benefits to Zstandard itself for the files to be self-contained with the identification of the dictionary that they were encoded with or the possibility of mismatching dictionaries on compression and decompression (or being able to find the matching dictionary given just the compressed file) are issues that are HTTP-specific. I'm fine with specifying that the encoded files carry the dictionary hash (before the compressed stream data) and having different ways for Zstandard and Brotli to do that actual embedding. That said, the tooling gets more complicated on the encode and decode side to generate the frame and insert it in the correct place in the compressed file and at decompression time, that makes the client more format-aware, having to parse more of the stream to extract the dictionary and then re-send the full stream through the decoder (at least until the decoder library becomes aware of embedded dictionary hash). |
Is this really what you want in this case? The decoder needs to know where to find the dictionary, so wouldn't you want this to be a breaking change to the format such that a decoder that has a dictionary is fine and a decoder that doesn't knows to go get one. ... And - importantly - an older decoder will abort. (I confess that I don't know what the zstd frame extension model is and didn't check.) |
I don't think we're contemplating a model where Zstd can ingest a frame and figure out on its own where to find the dictionary it needs and then load it and use it. I expect that the enclosing application will parse this header, get the hash, and then find the dictionary and provide it to Zstd. The advantage to using the skippable frame is that then you can provide the whole input to Zstd (including existing versions) and it will work rather than have to pull the header off. |
One thing I realized now - by assuming that the hash length is 32 bytes, we're assuming the hash would remain SHA-256 forever. That might be how things play out, but it might also be the case that we'd need to change hashes at some point. If we were to do that, having a fixed length hash as part of the format would make things more complex. |
Having a fixed-length hash (or fixed hash) as part of a content coding is perfectly fine. If there is a need to update hashes, it is easy to define a new content coding. |
@felixhandte the current PR doesn't use skippable frames and uses the same custom header for Brotli and Zstandard (with different magic numbers). I can switch to using a skippable frame instead (which effectively just becomes an 8-byte magic number since the frame length is always the same) but I'm wondering if it makes sense and is worth adding 4 bytes. It won't help in creating the files so it's just for decode time and the main benefit you get is that you can use the existing zstd cli and libraries to decode the stream without skipping the header but those also won't verify the dictionary hash, they will just skip over it. That might not be a problem but part of the decode process will be to fail the request if the hashes don't match. |
In talking to the Brotli team, it looks like Brotli already embeds a hash of the dictionary and validates it during decode to select the dictionary to use. It uses a "256-bit Highwayhash checksum" so we can't use the hash to lookup a dictionary indexed by SHA-256 but we can use it to guarantee the decompression doesn't use a different dictionary (and the existing libraries and cli tools already use it). @martinthomson when you were concerned about the client identifying which dictionary was used by the server, was it for both lookup and validation or just validation? I'm just wondering if we can use the existing brotli streams as they already exist or if we should still add a header to support locating the dictionary by SHA-256 hash. |
There are a few places in HTTP where the interpretation of a response depends on something in the request. Everyone one of those turns out to be awful for generic processing of responses. That's my main reasoning, so I'd say both: lookup first, then validation. That is, my expectation is not that the client has a singular candidate dictionary, or that it commits to one in particular. So in the original design, when you had the client pick one, that didn't seem like a great idea to me. For validation, Highwayhash (NIH much?) doesn't appear to be pre-image resistant, so I'd be concerned if we were relying on the pre-image resistance properties of SHA-2. Can we be confident that this is not now subject to potential ployglot attacks if we relied on that hash alone? That is, could two clients think that they have the same resource, but do not? |
I wouldn't be comfortable switching hashes given the wider and proven use
of SHA-256 and agree on the lookup case to allow content-encoding with
different and multiple dictionary negotiations.
Looks like a separate header is still the cleanest so the main remaining
question is if we use the same style of header for both or a skippable
frame for Zstandard.
I also need to update the brotli citations to point to the shared
dictionary draft format instead of the original brotli format.
…On Sun, May 12, 2024 at 7:58 PM Martin Thomson ***@***.***> wrote:
There are a few places in HTTP where the interpretation of a response
depends on something in the request. Everyone one of those turns out to be
awful for generic processing of responses. That's my main reasoning, so I'd
say both: lookup first, then validation.
That is, my expectation is not that the client has a singular candidate
dictionary, or that it commits to one in particular. So in the original
design, when you had the client pick one, that didn't seem like a great
idea to me.
For validation, Highwayhash (NIH much?) doesn't appear to be pre-image
resistant, so I'd be concerned if we were relying on the pre-image
resistance properties of SHA-2. Can we be confident that this is not now
subject to potential ployglot attacks if we relied on that hash alone? That
is, could two clients think that they have the same resource, but do not?
—
Reply to this email directly, view it on GitHub
<#2770 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADMOBO57DUKRCY2FZN6JBTZB762LAVCNFSM6AAAAABFYFNZAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBWGQYTOMJVHA>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
Sorry for the late comment. It seems to me that conveying this information in the content eases the integration with Signatures and Digest. It is not clear to me if there are still possible cases where the response does not contain all the information required for processing. |
Thanks |
This is a discussion we've had several times when defining content codings, but it seems like there is never really a single answer.
Should the content coding be self-describing, or can it rely on metadata in fields?
The compression dictionary work uses header fields to identify which compression dictionary is in use. Originally, the client would indicate a dictionary and the server would indicate use of that dictionary by choosing the content coding. This made interpreting the body of the response quite challenging in that you needed to have a request in order to make sense of it.
More recently, the specification has changed to having the client list available dictionaries, with the server echoing the one it chooses. Both use header fields.
There is a third option, which is to embed the dictionary identification (which is a hash of the dictionary) ahead of the compressed content. This has some real advantages:
It also comes with disadvantages:
The text was updated successfully, but these errors were encountered: