-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker's layer content hashing scheme doesn't follow the canonicalization rules #27970
Comments
ping @tonistiigi |
ping @stevvooe |
@ixmatus Thanks for the report! If we look at the struct that generated this json, The main problem here is the fundamental defect of CAS applications that require round trip serialization. Even when you think you've handled every possible source of instability, there is always something that is missed. Keeping this consistent from version to version is hard enough let across implementations. In this case, we did due diligence on specifying the use of canonical json, but didn't ensure that we used everywhere. The affect is that given any change in the output, the entire image itself is a different image. Fortunately, this is okay. We address this problem by ensuring that this json is only ever generated once. We also have tests that ensure that from version to version, the generated json is as consistent as possible. In addition, content is verified at the byte-level, rather than the structure, to ensure that we can consume content generated with a different ordering or standard. However, if we do change this, we end up with incompatibility of image ids between versions. Is there are particular problem you're trying to address? |
@stevvooe This came up because I've written a Haskell CLI utility to download, from docker-distribution, a container's layers and config JSON and assemble those artifacts into a docker image V1 tar archive that can be loaded by a daemon. This meant I needed to re-implement the Chain ID and Content ID hashing schemes (it would be nice if these were documented in the V1 image spec, too) implemented in Golang by Docker in order to assemble the fetched artifacts correctly because I couldn't find a way to fetch any of that information from the registry (as I think you could in the V1, Python-based registry). It doesn't appear to break anything if load an image produced by my tool but it's an unfortunate inconsistency because it means the hashes that are produced are idiosyncratic to the Go language environment, if I'm understanding you correctly. So, what I fetch and construct as an image and can load into the daemon will not be the same thing that the daemon produces when saving the same loaded image. That is unless I preserve the non-canonical sub-object key ordering (which is unspecified and subject to change based on decisions made in Golang's |
@ixmatus Make sure you are declaring to the registry that you accept schema2 manifests by including In general, when you download the content, you don't need to ever process the content through a json parser to calculate the various identifiers, other than to extract hash material for ChainID. Expecting json to round trip is a practice in futility 🐹. Just save the bytes, deserialize the portion you need for the id, then write the bytes to their destination. If you could show me the code, I may be able to point out the issue. Note: it is not called Golang. |
This is a good suggestion, but note that you will only get such a manifest if a sufficiently recent version of Docker (>= 1.10) pushed one in this format. The registry doesn't convert from schema1 to schema2, to avoid affecting manifest digests. In general, the need to create your own config JSON only arises with old-style schema1 manifests. This is an unavoidable consequence of the transition from the old, non-content-addressable scheme. But the current manifest format avoids the need to create JSON and the canonicalization frustrations that come with that. |
@stevvooe yes I specifically download the V2 manifest because that points at the image's config JSON. It's sufficiently safe for me to do so because we only push from Docker (>= 1.10) to our private registry. The tool downloads the image config and layers referenced in that manifest. Both artifacts are not enough to assemble a docker image though because the layer directory's name (which is a hash) is not described by the manifest or the image config json. The json file within each of those directories contains an In order to produce something that resembles what Is there a method I do not know about to get those values without the need to derive them?
I would be happy to share it once I have permission from my company to release it with an OSS license.
I may be misunderstanding how Docker is attempting to tackle addressable content hashes. Also, the problem from what I can tell is not specifically with the image's config JSON. I in-fact do as you're saying but the image config JSON has nothing to do with any of the layers except for the top-most layer (at least from what I understand of what's going on). The problem, for me at least, is in generating the empty JSON config files which are used to produce the Content IDs for each layer using the Chain IDs.
Sure. I wrote it that way because
@aaronlehmann I may not understand fully what is going on - what I could not find any other way of getting the same hash, I even tried a few different |
@ixmatus: Have a look at https://github.com/docker/docker/blob/master/image/spec/v1.2.md#combined-image-json--filesystem-changeset-format This builds on top of the old |
@aaronlehmann interesting...it appears I'm doing both! Thank you for clarifying this for me. |
Looks like this question is answered, so I'll close the issue for housekeeping, but feel free to continue the discussion |
@aaronlehmann @stevvooe FYI I have this working now with the feedback you provided me. So thank you. Something we've now run across, which isn't entirely unexpected, is the time it takes to I don't fully understand Docker's storage model though and how deeply intertwined the stateful storage (devicemapper, etc.) functionality is with storage of image layers... Could you perhaps provide me with some guidance on what to look at or read? |
I'm reading this, and it's helping: https://docs.docker.com/engine/userguide/storagedriver/imagesandcontainers/ |
@ixmatus https://medium.com/microscaling-systems/spot-the-docker-difference-9f99adcc4aaf#.jrhc8mp2k may provide some more insight. To build the level of understanding you're looking for, I'd recommend reading the code. When analyzing the docker code base, I generally start from the handlers. The pull handler can be understood from https://github.com/docker/docker/blob/master/daemon/image_pull.go#L76. From there, focus on the interaction with the image store. Many of the design constraints will come to light with an understanding of interactions present there. Let me know if you need further pointers. |
Description
When saving an image to the filesystem, Docker computes a Content ID hash for the layer using a Chain ID injected into a JSON object with null or empty keys. In cases where the layer has a parent, the parent's Content ID is also injected. The top-level keys of this object follow docker's rules for canonicalized JSON in that they are lexically sorted, however this property is not applied recursively to the keys of the sub-object produced by serializing the empty struct datatype.
Describe the results you received:
Describe the results you expected:
Output of
docker version
:Output of
docker info
:The text was updated successfully, but these errors were encountered: