Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Proposal: Global Image/Layer Namespace #14049
Global Namespace: often refers to the capability to aggregate remote filesystems via unified (file/directory) naming while at the same time supporting unmodified clients. Not to be confused with LXC pid etc. namespaces
Docker Registry V2 introduces content-addressable globally unique (*) digests for both image manifests and image layers. The default checksum is sha256.
Side note: sha256 covers a space of more than 10 ** 77 unique random digests, which is about as much as the number of atoms in the observable universe. Apart from this unimaginable number sha256 has all the good crypto-qualities including collision resistance, avalanche effect for small changes, pre-image resistance and second pre-image resistance.
The same applies to sha512 and SHA-3 crypto-checksums, as well as, likely, Edon-R and Blake2 to name a few.
Those are the distinct properties that allows us to say the following: two docker images that have the same sha256 digest are bitwise identical; the same holds for layers and manifests or, for that matter, any other sha256 content-addressable "asset".
This simple fact can be used not only to self-validate the images and index them locally via Graph’s in-memory index. This can be further used to support global container/image namespace and global deduplication. That is:
3. Docker Cluster
Rest of this document describes only the initial implementation and the corresponding proof-of-concept patch:
The setup is a number (N >= 2) of hosts or VMs, logically grouped in a cluster and visible to each other through, for instance, NFS. Every node in the cluster runs docker daemon. Each node performs a dual role: it is NFS server to all other nodes, with NFS share sitting directly on the node’s local rootfs. Simultaneously, each node is NFS client, as per the diagram below:
Blue arrows reflect actual NFS mounts.
There are no separate NAS servers: each node, on one hand, shares its docker (layers, images) metadata and, separately, driver-specific data. And vice versa, each node mounts all clustered shares locally, under respective hostnames as shown above.
Often times this type of depicted clustered symmetry, combined with the lack of physically separate storage backend is referred to as storage/compute "hyper-convergence". But that's another big story outside this scope..
Note: runtime mounting
As far as this initial implementation (link above) all the NFS shares are mounted statically and prior to the daemon’s startup. This can be changed to on-demand mount and more..
Back to the diagram. There are two logical layers: Graph (image and container metadata) and Driver (image and container data). This patch patches them both - the latter currently is done for aufs only.
It's been noted in the forums and elsewhere that mixing images and containers in the Graph layer is probably not a good idea. From the clustered perspective it is easy to see that it is definitely not a good idea - makes sense to fork /var/lib/docker/graph/images and /var/lib/docker/graph/containers, or similar.
6. What’s Next
The patch works as it is, with the capability to “see” and run remote images. There are multiple next steps, some self-evident others may be less.
The most obvious one is to un-HACK aufs and introduce a new multi-rooted (suggested name: namespace) driver that would be in-turn configurable to use the underlying OS aufs or overlayfs mount/unmount.
This is easy but this, as well as the other points below, requires positive feedback and consensus.
Other immediate steps include:
Once done, next steps could be:
Some of these are definitely beyond just the docker daemon and would require API and orchestrator (cluster-level) awareness. But that’s, again, outside the scope of this proposal.
7. Instead of Conclusion
In the end the one thing that makes it – all of the above - doable and feasible is the immutable nature of image layers and their unique and global naming via crypto-content-hashes.
referenced this issue
Jul 24, 2015
I agree on the direction of the proposal but we currently have different plans for how to get there. Although we are still trying to plan out what the immediate steps (for Docker 1.9) are related to the graph driver. We want both a significant code refactor to separate more cleanly graph store from tag store and to break down the graph store into an object store and layer store. I could see such a broken out layer interface supporting implementation for clustering based on NFS.
I would love to include you in these discussions as it is very common for code in this area to slip release due to focus being shifted elsewhere. It is becoming more and more a focal point though for distribution related problems. @stevvooe is the right person to continue to discussion with. I would also take a look at https://github.com/docker/blobber which addresses the problem from a different angle.
@alex-aizman What exactly are you proposing and what specific problems does this solve?
From an initial reading, it sounds like the proposal is to integrate NFS mounts into docker image storage to leverage better content sharing. We would likely never require people to configure NFS as part of a docker install. Aside from being a leaky abstraction, NFS is a nasty single point of failure without a lot of caveats and spotty support.
This seems like an interesting operational layout that can be supported by providing a sane path layout under
This current constraint makes a lot of this work much harder. If we can externalize actual image storage from the graph driver, we make a lot of these problems easier. We are working on a project to make this easier.
The steps on the tech side of things are very clear. There's this key concept, call it "centralized repository of immutable layers". The stuff can be designed around this concept, and 'docker pull', 'docker run' and friends will have to be changed accordingly. NFS of course must be one of the transport choices, etc. The works.
@alex-aizman In the patches provided, I don't really see anything NFS specific except for a reference to some sort of shared root folder. I think separating the pull-cache, artifact storage and other paths carefully would have the same affect.
However, there is something to be said about data locality. If one starts up the same image simultaneously across several cluster nodes, the NFS server will have to serve up that hot set nearly every time (depending on cache configuration). I can't see this performing much better than pulling from a central registry. In such scenarios, repeating data across disks gives you much better IO scaling (ie broadcast or p2p).
A static artifact store (just a filesystem path, really) that can be shared via arbitrary protocol would probably be a more scalable approach. This could be shared with NFS or bittorrent or anything.
The patch is more than 1 year old, and I'd suggest to maybe move beyond this particular patch at this point - to the original motivation that caused this patch in the first place. Which back when and today remains the same - duplication. It is a shame to keep duplicating the same immutable bits. As far as NFS, see e.g. my text at storagetarget.com. NFS is just a standard and ubiquitous storage transport - today. Nobody in the right mind will say it is optimal for the docker layer images, etc. But NFS exists, it is totally prevalent in the world of file storage, and it is therefore must be designed in...