Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Global Image/Layer Namespace #14049

Open
alex-aizman opened this issue Jun 19, 2015 · 10 comments

Comments

Projects
None yet
6 participants
@alex-aizman
Copy link

commented Jun 19, 2015

1. Terms

Global Namespace: often refers to the capability to aggregate remote filesystems via unified (file/directory) naming while at the same time supporting unmodified clients. Not to be confused with LXC pid etc. namespaces

2. sha256

Docker Registry V2 introduces content-addressable globally unique (*) digests for both image manifests and image layers. The default checksum is sha256.

Side note: sha256 covers a space of more than 10 ** 77 unique random digests, which is about as much as the number of atoms in the observable universe. Apart from this unimaginable number sha256 has all the good crypto-qualities including collision resistance, avalanche effect for small changes, pre-image resistance and second pre-image resistance.

The same applies to sha512 and SHA-3 crypto-checksums, as well as, likely, Edon-R and Blake2 to name a few.

Those are the distinct properties that allows us to say the following: two docker images that have the same sha256 digest are bitwise identical; the same holds for layers and manifests or, for that matter, any other sha256 content-addressable "asset".

This simple fact can be used not only to self-validate the images and index them locally via Graph’s in-memory index. This can be further used to support global container/image namespace and global deduplication. That is:

Global Namespace
Global Deduplication

  • for image layers. Hence, this Proposal.

3. Docker Cluster

Rest of this document describes only the initial implementation and the corresponding proof-of-concept patch:

The setup is a number (N >= 2) of hosts or VMs, logically grouped in a cluster and visible to each other through, for instance, NFS. Every node in the cluster runs docker daemon. Each node performs a dual role: it is NFS server to all other nodes, with NFS share sitting directly on the node’s local rootfs. Simultaneously, each node is NFS client, as per the diagram below:

docker-namespace-federated

Blue arrows reflect actual NFS mounts.

There are no separate NAS servers: each node, on one hand, shares its docker (layers, images) metadata and, separately, driver-specific data. And vice versa, each node mounts all clustered shares locally, under respective hostnames as shown above.

Note: hyper-convergence

Often times this type of depicted clustered symmetry, combined with the lack of physically separate storage backend is referred to as storage/compute "hyper-convergence". But that's another big story outside this scope..

Note: runtime mounting

As far as this initial implementation (link above) all the NFS shares are mounted statically and prior to the daemon’s startup. This can be changed to on-demand mount and more..

Back to the diagram. There are two logical layers: Graph (image and container metadata) and Driver (image and container data). This patch patches them both - the latter currently is done for aufs only.

4. Benefits

  • An orchestrator can run container on an image-less node, without waiting for the image to get pulled
  • Scale-out: by adding a new node to the cluster, we incrementally add CPU, memory and storage capacity for more docker images and containers that, in turn, can use the aggregated resource
  • Deduplication: any image or layer that exists in two or more instances can be, effectively, deduplicated. This may require pause/commit and restart of associated containers; this will require reference-counting (next)

5. Comments

It's been noted in the forums and elsewhere that mixing images and containers in the Graph layer is probably not a good idea. From the clustered perspective it is easy to see that it is definitely not a good idea - makes sense to fork /var/lib/docker/graph/images and /var/lib/docker/graph/containers, or similar.

6. What’s Next

The patch works as it is, with the capability to “see” and run remote images. There are multiple next steps, some self-evident others may be less.

The most obvious one is to un-HACK aufs and introduce a new multi-rooted (suggested name: namespace) driver that would be in-turn configurable to use the underlying OS aufs or overlayfs mount/unmount.

This is easy but this, as well as the other points below, requires positive feedback and consensus.

Other immediate steps include:

  • graph.TagStore to tag all layers including remote
  • rootNFS setting via .conf for Graph
  • fix migrate.go accordingly

Once done, next steps could be:

  • on demand mounting and remounting via distributed daemon (likely etcd)
  • node add/delete runtime support - same
  • local cache invalidation upon new-image-pulled, image-deleted, etc. events (“cache” here implies Graph.idIndex, etc.)
  • image/layer reference counting, to correctly handle remote usage vs. ‘docker rmi’ for instance
  • and more

And later:

  • shadow copying of read-only layers, to trade local space for performance
  • and vice versa, removal of duplicated layers (the “dedup”)
  • container inter-node migration
  • container HA failover
  • object storage as the alternative backend for docker images and layers (which are in fact immutable versioned objects, believe it or not).

Some of these are definitely beyond just the docker daemon and would require API and orchestrator (cluster-level) awareness. But that’s, again, outside the scope of this proposal.

7. Instead of Conclusion

In the end the one thing that makes it – all of the above - doable and feasible is the immutable nature of image layers and their unique and global naming via crypto-content-hashes.

@thaJeztah

This comment has been minimized.

Copy link
Member

commented Jul 26, 2015

ping @stevvooe @dmcgowan (I think)

@dmcgowan

This comment has been minimized.

Copy link
Member

commented Jul 27, 2015

I agree on the direction of the proposal but we currently have different plans for how to get there. Although we are still trying to plan out what the immediate steps (for Docker 1.9) are related to the graph driver. We want both a significant code refactor to separate more cleanly graph store from tag store and to break down the graph store into an object store and layer store. I could see such a broken out layer interface supporting implementation for clustering based on NFS.

I would love to include you in these discussions as it is very common for code in this area to slip release due to focus being shifted elsewhere. It is becoming more and more a focal point though for distribution related problems. @stevvooe is the right person to continue to discussion with. I would also take a look at https://github.com/docker/blobber which addresses the problem from a different angle.

@thaJeztah

This comment has been minimized.

Copy link
Member

commented Jul 27, 2015

@dmcgowan I think the https://github.com/docker/blobber is currently "private", because I get a 404 there

@dmcgowan

This comment has been minimized.

Copy link
Member

commented Jul 27, 2015

Ahh yeah its private, a late night oversight. Thanks for keeping me honest @thaJeztah 😄. I just wanted to show the design objectives in the README, let me see if we can get that into another document.

@stevvooe

This comment has been minimized.

Copy link
Contributor

commented Jul 28, 2015

@alex-aizman What exactly are you proposing and what specific problems does this solve?

From an initial reading, it sounds like the proposal is to integrate NFS mounts into docker image storage to leverage better content sharing. We would likely never require people to configure NFS as part of a docker install. Aside from being a leaky abstraction, NFS is a nasty single point of failure without a lot of caveats and spotty support.

This seems like an interesting operational layout that can be supported by providing a sane path layout under /var/lib/docker.

It's been noted in the forums and elsewhere that mixing images and containers in the Graph layer is probably not a good idea.

This current constraint makes a lot of this work much harder. If we can externalize actual image storage from the graph driver, we make a lot of these problems easier. We are working on a project to make this easier.

@LK4D4

This comment has been minimized.

Copy link
Contributor

commented Sep 15, 2016

@stevvooe @dmcgowan did we implement this differently? Is this still actual?

@stevvooe

This comment has been minimized.

Copy link
Contributor

commented Sep 15, 2016

@LK4D4 This is ambitious proposal. If we could divide the problems and solutions, we may be able to make it a little more actionable. There is still likely work to be done to allow a cluster of machines to share "at-rest" image storage.

@alex-aizman

This comment has been minimized.

Copy link
Author

commented Sep 15, 2016

The steps on the tech side of things are very clear. There's this key concept, call it "centralized repository of immutable layers". The stuff can be designed around this concept, and 'docker pull', 'docker run' and friends will have to be changed accordingly. NFS of course must be one of the transport choices, etc. The works.

@stevvooe

This comment has been minimized.

Copy link
Contributor

commented Sep 15, 2016

@alex-aizman In the patches provided, I don't really see anything NFS specific except for a reference to some sort of shared root folder. I think separating the pull-cache, artifact storage and other paths carefully would have the same affect.

However, there is something to be said about data locality. If one starts up the same image simultaneously across several cluster nodes, the NFS server will have to serve up that hot set nearly every time (depending on cache configuration). I can't see this performing much better than pulling from a central registry. In such scenarios, repeating data across disks gives you much better IO scaling (ie broadcast or p2p).

A static artifact store (just a filesystem path, really) that can be shared via arbitrary protocol would probably be a more scalable approach. This could be shared with NFS or bittorrent or anything.

@alex-aizman

This comment has been minimized.

Copy link
Author

commented Sep 15, 2016

The patch is more than 1 year old, and I'd suggest to maybe move beyond this particular patch at this point - to the original motivation that caused this patch in the first place. Which back when and today remains the same - duplication. It is a shame to keep duplicating the same immutable bits. As far as NFS, see e.g. my text at storagetarget.com. NFS is just a standard and ubiquitous storage transport - today. Nobody in the right mind will say it is optimal for the docker layer images, etc. But NFS exists, it is totally prevalent in the world of file storage, and it is therefore must be designed in...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.