Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft Proposal] Swarm Cluster Volume Support with CSI Plugins #39624

dperny opened this issue Jul 29, 2019 · 3 comments

[Draft Proposal] Swarm Cluster Volume Support with CSI Plugins #39624

dperny opened this issue Jul 29, 2019 · 3 comments


Copy link

@dperny dperny commented Jul 29, 2019

This is a very early draft of a proposal I've been bouncing around and trying to drum up support for over the past few months. It's a lot of work, and probably the biggest feature suggested for Swarm since Swarm, but it would put to bed the persistent trouble of persistent storage in a Swarm cluster.

Because this is an early draft, comments of all natures are welcome. Hopefully in the next couple of months, we can work up the critical mass to actually build this

Swarmkit Cluster Volume Support with CSI Plugins


SwarmKit is the native cluster orchestrator in the Docker container platform and is being widely used in production. Some of the details on using docker volume drivers is captured here:


“Container Storage Interface” (CSI) is an industry standard (CSI) that enables storage vendors to develop a plugin once and have it work across a number of container orchestration (CO) systems. CSI spec is available here.

The CSI document defines two terms that are relevant to the swarmkit implementation. The first, “Container Orchestrator” (“CO”) is obvious; swarmkit is a container orchestrator. There is additionally a “Plugin Supervisor”, which is defined as the “Process which governs the lifecycle of the plugin, MAY be the CO”.


Storage support in the docker platform is currently provided using the docker volume driver. Docker volume drivers allow docker engines to be extended to support various storage platforms. The Docker volume driver uses the plugin system in Docker. This approach has shortcomings due to lack of API richness as well as lack of orchestrator integration with swarmkit. The goal of this document is to propose storage related improvements in the docker platform, primarily by integrating with CSI and secondarily by building storage awareness in Swarm.

Importantly, this design should not be seen as “supporting CSI plugins in Swarm”. While the CSI spec and plugins form the underpinnings of this design, it’s scope goes beyond simply dumb support; and, at the same time, the full range of CSI plugin capabilities are not included.

This document should include the full scope of behavior for cluster volumes, and there should be no open questions or undefined behavior before work proceeds.


  • Provide a design for allowing Swarmkit to support CSI Plugins in the role of Container Orchestrator.
  • Provide support for assigning and scheduling CSI volumes with services.
  • Provide guidance on what other components of the platform are needed to fully support CSI Plugins natively and seamlessly in the Docker platform.


  • Provide a detailed design for allowing Swarmkit to support CSI Plugins in the role of Plugin Supervisor. This work is important, but sufficiently divorced from this subject matter that it shall be addressed in another document.
  • Provide, as with overlay networking, a working, basic, opinionated default implementation for volumes.
  • Support for assigning volumes based on resource requirements and limits.
  • Support for the following CSI Controller Service Capabilities:
  • Support for the following CSI Node Service Capabilities:

Overview of CSI

CSI, the Container Storage Interface, is a standard API defined for storage providers to expose volumes to a cluster orchestrator. The target audience of the the CSI spec is storage providers, not container orchestrators. This means that container orchestrators must work backwards from the spec, seeing what functionality the storage providers expose and what behavior they expect. However, the document is not entirely without guidance for container orchestrators, and does provide a number of behaviors that orchestrators should and must implement.

The CSI spec defines two kinds of plugins that container orchestrators are expected to support.

The first, “Node Plugins”, runs only on the node on which the volume is meant to be published. These plugins are most analogous to Docker Volume plugins, and support for them is the most easy to implement, requiring no intervention on the manager side.

The second type, “Controller Plugins”, is slightly more complicated. Controller Plugins can be run anywhere, including outside of the cluster entirely. Controller Plugins can be used to do things like manage distributed volumes, but do not have to do so.

Comparison of Docker Volume Plugins to CSI Plugins

In order to fully understand how CSI plugins fit in with the existing Docker ecosystem, one must first understand how they compare to the existing Docker Volume Plugins. The two specs take up a similar but slightly different niche, and CSI is the more complicated of the two.

The spec for Docker Volume plugins defines a basic set of “CRUD” actions: Create, Remove, Mount, Path, Unmount, Get, List, and Capabilities are the only possible endpoints. Additionally, these actions are all defined over a REST interface, and the values expected are JSON. The capabilities endpoint is optional, and the only supported capability according to the spec is “Scope”. The “Scope” capability allows a Volume driver to specify that it functions over and entire cluster, but otherwise the Volume Plugin spec is devoid of concerns related to orchestrated environments.

In contrast, the CSI spec was designed first and foremost for a clustered environments. It defines not only actions related to how the end-node manages the volume, but provides a set of RPCs providing an entire lifecycle with various in-progress hooks for volumes. The CSI spec in full defines dozens of different gRPC methods.

It may be possible to create a Shim allowing Docker Volume plugins to act as CSI plugins, assuming that the existing Docker Volume plugins expose a subset of CSI behavior. This would allow Swarmkit to and the Docker Engine to use one unified code-path (the CSI path) to handle volume plugins of either type. The creation of such a shim is out of scope of this design, but the possibility of its creation is quietly taken into consideration with the design in this document.

General Behavior

Volume Assignment

A Service requests one or more Volumes by name or group which are used for the Task. Templating is not included for volumes. Instead, the concept of “groups” will be used to assign different volumes to different tasks.

Volumes requested will not be reserved specifically for the Service or Tasks, beyond the lifespan of the Task using the Volume. If a Volume is only accessible to one Task at a time, but multiple services are requesting the same volume, then when the Task currently using the Volume dies, the Volume is immediately released and another Task in another Service entirely may acquire it. This is contrast with the "Dynamic Reservation" case, which is described in the section on Unsupported Use Cases.

Accessible Topology

Volumes have the capability of being created with an “Accessible Topology”, which defines a set of nodes from which volumes should be accessible. The accessible topology of a volume will be considered a scheduling constraint on the service. If no node meets both the constraints and the accessible topology of the required volume, then the behavior will be the same as if service is scheduled where no node meets a constraint.

When creating a volume, the user can specify two values: the requisite topology and the preferred topology. The requisite topology is a list of topologies, one of which must be used for the volume. The preferred topology is a subset of the requisite topologies which should be preferred, in order, for volume placement. These express the users preference for topology. When a volume is created, part of the response contains the accessible topology, which reflects the reality of where the volume can be accessed, and is the part that is actually taken into account for scheduling decisions.

Volume Sharing

The CSI spec defines several access modes for volumes, which define on how many nodes and by how many writers a volume can be used simultaneously. These are defined in the CSI spec as an enum of several values. To both simplify the UI for selecting access modes, and to additionally provide more granularity of access at the swarm level, these access modes will be selected via the matrix of two parameters: Sharing and Scope. Sharing defines how a particular volume may be used by tasks. Scope defines how many nodes a volume is available on at the same time. Note that even a volume with scope “single” might be publishable on many different nodes, depending on the plugin’s implementation, but it will only ever be in use on one node at a time.

Sharing Scope single Scope multi
none Equivalent to SINGLE_NODE_WRITER, but swarmkit will enforce only 1 user of the volume at a time. Not a valid access mode
readonly Equivalent to SINGLE_NODE_READER_ONLY Equivalent to MULTI_NODE_READER_ONLY
onewriter Equivalent to SINGLE_NODE_WRITER, but swarmkit will enforce only 1 writer and mount anyone after the first as read-only Equivalent to MULTI_NODE_SINGLE_WRITER
all Equivalent to SINGLE_NODE_WRITER, but swarmkit will allow unlimited sharing. Equivalent to MULTI_NODE_MULTI_WRITER

Volume Groups

Swarmkit Volumes will include an extra field, “group”, which defines a set of volumes that should be considered interchangeable to the orchestrator. This allows the user to specify several volumes which can all be used by tasks. This avoids complicated templating logic for more advanced use cases by putting the onus for naming and placement on volume creation. “Group” is, essentially, a specially privileged label, and nothing about setting a group requires that volumes in the group be in any way identical. It would be best practice to use one group per service, but nothing will enforce this behavior.

In the interest of avoiding the hazard-fraught realm of checking if group name is an empty string, the empty string shall be considered a valid group name, and all volumes without a group name set will belong to that same group.

For the probably common use case of creating a service where each task is bound to a particular volume, it would be sufficient to create a group of volumes, each of which has a local scope and no sharing, whose number is equal to the number of replicas of the service.

There is a weird interaction here with the concept of accessible topology. Because the accessible topologies of a volume group aren’t enforced to be the same, it is possible, if two services with different placement constraints are scheduled to the same volume group, then there could exist a configuration where one service uses some subset of the volumes such that the remaining volumes do not have an accessible topology compatible with the placement constraints.

For example:

  • Volumes are available on nodes A, B, C, and D
  • Service 1 is constrained to nodes A, B, C, and D
  • Service 2 is constrained to nodes A and B
  • If Service 1 is assigned volumes on nodes A and B, Service 2 cannot be scheduled
  • If Service 1 is assigned volumes on nodes C and D, Service 2 can still be scheduled on A and B.

This is why it is recommended to use a volume group by no more than 1 service.

Unsupported Use Cases

Dynamic Reservation From A Pool Of Available Volumes

There may be demand for such a mode where a pre-provisioned pool of Volumes exists, any of which can be assigned to a particular workload. Once the Volume is assigned to the Service, it would only be reused within that Service. This mode of provision introduces great complexity to the design of Volumes workflows, and is considered out of scope for this iteration of Volume support.

Resource Requirements and Limits

In Kubernetes, it is possible to assign the same volume to multiple Pods, while taking into account the volume’s present available storage and the requirements set by the user for the Pod. This functionality is out of scope for this iteration.

User Interface

Volume CLI

The existing docker volume CLI can be reused to support cluster volumes. This will require the addition of many new flags. An example of the volume create help text is below:

Usage:    docker volume create [OPTIONS] [VOLUME]

Create a volume

  -d, --driver string   Specify volume driver name (default "local")
      --label list        Set metadata for a volume
  -o, --opt map           Set driver specific options (default map[])

For swarm volumes, the following flags are available:

  --secret    Secret key and swarmkit secret identifier passed to the
              CSI Plugin Controller Service. Format 'key=identifier'.
              Use more than once to specify multiple values.
  --type      Type of volume to create, accepts "mount" or "block"
  --group     The group this volume belongs to. Volumes of the same group
              can be used interchangibly.

To set the capacity range of a volume, the following flags can be used:

  --capacity-min    The minimum required capacity in bytes. The volume will
                    be no smaller than this.
  --capacity-max    The maximum capacity of the volume in bytes. The volume
                    will be no larger than this.

To declare what topology the volume is accessible from, use the following
flags. To specify more than one topology, repeat the flag.

  --topology-requisite   The node topology that the volume must be 
                         accessible from.
  --topology-prefered    The topology that the volume should preferentially
                         be accessible from.

To set an access mode, use a combination of the following flags:

  --scope     Access mode scope of the volume. Accepts "single" or
              "multi" (default "single")
  --sharing   Access mode sharing of the volume. Accepts "none",
              "readonly", "onewriter", or "all" (default "none")

Services CLI

In order to best disambiguate CSI volumes, which have much more capability and complexity, from existing docker volumes, a new option type=csi, will be added to the --mount flag. The existing mount options will have this behavior for CSI mounts:

Option Description
src or source Required. The name or group of a volume. To specify a group called groupname, use src=group:groupname. Unlike volume, Docker will not create a new volume if the specified one does not exist.
dst or destination or target Same as bind and volume
readonly or ro Same as bind and volume
consistency Same as bind and volume

Because, unlike traditional volume mounts, the CSI volume specified must be created in advance, there is no need to specify the volume driver for CSI volumes. If the volume does not exist, the command will return an error.

For example, to create a service called mystatefulservice, using a CSI volume called mycsivolume, running the nginx image, the command would be:

$ docker service create --name mystatefulservice \
  --mount type=csi,src=mycsivolume,dst=/some/path \

Application Programming Interface

Protocol Buffer API

In swarm, protocol buffers define the internal API and object structure. To support CSI plugins, we both include types imported from the CSI plugin spec as well as adding our own information. Swarmkit will not contain verbatim CSI RPC request or response objects. Rather, the requests will be generated from information in the swarmkit object, and the responses unpacked into appropriate fields.

Volume Object

// Volume is a top-level object representing a volume managed by swarmkit. The
// Volume contains the user's VolumeSpec, the Volume status, and the Volume
// object returned by the CSI plugin.
message Volume {
  // there would be docker store object plugin boilerplate here

  // ID is the swarmkit-internal ID for this volume. This ID has no relation to
  // and is different from the CSI volume identifier provided by the CSI plugin
  string id = 1;

  Meta meta = 2 [(gotoproto.nullable) = false];

  // Spec defines the desired state of this Volume
  VolumeSpec spec = 3 [(gogoproto.nullable) = false];

  // Status contains information about how the volume is currently being
  // employed. This allows the user to see and understand errors in volume
  // provisioning and use
  VolumeStatus status = 4 [(gogoproto.nullable) = false];

  // VolumeDetail is the Volume object returned by the CSI plugin when the
  // volume is created.
  csi::Volume volume_detail = 5;

Volume Secrets

CSI Plugins may require that secrets be passed to the plugin. Though a CSI plugin accepts secrets as a map<string,string>, Swarm will leverage its native Secrets functionality to distribute CSI secret data across the cluster.

// VolumeSecret indicates a secret value that must be passed to CSI plugin
// operations.
message VolumeSecret {
  // Key represents the key that will be passed as a controller secret to
  // the CSI plugin
  string key = 1;
  // Secret represents the swarmkit Secret object from which to read data to
  // use as the value to pass to the CSI plugin. This can be a secret name
  // or secret ID.
  string secret = 2;

Volume Spec

// VolumeSpec is the spec for the volume. Once a VolumeSpec is created, it
// will never be altered for the lifetime of the Volume.
message VolumeSpec {
  // Annotations includes the name and labels of a volume. The name used in the
  // spec's Annotations will be passed to the Plugin as the "Name" in the
  // CreateVolume request.
  Annotations annotations = 1 [(gogoproto.nullable) = false];
  // Group defines the volume group this particular volume belongs to. When 
  // requesting volumes for a workload, group name can be used instead of
  // the volume's name, which allows swarmkit to pick one of many volumes to use
  // for the workload.
  string group = 2;

  // Driver represents the CSI Plugin object and its configuration parameters.
  // The "options" field of the Driver object is passed in the CSI
  // CreateVolumeRequest as the "parameters" field. The Driver must be
  // specified; there is no default CSI Plugin.
  Driver driver = 3;
  // AccessMode is similar to, and used to determine, the volume's access mode
  // as defined in the CSI spec.
  VolumeAccessMode access_mode = 4;

  // In the interest of maximum simplicity, Swarmkit's spec uses some of the
  // CSI volume types directly

  // VolumeContentSource represents the source data from which to create the
  // volume. Swarmkit does not manage snapshots, and passes this data through
  // to the plugin unquestioningly
  csi::VolumeContentSource volume_content_source = 5;

  // AccessibilityRequirements specifies where a volume must be accessible
  // from. See the CSI spec for a more complete explanation. If this field is
  // specified but the plugin does not support ACCESSIBILITY_CONSTRAINTS, then
  // the volume will not be created. If this field is not specified but is
  // supported, swarmkit will assume that the entire cluster is a valid target,
  // and may place the volume anywhere it chooses.
  csi::TopologyRequirement accessibility_requirements = 6;

  // Secrets represents a set of key/value pairs to pass to the CSI plugin to
  // complete RPCs. The keys of the secrets can be anything, but the values of
  // the secrets must be swarmkit Secret objects. See the "Secrets
  // Requirements" section of the CSI Plugin Spec for more information.
  // TODO(dperny): we may need to pass different secrets for each RPC, so we
  // may need fields for each RPC instead of just one repeated field passed to
  // each plugin
  repeated VolumeSecrets secrets = 7;


// VolumeAccessMode defines the access mode of the volume, and is used to determine
// the CSI AccessMode value. It is the combination of two values, the Sharing and
// the Scope.
message VolumeAccessMode {
  // Scope defines on how many nodes this volume can be accessed simultaneously
  enum Scope {
    SINGLE_NODE = 1;
    MULTI_NODE = 2;
  // Sharing defines how many tasks can use this volume at the same time, and in
  // what ways.
  enum Sharing {
    NONE = 1;
    READONLY = 2;
    ONEWRITER = 3;
    ALL = 4;
  Scope scope = 1;
  Sharing sharing = 2;


When a CSI volume is assigned to a node, the CSI plugin’s Node Service requires additional calls to the plugin. These calls need only be made once per node, and so the information does not have to be present on every Task using a volume.

Swarmkit, as a product of work in Secrets and Configs, has a mechanism to distribute assignments other than Tasks to a node. To support CSI Volumes, an additional assignment type, VolumeAssignment, will be needed. This type includes the information necessary to make the NodeStageVolume and NodePublishVolume (and their opposites).

// VolumeAssignment is the information needed by the node to make the necessary
// RPC calls and use the volume. The Volume object should never be passed to the
// worker; only the information necessary to stage and publish the Volume should
// be included.
message VolumeAssignment {
  // ID is the swarmkit ID for the Volume. This is used by all Swarmkit components
  // to identify the volume.
  string id = 1;
  // VolumeID represents the CSI volume ID, as returned from CreateVolume. It
  // is not the swarmkit internal ID. This is the ID used when calling CSI RPCs.
  string volume_id = 2;
  // VolumeContext is a map returned from the Controller service when a Volume is
  // created. It is optional, but if returned, must be passed to subsequent calls.
  map<string,string> volume_context = 3;

  // PublishContext is a map returned from the Controller service when 
  // ControllerPublishVolume is called. It is optional, but if returned, must be
  // passed to subsequent calls.
  map<string,string> publish_context = 4;

  // VolumeCapability represents the capabilities expected from the volume
  csi::VolumeCapability volume_capability = 5;
  // Secrets is the set of secrets required by the CSI plugin. These refer to
  // swarmkit Secrets that will be distributed to the node separately, through
  // the usual method.
  repeated VolumeSecret secrets = 6;


The Mount API type defines the parameters of a swarmkit storage mount. It is part of the Task spec. It currently has fields for BindOptions, VolumeOptions, and TmpfsOptions. A fourth options type CSIOptions, will be added for CSI plugin volumes. However, because of the limited scope of the current proposal, it needs no content at this time.

The Source field of the Mount object will be inspected and, will be used to determine if a volume group or a specific volume is requested.

// CSIOptions describes the options associated with CSI mounts.
message CSIOptions {
  // Deliberately left empty; the features of CSI mounts are adequately handled
  // by the existing generic options.


The swarmkit worker needs to be aware of which Volume is associated with which Task. Conveying this information, as well as any task-specific volume configuration information, is the role of the VolumeAttachment object. Because a task may mount multiple volumes, and each volume may possibly be valid for more than one possible Mount, the Source and Target fields from the Mount object are used to disambiguate which Mount this attachment is assigned for.

// VolumeAttachment defines the task-specific configuration options of the
// volume.
message VolumeAttachment {
  // ID is the swarmkit ID of the volume assigned to this task. It is not
  // the CSI volume's ID.
  string id = 1;
  // Source indicates the Mount source that this volume is assigned for.
  string source = 2;
  // Target indicates the Mount target that this volume is assigned for.
  string target = 3;


This is the same NodeDescription already defined in swarmkit, but CSI-specific fields are added.

message NodeDescription {
  // CSINodeInfo represents the information about a Node returned by calling
  // the NodeGetInfo RPC on the CSI plugin present on the node.
  message CSINodeInfo {
    // PluginName is the name of the CSI plugin
    string plugin_name = 1;

    // NodeID is the ID of the Node as reported by the CSI plugin
    string node_id = 2;

    // MaxVolumesPerNode is the maximum number of volumes which can be
    // published to the node, as reported by the CSI plugin
    int64 max_volumes_per_node = 3;

    // AccessibleTopology indicates the location of this node in the CSI
    // plugin's topology, as reported by the plugin.
    csi::Topology accessible_topology = 4;

  // NodeCSIInfo is a list of node info reported by all supported CSI plugins
  // on this node. If a NodeDescription includes a CSINodeInfo object for the
  // plugin, then the Node can be assumed to have that plugin, and if it does
  // not, the node should be assumed to not have that plugin. NodeCSIInfo is
  // included in the NodeDescription because it is reported by the Node, and
  // should be treated as relatively untrusted like other fields in the
  // NodeDescription.
  repeated NodeCSIInfo node_csi_info = 7 [(gogoproto.customname) = "CSINodeInfo"];


The VolumeStatus object is used to keep track of the status of a particular volume.

// VolumeStatus is the status of a particular volume.
message VolumeStatus {
  // PublishedNodes is a list of all node IDs to which this volume is known
  // to be published. If the CSI Plugin supports CONTROLLER_PUBLISH_UNPUBLISH,
  // the Node will not be added to the list of nodes until the call to
  // ControllerPublish has succeeded.
  repeated string published_nodes = 1;


The Docker Engine is controlled primarily through a REST API, where the object types are defined as Go structs. The REST API generally draws its inspiration from the Protocol Buffers, but may have differences. Objects sufficiently identical to the protos have been omitted.

For brevity sake, when doc comments are identical to those of the protos, they have been omitted.

Volume Object

// Volume is a top-level object representing a volume managed by swarmkit. The
// Volume contains the user's VolumeSpec, the Volume status, and the Volume
// object returned by the CSI plugin
type Volume struct {
  ID     string
  Spec   VolumeSpec
  Status VolumeStatus
  // VolumeDetail contains the csi::Volume object returned by the CSI plugin
  // when the volume is created. It is not the csi::Volume type directly; it
  // merely conveys the same information.
  VolumeDetail *CSIVolumeDetail


To avoid directly depending on the CSI protocol buffers at the REST API level, we instead cast the information from csi::Volume into a separately-defined isomorphic type, CSIVolumeDetail.

// CSIVolumeDetail contains the same information present in the csi::Volume proto,
// but defined separately. See the csi::Volume type from the CSI spec for more
// information on these values.
type CSIVolumeDetail struct {
  // CapacityBytes is the capacity of the volume in bytes. If 0, the capacity is
  // unknown.
  CapacityBytes int64 `json:",omitempty"`
  // VolumeID is the identifier for the volume generated by the CSI plugin.
  VolumeID string
  // VolumeContext is properties of the volume returned by the CSI plugin
  VolumeContext map[string]string `json:",omitempty"`
  // TODO: ContentSource is not supported, because creating volumes from
  // Snapshots or other Volumes is not yet supported.
  // AccessibleTopology is the topology from which this volume is available.
  AccessibleTopology []Topology `json:",omitempty"`


// Topology is a CSI Topology, used to indicate cluster topology from the CSI
// plugin's perspective.
type Topology struct {
  // Segments is a map of Topology Domains (like "zone" or "rack") to Topology
  // Segments (like "zone3" or "rack3")
  Segments map[string]string `json:",omitempty"`

This comment has been minimized.

Copy link
Contributor Author

@dperny dperny commented Jul 29, 2019

Oh and also, I made these dope images using graphviz in an earlier iteration of this proposal, but they're not completely germane so I left them out:




This comment has been minimized.

Copy link

@cpuguy83 cpuguy83 commented Aug 6, 2019

So I would like to make sure we aren't just ticking off feature boxes like "Swarm now supports CSI".
Instead it would be great to have a few goals in mind such as run etcd and/or postgres on swarm and how can swarm help those services recover from certain failure conditions.


This comment has been minimized.

Copy link

@trajano trajano commented Sep 29, 2019

I was thinking more of having a configuration like this:

      driver: block | object [ block store which is the default allows management of the volume in terms of blocks like a traditional file sytem, object would be like S3.
       replicas: 4
         # replication settings
         min_replica: # number of replicas that must have the data before considering the file written
         constraints: [same-constraints as services]
            size: 5G
            size: 2G # reserve is meant to find a docker swarm volume that would have the space to place the data
            memory: 5M # memory specifies the memory used for storing the index, buffer or caching

However, for something special like Postgres or ETCd, there may be a specific driver that would manage the store for them more efficiently. But I think having plugins be managed by the swarm orchestrator should be a higher priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.