Real Kubernetes clusters have a variety of volumes which differ widely in size, iops performance, retention policy, and other characteristics. Administrators need a way to dynamically provision volumes of these different types to automatically meet user demand.
A new mechanism called 'storage classes' is proposed to provide this capability.
In Kubernetes 1.2, an alpha form of limited dynamic provisioning was added that allows a single volume type to be provisioned in clouds that offer special volume types.
In Kubernetes 1.3, a label selector was added to persistent volume claims to allow administrators to create a taxonomy of volumes based on the characteristics important to them, and to allow users to make claims on those volumes based on those characteristics. This allows flexibility when claiming existing volumes; the same flexibility is needed when dynamically provisioning volumes.
After gaining experience with dynamic provisioning after the 1.2 release, we want to create a more flexible feature that allows configuration of how different storage classes are provisioned and supports provisioning multiple types of volumes within a single cloud.
One of our goals is to enable administrators to create out-of-tree provisioners, that is, provisioners whose code does not live in the Kubernetes project.
This design represents the minimally viable changes required to provision based on storage class configuration. Additional incremental features may be added as a separate effort.
We propose that:
-
Both for in-tree and out-of-tree storage provisioners, the PV created by the provisioners must match the PVC that led to its creations. If a provisioner is unable to provision such a matching PV, it reports an error to the user.
-
The above point applies also to PVC label selector. If user submits a PVC with a label selector, the provisioner must provision a PV with matching labels. This directly implies that the provisioner understands meaning behind these labels - if user submits a claim with selector that wants a PV with label "region" not in "[east,west]", the provisioner must understand what label "region" means, what available regions are there and choose e.g. "north".
In other words, provisioners should either refuse to provision a volume for a PVC that has a selector, or select few labels that are allowed in selectors (such as the "region" example above), implement necessary logic for their parsing, document them and refuse any selector that references unknown labels.
-
An api object will be incubated in storage.k8s.io/v1beta1 to hold the a
StorageClass
API resource. Each StorageClass object contains parameters required by the provisioner to provision volumes of that class. These parameters are opaque to the user. -
PersistentVolume.Spec.Class
attribute is added to volumes. This attribute is optional and specifies whichStorageClass
instance represents storage characteristics of a particular PV.During incubation,
Class
is an annotation and not actual attribute. -
PersistentVolume
instances do not require labels by the provisioner. -
PersistentVolumeClaim.Spec.Class
attribute is added to claims. This attribute specifies that only a volume with equalPersistentVolume.Spec.Class
value can satisfy a claim.During incubation,
Class
is just an annotation and not actual attribute. -
The existing provisioner plugin implementations be modified to accept parameters as specified via
StorageClass
. -
The persistent volume controller modified to invoke provisioners using
StorageClass
configuration and bind claims withPersistentVolumeClaim.Spec.Class
to volumes with equivalentPersistentVolume.Spec.Class
-
The existing alpha dynamic provisioning feature be phased out in the next release.
-
Kubernetes administrator can configure name of a default StorageClass. This StorageClass instance is then used when user requests a dynamically provisioned volume, but does not specify a StorageClass. In other words,
claim.Spec.Class == ""
(or annotationvolume.beta.kubernetes.io/storage-class == ""
). -
When a new claim is submitted, the controller attempts to find an existing volume that will fulfill the claim.
-
If the claim has non-empty
claim.Spec.Class
, only PVs with the samepv.Spec.Class
are considered. -
If the claim has empty
claim.Spec.Class
, only PVs with an unsetpv.Spec.Class
are considered.
All "considered" volumes are evaluated and the smallest matching volume is bound to the claim.
-
-
If no volume is found for the claim and
claim.Spec.Class
is not set or is empty string dynamic provisioning is disabled. -
If
claim.Spec.Class
is set the controller tries to find instance of StorageClass with this name. If no such StorageClass is found, the controller goes back to step 1. and periodically retries finding a matching volume or storage class again until a match is found. The claim isPending
during this period. -
With StorageClass instance, the controller updates the claim:
claim.Annotations["volume.beta.kubernetes.io/storage-provisioner"] = storageClass.Provisioner
-
In-tree provisioning
The controller tries to find an internal volume plugin referenced by
storageClass.Provisioner
. If it is found:-
The internal provisioner implements interface
ProvisionableVolumePlugin
, which has a method calledNewProvisioner
that returns a new provisioner. -
The controller calls volume plugin
Provision
with Parameters from theStorageClass
configuration object. -
If
Provision
returns an error, the controller generates an event on the claim and goes back to step 1., i.e. it will retry provisioning periodically. -
If
Provision
returns no error, the controller creates the returnedapi.PersistentVolume
, fills itsClass
attribute withclaim.Spec.Class
and makes it already bound to the claim -
If the create operation for the
api.PersistentVolume
fails, it is retried -
If the create operation does not succeed in reasonable time, the controller attempts to delete the provisioned volume and creates an event on the claim
-
Existing behavior is unchanged for claims that do not specify
claim.Spec.Class
.
-
Out of tree provisioning
Following step 4. above, the controller tries to find internal plugin for the
StorageClass
. If it is not found, it does not do anything, it just periodically goes to step 1., i.e. tries to find available matching PV.The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
External provisioner must have these features:
-
It MUST have a distinct name, following Kubernetes plugin naming scheme
<vendor name>/<provisioner name>
, e.g.gluster.org/gluster-volume
. -
The provisioner SHOULD send events on a claim to report any errors related to provisioning a volume for the claim. This way, users get the same experience as with internal provisioners.
-
The provisioner MUST implement also a deleter. It must be able to delete storage assets it created. It MUST NOT assume that any other internal or external plugin is present.
The external provisioner runs in a separate process which watches claims, be it an external storage appliance, a daemon or a Kubernetes pod. For every claim creation or update, it implements these steps:
-
The provisioner inspects if
claim.Annotations["volume.beta.kubernetes.io/storage-provisioner"] == <provisioner name>
. All other claims MUST be ignored. -
The provisioner MUST check that the claim is unbound, i.e. its
claim.Spec.VolumeName
is empty. Bound volumes MUST be ignored.Race condition when the provisioner provisions a new PV for a claim and at the same time Kubernetes binds the same claim to another PV that was just created by admin is discussed below.
-
It tries to find a StorageClass instance referenced by annotation
claim.Annotations["volume.beta.kubernetes.io/storage-class"]
. If not found, it SHOULD report an error (by sending an event to the claim) and it SHOULD retry periodically with step i. -
The provisioner MUST parse arguments in the
StorageClass
andclaim.Spec.Selector
and provisions appropriate storage asset that matches both the parameters and the selector. When it encounters unknown parameters instorageClass.Parameters
orclaim.Spec.Selector
or the combination of these parameters is impossible to achieve, it SHOULD report an error and it MUST NOT provision a volume. All errors found during parsing or provisioning SHOULD be send as events on the claim and the provisioner SHOULD retry periodically with step i.As parsing (and understanding) claim selectors is hard, the sentence "MUST parse ...
claim.Spec.Selector
" will in typical case lead to simple refusal of claims that have any selector:if pvc.Spec.Selector != nil { return Error("can't parse PVC selector!") }
-
When the volume is provisioned, the provisioner MUST create a new PV representing the storage asset and save it in Kubernetes. When this fails, it SHOULD retry creating the PV again few times. If all attempts fail, it MUST delete the storage asset. All errors SHOULD be sent as events to the claim.
The created PV MUST have these properties:
-
pv.Spec.ClaimRef
MUST point to the claim that led to its creation (including the claim UID).This way, the PV will be bound to the claim.
-
pv.Annotations["pv.kubernetes.io/provisioned-by"]
MUST be set to name of the external provisioner. This provisioner will be used to delete the volume.The provisioner/delete should not assume there is any other provisioner/deleter available that would delete the volume.
-
pv.Annotations["volume.beta.kubernetes.io/storage-class"]
MUST be set to name of the storage class requested by the claim.So the created PV matches the claim.
-
The provisioner MAY store any other information to the created PV as annotations. It SHOULD save any information that is needed to delete the storage asset there, as appropriate StorageClass instance may not exist when the volume will be deleted. However, references to Secret instance or direct username/password to a remote storage appliance MUST NOT be stored there, see issue #34822.
-
pv.Labels
MUST be set to matchclaim.spec.selector
. The provisioner MAY add additional labels.So the created PV matches the claim.
-
pv.Spec
MUST be set to match requirements inclaim.Spec
, especially access mode and PV size. The provisioned volume size MUST NOT be smaller than size requested in the claim, however it MAY be larger. -
Kubernetes v1.9 or later have functionality to deploy raw block volume instead of filesystem volume as a new feature. To support the feature, we added
volumeMode
parameter which takes valuesFilesystem
andBlock
topv.Spec
andpvc.Spec
. In order to deploy block volume via external provisioner, following conditions are REQUIRED.- A storage has ability to create raw block type of volume
- Block volume feature has been supported by the volume plugin
- External-provisioner MUST set
volumeMode
which matches requirements inclaim.Spec
intopv.Spec
.
So the created PV matches the claim.
-
pv.Spec.PersistentVolumeSource
MUST be set to point to the created storage asset. -
pv.Spec.PersistentVolumeReclaimPolicy
SHOULD be set toDelete
unless user manually configures other reclaim policy. -
pv.Name
MUST be unique. Internal provisioners use name based onclaim.UID
to produce conflicts when two provisioners accidentally provision a PV for the same claim, however external provisioners can use any mechanism to generate an unique PV name.
-
Example 1) a claim that is to be provisioned by an external provisioner for
foo.org/foo-volume
:apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: volume.beta.kubernetes.io/storage-class: myClass volume.beta.kubernetes.io/storage-provisioner: foo.org/foo-volume name: fooclaim namespace: default resourceVersion: "53" uid: 5a294561-7e5b-11e6-a20e-0eb6048532a3 spec: accessModes: - ReadWriteOnce volumeMode: Filesystem resources: requests: storage: 4Gi # volumeName: must be empty!
Example 1) the created PV:
apiVersion: v1 kind: PersistentVolume metadata: annotations: pv.kubernetes.io/provisioned-by: foo.org/foo-volume volume.beta.kubernetes.io/storage-class: myClass foo.org/provisioner: "any other annotations as needed" labels: foo.org/my-label: "any labels as needed" generateName: "foo-volume-" spec: accessModes: - ReadWriteOnce volumeMode: Filesystem awsElasticBlockStore: fsType: ext4 volumeID: aws://us-east-1d/vol-de401a79 capacity: storage: 4Gi claimRef: apiVersion: v1 kind: PersistentVolumeClaim name: fooclaim namespace: default resourceVersion: "53" uid: 5a294561-7e5b-11e6-a20e-0eb6048532a3 persistentVolumeReclaimPolicy: Delete
Example 2) a claim that provisions
volumeMode: Block
volume:apiVersion: v1 kind: PersistentVolumeClaim metadata: ... spec: accessModes: - ReadWriteOnce volumeMode: Block resources: requests: storage: 4Gi # volumeName: must be empty!
Example 2) the created PV:
apiVersion: v1 kind: PersistentVolume metadata: ... spec: accessModes: - ReadWriteOnce volumeMode: Block awsElasticBlockStore: volumeID: aws://us-east-1d/vol-de401a79 capacity: storage: 4Gi claimRef: ... persistentVolumeReclaimPolicy: Delete
As result, Kubernetes has a PV that represents the storage asset and is bound to the claim. When everything went well, Kubernetes completed binding of the claim to the PV.
Kubernetes was not blocked in any way during the provisioning and could either bound the claim to another PV that was created by user or even the claim may have been deleted by the user. In both cases, Kubernetes will mark the PV to be delete using the protocol below.
The external provisioner MAY save any annotations to the claim that is provisioned, however the claim may be modified or even deleted by the user at any time.
-
When the controller decides that a volume should be deleted it performs these steps:
-
The controller changes
pv.Status.Phase
toReleased
. -
The controller looks for
pv.Annotations["pv.kubernetes.io/provisioned-by"]
. If found, it uses this provisioner/deleter to delete the volume. -
If the volume is not annotated by
pv.kubernetes.io/provisioned-by
, the controller inspectspv.Spec
and finds in-tree deleter for the volume. -
If the deleter found by steps 2. or 3. is internal, it calls it and deletes the storage asset together with the PV that represents it.
-
If the deleter is not known to Kubernetes, it does not do anything.
-
External deleters MUST watch for PV changes. When
pv.Status.Phase == Released && pv.Annotations['pv.kubernetes.io/provisioned-by'] == <deleter name>
, the deleter:-
It MUST check reclaim policy of the PV and ignore all PVs whose
Spec.PersistentVolumeReclaimPolicy
is notDelete
. -
It MUST delete the storage asset.
-
Only after the storage asset was successfully deleted, it MUST delete the PV object in Kubernetes.
-
Any error SHOULD be sent as an event on the PV being deleted and the deleter SHOULD retry to delete the volume periodically.
-
The deleter SHOULD NOT use any information from StorageClass instance referenced by the PV. This is different to internal deleters, which need to be StorageClass instance present at the time of deletion to read Secret instances (see Gluster provisioner for example), however we would like to phase out this behavior.
Note that watching
pv.Status
has been frowned upon in the past, however in this particular case we could use it quite reliably to trigger deletion. It's not trivial to find out if a PV is not needed and should be deleted. Alternatively, an annotation could be used. -
Both internal and external provisioners and deleters may need access to credentials (e.g. username+password) of an external storage appliance to provision and delete volumes.
-
For internal provisioners, a Secret instance in a well secured namespace should be used. Pointer to the Secret instance shall be parameter of the StorageClass and it MUST NOT be copied around the system e.g. in annotations of PVs. See issue #34822.
-
External provisioners running in pod should have appropriate credentials mounted as Secret inside pods that run the provisioner. Namespace with the pods and Secret instance should be well secured.
A new API group should hold the API for storage classes, following the pattern
of autoscaling, metrics, etc. To allow for future storage-related APIs, we
should call this new API group storage.k8s.io
and incubate in storage.k8s.io/v1beta1.
Storage classes will be represented by an API object called StorageClass
:
package storage
// StorageClass describes the parameters for a class of storage for
// which PersistentVolumes can be dynamically provisioned.
//
// StorageClasses are non-namespaced; the name of the storage class
// according to etcd is in ObjectMeta.Name.
type StorageClass struct {
unversioned.TypeMeta `json:",inline"`
ObjectMeta `json:"metadata,omitempty"`
// Provisioner indicates the type of the provisioner.
Provisioner string `json:"provisioner,omitempty"`
// Parameters for dynamic volume provisioner.
Parameters map[string]string `json:"parameters,omitempty"`
}
PersistentVolumeClaimSpec
and PersistentVolumeSpec
both get Class attribute
(the existing annotation is used during incubation):
type PersistentVolumeClaimSpec struct {
// Name of requested storage class. If non-empty, only PVs with this
// pv.Spec.Class will be considered for binding and if no such PV is
// available, StorageClass with this name will be used to dynamically
// provision the volume.
Class string
...
}
type PersistentVolumeSpec struct {
// Name of StorageClass instance that this volume belongs to.
Class string
...
}
Storage classes are natural to think of as a global resource, since they:
- Align with PersistentVolumes, which are a global resource
- Are administrator controlled
With the scheme outlined above the provisioner creates PVs using parameters specified in the StorageClass
object.
struct volume.VolumeOptions
(containing parameters for a provisioner plugin)
will be extended to contain StorageClass.Parameters.
The existing provisioner implementations will be modified to accept the StorageClass configuration object.
The persistent volume controller will be modified to implement the new
workflow described in this proposal. The changes will be limited to the
provisionClaimOperation
method, which is responsible for invoking the
provisioner and to favor existing volumes before provisioning a new one.
This example shows two storage classes, "aws-fast" and "aws-slow".
apiVersion: v1
kind: StorageClass
metadata:
name: aws-fast
provisioner: kubernetes.io/aws-ebs
parameters:
zone: us-east-1b
type: ssd
apiVersion: v1
kind: StorageClass
metadata:
name: aws-slow
provisioner: kubernetes.io/aws-ebs
parameters:
zone: us-east-1b
type: spinning
-
Annotation
volume.alpha.kubernetes.io/storage-class
is used instead ofclaim.Spec.Class
andvolume.Spec.Class
during incubation. -
claim.Spec.Selector
andclaim.Spec.Class
are mutually exclusive for now (1.4). User can either match existing volumes withSelector
XOR match existing volumes withClass
and get dynamic provisioning by usingClass
. This simplifies initial PR and also provisioners. This limitation may be lifted in future releases.
Since the volume.alpha.kubernetes.io/storage-class
is in use a StorageClass
must be defined to support provisioning. No default is assumed as before.