New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LocalStorage api #44640
LocalStorage api #44640
Conversation
pkg/api/types.go
Outdated
@@ -302,6 +302,11 @@ type VolumeSource struct { | |||
// ScaleIO represents a ScaleIO persistent volume attached and mounted on Kubernetes nodes. | |||
// +optional | |||
ScaleIO *ScaleIOVolumeSource | |||
// LocalStorage represents a local volume accessible only from one node | |||
// This volume type cannot be used as a direct volume source and is only required here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The volume plugin interface requires the volume type to be part of the VolumeSource in order to do volume reconstruction. But for local storage, we do not want to allow users to be able to specify it as a VolumeSource. I'm not sure if it's possible to change the volume reconstruction. @saad-ali
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checked with Saad. ConstructVolumeSpec can actually support PersistentVolumeSource too, so then I can remove this from the Volume Source. :-D
pkg/api/validation/validation.go
Outdated
allErrs = append(allErrs, field.Required(specPath.Child("storageClassName"), "LocalStorage volumes require a StorageClass")) | ||
} | ||
// hostname label is required | ||
hostname, found := pv.Labels[metav1.LabelHostname] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if users would want to use other labels as the topology key for local volumes. For the recommended use case, if they accidentally forget the hostname label, then it won't trigger the node stickiness, which could be a dangerous situation (pod could mount any random path on any node).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @vishh, it would be better for the system to not depend on any specific labels. So we should not enforce that a particular label is set. This means that any admin/user that is using this feature really needs to understand what they're doing because mistakes, like forgetting a label, can have serious consequences. As part of this feature, we are going to provide a default external provisioner that will set the appropriate labels and be thoroughly tested.
Another possibility is to add an explicit node field to the LocalStorageVolumeSource, and then have an admission controller that will add the hostname label to the PV object. This is similar to how it works with zones on PVs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for admission controller.
We should trust the admin that correct labels are set or the admission plugin is running. To be extra careful, SetUp() call of volume plugin for LocalStorage could check that it runs on the right node before mount.
pkg/api/validation/validation.go
Outdated
numVolumes++ | ||
allErrs = append(allErrs, validateLocalStorageVolumeSource(pv.Spec.LocalStorage, specPath.Child("localStorage"))...) | ||
// StorageClass is required | ||
if len(pv.Spec.StorageClassName) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I actually want to verify is that the StorageClass exists and also has the topology-key set. I'm not sure if we can do that at this point though.
cc @vishh @kubernetes/sig-storage-api-reviews @kubernetes/sig-storage-pr-reviews |
LGTM for types.go and validation |
pkg/api/types.go
Outdated
@@ -370,6 +370,9 @@ type PersistentVolumeSource struct { | |||
// ScaleIO represents a ScaleIO persistent volume attached and mounted on Kubernetes nodes. | |||
// +optional | |||
ScaleIO *ScaleIOVolumeSource | |||
// LocalStorage represents a local volume accessible only from one node | |||
// +optional | |||
LocalStorage *LocalStorageVolumeSource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LocalStorage
-> Local
? Or LocalVolume
, following PortworxVolume
and PortworxVolumeSource
precedent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could go with LocalVolume
.
Hmm. I just realized that hostname is not necessarily the same as node name. So the approach of using the hostname label will not work because the scheduler operates on node name, of which there is no label. We could add a node name label to the node object, but that seems kind of redundant. It would fit into the current design of using topologyKey though. |
pkg/apis/extensions/types.go
Outdated
@@ -918,6 +918,7 @@ var ( | |||
Projected FSType = "projected" | |||
PortworxVolume FSType = "portworxVolume" | |||
ScaleIO FSType = "scaleIO" | |||
LocalStorage FSType = "localStorage" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is for podsecuritypolicy, where you can control which pods are allowed to use what types of volumes. However, it only seems to work for direct volumes, and not persistent volumes. So I'm not sure if this is needed, unless there are plans to also enforce persistent volume types. @timothysc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for removing it for now. Access can also be restricted using quota for PVs
@@ -65,6 +65,7 @@ func GetAllFSTypesAsSet() sets.String { | |||
string(extensions.Projected), | |||
string(extensions.PortworxVolume), | |||
string(extensions.ScaleIO), | |||
string(extensions.LocalStorage), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing, may not be needed.
effcd1b
to
492e876
Compare
pkg/api/types.go
Outdated
@@ -385,6 +388,9 @@ const ( | |||
// BetaStorageClassAnnotation represents the beta/previous StorageClass annotation. | |||
// It's currently still used and will be held for backwards compatibility | |||
BetaStorageClassAnnotation = "volume.beta.kubernetes.io/storage-class" | |||
// AlphaStorageTopologyAnnotation is set in the StorageClass and gives the scheduler | |||
// a hint that the volumes in this StorageClass have a topology constraint | |||
AlphaStorageTopologyAnnotation = "storage.alpha.kubernetes.io/topology-key" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the value for this key? We can only recognize "zones" and "nodes" as of now across the system. We need to mention what the possible values are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value can be any label that exists on the PV + Node. I can add that to the comment.
The pod affinity feature specifies label selector and topologyKey as separate fields. Maybe we can do the same here.
{{LabelSelector: P1, TopologyKey: "node"}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, I think we can use labels for expressing topology. Here is the algorithm that we discussed:
- PVs expose a single topology label -
topology.kubernetes.io/node:foo
, ortopology.k8s.io/zone:bar
- PVCs can already select using this topology label. Storage Classes should include a Label Selector that can specify topology constaints.
kind: StorageClass
spec:
selector:
- key: "topology.k8s.io/node"
or
kind: StorageClass
spec:
selector:
- key: "topology.k8s.io/zone"
operator: In
values:
- bar-zone
- Nodes will expose all aspects ot topology via consistent label keys.
kind: Node
metadata:
label:
topology.kubernetes.io/node : foo
topology.k8s.io/zone : bar
- The scheduler can then combine NodeSelector on the pod with the Selector on the StorageClass while identifying Nodes that can meet the storage locality requirements.
This method would require using consistent label keys across nodes and PVs. I hope that's not a non starter.
cc @thockin Does this align with your thoughts on topology?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the behavior of just specifying the key without an operator or value. For pod affinity, they actually have two fields, a label selector and a topologyKey. The topologyKey field means to match the labels between two objects (in this case, PV and Node). Whereas the selector is applying the filters to a single object.
I also spoke with @lavalamp about using labels to express the topology constraints. Would we ever want to express in the PV more than a single label/value? Like, this PV can be accessed on node1,2, and 3? Or define a new single label "mylabel=foo", and put that same label/value on node1,2,3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, may be label selector is not the right keyword. We definitely need a better selection scheme similar to that of "Affinity/Anti-affinity". I think we need value in the case of storage in addition to just a key.
Like, this PV can be accessed on node1,2, and 3
In such a scenario, a PV can include multiple "node" topology labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to remove this annotation from this PR for now because there is still more discussion on this topic that we have to do, and I don't want it to block everything else. Let's use #43504 to continue the discussion.
pkg/api/types.go
Outdated
NodeName string | ||
// Fs represents a filesystem-based local volume | ||
// +optional | ||
Fs *LocalFsVolume |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider having the "access types" (block or fs) as a separate field instead of this approach?
type LocalVolumeSource struct {
NodeName string
Path string
AccessType string // possible values are "block" or "fs"
}
I see AccessType
becoming an unique field in Storage Class and PVC as well and so we can have a consistent API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsafrane prefers not to overload the Path field. It can also make sense to have two separate types if they end up requiring different fields to be specified.
If for block support, we end up adding some access type field, I think it would make sense for that to be in the PV and not the specific volume type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is path to a block device file not ideal?
Isn't the accessType tied to the volume? A PV cannot expose a block device as a filesystem without formatting it. So wouldn't a PV still want to know what the access type is for an underlying volume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Path to a block device makes sense, but we're not sure yet if there may be any other additional fields required. So we prefer to have two different types to make field expansion easier in the future.
We are looking at exposing PVs as block devices in the future, so I think having that type field in the PV would make more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but we're not sure yet if there may be any other additional fields required
That's ok. Path will be forwards compatible right?
We are looking at exposing PVs as block devices in the future
If the assumption is that PV is representing an attribute of the underlying storage type, then that's OK. In theory, a storage type should specify it's access type. A PV is merely a wrapper around an existing storage type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a hypothetical scenario or do you have concrete examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hypothetical, as we don't have a detailed investigation or plan for block support. But I think it is a cleaner API to clearly separate the two types of sources, and also makes future expansion easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we sync with Saad on CSI - overloading path for block devs is exactly what he's planning, i think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saad-ali can you weigh in here? If you are also planning to overload the path for block and file, then I should do the same here to be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a hypothetical scenario or do you have concrete examples?
Local volume proposal wants to cover both devices mounted outside of Kubernetes and block devices mounted by Kubernetes (or presented as raw devices to pods somehow). To me, these are two quite different volumes. The first one needs a path to local directory, the second one needs path to a device name (which is not a directory) and at least a fsType
that should be created there if it's going to be used as regular PV (i.e. mounted by Kubernetes). Hence separate structures FS
and Block
(in future).
Overloading Path
with both mount directory and device path looks really ugly to me.
pkg/api/validation/validation.go
Outdated
} | ||
allErrs = append(allErrs, validateLocalVolumeSource(pv.Spec.LocalVolume, specPath.Child("localVolume"))...) | ||
// StorageClass is required | ||
if len(pv.Spec.StorageClassName) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is storage class required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without a storageclass, the scheduler won't know to look for a storage topology and the pod will get scheduled to any node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will storage topology be reflected in PVCs? I feel we need to include support for PVCs too without storage class. Am I the only one thinking if this is necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main concern I have is that PVC Is a user construct, and the user shouldn't care about the underlying storage topology. We could push towards a model where StorageClass is required in order to express more advanced storage concepts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we "Recommend" instead of "push"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is a valid use case. but what I think we are lacking here is a way for user or admin to unbind pv and pvc such that they can be put back into the scheduling pool. otherwise, we ended up exhausting all scheduling capabilities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the deleting flow is as follows:
- User deletes pod
- User deletes PVC
- If PV reclaim mode was Retain, then the state becomes "Retain", and the user has to cleanup the PV themselves, delete it, and recreate it.
- If PV reclaim mode was Delete, then the PV goes into Released state
- The local storage DaemonSet will cleanup Released PVs and then create new PVs again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌 thx for the quick turnaround.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could put the topologyKey in the PVC instead, but I prefer the idea of the StorageClass encapsulating the topology constraints so that the user does not need to be aware.
A user can still manually bind local PVs to PVCs without storage classes. Why force every local PV to expose a storage class?
I'm personally not a fan of tying storage classes to PVs during static provisioning.
In the future, will it be possible to keep these operations separate?
- Provisioning local volumes
- Defining storage classes for grouping local volumes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with not requiring StorageClass for the case where the user manually prebinds PVCs to PVs. What I do not want is the user to have to specify the topologyKey in the PVC instead of StorageClass to trigger the new scheduling/binding flow.
pkg/apis/extensions/types.go
Outdated
@@ -918,6 +918,7 @@ var ( | |||
Projected FSType = "projected" | |||
PortworxVolume FSType = "portworxVolume" | |||
ScaleIO FSType = "scaleIO" | |||
LocalStorage FSType = "localStorage" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for removing it for now. Access can also be restricted using quota for PVs
@@ -1413,6 +1413,12 @@ message LoadBalancerStatus { | |||
repeated LoadBalancerIngress ingress = 1; | |||
} | |||
|
|||
// LocalFsVolume represents a filesystem-based local volume |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the staging artifacts expected to be committed? cc @caesarxuchao
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. This file is copied from k8s.io/kubernetes/pkg/api/v1/generated.proto. If they are out of sync, verification-all.sh will fail.
@k8s-bot unit test this |
@k8s-bot kops aws e2e test this |
pkg/api/types.go
Outdated
@@ -388,8 +391,24 @@ const ( | |||
|
|||
// MountOptionAnnotation defines mount option annotation used in PVs | |||
MountOptionAnnotation = "volume.beta.kubernetes.io/mount-options" | |||
|
|||
// StorageTopologyConstraint defines topology constraints of a PV | |||
// Value is type []TopologyConstraint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of replacing this with the NodeAffinity type instead, since I think this can be expressed through NodeSelector. Then we don't have to add this new type for topology constraints. For alpha at least and maybe even beta, we will only implement the RequiredDuringSchedulingIgnoredDuringExecution NodeSelector part and ignore/error on the other fields.
What do you guys think?
+1 for using NodeAffinity with additional documentation and validation. In
any case, I'd stick to annotations for v1.7.
…On Thu, May 18, 2017 at 10:29 AM, Michelle Au ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pkg/api/types.go
<#44640 (comment)>
:
> @@ -388,8 +391,24 @@ const (
// MountOptionAnnotation defines mount option annotation used in PVs
MountOptionAnnotation = "volume.beta.kubernetes.io/mount-options"
+
+ // StorageTopologyConstraint defines topology constraints of a PV
+ // Value is type []TopologyConstraint
I'm thinking of replacing this with the NodeAffinity type instead, since I
think this can be expressed through NodeSelector. Then we don't have to add
this new type for topology constraints. For alpha at least and maybe even
beta, we will only implement the RequiredDuringSchedulingIgnoredDuringExecution
NodeSelector part and ignore/error on the other fields.
What do you guys think?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#44640 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGvIKAZaWLkFPyzAWrRV55YQkBmVyV4cks5r7H_dgaJpZM4NA_gk>
.
|
d4273de
to
cd3c878
Compare
/lgtm |
/approve |
Includes: - A new volume type, LocalVolumeSource. This only supports file-based local volumes for now. - New alpha annotation in PV: NodeAffinity - Validation + tests for specifying LocalVolumeSource and PV NodeAffinity - Alpha feature gate
@k8s-bot pull-kubernetes-federation-e2e-gce test this |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: msau42, thockin, vishh
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
@msau42: The following test(s) failed:
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Automatic merge from submit-queue |
What this PR does / why we need it:
API changes to support persistent local volumes, as described here
Which issue this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close that issue when PR gets merged): fixes #Part of #43640
Special notes for your reviewer:
There were a few items I was concerned about. Will add review comments in those places.
Release note:
NONE
Note will be added in subsequent PR with the volume plugin changes