Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LocalStorage api #44640

Merged
merged 2 commits into from May 22, 2017
Merged

LocalStorage api #44640

merged 2 commits into from May 22, 2017

Conversation

msau42
Copy link
Member

@msau42 msau42 commented Apr 18, 2017

What this PR does / why we need it:
API changes to support persistent local volumes, as described here

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #
Part of #43640

Special notes for your reviewer:
There were a few items I was concerned about. Will add review comments in those places.

Release note:

NONE

Note will be added in subsequent PR with the volume plugin changes

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 18, 2017
@k8s-reviewable
Copy link

This change is Reviewable

@k8s-github-robot k8s-github-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. release-note-label-needed labels Apr 18, 2017
pkg/api/types.go Outdated
@@ -302,6 +302,11 @@ type VolumeSource struct {
// ScaleIO represents a ScaleIO persistent volume attached and mounted on Kubernetes nodes.
// +optional
ScaleIO *ScaleIOVolumeSource
// LocalStorage represents a local volume accessible only from one node
// This volume type cannot be used as a direct volume source and is only required here
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The volume plugin interface requires the volume type to be part of the VolumeSource in order to do volume reconstruction. But for local storage, we do not want to allow users to be able to specify it as a VolumeSource. I'm not sure if it's possible to change the volume reconstruction. @saad-ali

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked with Saad. ConstructVolumeSpec can actually support PersistentVolumeSource too, so then I can remove this from the Volume Source. :-D

allErrs = append(allErrs, field.Required(specPath.Child("storageClassName"), "LocalStorage volumes require a StorageClass"))
}
// hostname label is required
hostname, found := pv.Labels[metav1.LabelHostname]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if users would want to use other labels as the topology key for local volumes. For the recommended use case, if they accidentally forget the hostname label, then it won't trigger the node stickiness, which could be a dangerous situation (pod could mount any random path on any node).

Copy link
Member Author

@msau42 msau42 Apr 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @vishh, it would be better for the system to not depend on any specific labels. So we should not enforce that a particular label is set. This means that any admin/user that is using this feature really needs to understand what they're doing because mistakes, like forgetting a label, can have serious consequences. As part of this feature, we are going to provide a default external provisioner that will set the appropriate labels and be thoroughly tested.

Another possibility is to add an explicit node field to the LocalStorageVolumeSource, and then have an admission controller that will add the hostname label to the PV object. This is similar to how it works with zones on PVs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for admission controller.

We should trust the admin that correct labels are set or the admission plugin is running. To be extra careful, SetUp() call of volume plugin for LocalStorage could check that it runs on the right node before mount.

numVolumes++
allErrs = append(allErrs, validateLocalStorageVolumeSource(pv.Spec.LocalStorage, specPath.Child("localStorage"))...)
// StorageClass is required
if len(pv.Spec.StorageClassName) == 0 {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I actually want to verify is that the StorageClass exists and also has the topology-key set. I'm not sure if we can do that at this point though.

@msau42
Copy link
Member Author

msau42 commented Apr 18, 2017

cc @vishh @kubernetes/sig-storage-api-reviews @kubernetes/sig-storage-pr-reviews

@jsafrane
Copy link
Member

LGTM for types.go and validation

pkg/api/types.go Outdated
@@ -370,6 +370,9 @@ type PersistentVolumeSource struct {
// ScaleIO represents a ScaleIO persistent volume attached and mounted on Kubernetes nodes.
// +optional
ScaleIO *ScaleIOVolumeSource
// LocalStorage represents a local volume accessible only from one node
// +optional
LocalStorage *LocalStorageVolumeSource
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LocalStorage -> Local? Or LocalVolume, following PortworxVolume and PortworxVolumeSource precedent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could go with LocalVolume.

@msau42
Copy link
Member Author

msau42 commented Apr 20, 2017

Hmm. I just realized that hostname is not necessarily the same as node name. So the approach of using the hostname label will not work because the scheduler operates on node name, of which there is no label.

We could add a node name label to the node object, but that seems kind of redundant. It would fit into the current design of using topologyKey though.

@@ -918,6 +918,7 @@ var (
Projected FSType = "projected"
PortworxVolume FSType = "portworxVolume"
ScaleIO FSType = "scaleIO"
LocalStorage FSType = "localStorage"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is for podsecuritypolicy, where you can control which pods are allowed to use what types of volumes. However, it only seems to work for direct volumes, and not persistent volumes. So I'm not sure if this is needed, unless there are plans to also enforce persistent volume types. @timothysc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for removing it for now. Access can also be restricted using quota for PVs

@@ -65,6 +65,7 @@ func GetAllFSTypesAsSet() sets.String {
string(extensions.Projected),
string(extensions.PortworxVolume),
string(extensions.ScaleIO),
string(extensions.LocalStorage),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing, may not be needed.

@msau42 msau42 force-pushed the local-pv-api branch 2 times, most recently from effcd1b to 492e876 Compare April 21, 2017 16:37
@msau42 msau42 mentioned this pull request Apr 25, 2017
@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 26, 2017
pkg/api/types.go Outdated
@@ -385,6 +388,9 @@ const (
// BetaStorageClassAnnotation represents the beta/previous StorageClass annotation.
// It's currently still used and will be held for backwards compatibility
BetaStorageClassAnnotation = "volume.beta.kubernetes.io/storage-class"
// AlphaStorageTopologyAnnotation is set in the StorageClass and gives the scheduler
// a hint that the volumes in this StorageClass have a topology constraint
AlphaStorageTopologyAnnotation = "storage.alpha.kubernetes.io/topology-key"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the value for this key? We can only recognize "zones" and "nodes" as of now across the system. We need to mention what the possible values are.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value can be any label that exists on the PV + Node. I can add that to the comment.

The pod affinity feature specifies label selector and topologyKey as separate fields. Maybe we can do the same here.
{{LabelSelector: P1, TopologyKey: "node"}}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I think we can use labels for expressing topology. Here is the algorithm that we discussed:

  1. PVs expose a single topology label - topology.kubernetes.io/node:foo, or topology.k8s.io/zone:bar
  2. PVCs can already select using this topology label. Storage Classes should include a Label Selector that can specify topology constaints.
kind: StorageClass
spec:
  selector:
    - key: "topology.k8s.io/node"

or

kind: StorageClass
spec:
  selector:
    - key: "topology.k8s.io/zone"
      operator: In
      values:
      - bar-zone
  1. Nodes will expose all aspects ot topology via consistent label keys.
kind: Node
metadata:
  label: 
   topology.kubernetes.io/node : foo
   topology.k8s.io/zone : bar
  1. The scheduler can then combine NodeSelector on the pod with the Selector on the StorageClass while identifying Nodes that can meet the storage locality requirements.

This method would require using consistent label keys across nodes and PVs. I hope that's not a non starter.

cc @thockin Does this align with your thoughts on topology?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the behavior of just specifying the key without an operator or value. For pod affinity, they actually have two fields, a label selector and a topologyKey. The topologyKey field means to match the labels between two objects (in this case, PV and Node). Whereas the selector is applying the filters to a single object.

I also spoke with @lavalamp about using labels to express the topology constraints. Would we ever want to express in the PV more than a single label/value? Like, this PV can be accessed on node1,2, and 3? Or define a new single label "mylabel=foo", and put that same label/value on node1,2,3?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, may be label selector is not the right keyword. We definitely need a better selection scheme similar to that of "Affinity/Anti-affinity". I think we need value in the case of storage in addition to just a key.

Like, this PV can be accessed on node1,2, and 3

In such a scenario, a PV can include multiple "node" topology labels.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to remove this annotation from this PR for now because there is still more discussion on this topic that we have to do, and I don't want it to block everything else. Let's use #43504 to continue the discussion.

pkg/api/types.go Outdated
NodeName string
// Fs represents a filesystem-based local volume
// +optional
Fs *LocalFsVolume
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider having the "access types" (block or fs) as a separate field instead of this approach?

type LocalVolumeSource struct {
 NodeName string
 Path string
 AccessType string // possible values are "block" or "fs"
}

I see AccessType becoming an unique field in Storage Class and PVC as well and so we can have a consistent API

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsafrane prefers not to overload the Path field. It can also make sense to have two separate types if they end up requiring different fields to be specified.

If for block support, we end up adding some access type field, I think it would make sense for that to be in the PV and not the specific volume type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is path to a block device file not ideal?
Isn't the accessType tied to the volume? A PV cannot expose a block device as a filesystem without formatting it. So wouldn't a PV still want to know what the access type is for an underlying volume?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path to a block device makes sense, but we're not sure yet if there may be any other additional fields required. So we prefer to have two different types to make field expansion easier in the future.

We are looking at exposing PVs as block devices in the future, so I think having that type field in the PV would make more sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we're not sure yet if there may be any other additional fields required

That's ok. Path will be forwards compatible right?

We are looking at exposing PVs as block devices in the future

If the assumption is that PV is representing an attribute of the underlying storage type, then that's OK. In theory, a storage type should specify it's access type. A PV is merely a wrapper around an existing storage type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a hypothetical scenario or do you have concrete examples?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hypothetical, as we don't have a detailed investigation or plan for block support. But I think it is a cleaner API to clearly separate the two types of sources, and also makes future expansion easier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we sync with Saad on CSI - overloading path for block devs is exactly what he's planning, i think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saad-ali can you weigh in here? If you are also planning to overload the path for block and file, then I should do the same here to be consistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a hypothetical scenario or do you have concrete examples?

Local volume proposal wants to cover both devices mounted outside of Kubernetes and block devices mounted by Kubernetes (or presented as raw devices to pods somehow). To me, these are two quite different volumes. The first one needs a path to local directory, the second one needs path to a device name (which is not a directory) and at least a fsType that should be created there if it's going to be used as regular PV (i.e. mounted by Kubernetes). Hence separate structures FS and Block (in future).

Overloading Path with both mount directory and device path looks really ugly to me.

}
allErrs = append(allErrs, validateLocalVolumeSource(pv.Spec.LocalVolume, specPath.Child("localVolume"))...)
// StorageClass is required
if len(pv.Spec.StorageClassName) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is storage class required?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a storageclass, the scheduler won't know to look for a storage topology and the pod will get scheduled to any node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will storage topology be reflected in PVCs? I feel we need to include support for PVCs too without storage class. Am I the only one thinking if this is necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main concern I have is that PVC Is a user construct, and the user shouldn't care about the underlying storage topology. We could push towards a model where StorageClass is required in order to express more advanced storage concepts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we "Recommend" instead of "push"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is a valid use case. but what I think we are lacking here is a way for user or admin to unbind pv and pvc such that they can be put back into the scheduling pool. otherwise, we ended up exhausting all scheduling capabilities?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the deleting flow is as follows:

  1. User deletes pod
  2. User deletes PVC
  3. If PV reclaim mode was Retain, then the state becomes "Retain", and the user has to cleanup the PV themselves, delete it, and recreate it.
  4. If PV reclaim mode was Delete, then the PV goes into Released state
  5. The local storage DaemonSet will cleanup Released PVs and then create new PVs again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌 thx for the quick turnaround.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could put the topologyKey in the PVC instead, but I prefer the idea of the StorageClass encapsulating the topology constraints so that the user does not need to be aware.

A user can still manually bind local PVs to PVCs without storage classes. Why force every local PV to expose a storage class?
I'm personally not a fan of tying storage classes to PVs during static provisioning.
In the future, will it be possible to keep these operations separate?

  1. Provisioning local volumes
  2. Defining storage classes for grouping local volumes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with not requiring StorageClass for the case where the user manually prebinds PVCs to PVs. What I do not want is the user to have to specify the topologyKey in the PVC instead of StorageClass to trigger the new scheduling/binding flow.

@@ -918,6 +918,7 @@ var (
Projected FSType = "projected"
PortworxVolume FSType = "portworxVolume"
ScaleIO FSType = "scaleIO"
LocalStorage FSType = "localStorage"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for removing it for now. Access can also be restricted using quota for PVs

@@ -1413,6 +1413,12 @@ message LoadBalancerStatus {
repeated LoadBalancerIngress ingress = 1;
}

// LocalFsVolume represents a filesystem-based local volume
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the staging artifacts expected to be committed? cc @caesarxuchao

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. This file is copied from k8s.io/kubernetes/pkg/api/v1/generated.proto. If they are out of sync, verification-all.sh will fail.

@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 27, 2017
@msau42
Copy link
Member Author

msau42 commented Apr 27, 2017

@k8s-bot unit test this

@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 18, 2017
@msau42
Copy link
Member Author

msau42 commented May 18, 2017

@k8s-bot kops aws e2e test this

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 18, 2017
pkg/api/types.go Outdated
@@ -388,8 +391,24 @@ const (

// MountOptionAnnotation defines mount option annotation used in PVs
MountOptionAnnotation = "volume.beta.kubernetes.io/mount-options"

// StorageTopologyConstraint defines topology constraints of a PV
// Value is type []TopologyConstraint
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of replacing this with the NodeAffinity type instead, since I think this can be expressed through NodeSelector. Then we don't have to add this new type for topology constraints. For alpha at least and maybe even beta, we will only implement the RequiredDuringSchedulingIgnoredDuringExecution NodeSelector part and ignore/error on the other fields.

What do you guys think?

@vishh
Copy link
Contributor

vishh commented May 18, 2017 via email

@msau42 msau42 force-pushed the local-pv-api branch 2 times, most recently from d4273de to cd3c878 Compare May 18, 2017 22:27
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 18, 2017
@msau42
Copy link
Member Author

msau42 commented May 19, 2017

I updated to use NodeAffinity in the annotation. @vishh @thockin PTAL

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 20, 2017
@vishh
Copy link
Contributor

vishh commented May 20, 2017

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 20, 2017
@thockin
Copy link
Member

thockin commented May 20, 2017

/approve

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 20, 2017
Includes:
- A new volume type, LocalVolumeSource.  This only supports
file-based local volumes for now.
- New alpha annotation in PV: NodeAffinity
- Validation + tests for specifying LocalVolumeSource and PV
NodeAffinity
- Alpha feature gate
@k8s-github-robot k8s-github-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 21, 2017
@msau42
Copy link
Member Author

msau42 commented May 22, 2017

@k8s-bot pull-kubernetes-federation-e2e-gce test this

@thockin
Copy link
Member

thockin commented May 22, 2017

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 22, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, thockin, vishh

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@saad-ali saad-ali removed the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label May 22, 2017
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented May 22, 2017

@msau42: The following test(s) failed:

Test name Commit Details Rerun command
pull-kubernetes-federation-e2e-gce 936269a link @k8s-bot pull-kubernetes-federation-e2e-gce test this

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit f6b3d08 into kubernetes:master May 22, 2017
@msau42 msau42 deleted the local-pv-api branch August 15, 2017 01:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet