Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate PV binding from PV.Spec.ClaimRef to Binding subresource #7438

Closed
markturansky opened this issue Apr 28, 2015 · 31 comments · Fixed by #7529
Closed

Migrate PV binding from PV.Spec.ClaimRef to Binding subresource #7438

markturansky opened this issue Apr 28, 2015 · 31 comments · Fixed by #7529
Labels
priority/backlog Higher priority than priority/awaiting-more-evidence.

Comments

@markturansky
Copy link
Contributor

@bgrant0607 this is a breaking API change.

@thockin @smarterclayton

From #6105

This should use a "binding" sub-resource, as with pods. We've drawn the analogy of claims and PVs being like pods and nodes - we should follow the same pattern. That seems like a significant change, so I am willing to proceed on this path as long as we have a strong promise to fix it ASAP. This is, strictly speaking, an API change (albeit constrained to ~1 consumer - you).

As per our IRC chat, there are a lot of things this might change:

Do we bind the Claim to the PV or vice-versa?
Do we move the Claim.VolumeRef to Spec?
Do we allow users to ask for a specific PV?

and

As per the binding sub-resource discussion, I sense that we have this backwards. PersVolume.Status should host a ClaimRef and PersVolumeClaim.Spec should host a VolumeName (not ref, I think).

Per @thockin's questions above...

I think the PersistentVolume should own the binding. It is 1:1 with Claims and it is more natural to query for volumes and see what's bound than to traverse all namespaces to see if a claim exists for a given volume.

Claim.Status.VolumeRef (potentially changed to VolumeName, since Namespace is unneeded) can stay where it is. That status is reconstructable by observing the bound volume.

Allowing users to ask for a specific PV breaks the separation between them. We can steer users towards some volumes (through admission control that retains some types/sizes of volumes to certain namespaces) but still not give them the exact volume.

I can add the Binding subresource similar to pod, but I need some help understanding how to use it effectively. It looks like Pod's use of Binding is still in flux (see https://github.com/GoogleCloudPlatform/kubernetes/blob/master/plugin/pkg/scheduler/factory/factory.go#L294).

Technical advice and guidance is required :)

@smarterclayton
Copy link
Contributor

On Apr 28, 2015, at 11:29 AM, Mark Turansky notifications@github.com wrote:

@bgrant0607 this is a breaking API change.

@thockin @smarterclayton

From #6105

This should use a "binding" sub-resource, as with pods. We've drawn the analogy of claims and PVs being like pods and nodes - we should follow the same pattern. That seems like a significant change, so I am willing to proceed on this path as long as we have a strong promise to fix it ASAP. This is, strictly speaking, an API change (albeit constrained to ~1 consumer - you).

As per our IRC chat, there are a lot of things this might change:

Do we bind the Claim to the PV or vice-versa?
Do we move the Claim.VolumeRef to Spec?
Do we allow users to ask for a specific PV?

and

As per the binding sub-resource discussion, I sense that we have this backwards. PersVolume.Status should host a ClaimRef and PersVolumeClaim.Spec should host a VolumeName (not ref, I think).

Per @thockin's questions above...

I think the PersistentVolume should own the binding. It is 1:1 with Claims and it is more natural to query for volumes and see what's bound than to traverse all namespaces to see if a claim exists for a given volume.

Claim.Status.VolumeRef (potentially changed to VolumeName, since Namespace is unneeded) can stay where it is. That status is reconstructable by observing the bound volume.

Allowing users to ask for a specific PV breaks the separation between them. We can steer users towards some volumes (through admission control that retains some types/sizes of volumes to certain namespaces) but still not give them the exact volume.

I can add the Binding subresource similar to pod, but I need some help understanding how to use it effectively. It looks like Pod's use of Binding is still in flux (see https://github.com/GoogleCloudPlatform/kubernetes/blob/master/plugin/pkg/scheduler/factory/factory.go#L294).

Not in flux, just no one went back to fix the todo yet.
Technical advice and guidance is required :)


Reply to this email directly or view it on GitHub.

@markturansky
Copy link
Contributor Author

Looking through pods, it appears the binding is stored on pod.Spec.Host inside a GuaranteedUpdate helper. Data-wise, this is 1:1 with pv.Spec.ClaimRef.

The introduction of BindingREST for pods is meant to narrow who can call Bind? This would be similar to the status subresource, which prevents users, for example, from setting status.

If my data is correct, then this is not a breaking API change. This is an improvement in the client for PVs and an improvement in the transaction (the update helper).

@smarterclayton Does that sound correct to you? If it does, I think this lessens the immediacy of this change, even as it remains a good and necessary next step for PV.

@smarterclayton
Copy link
Contributor

On Apr 28, 2015, at 12:44 PM, Mark Turansky notifications@github.com wrote:

Looking through pods, it appears the binding is stored on pod.Spec.Host inside a GuaranteedUpdate helper. Data-wise, this is 1:1 with pv.Spec.ClaimRef.

The introduction of BindingREST for pods is meant to narrow who can call Bind? This would be similar to the status subresource, which prevents users, for example, from setting status.

Yes.
If my data is correct, then this is not a breaking API change. This is an improvement in the client for PVs and an improvement in the transaction (the update helper).

@smarterclayton Does that sound correct to you? If it does, I think this lessens the immediacy of this change, even as it remains a good and necessary next step for PV.

For security reasons PV and binding have almost the same users (only admins can bind or create volumes) so establishing the "correct" API pattern without breaking backwards compat is fine (same thing we said about pod bindings, basically).

Reply to this email directly or view it on GitHub.

@markturansky
Copy link
Contributor Author

@thockin's argument is also that I have the bind on the wrong object in the paradigm.

The only difference is that Node:Pod is 1:N while PV:Claim is 1:1. Querying for volumes with non-nil ClaimRefs == querying for Claims with non-nil VolumeNames. I think it's a wash, honestly.

An array of pods on Node would be ugly. A single ref on either PV or PVC is equally clean, IMHO.

There's an argument to be made to limit exposure and change the ref to claim.Status.VolumeName. This is an easy change.

@a-robinson a-robinson added priority/backlog Higher priority than priority/awaiting-more-evidence. team/cluster labels Apr 28, 2015
@thockin
Copy link
Member

thockin commented Apr 28, 2015

Looking through pods, it appears the binding is stored on pod.Spec.Host
inside a GuaranteedUpdate helper. Data-wise, this is 1:1 with
pv.Spec.ClaimRef.

If pod:node as claim:pv, then this is not right. The node is consumed by
the pod. The PV is consumed by the claim. I'm not going to be dogmatic
about the analogy because it is not 100%, but I want to understand WHY it
has to be different, if we do it differently.

In pod-space, you assign the pod to the host. You assign the claim to the
PV. The Node or PV is long-lived, the pod or claim is not. You're
proposing the opposite - bind the PV to the claim, if I understand.

Now, there's a difference in that we don't offer an API to traverse from a
node object to the pod objects that consume it.

As we move toward this for networking, you'll see something like a
LoadBalancer object and a Service gets assigned to a LoadBalancer. Which
direction does binding go?

Allowing users to ask for a specific PV breaks the separation between them

I don't buy that. We allow users to choose a node. It's not the common
case, but it exists for a reason.

On Tue, Apr 28, 2015 at 9:44 AM, Mark Turansky notifications@github.com
wrote:

Looking through pods, it appears the binding is stored on pod.Spec.Host
inside a GuaranteedUpdate helper. Data-wise, this is 1:1 with
pv.Spec.ClaimRef.

The introduction of BindingREST for pods is meant to narrow who can call
Bind? This would be similar to the status subresource, which prevents
users, for example, from setting status.

If my data is correct, then this is not a breaking API change. This is an
improvement in the client for PVs and an improvement in the transaction
(the update helper).

@smarterclayton https://github.com/smarterclayton Does that sound
correct to you? If it does, I think this lessens the immediacy of this
change, even as it remains a good and necessary next step for PV.


Reply to this email directly or view it on GitHub
#7438 (comment)
.

@markturansky
Copy link
Contributor Author

There is no reason the pod:node/claim:pv bind has to work differently. Perhaps this nuance wasn't discussed during API review because it's 1:1 whereas 1:N for pods and nodes is a much bigger difference.

I think the nuance you're defining is correct. It would be achieved exactly the same way as the current bind, just with the objects flipped, because it's 1:1. It is not an insignificant amount of rework..

The proposed API changes are:

  • move bind from pv.Spec.ClaimRef to claim.Spec.VolumeName
  • move pv.Spec.ClaimRef to pv.Status.ClaimRef
  • remove claim.Status.VolumeRef

Other improvements:

  • use GuaranteedUpdate when saving the bind
  • create BindingREST subresource similar to pod

@markturansky
Copy link
Contributor Author

It's not just API changes, of course, but the binder as well. The volume plugin is easier.

@thockin
Copy link
Member

thockin commented Apr 29, 2015

On Tue, Apr 28, 2015 at 11:05 AM, Mark Turansky
notifications@github.com wrote:

There is no reason the pod:node/claim:pv bind has to work differently. Perhaps this nuance wasn't discussed during API review because it's 1:1 whereas 1:N for pods and nodes is a much bigger difference.

I think the nuance you're defining is correct. It would be achieved exactly the same way as the current bind, just with the objects flipped, because it's 1:1. It is not an insignificant amount of rework..

The proposed API changes are:

move bind from pv.Spec.ClaimRef to claim.Spec.VolumeName
move pv.Spec.ClaimRef to pv.Status.ClaimRef
remove claim.Status.VolumeRef

I am confused. This SOUNDS like what I wanted - which you were
arguing against. Did I miss something?

To be clear: the Binding is POSTed to $PVClaim/binding, referencing
the PV, yes? So the pod:node vs claim:pv analogy holds...

Other improvements:

use GuaranteedUpdate when saving the bind
create BindingREST subresource similar to pod


Reply to this email directly or view it on GitHub.

@markturansky
Copy link
Contributor Author

@thockin Yes, these are the API changes you are wanting.

I understand the semantic difference between the two ways of binding. Personally, I don't think having the bind on PV is wrong. Having it on Claim is also not wrong. To me, they are the same because this is 1:1, which is very different than 1:N for node/pod. Moving it is a matter of "purity", I think, because they are the same technically.

I have no problem making these changes, though. I cringe at the rework but this won't be the first time I've had to rework something (nor the last).

@thockin
Copy link
Member

thockin commented Apr 29, 2015

I think th emost important part for the short term is getting the *Ref
fields the way we want them. Once that is in, we can ungate it and make it
operate through a Binding as a second lower-prio pass. Does that ease up
on pressure at all? I just don't want it to become a "yeah, we should have
done that, but never got around to it" thing.

On Wed, Apr 29, 2015 at 7:55 AM, Mark Turansky notifications@github.com
wrote:

@thockin https://github.com/thockin Yes, these are the API changes you
are wanting.

I understand the semantic difference between the two ways of binding.
Personally, I don't think having the bind on PV is wrong. Having it on
Claim is also not wrong. To me, they are the same because this is 1:1,
which is very different than 1:N for node/pod. Moving it is a matter of
"purity", I think, because they are the same technically.

I have no problem making these changes, though. I cringe at the rework but
this won't be the first time I've had to rework something (nor the last).


Reply to this email directly or view it on GitHub
#7438 (comment)
.

@markturansky
Copy link
Contributor Author

Agreed that the API changes are required first and the other improvements can happen subsequently.

@markturansky
Copy link
Contributor Author

I've been thinking about the impact this change will have on the binder. I think it'll be simpler, actually. I retract my previous comments about the two being technically the same.

@bgrant0607
Copy link
Member

Sorry I've been MIA on this.

One concern I have with flipping the reference direction: We need a protocol for removing or recycling a PV once a PVC on it has been deleted. We don't want to immediately make it available for re-binding. Where would that information be recorded? Status isn't sufficient, since status must be observable from somewhere else. With pods, this information is in the container runtime (and on the filesystem).

If we always wanted to delete the PV object upon removal of the PVC, we could record it in an existence dependency, as I proposed in #1535.

@davidopp
Copy link
Member

davidopp commented May 7, 2015

/subscribe

@markturansky
Copy link
Contributor Author

@bgrant0607 I see what you mean. With PV.Spec.ClaimRef, we know the volume was once bound. If that claim does not exist, we know the volume has been released. We can build Status from that.

Conversely, if we move the bind to claim.Spec.VolumeName and the claim is deleted, we can't rebuild PV's Status should that data be lost for some reason.

@markturansky
Copy link
Contributor Author

If @bgrant0607 can convince @thockin that the current bind is correct, then the PV framework can be ungated. It is already merged with the bind on pv.Spec.ClaimRef.

@smarterclayton
Copy link
Contributor

Can we reach closure tomorrow on a direction?

I would argue that claims and PVs should bond in both directions. The claim is released, and reclamation happens with the previous claim marked no longer in use. The PV is the entity that determines whether a claim is official - claims shouldn't be able to spuriously associate themselves with random PVs by guessing global names (I'm not allowed to create a claim bound to your PV unless the PV points to me).

@markturansky
Copy link
Contributor Author

@smarterclayton It currently works nearly exactly like you describe, except that the PVC is deleted instead of being flagged unused. The PV retains ClaimRef as the binding reference and the absence of the PVC indicates the volume was released, advancing pv.Status.Phase.

In another thread @bgrant0607 (and I think @thockin) were keen to make sure ClaimRef was not deleted by the binder on release but that another process would handle reclamation and remove ClaimRef. I'm currently working the recycler now.

@bgrant0607
Copy link
Member

This didn't get mentioned during the hangout. :-( The API churn right now is impossible to keep up with. Expect latency to be high on any API changes. We may just need to stack-rank them and resolve them in order.

I'd be happy with bi-directional. We're going to need something similar in order to support sole tenancy of users' pods on nodes. That's somehow going to need to be represented on the node once a pod requiring sole tenancy is bound to it. So, bind on PVC and it flows to PV (if allowed).

@bgrant0607
Copy link
Member

A PV should have a way to "fail" a PVC that's not allowed, similar to how kubelet now can reject pods.

@markturansky
Copy link
Contributor Author

So we're clear, is the existing implementation correct, then? The single bind from which I can build all other objects and status is pv.Spec.ClaimRef from which I derive pvc.Status.VolumeRef.

I'm working through the recycler implementation and the bind as described above is a natural fit for the volume's lifecycle.

If this is correct, I need to just remove the gate and this feature will be enabled by default.

@markturansky
Copy link
Contributor Author

If we want the bind on pvc.Spec.VolumeName, then please review #7529. Either way, please let me know how to proceed and I'm happy to follow up.

@thockin
Copy link
Member

thockin commented May 9, 2015

I may be on the wrong side of this battle, but i'll try anyway.

Status isn't sufficient, since status must be observable from somewhere
else.

Technically, you can rediscover status by scanning all PVClaims and seeing
if any of them point back to a give PV. Granted, that's somewhat less than
pleasant, but it fits the requirement in principle. To maintain the
crumbling analogy, we bind a pod to a node, where pod.spec holds the
authoritative assignment. A node discovers what bods are bound to it by
selecting all pods where host == self. We could bind nodes to pods and
keep a list of pods per-node, and it might even be better. Consider:

  1. user asks for a Pod P with .host = abc123
  2. scheduler decides that is a feasible scheduling decision
  3. scheduler posts a binding to Node abc123 for Pod P.
  4. append P to Node abc123.spec.pods[]
  5. scheduler posts a status to Pod P with status.Host = abc123

It's now trivial for a node to ask "what pods are assigned to me". It's
unambiguous, given a Pod object, whether it has been scheduled or not. Is
this something we considered in the scheduler binding model?

If we care that it be easy to go from PV to any claim on that PV, we should
equally care that it be easy to go from a Node to and claims on that Node.

Alternately, we could write to Spec on both PV and PVClaim, and by analogy
Node and Pod.

I don't mean to jerk you around Mark. We should make a call and just live
with it, even it that means NOT fixing the API until v2. I think it's a
pretty deeply rooted design decision, so we should not rush it. Propose
that we put together a hangout or something Tuesday and make a decision by
Wednesday? Clayton, Brian, David, you OK with that?

Tim

On Fri, May 8, 2015 at 7:01 PM, Mark Turansky notifications@github.com
wrote:

If we want the bind on pvc.Spec.VolumeName, then please review #7529
#7529. Either
way, please let me know how to proceed and I'm happy to follow up.


Reply to this email directly or view it on GitHub
#7438 (comment)
.

@markturansky
Copy link
Contributor Author

@thockin I'm ready to put this to bed. Hangout Tuesday to discuss works. How do we best schedule it? I think some of you have more demands on your time than I do.

@markturansky
Copy link
Contributor Author

you can rediscover status by scanning all PVClaims and seeing
if any of them point back to a give PV.

Except for a PVClaim that is deleted. volume.Status.ClaimRef might still be set on the PV, but that status is not observable by itself and cannot be rebuilt because there would be no corresponding claim. The volume wouldn't know it requires recycling.

Perhaps this is why @smarterclayton suggested keeping PVClaim around until after the PV is recycled. That requires another Phase for PVClaim (such as Released with the subsequent delete of the claim happening after the volume is recycled).

But adding extra phases to achieve the same effect seems like a violation of Occam's Razor.

@thockin
Copy link
Member

thockin commented May 11, 2015

On Mon, May 11, 2015 at 6:51 AM, Mark Turansky notifications@github.com wrote:

you can rediscover status by scanning all PVClaims and seeing
if any of them point back to a give PV.

Except for a PVClaim that is deleted. volume.Status.ClaimRef might still be set on the PV, but that status is not observable by itself and cannot be rebuilt because there would be no corresponding claim. The volume wouldn't know it requires recycling.

Of course. Not sure how I let that escape. Shame. PV/PVC is
different from Node/Pod in this regard.

Perhaps this is why @smarterclayton suggested keeping PVClaim around until after the PV is recycled. That requires another Phase for PVClaim (such as Released with the subsequent delete of the claim happening after the volume is recycled).

But adding extra phases to achieve the same effect seems like a violation of Occam's Razor.


Reply to this email directly or view it on GitHub.

@markturansky
Copy link
Contributor Author

@thockin does that mean the current implementation w/ bind on PV.Spec.ClaimRef is correct? If so, I can submit a PR to remove the gate.

@thockin
Copy link
Member

thockin commented May 12, 2015

Spoke with Brian, he agreed with my feeling. Binding should be posted to
the less-durable object, referencing the more-durable object (bind pod to
node, bind PVC to PV). The fact that PV's ClaimRef needs to be more
durable than Status allows just means that it should be in Spec. Yes, the
bi-di refs are both in spec on their respective objects.

Workable? Or did I miss something?

On Mon, May 11, 2015 at 9:16 AM, Mark Turansky notifications@github.com
wrote:

@thockin https://github.com/thockin does that mean the current
implementation w/ bind on PV.Spec.ClaimRef is correct? If so, I can submit
a PR to remove the gate.


Reply to this email directly or view it on GitHub
#7438 (comment)
.

@markturansky
Copy link
Contributor Author

Workable and requires rework of #7529.

The change from two weeks ago is to keep pv.Spec.ClaimRef as opposed to changing it to pv.Status.ClaimRef.

I'll work through #7529 today.

@bgrant0607
Copy link
Member

We will eventually need to support sole tenancy for pods. I want to follow the same model in that case: the pod has /binding, as in the non-exclusive case, and the exclusive reservation is then confirmed/recorded by writing the spec of the node. An async. process can determine when the exclusive reservation is valid to remove.

@thockin
Copy link
Member

thockin commented May 12, 2015

Brian,

I think what we have here is compatible with that goal.

On Tue, May 12, 2015 at 12:42 PM, Brian Grant notifications@github.com
wrote:

We will eventually need to support sole tenancy for pods. I want to follow
the same model in that case: the pod has /binding, as in the non-exclusive
case, and the exclusive reservation is then confirmed/recorded by writing
the spec of the node. An async. process can determine when the exclusive
reservation is valid to remove.


Reply to this email directly or view it on GitHub
#7438 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants