Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migrate leader election to lease API #81030

Open
wants to merge 1 commit into
base: master
from

Conversation

@ricky1993
Copy link
Contributor

commented Aug 6, 2019

Change-Id: I21fd5cdc1af59e456628cf15fc84b2d79db2eda0

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug

/kind cleanup

/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:
Re-implement #80508, due to the suggestion at #80508 (comment)

If user uses endpoint lock and want to migrate to lease lock, he can switch all components(kcm, scheduler, etc) to "endpointlease" lock and then switch to lease lock safely. Note that the old endpoint lock will not be clean.
Which issue(s) this PR fixes:

Ref #80289

Special notes for your reviewer:
Implement two composite resource locks for migration between lease and endpoint and configmap.

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/sig scalability
/assign @wojtek-t
/cc @mikedanese @timothysc any suggestions?

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Aug 6, 2019

Hi @ricky1993. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Aug 7, 2019

/ok-to-test

This looks reasonable to me.
@mikedanese @timothysc - thoughts?

@ricky1993 ricky1993 force-pushed the ricky1993:leader_election_migrate branch from 44fa6ff to 621a0bc Aug 7, 2019

@timothysc

This comment has been minimized.

Copy link
Member

commented Aug 7, 2019

I'm still confused by your motivations for this change, and imo release notes are required:

If user uses endpoint lock and want to migrate to lease lock, he can switch all components(kcm, scheduler, etc) to "endpointlease" lock and then switch to lease lock safely. Note that the old endpoint lock will not be clean.

^ What is the user story driving this? You want to switch the locking on the fly, that seems like a really bad idea? What am I missing here?

/hold

@ricky1993

This comment has been minimized.

Copy link
Contributor Author

commented Aug 7, 2019

^ What is the user story driving this? You want to switch the locking on the fly, that seems like a really bad idea? What am I missing here?

The ref issue.
"Given that both of these are watched by different components, this is generating a lot of unnecessary load.

We should migrate all leader-election to use Lease API (that was designed exactly for this case)."

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Aug 7, 2019

You want to switch the locking on the fly, that seems like a really bad idea? What am I missing here?

@timothysc - endpoints or configmaps are not the objects that we should be doing leader election against. We have an API that was designed exactly for this purpose, which is coordination.Lease.
We would like to use in our components what we recommend others.
In addition to that, using Endpoints object for leader election is bad from performance/scalability POV - Endpoints are watched by kube-proxy on every single node, which means we don't really want to use Endpoints object.

And we exactly don't want to change it randomly but in two phases:
(1) acquire old object lock, once acquired, acquire new object lock; only when both acquired we have a lock and can proceed
(2) (next release) since now master has to acquire the lock on the new object (lease) to be master, we can now delete the need to acquire the old object (endpoints or configmap)
And now we're using lease as it should be.

Is that more clear now?

@ricky1993

This comment has been minimized.

Copy link
Contributor Author

commented Aug 7, 2019

/test pull-kubernetes-e2e-gce

@fedebongio

This comment has been minimized.

Copy link
Contributor

commented Aug 8, 2019

/cc @liggitt

@timothysc

This comment has been minimized.

Copy link
Member

commented Aug 12, 2019

We have an API that was designed exactly for this purpose, which is coordination.Lease.
We would like to use in our components what we recommend others.

lol, we argued for this years ago but briant grant said no way. I don't know how you slid this through, but works for me.

/hold cancel

@timothysc

This comment has been minimized.

Copy link
Member

commented Aug 12, 2019

/assign @liggitt
^ I wasn't involved in this api creation so deferring to api-approver.

@timothysc timothysc removed their assignment Aug 12, 2019

}
// check if the older resource lock is released
if leaseLock.RenewTime.Before(&configMapLock.RenewTime) {
return leaseLock, nil

This comment has been minimized.

Copy link
@liggitt

liggitt Aug 12, 2019

Member

this doesn't check that the HolderIdentity agrees between the two... don't that mask scenarios where both objects exist but have conflicting identities recorded?

This comment has been minimized.

Copy link
@mikedanese

mikedanese Aug 13, 2019

Member

The value of HolderIdentity is informational only. The client doesn't actually act on it.

This comment has been minimized.

Copy link
@ricky1993

ricky1993 Aug 13, 2019

Author Contributor

If we return an error when HolderIdentity is different between the two, the function tryAcquireOrRenew will return at first step.
https://github.com/kubernetes/client-go/blob/4f902818859a531e018f805b2d3597f14bf2c494/tools/leaderelection/leaderelection.go#L331
As a result, no client can acquire the leader lease successfully.
Correct me if I am wrong.

// Create attempts to create both configmap and lease
func (cmll *ConfigMapLeaseLock) Create(ler LeaderElectionRecord) error {
err := cmll.ConfigMapLock.Create(ler)
if err != nil && !apierrors.IsAlreadyExists(err) {

This comment has been minimized.

Copy link
@liggitt

liggitt Aug 12, 2019

Member

if this already exists, and we do not verify the object that already exists matches the LeaderElectionRecord we intended to create, doesn't that cause problems?

This comment has been minimized.

Copy link
@liggitt

liggitt Aug 12, 2019

Member

doesn't this also put the configmap holder in an invalid state (it makes the internal cm variable nil in an AlreadyExists case)?

This comment has been minimized.

Copy link
@wojtek-t

wojtek-t Aug 13, 2019

Member

Not sure I understand it - this could potentially be a problem already (even without multilock).

This comment has been minimized.

Copy link
@ricky1993

ricky1993 Aug 13, 2019

Author Contributor

I think there is a assumption here:
We will call Get() firstly, if it is not found. We would not meet already exists error here. And we can initial cm variable successfully.
If it already exist, function Get() will initial cm variable.

And as wojtek said, this could potentially be a problem already (even without multilock).

This comment has been minimized.

Copy link
@liggitt

liggitt Aug 13, 2019

Member

this could potentially be a problem already (even without multilock).

I don't think so... we don't mask AlreadyExists errors in ConfigMapLock#Create, so tryAcquireOrRenew fails and retries, where the next Get() would return the object

This comment has been minimized.

Copy link
@ricky1993

ricky1993 Aug 13, 2019

Author Contributor

I don't think so... we don't mask AlreadyExists errors in ConfigMapLock#Create, so tryAcquireOrRenew fails and retries, where the next Get() would return the object

I think I get your point.
If two clients start for a initial cluster, they may Get() and both get NotFound error, then try to Create() and both return true.
The problem is Get() and Create() is not atomic.

How about masking AlreadyExists errors when the primary lock is created by the client.
The reason about masking AlreadyExists errors is that Get() return NotFound error, but we don't know which lock is not found, so we try to create both of them.
It the primary lock is belong to the client, the secondary lock is NotFound, and we would create it, else if the primary is created by other client, we would return error and tryAcquireOrRenew would fail.

if err != nil {
return nil, err
}
// check if the older resource lock is released

This comment has been minimized.

Copy link
@mikedanese

mikedanese Aug 13, 2019

Member

RenewTime is not meaningful as it is not a locally collected timestamp. See:

// A client only acts on timestamps captured locally to infer the state of the
// leader election. The client does not consider timestamps in the leader
// election record to be accurate because these timestamps may not have been
// produced by a local clock. The implemention does not depend on their
// accuracy and only uses their change to indicate that another client has
// renewed the leader lease. Thus the implementation is tolerant to arbitrary
// clock skew, but is not tolerant to arbitrary clock skew rate.

This comment has been minimized.

Copy link
@ricky1993

ricky1993 Aug 13, 2019

Author Contributor

We want to panic and restart the component while losing any of two lock. And I don't want to break the interface. Any better implement suggestion for here?

@mikedanese

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

I see two ways to handle this transition. Option one, during the transition:

  • continue to use the old object as the primary object to implement correct leader election.
  • publish results to the a new object without consideration of it's current state.

Option two, during the transition:

  • clients maintain leases on both objects.

Option two may be more correct but requires a more significant change. This would need to change to a comparison over both locks:

// 2. Record obtained, check the Identity & Time
if !reflect.DeepEqual(le.observedRecord, *oldLeaderElectionRecord) {

Option 1 is going to be a lot simpler and less risky, so it's probably a better option if it's good enough.

@liggitt

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

  • publish results to the a new object without consideration of its current state.

wouldn't that break the aspirational future version that only relies on the new object, and assumes once it has acquired that lock, it actually holds it?

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

wouldn't that break the aspirational future version that only relies on the new object, and assumes once it has acquired that lock, it actually holds it?

I think I agree with Jordan. Even though the probability of not holding the second lock seems relatively low, it's non-zero. And if we want to rely on that in future releases, we should do that correctly.

[I wouldn't say is very aspirational - if we do this step correctly, the final migration to just use lease should be trivial.]

@ricky1993 ricky1993 force-pushed the ricky1993:leader_election_migrate branch from 621a0bc to 2fbc3aa Aug 13, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2019

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ricky1993
To complete the pull request process, please assign liggitt
You can assign the PR to them by writing /assign @liggitt in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ricky1993 ricky1993 force-pushed the ricky1993:leader_election_migrate branch from 2fbc3aa to 7473fa3 Aug 13, 2019

@ricky1993

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2019

/retest

if err != nil {
return nil, err
}
// check if the older resource lock is expired

This comment has been minimized.

Copy link
@wojtek-t

wojtek-t Aug 14, 2019

Member

Copying some comments from the previous version:
@mikedanese

RenewTime is not meaningful as it is not a locally collected timestamp. See:

// A client only acts on timestamps captured locally to infer the state of the
// leader election. The client does not consider timestamps in the leader
// election record to be accurate because these timestamps may not have been
// produced by a local clock. The implemention does not depend on their
// accuracy and only uses their change to indicate that another client has
// renewed the leader lease. Thus the implementation is tolerant to arbitrary
// clock skew, but is not tolerant to arbitrary clock skew rate.

@ricky1993

We want to panic and restart the component while losing any of two lock. And I don't want to break the interface. Any better implement suggestion for here?

This comment has been minimized.

Copy link
@wojtek-t

wojtek-t Aug 14, 2019

Member

I think what you're doing here isn't really the right thing.
What if:

  • I was the owner in the past
  • then someone else wanted to acquire it, but only acquired the primary one

Now I'm getting it and I own the Secondary (with older RenewTime), but I don't even own Primary?

Technically, I would say that a lock is valid only if both primary and secondary are the same.
So I would say that we should probably introduce some dedicated error and return it here if both locks are different. Obviously it would also require proper handling in the library itself.

This comment has been minimized.

Copy link
@mikedanese

mikedanese Aug 14, 2019

Member

We have to be careful to avoid situations where a corrupt state (e.g. different leaders in each record causes unrecoverable errors). Another option is to change Get() signature to return (record LeaderElectionRecord, rawRecord []byte, error), store the raw record as le.observedRawRecord and change this DeepEqual to a byte comparison:

// 2. Record obtained, check the Identity & Time
if !reflect.DeepEqual(le.observedRecord, *oldLeaderElectionRecord) {
le.observedRecord = *oldLeaderElectionRecord
le.observedTime = le.clock.Now()
}

We need a HolderIdentity that isn't "" to pass back from Get() when the two objects disagree on the leader disagree. We can pick a placeholder like "leaderelection.k8s.io/unknown". What we get is:

  • transitional clients respect other clients that are using either the old or new resource only.
  • transitional clients treat corruption as just another leader named "unknown" and will attempt to steal the lease after LeaseDuration.
  • no dependencies on time.

This seems like a pretty minimal change to the leaderelection client, EndpointsLock and ConfigMapLock but LeaseLock is a bit tricky since we don't currently handle raw bytes for that record since it's stored in the object and not an annotation.

This comment has been minimized.

Copy link
@wojtek-t

wojtek-t Aug 19, 2019

Member

No sure I'm really following what does rawRecord give us - DeepCopy should be doing the same stuff, right?

This comment has been minimized.

Copy link
@mikedanese

mikedanese Aug 21, 2019

Member

The client detects a refresh of the the lease by checking if anything changed in the record. This is done with a DeepEqual. If we want a lock to only be valid if both primary and secondary are the same, transitional clients (moving to a new record) need to compare two locks while steady state clients need to compare only one lock. This is certainly possible to do:

  • with DeepEqual by returning both records and comparing both in the client, then handling either one or two records everywhere after.
  • with comparing both in the resource lock implementation, then returning an error. This would also require modifying the leaderelection client today to handle this kind of error and we need to be careful to make sure that corrupt states are recoverable.

I'm suggesting a third option:

  • change the DeepEqual to be a bytes.Compare. Steady state clients return a single serialized record, transitional clients return two concatenated records.
  • construct a LeaderElectionRecord struct in the transitional client picking the best fields to preserve all the informational logging.

The one field in LeaderElectionRecord that we use non-informationally is the Holder field. We use this to implement reentrance when a leader container restarts and to speed up the first election or elections after the leader explicty releases the lease. I client doesn't have too wait to reclaim the lease if the holder is empty or the holder == itself:

if len(oldLeaderElectionRecord.HolderIdentity) > 0 &&
le.observedTime.Add(le.config.LeaseDuration).After(now.Time) &&
!le.IsLeader() {

I'm not sure this behavior is actually useful but it's there for now. The client should work correctly but a little slower without these optimizations. This is why I suggested that transitional clients treat mismatched HolderIdentities between the two records as just another leader named "unknown" and will attempt to steal the lease after LeaseDuration. This would be functionally equivalent to a steady state client reading a LeaderElectionRecord owned by another client (assuming the client isn't named "unknown" :)). We are just skipping the HolderIdentity optimizations.

This comment has been minimized.

Copy link
@wojtek-t

wojtek-t Aug 22, 2019

Member

I'm suggesting a third option:
change the DeepEqual to be a bytes.Compare. Steady state clients return a single serialized record, transitional clients return two concatenated records.

OK - that makes a lot of sense to me.

Re holder-identity - yes, that I also already agreed with before.

This comment has been minimized.

Copy link
@ricky1993

ricky1993 Aug 22, 2019

Author Contributor

The one field in LeaderElectionRecord that we use non-informationally is the Holder field. We use this to implement reentrance when a leader container restarts and to speed up the first election or elections after the leader explicty releases the lease. I client doesn't have too wait to reclaim the lease if the holder is empty or the holder == itself:

I think the optimization is not work while restarting kcm and scheduler. Holder is generated with hostname and uuid, so the old holder will not equal to the new one.

This comment has been minimized.

Copy link
@mikedanese

mikedanese Aug 23, 2019

Member

Yup, it doesn't get hit e.g. during a master upgrade or a master restart.

@yue9944882

This comment has been minimized.

Copy link
Member

commented Aug 19, 2019

/cc

@k8s-ci-robot k8s-ci-robot requested a review from yue9944882 Aug 19, 2019

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Aug 22, 2019

@ricky1993 - so I guess to summarize, I'm with @mikedanese on his proposal. Can you please update this PR to do what he suggested?

@ricky1993

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2019

@ricky1993 - so I guess to summarize, I'm with @mikedanese on his proposal. Can you please update this PR to do what he suggested?

I also agree with him, the PR will be updated soon.

@ricky1993 ricky1993 force-pushed the ricky1993:leader_election_migrate branch 3 times, most recently from 66455ac to 496ac23 Aug 23, 2019

@ricky1993

This comment has been minimized.

Copy link
Contributor Author

commented Aug 29, 2019

@wojtek-t @mikedanese PTAL. I will add some unittest for multilock at "staging/src/k8s.io/client-go/tools/leaderelection/leaderelection_test.go" soon.

@ricky1993 ricky1993 force-pushed the ricky1993:leader_election_migrate branch from 496ac23 to e7980b8 Sep 4, 2019

@k8s-ci-robot k8s-ci-robot added size/XL and removed size/L labels Sep 4, 2019

@ricky1993

This comment has been minimized.

Copy link
Contributor Author

commented Sep 4, 2019

/test pull-kubernetes-kubemark-e2e-gce-big

@wojtek-t
Copy link
Member

left a comment

I didn't carefully reviewed the test yet, but the non-test logic looks reasonable for me.
But before I will review the test, I would prefer someone else to also review this.

@mikedanese - can you please take a look?

migrate leader election to lease API
Change-Id: I21fd5cdc1af59e456628cf15fc84b2d79db2eda0

@ricky1993 ricky1993 force-pushed the ricky1993:leader_election_migrate branch from e7980b8 to c5e85d9 Sep 4, 2019

@ricky1993

This comment has been minimized.

Copy link
Contributor Author

commented Sep 10, 2019

friendly ping~ @mikedanese

@timothysc timothysc removed their request for review Sep 10, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.