Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync etcd endpoints before lease acquistion #13082

Merged

Conversation

ncdc
Copy link
Contributor

@ncdc ncdc commented Feb 23, 2017

Auto-sync etcd endpoints while trying to acquire the leader lease. If
any of the etcd cluster members is down, the sync operation will update
the client's member list to remove the unavailable members.

Because the lease acquistion key "set" operation is a one-shot attempt,
if the first member the etcd client tries to contact is unavailable, it
will not try any other members and it just fails. With the sync, we
should be able to work around this.

Fixes bug 1426183
https://bugzilla.redhat.com/show_bug.cgi?id=1426183

cc @smarterclayton @deads2k @sttts @liggitt @mfojtik @derekwaynecarr

@ncdc
Copy link
Contributor Author

ncdc commented Feb 23, 2017

If this looks ok, do you want me to add a call to e.client.AutoSync() also?

@ncdc
Copy link
Contributor Author

ncdc commented Feb 23, 2017

cc @eparis

@smarterclayton
Copy link
Contributor

What is the downside of autosync?

@ncdc
Copy link
Contributor Author

ncdc commented Feb 24, 2017

@smarterclayton not much of one that I can think of. They recommend running with an interval between 10 and 60 seconds. https://github.com/coreos/etcd/blob/02f4a9a034f65f588599eaf6d756b98b55640c98/client/client.go#L181-L195

@smarterclayton
Copy link
Contributor

smarterclayton commented Feb 24, 2017 via email

@ncdc
Copy link
Contributor Author

ncdc commented Feb 24, 2017

No, I think only calling AutoSync should be sufficient. I have a standalone test driver I will try it in first to be sure.

@ncdc ncdc force-pushed the sync-endpoints-before-leader-election branch from 61ade16 to 28872e8 Compare February 24, 2017 15:45
@ncdc
Copy link
Contributor Author

ncdc commented Feb 24, 2017

@smarterclayton updated, ptal

// until it succeeds. Assuming it does, the client's list of endpoints is updated, and any
// unavailable members are removed from the list.
for {
err := e.client.AutoSync(ctx, autoSyncInterval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scarily looks like a hotloop. :)

If the hotloop dies, should it cancel the lease?

Copy link
Contributor Author

@ncdc ncdc Feb 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right - if the inner call to Sync fails repeatedly, it would be a hotloop. At the very least, after an error, it should sleep for e.pauseInterval or maybe autoSyncInterval.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed an update where it sleeps e.pauseInterval after an AutoSync error

Auto-sync etcd endpoints while trying to acquire the leader lease. If
any of the etcd cluster members is down, the sync operation will update
the client's member list to remove the unavailable members.

Because the lease acquistion key "set" operation is a one-shot attempt,
if the first member the etcd client tries to contact is unavailable, it
will not try any other members and it just fails. With the sync, we
should be able to work around this.
@ncdc ncdc force-pushed the sync-endpoints-before-leader-election branch from 28872e8 to d960c59 Compare February 24, 2017 16:03
Copy link
Contributor

@smarterclayton smarterclayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm [merge]

@ncdc
Copy link
Contributor Author

ncdc commented Feb 24, 2017

[merge]

@openshift-bot
Copy link
Contributor

[Test]ing while waiting on the merge queue

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to d960c59

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_future/557/) (Base Commit: f984cc7)

@ncdc
Copy link
Contributor Author

ncdc commented Feb 26, 2017 via email

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to d960c59

@openshift-bot
Copy link
Contributor

openshift-bot commented Feb 26, 2017

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_future/570/) (Base Commit: d757d20) (Image: devenv-rhel7_5984)

@openshift-bot openshift-bot merged commit 5357357 into openshift:master Feb 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants