Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd: reuse leases for keys in a time window #64539

Merged
merged 1 commit into from Jun 21, 2018

Conversation

@ccding
Copy link
Contributor

ccding commented May 31, 2018

Reuse leases for keys in a time window, to reduce the overhead to etcd
caused by using massive number of leases

Fixes #47532

Fix a scalability issue where high rates of event writes degraded etcd performance.
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented May 31, 2018

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ccding

This comment has been minimized.

Copy link
Contributor Author

ccding commented May 31, 2018

signed CLA

@wgliang

This comment has been minimized.

Copy link
Member

wgliang commented May 31, 2018

/ok-to-test

@ccding ccding force-pushed the cfork:lease branch from 5a71572 to e01904f May 31, 2018

@@ -33,7 +33,8 @@ import (
// The value for timeout should effectively be "forever." Obviously we don't want our tests to truly lock up forever, but 30s
// is long enough that it is effectively forever for the things that can slow down a run on a heavily contended machine
// (GC, seeks, etc), but not so long as to make a developer ctrl-c a test run if they do happen to break that test.
var ForeverTestTimeout = time.Second * 30
// Since we use one etcd lease to handle events within 60s, we use 30+60=90s in the test for forever timeout.
var ForeverTestTimeout = time.Second * 90

This comment has been minimized.

@hongchaodeng

hongchaodeng May 31, 2018

Member

Why change this to 90s?

This comment has been minimized.

@ccding

ccding May 31, 2018

Author Contributor

I added comments for this "Since we use one etcd lease to handle events within 60s, we use 30+60=90s in the test for forever timeout."

This comment has been minimized.

@hongchaodeng

hongchaodeng May 31, 2018

Member

Can you give an example which test would need this long?

This comment has been minimized.

@wojtek-t

wojtek-t Jun 1, 2018

Member

I really don't like it. Let's change that test to not require 60s, but instead maybe 30s or sth...
We really shouldn't change some system-wide (even test-only) constant just for the purpose of a single test.

This comment has been minimized.

@xiang90

xiang90 Jun 1, 2018

Contributor

@ccding you should change the const defined (also change it to a var) in store.go to make the lease shorter for testing instead of changing this global thing.

@@ -70,6 +73,10 @@ type store struct {
pathPrefix string
watcher *watcher
pagingEnabled bool

leaseLock *sync.Mutex
prevLeaseExpirationTime int64

This comment has been minimized.

@hongchaodeng

hongchaodeng May 31, 2018

Member

add a comment what prevLeaseExpirationTime and prevLeaseID are for?

This comment has been minimized.

@ccding

ccding May 31, 2018

Author Contributor

done

// If the previous lease grants valid TTL, reuse it; Otherwise create
// a new one with a little extra time to make it possible to be reused.
now := time.Now().Unix()
if now+ttl <= s.prevLeaseExpirationTime {

This comment has been minimized.

@hongchaodeng

hongchaodeng May 31, 2018

Member
if now+ttl > s.prevLeaseExpirationTime {
 ...
}

No need to have empty blocks.

This comment has been minimized.

@ccding

ccding May 31, 2018

Author Contributor

done

if now+ttl <= s.prevLeaseExpirationTime {
// do nothing and reuse the previous lease
} else {
s.prevLeaseExpirationTime = now + leaseReuseLength + ttl

This comment has been minimized.

@hongchaodeng

hongchaodeng May 31, 2018

Member

Why not

prevLeaseExpirationTime = now + leaseReuseLength

and check

if now > prevLeaseExpirationTime

This comment has been minimized.

@ccding

ccding May 31, 2018

Author Contributor

I guess the values of ttls might be different for different events?

@ccding ccding force-pushed the cfork:lease branch from e01904f to 8a03d71 May 31, 2018

@xiang90

This comment has been minimized.

Copy link
Contributor

xiang90 commented May 31, 2018

/cc @jpbetz

@k8s-ci-robot k8s-ci-robot requested a review from jpbetz May 31, 2018

@@ -33,7 +33,8 @@ import (
// The value for timeout should effectively be "forever." Obviously we don't want our tests to truly lock up forever, but 30s
// is long enough that it is effectively forever for the things that can slow down a run on a heavily contended machine
// (GC, seeks, etc), but not so long as to make a developer ctrl-c a test run if they do happen to break that test.
var ForeverTestTimeout = time.Second * 30
// Since we use one etcd lease to handle events within 60s, we use 30+60=90s in the test for forever timeout.
var ForeverTestTimeout = time.Second * 90

This comment has been minimized.

@wojtek-t

wojtek-t Jun 1, 2018

Member

I really don't like it. Let's change that test to not require 60s, but instead maybe 30s or sth...
We really shouldn't change some system-wide (even test-only) constant just for the purpose of a single test.

// previous lease. If a new event has an expiration time earlier than
// the previous one, the previous lease will be reused to reduce
// etcd's overhead.
leaseLock *sync.Mutex

This comment has been minimized.

@wojtek-t

wojtek-t Jun 1, 2018

Member

Why pointer to mutex? We generally use sync.Mutex in whole codebase.

This comment has been minimized.

@ccding

ccding Jun 1, 2018

Author Contributor

Sorry. I will change it.

@wojtek-t wojtek-t self-assigned this Jun 1, 2018

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Jun 1, 2018

@jpbetz - I would like to target it for 1.11 - WDYT?

@ccding ccding force-pushed the cfork:lease branch from 8a03d71 to 30cc20e Jun 1, 2018

@@ -1182,6 +1182,7 @@ func TestListContinuation(t *testing.T) {
}

func testSetup(t *testing.T) (context.Context, *store, *integration.ClusterV3) {
LeaseReuseLength = 10

This comment has been minimized.

@xiang90

xiang90 Jun 1, 2018

Contributor

comment on why we set this to 10.

This comment has been minimized.

@ccding

ccding Jun 1, 2018

Author Contributor

Anything smaller than 30 (minus a very small number) would work. I don't think it matters. Any suggestions?

This comment has been minimized.

@xiang90

xiang90 Jun 1, 2018

Contributor

we shall comment on why it is 10 not 1000 for example. saying 30 is the default timeout for testing, so we cannot wait longer than that in a single time ... should be good enough.

This comment has been minimized.

@ccding

ccding Jun 1, 2018

Author Contributor

done

@@ -59,6 +60,9 @@ func (d authenticatedDataString) AuthenticatedData() []byte {

var _ value.Context = authenticatedDataString("")

// LeaseReuseLength defines the period of time in seconds that each lease is reused.
var LeaseReuseLength int64 = 60

This comment has been minimized.

@xiang90

xiang90 Jun 1, 2018

Contributor

comment on why this is a var not a const (for example, we should tell people this is for testing purpose etc..)

This comment has been minimized.

@ccding

ccding Jun 1, 2018

Author Contributor

done

@ccding ccding force-pushed the cfork:lease branch 2 times, most recently from 6764dea to 31362a1 Jun 1, 2018

@ccding

This comment has been minimized.

Copy link
Contributor Author

ccding commented Jun 1, 2018

all fixed. PTAL

/cc @wojtek-t

@k8s-ci-robot k8s-ci-robot requested a review from wojtek-t Jun 1, 2018

@@ -59,6 +60,10 @@ func (d authenticatedDataString) AuthenticatedData() []byte {

var _ value.Context = authenticatedDataString("")

// LeaseReuseLength defines the period of time in seconds that each lease is
// reused. We use var instead of const for testing purposes.
var LeaseReuseLength int64 = 60

This comment has been minimized.

@jpbetz

jpbetz Jun 1, 2018

Contributor

Since we use in other places like testSetup, would a more explicit name like LeaseReuseDurationSeconds be worth it?

This comment has been minimized.

@ccding

ccding Jun 1, 2018

Author Contributor

done

@ccding ccding force-pushed the cfork:lease branch from 31362a1 to d12a6de Jun 1, 2018

@jpbetz

This comment has been minimized.

Copy link
Contributor

jpbetz commented Jun 1, 2018

/lgtm

Thanks @ccding!

@wojtek-t Yes, we should target 1.11. This is urgently needed.

@xiang90 xiang90 added this to the v1.12 milestone Jun 21, 2018

@xiang90

This comment has been minimized.

Copy link
Contributor

xiang90 commented Jun 21, 2018

/lgtm

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jun 21, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ccding, jpbetz, xiang90

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented Jun 21, 2018

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@ccding @jpbetz @wojtek-t @xiang90

Pull Request Labels
  • sig/api-machinery: Pull Request will be escalated to these SIGs if needed.
  • priority/important-soon: Escalate to the pull request owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
  • kind/bug: Fixes a bug discovered during the current release.
Help
@xiang90

This comment has been minimized.

Copy link
Contributor

xiang90 commented Jun 21, 2018

/test pull-kubernetes-verify

@xiang90

This comment has been minimized.

Copy link
Contributor

xiang90 commented Jun 21, 2018

/test pull-kubernetes-e2e-kops-aws

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented Jun 21, 2018

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented Jun 21, 2018

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 8ba3297 into kubernetes:master Jun 21, 2018

16 of 17 checks passed

Submit Queue Required Github CI test is not green: pull-kubernetes-kubemark-e2e-gce-big
Details
cla/linuxfoundation ccding authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details

@ccding ccding deleted the cfork:lease branch Jun 28, 2018

k8s-github-robot pushed a commit that referenced this pull request Jun 28, 2018

Kubernetes Submit Queue
Merge pull request #65539 from wenjiaswe/automated-cherry-pick-of-#64…
…539-upstream-release-1.8

Automatic merge from submit-queue.

Automated cherry pick of #64539: etcd: reuse leases for keys in a time window

Cherry pick of #64539 on release-1.8.

#64539: etcd: reuse leases for keys in a time window

This is it an important bug fix for scalability.
@jpbetz

This comment has been minimized.

Copy link
Contributor

jpbetz commented Jun 29, 2018

@shyamjvs Could we get a performance run on this PR? We're proposing cherrypicks to the release branches and would like to get a stronger signal on the impact of this change. Thanks!

k8s-github-robot pushed a commit that referenced this pull request Jul 5, 2018

Kubernetes Submit Queue
Merge pull request #65510 from wenjiaswe/automated-cherry-pick-of-#64…
…539-upstream-release-1.10

Automatic merge from submit-queue.

Automated cherry pick of #64539: etcd: reuse leases for keys in a time window

Cherry pick of #64539 on release-1.10.

#64539: etcd: reuse leases for keys in a time window

This is it an important bug fix for scalability.

cc @jpbetz

k8s-github-robot pushed a commit that referenced this pull request Jul 13, 2018

Kubernetes Submit Queue
Merge pull request #65511 from wenjiaswe/automated-cherry-pick-of-#64…
…539-upstream-release-1.11

Automatic merge from submit-queue.

Automated cherry pick of #64539: etcd: reuse leases for keys in a time window

Cherry pick of #64539 on release-1.11.

#64539: etcd: reuse leases for keys in a time window

This is it an important bug fix for scalability.

cc @jpbetz

k8s-github-robot pushed a commit that referenced this pull request Jul 24, 2018

Kubernetes Submit Queue
Merge pull request #65509 from wenjiaswe/automated-cherry-pick-of-#64…
…539-upstream-release-1.9

Automatic merge from submit-queue.

Automated cherry pick of #64539: etcd: reuse leases for keys in a time window

Cherry pick of #64539 on release-1.9.

#64539: etcd: reuse leases for keys in a time window

This is it an important bug fix for scalability.

cc @jpbetz

@wojtek-t wojtek-t referenced this pull request Jan 23, 2019

Open

Implementing logic for v1beta1.Event API #65782

0 of 3 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.