Refresh discovery server resources for memCacheClient in parallel #88382

jpbetz · 2020-02-20T21:53:58Z

What type of PR is this?

/kind flake

What this PR does / why we need it:

Run discovery API calls in parallel to reduce for in-memory discovery client. Reduces each cache refresh from ~15 seconds to ~ 2 seconds.

This helps e2e tests since they are run in parallel and some remove CRDs at the same others are being run, forcing the cache to be refreshed often.

In e2e tests we're seeing a groupVersion list length of around 27, which means we are 27 API calls in series.

See #87668 and #88297 for debugging details.

There might be other ways to optimize this further, but running the requests in parallel is an obvious win, so would like to merge this before pursuing other optimizations.

Last 9 pull-kubernetes-e2e-gce runs below. No GC failures. This is with SSA at 10%

1230671712341725185 - pass - 53m18s
1230729716571312129 - pass - 49m35s
1230742932269568000 - pass - 47m0s
1230756271573962752 - pass - 45m38s
1230767966954459136 - pass - 51m17s
1230781434029936640 - 12 In-tree Volumes failures - 48m41s
1230885123486912512 - 1 volume mount failure - 50m28s
1230899588768993280 - 1 volume mount failure - 51m12s
1230916073155465216 - pass - 49m40s

From #88297 here are the last 9 runs with SSA at 100%. No GC failures.

1230746203231096836 - pass - 1h1m41s
1230762058010595328 - 1 volume/CSIDriver failure - 1h0m17s
1230778298858999808 - 15 In-tree Volumes failures - 50m16s
1230885241724342272 - pass - 51m50s
1230899588773187585 - pass - 52m36s
1230919719050022912 - pass - 57m56s
1230949943569551360 - 1 In-tree Volumes failures, 2 PersistentVolumes-local failures - 59m57s
1230966546394779648 - 1 CRD publish OpenAPI failure, 58m36s
1230985169494609921 - pass - 55m29s

Which issue(s) this PR fixes:

Fixes #87668

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

staging/src/k8s.io/client-go/discovery/cached/memory/memcache.go

apelisse · 2020-02-20T22:11:42Z

Thanks
/lgtm

jpbetz · 2020-02-20T23:19:44Z

e2e test are failing, requests are getting throttled:

W0220 21:49:48.390781       1 garbagecollector.go:661] failed to discover some groups: map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request]
 I0220 21:49:48.390954       1 garbagecollector.go:203] syncing garbage collector with updated resources from discovery (attempt 5): added: [/v1, Resource=configmaps /v1, Resource=endpoints /v1, Resource=events /v1, Resource=limitranges /v1, Resource=nam\
 espaces /v1, Resource=nodes /v1, Resource=persistentvolumeclaims /v1, Resource=persistentvolumes /v1, Resource=pods /v1, Resource=podtemplates /v1, Resource=replicationcontrollers /v1, Resource=resourcequotas /v1, Resource=secrets /v1, Resource=servicea\
 ccounts /v1, Resource=services admissionregistration.k8s.io/v1, Resource=mutatingwebhookconfigurations admissionregistration.k8s.io/v1, Resource=validatingwebhookconfigurations apiextensions.k8s.io/v1, Resource=customresourcedefinitions apiregistration.\
 k8s.io/v1, Resource=apiservices apps/v1, Resource=controllerrevisions apps/v1, Resource=daemonsets apps/v1, Resource=deployments apps/v1, Resource=replicasets apps/v1, Resource=statefulsets autoscaling/v1, Resource=horizontalpodautoscalers batch/v1, Res\
 ource=jobs batch/v1beta1, Resource=cronjobs certificates.k8s.io/v1beta1, Resource=certificatesigningrequests coordination.k8s.io/v1, Resource=leases discovery.k8s.io/v1beta1, Resource=endpointslices events.k8s.io/v1beta1, Resource=events extensions/v1be\
 ta1, Resource=ingresses networking.k8s.io/v1, Resource=networkpolicies networking.k8s.io/v1beta1, Resource=ingresses node.k8s.io/v1beta1, Resource=runtimeclasses policy/v1beta1, Resource=poddisruptionbudgets policy/v1beta1, Resource=podsecuritypolicies \
 rbac.authorization.k8s.io/v1, Resource=clusterrolebindings rbac.authorization.k8s.io/v1, Resource=clusterroles rbac.authorization.k8s.io/v1, Resource=rolebindings rbac.authorization.k8s.io/v1, Resource=roles scalingpolicy.kope.io/v1alpha1, Resource=scal\
 ingpolicies scheduling.k8s.io/v1, Resource=priorityclasses settings.k8s.io/v1alpha1, Resource=podpresets snapshot.storage.k8s.io/v1beta1, Resource=volumesnapshotclasses snapshot.storage.k8s.io/v1beta1, Resource=volumesnapshotcontents snapshot.storage.k8\
 s.io/v1beta1, Resource=volumesnapshots storage.k8s.io/v1, Resource=csinodes storage.k8s.io/v1, Resource=storageclasses storage.k8s.io/v1, Resource=volumeattachments storage.k8s.io/v1beta1, Resource=csidrivers], removed: []

jpbetz · 2020-02-21T00:06:44Z

Need to prevent client side rate limiter induced timeouts. Trying to limit concurrency to 5 to see if that helps.

jpbetz · 2020-02-21T01:53:24Z

I misunderstood the error, it looks like there is a correctness problem. Fixing.

jpbetz · 2020-02-21T06:35:32Z

/test pull-kubernetes-e2e-gce

two passes in a row, going to try a few more

jpbetz · 2020-02-21T07:28:16Z

/test pull-kubernetes-e2e-gce

jpbetz · 2020-02-21T08:14:58Z

/test pull-kubernetes-e2e-gce

jpbetz · 2020-02-21T09:08:32Z

/test pull-kubernetes-e2e-gce

jpbetz · 2020-02-21T16:00:22Z

/test pull-kubernetes-e2e-gce

jpbetz · 2020-02-21T19:06:53Z

@liggitt, @apelisse I think this is ready to go. Would you have a look?

jpbetz · 2020-02-21T20:23:51Z

cc @lavalamp @caesarxuchao

apelisse · 2020-02-21T22:26:22Z

/lgtm
Thanks!

staging/src/k8s.io/client-go/discovery/cached/memory/memcache.go

staging/src/k8s.io/client-go/discovery/cached/memory/memcache_test.go

liggitt · 2020-02-21T23:47:42Z

/approve

two nits, then lgtm

liggitt · 2020-02-21T23:47:49Z

/hold for nits

k8s-ci-robot · 2020-02-21T23:48:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jpbetz, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/client-go/OWNERS~~ [liggitt]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jpbetz · 2020-02-21T23:52:19Z

/hold cancel
nits fixed

apelisse · 2020-02-21T23:53:08Z

/lgtm
Thanks!

lavalamp · 2020-02-22T00:15:20Z

staging/src/k8s.io/client-go/discovery/cached/memory/memcache.go

@@ -190,16 +190,29 @@ func (d *memCacheClient) refreshLocked() error {
 		return err
 	}

+	wg := &sync.WaitGroup{}
+	resultLock := &sync.Mutex{}


var resultLock sync.Mutex

works fine here, I think? No need to change it, we'll just wait for an "ineffassign" fix...

lavalamp · 2020-02-22T00:15:39Z

/lgtm

(I want credit for reading it ;) )

lavalamp · 2020-02-22T05:43:26Z

/lgtm

jpbetz mentioned this pull request Feb 20, 2020

[flaky test] Garbage collector timeout tests #87668

Closed

k8s-ci-robot requested review from lavalamp and sttts February 20, 2020 21:55

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 20, 2020

jpbetz mentioned this pull request Feb 20, 2020

WIP - debug gc timing #88297

Closed

apelisse reviewed Feb 20, 2020

View reviewed changes

staging/src/k8s.io/client-go/discovery/cached/memory/memcache.go Outdated Show resolved Hide resolved

k8s-ci-robot assigned apelisse Feb 20, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 20, 2020

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 20, 2020

jpbetz force-pushed the parallel-mem-client-resource-discovery branch 4 times, most recently from 412c453 to 4d09f47 Compare February 21, 2020 01:52

jpbetz force-pushed the parallel-mem-client-resource-discovery branch from 4d09f47 to 8beb555 Compare February 21, 2020 05:42

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 21, 2020

apelisse mentioned this pull request Feb 21, 2020

Enable field management for all new objects #88421

Merged

liggitt reviewed Feb 21, 2020

View reviewed changes

staging/src/k8s.io/client-go/discovery/cached/memory/memcache.go Outdated Show resolved Hide resolved

liggitt reviewed Feb 21, 2020

View reviewed changes

staging/src/k8s.io/client-go/discovery/cached/memory/memcache_test.go Outdated Show resolved Hide resolved

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 21, 2020

jpbetz force-pushed the parallel-mem-client-resource-discovery branch from 8beb555 to a2e5a14 Compare February 21, 2020 23:48

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 21, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 21, 2020

jpbetz force-pushed the parallel-mem-client-resource-discovery branch 2 times, most recently from e6968cf to cc4a2af Compare February 21, 2020 23:51

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 21, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 21, 2020

lavalamp reviewed Feb 22, 2020

View reviewed changes

k8s-ci-robot assigned lavalamp Feb 22, 2020

Refresh discovery server resources for memCacheClient in parallel

190a723

jpbetz force-pushed the parallel-mem-client-resource-discovery branch from cc4a2af to 190a723 Compare February 22, 2020 02:35

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 22, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 22, 2020

k8s-ci-robot merged commit 62b0bbc into kubernetes:master Feb 22, 2020

k8s-ci-robot added this to the v1.18 milestone Feb 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refresh discovery server resources for memCacheClient in parallel #88382

Refresh discovery server resources for memCacheClient in parallel #88382

jpbetz commented Feb 20, 2020 •

edited

apelisse commented Feb 20, 2020

jpbetz commented Feb 20, 2020 •

edited

jpbetz commented Feb 21, 2020

jpbetz commented Feb 21, 2020

jpbetz commented Feb 21, 2020 •

edited

jpbetz commented Feb 21, 2020

jpbetz commented Feb 21, 2020

jpbetz commented Feb 21, 2020

jpbetz commented Feb 21, 2020

jpbetz commented Feb 21, 2020

jpbetz commented Feb 21, 2020

apelisse commented Feb 21, 2020

liggitt commented Feb 21, 2020

liggitt commented Feb 21, 2020

k8s-ci-robot commented Feb 21, 2020

jpbetz commented Feb 21, 2020

apelisse commented Feb 21, 2020

lavalamp Feb 22, 2020

lavalamp commented Feb 22, 2020

lavalamp commented Feb 22, 2020

Refresh discovery server resources for memCacheClient in parallel #88382

Refresh discovery server resources for memCacheClient in parallel #88382

Conversation

jpbetz commented Feb 20, 2020 • edited

apelisse commented Feb 20, 2020

jpbetz commented Feb 20, 2020 • edited

jpbetz commented Feb 21, 2020

jpbetz commented Feb 21, 2020

jpbetz commented Feb 21, 2020 • edited

jpbetz commented Feb 21, 2020

jpbetz commented Feb 21, 2020

jpbetz commented Feb 21, 2020

jpbetz commented Feb 21, 2020

jpbetz commented Feb 21, 2020

jpbetz commented Feb 21, 2020

apelisse commented Feb 21, 2020

liggitt commented Feb 21, 2020

liggitt commented Feb 21, 2020

k8s-ci-robot commented Feb 21, 2020

jpbetz commented Feb 21, 2020

apelisse commented Feb 21, 2020

lavalamp Feb 22, 2020

Choose a reason for hiding this comment

lavalamp commented Feb 22, 2020

lavalamp commented Feb 22, 2020

jpbetz commented Feb 20, 2020 •

edited

jpbetz commented Feb 20, 2020 •

edited

jpbetz commented Feb 21, 2020 •

edited