Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh discovery server resources for memCacheClient in parallel #88382

Merged

Conversation

jpbetz
Copy link
Contributor

@jpbetz jpbetz commented Feb 20, 2020

What type of PR is this?

/kind flake

What this PR does / why we need it:

Run discovery API calls in parallel to reduce for in-memory discovery client. Reduces each cache refresh from ~15 seconds to ~ 2 seconds.

This helps e2e tests since they are run in parallel and some remove CRDs at the same others are being run, forcing the cache to be refreshed often.

In e2e tests we're seeing a groupVersion list length of around 27, which means we are 27 API calls in series.

See #87668 and #88297 for debugging details.

There might be other ways to optimize this further, but running the requests in parallel is an obvious win, so would like to merge this before pursuing other optimizations.

Last 9 pull-kubernetes-e2e-gce runs below. No GC failures. This is with SSA at 10%

1230671712341725185 - pass - 53m18s
1230729716571312129 - pass - 49m35s
1230742932269568000 - pass - 47m0s
1230756271573962752 - pass - 45m38s
1230767966954459136 - pass - 51m17s
1230781434029936640 - 12 In-tree Volumes failures - 48m41s
1230885123486912512 - 1 volume mount failure - 50m28s
1230899588768993280 - 1 volume mount failure - 51m12s
1230916073155465216 - pass - 49m40s

From #88297 here are the last 9 runs with SSA at 100%. No GC failures.

1230746203231096836 - pass - 1h1m41s
1230762058010595328 - 1 volume/CSIDriver failure - 1h0m17s
1230778298858999808 - 15 In-tree Volumes failures - 50m16s
1230885241724342272 - pass - 51m50s
1230899588773187585 - pass - 52m36s
1230919719050022912 - pass - 57m56s
1230949943569551360 - 1 In-tree Volumes failures, 2 PersistentVolumes-local failures - 59m57s
1230966546394779648 - 1 CRD publish OpenAPI failure, 58m36s
1230985169494609921 - pass - 55m29s

Which issue(s) this PR fixes:

Fixes #87668

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 20, 2020
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 20, 2020
@jpbetz jpbetz mentioned this pull request Feb 20, 2020
@apelisse
Copy link
Member

Thanks
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 20, 2020
@jpbetz
Copy link
Contributor Author

jpbetz commented Feb 20, 2020

e2e test are failing, requests are getting throttled:

W0220 21:49:48.390781       1 garbagecollector.go:661] failed to discover some groups: map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request]
 I0220 21:49:48.390954       1 garbagecollector.go:203] syncing garbage collector with updated resources from discovery (attempt 5): added: [/v1, Resource=configmaps /v1, Resource=endpoints /v1, Resource=events /v1, Resource=limitranges /v1, Resource=nam\
 espaces /v1, Resource=nodes /v1, Resource=persistentvolumeclaims /v1, Resource=persistentvolumes /v1, Resource=pods /v1, Resource=podtemplates /v1, Resource=replicationcontrollers /v1, Resource=resourcequotas /v1, Resource=secrets /v1, Resource=servicea\
 ccounts /v1, Resource=services admissionregistration.k8s.io/v1, Resource=mutatingwebhookconfigurations admissionregistration.k8s.io/v1, Resource=validatingwebhookconfigurations apiextensions.k8s.io/v1, Resource=customresourcedefinitions apiregistration.\
 k8s.io/v1, Resource=apiservices apps/v1, Resource=controllerrevisions apps/v1, Resource=daemonsets apps/v1, Resource=deployments apps/v1, Resource=replicasets apps/v1, Resource=statefulsets autoscaling/v1, Resource=horizontalpodautoscalers batch/v1, Res\
 ource=jobs batch/v1beta1, Resource=cronjobs certificates.k8s.io/v1beta1, Resource=certificatesigningrequests coordination.k8s.io/v1, Resource=leases discovery.k8s.io/v1beta1, Resource=endpointslices events.k8s.io/v1beta1, Resource=events extensions/v1be\
 ta1, Resource=ingresses networking.k8s.io/v1, Resource=networkpolicies networking.k8s.io/v1beta1, Resource=ingresses node.k8s.io/v1beta1, Resource=runtimeclasses policy/v1beta1, Resource=poddisruptionbudgets policy/v1beta1, Resource=podsecuritypolicies \
 rbac.authorization.k8s.io/v1, Resource=clusterrolebindings rbac.authorization.k8s.io/v1, Resource=clusterroles rbac.authorization.k8s.io/v1, Resource=rolebindings rbac.authorization.k8s.io/v1, Resource=roles scalingpolicy.kope.io/v1alpha1, Resource=scal\
 ingpolicies scheduling.k8s.io/v1, Resource=priorityclasses settings.k8s.io/v1alpha1, Resource=podpresets snapshot.storage.k8s.io/v1beta1, Resource=volumesnapshotclasses snapshot.storage.k8s.io/v1beta1, Resource=volumesnapshotcontents snapshot.storage.k8\
 s.io/v1beta1, Resource=volumesnapshots storage.k8s.io/v1, Resource=csinodes storage.k8s.io/v1, Resource=storageclasses storage.k8s.io/v1, Resource=volumeattachments storage.k8s.io/v1beta1, Resource=csidrivers], removed: []

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 20, 2020
@jpbetz
Copy link
Contributor Author

jpbetz commented Feb 21, 2020

Need to prevent client side rate limiter induced timeouts. Trying to limit concurrency to 5 to see if that helps.

@jpbetz jpbetz force-pushed the parallel-mem-client-resource-discovery branch 4 times, most recently from 412c453 to 4d09f47 Compare February 21, 2020 01:52
@jpbetz
Copy link
Contributor Author

jpbetz commented Feb 21, 2020

I misunderstood the error, it looks like there is a correctness problem. Fixing.

@jpbetz jpbetz force-pushed the parallel-mem-client-resource-discovery branch from 4d09f47 to 8beb555 Compare February 21, 2020 05:42
@jpbetz
Copy link
Contributor Author

jpbetz commented Feb 21, 2020

/test pull-kubernetes-e2e-gce

two passes in a row, going to try a few more

@jpbetz
Copy link
Contributor Author

jpbetz commented Feb 21, 2020

/test pull-kubernetes-e2e-gce

3 similar comments
@jpbetz
Copy link
Contributor Author

jpbetz commented Feb 21, 2020

/test pull-kubernetes-e2e-gce

@jpbetz
Copy link
Contributor Author

jpbetz commented Feb 21, 2020

/test pull-kubernetes-e2e-gce

@jpbetz
Copy link
Contributor Author

jpbetz commented Feb 21, 2020

/test pull-kubernetes-e2e-gce

@jpbetz
Copy link
Contributor Author

jpbetz commented Feb 21, 2020

@liggitt, @apelisse I think this is ready to go. Would you have a look?

@jpbetz
Copy link
Contributor Author

jpbetz commented Feb 21, 2020

cc @lavalamp @caesarxuchao

@apelisse
Copy link
Member

/lgtm
Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 21, 2020
@liggitt
Copy link
Member

liggitt commented Feb 21, 2020

/approve

two nits, then lgtm

@liggitt
Copy link
Member

liggitt commented Feb 21, 2020

/hold for nits

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 21, 2020
@jpbetz jpbetz force-pushed the parallel-mem-client-resource-discovery branch from 8beb555 to a2e5a14 Compare February 21, 2020 23:48
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 21, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jpbetz, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 21, 2020
@jpbetz jpbetz force-pushed the parallel-mem-client-resource-discovery branch 2 times, most recently from e6968cf to cc4a2af Compare February 21, 2020 23:51
@jpbetz
Copy link
Contributor Author

jpbetz commented Feb 21, 2020

/hold cancel
nits fixed

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 21, 2020
@apelisse
Copy link
Member

/lgtm
Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 21, 2020
@@ -190,16 +190,29 @@ func (d *memCacheClient) refreshLocked() error {
return err
}

wg := &sync.WaitGroup{}
resultLock := &sync.Mutex{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var resultLock sync.Mutex

works fine here, I think? No need to change it, we'll just wait for an "ineffassign" fix...

@lavalamp
Copy link
Member

/lgtm

(I want credit for reading it ;) )

@jpbetz jpbetz force-pushed the parallel-mem-client-resource-discovery branch from cc4a2af to 190a723 Compare February 22, 2020 02:35
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 22, 2020
@lavalamp
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 22, 2020
@k8s-ci-robot k8s-ci-robot merged commit 62b0bbc into kubernetes:master Feb 22, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Feb 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[flaky test] Garbage collector timeout tests
5 participants