test(e2e_node): Parallelize prepulling all images in `e2e_node` tests #91007

lsytj0413 · 2020-05-12T06:23:21Z

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespace from that line:

/kind api-change
/kind bug
/kind cleanup
/kind deprecation
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

improve the runtime of the test suite by prepulling the images in parallel.

Which issue(s) this PR fixes:

Fixes #89443

Special notes for your reviewer:

/cc @mattjmcnaughton

Does this PR introduce a user-facing change?:

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2020-05-12T06:23:29Z

Hi @lsytj0413. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Joseph-Irving · 2020-05-12T13:29:31Z

/ok-to-test
/kind feature

Joseph-Irving · 2020-05-12T13:33:32Z

test/e2e_node/image_list.go

-		for i := 0; i < maxImagePullRetries; i++ {
-			if i > 0 {
-				time.Sleep(imagePullRetryDelay)
+		wg.Add(1)


could you instead do

wg.Add(len(images))

outside the loop as the length is pre-determined?

mattjmcnaughton

Wonderful, thanks for acting so fast on this issue :)

Overall, looks like a strong approach. I had a couple of questions I'm curious to hear your answer on. Please let me know if they don't make sense :)

Also, could you please add a release note of:

NONE

Thanks!

mattjmcnaughton · 2020-05-13T00:49:10Z

test/e2e_node/image_list.go

-		for i := 0; i < maxImagePullRetries; i++ {
-			if i > 0 {
-				time.Sleep(imagePullRetryDelay)
+		go func(image string) {


One thing I was thinking about when I originally filled the issue - do we want to limit the number of images that we try and pull at once?

In other words, if images grows to be long in length, could we envision it becoming too expensive to try and download all of them in parallel? How could we protect ourselves against that happening?

At this moment，the images length is 20+，so this may not be expensive？other words，we can fix this when it happenes。

If we need fix it now，how about use ** semaphore** to limit the number of goroutines ？

Hmmm, I feel like while we are working on this code, we might as well make it robust for the future? What do you think?

Semaphores is definitely one path to consider. I'd also wonder about Worker pools. This allows us to leverage channels which I think is a slightly more idiomatic way of doing concurrency in Golang.

Agree with you that we can make it robust as well as we can. I will work on it with Worker pools support.

Awesome, thank you :)

mattjmcnaughton · 2020-05-13T00:50:37Z

test/e2e_node/image_list.go

+			if pullErr != nil {
+				klog.Warningf("Could not pre-pull image %s %v output: %s", image, pullErr, output)
+				once.Do(func() {
+					err = pullErr


Using once.Do is definitely one option here. I'm wondering if you considered any alternatives? For example, we could also concatenate all of the errors from pulling individual images.

Yes, We definitely can use []error to save all errors happened and concatenate it，But i didn't think it's meaningful, if the Caller just compare the err with nil.

I don't have strong opinions :) Maybe a slight preference for saving as much error information as possible, but I'm flexible :)

mattjmcnaughton · 2020-05-13T00:52:41Z

test/e2e_node/image_list.go

 			}
-			if output, err = puller.Pull(image); err == nil {
-				break
+			if pullErr != nil {


One behavior from the serialized approach was that if we failed to pull any single image, we would immediately fail the whole function. Thoughts on if we want to retain that behavior in the parallel approach? Right now, we wait for all of the images to attempt a pull before returning any type of error value.

The puller.Pull doesn't have any param like ctx to cancel the process(neither exec.Command nor ImageService interface), so we must wait once pull returned(If we exit PrePullAllImages without wait sub-goroutine exited, there is goroutine-leak happened).

In other words, we can immediately fail the pull process when every pull-retry started.

Just to confirm - you're saying that we can't cancel the current pull, BUT we can prevent it from retrying if it fails (if another image pull has already failed)? I'm comfortable with waiting until the possible retry to fail (and agree with you there isn't another option). I still think that's preferable to letting all of the attempted pulls go through the full number of retries.

Yes, we can prevent it from retrying if any single image pull failed.

lsytj0413 · 2020-05-14T08:48:10Z

/release-note-none

mattjmcnaughton

/retest

Thanks for your thoughtful responses to my qs :) I responded - lmk if you have any qs.

mattjmcnaughton

Looking good!

Thanks for implementing the limit on the number of parallel images we pull at once.

I have one last thought on how we can communicate to the different workers that they need to stop, but other than that, I think this is looking good!

mattjmcnaughton · 2020-05-19T02:44:43Z

test/e2e_node/image_list.go

+					output  []byte
+				)
+				for retryCount := 0; retryCount < maxImagePullRetries; retryCount++ {
+					if atomic.LoadInt32(&pullProgressStoped) == 1 {


I'm open to using atomic here - however, it seems like we're trying to communicate via shared memory, and my understanding is that Golang tries to prefer channels instead of shared memory.

Thoughts on having a quit channel, and then whenever we want to stop, we can send a signal to the quit channel? We could use a select statement with a fall through to reattempting the pull as the default case.

Let me know if a code sample could be helpful :)

Use channels instead of shared memory is recommended in Golang, and yes, here we can use channel to solve this problem.

But If we use channel there are other points that must be considered, we can't close the quite channel to notify other goroutines to stop(because other goroutines maybe close the quite channel as well, double-close will panic), and we should do something to fix this(and here, we go back with two-option: shared memory or channels :)).

If we send a signal to the quit channel, we must send at-least parallelImagePullCount values, to ensure all goroutine will receive one value from the channel. I think this is a little complicated to solve the problem.

Finally, we can use context and cancel to simplify the solution, the cancel function support multi called. Do you think this will be better?

Can you share more around where you're reading about contexts being able to be cancelled more than once? https://www.sohamkamani.com/golang/2018-06-17-golang-using-context-cancellation/#gotchas-and-caveats seems to suggest otherwise.

Another option is that we could set up a channel in the main goroutine on which the worker goroutines can send errors back to the main goroutine. The first time that channel receives an error, it would close the quit channel. In that way, only a single thread can try and close the quit channel, and we don't need to worry about race conditions. It also has the side benefit of returning the errors to the main goroutine.

Got this from the comment in golang project source code.

Ah, that certainly seems pretty convincing :)

Mind just testing to verify that calling cancel multiple times works ok? If yes, than using a context sounds perfect.

Tested with 100 goroutines(each invokes cancel 100 times), works as expected.

Refactored atomic to context, PTAL :)

mattjmcnaughton

/retest

One final request around error handling but the context use looks perfect - thanks for working with me on that :)

mattjmcnaughton · 2020-05-23T18:55:30Z

test/e2e_node/image_list.go

+	}
+
+	wg.Wait()
+	for _, err := range pullErrs {


Final request - can we aggregate all of the non-nil pullErrors into a single error, instead of just returning the first error? Thanks

Aggregate all pullErrors into a single error, and If there are nothing errors in the pullErrors slice, the utilerrors.NewAggregate function will return nil. PTAL

mattjmcnaughton

Awesome!

I think the final step is to run ./hack/update-bazel.sh, which should fix the hack-verify errors.

Also, do you mind squashing your commits into a single commit?

Thank you!

mattjmcnaughton · 2020-05-24T19:23:22Z

/retest

lsytj0413 · 2020-05-25T03:50:25Z

Awesome!

I think the final step is to run ./hack/update-bazel.sh, which should fix the hack-verify errors.

Also, do you mind squashing your commits into a single commit?

Thank you!

Squash into a single commit and updated BUILD with ./hack/update-bazel.sh, And the hack-verify errors have gone(There is still three Job failed, I didn't know is it caused by this PR, let me known If it should been fixed :).

mattjmcnaughton

/lgtm

Awesome, thank you! I agree that the failures appear unrelated to this diff. Retesting to verify.

/retest

lsytj0413 · 2020-05-31T07:26:20Z

/test pull-kubernetes-e2e-kind

lsytj0413 · 2020-05-31T09:59:54Z

/test pull-kubernetes-e2e-kind

lsytj0413 · 2020-05-31T13:12:58Z

test/e2e_node/image_list.go

-				break
+
+	imageCh := make(chan int, len(images))
+	for i := range images {


there is a deathlock in the previous implementation, I did fix it and all tests have passed now. PTAL @mattjmcnaughton

Can you share more around what the deadlock was @lsytj0413 ? Would also love to know more around how we are confident this implementation avoids the deadlock. Thanks :)

Used variable imageCh in the for range sentence at previous implementation，it should be images。And read from imageCh will be deathlock。

mattjmcnaughton

/lgtm

I've read through this and it looks good to me!

I'd love whoever gives the final sign-off to also look with close eyes as concurrency can be tricky :)

@lsytj0413 do you have stats around how much faster this pulling images step is w/ the parallelism? I think knowing the stats will help the final approver judge whether the speed up is worth the complexity added by the concurrency.

lsytj0413 · 2020-06-03T02:17:03Z

/lgtm

I've read through this and it looks good to me!

I'd love whoever gives the final sign-off to also look with close eyes as concurrency can be tricky :)

@lsytj0413 do you have stats around how much faster this pulling images step is w/ the parallelism? I think knowing the stats will help the final approver judge whether the speed up is worth the complexity added by the concurrency.

I'd love to give these stats. Is there any command which I can test it？or I will found my own way to collect the stats record.

mattjmcnaughton · 2020-06-05T19:12:24Z

On Tue, Jun 02, 2020 at 07:17:17PM -0700, Songlin Yang wrote: > /lgtm > > I've read through this and it looks good to me! > > I'd love whoever gives the final sign-off to also look with close eyes as concurrency can be tricky :) > > @lsytj0413 do you have stats around how much faster this pulling images step is w/ the parallelism? I think knowing the stats will help the final approver judge whether the speed up is worth the complexity added by the concurrency. I'd love to give these stats. Is there any command which I can test it？or I will found my own way to collect the stats record. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #91007 (comment)

I think the simplest way would just be to add code to the parallel version (and to the old serial version) which output how long the function to pull all images took. You can then run the e2e tests a couple of times and then manually aggregate the results for time spent pulling images. Very possible there's a more elegant way to do it as well :)

lsytj0413 · 2020-06-20T09:09:41Z

I have tested the pulling image steps with concurrency and sequential, collect the stats as follows:

All images list:

docker.io/library/busybox:1.29
docker.io/library/httpd:2.4.38-alpine
docker.io/library/nginx:1.14-alpine
docker.io/library/perl:5.26
docker.io/nfvpe/sriov-device-plugin:v3.1
gcr.io/kubernetes-e2e-test-images/ipc-utils:1.0
gcr.io/kubernetes-e2e-test-images/node-perf/npb-ep:1.0
gcr.io/kubernetes-e2e-test-images/node-perf/npb-is:1.0
gcr.io/kubernetes-e2e-test-images/node-perf/tf-wide-deep-amd64:1.0
gcr.io/kubernetes-e2e-test-images/nonewprivs:1.0
gcr.io/kubernetes-e2e-test-images/nonroot:1.0
gcr.io/kubernetes-e2e-test-images/volume/gluster:1.0
gcr.io/kubernetes-e2e-test-images/volume/nfs:1.0
google/cadvisor:latest
k8s.gcr.io/busybox@sha256:4bdd623e848417d96127e16037743f0cd8b528c026e9175e22a84f639eca58ff
k8s.gcr.io/node-problem-detector:v0.6.2
k8s.gcr.io/nvidia-gpu-device-plugin@sha256:4b036e8844920336fa48f36edeb7d4398f426d6a934ba022848deed2edbf09aa
k8s.gcr.io/pause:3.2
k8s.gcr.io/stress:v1
us.gcr.io/k8s-artifacts-prod/e2e-test-images/agnhost:2.13

Concurrency pulling image steps cost:

354.73s
291.34s
271.33s

The average is 305.80s.

Sequential pulling image steps cost:

684.78s
711.86s
797.39s

The average is 731.34s.

As the stats show to us, the concurrency will be 2x faster than sequential. I hope this will be helpful to judge this PR.

/cc @mattjmcnaughton

/cc @tallclair
Could you take a look at this PR?

mattjmcnaughton

Awesome, thanks for the stats @lsytj0413 !

I personally think a 2x speed up is worth it for the added complexity of running this in parallel.

/lgtm
/assign @tallclair for final sign off.

dims · 2020-07-07T10:23:51Z

/assign @spiffxp @BenTheElder

dims · 2020-07-07T10:24:06Z

/test pull-kubernetes-e2e-gce-ubuntu-containerd

alejandrox1

Thank you for working on this @lsytj0413! It looks great.

/approve
/priority important-longterm

k8s-ci-robot · 2020-07-07T12:16:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alejandrox1, lsytj0413

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/e2e_node/OWNERS~~ [alejandrox1]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fejta-bot · 2020-07-07T21:42:48Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

fejta-bot · 2020-07-08T00:09:49Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

spiffxp · 2020-07-08T03:11:56Z

/test pull-kubernetes-kubemark-e2e-gce-big

k8s-ci-robot requested a review from mattjmcnaughton May 12, 2020 06:23

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 12, 2020

Joseph-Irving reviewed May 12, 2020

View reviewed changes

mattjmcnaughton suggested changes May 13, 2020

View reviewed changes

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 14, 2020

mattjmcnaughton reviewed May 14, 2020

View reviewed changes

mattjmcnaughton reviewed May 19, 2020

View reviewed changes

mattjmcnaughton reviewed May 23, 2020

View reviewed changes

mattjmcnaughton approved these changes May 24, 2020

View reviewed changes

lsytj0413 force-pushed the fix-89443 branch from 2937c52 to b523ca9 Compare May 25, 2020 02:32

mattjmcnaughton approved these changes May 25, 2020

View reviewed changes

lsytj0413 commented May 31, 2020

View reviewed changes

mattjmcnaughton approved these changes Jun 3, 2020

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 3, 2020

k8s-ci-robot requested review from mattjmcnaughton and tallclair June 20, 2020 09:09

mattjmcnaughton approved these changes Jun 24, 2020

View reviewed changes

This was referenced Jul 7, 2020

ref: golang snippets esp. kubernetes AmitKumarDas/Decisions#269

Open

ref: kubernetes test strategies AmitKumarDas/Decisions#259

Open

k8s-ci-robot assigned BenTheElder and spiffxp Jul 7, 2020

alejandrox1 reviewed Jul 7, 2020

View reviewed changes

k8s-ci-robot added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 7, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 7, 2020

alejandrox1 added this to Reviewer approved in SIG Node CI/Test Board Jul 7, 2020

k8s-ci-robot merged commit 9eced04 into kubernetes:master Jul 8, 2020

SIG Node CI/Test Board automation moved this from Reviewer approved to Done Jul 8, 2020

k8s-ci-robot added this to the v1.19 milestone Jul 8, 2020

lsytj0413 deleted the fix-89443 branch July 9, 2020 01:06

test(e2e_node): Parallelize prepulling all images in e2e_node tests #91007

test(e2e_node): Parallelize prepulling all images in e2e_node tests #91007

Conversation

lsytj0413 commented May 12, 2020

k8s-ci-robot commented May 12, 2020

Joseph-Irving commented May 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattjmcnaughton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lsytj0413 commented May 14, 2020

mattjmcnaughton left a comment

Choose a reason for hiding this comment

mattjmcnaughton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lsytj0413 May 19, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattjmcnaughton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattjmcnaughton left a comment

Choose a reason for hiding this comment

mattjmcnaughton commented May 24, 2020

lsytj0413 commented May 25, 2020

mattjmcnaughton left a comment

Choose a reason for hiding this comment

lsytj0413 commented May 31, 2020

lsytj0413 commented May 31, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattjmcnaughton left a comment

Choose a reason for hiding this comment

lsytj0413 commented Jun 3, 2020

mattjmcnaughton commented Jun 5, 2020 via email

lsytj0413 commented Jun 20, 2020

mattjmcnaughton left a comment • edited

Choose a reason for hiding this comment

dims commented Jul 7, 2020

dims commented Jul 7, 2020

alejandrox1 left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jul 7, 2020

fejta-bot commented Jul 7, 2020

fejta-bot commented Jul 8, 2020

spiffxp commented Jul 8, 2020

test(e2e_node): Parallelize prepulling all images in `e2e_node` tests #91007

test(e2e_node): Parallelize prepulling all images in `e2e_node` tests #91007

lsytj0413 May 19, 2020 •

edited

mattjmcnaughton left a comment •

edited