Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(e2e_node): Parallelize prepulling all images in e2e_node tests #91007

Merged
merged 1 commit into from Jul 8, 2020

Conversation

lsytj0413
Copy link
Contributor

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespace from that line:

/kind api-change
/kind bug
/kind cleanup
/kind deprecation
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

improve the runtime of the test suite by prepulling the images in parallel.

Which issue(s) this PR fixes:

Fixes #89443

Special notes for your reviewer:

/cc @mattjmcnaughton

Does this PR introduce a user-facing change?:


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 12, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @lsytj0413. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 12, 2020
@Joseph-Irving
Copy link
Member

/ok-to-test
/kind feature

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. kind/feature Categorizes issue or PR as related to a new feature. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels May 12, 2020
for i := 0; i < maxImagePullRetries; i++ {
if i > 0 {
time.Sleep(imagePullRetryDelay)
wg.Add(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you instead do

wg.Add(len(images))

outside the loop as the length is pre-determined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, PTAL

Copy link
Contributor

@mattjmcnaughton mattjmcnaughton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful, thanks for acting so fast on this issue :)

Overall, looks like a strong approach. I had a couple of questions I'm curious to hear your answer on. Please let me know if they don't make sense :)

Also, could you please add a release note of:

NONE

Thanks!

for i := 0; i < maxImagePullRetries; i++ {
if i > 0 {
time.Sleep(imagePullRetryDelay)
go func(image string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I was thinking about when I originally filled the issue - do we want to limit the number of images that we try and pull at once?

In other words, if images grows to be long in length, could we envision it becoming too expensive to try and download all of them in parallel? How could we protect ourselves against that happening?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this moment,the images length is 20+,so this may not be expensive?other words,we can fix this when it happenes。

If we need fix it now,how about use ** semaphore** to limit the number of goroutines ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I feel like while we are working on this code, we might as well make it robust for the future? What do you think?

Semaphores is definitely one path to consider. I'd also wonder about Worker pools. This allows us to leverage channels which I think is a slightly more idiomatic way of doing concurrency in Golang.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with you that we can make it robust as well as we can. I will work on it with Worker pools support.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, PTAL

if pullErr != nil {
klog.Warningf("Could not pre-pull image %s %v output: %s", image, pullErr, output)
once.Do(func() {
err = pullErr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using once.Do is definitely one option here. I'm wondering if you considered any alternatives? For example, we could also concatenate all of the errors from pulling individual images.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, We definitely can use []error to save all errors happened and concatenate it,But i didn't think it's meaningful, if the Caller just compare the err with nil.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong opinions :) Maybe a slight preference for saving as much error information as possible, but I'm flexible :)

}
if output, err = puller.Pull(image); err == nil {
break
if pullErr != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One behavior from the serialized approach was that if we failed to pull any single image, we would immediately fail the whole function. Thoughts on if we want to retain that behavior in the parallel approach? Right now, we wait for all of the images to attempt a pull before returning any type of error value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The puller.Pull doesn't have any param like ctx to cancel the process(neither exec.Command nor ImageService interface), so we must wait once pull returned(If we exit PrePullAllImages without wait sub-goroutine exited, there is goroutine-leak happened).

In other words, we can immediately fail the pull process when every pull-retry started.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm - you're saying that we can't cancel the current pull, BUT we can prevent it from retrying if it fails (if another image pull has already failed)? I'm comfortable with waiting until the possible retry to fail (and agree with you there isn't another option). I still think that's preferable to letting all of the attempted pulls go through the full number of retries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can prevent it from retrying if any single image pull failed.

@lsytj0413
Copy link
Contributor Author

/release-note-none

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 14, 2020
Copy link
Contributor

@mattjmcnaughton mattjmcnaughton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/retest

Thanks for your thoughtful responses to my qs :) I responded - lmk if you have any qs.

Copy link
Contributor

@mattjmcnaughton mattjmcnaughton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

Thanks for implementing the limit on the number of parallel images we pull at once.

I have one last thought on how we can communicate to the different workers that they need to stop, but other than that, I think this is looking good!

output []byte
)
for retryCount := 0; retryCount < maxImagePullRetries; retryCount++ {
if atomic.LoadInt32(&pullProgressStoped) == 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to using atomic here - however, it seems like we're trying to communicate via shared memory, and my understanding is that Golang tries to prefer channels instead of shared memory.

Thoughts on having a quit channel, and then whenever we want to stop, we can send a signal to the quit channel? We could use a select statement with a fall through to reattempting the pull as the default case.

Let me know if a code sample could be helpful :)

Copy link
Contributor Author

@lsytj0413 lsytj0413 May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use channels instead of shared memory is recommended in Golang, and yes, here we can use channel to solve this problem.

But If we use channel there are other points that must be considered, we can't close the quite channel to notify other goroutines to stop(because other goroutines maybe close the quite channel as well, double-close will panic), and we should do something to fix this(and here, we go back with two-option: shared memory or channels :)).

If we send a signal to the quit channel, we must send at-least parallelImagePullCount values, to ensure all goroutine will receive one value from the channel. I think this is a little complicated to solve the problem.

Finally, we can use context and cancel to simplify the solution, the cancel function support multi called. Do you think this will be better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share more around where you're reading about contexts being able to be cancelled more than once? https://www.sohamkamani.com/golang/2018-06-17-golang-using-context-cancellation/#gotchas-and-caveats seems to suggest otherwise.

Another option is that we could set up a channel in the main goroutine on which the worker goroutines can send errors back to the main goroutine. The first time that channel receives an error, it would close the quit channel. In that way, only a single thread can try and close the quit channel, and we don't need to worry about race conditions. It also has the side benefit of returning the errors to the main goroutine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got this from the comment in golang project source code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that certainly seems pretty convincing :)

Mind just testing to verify that calling cancel multiple times works ok? If yes, than using a context sounds perfect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with 100 goroutines(each invokes cancel 100 times), works as expected.

Refactored atomic to context, PTAL :)

Copy link
Contributor

@mattjmcnaughton mattjmcnaughton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/retest

One final request around error handling but the context use looks perfect - thanks for working with me on that :)

}

wg.Wait()
for _, err := range pullErrs {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final request - can we aggregate all of the non-nil pullErrors into a single error, instead of just returning the first error? Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aggregate all pullErrors into a single error, and If there are nothing errors in the pullErrors slice, the utilerrors.NewAggregate function will return nil. PTAL

Copy link
Contributor

@mattjmcnaughton mattjmcnaughton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

I think the final step is to run ./hack/update-bazel.sh, which should fix the hack-verify errors.

Also, do you mind squashing your commits into a single commit?

Thank you!

@mattjmcnaughton
Copy link
Contributor

/retest

@lsytj0413
Copy link
Contributor Author

Awesome!

I think the final step is to run ./hack/update-bazel.sh, which should fix the hack-verify errors.

Also, do you mind squashing your commits into a single commit?

Thank you!

Squash into a single commit and updated BUILD with ./hack/update-bazel.sh, And the hack-verify errors have gone(There is still three Job failed, I didn't know is it caused by this PR, let me known If it should been fixed :).

Copy link
Contributor

@mattjmcnaughton mattjmcnaughton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Awesome, thank you! I agree that the failures appear unrelated to this diff. Retesting to verify.

/retest

@lsytj0413
Copy link
Contributor Author

/test pull-kubernetes-e2e-kind

1 similar comment
@lsytj0413
Copy link
Contributor Author

/test pull-kubernetes-e2e-kind

break

imageCh := make(chan int, len(images))
for i := range images {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a deathlock in the previous implementation, I did fix it and all tests have passed now. PTAL @mattjmcnaughton

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share more around what the deadlock was @lsytj0413 ? Would also love to know more around how we are confident this implementation avoids the deadlock. Thanks :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used variable imageCh in the for range sentence at previous implementation,it should be images。And read from imageCh will be deathlock。

Copy link
Contributor

@mattjmcnaughton mattjmcnaughton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

I've read through this and it looks good to me!

I'd love whoever gives the final sign-off to also look with close eyes as concurrency can be tricky :)

@lsytj0413 do you have stats around how much faster this pulling images step is w/ the parallelism? I think knowing the stats will help the final approver judge whether the speed up is worth the complexity added by the concurrency.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 3, 2020
@lsytj0413
Copy link
Contributor Author

/lgtm

I've read through this and it looks good to me!

I'd love whoever gives the final sign-off to also look with close eyes as concurrency can be tricky :)

@lsytj0413 do you have stats around how much faster this pulling images step is w/ the parallelism? I think knowing the stats will help the final approver judge whether the speed up is worth the complexity added by the concurrency.

I'd love to give these stats. Is there any command which I can test it?or I will found my own way to collect the stats record.

@mattjmcnaughton
Copy link
Contributor

mattjmcnaughton commented Jun 5, 2020 via email

@lsytj0413
Copy link
Contributor Author

I have tested the pulling image steps with concurrency and sequential, collect the stats as follows:

All images list:

docker.io/library/busybox:1.29
docker.io/library/httpd:2.4.38-alpine
docker.io/library/nginx:1.14-alpine
docker.io/library/perl:5.26
docker.io/nfvpe/sriov-device-plugin:v3.1
gcr.io/kubernetes-e2e-test-images/ipc-utils:1.0
gcr.io/kubernetes-e2e-test-images/node-perf/npb-ep:1.0
gcr.io/kubernetes-e2e-test-images/node-perf/npb-is:1.0
gcr.io/kubernetes-e2e-test-images/node-perf/tf-wide-deep-amd64:1.0
gcr.io/kubernetes-e2e-test-images/nonewprivs:1.0
gcr.io/kubernetes-e2e-test-images/nonroot:1.0
gcr.io/kubernetes-e2e-test-images/volume/gluster:1.0
gcr.io/kubernetes-e2e-test-images/volume/nfs:1.0
google/cadvisor:latest
k8s.gcr.io/busybox@sha256:4bdd623e848417d96127e16037743f0cd8b528c026e9175e22a84f639eca58ff
k8s.gcr.io/node-problem-detector:v0.6.2
k8s.gcr.io/nvidia-gpu-device-plugin@sha256:4b036e8844920336fa48f36edeb7d4398f426d6a934ba022848deed2edbf09aa
k8s.gcr.io/pause:3.2
k8s.gcr.io/stress:v1
us.gcr.io/k8s-artifacts-prod/e2e-test-images/agnhost:2.13

Concurrency pulling image steps cost:

354.73s
291.34s
271.33s

The average is 305.80s.

Sequential pulling image steps cost:

684.78s
711.86s
797.39s

The average is 731.34s.

As the stats show to us, the concurrency will be 2x faster than sequential. I hope this will be helpful to judge this PR.

/cc @mattjmcnaughton

/cc @tallclair
Could you take a look at this PR?

Copy link
Contributor

@mattjmcnaughton mattjmcnaughton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for the stats @lsytj0413 !

I personally think a 2x speed up is worth it for the added complexity of running this in parallel.

/lgtm
/assign @tallclair for final sign off.

@dims
Copy link
Member

dims commented Jul 7, 2020

/assign @spiffxp @BenTheElder

@dims
Copy link
Member

dims commented Jul 7, 2020

/test pull-kubernetes-e2e-gce-ubuntu-containerd

Copy link
Contributor

@alejandrox1 alejandrox1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @lsytj0413! It looks great.

/approve
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 7, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alejandrox1, lsytj0413

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 7, 2020
@alejandrox1 alejandrox1 added this to Reviewer approved in SIG Node CI/Test Board Jul 7, 2020
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@spiffxp
Copy link
Member

spiffxp commented Jul 8, 2020

/test pull-kubernetes-kubemark-e2e-gce-big

@k8s-ci-robot k8s-ci-robot merged commit 9eced04 into kubernetes:master Jul 8, 2020
SIG Node CI/Test Board automation moved this from Reviewer approved to Done Jul 8, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone Jul 8, 2020
@lsytj0413 lsytj0413 deleted the fix-89443 branch July 9, 2020 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note-none Denotes a PR that doesn't merit a release note. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Parallelize prepulling all images in e2e_node tests
10 participants