New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test(e2e_node): Parallelize prepulling all images in e2e_node
tests
#91007
Conversation
Hi @lsytj0413. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
test/e2e_node/image_list.go
Outdated
for i := 0; i < maxImagePullRetries; i++ { | ||
if i > 0 { | ||
time.Sleep(imagePullRetryDelay) | ||
wg.Add(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you instead do
wg.Add(len(images))
outside the loop as the length is pre-determined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonderful, thanks for acting so fast on this issue :)
Overall, looks like a strong approach. I had a couple of questions I'm curious to hear your answer on. Please let me know if they don't make sense :)
Also, could you please add a release note of:
NONE
Thanks!
test/e2e_node/image_list.go
Outdated
for i := 0; i < maxImagePullRetries; i++ { | ||
if i > 0 { | ||
time.Sleep(imagePullRetryDelay) | ||
go func(image string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I was thinking about when I originally filled the issue - do we want to limit the number of images that we try and pull at once?
In other words, if images
grows to be long in length, could we envision it becoming too expensive to try and download all of them in parallel? How could we protect ourselves against that happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this moment,the images length is 20+,so this may not be expensive?other words,we can fix this when it happenes。
If we need fix it now,how about use ** semaphore** to limit the number of goroutines ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I feel like while we are working on this code, we might as well make it robust for the future? What do you think?
Semaphores is definitely one path to consider. I'd also wonder about Worker pools. This allows us to leverage channels which I think is a slightly more idiomatic way of doing concurrency in Golang.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with you that we can make it robust as well as we can. I will work on it with Worker pools support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thank you :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, PTAL
test/e2e_node/image_list.go
Outdated
if pullErr != nil { | ||
klog.Warningf("Could not pre-pull image %s %v output: %s", image, pullErr, output) | ||
once.Do(func() { | ||
err = pullErr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using once.Do
is definitely one option here. I'm wondering if you considered any alternatives? For example, we could also concatenate all of the errors from pulling individual images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, We definitely can use []error to save all errors happened and concatenate it,But i didn't think it's meaningful, if the Caller just compare the err with nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong opinions :) Maybe a slight preference for saving as much error information as possible, but I'm flexible :)
test/e2e_node/image_list.go
Outdated
} | ||
if output, err = puller.Pull(image); err == nil { | ||
break | ||
if pullErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One behavior from the serialized approach was that if we failed to pull any single image, we would immediately fail the whole function. Thoughts on if we want to retain that behavior in the parallel approach? Right now, we wait for all of the images to attempt a pull before returning any type of error value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The puller.Pull doesn't have any param like ctx to cancel the process(neither exec.Command nor ImageService interface), so we must wait once pull returned(If we exit PrePullAllImages without wait sub-goroutine exited, there is goroutine-leak happened).
In other words, we can immediately fail the pull process when every pull-retry started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm - you're saying that we can't cancel the current pull, BUT we can prevent it from retrying if it fails (if another image pull has already failed)? I'm comfortable with waiting until the possible retry to fail (and agree with you there isn't another option). I still think that's preferable to letting all of the attempted pulls go through the full number of retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can prevent it from retrying if any single image pull failed.
/release-note-none |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/retest
Thanks for your thoughtful responses to my qs :) I responded - lmk if you have any qs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
Thanks for implementing the limit on the number of parallel images we pull at once.
I have one last thought on how we can communicate to the different workers that they need to stop, but other than that, I think this is looking good!
test/e2e_node/image_list.go
Outdated
output []byte | ||
) | ||
for retryCount := 0; retryCount < maxImagePullRetries; retryCount++ { | ||
if atomic.LoadInt32(&pullProgressStoped) == 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to using atomic
here - however, it seems like we're trying to communicate via shared memory, and my understanding is that Golang tries to prefer channels instead of shared memory.
Thoughts on having a quit
channel, and then whenever we want to stop, we can send a signal to the quit
channel? We could use a select
statement with a fall through to reattempting the pull as the default case.
Let me know if a code sample could be helpful :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use channels instead of shared memory is recommended in Golang, and yes, here we can use channel to solve this problem.
But If we use channel there are other points that must be considered, we can't close the quite channel to notify other goroutines to stop(because other goroutines maybe close the quite channel as well, double-close will panic), and we should do something to fix this(and here, we go back with two-option: shared memory or channels :)).
If we send a signal to the quit channel, we must send at-least parallelImagePullCount values, to ensure all goroutine will receive one value from the channel. I think this is a little complicated to solve the problem.
Finally, we can use context and cancel to simplify the solution, the cancel function support multi called. Do you think this will be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you share more around where you're reading about contexts being able to be cancelled more than once? https://www.sohamkamani.com/golang/2018-06-17-golang-using-context-cancellation/#gotchas-and-caveats seems to suggest otherwise.
Another option is that we could set up a channel in the main goroutine on which the worker goroutines can send errors back to the main goroutine. The first time that channel receives an error, it would close the quit
channel. In that way, only a single thread can try and close the quit
channel, and we don't need to worry about race conditions. It also has the side benefit of returning the errors to the main goroutine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got this from the comment in golang project source code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that certainly seems pretty convincing :)
Mind just testing to verify that calling cancel
multiple times works ok? If yes, than using a context sounds perfect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with 100 goroutines(each invokes cancel 100 times), works as expected.
Refactored atomic to context, PTAL :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/retest
One final request around error handling but the context use looks perfect - thanks for working with me on that :)
test/e2e_node/image_list.go
Outdated
} | ||
|
||
wg.Wait() | ||
for _, err := range pullErrs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final request - can we aggregate all of the non-nil
pullErrors
into a single error, instead of just returning the first error? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aggregate all pullErrors into a single error, and If there are nothing errors in the pullErrors slice, the utilerrors.NewAggregate function will return nil. PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
I think the final step is to run ./hack/update-bazel.sh
, which should fix the hack-verify
errors.
Also, do you mind squashing your commits into a single commit?
Thank you!
/retest |
Squash into a single commit and updated BUILD with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Awesome, thank you! I agree that the failures appear unrelated to this diff. Retesting to verify.
/retest
/test pull-kubernetes-e2e-kind |
1 similar comment
/test pull-kubernetes-e2e-kind |
break | ||
|
||
imageCh := make(chan int, len(images)) | ||
for i := range images { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a deathlock in the previous implementation, I did fix it and all tests have passed now. PTAL @mattjmcnaughton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you share more around what the deadlock was @lsytj0413 ? Would also love to know more around how we are confident this implementation avoids the deadlock. Thanks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used variable imageCh in the for range sentence at previous implementation,it should be images。And read from imageCh will be deathlock。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
I've read through this and it looks good to me!
I'd love whoever gives the final sign-off to also look with close eyes as concurrency can be tricky :)
@lsytj0413 do you have stats around how much faster this pulling images step is w/ the parallelism? I think knowing the stats will help the final approver judge whether the speed up is worth the complexity added by the concurrency.
I'd love to give these stats. Is there any command which I can test it?or I will found my own way to collect the stats record. |
On Tue, Jun 02, 2020 at 07:17:17PM -0700, Songlin Yang wrote:
> /lgtm
>
> I've read through this and it looks good to me!
>
> I'd love whoever gives the final sign-off to also look with close eyes as concurrency can be tricky :)
>
> @lsytj0413 do you have stats around how much faster this pulling images step is w/ the parallelism? I think knowing the stats will help the final approver judge whether the speed up is worth the complexity added by the concurrency.
I'd love to give these stats. Is there any command which I can test it?or I will found my own way to collect the stats record.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#91007 (comment)
I think the simplest way would just be to add code to the parallel version (and
to the old serial version) which output how long the function to pull all images
took. You can then run the e2e tests a couple of times and then manually
aggregate the results for time spent pulling images.
Very possible there's a more elegant way to do it as well :)
|
I have tested the pulling image steps with concurrency and sequential, collect the stats as follows: All images list:
Concurrency pulling image steps cost:
The average is 305.80s. Sequential pulling image steps cost:
The average is 731.34s. As the stats show to us, the concurrency will be 2x faster than sequential. I hope this will be helpful to judge this PR. /cc @mattjmcnaughton /cc @tallclair |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for the stats @lsytj0413 !
I personally think a 2x speed up is worth it for the added complexity of running this in parallel.
/lgtm
/assign @tallclair for final sign off.
/assign @spiffxp @BenTheElder |
/test pull-kubernetes-e2e-gce-ubuntu-containerd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this @lsytj0413! It looks great.
/approve
/priority important-longterm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alejandrox1, lsytj0413 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Review the full test history for this PR. Silence the bot with an |
1 similar comment
/retest Review the full test history for this PR. Silence the bot with an |
/test pull-kubernetes-kubemark-e2e-gce-big |
What type of PR is this?
What this PR does / why we need it:
improve the runtime of the test suite by prepulling the images in parallel.
Which issue(s) this PR fixes:
Fixes #89443
Special notes for your reviewer:
/cc @mattjmcnaughton
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: