New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[xDS interop] Updating the config update timeout to 600s #26090
Conversation
This PR in general SG. One thing is that we will need to increase the kokoro timeout, so that the job won't be killed by kokoro, because then we will lose all the logs. Another concern is that, it will also take much longer to find bugs in gRPC clients. We should combine this with the CSDS change. The CSDS check can have a 10 minutes timeout, but the gRPC logic shouldn't need that long. But CSDS only works for branch >= 1.37. So in the old releases, we still need to increase the timeout for everything. That should be OK, because the released clients are quite stable, and we don't expect many failures caused by them. |
I'll defer to Menghan on how this will impact debuggability of gRPC clients, but from a test framework health perspective:
I agree. It is currently very difficult to debug jobs killed by Kokoro mid-run due to a job-level timeout. We avoid that (mostly) by giving most of the Kokoro jobs a 3 hour time out. But this 3 hours was not chosen based on looking at the worst-case run times for each test case, it was just chosen heuristically by slowly increasing the job timeout until we stopped seeing many failures. Bumping (doubling) most of the operation-level timeouts without per-test-case time bounds may lead to more of these problems. Just something to keep an eye out for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, as Stanley pointed out on #26168, with the three new tests added recently we're already bumping up against the existing Kokoro timeouts in each language. That's definitely a problem but I'm no longer concerned about this PR making that worse (I mean, it might, but it's already a problem we need to address directly and fix outside of this PR).
Since we're already having overall job timeout issues, and this change may reduce other flakes, I think we should go ahead with this change and just bump the timeouts. We can then look at the average per-job time and work on fixing that directly (the hope is that moving to the k8s framework (and more pre-built images) will make the timeout issue less of a problem, but all of that is orthogonal to this change).
Lidi, it seems like this should probably also be accompanied with changing the other grpc_<lang>.cfg
files to increase the timeouts to something like the 360 minutes you chose for Java and Go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG.
The PR to increase go's timeout was merged. We will need to backport that to previous releases if we backport this PR.
Sounds good. After #26187 is merged, I can start backporting. |
This PR and #26187 contain many conflicting changes against previous releases. I failed to automate the backport process. Maybe we need to spend an hour or two to handcraft the backport PRs... |
If you make one PR manually for 1.37.x, backporting that to the other releases might be easier 🤷♂️ |
* Updating the config update timeout to 600s * Remove unnecessary lines and comments
This may need to be reverted. Basically every xDS v3 test suite started timing out after 6 hours once this PR went in. Since the tests were previously passing in under 3 hours, it suggests something went wrong (besides #26196) amidst all of the whitespace reformatting here. The logs before timeout repeatedly show:
Maybe the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's revert this first. @ericgribkoff Did you send a PR?
success = False | ||
if success: | ||
logger.info('success') | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this break only breaks from the inner loop, but not the outer While True
.
|
…00s (grpc#26090)" (grpc#26197)" This reverts commit cf7f5a0.
For interop testing the control plane guarantees the eventual delivery of the config update. We agreed upon 600s as the reasonable config update deadline.
Most cases only have one config update, and their deadline will be extended to 600s. For test cases with multiple config updates, each config update will be waited for 600s.
This PR includes many blank space changes, and they will be easier to review in Reviewable.io.
Manual run with C++ passed.
This change is