Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[xDS interop] Updating the config update timeout to 600s #26090

Merged
merged 2 commits into from May 6, 2021

Conversation

lidizheng
Copy link
Contributor

@lidizheng lidizheng commented Apr 26, 2021

For interop testing the control plane guarantees the eventual delivery of the config update. We agreed upon 600s as the reasonable config update deadline.

Most cases only have one config update, and their deadline will be extended to 600s. For test cases with multiple config updates, each config update will be waited for 600s.

This PR includes many blank space changes, and they will be easier to review in Reviewable.io.


Manual run with C++ passed.


This change is Reviewable

@lidizheng lidizheng added lang/Python release notes: no Indicates if PR should not be in release notes labels Apr 26, 2021
@menghanl
Copy link
Contributor

This PR in general SG. One thing is that we will need to increase the kokoro timeout, so that the job won't be killed by kokoro, because then we will lose all the logs.

Another concern is that, it will also take much longer to find bugs in gRPC clients. We should combine this with the CSDS change. The CSDS check can have a 10 minutes timeout, but the gRPC logic shouldn't need that long.

But CSDS only works for branch >= 1.37. So in the old releases, we still need to increase the timeout for everything. That should be OK, because the released clients are quite stable, and we don't expect many failures caused by them.

@ericgribkoff
Copy link
Contributor

I'll defer to Menghan on how this will impact debuggability of gRPC clients, but from a test framework health perspective:

This PR in general SG. One thing is that we will need to increase the kokoro timeout, so that the job won't be killed by kokoro, because then we will lose all the logs.

I agree. It is currently very difficult to debug jobs killed by Kokoro mid-run due to a job-level timeout. We avoid that (mostly) by giving most of the Kokoro jobs a 3 hour time out. But this 3 hours was not chosen based on looking at the worst-case run times for each test case, it was just chosen heuristically by slowly increasing the job timeout until we stopped seeing many failures. Bumping (doubling) most of the operation-level timeouts without per-test-case time bounds may lead to more of these problems. Just something to keep an eye out for.

Copy link
Contributor

@ericgribkoff ericgribkoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, as Stanley pointed out on #26168, with the three new tests added recently we're already bumping up against the existing Kokoro timeouts in each language. That's definitely a problem but I'm no longer concerned about this PR making that worse (I mean, it might, but it's already a problem we need to address directly and fix outside of this PR).

Since we're already having overall job timeout issues, and this change may reduce other flakes, I think we should go ahead with this change and just bump the timeouts. We can then look at the average per-job time and work on fixing that directly (the hope is that moving to the k8s framework (and more pre-built images) will make the timeout issue less of a problem, but all of that is orthogonal to this change).

Lidi, it seems like this should probably also be accompanied with changing the other grpc_<lang>.cfg files to increase the timeouts to something like the 360 minutes you chose for Java and Go?

Copy link
Contributor

@menghanl menghanl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG.

The PR to increase go's timeout was merged. We will need to backport that to previous releases if we backport this PR.

@lidizheng
Copy link
Contributor Author

Sounds good. After #26187 is merged, I can start backporting.

@lidizheng lidizheng merged commit 9bc421c into grpc:master May 6, 2021
@lidizheng
Copy link
Contributor Author

This PR and #26187 contain many conflicting changes against previous releases. I failed to automate the backport process. Maybe we need to spend an hour or two to handcraft the backport PRs...

@menghanl
Copy link
Contributor

menghanl commented May 7, 2021

If you make one PR manually for 1.37.x, backporting that to the other releases might be easier 🤷‍♂️

paulosjca pushed a commit to paulosjca/grpc that referenced this pull request May 7, 2021
* Updating the config update timeout to 600s

* Remove unnecessary lines and comments
@ericgribkoff
Copy link
Contributor

This may need to be reverted. Basically every xDS v3 test suite started timing out after 6 hours once this PR went in. Since the tests were previously passing in under 3 hours, it suggests something went wrong (besides #26196) amidst all of the whitespace reformatting here.

The logs before timeout repeatedly show:

2021-05-06 20:35:30,299: INFO     zero_percent_fault_injection: attempt 0
2021-05-06 20:35:40,310: DEBUG    Invoking GetClientAccumulatedStats RPC to localhost:8079:
....
2021-05-06 20:35:40,313: INFO     success
2021-05-06 20:35:40,313: INFO     starting case zero_percent_fault_injection: attempt 1389

Maybe the while True loop introduced in this PR is never terminating?

ericgribkoff added a commit to ericgribkoff/grpc that referenced this pull request May 7, 2021
Copy link
Contributor

@menghanl menghanl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's revert this first. @ericgribkoff Did you send a PR?

success = False
if success:
logger.info('success')
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this break only breaks from the inner loop, but not the outer While True.

menghanl added a commit to menghanl/grpc that referenced this pull request May 7, 2021
@menghanl
Copy link
Contributor

menghanl commented May 7, 2021

I filed #26199 to revert.
@ericgribkoff already sent #26197

lidizheng pushed a commit that referenced this pull request May 7, 2021
lidizheng added a commit to lidizheng/grpc that referenced this pull request May 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang/Python release notes: no Indicates if PR should not be in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants