Enhance RetryingRouteInconsistency#7390
Conversation
knative-prow-robot
left a comment
There was a problem hiding this comment.
@mgencur: 0 warnings.
Details
In response to this:
- also update TestSingleConcurrency to provide more logging on failure
Fixes #
Proposed Changes
Release Note
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
|
|
||
| // RetryingRouteInconsistency retries common requests seen when creating a new route | ||
| func RetryingRouteInconsistency(innerCheck spoof.ResponseChecker) spoof.ResponseChecker { | ||
| const neededSuccesses = 5 |
There was a problem hiding this comment.
Same as for the thing below basically: We need this downstream because our routers are inconsistent in between themselves too 😂 . So we're adding a retry that checks that we've seen enough succeeding requests in a row to assume all routers are good.
Not sure we want this upstream though, as this problem currently doesn't exist here AFAIK.
On a broader note: We might want to discuss having helpers like the ones we need here upstream and configurable. I could imagine other systems having similar issues if they really want to use the Knative E2E tests on a public service for example, with domains etc all setup.
Wondering if we might want to add flags to control this, i.e. --ingress-retriable-status-code="404,503" --ingress-consecutive-requests="5" and Knative upstream's CI just runs with both of these turned off?
Just a thought from the top of my head, but definitely something we might want to discuss and explore I'd think.
8126a6d to
9f3292c
Compare
|
I've extracted the changes to TestSingleConcurrency into #7392 |
|
The following jobs failed:
Automatically retrying due to test flakiness... |
|
Please add a description as to why this change is needed. I think we should bring this up with the test/productivity working group and see if we can find common ground with these helpers. I completely agree with the sentiment of being willing to run these tests in virtually any environment, but I think we need to be able to relax/tighten requirements here. |
|
Martin, are you still pursuing this? |
|
@vagababov Yes. I'd like this to be merged. The function solves a downstream problem - same as before. It it just more robust. So, I don't see a reason for not merging that. |
|
We'll need to send this via the respective working group as mentioned above. I don't think we want to generally water down our assertions here because of downstream projects. At least not unconditionally. |
|
@markusthoemmes I am fine with this change |
|
Historically we needed this because of real issues, and I think we stamped out most of them. I'm hesitant to relax our upstream checks because of this, while I understand the downstream pain here. Should we put some additional retry logic under a flag that's off by default? If anything I think I'd rather see this adopt an SLO-based approach similar to the probers we run because this check let's through 100 failures followed by 5 successes. I think that if we had an SLO we were trying to achieve that was configurable via a flag (default to 100%, no minimum population) then we could simply check until that's satisfied. This could be allowed to tolerate some early failures, so long as over time it rises above some minimum acceptable level. WDYT? |
|
That matches my sentiment too, yeah. It should be configurable via a flag. Not sure if we need SLO support though. In this case it's actually okay to fail 100 times to then pass 5 consecutive times. |
Do you mean in the case you folks see downstream or in general? |
|
Definitely behind a flag. |
|
OK. It seems the consensus is to have it behind a flag. I can see a little problem with the SLO approach in this case because it could pass even if the failures (though infrequent) are at the end of the checks. At that point we really expect that everything works and is ready. |
|
Yep, I agree. |
|
I'm afraid any self-explanatory name will be too long. A comment on the flag will help to clarify the exact meaning. E.g. |
9f3292c to
1d8c080
Compare
|
New changes are detected. LGTM label has been removed. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mgencur The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
The first impl is above. Another option would be to move the RetryingRouteInconsistency to knative.dev/pkg and put it inside the default response checkers. For example: This would allow us to delete all occurrences of RetryingRouteInconsistency from all tests in Knative Serving. But we'd need to move the flag too :-/ |
|
@mgencur: PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@mgencur: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@mgencur is this PR still needed? can you please rebase? thanks |
|
@tcnghia my impression was that this is not desired so I didn't rebase lately. Any comment on that @markusthoemmes ? |
|
@mgencur in that case let's close the PR? thanks |
Fixes a downstream problem with routing traffic to the application. There can be more routers which start proper routing at different times. This PR works around that by enforcing several successful responses in sequence which might go through different routers. All of them must work before proceeding.
Fixes #