Enhance RetryingRouteInconsistency by mgencur · Pull Request #7390 · knative/serving

mgencur · 2020-03-26T10:36:00Z

Fixes a downstream problem with routing traffic to the application. There can be more routers which start proper routing at different times. This PR works around that by enforcing several successful responses in sequence which might go through different routers. All of them must work before proceeding.

Fixes #

knative-prow-robot

@mgencur: 0 warnings.

Details

In response to this:

also update TestSingleConcurrency to provide more logging on failure

Fixes #

Proposed Changes

Release Note

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

markusthoemmes · 2020-03-26T11:38:36Z


 // RetryingRouteInconsistency retries common requests seen when creating a new route
 func RetryingRouteInconsistency(innerCheck spoof.ResponseChecker) spoof.ResponseChecker {
+	const neededSuccesses = 5


Same as for the thing below basically: We need this downstream because our routers are inconsistent in between themselves too 😂 . So we're adding a retry that checks that we've seen enough succeeding requests in a row to assume all routers are good.

Not sure we want this upstream though, as this problem currently doesn't exist here AFAIK.

On a broader note: We might want to discuss having helpers like the ones we need here upstream and configurable. I could imagine other systems having similar issues if they really want to use the Knative E2E tests on a public service for example, with domains etc all setup.

Wondering if we might want to add flags to control this, i.e. --ingress-retriable-status-code="404,503" --ingress-consecutive-requests="5" and Knative upstream's CI just runs with both of these turned off?

Just a thought from the top of my head, but definitely something we might want to discuss and explore I'd think.

mgencur · 2020-03-26T13:58:50Z

I've extracted the changes to TestSingleConcurrency into #7392

knative-test-reporter-robot · 2020-03-26T14:36:38Z

The following jobs failed:

Test name	Triggers	Retries
pull-knative-serving-integration-tests	pull-knative-serving-integration-tests	1/3

Automatically retrying due to test flakiness...
/test pull-knative-serving-integration-tests

markusthoemmes · 2020-04-21T07:12:10Z

Please add a description as to why this change is needed. I think we should bring this up with the test/productivity working group and see if we can find common ground with these helpers. I completely agree with the sentiment of being willing to run these tests in virtually any environment, but I think we need to be able to relax/tighten requirements here.

vagababov · 2020-05-06T19:09:28Z

Martin, are you still pursuing this?

mgencur · 2020-05-07T05:25:08Z

@vagababov Yes. I'd like this to be merged. The function solves a downstream problem - same as before. It it just more robust. So, I don't see a reason for not merging that.
@markusthoemmes make sense?? I'm not convinced about adding new flags to make it more configurable so I'd defer next steps to you.

markusthoemmes · 2020-05-07T05:59:51Z

We'll need to send this via the respective working group as mentioned above. I don't think we want to generally water down our assertions here because of downstream projects. At least not unconditionally.

tcnghia · 2020-05-15T18:14:34Z

@markusthoemmes I am fine with this change
/lgtm

mattmoor · 2020-06-03T15:11:37Z

Historically we needed this because of real issues, and I think we stamped out most of them. I'm hesitant to relax our upstream checks because of this, while I understand the downstream pain here. Should we put some additional retry logic under a flag that's off by default?

If anything I think I'd rather see this adopt an SLO-based approach similar to the probers we run because this check let's through 100 failures followed by 5 successes.

I think that if we had an SLO we were trying to achieve that was configurable via a flag (default to 100%, no minimum population) then we could simply check until that's satisfied. This could be allowed to tolerate some early failures, so long as over time it rises above some minimum acceptable level.

WDYT?

markusthoemmes · 2020-06-03T16:21:01Z

That matches my sentiment too, yeah. It should be configurable via a flag. Not sure if we need SLO support though. In this case it's actually okay to fail 100 times to then pass 5 consecutive times.

mattmoor · 2020-06-04T01:41:06Z

In this case it's actually okay to fail 100 times to then pass 5 consecutive times

Do you mean in the case you folks see downstream or in general?

markusthoemmes · 2020-06-04T06:04:48Z

Definitely behind a flag.

mgencur · 2020-06-04T13:21:39Z

OK. It seems the consensus is to have it behind a flag. I can see a little problem with the SLO approach in this case because it could pass even if the failures (though infrequent) are at the end of the checks. At that point we really expect that everything works and is ready.
IMO, a flag such as --ingress-retries="5" would be a good choice. The additional logic would be triggered only if it's defined and is positive. This would also ensure that at the very end of our checks everything is ready.

markusthoemmes · 2020-06-04T13:24:53Z

Yep, I agree. --ingress-retries is a little confusing though as it's not the amount of retries but the number of consecutive successes.

mgencur · 2020-06-04T13:38:23Z

I'm afraid any self-explanatory name will be too long. A comment on the flag will help to clarify the exact meaning. E.g. --ingress-consecutive-successful-requests. Too long?

knative-prow-robot · 2020-06-04T17:05:26Z

New changes are detected. LGTM label has been removed.

knative-prow-robot · 2020-06-04T17:05:36Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mgencur
To complete the pull request process, please assign mattmoor
You can assign the PR to them by writing /assign @mattmoor in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

test/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mgencur · 2020-06-04T17:18:23Z

The first impl is above. Another option would be to move the RetryingRouteInconsistency to knative.dev/pkg and put it inside the default response checkers. For example:

func EventuallyMatchesBody(expected string) spoof.ResponseChecker {
	return RetryingRouteInconsistency(
		func(resp *spoof.Response) (bool, error) {
			if !strings.Contains(string(resp.Body), expected) {
				// Returning (false, nil) causes SpoofingClient.Poll to retry.
				return false, nil
			}

			return true, nil
		})
}

This would allow us to delete all occurrences of RetryingRouteInconsistency from all tests in Knative Serving. But we'd need to move the flag too :-/

knative-prow-robot · 2020-07-26T16:33:23Z

@mgencur: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow-robot · 2020-08-28T16:30:56Z

@mgencur: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-knative-serving-contour-tls	`69f5ec0`	link	`/test pull-knative-serving-contour-tls`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

tcnghia · 2020-09-02T14:27:54Z

@mgencur is this PR still needed? can you please rebase? thanks

mgencur · 2020-09-02T14:48:03Z

@tcnghia my impression was that this is not desired so I didn't rebase lately. Any comment on that @markusthoemmes ?

tcnghia · 2020-09-03T15:16:02Z

@mgencur in that case let's close the PR? thanks

googlebot added the cla: yes Indicates the PR's author has signed the CLA. label Mar 26, 2020

knative-prow-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/test-and-release It flags unit/e2e/conformance/perf test issues for product features labels Mar 26, 2020

knative-prow-robot reviewed Mar 26, 2020

View reviewed changes

knative-prow-robot requested review from coryrc and steuhs March 26, 2020 10:36

markusthoemmes reviewed Mar 26, 2020

View reviewed changes

Comment thread test/v1/route.go Outdated

markusthoemmes reviewed Mar 26, 2020

View reviewed changes

mgencur force-pushed the retrying_route_inconsistency branch from 8126a6d to 9f3292c Compare March 26, 2020 13:55

knative-prow-robot assigned tcnghia May 15, 2020

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label May 15, 2020

markusthoemmes assigned mattmoor Jun 3, 2020

Provide test flag --ingressRetries for retrying route inconsistency

1d8c080

mgencur force-pushed the retrying_route_inconsistency branch from 9f3292c to 1d8c080 Compare June 4, 2020 17:05

knative-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Jun 4, 2020

markusthoemmes reviewed Jun 4, 2020

View reviewed changes

Comment thread test/e2e_flags.go Outdated

markusthoemmes reviewed Jun 4, 2020

View reviewed changes

Comment thread test/v1/route.go Outdated

Fixes

69f5ec0

mgencur mentioned this pull request Jun 25, 2020

Add route consistency retry to new logging E2E test. #8324

Merged

knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2020

mgencur closed this Sep 3, 2020

Conversation

mgencur commented Mar 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

knative-prow-robot left a comment

Choose a reason for hiding this comment

Proposed Changes

Uh oh!

Uh oh!

markusthoemmes Mar 26, 2020

Choose a reason for hiding this comment

Uh oh!

mgencur commented Mar 26, 2020

Uh oh!

knative-test-reporter-robot commented Mar 26, 2020

Uh oh!

markusthoemmes commented Apr 21, 2020

Uh oh!

vagababov commented May 6, 2020

Uh oh!

mgencur commented May 7, 2020

Uh oh!

markusthoemmes commented May 7, 2020

Uh oh!

tcnghia commented May 15, 2020

Uh oh!

mattmoor commented Jun 3, 2020

Uh oh!

markusthoemmes commented Jun 3, 2020

Uh oh!

mattmoor commented Jun 4, 2020

Uh oh!

markusthoemmes commented Jun 4, 2020

Uh oh!

mgencur commented Jun 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markusthoemmes commented Jun 4, 2020

Uh oh!

mgencur commented Jun 4, 2020

Uh oh!

knative-prow-robot commented Jun 4, 2020

Uh oh!

knative-prow-robot commented Jun 4, 2020

Uh oh!

Uh oh!

Uh oh!

mgencur commented Jun 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

knative-prow-robot commented Jul 26, 2020

Uh oh!

knative-prow-robot commented Aug 28, 2020

Uh oh!

tcnghia commented Sep 2, 2020

Uh oh!

mgencur commented Sep 2, 2020

Uh oh!

tcnghia commented Sep 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

mgencur commented Mar 26, 2020 •

edited

Loading

mgencur commented Jun 4, 2020 •

edited

Loading

mgencur commented Jun 4, 2020 •

edited

Loading