Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds: fix issue when skipping first request #39916

Merged
merged 1 commit into from
Jul 14, 2022

Conversation

howardjohn
Copy link
Member

@howardjohn howardjohn commented Jul 13, 2022

The bug (example failure:
https://prow.istio.io/view/gs/istio-prow/pr-logs/pull/istio_istio/39896/integ-cni_istio/1546984256612339712)

  • ingressgateway has 1 Gateway
  • Gateway is removed
  • envoy disconnects
  • Envoy reconnects, requests RDS. This hits the INIT/RECONNECT flow.
  • RDS hits 'Gateway missing for route' path and gives no response at all
  • Next RDS request, we get a "stale nonce" since we have no previously
    sent nonce
  • Envoy stuck forever

The fix:

  • Remove code path to return empty route instead of no route (matching
    other paths)
  • Add assertions to ensure that we don't send empty response to requests
    and that we never count a "stale nonce" if we somehow have no
    previously sent nonce, to ensure there aren't any other issues

Please provide a description of this PR:

The bug (example failure:
https://prow.istio.io/view/gs/istio-prow/pr-logs/pull/istio_istio/39896/integ-cni_istio/1546984256612339712)
* ingressgateway has 1 Gateway
* Gateway is removed
* envoy disconnects
* Envoy reconnects, requests RDS. This hits the INIT/RECONNECT flow.
* RDS hits 'Gateway missing for route' path and gives no response at all
* Next RDS request, we get a "stale nonce" since we have no previously
  sent nonce
* Envoy stuck forever

The fix:
* Remove code path to return empty route instead of no route (matching
  other paths)
* Add assertions to ensure that we don't send empty response to requests
  and that we never count a "stale nonce" if we somehow have no
previously sent nonce, to ensure there aren't any other issues
@howardjohn howardjohn added the release-notes-none Indicates a PR that does not require release notes. label Jul 13, 2022
@howardjohn howardjohn requested review from a team as code owners July 13, 2022 18:47
@istio-policy-bot
Copy link

🤔 🐛 You appear to be fixing a bug in Go code, yet your PR doesn't include updates to any test files. Did you forget to add a test?

Courtesy of your friendly test nag.

@istio-testing istio-testing added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jul 13, 2022
@howardjohn
Copy link
Member Author

/test integ-assertion_istio

@howardjohn
Copy link
Member Author

/retest

howardjohn added a commit to howardjohn/istio that referenced this pull request Jul 14, 2022
This flag requires the control plane to also have support, and its not
enabled. This also breaks a (test only) assertion added in
istio#39916.
istio-testing pushed a commit that referenced this pull request Jul 14, 2022
This flag requires the control plane to also have support, and its not
enabled. This also breaks a (test only) assertion added in
#39916.
@hzxuzhonghu
Copy link
Member

When met Gateway missing for route, isn't the listener will be removed too?

@howardjohn
Copy link
Member Author

No, the issue happens because listener is still present for 45s (draining) after removal of the Gateway

/test integ-assertion_istio

(assertion failed due to unrelated issue, fixed in another pr)

@hzxuzhonghu
Copy link
Member

IC, should we fix envoy which should not send rds request for a listener in draining state?

@howardjohn
Copy link
Member Author

It seems reasonable to me for envoy to do that otherwise it cannot serve traffic on the draining listener appropriately. We only set drain to 45s but some may do it for hours even where it makes more sense

@ramaraochavali
Copy link
Contributor

LGTM. Would be good if we can add a test case that covers this. But if you want to add later, I am OK to approve this. Let me know.

@howardjohn
Copy link
Member Author

howardjohn commented Jul 14, 2022 via email

@istio-testing
Copy link
Collaborator

@howardjohn: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
integ-security-multicluster_istio aadf93e link unknown /test integ-security-multicluster_istio

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@howardjohn
Copy link
Member Author

/retest

@istio-testing istio-testing merged commit a52abd8 into istio:master Jul 14, 2022
@SpecialYang
Copy link
Member

@howardjohn When we met the case Gateway missing for route, our gateway respond 404 for a while. We think this pr is the reason.

Is it right that we push empty RDS?

// This can happen when a gateway has recently been deleted. Envoy will still request route
// information due to the draining of listeners, so we should not return an error.
return &route.RouteConfiguration{
Name: routeName,
VirtualHosts: []*route.VirtualHost{},
ValidateClusters: proto.BoolFalse,
}

@howardjohn
Copy link
Member Author

Its hard to say without more info. what did you actually do in terms of creating/deleting GW/VS?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes-none Indicates a PR that does not require release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants