Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manage endpoints when KIngress status contains an IP address #11843

Merged
merged 10 commits into from
Aug 26, 2021

Conversation

dprotaso
Copy link
Member

@dprotaso dprotaso commented Aug 20, 2021

Part of: #11821

Proposed Changes

  • add an error status were the route doesn't own the endpoints object
  • Ingress LBs with IP now have priority over Domain & DomainInternal
  • continue preserving the clusterIP if set - include a test
  • refactor the route constructor to remove duplication
  • manage endpoints when the Ingress returns an IP load balancer status

TODO

  • Test with a fork of net-istio
  • Test with a fork of net-contour
  • Test with a fork of net-kourier

Release Note

NONE

@knative-prow-robot knative-prow-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Aug 20, 2021
@google-cla google-cla bot added the cla: yes Indicates the PR's author has signed the CLA. label Aug 20, 2021
@knative-prow-robot knative-prow-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. area/API API objects and controllers area/networking approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 20, 2021
@knative-prow-robot knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 20, 2021
@codecov
Copy link

codecov bot commented Aug 20, 2021

Codecov Report

Merging #11843 (e1d50a1) into main (21e0d8e) will decrease coverage by 0.07%.
The diff coverage is 84.61%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #11843      +/-   ##
==========================================
- Coverage   87.81%   87.73%   -0.08%     
==========================================
  Files         196      196              
  Lines        9393     9430      +37     
==========================================
+ Hits         8248     8273      +25     
- Misses        890      895       +5     
- Partials      255      262       +7     
Impacted Files Coverage Δ
pkg/reconciler/route/route.go 79.89% <ø> (+0.19%) ⬆️
pkg/reconciler/route/reconcile_resources.go 76.66% <71.42%> (-4.94%) ⬇️
pkg/reconciler/route/resources/service.go 93.02% <97.10%> (+1.77%) ⬆️
pkg/apis/serving/v1/route_lifecycle.go 100.00% <100.00%> (ø)
pkg/reconciler/route/controller.go 100.00% <100.00%> (ø)
pkg/reconciler/revision/reconcile_resources.go 80.72% <0.00%> (-2.41%) ⬇️
pkg/reconciler/revision/controller.go 86.00% <0.00%> (-0.28%) ⬇️
pkg/reconciler/gc/controller.go 0.00% <0.00%> (ø)
pkg/queue/health/health_state.go 100.00% <0.00%> (ø)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 21e0d8e...e1d50a1. Read the comment docs.

@dprotaso
Copy link
Member Author

Contour changes are here: knative-extensions/net-contour#582 - all I'm doing is including the service IP with the Domain*

@dprotaso
Copy link
Member Author

Istio changes are here: knative-extensions/net-istio#731 - same sorta diff as contour

note: probably not worth including this change to istio unless they disable external name forwarding like contour

@dprotaso
Copy link
Member Author

Kourier changes are here: knative-extensions/net-kourier#605 - same diff as the others

@dprotaso
Copy link
Member Author

/retest

@dprotaso
Copy link
Member Author

I think the istio mesh tests are just flakey - as the code path hasn't changed
/retest

@dprotaso
Copy link
Member Author

Everything else went green which is great

Screen Shot 2021-08-21 at 8 49 41 PM

@knative-prow-robot
Copy link
Contributor

@dprotaso: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Rerun command
pull-knative-serving-istio-stable-mesh c5b8f66 link /test pull-knative-serving-istio-stable-mesh

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@dprotaso
Copy link
Member Author

@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 22, 2021
@nak3
Copy link
Contributor

nak3 commented Aug 23, 2021

mesh test is often killed in the mid of the scale test.
It is related to this proposal #11552 to change the test order. (This may not fix the scale test perfectly but it exposes some HA tests which are also hidden by the scale test.)

@@ -60,6 +61,7 @@ func newController(
) *controller.Impl {
logger := logging.FromContext(ctx)
serviceInformer := serviceinformer.Get(ctx)
endpointsInformer := endpointsinformer.Get(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't add event handler endpointsInformer.Informer().AddEventHandler(handleControllerOf) for the endpoints?

Copy link
Member Author

@dprotaso dprotaso Aug 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about but thought it wasn't necessary since

  1. Users will probably not have access to modify this endpoint object (because of CVE-2021-25740: Endpoint & EndpointSlice permissions allow cross-Namespace forwarding kubernetes/kubernetes#103675) - and we have the option for the visibility label on the service
  2. There's no information from the endpoint being propagated to the route

Though we may want to 'scope' the endpoint informer to only watch endpoints with that controller route label.

@knative-prow-robot knative-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 23, 2021
pkg/reconciler/route/resources/service.go Outdated Show resolved Hide resolved

lbStatus := ingressStatus.PublicLoadBalancer
if isPrivate || ingressStatus.PrivateLoadBalancer != nil {
if isPrivate || privateLB != nil && len(privateLB.Ingress) != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does len(privateLB.Ingress) != 0 need here? The following line checks len(privateLB.Ingress) if 0 or more than 1.

Copy link
Member Author

@dprotaso dprotaso Aug 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this because I don't think we should error out if the private loadBalancers has no ingress statuses but the public one does

pkg/reconciler/route/reconcile_resources.go Outdated Show resolved Hide resolved
@nak3
Copy link
Contributor

nak3 commented Aug 23, 2021

/test pull-knative-serving-istio-stable-mesh-short

case corev1.ServiceTypeExternalName:
canUpdate = true
default:
// Transitions from ClusterIP to ExternalName Fail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to delete the placeholder services manually when downgrading the serving. We probably need to document it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point will add this as release note

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is bad as this could cause downgrading fails by default for all users without any manual work. And I really feel that we should handle the backward compatibility issue instead of just documenting it.

Would it be OK to:

  1. putting this feature behind a flag and set the flag to false in 0.26 release.
  2. handling the backward compatibility issue in 0.26 release
  3. enabling the feature by default in 0.27 release

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see my comment here: #11843 (comment) about a potential downgrade path that doesn't require manually deleting k8s services?

Also the prior behaviour is preserved - it's only until the net-* plugins start setting the LoadBalancers status IP that triggers the changes in this PR. So if net-istio/kourier don't play on setting that value then downgrade will work.

Copy link

@ZhiminXiang ZhiminXiang Aug 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer, Dave. The comment makes sense.

For downgrading, I am not sure how easy it would be for users to do the downgrade based on the orders mentioned in the comment. To make the downgrade easier, I would suggest we ship this PR in 0.26 release, and populate Kingress IP from the net-* repo in 0.27+ release if the maintainers are OK with it. WDYT @nak3 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, net-contour needs this change ASAP but net-istio/kourier do not need to rush so adopting it on 0.27+ would be fine. It depends on each repo maintainer's decision as you also said, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

net-istio/kourier do not need to rush so adopting it on 0.27+ would be fine. It depends on each repo maintainer's decision as you also said, though.

I'd say only set the IP address if you need to

Copy link
Member

@julz julz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is awesome, very cool that this works \o/ ❤️

pkg/apis/serving/v1/route_lifecycle.go Show resolved Hide resolved
pkg/reconciler/route/reconcile_resources.go Outdated Show resolved Hide resolved
pkg/reconciler/route/reconcile_resources.go Show resolved Hide resolved
pkg/reconciler/route/reconcile_resources.go Outdated Show resolved Hide resolved
Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a flag here (default false this release) to handle the downgrade case?

@dprotaso
Copy link
Member Author

Re: Downgrade

So it's actually the ingress provider that drives us to managing an endpoints object by setting the IP property in the Kingress status. So upgrade and downgrade will work fine when the net-* version remains stable.

If the net-* version were to vary then the safest thing to do would be

When upgrading:

  1. Upgrade knative-serving
  2. Upgrade net-* plugin

When downgrading:

  1. Downgrade net-* plugin
  2. Wait for things to reconcile back to what they were before
  3. Downgrade knative-serving

@dprotaso
Copy link
Member Author

dprotaso commented Aug 24, 2021

Probably worth stating explicitly:

I'm not going to PR the mentioned IP changes to net-kourier and net-istio - they were done to verify the changes in this PR.. I think it's for the net-* maintainer to make the call and secondly whether they can even support routing traffic to headless services (I don't see why it wouldn't work).

For net-contour we'll make the change since the default setting is to not support these ExternalName Services.

@dprotaso
Copy link
Member Author

/hold cancel

@knative-prow-robot knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 24, 2021
pkg/reconciler/route/resources/service.go Outdated Show resolved Hide resolved
case corev1.ServiceTypeExternalName:
canUpdate = true
default:
// Transitions from ClusterIP to ExternalName Fail

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is bad as this could cause downgrading fails by default for all users without any manual work. And I really feel that we should handle the backward compatibility issue instead of just documenting it.

Would it be OK to:

  1. putting this feature behind a flag and set the flag to false in 0.26 release.
  2. handling the backward compatibility issue in 0.26 release
  3. enabling the feature by default in 0.27 release

@nak3
Copy link
Contributor

nak3 commented Aug 26, 2021

/lgtm
/hold

/hold for other reviewers.

@knative-prow-robot knative-prow-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Aug 26, 2021
Copy link
Member

@julz julz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 this is great

/lgtm

@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso, julz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

@ZhiminXiang ZhiminXiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@dprotaso
Copy link
Member Author

/hold cancel

@knative-prow-robot knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 26, 2021
@knative-prow-robot knative-prow-robot merged commit ce627e5 into knative:main Aug 26, 2021
@dprotaso dprotaso deleted the route-endpoint-management branch August 26, 2021 20:08
nealhu pushed a commit to nealhu/serving that referenced this pull request Aug 28, 2021
…#11843)

* add an error status were the route doesn't own the endpoints object

* Ingress LBs with IP now have priority over Domain & DomainInternal

* continue preserving the clusterIP if set - include a test

* refactor the route constructor to remove duplication

* manage endpoints when the Ingress returns an IP load balancer status

* fix comment & drop deleted function

* fix comment

* fix linter warning - remove unused function

* address PR feedback

* fix comment
nak3 added a commit to nak3/serving that referenced this pull request Oct 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers area/networking cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants