New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Support EndpointSlice in sdn and test handling terminating endpoints #271
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: smarterclayton The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@Miciah this i where i'm testing it |
|
EDIT: Feature gate enablement was in wrong place |
16d1b86
to
c918927
Compare
well, test went green except for an unidling test (i may have broken it just because of the use of the slice for annotations instead of endpoints). will kick some upgrade jobs and see how it plays out as well as a network stress test |
This is a test cherry-pick based on the current vendor state containing upstream 97238, which allows the proxier to handle terminating endpoints. This is not sufficient by itself because we need to test endpoint slices, but ensures the right code is in place.
Ok, got one run of e2e two runs of upgrade with endpoint slices on, but the termination gate off, and behavior was similar to endpoints except for the idling bug. Now testing with termination gate on in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1368251013236002816 as an upgrade and https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1368251449502339072 as an e2e |
@smarterclayton: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Passes on those (for the factors that matter) without EndpointSliceTerminating set on the server side (so we're safe to roll this out first to Kube-proxy). |
// we track all endpoints in the unidling endpoints handler so that we can succesfully | ||
// detect when a service become unidling | ||
klog.V(6).Infof("hybrid proxy: (always) add ep %s/%s in unidling proxy", endpoints.Namespace, endpoints.Name) | ||
p.unidlingProxy.OnEndpointsAdd(endpoints) | ||
p.unidlingProxy.OnEndpointSliceAdd(endpoints) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar at all with the sdn/proxy implementation, maybe this information is redundant, but can be multiple slices for the same service, and each slice can have duplicate endpoints, kube-proxy uses a cache
https://github.com/kubernetes/kubernetes/blob/2bcbc527a760106ec89647fcf6852f37c804f4ed/pkg/proxy/endpointslicecache.go#L43-L49
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the limit of endpoints per slice is 100, so if you have more than 100 endpoints, let's say 110 for service X you'll receive two slices Y1 and Y2, maybe with 100 and 10 endpoints each
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We really should have an e2e test in upstream that creates a service with > 100 endpoints then to exercise this. Is there one you know of I can crib?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, for idling, wondering whether we even need to support > three or four endpoints. The only time the user space proxy should be in play is on an idle service which has no endpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
her we even need to support > three or four endpoints.
as I said, I'm not familiar with this code, just raising some points that I think may be taking into consideration, if that is the case, it seems we should't worry about this
We really should have an e2e test in upstream that creates a service with > 100 endpoints
this is well tested upstream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unidling proxy can just ignore the endpointslices and work with the endpoints like it always did. Endpoints objects always contain the full set of endpoints, even in cases where the EndpointSlice controller would start splitting things up; it just means that code working with the Endpoints objects doesn't get the efficiency wins that code working with the EndpointSlice objects would get.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idling is broken because this flips us to only using EndpointSlice, but the userspace proxy (which is used by idling) doesn't support EndpointSlice.
@@ -245,6 +245,11 @@ func (o *Options) Complete() error { | |||
return err | |||
} | |||
|
|||
// DO NOT MERGE: hack endpoint slice on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to revert 12b21f8 from #227.
(That will result in EndpointSlice
and EndpointSliceProxying
being enabled since they're enabled by default.) For EndpointSliceTerminatingCondition
we should eventually handle that like other feature gates, but CNO doesn't watch the FeatureGate
resource yet (https://issues.redhat.com/browse/SDN-1325).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, wait, no, Aniket already reverted that a while back; we'll need to revert the relevant part of openshift/cluster-network-operator#905
proxyconfig.NoopEndpointSliceHandler | ||
// TODO implement https://github.com/kubernetes/enhancements/pull/640 | ||
proxyconfig.NoopNodeHandler | ||
NoopEndpointsHandler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HybridProxier can't no-op Endpoints handling; it has to pass EndpointSlice events down to the iptables proxier and Endpoints events to the userspace proxier. And since OsdnProxy acts as a filter on top of HybridProxier, it needs to also pass both sets of events down to the proxier it's wrapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated userspace proxier to use EndpointSlice, I thought we were already going to have to switch to use Service instead of Endpoints for idling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I see what you mean. Why wasn't userspace proxier updated? Just no one signed up for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wasn't userspace proxier updated? Just no one signed up for it?
Upstream doesn't care about the userspace proxier any more (Tim would probably have already deleted it if OCP wasn't using it for unidling) and Red Hat had thought we weren't going to have to use EndpointSlice in openshift-sdn, so we didn't care about updating it either.
At any rate, I think we don't actually need to update userspace to use EndpointSlice; we just need to make HybridProxier
and OsdnProxy
pass both endpoint events and endpointslice events down to their wrapper proxiers, and then eventually the iptables proxy will act on the endpointslice events and the userspace proxy will act on the endpoint events.
FYI #296 is a more complete EndpointSlice PR |
@smarterclayton: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/close |
@danwinship: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
86ceaca (Clayton Coleman, 5 minutes ago)
DO NOT MERGE: Force EndpointSliceProxying on
bf1fd9b (Clayton Coleman, 12 minutes ago)
DO NOT MERGE: UPSTREAM: 97238: Handle terminating endpoints
This is a test cherry-pick based on the current vendor state containing
upstream 97238, which allows the proxier to handle terminating endpoints.
This is not sufficient by itself because we need to test endpoint slices,
but ensures the right code is in place.