Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update apiserver endpoint reconcilers to support dual-stack #117704

Closed

Conversation

danwinship
Copy link
Contributor

What type of PR is this?

/kind feature
/kind bug

What this PR does / why we need it:

This makes the apiserver endpoint reconcilers support dual-stack. It does not add dual-stack support to anything outside of pkg/controlplane/reconcilers; this is just step 1 to making the apiservers dual-stack.

In fact, most of this PR is just rewriting the reconcilers in preparation for supporting dual-stack. In current git, the way that the reconcilers work is:

That won't work for dual-stack, because the v1.Endpoints object can't represent dual-stack endpoints. So the PR changes the adapter/reconciler interface so that now the reconcilers only work with IPs and ports, and the adapter takes the IPs/ports and creates/updates an appropriate set of Endpoints and one or more EndpointSlices (simplifying the code and fixing a handful of bugs in the process).

Which issue(s) this PR fixes:

Fixes #101070
(and does more on top of that)

Special notes for your reviewer:

The first 4 commits (up to "Misc endpointsadapter_test.go improvements") are just improvements/cleanups to the unit tests.

Then the next 4 (up to "Remove unused EndpointsAdapter Create/Update return values") are minor API improvements and code cleanups.

The next 2 ("Merge Create/Update/EnsureEndpointSliceFromEndpoints together" and "Refactor duplicated EndpointReconciler code into EndpointsAdapter") are the core of the PR, changing the adapter/reconciler interface as described above so that most of the work is done in the adapter and the reconcilers only work with IPs/ports. (With the following commit, "Use retry.OnError in reconciler Sync" as a fixup to the new API.)

Then the next 2 ("Make the endpoint reconciler explicitly aware of the primary IP family" and "Generate both Endpoints and EndpointSlices from scratch") prepare it to deal with multiple EndpointSlices rather than only one, and "Make the reconciler correctly handle unexpected apiserver EndpointSlices" actually implements that, fixing #101070 and in particular fixing the "downgrading from dual-stack post-KEP-2438 apiservers to pre-KEP-2438 apiservers" problem.

Then finally the last commit actually adds dual-stack support. FWIW, we could omit that commit for now and just merge everything else, since it doesn't do anything externally-visible yet, and maybe the way I changed the API (in terms of separate primary/secondary IP args vs a single array arg) will turn out to not work great with the eventual apiserver changes.

However, it definitely makes sense to commit everything up to "Make the reconciler correctly handle unexpected apiserver EndpointSlices", since that is something we want before having proper dual-stack support, for rollback support.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2438-dual-stack-apiserver

/sig network
/sig api-machinery

/assign @aojea @robscott

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. labels May 1, 2023
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. labels May 1, 2023
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 1, 2023
@danwinship danwinship marked this pull request as ready for review May 2, 2023 12:22
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 2, 2023
@danwinship danwinship force-pushed the dual-stack-endpoint-reconciling branch from b96bc83 to 9996147 Compare May 2, 2023 17:40
@cici37
Copy link
Contributor

cici37 commented May 2, 2023

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 2, 2023
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 10, 2023
to produce better outputs on errors
The unit tests were testing the behavior of the code when the name or
namespace of the kubernetes service changed, but (a) this has never
happened, (b) this probably won't ever happen, and (c) if it was going
to happen, the behavior currently being asserted by the unit tests is
clearly not the behavior we would actually want (it leaves a stale set
of endpoints behind forever, rather than either deleting the old
object or continuing to update both objects) so we would have to
completely rewrite those tests at that point anyway.

So simplify the unit tests by not testing that; just use
default/kubernetes in all the tests.
Sort the test case data so the "setup" fields appear before the
"expected result" fields, to make it easier to understand each case.

Add descriptions to the tests that didn't have any. Swap the names of
two test cases that did the opposite of what they claimed.

Improve the names of the Endpoints and EndpointSlice objects used by
the tests to make it clearer what some of them are, and so that the
two basic endpoint sets (endpoints1/epSlice1 and endpoints2/epSlice2)
are the same in both the Create test and the Update test.
They are both compile-time constants; previously the namespace was
hardcoded in multiple places and the name was passed from the
controller to the reconciler on every call. Simplify by just passing
them to the adapter at construct time instead.
The unit tests were checking that TargetRef got copied from the
Endpoints to the EndpointSlice, but neither of the reconcilers creates
Endpoints with TargetRefs set (and they couldn't even if they wanted
to, since the reconciler currently has no information about how/where
the apiserver is running.)
They were only used by the unit tests, but even there they are now
redundant with expectCreate/expectUpdate.

Also remove an incorrect usage of RetryOnConflict in one of the
reconcilers; we'll fix this generically later.
Make Endpoints updates work the same way as EndpointSlice updates do,
where we just check if the Endpoints provided by the reconciler is
(relevantly) different from the one on the server, and Update it if so
(or Create it if it doesn't already exist).

Merge the Create and Update unit tests together, and update for the
fact that Sync will do nothing on a no-op update, and won't try to
Create something that already exists or Update something that doesn't.
Rather than having EndpointsAdapter.Get return a v1.Endpoints object
to the reconciler, and having the reconciler modify that and pass it
back, and duplicating a bunch of code between the two reconcilers,
have the adapter just return the list of apiserver IPs instead, and
have the reconcilers work with just that, and move all the shared
Endpoints comparing/fixing/creating/updating code into the adapter.
When reconciling endpoints, correctly handle the cases where we race
with another apiserver on a Create, Update, or Delete.
Pass the service IP to the EndpointsAdapter at construct time so it
can reliably filter the endpoint IPs and set the slice AddressType
correctly.
Rather than generating the Endpoints from scratch and then generating
the EndpointSlice from the Endpoints, generate both of them from
scratch (preparing for the possibility of supporting
EndpointSlice-only features, such as dual-stack).
In particular, if the reconciler finds itself in a cluster that has a
dual-stack pair of EndpointSlices for default/kubernetes, then it
should delete the extra slice, since it is probably stale data left
behind by a rollback.
(This does not add user-facing dual-stack apiserver support; it just
updates the backend so that the rest of the feature can be added
later.)
@danwinship danwinship force-pushed the dual-stack-endpoint-reconciling branch from 9996147 to 02cd140 Compare October 17, 2023 15:55
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: danwinship
Once this PR has been reviewed and has the lgtm label, please assign mikedanese for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

@danwinship: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-linter-hints 02cd140 link false /test pull-kubernetes-linter-hints

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 20, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 21, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

apiserver endpointslice reconciler should be smarter about what slices it owns
6 participants