Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix endpointslicemirroring controller not create endpointslice when delete endpointslice after kube-controller-manager restarted #112197

Closed
wants to merge 2 commits into from

Conversation

Dingshujie
Copy link
Member

What type of PR is this?

/kind bug

What this PR does / why we need it:

fix endpointslicemirroring controller not create endpointslice when delete endpointslice after kube-controller-manager restarted

Which issue(s) this PR fixes:

Fixes #112143

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot
Copy link
Contributor

@Dingshujie: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 2, 2022
@k8s-ci-robot
Copy link
Contributor

@Dingshujie: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/network Categorizes an issue or PR as relevant to SIG Network. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 2, 2022
@Dingshujie
Copy link
Member Author

@thockin @robscott @swetharepakula @aojea PTAL, thanks

@Dingshujie
Copy link
Member Author

/assgin @thockin @robscott @swetharepakula @aojea

@Dingshujie
Copy link
Member Author

/test pull-kubernetes-conformance-kind-ga-only-parallel

return
}
// requeue the service for another sync, if
// 1. endpointSliceTracker don't has endpoint.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should not be requeued if it is managed and the endpointSliceTracker has the endpoint?

  1. I create endpoint and is mirrored, is not added to the tracker here?
  2. I delete the endpointslice, is in the tracker but I don't requeue so it is not recreated
    I think I should recreate it

Copy link
Member Author

@Dingshujie Dingshujie Sep 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, my mistake, misspelled, endpointSliceTracker don't has this endpointslice not endpoint

  1. I create endpoint and is mirrored, is not added to the tracker here?

when endpoint mirroring controller create this endpointslice, then add endpointslice to tracker, but if kube-controller-manager restart, it won't be add to tracker.

  1. I delete the endpointslice, is in the tracker but I don't requeue so it is not recreated
    I think I should recreate it

Under normal conditions(mirrored endpointslice created and kcm not restarted),if i delete a endpointslice that exist in tracker, and onEndpointSliceDelete will check tracker whether has this endpointslice, and this endpointslice whether marked want deleted, so if i delete endpointslice is in the tracker, now controller will requeue this endpointslice

but if kcm is restart, tracker don't has this endpointslice, now if i delete endpointslice, it don't requeue

// onEndpointSliceDelete queues a sync for the relevant Endpoints resource for a
// sync if the EndpointSlice resource version does not match the expected
// version in the endpointSliceTracker.
func (c *Controller) onEndpointSliceDelete(obj interface{}) {
	endpointSlice := getEndpointSliceFromDeleteAction(obj)
	if endpointSlice == nil {
		utilruntime.HandleError(fmt.Errorf("onEndpointSliceDelete() expected type discovery.EndpointSlice, got %T", obj))
		return
	}
	if managedByController(endpointSlice) && c.endpointSliceTracker.Has(endpointSlice) {
		// This returns false if we didn't expect the EndpointSlice to be
		// deleted. If that is the case, we queue the Service for another sync.
		if !c.endpointSliceTracker.HandleDeletion(endpointSlice) {
			c.queueEndpointsForEndpointSlice(endpointSlice)
		}
	}
}

so i change the check method, if tracker don't has this endpointslice, requeue this endpoints

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, no, I didn't mean the naming, I mean, what happens if the endpointSliceTracker has the the endpoint slice and is deleted? why don't we reconcile

Is not the problem that the slice is not in the tracker? should not handle this case and add it to the tracker?

@robscott have a better understanding of the internals of this controller

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

under my understanding, tracker may avoid some race condition, if controller create endpointslice,but may not receive event from apiserver,and lister can not get this endpointslice from lister, and if we do a reconcile,may lead something wrong. so controller call tracker ExpectDeletion/Update function when it call Create/Update/Delete API, update new generation in tracker.

if we has stale information, will skip this sync, try later

if c.endpointSliceTracker.StaleSlices(svc, endpointSlices) {
		return endpointsliceutil.NewStaleInformerCache("EndpointSlice informer cache is out of date")
}

@aojea
Copy link
Member

aojea commented Sep 2, 2022

/assign @robscott

@aojea
Copy link
Member

aojea commented Sep 2, 2022

I'd like to see an integration test verifying this scenarios, can you please add one?

https://github.com/kubernetes/kubernetes/blob/master/test/integration/endpointslice/endpointslicemirroring_test.go

@Dingshujie
Copy link
Member Author

I'd like to see an integration test verifying this scenarios, can you please add one?

https://github.com/kubernetes/kubernetes/blob/master/test/integration/endpointslice/endpointslicemirroring_test.go

yeah, my pleasure.

…elete endpointslice after kube-controller-manager restarted

Signed-off-by: DingShujie <dingshujie@huawei.com>
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Dingshujie
Once this PR has been reviewed and has the lgtm label, please assign mrhohn for approval by writing /assign @mrhohn in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Sep 2, 2022
@@ -528,6 +528,123 @@ func TestEndpointSliceMirroringSelectorTransition(t *testing.T) {
}
}

func TestEndpointSliceMirroringDeleteWhenEndpoointSliceMirroringControllerRestart(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I run this test without the other commit it passes,

t.Fatalf("Error deleting EndpointSlices(%s/%s): %v", ns.Name, esList.Items[0].Name, err)
}
// wait endpoint to be created
err = waitForMirroredSlices(t, client, ns.Name, service.Name, len(esList.Items))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you have to replace len(esList.Items) by 1 here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, nevermind, this has to be > 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think first we have to assert the current slice has been deleted, otherwise we can List the slice being deleted

diff --git a/test/integration/endpointslice/endpointslicemirroring_test.go b/test/integration/endpointslice/endpointslicemirroring_test.go
index 1959872d219..993a33a0db3 100644
--- a/test/integration/endpointslice/endpointslicemirroring_test.go
+++ b/test/integration/endpointslice/endpointslicemirroring_test.go
@@ -26,7 +26,9 @@ import (
        corev1 "k8s.io/api/core/v1"
        discovery "k8s.io/api/discovery/v1"
        apiequality "k8s.io/apimachinery/pkg/api/equality"
+       apierrors "k8s.io/apimachinery/pkg/api/errors"
        metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+
        "k8s.io/apimachinery/pkg/util/wait"
        "k8s.io/client-go/informers"
        clientset "k8s.io/client-go/kubernetes"
@@ -638,8 +640,22 @@ func TestEndpointSliceMirroringDeleteWhenEndpoointSliceMirroringControllerRestar
        if err != nil {
                t.Fatalf("Error deleting EndpointSlices(%s/%s): %v", ns.Name, esList.Items[0].Name, err)
        }
+       // wait endpoint slice to be deleted
+       err = wait.PollImmediate(1*time.Second, wait.ForeverTestTimeout, func() (bool, error) {
+               _, err := client.DiscoveryV1().EndpointSlices(ns.Name).Get(ctx, esList.Items[0].Name, metav1.GetOptions{})
+               if err != nil {
+                       if apierrors.IsNotFound(err) {
+                               return true, nil
+                       }
+                       return false, err
+               }
+               return false, nil
+       })
+       if err != nil {
+               t.Fatalf("Error deleting EndpointSlices(%s/%s): %v", ns.Name, esList.Items[0].Name, err)
+       }
        // wait endpoint to be created

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this way we avoid the race

@Dingshujie
Copy link
Member Author

It is weird, the test doesn't reproduce the behavior you are describing #112197 (comment) , the slice is being recreated, Just run it without the fix, it passes

it need to wait add event to process complete, i will try to update integration test for reproduce it

@aojea
Copy link
Member

aojea commented Sep 2, 2022

it need to wait add event to process complete, i will try to update integration test for reproduce it

with this patch #112197 (comment) it waits until it confirms the original has been deleted and asserts that it has been recreated ... I think that the controller recreates it

@Dingshujie Dingshujie force-pushed the fix_endpointslice branch 2 times, most recently from 6460737 to c4c56a6 Compare September 2, 2022 08:37
@Dingshujie
Copy link
Member Author

@aojea PTAL, I run this test without this commits it failed

@aojea
Copy link
Member

aojea commented Sep 2, 2022

@aojea PTAL, I run this test without this commits it failed

the test is racy https://github.com/kubernetes/kubernetes/pull/112197/files#r961381285

once you Delete the slice is not inmidiatly removed, and the List can get the slice that is being deleted, giving a false positive, we have to assert that the Slice is deleted and a new one is created , maybe you can store the UUID from the old one and compare it to the new one?

…ntSliceMirroringControllerRestart

Signed-off-by: DingShujie <dingshujie@huawei.com>
@Dingshujie
Copy link
Member Author

once you Delete the slice is not inmidiatly removed, and the List can get the slice that is being deleted, giving a false positive, we have to assert that the Slice is deleted and a new one is created , maybe you can store the UUID from the old one and compare it to the new one?

Good point, i change to compared UID, PTAL, thanks

@Dingshujie
Copy link
Member Author

/test pull-kubernetes-e2e-kind

@Dingshujie
Copy link
Member Author

@robscott @aojea @thockin could you please take a look at this pr?

@@ -228,6 +228,12 @@ func (c *Controller) Run(workers int, stopCh <-chan struct{}) {
<-stopCh
}

// Queue return Controller ratelimit queue
// only for testing
func (c *Controller) Queue() workqueue.RateLimitingInterface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want to export the queue and check on the queue len on the tests if we can assert on the slices created?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment still stands

@aojea
Copy link
Member

aojea commented Sep 5, 2022

I will not have much time this week, but I still wonder if this scenario should be part of the reconcile loop:

func (r *reconciler) reconcile(endpoints *corev1.Endpoints, existingSlices []*discovery.EndpointSlice) error {
...
if endpoint matches endpointslices and not additional action required
   add slice to the tracker
   return nil
...

but better wait for @robscott , he will return from vacation soon

@Dingshujie
Copy link
Member Author

Dingshujie commented Sep 6, 2022

I will not have much time this week, but I still wonder if this scenario should be part of the reconcile loop:

func (r *reconciler) reconcile(endpoints *corev1.Endpoints, existingSlices []*discovery.EndpointSlice) error {
...
if endpoint matches endpointslices and not additional action required
   add slice to the tracker
   return nil
...

but better wait for @robscott , he will return from vacation soon

ok. waiting for @robscott reply

@Dingshujie
Copy link
Member Author

@robscott do you have time to review this bugfixs?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 8, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 11, 2023
@Dingshujie
Copy link
Member Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 13, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 19, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

endpointslicemirroring controller not create endpointslice
5 participants