fix endpointslicemirroring controller not create endpointslice when delete endpointslice after kube-controller-manager restarted #112197

Dingshujie · 2022-09-02T02:13:04Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

fix endpointslicemirroring controller not create endpointslice when delete endpointslice after kube-controller-manager restarted

Which issue(s) this PR fixes:

Fixes #112143

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2022-09-02T02:13:06Z

@Dingshujie: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2022-09-02T02:13:12Z

@Dingshujie: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Dingshujie · 2022-09-02T02:19:32Z

@thockin @robscott @swetharepakula @aojea PTAL, thanks

Dingshujie · 2022-09-02T02:21:27Z

/assgin @thockin @robscott @swetharepakula @aojea

Dingshujie · 2022-09-02T03:08:47Z

/test pull-kubernetes-conformance-kind-ga-only-parallel

aojea · 2022-09-02T04:35:15Z

pkg/controller/endpointslicemirroring/endpointslicemirroring_controller.go

+		return
+	}
+	// requeue the service for another sync, if
+	// 1. endpointSliceTracker don't has endpoint.


why should not be requeued if it is managed and the endpointSliceTracker has the endpoint?

I create endpoint and is mirrored, is not added to the tracker here?

I delete the endpointslice, is in the tracker but I don't requeue so it is not recreated
I think I should recreate it

sorry, my mistake, misspelled, endpointSliceTracker don't has this endpointslice not endpoint

I create endpoint and is mirrored, is not added to the tracker here?

when endpoint mirroring controller create this endpointslice, then add endpointslice to tracker, but if kube-controller-manager restart, it won't be add to tracker.

I delete the endpointslice, is in the tracker but I don't requeue so it is not recreated
I think I should recreate it

Under normal conditions（mirrored endpointslice created and kcm not restarted)，if i delete a endpointslice that exist in tracker, and onEndpointSliceDelete will check tracker whether has this endpointslice, and this endpointslice whether marked want deleted, so if i delete endpointslice is in the tracker, now controller will requeue this endpointslice

but if kcm is restart, tracker don't has this endpointslice, now if i delete endpointslice, it don't requeue

// onEndpointSliceDelete queues a sync for the relevant Endpoints resource for a // sync if the EndpointSlice resource version does not match the expected // version in the endpointSliceTracker. func (c *Controller) onEndpointSliceDelete(obj interface{}) { endpointSlice := getEndpointSliceFromDeleteAction(obj) if endpointSlice == nil { utilruntime.HandleError(fmt.Errorf("onEndpointSliceDelete() expected type discovery.EndpointSlice, got %T", obj)) return } if managedByController(endpointSlice) && c.endpointSliceTracker.Has(endpointSlice) { // This returns false if we didn't expect the EndpointSlice to be // deleted. If that is the case, we queue the Service for another sync. if !c.endpointSliceTracker.HandleDeletion(endpointSlice) { c.queueEndpointsForEndpointSlice(endpointSlice) } } }

so i change the check method, if tracker don't has this endpointslice, requeue this endpoints

ah, no, I didn't mean the naming, I mean, what happens if the endpointSliceTracker has the the endpoint slice and is deleted? why don't we reconcile

Is not the problem that the slice is not in the tracker? should not handle this case and add it to the tracker?

@robscott have a better understanding of the internals of this controller

under my understanding, tracker may avoid some race condition, if controller create endpointslice，but may not receive event from apiserver，and lister can not get this endpointslice from lister， and if we do a reconcile，may lead something wrong. so controller call tracker ExpectDeletion/Update function when it call Create/Update/Delete API, update new generation in tracker.

if we has stale information, will skip this sync, try later

if c.endpointSliceTracker.StaleSlices(svc, endpointSlices) { return endpointsliceutil.NewStaleInformerCache("EndpointSlice informer cache is out of date") }

aojea · 2022-09-02T05:05:59Z

/assign @robscott

aojea · 2022-09-02T05:10:22Z

I'd like to see an integration test verifying this scenarios, can you please add one?

https://github.com/kubernetes/kubernetes/blob/master/test/integration/endpointslice/endpointslicemirroring_test.go

Dingshujie · 2022-09-02T05:19:49Z

I'd like to see an integration test verifying this scenarios, can you please add one?

https://github.com/kubernetes/kubernetes/blob/master/test/integration/endpointslice/endpointslicemirroring_test.go

yeah, my pleasure.

…elete endpointslice after kube-controller-manager restarted Signed-off-by: DingShujie <dingshujie@huawei.com>

k8s-ci-robot · 2022-09-02T06:57:09Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Dingshujie
Once this PR has been reviewed and has the lgtm label, please assign mrhohn for approval by writing /assign @mrhohn in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/controller/endpointslicemirroring/OWNERS
test/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aojea · 2022-09-02T07:33:45Z

test/integration/endpointslice/endpointslicemirroring_test.go

@@ -528,6 +528,123 @@ func TestEndpointSliceMirroringSelectorTransition(t *testing.T) {
 	}
 }

+func TestEndpointSliceMirroringDeleteWhenEndpoointSliceMirroringControllerRestart(t *testing.T) {


if I run this test without the other commit it passes,

aojea · 2022-09-02T07:34:18Z

test/integration/endpointslice/endpointslicemirroring_test.go

+		t.Fatalf("Error deleting EndpointSlices(%s/%s): %v", ns.Name, esList.Items[0].Name, err)
+	}
+	// wait endpoint to be created
+	err = waitForMirroredSlices(t, client, ns.Name, service.Name, len(esList.Items))


I think that you have to replace len(esList.Items) by 1 here

ah, nevermind, this has to be > 0

I think first we have to assert the current slice has been deleted, otherwise we can List the slice being deleted

diff --git a/test/integration/endpointslice/endpointslicemirroring_test.go b/test/integration/endpointslice/endpointslicemirroring_test.go index 1959872d219..993a33a0db3 100644 --- a/test/integration/endpointslice/endpointslicemirroring_test.go +++ b/test/integration/endpointslice/endpointslicemirroring_test.go @@ -26,7 +26,9 @@ import ( corev1 "k8s.io/api/core/v1" discovery "k8s.io/api/discovery/v1" apiequality "k8s.io/apimachinery/pkg/api/equality" + apierrors "k8s.io/apimachinery/pkg/api/errors" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "k8s.io/apimachinery/pkg/util/wait" "k8s.io/client-go/informers" clientset "k8s.io/client-go/kubernetes" @@ -638,8 +640,22 @@ func TestEndpointSliceMirroringDeleteWhenEndpoointSliceMirroringControllerRestar if err != nil { t.Fatalf("Error deleting EndpointSlices(%s/%s): %v", ns.Name, esList.Items[0].Name, err) } + // wait endpoint slice to be deleted + err = wait.PollImmediate(1*time.Second, wait.ForeverTestTimeout, func() (bool, error) { + _, err := client.DiscoveryV1().EndpointSlices(ns.Name).Get(ctx, esList.Items[0].Name, metav1.GetOptions{}) + if err != nil { + if apierrors.IsNotFound(err) { + return true, nil + } + return false, err + } + return false, nil + }) + if err != nil { + t.Fatalf("Error deleting EndpointSlices(%s/%s): %v", ns.Name, esList.Items[0].Name, err) + } // wait endpoint to be created

this way we avoid the race

Dingshujie · 2022-09-02T07:51:01Z

It is weird, the test doesn't reproduce the behavior you are describing #112197 (comment) , the slice is being recreated, Just run it without the fix, it passes

it need to wait add event to process complete, i will try to update integration test for reproduce it

aojea · 2022-09-02T08:01:42Z

it need to wait add event to process complete, i will try to update integration test for reproduce it

with this patch #112197 (comment) it waits until it confirms the original has been deleted and asserts that it has been recreated ... I think that the controller recreates it

Dingshujie · 2022-09-02T08:52:38Z

@aojea PTAL, I run this test without this commits it failed

aojea · 2022-09-02T08:57:11Z

@aojea PTAL, I run this test without this commits it failed

the test is racy https://github.com/kubernetes/kubernetes/pull/112197/files#r961381285

once you Delete the slice is not inmidiatly removed, and the List can get the slice that is being deleted, giving a false positive, we have to assert that the Slice is deleted and a new one is created , maybe you can store the UUID from the old one and compare it to the new one?

…ntSliceMirroringControllerRestart Signed-off-by: DingShujie <dingshujie@huawei.com>

Dingshujie · 2022-09-02T09:51:38Z

once you Delete the slice is not inmidiatly removed, and the List can get the slice that is being deleted, giving a false positive, we have to assert that the Slice is deleted and a new one is created , maybe you can store the UUID from the old one and compare it to the new one?

Good point, i change to compared UID, PTAL, thanks

Dingshujie · 2022-09-03T01:50:28Z

/test pull-kubernetes-e2e-kind

Dingshujie · 2022-09-05T03:55:27Z

@robscott @aojea @thockin could you please take a look at this pr?

aojea · 2022-09-05T08:02:45Z

pkg/controller/endpointslicemirroring/endpointslicemirroring_controller.go

@@ -228,6 +228,12 @@ func (c *Controller) Run(workers int, stopCh <-chan struct{}) {
 	<-stopCh
 }

+// Queue return Controller ratelimit queue
+// only for testing
+func (c *Controller) Queue() workqueue.RateLimitingInterface {


why do we want to export the queue and check on the queue len on the tests if we can assert on the slices created?

this comment still stands

aojea · 2022-09-05T08:22:42Z

I will not have much time this week, but I still wonder if this scenario should be part of the reconcile loop:

func (r *reconciler) reconcile(endpoints *corev1.Endpoints, existingSlices []*discovery.EndpointSlice) error {
...
if endpoint matches endpointslices and not additional action required
   add slice to the tracker
   return nil
...

but better wait for @robscott , he will return from vacation soon

Dingshujie · 2022-09-06T01:01:47Z

I will not have much time this week, but I still wonder if this scenario should be part of the reconcile loop:
func (r *reconciler) reconcile(endpoints *corev1.Endpoints, existingSlices []*discovery.EndpointSlice) error {
...
if endpoint matches endpointslices and not additional action required
   add slice to the tracker
   return nil
...
but better wait for @robscott , he will return from vacation soon

ok. waiting for @robscott reply

Dingshujie · 2022-10-10T07:17:02Z

@robscott do you have time to review this bugfixs?

k8s-triage-robot · 2023-01-08T19:53:20Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-04-11T06:59:12Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dingshujie · 2023-04-13T01:25:17Z

/remove-lifecycle stale

k8s-triage-robot · 2023-07-12T01:45:48Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-01-19T03:00:52Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-02-18T03:53:15Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2024-02-18T03:53:19Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot requested review from danwinship and MrHohn September 2, 2022 02:14

Dingshujie mentioned this pull request Sep 2, 2022

endpointslicemirroring controller not create endpointslice #112143

Open

aojea reviewed Sep 2, 2022

View reviewed changes

k8s-ci-robot assigned robscott Sep 2, 2022

fix endpointslicemirroring controller not create endpointslice when d…

50493e5

…elete endpointslice after kube-controller-manager restarted Signed-off-by: DingShujie <dingshujie@huawei.com>

Dingshujie force-pushed the fix_endpointslice branch from adfeae8 to c6b1824 Compare September 2, 2022 06:56

k8s-ci-robot added the area/test label Sep 2, 2022

k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Sep 2, 2022

aojea reviewed Sep 2, 2022

View reviewed changes

Dingshujie force-pushed the fix_endpointslice branch 2 times, most recently from 6460737 to c4c56a6 Compare September 2, 2022 08:37

add integration test case TestEndpointSliceMirroringDeleteWhenEndpooi…

68002a7

…ntSliceMirroringControllerRestart Signed-off-by: DingShujie <dingshujie@huawei.com>

Dingshujie force-pushed the fix_endpointslice branch from c4c56a6 to 68002a7 Compare September 2, 2022 09:49

aojea reviewed Sep 5, 2022

View reviewed changes

k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 8, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 11, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 13, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 19, 2024

k8s-ci-robot closed this Feb 18, 2024

fix endpointslicemirroring controller not create endpointslice when delete endpointslice after kube-controller-manager restarted #112197

fix endpointslicemirroring controller not create endpointslice when delete endpointslice after kube-controller-manager restarted #112197

Conversation

Dingshujie commented Sep 2, 2022

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Sep 2, 2022

k8s-ci-robot commented Sep 2, 2022

Dingshujie commented Sep 2, 2022

Dingshujie commented Sep 2, 2022

Dingshujie commented Sep 2, 2022

Choose a reason for hiding this comment

Dingshujie Sep 2, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aojea commented Sep 2, 2022

aojea commented Sep 2, 2022

Dingshujie commented Sep 2, 2022

k8s-ci-robot commented Sep 2, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dingshujie commented Sep 2, 2022

aojea commented Sep 2, 2022

Dingshujie commented Sep 2, 2022

aojea commented Sep 2, 2022

Dingshujie commented Sep 2, 2022

Dingshujie commented Sep 3, 2022

Dingshujie commented Sep 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aojea commented Sep 5, 2022

Dingshujie commented Sep 6, 2022 • edited

Dingshujie commented Oct 10, 2022

k8s-triage-robot commented Jan 8, 2023

k8s-triage-robot commented Apr 11, 2023

Dingshujie commented Apr 13, 2023

k8s-triage-robot commented Jul 12, 2023

k8s-triage-robot commented Jan 19, 2024

k8s-triage-robot commented Feb 18, 2024

k8s-ci-robot commented Feb 18, 2024

Dingshujie Sep 2, 2022 •

edited

Dingshujie commented Sep 6, 2022 •

edited