Informer Lifecycle Improvements #97214

kevindelgado · 2020-12-11T00:17:03Z

What type of PR is this?

/kind feature

What this PR does / why we need it:
Implementation of Informer Lifecycle Managemt design doc
Discussed at the Nov 4 sig-api-machinery meeting:video

It does a few things that are broken up by commit:

Adds the OnError method to the ResourceEventHandler interface so that the handler is notified of ListAndWatch errors.
Adds the ability to define stop conditions when running individual informers via the new RunWithStopOptions() method on the SharedInformer interface. It exposes running informers with stop options on the various informer factories (currently only supports global stop options that apply to all informers of an informer factory)
Modifies the metadata informer factory and dynamic informer factory to start informers via RunWithStopOptions() and provides a way to retrieve the stop channel for given informer to determine when an individual informer is stopped.
Refactors some CRD specific integration testing helpers out of the GC integration test into a shared util package so that RQ integration test can also use them without duplicating code.
Modifies GC controller to recognize when a CRD informer has stopped and removes the monitor for it.
Modifies RQ controller to recognize when a CRD informer has stopped and removes the monitor for it.
Modifies controller-manager to run the GC and RQ controllers with RunWithStopOptions (meaning it always stops an informer upon reflector ListAndWatch error).

This is tested manually by modifying controller-runtime to use the new interfaces. Also by running the resource quota controller and GC controller with the new interfaces to verify that the informer shuts down as expected.

Unit testing has been added to the relevant pieces of tools/cache and metadatainformer/dynamicinformer. Integration tests for both the garbage collector and quota controller test that the install CRD/uninstall CRD/reinstall CRD flow continues to work while the informer gets stopped and restarted as expected.

Which issue(s) this PR fixes:

Fixes #79610
Unblocks work relating to kubernetes-sigs/controller-runtime#1192

1. Adds the OnError method to the ResourceEventHandler interface so that the handler is notified of ListAndWatch errors. 
2. Adds the ability to define stop conditions when running individual informers via the new `RunWithStopOptions()` method on the `SharedInformer` interface. It exposes running informers with stop options on the various informer factories (currently only supports global stop options that apply to all informers of an informer factory)
3. Modifies the metadata informer factory and dynamic informer factory to start informers via `RunWithStopOptions()` and provides a way to retrieve the stop channel for given informer to determine when an individual informer is stopped. 
4. Refactors some CRD specific integration testing helpers out of the GC integration test into a shared util package so that RQ integration test can also use them without duplicating code.
5. Modifies GC controller to recognize when a CRD informer has stopped and removes the monitor for it.
6. Modifies RQ controller to recognize when a CRD informer has stopped and removes the monitor for it.
7. Modifies controller-manager to run the GC and RQ controllers with RunWithStopOptions (meaning it always stops an informer upon reflector ListAndWatch error).

based on #98657

k8s-ci-robot · 2020-12-11T00:17:14Z

Hi @kevindelgado. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kevindelgado · 2020-12-11T00:25:53Z

/assign @caesarxuchao
/assign @DirectXMan12

fedebongio · 2020-12-15T21:07:13Z

/triage accepted
/ok-to-test

fedebongio · 2020-12-15T21:07:23Z

/assign @yliaog

kevindelgado · 2021-04-29T15:56:34Z

/retest

kevindelgado · 2021-04-29T17:02:52Z

/retest

kevindelgado · 2021-04-29T19:18:07Z

/retest

caesarxuchao · 2021-05-06T23:46:13Z

/unassign

kevindelgado · 2021-05-17T15:54:18Z

friendly ping @yliaog @liggitt

yliaog · 2021-05-17T17:03:37Z

staging/src/k8s.io/client-go/tools/cache/shared_informer.go

+	// results in an appropriate error.
+	// Please note: If, for some reason, the same handler has been added multiple
+	// times, all registrations will be removed.
+	RemoveEventHandler(handler ResourceEventHandler) error


the interface has changed, please sync with the latest

hmm ok, it looks like it will be changing back to returning an error, I will wait until #98657 is updated to sync with latest

k8s-ci-robot · 2021-05-18T16:16:19Z

@kevindelgado: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

atoato88 · 2021-07-26T02:17:44Z

FYI
I confirmed that this PR clears off controller manager logs said in #79610 on my local env.
Thank you to create this PR.
I'll wait for merging to master.

kevindelgado · 2021-07-26T17:05:41Z

FYI
I confirmed that this PR clears off controller manager logs said in #79610 on my local env.
Thank you to create this PR.
I'll wait for merging to master.

Yea sorry for how slow this is moving. This PR is blocked on #98657 which is still in review and I have been trying to push it forward without much luck. Feel free to ping that PR like I have been doing to try to see if progress can be made there.

249043822 · 2021-08-27T02:25:56Z

staging/src/k8s.io/client-go/tools/cache/shared_informer_test.go

+		t.Errorf("informer reports not to be stopped although stop channel closed")
+		return
+	}
+	err := informer.RemoveEventHandler(listener)


RemoveEventHandler may access handler obj by fmt.Errorf， so this line should be wrapped by listener.lock. or may casuse DATA RACE

k8s-triage-robot · 2021-11-25T02:48:03Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

atoato88 · 2021-11-29T21:41:02Z

/remove-lifecycle stale

dims · 2022-01-10T17:05:17Z

Is this PR still needed, please rebase if so (or we can close it?)

atoato88 · 2022-01-12T20:53:58Z

I think this PR is still needed because #79610 error occurs on v1.23.0 cluster created by kind.

@kevindelgado
Any comments and could you rebase?

kevindelgado · 2022-01-13T00:01:10Z

This PR is blocked on #98657

The issues from #79610 are certainly still present and will need to be addressed, but this PR has been deprioritized given the current state of #98657

dims · 2022-02-11T14:44:48Z

This PR is older than 4 weeks and needs as rebase.

Please rebase the PR against latest master and reopen if still needed.

/close

k8s-ci-robot · 2022-02-11T14:50:12Z

@dims: Closed this PR.

In response to this:

This PR is older than 4 weeks and needs as rebase.

Please rebase the PR against latest master and reopen if still needed.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Pingan2017 · 2022-02-28T01:45:42Z

@kevindelgado
we encountered a bug when we merge this PR .
when delete a deployment, the gc controller set the deletionTimestamp for the rs and pod, then pod been deleted from etcd by kubelet, but the gc controller not delete rs and deploy form etcd.
When we analyzed the problem, we found apiserver restarted when delete deployment and the pod still exist in gc controller cache even pod deleted from etcd. the gc controller miss the pod deleted event .
By reading the code, we found that this bug is related to this PR.
Let's imagine a scenario:

gc controller start informer and add event handle
deploy a deployment , gc controller will received deployment, rs , pod event , and add them to cache.
delete this deployment, gc controler set the deletionTimestamp for rs , pod
apiserver restart

gc controller sync the deletableResources , but get some resource(eg. pod) failed,because the apiserver restarted

kubernetes/pkg/controller/garbagecollector/garbagecollector.go

Lines 173 to 178 in 5b1e538

    
           func (gc *GarbageCollector) Sync(discoveryClient discovery.ServerResourcesInterface, period time.Duration, stopCh <-chan struct{}) { 
        
           	oldResources := make(map[schema.GroupVersionResource]struct{}) 
        
           	wait.Until(func() { 
        
           		// Get the current resource list from discovery. 
        
           		newResources := GetDeletableResources(discoveryClient)

gc controller will remove the pod evnet handler from informer
kubelet delete pod from etcd
informer received the pod delete evnet ,but no event handler for gc controller, so gc controller not delete pod form cache, then , gc controller won't delete rs and deployment forever
gc controller get the pod resource again, but too late, pod has been deleted, gc won't received the pod deleted event

oomichi · 2022-12-23T03:16:20Z

This PR is blocked on #98657

The issues from #79610 are certainly still present and will need to be addressed, but this PR has been deprioritized given the current state of #98657

Hi @kevindelgado

#111122 has been merged instead of #98657
Can we move forward for this again?

k8s-ci-robot added area/dependency Issues or PRs related to dependency changes area/test sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Dec 11, 2020

k8s-ci-robot requested review from andrewsykim, caesarxuchao and a team December 11, 2020 00:17

kevindelgado changed the title ~~Exp/controller lifecycle mgmt~~ Informer Lifecycle Improvements Dec 11, 2020

k8s-ci-robot assigned caesarxuchao and DirectXMan12 Dec 11, 2020

k8s-ci-robot assigned yliaog Dec 15, 2020

kevindelgado added 4 commits April 29, 2021 04:54

Factor out common CRD integration testing utils

519ce13

Modify GC controller to stop informers on error

db7877c

Modify quota controller to stop informers on error

d96bd4b

Controller Manager runs stoppable informers

1f87ccb

kevindelgado force-pushed the exp/controller-lifecycle-mgmt branch from 82c0bbf to 1f87ccb Compare April 29, 2021 04:54

k8s-ci-robot unassigned caesarxuchao May 6, 2021

yliaog reviewed May 17, 2021

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 18, 2021

249043822 reviewed Aug 27, 2021

View reviewed changes

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 29, 2021

k8s-ci-robot closed this Feb 11, 2022

Pingan2017 mentioned this pull request Mar 2, 2022

Dynamic informers do not stop when custom resource definition is removed #79610

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Informer Lifecycle Improvements #97214

Informer Lifecycle Improvements #97214

kevindelgado commented Dec 11, 2020 •

edited by liggitt

k8s-ci-robot commented Dec 11, 2020

kevindelgado commented Dec 11, 2020

fedebongio commented Dec 15, 2020

fedebongio commented Dec 15, 2020

kevindelgado commented Apr 29, 2021

kevindelgado commented Apr 29, 2021

kevindelgado commented Apr 29, 2021

caesarxuchao commented May 6, 2021

kevindelgado commented May 17, 2021

yliaog May 17, 2021

kevindelgado May 17, 2021

k8s-ci-robot commented May 18, 2021

atoato88 commented Jul 26, 2021

kevindelgado commented Jul 26, 2021

249043822 Aug 27, 2021

k8s-triage-robot commented Nov 25, 2021

atoato88 commented Nov 29, 2021

dims commented Jan 10, 2022

atoato88 commented Jan 12, 2022

kevindelgado commented Jan 13, 2022

dims commented Feb 11, 2022

k8s-ci-robot commented Feb 11, 2022

Pingan2017 commented Feb 28, 2022 •

edited

oomichi commented Dec 23, 2022

Informer Lifecycle Improvements #97214

Informer Lifecycle Improvements #97214

Conversation

kevindelgado commented Dec 11, 2020 • edited by liggitt

k8s-ci-robot commented Dec 11, 2020

kevindelgado commented Dec 11, 2020

fedebongio commented Dec 15, 2020

fedebongio commented Dec 15, 2020

kevindelgado commented Apr 29, 2021

kevindelgado commented Apr 29, 2021

kevindelgado commented Apr 29, 2021

caesarxuchao commented May 6, 2021

kevindelgado commented May 17, 2021

yliaog May 17, 2021

Choose a reason for hiding this comment

kevindelgado May 17, 2021

Choose a reason for hiding this comment

k8s-ci-robot commented May 18, 2021

atoato88 commented Jul 26, 2021

kevindelgado commented Jul 26, 2021

249043822 Aug 27, 2021

Choose a reason for hiding this comment

k8s-triage-robot commented Nov 25, 2021

atoato88 commented Nov 29, 2021

dims commented Jan 10, 2022

atoato88 commented Jan 12, 2022

kevindelgado commented Jan 13, 2022

dims commented Feb 11, 2022

k8s-ci-robot commented Feb 11, 2022

Pingan2017 commented Feb 28, 2022 • edited

oomichi commented Dec 23, 2022

kevindelgado commented Dec 11, 2020 •

edited by liggitt

Pingan2017 commented Feb 28, 2022 •

edited