Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Informer Lifecycle Improvements #97214

Conversation

kevindelgado
Copy link
Contributor

@kevindelgado kevindelgado commented Dec 11, 2020

What type of PR is this?

/kind feature

What this PR does / why we need it:
Implementation of Informer Lifecycle Managemt design doc
Discussed at the Nov 4 sig-api-machinery meeting:video

It does a few things that are broken up by commit:

  1. Adds the OnError method to the ResourceEventHandler interface so that the handler is notified of ListAndWatch errors.
  2. Adds the ability to define stop conditions when running individual informers via the new RunWithStopOptions() method on the SharedInformer interface. It exposes running informers with stop options on the various informer factories (currently only supports global stop options that apply to all informers of an informer factory)
  3. Modifies the metadata informer factory and dynamic informer factory to start informers via RunWithStopOptions() and provides a way to retrieve the stop channel for given informer to determine when an individual informer is stopped.
  4. Refactors some CRD specific integration testing helpers out of the GC integration test into a shared util package so that RQ integration test can also use them without duplicating code.
  5. Modifies GC controller to recognize when a CRD informer has stopped and removes the monitor for it.
  6. Modifies RQ controller to recognize when a CRD informer has stopped and removes the monitor for it.
  7. Modifies controller-manager to run the GC and RQ controllers with RunWithStopOptions (meaning it always stops an informer upon reflector ListAndWatch error).

This is tested manually by modifying controller-runtime to use the new interfaces. Also by running the resource quota controller and GC controller with the new interfaces to verify that the informer shuts down as expected.

Unit testing has been added to the relevant pieces of tools/cache and metadatainformer/dynamicinformer. Integration tests for both the garbage collector and quota controller test that the install CRD/uninstall CRD/reinstall CRD flow continues to work while the informer gets stopped and restarted as expected.

Which issue(s) this PR fixes:

Fixes #79610
Unblocks work relating to kubernetes-sigs/controller-runtime#1192

1. Adds the OnError method to the ResourceEventHandler interface so that the handler is notified of ListAndWatch errors. 
2. Adds the ability to define stop conditions when running individual informers via the new `RunWithStopOptions()` method on the `SharedInformer` interface. It exposes running informers with stop options on the various informer factories (currently only supports global stop options that apply to all informers of an informer factory)
3. Modifies the metadata informer factory and dynamic informer factory to start informers via `RunWithStopOptions()` and provides a way to retrieve the stop channel for given informer to determine when an individual informer is stopped. 
4. Refactors some CRD specific integration testing helpers out of the GC integration test into a shared util package so that RQ integration test can also use them without duplicating code.
5. Modifies GC controller to recognize when a CRD informer has stopped and removes the monitor for it.
6. Modifies RQ controller to recognize when a CRD informer has stopped and removes the monitor for it.
7. Modifies controller-manager to run the GC and RQ controllers with RunWithStopOptions (meaning it always stops an informer upon reflector ListAndWatch error).

based on #98657

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 11, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @kevindelgado. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added area/dependency Issues or PRs related to dependency changes area/test sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Dec 11, 2020
@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 11, 2020
@kevindelgado kevindelgado changed the title Exp/controller lifecycle mgmt Informer Lifecycle Improvements Dec 11, 2020
@kevindelgado
Copy link
Contributor Author

/assign @caesarxuchao
/assign @DirectXMan12

@fedebongio
Copy link
Contributor

/triage accepted
/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 15, 2020
@fedebongio
Copy link
Contributor

/assign @yliaog

@kevindelgado
Copy link
Contributor Author

/retest

2 similar comments
@kevindelgado
Copy link
Contributor Author

/retest

@kevindelgado
Copy link
Contributor Author

/retest

@caesarxuchao
Copy link
Member

/unassign

@kevindelgado
Copy link
Contributor Author

friendly ping @yliaog @liggitt

// results in an appropriate error.
// Please note: If, for some reason, the same handler has been added multiple
// times, all registrations will be removed.
RemoveEventHandler(handler ResourceEventHandler) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the interface has changed, please sync with the latest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm ok, it looks like it will be changing back to returning an error, I will wait until #98657 is updated to sync with latest

@k8s-ci-robot
Copy link
Contributor

@kevindelgado: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 18, 2021
@atoato88
Copy link
Contributor

FYI
I confirmed that this PR clears off controller manager logs said in #79610 on my local env.
Thank you to create this PR.
I'll wait for merging to master.

@kevindelgado
Copy link
Contributor Author

FYI
I confirmed that this PR clears off controller manager logs said in #79610 on my local env.
Thank you to create this PR.
I'll wait for merging to master.

Yea sorry for how slow this is moving. This PR is blocked on #98657 which is still in review and I have been trying to push it forward without much luck. Feel free to ping that PR like I have been doing to try to see if progress can be made there.

t.Errorf("informer reports not to be stopped although stop channel closed")
return
}
err := informer.RemoveEventHandler(listener)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RemoveEventHandler may access handler obj by fmt.Errorf, so this line should be wrapped by listener.lock. or may casuse DATA RACE

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2021
@atoato88
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 29, 2021
@dims
Copy link
Member

dims commented Jan 10, 2022

Is this PR still needed, please rebase if so (or we can close it?)

@atoato88
Copy link
Contributor

I think this PR is still needed because #79610 error occurs on v1.23.0 cluster created by kind.

@kevindelgado
Any comments and could you rebase?

@kevindelgado
Copy link
Contributor Author

This PR is blocked on #98657

The issues from #79610 are certainly still present and will need to be addressed, but this PR has been deprioritized given the current state of #98657

@dims
Copy link
Member

dims commented Feb 11, 2022

This PR is older than 4 weeks and needs as rebase.

Please rebase the PR against latest master and reopen if still needed.

/close

@k8s-ci-robot
Copy link
Contributor

@dims: Closed this PR.

In response to this:

This PR is older than 4 weeks and needs as rebase.

Please rebase the PR against latest master and reopen if still needed.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Pingan2017
Copy link
Member

Pingan2017 commented Feb 28, 2022

@kevindelgado
we encountered a bug when we merge this PR .
when delete a deployment, the gc controller set the deletionTimestamp for the rs and pod, then pod been deleted from etcd by kubelet, but the gc controller not delete rs and deploy form etcd.
When we analyzed the problem, we found apiserver restarted when delete deployment and the pod still exist in gc controller cache even pod deleted from etcd. the gc controller miss the pod deleted event .
By reading the code, we found that this bug is related to this PR.
Let's imagine a scenario:

  1. gc controller start informer and add event handle
  2. deploy a deployment , gc controller will received deployment, rs , pod event , and add them to cache.
  3. delete this deployment, gc controler set the deletionTimestamp for rs , pod
  4. apiserver restart
  5. gc controller sync the deletableResources , but get some resource(eg. pod) failed,because the apiserver restarted
    func (gc *GarbageCollector) Sync(discoveryClient discovery.ServerResourcesInterface, period time.Duration, stopCh <-chan struct{}) {
    oldResources := make(map[schema.GroupVersionResource]struct{})
    wait.Until(func() {
    // Get the current resource list from discovery.
    newResources := GetDeletableResources(discoveryClient)
  6. gc controller will remove the pod evnet handler from informer
  7. kubelet delete pod from etcd
  8. informer received the pod delete evnet ,but no event handler for gc controller, so gc controller not delete pod form cache, then , gc controller won't delete rs and deployment forever
  9. gc controller get the pod resource again, but too late, pod has been deleted, gc won't received the pod deleted event

@oomichi
Copy link
Member

oomichi commented Dec 23, 2022

This PR is blocked on #98657

The issues from #79610 are certainly still present and will need to be addressed, but this PR has been deprioritized given the current state of #98657

Hi @kevindelgado

#111122 has been merged instead of #98657
Can we move forward for this again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/apiserver area/cloudprovider area/code-generation area/dependency Issues or PRs related to dependency changes area/kubectl area/provider/gcp Issues or PRs related to gcp provider area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Archived in project
SIG Node PR Triage
Needs Reviewer
Development

Successfully merging this pull request may close these issues.

Dynamic informers do not stop when custom resource definition is removed