Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a CRD with broken converter webhook prevents GC controller from initialization #101078

Open
jprzychodzen opened this issue Apr 13, 2021 · 22 comments · May be fixed by #120164
Open

Creating a CRD with broken converter webhook prevents GC controller from initialization #101078

jprzychodzen opened this issue Apr 13, 2021 · 22 comments · May be fixed by #120164
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@jprzychodzen
Copy link
Contributor

What happened:

Creating a CRD with broken converter webhook prevents GC controller from initialization, which breaks on informer sync. Additionally, this issue is not visible until gc controller restarts - dynamically added crd resources with non-working converter webhook do not break running GC.

What you expected to happen:

GC controller should initialize with available informers. CRDs with broken converter webhook should not prevent GC controller from working on other resources.

How to reproduce it (as minimally and precisely as possible):

  1. Create a cluster
  2. Create a CRD (1_crd.yaml)
  3. Create a CR (2_crd.yaml)
  4. Change CRD to add another version and webhook converter (3_crd.yaml)
  5. Restart kube-controll-manager
  6. Add deployment
  7. Delete deployment
  8. Check that pod from deployment is not GC'ed

gc-bug.zip

@jprzychodzen jprzychodzen added the kind/bug Categorizes issue or PR as related to a bug. label Apr 13, 2021
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 13, 2021
@neolit123
Copy link
Member

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 13, 2021
@neolit123
Copy link
Member

please follow the issue template correctly:
https://github.com/kubernetes/kubernetes/blob/master/.github/ISSUE_TEMPLATE/bug-report.md

including k8s version is important.

@jprzychodzen
Copy link
Contributor Author

jprzychodzen commented Apr 14, 2021

Sure, it happens on current K8s master branch - exact commit is b0abe89ae259d5e891887414cb0e5f81c969c697

  • Kubernetes version (use kubectl version):
    kubectl version
    Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", 
    GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.0-alpha.0.30+b0abe89ae259d5-dirty", GitCommit:"b0abe89ae259d5e891887414cb0e5f81c969c697", GitTreeState:"dirty", 
    BuildDate:"2021-04-13T16:11:56Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
    
  • Cloud provider or hardware configuration:
    K8s cluster running on GCE started with kubetest. 10 nodes with --gcp-node-size=n1-standard-1 and with preset-e2e-scalability-common env variables
  • OS (e.g: cat /etc/os-release):
    cat /etc/os-release
    NAME="Container-Optimized OS"
    ID=cos
    PRETTY_NAME="Container-Optimized OS from Google"
    HOME_URL="https://cloud.google.com/container-optimized-os/docs"
    BUG_REPORT_URL="https://cloud.google.com/container-optimized-os/docs/resources/support-policy#contact_us"
    GOOGLE_METRICS_PRODUCT_ID=26
    GOOGLE_CRASH_ID=Lakitu
    KERNEL_COMMIT_ID=9ca830b4d7ae9ff76f64f4f9f78a0a0b88dfcda4
    VERSION=85
    VERSION_ID=85
    BUILD_ID=13310.1041.9
    
  • Kernel (e.g. uname -a):
    Linux e2e-test-jprzychodzen-master 5.4.49+ #1 SMP Wed Sep 23 19:45:38 PDT 2020 x86_64 Intel(R) Xeon(R) CPU @ 2.00GHz GenuineIntel GNU/Linux
    
  • Install tools:
    kubetest

@fedebongio
Copy link
Contributor

/assign @yliaog
/cc @caesarxuchao @leilajal
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 15, 2021
@yliaog
Copy link
Contributor

yliaog commented Apr 15, 2021

i think this is the same issue as reported in #90597

@jprzychodzen
Copy link
Contributor Author

It might share root cause - informer sync in case of GC should non-block on CRDs (and possibly on other resources?)

I guess we would need some metrics about unsynced informers to handle this properly.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 20, 2021
@spencer-p
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 5, 2021
@aojea
Copy link
Member

aojea commented Sep 6, 2021

This seems to be this way by design, see this duplicate #96066 (comment)

@willdeuschle
Copy link

This is problematic for our environments as well. Unstable user-defined conversion webhooks break GC for unrelated resources, those unrelated resources eventually hit quota limits and render the environment unusable.

Is there a recommended approach to this from the community? One naive solution that comes to mind is a config option for marking a CRD as non-blocking for GC. Then GC would only respect blockOwnerDeletion in a best effort fashion, for example. Admission webhooks could then block CRD creation that specified a conversion webhook without making the resource non-blocking.

Without this, it's hard to allow users to specify conversion webhooks because k8s then takes a dependency on those services (which in our case, already take a dependency on k8s).

@liggitt
Copy link
Member

liggitt commented Nov 22, 2021

I think I'd push to make gc stop blocking on discovery or informer sync at all, and make blockOwnerDeletion even more best effort.

@deads2k
Copy link
Contributor

deads2k commented Nov 22, 2021

I think I'd push to make gc stop blocking on discovery or informer sync at all, and make blockOwnerDeletion even more best effort.

I'd like to stop honoring blockOwnerDeletion. :)

@tkashem
Copy link
Contributor

tkashem commented Feb 16, 2022

cc @tkashem

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 17, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 16, 2022
@liggitt liggitt added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jun 24, 2022
@haorenfsa
Copy link

Is there any ongoing work for this?

@liggitt
Copy link
Member

liggitt commented Jun 25, 2022

Is there any ongoing work for this?

None that I know of. At first glance, removing the requirement that all informers be fully synced before GC starts/resumes seems reasonable to me, and would resolve this issue.

@haorenfsa
Copy link

OK, I'll try to work out a patch

@tossmilestone
Copy link
Member

tossmilestone commented Jun 27, 2022

I am now working on this, will fix it soon.

@rauferna
Copy link

Hi @tossmilestone,

What is the status of this? One year has passed. Is there any short term plan to fix this?

Thanks!

@tossmilestone
Copy link
Member

Hi @tossmilestone,

What is the status of this? One year has passed. Is there any short term plan to fix this?

Thanks!

Sorry, I don't have the time right now to continue fixing this issue. If you're willing, you can help continue this work. Thank you!

@haorenfsa
Copy link

@rauferna not likely in a short term. A quick solution is that you delete the converter webhook when you find your CRD controller is not working. and add it back when it recovers. Or you deploy multiple replicas to avoid the downtime as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet