Creating a CRD with broken converter webhook prevents GC controller from initialization #101078

jprzychodzen · 2021-04-13T17:03:31Z

What happened:

Creating a CRD with broken converter webhook prevents GC controller from initialization, which breaks on informer sync. Additionally, this issue is not visible until gc controller restarts - dynamically added crd resources with non-working converter webhook do not break running GC.

What you expected to happen:

GC controller should initialize with available informers. CRDs with broken converter webhook should not prevent GC controller from working on other resources.

How to reproduce it (as minimally and precisely as possible):

Create a cluster
Create a CRD (1_crd.yaml)
Create a CR (2_crd.yaml)
Change CRD to add another version and webhook converter (3_crd.yaml)
Restart kube-controll-manager
Add deployment
Delete deployment
Check that pod from deployment is not GC'ed

gc-bug.zip

neolit123 · 2021-04-13T18:40:14Z

/sig api-machinery

neolit123 · 2021-04-13T18:41:08Z

please follow the issue template correctly:
https://github.com/kubernetes/kubernetes/blob/master/.github/ISSUE_TEMPLATE/bug-report.md

including k8s version is important.

jprzychodzen · 2021-04-14T07:45:28Z

Sure, it happens on current K8s master branch - exact commit is b0abe89ae259d5e891887414cb0e5f81c969c697

Kubernetes version (use kubectl version):

kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", 
GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.0-alpha.0.30+b0abe89ae259d5-dirty", GitCommit:"b0abe89ae259d5e891887414cb0e5f81c969c697", GitTreeState:"dirty", 
BuildDate:"2021-04-13T16:11:56Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration:
K8s cluster running on GCE started with kubetest. 10 nodes with --gcp-node-size=n1-standard-1 and with preset-e2e-scalability-common env variables

OS (e.g: cat /etc/os-release):

cat /etc/os-release
NAME="Container-Optimized OS"
ID=cos
PRETTY_NAME="Container-Optimized OS from Google"
HOME_URL="https://cloud.google.com/container-optimized-os/docs"
BUG_REPORT_URL="https://cloud.google.com/container-optimized-os/docs/resources/support-policy#contact_us"
GOOGLE_METRICS_PRODUCT_ID=26
GOOGLE_CRASH_ID=Lakitu
KERNEL_COMMIT_ID=9ca830b4d7ae9ff76f64f4f9f78a0a0b88dfcda4
VERSION=85
VERSION_ID=85
BUILD_ID=13310.1041.9

Kernel (e.g. uname -a):

Linux e2e-test-jprzychodzen-master 5.4.49+ #1 SMP Wed Sep 23 19:45:38 PDT 2020 x86_64 Intel(R) Xeon(R) CPU @ 2.00GHz GenuineIntel GNU/Linux

Install tools:
kubetest

fedebongio · 2021-04-15T20:08:24Z

/assign @yliaog
/cc @caesarxuchao @leilajal
/triage accepted

yliaog · 2021-04-15T20:43:10Z

i think this is the same issue as reported in #90597

jprzychodzen · 2021-04-21T06:32:25Z

It might share root cause - informer sync in case of GC should non-block on CRDs (and possibly on other resources?)

I guess we would need some metrics about unsynced informers to handle this properly.

fejta-bot · 2021-07-20T06:33:45Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

spencer-p · 2021-08-05T17:04:58Z

/remove-lifecycle stale

aojea · 2021-09-06T09:48:46Z

This seems to be this way by design, see this duplicate #96066 (comment)

willdeuschle · 2021-11-22T15:45:36Z

This is problematic for our environments as well. Unstable user-defined conversion webhooks break GC for unrelated resources, those unrelated resources eventually hit quota limits and render the environment unusable.

Is there a recommended approach to this from the community? One naive solution that comes to mind is a config option for marking a CRD as non-blocking for GC. Then GC would only respect blockOwnerDeletion in a best effort fashion, for example. Admission webhooks could then block CRD creation that specified a conversion webhook without making the resource non-blocking.

Without this, it's hard to allow users to specify conversion webhooks because k8s then takes a dependency on those services (which in our case, already take a dependency on k8s).

liggitt · 2021-11-22T15:48:38Z

I think I'd push to make gc stop blocking on discovery or informer sync at all, and make blockOwnerDeletion even more best effort.

deads2k · 2021-11-22T16:45:56Z

I think I'd push to make gc stop blocking on discovery or informer sync at all, and make blockOwnerDeletion even more best effort.

I'd like to stop honoring blockOwnerDeletion. :)

tkashem · 2022-02-16T20:21:52Z

cc @tkashem

k8s-triage-robot · 2022-05-17T20:46:44Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-06-16T21:37:04Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

haorenfsa · 2022-06-25T12:27:40Z

Is there any ongoing work for this?

liggitt · 2022-06-25T17:31:04Z

Is there any ongoing work for this?

None that I know of. At first glance, removing the requirement that all informers be fully synced before GC starts/resumes seems reasonable to me, and would resolve this issue.

haorenfsa · 2022-06-27T08:37:10Z

OK, I'll try to work out a patch

tossmilestone · 2022-06-27T08:47:43Z

I am now working on this, will fix it soon.

rauferna · 2023-06-16T14:29:43Z

Hi @tossmilestone,

What is the status of this? One year has passed. Is there any short term plan to fix this?

Thanks!

tossmilestone · 2023-06-16T14:42:51Z

Hi @tossmilestone,

What is the status of this? One year has passed. Is there any short term plan to fix this?

Thanks!

Sorry, I don't have the time right now to continue fixing this issue. If you're willing, you can help continue this work. Thank you!

haorenfsa · 2023-06-16T16:01:54Z

@rauferna not likely in a short term. A quick solution is that you delete the converter webhook when you find your CRD controller is not working. and add it back when it recovers. Or you deploy multiple replicas to avoid the downtime as possible.

jprzychodzen added the kind/bug Categorizes issue or PR as related to a bug. label Apr 13, 2021

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 13, 2021

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 13, 2021

k8s-ci-robot assigned yliaog Apr 15, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 15, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 20, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 5, 2021

paulfantom mentioned this issue Oct 26, 2021

CRD from this repository break kubernetes 1.22 cluster open-telemetry/opentelemetry-operator#473

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 17, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 16, 2022

liggitt mentioned this issue Jun 23, 2022

CRD Conversion webhook down results in controller-manager GC failure #110720

Closed

liggitt added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jun 24, 2022

tossmilestone mentioned this issue Jun 29, 2022

Fix garbagecollector is blocked #110858

Closed

aojea mentioned this issue Aug 4, 2022

gc controller run whill be block when the apiservice missing a controller #111354

Closed

andrewsykim linked a pull request Aug 24, 2023 that will close this issue

garbagecollector: controller should not be blocking on failed cache sync #120164

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating a CRD with broken converter webhook prevents GC controller from initialization #101078

Creating a CRD with broken converter webhook prevents GC controller from initialization #101078

jprzychodzen commented Apr 13, 2021

neolit123 commented Apr 13, 2021

neolit123 commented Apr 13, 2021

jprzychodzen commented Apr 14, 2021 •

edited

fedebongio commented Apr 15, 2021

yliaog commented Apr 15, 2021

jprzychodzen commented Apr 21, 2021

fejta-bot commented Jul 20, 2021

spencer-p commented Aug 5, 2021

aojea commented Sep 6, 2021

willdeuschle commented Nov 22, 2021

liggitt commented Nov 22, 2021

deads2k commented Nov 22, 2021

tkashem commented Feb 16, 2022

k8s-triage-robot commented May 17, 2022

k8s-triage-robot commented Jun 16, 2022

haorenfsa commented Jun 25, 2022

liggitt commented Jun 25, 2022

haorenfsa commented Jun 27, 2022

tossmilestone commented Jun 27, 2022 •

edited

rauferna commented Jun 16, 2023

tossmilestone commented Jun 16, 2023

haorenfsa commented Jun 16, 2023

Creating a CRD with broken converter webhook prevents GC controller from initialization #101078

Creating a CRD with broken converter webhook prevents GC controller from initialization #101078

Comments

jprzychodzen commented Apr 13, 2021

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

neolit123 commented Apr 13, 2021

neolit123 commented Apr 13, 2021

jprzychodzen commented Apr 14, 2021 • edited

fedebongio commented Apr 15, 2021

yliaog commented Apr 15, 2021

jprzychodzen commented Apr 21, 2021

fejta-bot commented Jul 20, 2021

spencer-p commented Aug 5, 2021

aojea commented Sep 6, 2021

willdeuschle commented Nov 22, 2021

liggitt commented Nov 22, 2021

deads2k commented Nov 22, 2021

tkashem commented Feb 16, 2022

k8s-triage-robot commented May 17, 2022

k8s-triage-robot commented Jun 16, 2022

haorenfsa commented Jun 25, 2022

liggitt commented Jun 25, 2022

haorenfsa commented Jun 27, 2022

tossmilestone commented Jun 27, 2022 • edited

rauferna commented Jun 16, 2023

tossmilestone commented Jun 16, 2023

haorenfsa commented Jun 16, 2023

jprzychodzen commented Apr 14, 2021 •

edited

tossmilestone commented Jun 27, 2022 •

edited