🐛 Revert: move health probes to runnable #2321

sbueringer · 2023-05-11T14:52:23Z

This PR reverts part of: #2275

On main we have the following sequence in controllerManager.Start():

...
add health probe server
...
Start webhook runnables
Start cache runnables
Start other runnables (including health probe server)

Usually this works except under the following circumstances (concrete example):

The controller implements a reconciler for a CRD called MachineDeployment
MachineDeployment has multiple apiVersions and thus the controller also implements a conversion webhook
There are MachineDeployments in the older apiVersion stored in etcd

Now we have the following problem:

Start cache runnables fails because caches are not getting ready
Caches are not getting ready because MachineDeployment list does not work
MachineDeployment list does not work because the conversion webhook does not work ("Post "https://capi-webhook-service.capi-system.svc:443/convert?timeout=30s\": dial tcp 10.142.135.97:443: connect: connection refused")
conversion webhook does not work because the service/pod does not get ready
Pod does not get ready because health probes are not started
health probes are not started because cache does not get ready

tl;dr cyclic dependency between start caches and start runnables.

This PR solves this by immediately starting the health probes again.

sbueringer · 2023-05-11T14:54:19Z

/assign @vincepri @alvaroaleman

/hold
I'll verify with Cluster API if the deadlock is gone before merge

sbueringer · 2023-05-11T15:09:53Z

/hold cancel

Ran the Cluster API test with the CR version from this PR locally and everything works as expected again.

sbueringer · 2023-05-11T15:12:44Z

(cc just fyi @zqzten, we're reverting this part for now to ensure CR main works again, we can work on a new implementation independent of that)

k8s-ci-robot · 2023-05-11T15:27:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sbueringer,vincepri]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

zqzten · 2023-05-12T07:30:21Z

I think we can introduce a new "basics" runnable group which includes the previous internal servers and start it before any other runnable groups, this seems to solve the dead lock issue here. wdyt @sbueringer

sbueringer · 2023-05-15T13:38:50Z

I think we can introduce a new "basics" runnable group which includes the previous internal servers and start it before any other runnable groups, this seems to solve the dead lock issue here. wdyt @sbueringer

Yup, something like this would work.

Revert: move health probes to runnable

dd0cb45

k8s-ci-robot requested review from FillZpp and joelanford May 11, 2023 14:52

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 11, 2023

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 11, 2023

k8s-ci-robot assigned alvaroaleman and vincepri May 11, 2023

sbueringer mentioned this pull request May 11, 2023

Improve test coverage for controller startup with conversion #2322

Closed

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 11, 2023

vincepri approved these changes May 11, 2023

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 11, 2023

k8s-ci-robot merged commit 92646a5 into kubernetes-sigs:main May 11, 2023
15 checks passed

sbueringer deleted the pr-fix-deadlock branch May 11, 2023 16:26

zqzten mentioned this pull request May 22, 2023

🌱 Introduce a new runnable group for basic servers of the manager #2337

Merged

sbueringer mentioned this pull request Jul 23, 2023

🌱 Add integration test to avoid manager.Start deadlocks #2418

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Revert: move health probes to runnable #2321

🐛 Revert: move health probes to runnable #2321

sbueringer commented May 11, 2023 •

edited

sbueringer commented May 11, 2023

sbueringer commented May 11, 2023

sbueringer commented May 11, 2023

k8s-ci-robot commented May 11, 2023

zqzten commented May 12, 2023

sbueringer commented May 15, 2023

🐛 Revert: move health probes to runnable #2321

🐛 Revert: move health probes to runnable #2321

Conversation

sbueringer commented May 11, 2023 • edited

sbueringer commented May 11, 2023

sbueringer commented May 11, 2023

sbueringer commented May 11, 2023

k8s-ci-robot commented May 11, 2023

zqzten commented May 12, 2023

sbueringer commented May 15, 2023

sbueringer commented May 11, 2023 •

edited