kube-master-installation: improve systemd cross-unit robustness. #101176

jkh52 · 2021-04-16T01:12:03Z

Also some minor reliability tweaks.

What type of PR is this?

/kind bug

What this PR does / why we need it:

We have seen sometimes that if metadata service is flaky and kube-master-installation unit fails, downstream may still try and fail (such as kube-master-configuration). Our normal health signal based repair waits a very long time, so you can have a failed but idle master node for some time.

This adds more clear dependency between units, and immediately reboots the machine if kube-master-installation.service fails (much faster than otherwise).

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

k8s-ci-robot · 2021-04-16T01:12:11Z

Hi @jkh52. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jkh52 · 2021-04-16T01:20:22Z

/assign @wojtek-t

(similar to #101015)

jkh52 · 2021-04-16T01:36:14Z

/retest

k8s-ci-robot · 2021-04-16T01:36:27Z

@jkh52: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

zmerlynn · 2021-04-16T01:49:33Z

/hold

I have concerns (raised internally).

Also some minor reliability tweaks.

jkh52 · 2021-04-20T00:17:40Z

Latest snapshot removes the reboot (there are some cons) and vastly increases the retry interval.

cluster/gce/gci/master.yaml

zmerlynn · 2021-04-20T00:20:34Z

/unhold

zmerlynn · 2021-04-20T00:31:18Z

/ok-to-test
/lgtm

wojtek-t · 2021-04-20T06:00:02Z

/approve

/retest

/triage accepted

k8s-ci-robot · 2021-04-20T06:00:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jkh52, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster/gce/gci/OWNERS~~ [wojtek-t]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from karan and roycaihw April 16, 2021 01:12

k8s-ci-robot added area/provider/gcp Issues or PRs related to gcp provider sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 16, 2021

k8s-ci-robot assigned wojtek-t Apr 16, 2021

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 16, 2021

jkh52 force-pushed the master branch from a161433 to 7e81a30 Compare April 16, 2021 05:44

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 16, 2021

kube-master-installation: reboot on failure.

05bcc72

Also some minor reliability tweaks.

jkh52 force-pushed the master branch from 7e81a30 to 05bcc72 Compare April 20, 2021 00:16

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 20, 2021

zmerlynn reviewed Apr 20, 2021

View reviewed changes

cluster/gce/gci/master.yaml Show resolved Hide resolved

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 20, 2021

jkh52 changed the title ~~kube-master-installation: reboot on failure.~~ kube-master-installation: improve systemd cross-unit robustness. Apr 20, 2021

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 20, 2021

k8s-ci-robot assigned zmerlynn Apr 20, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 20, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 20, 2021

k8s-ci-robot merged commit 41505f7 into kubernetes:master Apr 20, 2021

k8s-ci-robot added this to the v1.22 milestone Apr 20, 2021

mm4tt mentioned this pull request May 7, 2021

Retry hostname->IP: [Errno -2] Name or service not known #101781

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-master-installation: improve systemd cross-unit robustness. #101176

kube-master-installation: improve systemd cross-unit robustness. #101176

jkh52 commented Apr 16, 2021 •

edited by wojtek-t

k8s-ci-robot commented Apr 16, 2021

jkh52 commented Apr 16, 2021

jkh52 commented Apr 16, 2021

k8s-ci-robot commented Apr 16, 2021

zmerlynn commented Apr 16, 2021

jkh52 commented Apr 20, 2021

zmerlynn commented Apr 20, 2021

zmerlynn commented Apr 20, 2021

wojtek-t commented Apr 20, 2021

k8s-ci-robot commented Apr 20, 2021

kube-master-installation: improve systemd cross-unit robustness. #101176

kube-master-installation: improve systemd cross-unit robustness. #101176

Conversation

jkh52 commented Apr 16, 2021 • edited by wojtek-t

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Apr 16, 2021

jkh52 commented Apr 16, 2021

jkh52 commented Apr 16, 2021

k8s-ci-robot commented Apr 16, 2021

zmerlynn commented Apr 16, 2021

jkh52 commented Apr 20, 2021

zmerlynn commented Apr 20, 2021

zmerlynn commented Apr 20, 2021

wojtek-t commented Apr 20, 2021

k8s-ci-robot commented Apr 20, 2021

jkh52 commented Apr 16, 2021 •

edited by wojtek-t