Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubeadm upgrade stuck in "Static pod: etcd-xxxx hash: xxxx" occasionally #2877

Closed
ImNotAFrog opened this issue May 16, 2023 · 6 comments · Fixed by kubernetes/kubernetes#118069
Labels
area/upgrades help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence.
Milestone

Comments

@ImNotAFrog
Copy link

What happened?

We encountered this error when running kubeadm upgrade apply, the logs of kubeadm printed "Static pod: etcd-xxxx hash: xxxx" repeatedly till timeout.

What did you expect to happen?

We were trying to keep some of the modifications in static pods' yaml while upgrading k8s. Our strategy is:

  1. Keep a copy of original static pods' yaml in cluster using configmap when deploy the cluster.
  2. When upgrading k8s, make a diff of current static pods' yaml and the saved copies in cluster, store them in yaml files.
  3. Call "kubeadm upgrade apply" with param "--patches" to upgrade k8s components.
    We suppose that these operations should work.

How can we reproduce it (as minimally and precisely as possible)?

If the order of a map's fields are different between current static pods' yaml and the diff patch files, the problem will occur. e.g. if current etcd yaml has some annotations like:
metadata:
annotations:
field1: foo1
field2: foo2
and the patch file we created is:
metadata:
annotations:
field2: foo2
field1: foo1
the upgrade will stuck and failed.

Anything else we need to know?

Since map is unordered in golang, we didn't expected this behavior of kubeadm.
By looking into the source code of kubeadm, we found the possible cause of this behavior, the current procedures of kubeadm upgrading static pods are as below(using etcd as an example):

  1. Make a hash of current etcd yaml
  2. Make a new yaml and patch it if "--patches" exists
  3. Compare if new yaml and old yaml are the same, if not, move new yaml to manifest and wait.
  4. Repeatedly get pod hash from cluster, if the hash has changed, means the pod has upgraded successfully.
    The problem of this process is, kubeadm uses DeepHashObject in pkg/util/hash to generate hash code, this hash method will sort the fileds, which means the two yaml files like above will leads to a same hash code. Meanwhile, when compare yaml files, kubeadm simply comprares the files' bytes instead, see in: https://github.com/kubernetes/kubernetes/blob/bc41f37e85dd67ddb69eaa9e4da530c3c72101d7/cmd/kubeadm/app/util/staticpod/utils.go#L352
    In this case, kubeadm will replace the new yaml file to /etc/kubernetes/manifests, but the hash code of the pod won't make any difference.

Kubernetes version

$ kubeadm version
# v1.22.1

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@ImNotAFrog ImNotAFrog added the kind/bug Categorizes issue or PR as related to a bug. label May 16, 2023
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 16, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ImNotAFrog
Copy link
Author

/sig Cluster-Lifecycle

@k8s-ci-robot k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 16, 2023
@pacoxu
Copy link
Member

pacoxu commented May 16, 2023

/area kubeadm
/transfer kubeadm

At my first glance, to patch etcd manifest, you may try the #2046 --patch feature. I need to dig into this later.

@k8s-ci-robot k8s-ci-robot transferred this issue from kubernetes/kubernetes May 16, 2023
@neolit123
Copy link
Member

neolit123 commented May 16, 2023

Repeatedly get pod hash from cluster, if the hash has changed, means the pod has upgraded successfully.
The problem of this process is, kubeadm uses DeepHashObject in pkg/util/hash to generate hash code, this hash method will sort the fileds, which means the two yaml files like above will leads to a same hash code. Meanwhile, when compare yaml files, kubeadm simply comprares the files' bytes instead, see in: https://github.com/kubernetes/kubernetes/blob/bc41f37e85dd67ddb69eaa9e4da530c3c72101d7/cmd/kubeadm/app/util/staticpod/utils.go#L352
In this case, kubeadm will replace the new yaml file to /etc/kubernetes/manifests, but the hash code of the pod won't make any difference.

so when comparing yaml, we should also use DeepHashObject?

this seems like a bug.

occasionally

i don't understand why it would happen occasionally and not always thought?

@neolit123 neolit123 added area/upgrades triage/needs-information Indicates an issue needs more information in order to work on it. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. and removed sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. area/kubeadm triage/needs-information Indicates an issue needs more information in order to work on it. labels May 16, 2023
@ImNotAFrog
Copy link
Author

ImNotAFrog commented May 17, 2023

so when comparing yaml, we should also use DeepHashObject?

Totally agree with this, seems to be the most reasonable solution.

i don't understand why it would happen occasionally and not always thought?

In our case, if the new yaml file has the same order with the old one, this process would be skipped since ManifestFilesAreEqual returns true. And the annotation fileds order is random since its a map.

@neolit123 neolit123 added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels May 17, 2023
@neolit123
Copy link
Member

Totally agree with this, seems to be the most reasonable solution.

ok, PRs welcome with a fix. we can backport this bugfix as blocking for upgrades in older versions too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/upgrades help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
4 participants