Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet and kube-proxy fail to reload certificates when they are updated #46287

Closed
Spindel opened this issue May 23, 2017 · 69 comments
Closed

kubelet and kube-proxy fail to reload certificates when they are updated #46287

Spindel opened this issue May 23, 2017 · 69 comments
Assignees
Labels
area/kube-proxy area/kubelet kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@Spindel
Copy link

Spindel commented May 23, 2017

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): tls certificate reload


Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.1", GitCommit:"b0b7a323cc5a4a2019b2e9520c21c7830b7f708e", GitTreeState:"clean", BuildDate:"2017-04-25T14:48:12Z", GoVersion:"go1.8.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.1+coreos.0", GitCommit:"9212f77ed8c169a0afa02e58dce87913c6387b3e", GitTreeState:"clean", BuildDate:"2017-04-04T00:32:53Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration:
    Digital Ocean / custom setup

  • OS (e.g. from /etc/os-release):
    coreos VERSION=1353.7.0

  • Kernel (e.g. uname -a):
    Linux coreos01.kub.do.modio.se 4.9.24-coreos Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Wed Apr 26 21:44:23 UTC 2017 x86_64 Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz GenuineIntel GNU/Linux

  • Install tools:
    Ansible / CoreOS getting started guide

  • Others:

What happened:
We came up to our scheduled update of TLS client certificate, TLS certs were updated properly, but kubelet and kube-proxy keep the old Certs in memory. This causes them to fail when communicating with APIserver.

What you expected to happen:
kubelet & kube proxy should reload the certificates from disk.

How to reproduce it (as minimally and precisely as possible):
Generate a cert with a short lifetime, set up your cluster, wait a while, and then replace the cert with a longer lived one.

Anything else we need to know:
We're attempting to run with short lived client certificates. This has shown some issues with how kubernetes handles it, and will likely cause others to have hard to debug problems in the future.

@cmluciano
Copy link

/sig cluster-lifecycle

@k8s-ci-robot k8s-ci-robot added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label May 24, 2017
@alkar
Copy link

alkar commented Nov 3, 2017

Experiencing the same issue.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.1", GitCommit:"f38e43b221d08850172a9a4ea785a86a3ffa3b3a", GitTreeState:"clean", BuildDate:"2017-10-12T00:44:36Z", GoVersion:"go1.9.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.1+coreos.0", GitCommit:"59359d9fdce74738ac9a672d2f31e9a346c5cece", GitTreeState:"clean", BuildDate:"2017-10-12T21:53:13Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

@FarhadF
Copy link

FarhadF commented Nov 12, 2017

Experiencing the same issue, Just tried to replace the certificates with new one.

Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.2", GitCommit:"bdaeafa71f6c7c04636251031f93464384d54963", GitTreeState:"clean", BuildDate:"2017-10-24T19:48:57Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.2", GitCommit:"bdaeafa71f6c7c04636251031f93464384d54963", GitTreeState:"clean", BuildDate:"2017-10-24T19:38:10Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

@FarhadF
Copy link

FarhadF commented Nov 12, 2017

My work around:

  1. Replace certificates with new ones
  2. change systemd service execstart to one liner:
ExecStart=/usr/bin/kubelet   --kubeconfig=/etc/kubelet/kubeconfig   --allow-privileged=true   --cluster-dns=10.96.0.10   --cluster-domain=cluster.local   --container-runtime=docker   --docker=unix:///var/run/docker.sock   --network-plugin=cni   --serialize-image-pulls=false   --tls-cert-file=/etc/kubernetes/k2.pem   --tls-private-key-file=/etc/kubernetes/k2-key.pem   --cni-conf-dir=/etc/cni/net.d   --cni-bin-dir=/opt/cni/bin   --v=2

I was using \ to have multi lines, you can even use your oneline at terminal for fast verification.
3. systemctl daemon-reload
4. openssl s_client -connect <nodename>:10250 -showcerts
you should see cert chain and no selfsigned errors and I also verified new certificate with the respective file:

Certificate chain
 0 s:/CN=k2
   i:/CN=kube-ca
Server certificate
subject=/CN=k2
issuer=/CN=kube-ca

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 10, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 12, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@george-angel
Copy link
Contributor

/reopen
/remove-lifecycle rotten

@k8s-ci-robot
Copy link
Contributor

@george-angel: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 11, 2018
@george-angel
Copy link
Contributor

Can someone please re-open this issue? Its quiet a significant one for us. We currently need trello cards with due time set a year ahead with a reminder to restart kube-proxy.

@george-angel
Copy link
Contributor

Ping

@dims
Copy link
Member

dims commented Aug 1, 2019

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Aug 1, 2019
@k8s-ci-robot
Copy link
Contributor

@dims: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@riking
Copy link

riking commented Aug 13, 2019

Is this a duplicate of #4672 ? Or possibly a subset (it calls out kubelet / kube-proxy specifically).

@george-angel
Copy link
Contributor

It can be considered a subset, depending on how #4672 unfolds, currently its very broad.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 11, 2019
@george-angel
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 11, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 9, 2020
@george-angel
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 10, 2020
@kfox1111
Copy link

Semi related, I'm also interested in seeing if either this mechanism or direct integration with spire is possible for the certs.

@shaneutt
Copy link
Member

Gotcha! Well I don't know of anyone right this moment that has the bandwidth/priority to take this one on (thus why it's been continually going stale), but if you think you might have some time to spare I would encourage you to not worry about being a kubelet or kube-proxy expert. Reach out to the community (Slack, Zooms) and let them know you're trying to learn how to get this one fixed: I expect you'll find the community can provide some level of assistance (even if they themselves might not have the bandwidth/priority to take the whole thing on).

@shaneutt
Copy link
Member

While we're still open to someone with capacity taking this one on, since it's been some time without anyone who can it seems the previous lifecycle was accurate:

/lifecycle rotten

If you're interested in picking this one up, let us know and we will support you!

@k8s-ci-robot k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 22, 2024
@kfox1111
Copy link

It hasn't risen to the top of the todo list yet especially since it needs so much searching out for help and its not clear who can do so to me. (sig-auth, sig-node, sig-cluster-lifecycle, other?)

If we could identify some developers with knowledge of roughly what needs to be done, to which places, that would go a long way.

@shaneutt
Copy link
Member

Is it something you'd like to put on the agenda for our next SIG Network meeting and come talk about it with us so we can start figuring it out?

@aojea
Copy link
Member

aojea commented Mar 23, 2024

@kfox1111 can you expand more on how are you doing today to update the certificates in kube-proxy? also to know how do you deploy kube-proxy, are you using a daemonset?
It will be useful to understand better your workflow to come up with the best solution

@kfox1111
Copy link

@shaneutt https://docs.google.com/document/d/1_w77-zG_Xj0zYvEMfQZTQ-wPP4kXkpGD8smVtW_qqWM/edit Mar 28 at 9am pst? I think I can make that, if so.

@aojea I'm not yet doing anything but deploying with kubeadm. Ideally though, I'd like to be able to use a spire chain of trust for the cluster, attesting with either tpms on bare metal, or a cloud based node attestor for vms. They have already done a lot of the heavy lifting. We just need a machanism to get the certs from spire to k8s.

This would have some big benefits:

  • much shorter lived certificates, rolled automatically (hours or days)
  • node attestation. Rather then an initial bootstrap join token like thing, stronger mechanisms like tpms can be used to prove identity. No need to ssh in (how do you validate ssh host sig?) and copy join material from control plane to node. It can be automatic and safe.
  • periodic reattestation. On certificate refresh, continued poof of identity can be done. For example, handshaking with the tpm again to ensure it is still on the same node.
  • kubeadm has had a long standing issue with kubelet server certs being self signed. The same mechanism can be used to give kubelet (and maybe some other services) proper certificates in a validatable chain of trust.

@kfox1111
Copy link

Was referred to sig-auth from sig-network on this issue.

@aojea
Copy link
Member

aojea commented Mar 28, 2024

@enj we were discussing this issue today during sig network meeting and our understanding was that this is related about how client-go load the certificates so we are moving this to sig auth

/sig auth

@k8s-ci-robot k8s-ci-robot added the sig/auth Categorizes an issue or PR as relevant to SIG Auth. label Mar 28, 2024
@kfox1111
Copy link

client-go is 1/2 of the issue.

The server being able to use updated certificates without restart is also important and does not use client-go.

@stlaz
Copy link
Member

stlaz commented Apr 8, 2024

per triage:
Someone from the auth team should have a look and see what the issue is based on the comments in the issue. The original comment does not explain which exact certificates are not being reloaded.

@george-angel
Copy link
Contributor

  1. Is tlsCertFile as part of KubeletConfiguration. We set this to 7d TTL and refresh daily. Kubelet needs to be restarted to use the new certificate.
  2. clientCAFile as part of:
authentication:
  x509:
    clientCAFile: "/etc/kubernetes/ssl/ca.pem"

config in the same KubeletConfiguration file. This has 1 - 2 yr validity for us, and again Kubelet (and kube-proxy) need a restart to use this.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale May 9, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kfox1111
Copy link

Still an issue. Please reopen

@thockin thockin reopened this May 10, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 10, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@shaneutt
Copy link
Member

/assign @aroradaman

@zhangweikop
Copy link
Contributor

zhangweikop commented Jun 6, 2024

client-go is 1/2 of the issue.

The server being able to use updated certificates without restart is also important and does not use client-go.

Yes.
Recent change in kubelet 124574 is to resolve the second half.

@kfox1111
Copy link

kfox1111 commented Jun 8, 2024

Awesome. :)

Does it do the CA's too or just the client/server certs?

@zhangweikop
Copy link
Contributor

Awesome. :)

Does it do the CA's too or just the client/server certs?

Not CA. Only certs.

From what I know for kubelet:
Both the server tls config and client-go tls config doesn't do CA File dynamic reload as of today

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kube-proxy area/kubelet kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
Archived in project
Development

No branches or pull requests