Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't join existing cluster JWS not found? #668

Closed
brendandburns opened this issue Jan 23, 2018 · 37 comments
Closed

Can't join existing cluster JWS not found? #668

brendandburns opened this issue Jan 23, 2018 · 37 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Milestone

Comments

@brendandburns
Copy link

brendandburns commented Jan 23, 2018

Trying to join an existing ~20 day old cluster:

sudo kubeadm join --token <redacted>
[preflight] Running pre-flight checks.
	[WARNING SystemVerification]: docker version is greater than the most recently validated version. Docker version: 18.01.0-ce. Max validated version: 17.03
	[WARNING FileExisting-crictl]: crictl not found in system path
[discovery] Trying to connect to API Server "10.0.0.1:6443"
[discovery] Created cluster-info discovery client, requesting info from "https://10.0.0.1:6443"
[discovery] Failed to connect to API Server "10.0.0.1:6443": there is no JWS signed token in the cluster-info ConfigMap. This token id "aec65f" is invalid for this cluster, can't connect
[discovery] Trying to connect to API Server "10.0.0.1:6443"
[discovery] Created cluster-info discovery client, requesting info from "https://10.0.0.1:6443"
[discovery] Failed to connect to API Server "10.0.0.1:6443": there is no JWS signed token in the cluster-info ConfigMap. This token id "aec65f" is invalid for this cluster, can't connect

Any ideas?

Thanks

Cluster was created by kubeadm 1.9.0, current kubeadm is 1.9.2

@brendandburns brendandburns changed the title Can't join existing cluster... Can't join existing cluster JWS not found? Jan 23, 2018
@stewart-yu
Copy link
Contributor

stewart-yu commented Jan 23, 2018

It seem that token expired, Use kubeadm token create on the master node to creating a new valid token

@brendandburns
Copy link
Author

I tried that. The token in the error message changes, but the error persists.

I can see the tokens when I run kubeadm token list on the master, any more things I can do to help debug?

@dixudx
Copy link
Member

dixudx commented Jan 24, 2018

@brendandburns Can you connect to the cluster now?

Please help check the configmap cluster-info in kube-public namespace. Seems this configmap is damaged. Missing key jws-kubeconfig-aec65f in cluster-info.

@brendandburns
Copy link
Author

Sorry, I nuked the cluster and rebuilt it from scratch. I will try to reproduce this later today and see if it re-occurs.

Thanks
--brendan

@arunmk
Copy link

arunmk commented Jan 27, 2018

I saw this happen on a node with Kubernetes 1.6.8 today. This doesn't seem to be a ttl issue and the workaround mentioned in [https://github.com//issues/335](Bug 335) does not help. The cluster-info does not have the jws-kubeconfig set. At what step is this variable set? We use kubeadm to set up the cluster.

@dixudx
Copy link
Member

dixudx commented Jan 28, 2018

@arunmk You can try to run kubeadm token create to create a new token.

@arunmk
Copy link

arunmk commented Jan 28, 2018

@dixudx creation of a new token did not help and the new configmap also has the same issue (no jws-kubeconfig). I think there should be something unique to the node and am debugging along those lines.

Do you know if there are some packages needed etc for this token to be created on the machine?

EDIT: Apologies @dixudx, this command does work on a test node. I mistook the token create command to mean recreation of the cluster with the new token. I'll try on the other node of interest and update the thread.

@timothysc timothysc added kind/bug Categorizes issue or PR as related to a bug. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Jan 30, 2018
@timothysc
Copy link
Member

/cc @mattmoyer

@arunmk
Copy link

arunmk commented Jan 30, 2018

@dixudx on a machine where I don't have much access, the jws-kubeconfig does not get created. Here is what I do:
kubectl token create --ttl=0
kubectl -n kube-public get configmap cluster-info -o json

The second command yields only the kubeconfig child under the data field and the jws-kubeconfig-* field is missing. The kubeadm token list does show the newly created tokens but the kube-public configmap does not have it.

@binarybana
Copy link

I just encountered this problem and a kubeadm token create <existing token> DID seem to populate the jws-kubeconfig-<string> field of the cloud-info ConfigMap for me.

This was using kubeadm 1.9.2 from the master, using an existing token as an argument. Interestingly kubeadm token list was empty prior to running the token creation.

@dixudx
Copy link
Member

dixudx commented Feb 1, 2018

@arunmk Works well on my env.

root@server-01:~# kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.2", GitCommit:"5fa2db2bd46ac79e5e00a4e6ed24191080aa463b", GitTreeState:"clean", BuildDate:"2018-01-18T09:42:01Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
root@server-01:~# kubeadm token create --ttl 0
aacff5.cb1a195970ddba98
root@server-01:~# kubectl get configmap cluster-info --namespace=kube-public -o json
{
    "apiVersion": "v1",
    "data": {
        "jws-kubeconfig-aacff5": "eyJhbGciOiJIUzI1NiIsImtpZCI6ImFhY2ZmNSJ9..Xgn2YRa_SM5qHq04vw_8SF-5-6nztGBi-4euSCIz_6Q",
        "kubeconfig": "apiVersion: v1\nclusters:\n- cluster:\n    certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUN5RENDQWJDZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRFNE1ERXlOVEF6TVRrME5sb1hEVEk0TURFeU16QXpNVGswTmxvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBTHNmCks4b2R0bmExdDdRSlpxUjBCYnA4aVJqMWQ5OHQwRHRoUVBPOUlMTUt6M3F3c09qWWtXL0ZQc2R0QU1BYmw0Q24KM2lTajJqWW9TTTNLTUJpNUg1QjdhVHM1OEN6Rk85Q3FVVDkyUHFiMkhmVnZYNjdPZ1poZFo3ak9vVUNXM0pjYwprNnFDYjZxWWZSVkdXUkl4cldoQXJOTllyeEttZE01L1ErM3hFclJwdGtDaFVyTi9ETk05ZGhTQlBpRFBocEVoCkgyNy9Xa2JnUm95TThZQ1F5bTZaa204eGR2Zk1DVEV0WHgvdko4U0lERHYyS1orZnRPQWNoRCtsVmxpK2xJZ1MKVmttdVAvU0lNRU5sMDVMeTBqQVlyM004QkNMeUx6bWRiZU1zMlpCWTltMk1kckJ4S2lLRXg3RkZlTG9odGw3NwpEZGVxakhqaklaMW1oTUxqMXY4Q0F3RUFBYU1qTUNFd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0RRWUpLb1pJaHZjTkFRRUxCUUFEZ2dFQkFMSjYxY1FzY3FPeDZxeXJrU0lkSmRKbWs4WmsKcVk1TldUa2RBRFFQTFBLSEo0eSt0TEpYVE81dGRKQ0NYT1JpeHJCMk9SQ0k5M2dGMWJwcWhOamlUNE40Rmg0YQpwSUl1RmZkNURRaWVQTWlFdEl2cmVKNEY3ZVZWWENOMitNZ0l6ZFRCZTRHQmczcXIrVk43TGtndEdmZllubEFuClRheVE3MkhnZG40YXlYcUdSQm1lTzBYeExSTTZUck1PeWtPckhkdVdtNVBCbXNmdENzM3IxdGczTmVHNEwyYzAKS1J6TkI5cVVneW9hOTlEMWFZcDNaWVBacDhORFBiR2Rsay9GaTRXOWZFZkIrem5BVXkrVGdqK1VQUjY3SzZpMApvbEkvWG1YemVNaHBEQ2F1Y2RsK2d4emNJeFhSQmxkQkFWREE5a0RNa0xuQ2VEL1FpUEpOZmxuZHBJbz0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=\n    server: https://192.168.31.100:6443\n  name: \"\"\ncontexts: []\ncurrent-context: \"\"\nkind: Config\npreferences: {}\nusers: []\n"
    },
    "kind": "ConfigMap",
    "metadata": {
        "creationTimestamp": "2018-01-25T03:20:16Z",
        "name": "cluster-info",
        "namespace": "kube-public",
        "resourceVersion": "125266",
        "selfLink": "/api/v1/namespaces/kube-public/configmaps/cluster-info",
        "uid": "aa1498ff-017e-11e8-abe2-023e88328eba"
    }
}
root@server-01:~# kubeadm token list
TOKEN                     TTL         EXPIRES   USAGES                   DESCRIPTION   EXTRA GROUPS
aacff5.cb1a195970ddba98   <forever>   <never>   authentication,signing   <none>        system:bootstrappers:kubeadm:default-node-token

@bart0sh
Copy link

bart0sh commented Feb 14, 2018

Can you show controller-manager log from your master? It looks like cluster-info configmap was not updated for some reason. I hope there will be errors in the controller manager log that would help to investigate why that happened.

@bart0sh
Copy link

bart0sh commented Feb 17, 2018

@brendandburns @binarybana Can you show controller-manager log from your master?

@arunmk
Copy link

arunmk commented Feb 20, 2018

@bart0sh how do we get the controller manager logs? I looked at the etcd database using the etcdctl utility and see that, on a working machine the token is present while on the failing machine the token isn't. This is the command used:
ETCDCTL_API=3 ./etcdctl get "/registry/configmaps/kube-public/cluster-info"

Working machine:
cluster-info kube-public"6/api/v1/namespaces/kube-public/configmaps/cluster-info*$db43c18c-fbe3-11e7-99fe-001e67f85cfc28B ������܉zn jws-kubeconfig-266b02UeyJhbGciOiJIUzI1NiIsImtpZCI6IjI2NmIwMiJ9..3Fz-FLir0GSBPeKGS5u5FHsm69YA6MULOIKvJUO4CUc�

Failing machine:
cluster-info kube-public"*$78557677-0a90-11e8-b710-94188268af7028B ÈâÓ¶æÑùz¤

@arunmk
Copy link

arunmk commented Feb 20, 2018

There are also numerous errors in the journal logs as follows:
k8s.io/kubernetes/pkg/controller/bootstrap/bootstrapsigner.go:151: Failed to list *v1.Secret: User "system:serviceaccount:kube-system:bootstrap-signer" cannot list secrets in the namespace "kube-system".: "role.rbac.authorization.k8s.io \"system:controller:bootstrap-signer\" not found" (get secrets)

@bart0sh
Copy link

bart0sh commented Feb 20, 2018

@arunmk you can get controller manager logs this way:

kubectl logs <controller-manager pod> --namespace=kube-system

You can see name of its pod using kubectl get pods:

kubectl get pods --all-namespaces | grep controller-manager

It would be interesting to see its logs when new token is created and configmap requested:

kubeadm token create && kubectl get configmap cluster-info --namespace=kube-public -o json

@arunmk
Copy link

arunmk commented Feb 21, 2018

@bart0sh ah ok, you were mentioning the pod logs. I was thinking of some process related logs in journald. I’ll get these out.

@marranz
Copy link

marranz commented Feb 22, 2018

Hi,

I'm experiencing this same problem ONLY when creating the master with the flags --cloud=aws or when i enable it on the config file.

this is a config file example:

apiVersion: kubeadm.k8s.io/v1alpha1 kind: MasterConfiguration api: advertiseAddress: 10.1.11.22 networking: podSubnet: 192.168.0.0/16 cloudProvider: aws

When i remove 'cloudProvider: aws' on new nodes or after running 'kubeadm reset' i can join the nodes without errors.

Should i open another issue ?

@arunmk
Copy link

arunmk commented Feb 27, 2018

@bart0sh looks like the problem got fixed. We suspected that the secret could not be read by the system:serviceaccount:kube-system:bootstrap-signer in order to create a signed token. Hence we applied the attached file and that allowed the system:serviceaccount:kube-system:bootstrap-signer to read and sign the token.

k8s-workaround-cr.txt

@bart0sh
Copy link

bart0sh commented Feb 27, 2018

@arunmk Thank you for the info. Can you tell how have you discovered the reason? Which logs did you look at, etc?

@bart0sh
Copy link

bart0sh commented Feb 27, 2018

@marranz Can you try if @arunmk's solution works for you? Do you see anything suspicious in the controller-manager log?

@arunmk
Copy link

arunmk commented Feb 27, 2018

@bart0sh I saw numerous errors in journalctl logs of the form:
k8s.io/kubernetes/pkg/controller/bootstrap/bootstrapsigner.go:151: Failed to list *v1.Secret: User "system:serviceaccount:kube-system:bootstrap-signer" cannot list secrets in the namespace "kube-system".: "role.rbac.authorization.k8s.io \"system:controller:bootstrap-signer\" not found" (get secrets)

That, and the dump of the etcd led us to suspect that the token was not present because it was probably not getting signed (sue to the error above). Hence I tried granting rights to allow access to the signature.

@timothysc
Copy link
Member

/cc @liztio

@liztio
Copy link

liztio commented Mar 30, 2018

@marranz hey, I repro'd your bug and am working on tests / a fix

@timothysc timothysc added this to the v1.11 milestone Apr 3, 2018
@timothysc timothysc added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 3, 2018
@liztio
Copy link

liztio commented Apr 3, 2018

update:
I tracked down the actual signing action to the "controller-manager" pod. And lo and behold, it's crash-looping with cloudProvider: aws:

ubuntu@ip-172-31-80-217:~$ kubectl logs -n kube-system kube-controller-manager-ip-172-31-80-217
I0403 21:11:54.344806       1 controllermanager.go:108] Version: v1.9.6
I0403 21:11:54.348686       1 leaderelection.go:174] attempting to acquire leader lease...
I0403 21:11:54.362171       1 leaderelection.go:184] successfully acquired lease kube-system/kube-controller-manager
I0403 21:11:54.362616       1 event.go:218] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"kube-controller-manager", UID:"e275f07f-3389-11e8-b7b8-12eedf22aef4", APIVersion:"v1", ResourceVersion:"272947", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' ip-172-31-80-217 became leader
I0403 21:11:54.389127       1 aws.go:1000] Building AWS cloudprovider
I0403 21:11:54.389173       1 aws.go:963] Zone not specified in configuration file; querying AWS metadata service
E0403 21:11:55.040927       1 tags.go:94] Tag "KubernetesCluster" nor "kubernetes.io/cluster/..." not found; Kubernetes may behave unexpectedly.
W0403 21:11:55.040956       1 tags.go:78] AWS cloud - no clusterID filtering applied for shared resources; do not run multiple clusters in this AZ.
F0403 21:11:55.041029       1 controllermanager.go:150] error building controller context: no ClusterID Found.  A ClusterID is required for the cloud provider to function properly.  This check can be bypassed by setting the allow-untagged-cloud option

That clusterID problem looks a lot like #53538. I'll investigate the solutions mentioned there tomorrow.

@timothysc timothysc removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Apr 4, 2018
@liztio
Copy link

liztio commented Apr 4, 2018

Like I speculated yesterday, this is not a bug related to signing. Rather, it's a documentation and error exposure failure.

using cloudProvider: aws imposes a number of additional requirements on an environment that aren't present for non-cloudProvider clusters. Failures are hidden away in secret places: the kubelet logs, the controller manager logs (as seen above), and the api server pod logs.

Here's how I got everything to work:

  • I had to set each node's hostname to a FQDN. On Ubuntu this meant: sudo hostname $(curl 169.254.169.254/latest/meta-data/hostname).
  • Each node needs an AWS tag that looks like kubernetes.io/cluster/<cluster-name>, set to either owned or `shared.
  • I gave the master an IAM role that lets it manage clusters.

There's a good writeup by Nate Baker on these requirements.

@marranz hope this helps!

@liztio liztio closed this as completed Apr 4, 2018
@arunmk
Copy link

arunmk commented Apr 4, 2018

@liztio I don’t think this should be closed. The aws issue is one facet of the issue and the issue we hit had no relation to the cloud (on-prem scenario).

@liztio
Copy link

liztio commented Apr 4, 2018

@arunmk ah, my mistake, I thought the other issue was solved (from this comment).

@liztio liztio reopened this Apr 4, 2018
@arunmk
Copy link

arunmk commented Apr 4, 2018

@liztio no worries. I have a workaround, but it’s quite messy to automate, and we don’t have a root cause yet. Hence I don’t want to close this issue and hope for a cleaner fix. If the issue reported by @brendandburns is fixed by the commit please do mention it and I’ll open a new bug for my issue.

@timothysc
Copy link
Member

@arunmk - can you distill the actionable requirements you are looking for? Most of these issues I have seen have todo with the ability to change the hostname.

@arunmk
Copy link

arunmk commented Apr 9, 2018

@timothysc I saw errors of the form mentioned in this comment: #668 (comment)
Also, when I dumped the config from etcd and using 'kubectl get configmap' on master, the tokens weren't present. However the token list command did show the tokens. (Do they pull from different locations in etcd?)

To get around this issue, we used the script mentioned in #668 (comment)

I don't have an RCA for that as it was on a machine where I don't have much access. What I am looking for are:

  • does anyone know of a root-cause for this issue?
  • is there a better workaround / some other fix related to system settings?

We can detect this issue and automate the workaround, but we don't want to do it until we understand the root cause. Why does the 'kube-system:bootstrap-signer' not have access to the secret?

@liztio
Copy link

liztio commented Apr 10, 2018

@arunmk have you noticed anything about the environments the issue occurs in? I'm happy to dig in on this, but I need a starting point.

@timothysc timothysc added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. and removed triaged priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Apr 26, 2018
@timothysc timothysc assigned timothysc and unassigned liztio May 15, 2018
@luxas
Copy link
Member

luxas commented May 16, 2018

I think the first issue was a race condition in the controller-manager that has been fixed later, and the other issue conflated here is unrelated to the first comment. I'm closing this now, please file a new issue for the aws-related bug and reopen if you can reproduce the JWS-signing bug in a bare metal env with no cloud provider running kubeadm v1.10.

@luxas luxas closed this as completed May 16, 2018
@arunmk
Copy link

arunmk commented May 16, 2018

@liztio I was away for a while and could not respond. Let me create a new bug with the required information if this reoccurs. Thanks for the fixes!

@devtech0101
Copy link

devtech0101 commented Aug 23, 2020

It depends on what version of kubernetes you are running: example 1.18 you can generate new token that supports v=5.

step 1:
sudo kubeadm --v=5 token create --print-join-command (this will update cluster-info.yaml with JWS in data section)

step 2:
run the above command on node should work.
kubeadm join 172.16.26.136:6443 --token 0l27fp.tegcha916hiwn4lv --discovery-token-ca-cert-hash sha256:058073bb05c1d15ec802288c815e2f1d5fa12f912e6e7da9086f4b7c2e2aa850

@arunmk
Copy link

arunmk commented Aug 23, 2020

@liztio @timothysc @devtech0101 this bug has fallen off my radar and I have not heard of any new reports of this. Kubernetes also has moved quite a ways from 1.6.8 when this issue was seen to 1.18 nowadays. So I am fine with closing this issue.

@m33m33k
Copy link

m33m33k commented Sep 7, 2020

This issue happend for me where the kube init, or create token was not creating the jws-kubeconfig token. I was using fedora coreos which was restrcitive in terms of write rights.

The part which says "/usr is mounted read-only on nodes" from this doco
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/troubleshooting-kubeadm/ helped me fix the issue.

So basically kubeadm wasn't able to write because of insufficient write access.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests