Upgrading cluster running cri-o from 1.13.5 to 1.14.1 yields errors on runtimeclasses access #77801

sbskas · 2019-05-13T08:48:32Z

What happened:
After upgrade a working cluster from 1.13.5 to 1.14.1 using the kubeadm upgrade apply command, kubelet started to spurt the following error messages :
kubelet[2635221]: E0513 09:55:54.814239 2635221 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.RuntimeClass: runtimeclasses.node.k8s.io is forbidden: User "system:node:node" cannot list resource "runtimeclasses" in API group "node.k8s.io" at the cluster scope
kubelet[2635221]: E0513 09:55:55.537722 2635221 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.CSIDriver: csidrivers.storage.k8s.io is forbidden: User "system:node:node" cannot list resource "csidrivers" in API group "storage.k8s.io" at the cluster scope

What you expected to happen:
kubelet should have been able to access the runtimeclasses in node.k8s.io and csidrivers in storage.k8s.io.

How to reproduce it (as minimally and precisely as possible):
We did not try to reproduce the bug. Bascaly, create a 1.13.5, with ceph-csi driver and upgrade it to 1.14.1.

Anything else we need to know?:
We were able to fix the bug by adding a subject line in the system:nodes clusterrolebindings.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
annotations:
rbac.authorization.kubernetes.io/autoupdate: "true"
labels:
kubernetes.io/bootstrapping: rbac-defaults
name: system:node
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:node
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: Group
name: system:nodes

Environment:

Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:02:58Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release):
NAME=Fedora
VERSION="30 (Thirty)"
ID=fedora
VERSION_ID=30
VERSION_CODENAME=""
PLATFORM_ID="platform:f30"
PRETTY_NAME="Fedora 30 (Thirty)"
ANSI_COLOR="0;34"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:30"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f30/system-administrators-guide/"
SUPPORT_URL="https://fedoraproject.org/wiki/Communicating_and_getting_help"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=30
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=30
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
Kernel (e.g. uname -a):
Linux core1 5.0.13-300.fc30.x86_64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Mon May 6 00:39:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

sbskas · 2019-05-13T08:53:30Z

/sig node

mattjmcnaughton · 2019-05-13T12:02:38Z

Thanks for reporting @sbskas !

I think (but definitely could be wrong), that @kubernetes/sig-auth-bugs will be best able to assist you with this.

Please feel free to undo my tagging if I'm incorrect :)

/remove-sig node
/sig auth

liggitt · 2019-05-13T12:21:20Z

/cc @tallclair

do kubelets list runtimeclasses?

sbskas · 2019-05-13T17:56:41Z

I don't know. They print those errors every seconds without patching the clusterrolebinding/system-nodes.

One important point I forgot to mention in the description is that I'm running cri-o as runtime.

liggitt · 2019-05-14T04:10:21Z

kubernetes/pkg/kubelet/kubelet.go

Lines 659 to 661 in 5f72845

    
           if utilfeature.DefaultFeatureGate.Enabled(features.RuntimeClass) { 
        
           	klet.runtimeClassManager = runtimeclass.NewManager(kubeDeps.KubeClient) 
        
           }

this sets up a runtimeclass informer in the kubelet, which needs list/watch

@tallclair, what is the runtimeclass used for in the kubelet? is it required? if so, how are we testing the functionality this enables? can we add test coverage along with the required permissions. if not, can we remove the runtimeclassmanager from the kubelet?

mrtylerzhou · 2019-05-14T08:49:06Z

I also came across with similar problem.Some how,on node upgade to 1.14.1(other nodes are 1.13.1),the nodes shift its state from ready to not ready back and forth.When checking the kubelet log,get similar log information.

tallclair · 2019-05-14T21:51:25Z

what is the runtimeclass used for in the kubelet?

The Kubelet needs to look up the runtime handler for the runtimeclass associated with a pod.

is it required?

Yes, if pods are using RuntimeClasses.

if so, how are we testing the functionality this enables?

Cluster E2E's only test the failure cases, and it doesn't distinguish failures to look up the runtimeclass from failures to run with the handler (I'll fix this): https://github.com/kubernetes/kubernetes/blob/master/test/e2e/common/runtimeclass.go

Node E2E's test the end-to-end configured handler:

kubernetes/test/e2e/common/runtimeclass.go

Lines 59 to 67 in 0e224ad

    
           // This test requires that the PreconfiguredRuntimeHandler has already been set up on nodes. 
        
           It("should run a Pod requesting a RuntimeClass with a configured handler [NodeFeature:RuntimeHandler]", func() { 
        
           	// The built-in docker runtime does not support configuring runtime handlers. 
        
           	framework.SkipIfContainerRuntimeIs("docker") 
        
           	rcName := createRuntimeClass(f, "preconfigured-handler", PreconfiguredRuntimeHandler) 
        
           	pod := createRuntimeClassPod(f, rcName) 
        
           	expectPodSuccess(f, pod) 
        
           })

(https://testgrid.k8s.io/sig-node-containerd#node-e2e-features&include-filter-by-regex=RuntimeClass)

can we add test coverage along with the required permissions.

Do you mean fixing the tests to fail when the permissions are missing? Yes.

Permissions are added here as part of RBAC bootstrap - any idea why they wouldn't be included in this case?

kubernetes/plugin/pkg/auth/authorizer/rbac/bootstrappolicy/policy.go

Lines 178 to 181 in 0e224ad

    
           // RuntimeClass 
        
           if utilfeature.DefaultFeatureGate.Enabled(features.RuntimeClass) { 
        
           	nodePolicyRules = append(nodePolicyRules, rbacv1helpers.NewRule("get", "list", "watch").Groups("node.k8s.io").Resources("runtimeclasses").RuleOrDie()) 
        
           }

tallclair · 2019-05-15T00:37:23Z

@sbskas and @mrtylerzhou - Did you update the apiserver before updating the nodes? Are you running the node authorizer?

mrtylerzhou · 2019-05-15T00:54:18Z

@tallclair In fact,I did not update the apiserver,I even dont know why the node got updated.I downgrade the node's kubelet version and fixed the problem.

sbskas · 2019-05-15T07:58:12Z

@tallclair we did a normal upgrade of the cluster by upgrading the controlplane first and then the kubelets. The node authorizer is configured in the apiserver (--authorization-mode=Node,RBAC).

liggitt · 2019-05-15T13:06:23Z

As @tallclair mentioned, runtimeclass read permissions are already included in the system:node role.

Is the User "system:node:node" cannot list resource "runtimeclasses" in API group "node.k8s.io" at the cluster scope error message in this issue's description exactly what you are seeing?

Do you have your full kube-apiserver invocation?

liggitt · 2019-05-15T13:07:52Z

We were able to fix the bug by adding a subject line in the system:nodes clusterrolebindings.

That is definitely not recommended. That lets any node modify any other node.

sbskas · 2019-05-16T16:19:14Z

@liggitt : At this time we found this workaround to keep our cluster working. If I read the kube doc correctly, having the clusterrole system:nodes is not necessary either.
The apiserver is run like this :
`kube-apiserver

--advertise-address=192.168.0.50
--allow-privileged=true
--authorization-mode=Node,RBAC
--client-ca-file=/etc/kubernetes/pki/ca.crt
--enable-admission-plugins=NodeRestriction
--enable-bootstrap-token-auth=true
--etcd-cafile=/etc/pki/tls/certs/etcd-ca.pem
--etcd-certfile=/etc/pki/tls/certs/server.pem
--etcd-keyfile=/etc/pki/tls/private/server.pem
--etcd-servers=https://server1:2379,https://server2:2379,https://server3:2379
--feature-gates=NodeLease=true
--insecure-port=0
--kubelet-client-certificate=/etc/kubernetes/pki/apiserver-kubelet-client.crt
--kubelet-client-key=/etc/kubernetes/pki/apiserver-kubelet-client.key
--kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
--proxy-client-cert-file=/etc/kubernetes/pki/front-proxy-client.crt
--proxy-client-key-file=/etc/kubernetes/pki/front-proxy-client.key
--requestheader-allowed-names=front-proxy-client
--requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
--requestheader-extra-headers-prefix=X-Remote-Extra-
--requestheader-group-headers=X-Remote-Group
--requestheader-username-headers=X-Remote-User
--secure-port=6443
--service-account-key-file=/etc/kubernetes/pki/sa.pub
--service-cluster-ip-range=10.160.0.0/16
--tls-cert-file=/etc/kubernetes/pki/apiserver.crt
--tls-private-key-file=/etc/kubernetes/pki/apiserver.key`

sbskas · 2019-05-20T15:18:57Z

@liggitt Have you all the pieces of information needed to make a diagnostic or do you need more information ? I'm not exactly confident about the kube-apiserver invocation question...

liggitt · 2019-05-21T16:16:46Z

@liggitt Have you all the pieces of information needed to make a diagnostic or do you need more information ?

I'm not able to reproduce.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-05-21T16:11:39Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-05-21T16:11:39Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"darwin/amd64"}

$ kubectl auth can-i list csidrivers.storage.k8s.io --as=system:node:node --as-group=system:authenticated --as-group=system:nodes --all-namespaces
yes
$ kubectl auth can-i list runtimeclasses.node.k8s.io --as=system:node:node --as-group=system:authenticated --as-group=system:nodes --all-namespaces
yes

that is consistent with the authorization rules for those resources granted to nodes.

seh · 2019-05-21T19:17:43Z

I saw this happen in my cluster. I think that what was happening was that the kubelet running on an upgraded master was talking through our API server load balancer to older API servers on other nodes, and whatever was responsible for registering those new RBAC rules hadn't yet written them to etcd.

liggitt · 2019-05-21T19:23:47Z

the node authorizer works off a fixed permission set, not something persisted to etcd. a new node talking to an old apiserver would definitely hit this

seh · 2019-05-21T19:39:52Z

Well, scratch the etcd part, but does the rest of the scenario match what you saw, @sbskas?

sbskas · 2019-05-22T08:53:37Z

@seh indeed. I'm confused. Only one of our apiserver got upgraded not the other one.
I completed the upgrade of the second apiserver and removed the subject line. Everything went ok after this.
I'm closing the issue. Thanks for hitting the nail right on the spot. (I'll double check my work next time before opening an issue).

seh · 2019-05-22T12:36:53Z

One thing I can’t remember: While this new kubelet is complaining like this, is it able to start up a new API server successfully? That is, is the new node stuck until the old API servers are no longer answering to the load balancer?

pacoxu · 2020-06-22T08:31:47Z

--authorization-mode=Node,RBAC

This works for me.

sbskas added the kind/bug Categorizes issue or PR as related to a bug. label May 13, 2019

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 13, 2019

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 13, 2019

k8s-ci-robot added sig/auth Categorizes an issue or PR as relevant to SIG Auth. and removed sig/node Categorizes an issue or PR as relevant to SIG Node. labels May 13, 2019

sbskas changed the title ~~Upgrading cluster to 1.14.1 yields errors on runtimeclasses access~~ Upgrading cluster running cri-o from 1.13.5 to 1.14.1 yields errors on runtimeclasses access May 13, 2019

liggitt assigned tallclair May 14, 2019

liggitt added triage/not-reproducible Indicates an issue can not be reproduced as described. kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels May 21, 2019

sbskas closed this as completed May 22, 2019

ptone mentioned this issue Jun 1, 2019

kubeadm init fails with "node not found" error in kubelet logs cri-o/cri-o#2357

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrading cluster running cri-o from 1.13.5 to 1.14.1 yields errors on runtimeclasses access #77801

Upgrading cluster running cri-o from 1.13.5 to 1.14.1 yields errors on runtimeclasses access #77801

sbskas commented May 13, 2019 •

edited

sbskas commented May 13, 2019

mattjmcnaughton commented May 13, 2019

liggitt commented May 13, 2019

sbskas commented May 13, 2019

liggitt commented May 14, 2019

mrtylerzhou commented May 14, 2019

tallclair commented May 14, 2019

tallclair commented May 15, 2019

mrtylerzhou commented May 15, 2019

sbskas commented May 15, 2019

liggitt commented May 15, 2019 •

edited

liggitt commented May 15, 2019

sbskas commented May 16, 2019 •

edited

sbskas commented May 20, 2019

liggitt commented May 21, 2019

seh commented May 21, 2019

liggitt commented May 21, 2019

seh commented May 21, 2019

sbskas commented May 22, 2019

seh commented May 22, 2019

pacoxu commented Jun 22, 2020

Upgrading cluster running cri-o from 1.13.5 to 1.14.1 yields errors on runtimeclasses access #77801

Upgrading cluster running cri-o from 1.13.5 to 1.14.1 yields errors on runtimeclasses access #77801

Comments

sbskas commented May 13, 2019 • edited

sbskas commented May 13, 2019

mattjmcnaughton commented May 13, 2019

liggitt commented May 13, 2019

sbskas commented May 13, 2019

liggitt commented May 14, 2019

mrtylerzhou commented May 14, 2019

tallclair commented May 14, 2019

tallclair commented May 15, 2019

mrtylerzhou commented May 15, 2019

sbskas commented May 15, 2019

liggitt commented May 15, 2019 • edited

liggitt commented May 15, 2019

sbskas commented May 16, 2019 • edited

sbskas commented May 20, 2019

liggitt commented May 21, 2019

seh commented May 21, 2019

liggitt commented May 21, 2019

seh commented May 21, 2019

sbskas commented May 22, 2019

seh commented May 22, 2019

pacoxu commented Jun 22, 2020

sbskas commented May 13, 2019 •

edited

liggitt commented May 15, 2019 •

edited

sbskas commented May 16, 2019 •

edited