Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading cluster running cri-o from 1.13.5 to 1.14.1 yields errors on runtimeclasses access #77801

Closed
sbskas opened this issue May 13, 2019 · 21 comments
Assignees
Labels
kind/support Categorizes issue or PR as a support question. sig/auth Categorizes an issue or PR as relevant to SIG Auth. triage/not-reproducible Indicates an issue can not be reproduced as described.

Comments

@sbskas
Copy link

sbskas commented May 13, 2019

What happened:
After upgrade a working cluster from 1.13.5 to 1.14.1 using the kubeadm upgrade apply command, kubelet started to spurt the following error messages :
kubelet[2635221]: E0513 09:55:54.814239 2635221 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.RuntimeClass: runtimeclasses.node.k8s.io is forbidden: User "system:node:node" cannot list resource "runtimeclasses" in API group "node.k8s.io" at the cluster scope
kubelet[2635221]: E0513 09:55:55.537722 2635221 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.CSIDriver: csidrivers.storage.k8s.io is forbidden: User "system:node:node" cannot list resource "csidrivers" in API group "storage.k8s.io" at the cluster scope

What you expected to happen:
kubelet should have been able to access the runtimeclasses in node.k8s.io and csidrivers in storage.k8s.io.

How to reproduce it (as minimally and precisely as possible):
We did not try to reproduce the bug. Bascaly, create a 1.13.5, with ceph-csi driver and upgrade it to 1.14.1.

Anything else we need to know?:
We were able to fix the bug by adding a subject line in the system:nodes clusterrolebindings.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
annotations:
rbac.authorization.kubernetes.io/autoupdate: "true"
labels:
kubernetes.io/bootstrapping: rbac-defaults
name: system:node
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:node
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: Group
name: system:nodes

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:02:58Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}

  • Cloud provider or hardware configuration:

  • OS (e.g: cat /etc/os-release):
    NAME=Fedora
    VERSION="30 (Thirty)"
    ID=fedora
    VERSION_ID=30
    VERSION_CODENAME=""
    PLATFORM_ID="platform:f30"
    PRETTY_NAME="Fedora 30 (Thirty)"
    ANSI_COLOR="0;34"
    LOGO=fedora-logo-icon
    CPE_NAME="cpe:/o:fedoraproject:fedora:30"
    HOME_URL="https://fedoraproject.org/"
    DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f30/system-administrators-guide/"
    SUPPORT_URL="https://fedoraproject.org/wiki/Communicating_and_getting_help"
    BUG_REPORT_URL="https://bugzilla.redhat.com/"
    REDHAT_BUGZILLA_PRODUCT="Fedora"
    REDHAT_BUGZILLA_PRODUCT_VERSION=30
    REDHAT_SUPPORT_PRODUCT="Fedora"
    REDHAT_SUPPORT_PRODUCT_VERSION=30
    PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"

  • Kernel (e.g. uname -a):
    Linux core1 5.0.13-300.fc30.x86_64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Mon May 6 00:39:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

@sbskas sbskas added the kind/bug Categorizes issue or PR as related to a bug. label May 13, 2019
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 13, 2019
@sbskas
Copy link
Author

sbskas commented May 13, 2019

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 13, 2019
@mattjmcnaughton
Copy link
Contributor

Thanks for reporting @sbskas !

I think (but definitely could be wrong), that @kubernetes/sig-auth-bugs will be best able to assist you with this.

Please feel free to undo my tagging if I'm incorrect :)

/remove-sig node
/sig auth

@k8s-ci-robot k8s-ci-robot added sig/auth Categorizes an issue or PR as relevant to SIG Auth. and removed sig/node Categorizes an issue or PR as relevant to SIG Node. labels May 13, 2019
@liggitt
Copy link
Member

liggitt commented May 13, 2019

/cc @tallclair

do kubelets list runtimeclasses?

@sbskas
Copy link
Author

sbskas commented May 13, 2019

I don't know. They print those errors every seconds without patching the clusterrolebinding/system-nodes.

One important point I forgot to mention in the description is that I'm running cri-o as runtime.

@sbskas sbskas changed the title Upgrading cluster to 1.14.1 yields errors on runtimeclasses access Upgrading cluster running cri-o from 1.13.5 to 1.14.1 yields errors on runtimeclasses access May 13, 2019
@liggitt
Copy link
Member

liggitt commented May 14, 2019

if utilfeature.DefaultFeatureGate.Enabled(features.RuntimeClass) {
klet.runtimeClassManager = runtimeclass.NewManager(kubeDeps.KubeClient)
}

this sets up a runtimeclass informer in the kubelet, which needs list/watch

@tallclair, what is the runtimeclass used for in the kubelet? is it required? if so, how are we testing the functionality this enables? can we add test coverage along with the required permissions. if not, can we remove the runtimeclassmanager from the kubelet?

@mrtylerzhou
Copy link

I also came across with similar problem.Some how,on node upgade to 1.14.1(other nodes are 1.13.1),the nodes shift its state from ready to not ready back and forth.When checking the kubelet log,get similar log information.

@tallclair
Copy link
Member

what is the runtimeclass used for in the kubelet?

The Kubelet needs to look up the runtime handler for the runtimeclass associated with a pod.

is it required?

Yes, if pods are using RuntimeClasses.

if so, how are we testing the functionality this enables?

Cluster E2E's only test the failure cases, and it doesn't distinguish failures to look up the runtimeclass from failures to run with the handler (I'll fix this): https://github.com/kubernetes/kubernetes/blob/master/test/e2e/common/runtimeclass.go

Node E2E's test the end-to-end configured handler:

// This test requires that the PreconfiguredRuntimeHandler has already been set up on nodes.
It("should run a Pod requesting a RuntimeClass with a configured handler [NodeFeature:RuntimeHandler]", func() {
// The built-in docker runtime does not support configuring runtime handlers.
framework.SkipIfContainerRuntimeIs("docker")
rcName := createRuntimeClass(f, "preconfigured-handler", PreconfiguredRuntimeHandler)
pod := createRuntimeClassPod(f, rcName)
expectPodSuccess(f, pod)
})
(https://testgrid.k8s.io/sig-node-containerd#node-e2e-features&include-filter-by-regex=RuntimeClass)

can we add test coverage along with the required permissions.

Do you mean fixing the tests to fail when the permissions are missing? Yes.

Permissions are added here as part of RBAC bootstrap - any idea why they wouldn't be included in this case?

// RuntimeClass
if utilfeature.DefaultFeatureGate.Enabled(features.RuntimeClass) {
nodePolicyRules = append(nodePolicyRules, rbacv1helpers.NewRule("get", "list", "watch").Groups("node.k8s.io").Resources("runtimeclasses").RuleOrDie())
}

@tallclair
Copy link
Member

@sbskas and @mrtylerzhou - Did you update the apiserver before updating the nodes? Are you running the node authorizer?

@mrtylerzhou
Copy link

@tallclair In fact,I did not update the apiserver,I even dont know why the node got updated.I downgrade the node's kubelet version and fixed the problem.

@sbskas
Copy link
Author

sbskas commented May 15, 2019

@tallclair we did a normal upgrade of the cluster by upgrading the controlplane first and then the kubelets. The node authorizer is configured in the apiserver (--authorization-mode=Node,RBAC).

@liggitt
Copy link
Member

liggitt commented May 15, 2019

As @tallclair mentioned, runtimeclass read permissions are already included in the system:node role.

Is the User "system:node:node" cannot list resource "runtimeclasses" in API group "node.k8s.io" at the cluster scope error message in this issue's description exactly what you are seeing?

Do you have your full kube-apiserver invocation?

@liggitt
Copy link
Member

liggitt commented May 15, 2019

We were able to fix the bug by adding a subject line in the system:nodes clusterrolebindings.

That is definitely not recommended. That lets any node modify any other node.

@sbskas
Copy link
Author

sbskas commented May 16, 2019

@liggitt : At this time we found this workaround to keep our cluster working. If I read the kube doc correctly, having the clusterrole system:nodes is not necessary either.
The apiserver is run like this :
`kube-apiserver

--advertise-address=192.168.0.50
--allow-privileged=true
--authorization-mode=Node,RBAC
--client-ca-file=/etc/kubernetes/pki/ca.crt
--enable-admission-plugins=NodeRestriction
--enable-bootstrap-token-auth=true
--etcd-cafile=/etc/pki/tls/certs/etcd-ca.pem
--etcd-certfile=/etc/pki/tls/certs/server.pem
--etcd-keyfile=/etc/pki/tls/private/server.pem
--etcd-servers=https://server1:2379,https://server2:2379,https://server3:2379
--feature-gates=NodeLease=true
--insecure-port=0
--kubelet-client-certificate=/etc/kubernetes/pki/apiserver-kubelet-client.crt
--kubelet-client-key=/etc/kubernetes/pki/apiserver-kubelet-client.key
--kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
--proxy-client-cert-file=/etc/kubernetes/pki/front-proxy-client.crt
--proxy-client-key-file=/etc/kubernetes/pki/front-proxy-client.key
--requestheader-allowed-names=front-proxy-client
--requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
--requestheader-extra-headers-prefix=X-Remote-Extra-
--requestheader-group-headers=X-Remote-Group
--requestheader-username-headers=X-Remote-User
--secure-port=6443
--service-account-key-file=/etc/kubernetes/pki/sa.pub
--service-cluster-ip-range=10.160.0.0/16
--tls-cert-file=/etc/kubernetes/pki/apiserver.crt
--tls-private-key-file=/etc/kubernetes/pki/apiserver.key`

@sbskas
Copy link
Author

sbskas commented May 20, 2019

@liggitt Have you all the pieces of information needed to make a diagnostic or do you need more information ? I'm not exactly confident about the kube-apiserver invocation question...

@liggitt
Copy link
Member

liggitt commented May 21, 2019

@liggitt Have you all the pieces of information needed to make a diagnostic or do you need more information ?

I'm not able to reproduce.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-05-21T16:11:39Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-05-21T16:11:39Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"darwin/amd64"}

$ kubectl auth can-i list csidrivers.storage.k8s.io --as=system:node:node --as-group=system:authenticated --as-group=system:nodes --all-namespaces
yes
$ kubectl auth can-i list runtimeclasses.node.k8s.io --as=system:node:node --as-group=system:authenticated --as-group=system:nodes --all-namespaces
yes

that is consistent with the authorization rules for those resources granted to nodes.

@liggitt liggitt added triage/not-reproducible Indicates an issue can not be reproduced as described. kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels May 21, 2019
@seh
Copy link
Contributor

seh commented May 21, 2019

I saw this happen in my cluster. I think that what was happening was that the kubelet running on an upgraded master was talking through our API server load balancer to older API servers on other nodes, and whatever was responsible for registering those new RBAC rules hadn't yet written them to etcd.

@liggitt
Copy link
Member

liggitt commented May 21, 2019

the node authorizer works off a fixed permission set, not something persisted to etcd. a new node talking to an old apiserver would definitely hit this

@seh
Copy link
Contributor

seh commented May 21, 2019

Well, scratch the etcd part, but does the rest of the scenario match what you saw, @sbskas?

@sbskas
Copy link
Author

sbskas commented May 22, 2019

@seh indeed. I'm confused. Only one of our apiserver got upgraded not the other one.
I completed the upgrade of the second apiserver and removed the subject line. Everything went ok after this.
I'm closing the issue. Thanks for hitting the nail right on the spot. (I'll double check my work next time before opening an issue).

@sbskas sbskas closed this as completed May 22, 2019
@seh
Copy link
Contributor

seh commented May 22, 2019

One thing I can’t remember: While this new kubelet is complaining like this, is it able to start up a new API server successfully? That is, is the new node stuck until the old API servers are no longer answering to the load balancer?

@pacoxu
Copy link
Member

pacoxu commented Jun 22, 2020

--authorization-mode=Node,RBAC

This works for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question. sig/auth Categorizes an issue or PR as relevant to SIG Auth. triage/not-reproducible Indicates an issue can not be reproduced as described.
Projects
None yet
Development

No branches or pull requests

8 participants