New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrading cluster running cri-o from 1.13.5 to 1.14.1 yields errors on runtimeclasses access #77801
Comments
/sig node |
Thanks for reporting @sbskas ! I think (but definitely could be wrong), that @kubernetes/sig-auth-bugs will be best able to assist you with this. Please feel free to undo my tagging if I'm incorrect :) /remove-sig node |
/cc @tallclair do kubelets list runtimeclasses? |
I don't know. They print those errors every seconds without patching the clusterrolebinding/system-nodes. One important point I forgot to mention in the description is that I'm running cri-o as runtime. |
kubernetes/pkg/kubelet/kubelet.go Lines 659 to 661 in 5f72845
this sets up a runtimeclass informer in the kubelet, which needs list/watch @tallclair, what is the runtimeclass used for in the kubelet? is it required? if so, how are we testing the functionality this enables? can we add test coverage along with the required permissions. if not, can we remove the runtimeclassmanager from the kubelet? |
I also came across with similar problem.Some how,on node upgade to 1.14.1(other nodes are 1.13.1),the nodes shift its state from ready to not ready back and forth.When checking the kubelet log,get similar log information. |
The Kubelet needs to look up the runtime handler for the runtimeclass associated with a pod.
Yes, if pods are using RuntimeClasses.
Cluster E2E's only test the failure cases, and it doesn't distinguish failures to look up the runtimeclass from failures to run with the handler (I'll fix this): https://github.com/kubernetes/kubernetes/blob/master/test/e2e/common/runtimeclass.go Node E2E's test the end-to-end configured handler: kubernetes/test/e2e/common/runtimeclass.go Lines 59 to 67 in 0e224ad
Do you mean fixing the tests to fail when the permissions are missing? Yes. Permissions are added here as part of RBAC bootstrap - any idea why they wouldn't be included in this case? kubernetes/plugin/pkg/auth/authorizer/rbac/bootstrappolicy/policy.go Lines 178 to 181 in 0e224ad
|
@sbskas and @mrtylerzhou - Did you update the apiserver before updating the nodes? Are you running the node authorizer? |
@tallclair In fact,I did not update the apiserver,I even dont know why the node got updated.I downgrade the node's kubelet version and fixed the problem. |
@tallclair we did a normal upgrade of the cluster by upgrading the controlplane first and then the kubelets. The node authorizer is configured in the apiserver (--authorization-mode=Node,RBAC). |
As @tallclair mentioned, runtimeclass read permissions are already included in the Is the Do you have your full kube-apiserver invocation? |
That is definitely not recommended. That lets any node modify any other node. |
@liggitt : At this time we found this workaround to keep our cluster working. If I read the kube doc correctly, having the clusterrole system:nodes is not necessary either. --advertise-address=192.168.0.50 |
@liggitt Have you all the pieces of information needed to make a diagnostic or do you need more information ? I'm not exactly confident about the kube-apiserver invocation question... |
I'm not able to reproduce. $ kubectl version $ kubectl auth can-i list csidrivers.storage.k8s.io --as=system:node:node --as-group=system:authenticated --as-group=system:nodes --all-namespaces that is consistent with the authorization rules for those resources granted to nodes. |
I saw this happen in my cluster. I think that what was happening was that the kubelet running on an upgraded master was talking through our API server load balancer to older API servers on other nodes, and whatever was responsible for registering those new RBAC rules hadn't yet written them to etcd. |
the node authorizer works off a fixed permission set, not something persisted to etcd. a new node talking to an old apiserver would definitely hit this |
Well, scratch the etcd part, but does the rest of the scenario match what you saw, @sbskas? |
@seh indeed. I'm confused. Only one of our apiserver got upgraded not the other one. |
One thing I can’t remember: While this new kubelet is complaining like this, is it able to start up a new API server successfully? That is, is the new node stuck until the old API servers are no longer answering to the load balancer? |
This works for me. |
What happened:
After upgrade a working cluster from 1.13.5 to 1.14.1 using the kubeadm upgrade apply command, kubelet started to spurt the following error messages :
kubelet[2635221]: E0513 09:55:54.814239 2635221 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.RuntimeClass: runtimeclasses.node.k8s.io is forbidden: User "system:node:node" cannot list resource "runtimeclasses" in API group "node.k8s.io" at the cluster scope
kubelet[2635221]: E0513 09:55:55.537722 2635221 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.CSIDriver: csidrivers.storage.k8s.io is forbidden: User "system:node:node" cannot list resource "csidrivers" in API group "storage.k8s.io" at the cluster scope
What you expected to happen:
kubelet should have been able to access the runtimeclasses in node.k8s.io and csidrivers in storage.k8s.io.
How to reproduce it (as minimally and precisely as possible):
We did not try to reproduce the bug. Bascaly, create a 1.13.5, with ceph-csi driver and upgrade it to 1.14.1.
Anything else we need to know?:
We were able to fix the bug by adding a subject line in the system:nodes clusterrolebindings.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
annotations:
rbac.authorization.kubernetes.io/autoupdate: "true"
labels:
kubernetes.io/bootstrapping: rbac-defaults
name: system:node
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:node
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: Group
name: system:nodes
Environment:
Kubernetes version (use
kubectl version
):Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:02:58Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration:
OS (e.g:
cat /etc/os-release
):NAME=Fedora
VERSION="30 (Thirty)"
ID=fedora
VERSION_ID=30
VERSION_CODENAME=""
PLATFORM_ID="platform:f30"
PRETTY_NAME="Fedora 30 (Thirty)"
ANSI_COLOR="0;34"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:30"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f30/system-administrators-guide/"
SUPPORT_URL="https://fedoraproject.org/wiki/Communicating_and_getting_help"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=30
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=30
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
Kernel (e.g.
uname -a
):Linux core1 5.0.13-300.fc30.x86_64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Mon May 6 00:39:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: