Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression - etcd datadir permissions not set on etcd grow #2256

Closed
dannysauer opened this issue Aug 18, 2020 · 26 comments
Closed

Regression - etcd datadir permissions not set on etcd grow #2256

dannysauer opened this issue Aug 18, 2020 · 26 comments
Assignees
Labels
area/etcd area/security kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. priority/backlog Higher priority than priority/awaiting-more-evidence.
Milestone

Comments

@dannysauer
Copy link

dannysauer commented Aug 18, 2020

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version): 1.14+

Environment:
N/A

What happened?

With the release of etcd 3.4.10, the datadir permissions now need to be 0700 or etcd won't start. There was an issue (#1308) where perms were set on join before starting the etcd container as a security control, overriding the default behavior of creating a non-existant directory mode 0755. However, in a cleanup, that necessary os.mkdirall was removed. This was transparently ignored for several releases since etcd didn't complain, but with etcd-io/etcd#11798 (in 3.4.10), the new etcd cluster on the second node does not start.

I'm pretty sure this will break anyone on k8s 1.14 or newer who upgrades to etcd 3.4.10 or newer without first fixing the /var/lib/etcd perms.

What you expected to happen?

/var/lib/etcd (or whatever the var is set to) should be set to 0700. :)

How to reproduce it (as minimally and precisely as possible)?

Join a second master node, then ls -ld /var/lib/etcd on the node. With an etcd 3.4.10 or newer runtime

Anything else we need to know?

It's worth explicitly noting that the first control plane node added works fine. It's just the second and subsequent nodes which were handled in a separate location in the code which exhibit the problem.

@neolit123
Copy link
Member

/var/lib/etcd (or whatever the var is set to) should be set to 0700. :)

hm, isn't this what kubeadm is doing by default?
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/cmd/phases/init/etcd.go#L90

@neolit123 neolit123 added area/etcd priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Aug 18, 2020
@neolit123
Copy link
Member

However, in a cleanup, that necessary os.mkdirall was removed

do you mean this line?
kubernetes/kubernetes@6bbed9f#diff-0960dc0bb7c3e7113a0daa027856b8f9L467

it is still here https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/cmd/phases/init/etcd.go#L90

here is the version mapping between kubeadm(k8s) version and etcd:
https://github.com/kubernetes/kubernetes/blob/3f585835d0a38462bb928454245ec01e834e53a0/cmd/kubeadm/app/constants/constants.go#L430-L437

so i guess this will happen if one tries to use kubeadm <1.14 with etcd server > 3.4.10?
this is not recommended by the kubeadm team, but also 1.14 is out of support at this point (1.16 to go out of support soon too).

@dannysauer
Copy link
Author

There were two mkdirs - the one at bootstrap time (which is still there) and a second one which ran when additional nodes were run. The referenced cleanup SR removed the second one. Which makes sense in the context of the observed behavior; additional control-plane nodes have the /var/lib/etcd directory created with mode 0755 (the kubernetes default), and looking at some older clusters I have around, have been doing that for quite some time. Assuming I identified the right PR, it's been happening since 1.14, and still happening on the 1.18 cluster I stood up this morning to test the new etcd. It would've also been happening before #1308, but those systems are solidly out of support by now.

so i guess this will happen if one tries to use kubeadm <1.14 with etcd server > 3.4.10?
Possibly, but the greater concern is seeing it happen with >= 1.14 and etcd server >= 3.4.10

@neolit123
Copy link
Member

neolit123 commented Aug 19, 2020

ok, so originally a similar fix was added here in the function that creates the static pods for 1.14-pre:
kubernetes/kubernetes@836f413#diff-c4574f3918f016aeb3b32f5d9cb62ed6

later that code was moved (by ereslibre):
kubernetes/kubernetes@981bf19

and then this refactor that you linked indeed omitted it.
kubernetes/kubernetes#73452
that PR was very noisy and had 140+ comments and we might have missed this.

so yes, we should not let the kubelet create the path with 755, and include the following:

// pre-create the etcd data directory with the right permissions
if err := os.MkdirAll(cfg.Etcd.Local.DataDir, 0700); err != nil {
	return errors.Wrapf(err, "failed to create etcd directory %q", cfg.Etcd.Local.DataDir)
}
// if the directory already existed and the above call was a no-op, ensure it has the right permissions
// or otherwise etcd 3.4.10+ will fail:
// https://github.com/etcd-io/etcd/pull/11798
if err := os.Chmod(cfg.Etcd.Local.DataDir, 0700); err != nil {
	return errors.Wrapf(err, "failed to chmod etcd directory %q", cfg.Etcd.Local.DataDir)
}

at this line:
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/cmd/phases/join/controlplanejoin.go#L133

thanks for reporting it.

@neolit123
Copy link
Member

/help
/area security
/milestone v1.20
/priority important-longterm

i don't see this as a critical bug for 1.19 during code freeze.

given there is a chance an older version of kubeadm could be used to run etcd 3.4.10 we might as well backport the change to the support skew (1.19, 1.18, 1.17) after 1.19 releases.

@k8s-ci-robot
Copy link
Contributor

@neolit123:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help
/area security
/milestone v1.20
/priority important-longterm

i don't see this as a critical bug for 1.19 during code freeze.

given there is a chance an older version of kubeadm could be used to run etcd 3.4.10 we might as well backport the change to the support skew (1.19, 1.18, 1.17) after 1.19 releases.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Aug 19, 2020
@k8s-ci-robot k8s-ci-robot added area/security help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Aug 19, 2020
@neolit123 neolit123 added kind/bug Categorizes issue or PR as related to a bug. and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Aug 19, 2020
@neolit123
Copy link
Member

neolit123 commented Aug 19, 2020

one problem though is if the user is upgrading an existing cluster that has the 755 data dir.
in that case we should also ensure to chmod it due to etcd-io/etcd#11798

EDIT: ... but i guess this also has to be done on upgrade. removing "help" as this requires more changes.
/remove-help

@neolit123 neolit123 removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 19, 2020
@frittenlab
Copy link

I also just run into this problem trying to upgrade an existing etcd cluster from 3.4.9 to 3.4.11. The etcd node refuses to start with:

Aug 19 11:03:43 etcd0-dev etcd[26085]: cannot access data directory: directory "/var/lib/etcd","drwxr-xr-x" exist without desired file permissions

@neolit123
Copy link
Member

@frittenlab hi, what k8s/kubeadm version are you using?

looks like we might be picking the latest etcd in 1.19, which means we must patch kubeadm too:
kubernetes/kubernetes#93450
https://groups.google.com/d/msgid/kubernetes-sig-api-machinery/CAOXj30tEQzy0uSq9Oc4dX2UfSR%3DsCaqWwwUzymLrDxLirvHJ%2BQ%40mail.gmail.com?utm_medium=email&utm_source=footer

also:

  • backport the change to older versions (which ones?)
  • add a note in the kubeadm troubleshooting guide

/milestone v1.19

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.20, v1.19 Aug 19, 2020
@neolit123 neolit123 added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Aug 19, 2020
@neolit123 neolit123 self-assigned this Aug 19, 2020
@frittenlab
Copy link

@neolit123 We are currently using kubeadm / k8s 1.18.8. We use an external etcd cluster which we provisioned ourselves on dedicated localstorage vm's. We added the etcd cluster to kubeadm via the kubeadm-config configmap. I was able to upgrade the etcd cluster after changing the directory permissions of

/var/lib/etcd

to 700

@neolit123 neolit123 added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Aug 19, 2020
@neolit123
Copy link
Member

@frittenlab
just to double check, i'm assuming you did not follow this guide:
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/setup-ha-etcd-with-kubeadm/
IIUC, if using that guide (kubeadm init) the directory should have been created with 0700.

@dannysauer dannysauer changed the title Reversion - etcd datadir permissions not set on etcd grow Regression - etcd datadir permissions not set on etcd grow Aug 19, 2020
@neolit123
Copy link
Member

PR for master is here:
kubernetes/kubernetes#94102

@frittenlab
Copy link

@neolit123 We did follow this guide. We created the cluster over 2 years ago. I can't recall having to change the mode of the data dir then. I added the option to out ansible role for the etcd cluster. And everything works fine now. This was not necessary with previous etc versions.

@neolit123
Copy link
Member

ok, in the above PR i've added a chmod 700 in kubeadm init even if the directory already exists, just in case.

@frittenlab
Copy link

Thanks that is a good idea :)

@liggitt
Copy link
Member

liggitt commented Aug 19, 2020

looks like we might be picking the latest etcd in 1.19, which means we must patch kubeadm too:

that's only updating the client library, not the server version

@neolit123
Copy link
Member

that's only updating the client library, not the server version

ok, i still think that including the fix in 1.19 is a good idea to avoid the case where 1.19 kubeadm is used to deploy the newer (problematic) etcd.

@liggitt
Copy link
Member

liggitt commented Aug 19, 2020

ok, i still think that including the fix in 1.19 is a good idea to avoid the case where 1.19 kubeadm is used to deploy the newer (problematic) etcd.

agree

@liggitt
Copy link
Member

liggitt commented Aug 19, 2020

actually, I think @jingyih is working on resolving the regression in etcd ... not sure if kubeadm should work around it or wait for an etcd fix. I think kubernetes manifests will stay on 3.4.9 until this is resolved

@neolit123
Copy link
Member

neolit123 commented Aug 19, 2020

@jingyih in which etcd version are you planning to resolve this?
would you recommend that kubeadm pre-creates this directory with 0700 for all OSes?

if a fix in etcd is coming, we can just add a note about this in the kubeadm troubleshooting guide and move the pending PR to 1.20.
the PR itself can continue pre-creating the directory with 0700 (similar is done for /var/lib/kubelet) but possibly omit the chmod, as it overrides custom permissions by the user.

@dannysauer
Copy link
Author

dannysauer commented Aug 19, 2020

The etcd behavior simply illuminated a security regression in kubeadm, which is why #1308 was referenced in the report. There's no fix to etcd which undoes that regression (unless etcd changes the datadir permissions itself, which is even more sketchy than kubeadm updating the old incorrect perms).

Admittedly, world-readable perms at the top level of the datadir isn't a huge deal in practice, as etcd creates the member subdir mode 0700 and thus there shouldn't be any information disclosure risk. But leaving the current behavior is going to cause people following the CIS kubernetes benchmark and similar hardening guides with a bit of heartburn when they have to fix it on about every control plane node. Never mind how many people are doing so, given this has gone unnoticed for years. 🤦

I think kubernetes manifests will stay on 3.4.9 until this is resolved

FWIW, staying on etcd 3.4.9 means retaining a security hole fixed in 3.4.10 - CVE-2020-15106. That's why I was updating an etcd package to begin with. :) Also a difficult-to-exploit security hole, but a published CVE none the less.

@neolit123
Copy link
Member

neolit123 commented Aug 19, 2020

The etcd behavior simply illuminated a security regression in kubeadm, which is why #1308 was referenced in the report. There's no fix to etcd which undoes that regression (unless etcd changes the datadir permissions itself, which is even more sketchy than kubeadm updating the old incorrect perms).

yes, that is why i think kubernetes/kubernetes#94102 is still viable. but while enforcing 700 on existing folders makes sense, it also overrides customized permissions such as 0770.

Admittedly, world-readable perms at the top level of the datadir isn't a huge deal in practice, as etcd creates the member subdir mode 0700 and thus there shouldn't be any information disclosure risk.

with the member sub-directory being protected with 0700, it is indeed not that much of security issue.

FWIW, staying on etcd 3.4.9 means retaining a security hole fixed in 3.4.10 - CVE-2020-15106. That's why I was updating an etcd package to begin with. :) Also a difficult-to-exploit security hole, but a published CVE none the less.

we rarely update the default etcd server version (constant) in prior kubeadm releases but it has happened before.
yet, for 1.19 we might be out of time to release a new etcd server version at this point.

kubeadm makes it possible to deploy a custom etcd image/version, but in later kubeadm versions this comes with the trade off that etcd upgrades will be skipped when doing "kubeadm upgrade".

@dannysauer
Copy link
Author

while enforcing 700 on existing folders makes sense, it also overrides customized permissions such as 0770.

Yeah, I'd either leave that out, or check for root/root:0755 and only change in that case. Surely no one's consciously choosing that. :)

@jingyih
Copy link

jingyih commented Aug 21, 2020

actually, I think @jingyih is working on resolving the regression in etcd ... not sure if kubeadm should work around it or wait for an etcd fix. I think kubernetes manifests will stay on 3.4.9 until this is resolved

I discussed with @spzala about this. We are thinking about providing a warning message instead of enforcing the file permission:
etcd-io/etcd#12242

@neolit123
Copy link
Member

updated the PR to only create the directory if it does not exist on init/join-control-plane, but not chmod it.
kubernetes/kubernetes#94102

/remove-priority important-soon
/priority backlog
lowering priority since the fix in etcd was applied and 1.20 will include 3.4.13+.
(there is also discussion to backport the etcd version to 1.19)

@k8s-ci-robot k8s-ci-robot added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Sep 3, 2020
@neolit123
Copy link
Member

kubernetes/kubernetes#94102 merged
newer etcd was backported to 1.19 too:
kubernetes/kubernetes#94536

closing, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/etcd area/security kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

No branches or pull requests

6 participants