New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set IMDSv2 on by default for nodes and apiservers #11329
Conversation
I recall we initially defaulted this to enabled but reverted it in #10655. Can we address the issues that lead to the revert? |
The revert was because it caused any pod that piggybacked on the instance's IAM role to fail. This should not happen on API server or worker nodes though. I hope it is rare that people add extra privileges to normal nodes and rather rely on KIAM (who should have their servers running on the control plane). It also caused breakage on EBS CSI nodes because they rely on talking to the metadata service. This breakage was technically caused by me by removing I.e we should have addressed the issues now. |
Typically changes like this we only enable for new clusters (and sometimes only for new clusters with k8s minor versions >= the kops minor version) Any reason not to follow that same pattern here? |
I want to play the security card here. But I am fine if we constrain this to a given k8s version too. But I think we really need this on by default on normal nodes. @johngmyers what do you think? |
After thinking about this for some time, I think it it would be better to set this only when warm pool is involved. |
What makes flatcar break? Does it handle routing differently than other distros? |
We should be having secure defaults, at least for new clusters. |
Flatcar breaks because of the outdated AWS SDK: flatcar/ignition#19. |
Honestly, to me it looks like in this case we should use IMDSv2 by default and flatcar users should explicitly disable it. That this hasn't been fixed in flatcar sounds odd and I would prefer this not holding every other distro back from secure by default. We can add an additinal callout here: https://kops.sigs.k8s.io/operations/images/#flatcar |
Same story with Debian 9. It will most likely break many obscure workflows or setups where people were lazy to update. |
I agree that we should create secure clusters by default. Does debian 10 work? I'm fine with documenting that using certain distros (particularly older ones) on new clusters will need to have an instancegroup field overridden. We would need to do something similar for userdata generation for bottlerocket and windows anyways, so it seems like a reasonable pattern to follow. |
Corner cases don't detract from changing the default for new clusters to something more secure. They do detract from changing the default for existing clusters. If there is an upcoming change, such as ServiceAccountIAM, which would significantly reduce the number of affected corner cases, that can argue to deferring the change of defaults. |
docs/releases/1.21-NOTES.md
Outdated
httpTokens: optional | ||
``` | ||
|
||
This change does not affect control plane nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe best moved in main message.
Debian 10 does work. I can't say that I really agree that these are corner cases, but I agree that the vote is to go forward like this. |
I could accept exempting new clusters that are on old k8s versions. Do we need to resurrect #8531? |
In office hours we decided that we can enable this for new k8s 1.22 clusters. This address the Debian 9 issue because it is already EOL, leaving flatcar compatibility unknown. all node roles, token requred, hop limit 1 |
25304db
to
2a9b5a9
Compare
Implemented accordingly. Do we have any tests that uses k8s 1.22 yet? |
We have a k8s-latest job but apparently its history is too large for testgrid and we need to reduce the retention on it. We also have an arm64 job that you could observe |
Both of those cilium tests use latest. Should be good enough to see if this breaks anything. |
docs/releases/1.22-NOTES.md
Outdated
``` | ||
spec: | ||
instanceMetadata: | ||
httpPutResponseHopLimit: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Removed.
docs/releases/1.22-NOTES.md
Outdated
httpTokens: optional | ||
``` | ||
|
||
This change does not affect control plane nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change will affect control plane nodes also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the wording to it is more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is not the wording but the intention.
As @rifelpet was saying above, conclusion from office hours was:
all node roles, token requred, hop limit 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot implement this on master roles though. Not until IRSA is used for all components and enabled by default.
2a9b5a9
to
839e8c3
Compare
839e8c3
to
fb12c6a
Compare
fb12c6a
to
7eea8f8
Compare
upup/pkg/fi/cloudup/new_cluster.go
Outdated
|
||
if cluster.IsKubernetesGTE("1.22") { | ||
bastionGroup.Spec.InstanceMetadata = &api.InstanceMetadataOptions{ | ||
HTTPPutResponseHopLimit: fi.Int64(3), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any special reason for using 3 as hop limit for Bastions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Seems like a mistake.
Bastion, nodes, and api servers get limit of 1 API servers tend to run pods requiring metadata access. The hop limit depends on CNI, but all should work with a limit of 3.
7eea8f8
to
b3a60d3
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hakman The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
On AWS, kOps will enable [Instance Metadata Service Version 2](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-service.html) by default with a max-hop-limit of 1 on new clusters that use Kubernetes 1.22. This means that any container running in the cluster will be unable to connect to the instance metadata _unless_ the container is running with `hostNetworking: true`. This will increase security by default, but may break some types of workloads. In order to revert to old behavior, add the following to the InstanceGroup: | ||
|
||
``` | ||
spec: | ||
instanceMetadata: | ||
httpTokens: optional | ||
``` | ||
|
||
This change only affects dedicated API server nodes and worker nodes. It does not affect control plane nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this release note needs updating to match the final behavior in this PR. It does affect control plane nodes but with a hop limit of 3 (along with apiserver nodes)
This should mitigate security concerns around things like e.g enabling lifecycle hooks by default