New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revision to IAM Policies created by Kops #3343
Revision to IAM Policies created by Kops #3343
Conversation
Ping @gambol99 @chrislovecnm This will need some thorough review to make sure I've not accidentally dropped any permissions for the legacy behaviour (I think I've captured it all), also maybe some SID renaming and refactoring to make it more readable. I've tested on a couple local clusters (creation, updates, rolling-updates, both with and without flag) and it seems fine so far. Maybe the Jenkins E2E test will pick something up! |
Not sure why the E2E tests failed, there's no log output. Will kick off again. /test pull-kops-e2e-kubernetes-aws |
E2E wth kops is not happy currently. Known issue that @justinsb is workin on |
/test pull-kops-e2e-kubernetes-aws |
d2d669f
to
89ea4d7
Compare
This is a biggy. Need to set aside some time to review. The conversation between @justinsb and I was to initially get this under a feature flag, or do you think the API settings that I think you are using are sufficient? |
/assign |
Hey @chrislovecnm! Yeah this will need some careful review, it's easy for a permission to slip by and potentially break something. I think when we initially discussed about this back in PR #3210 the plan was to have any future policy revisions wrapped around the Cluster Spec IAM Legacy flag, so I've tried to re-use that (it was already in use for the S3 and EC2 permissions). Do you think a different flag should be used instead? |
@KashifSaadat I think we can use your API element just fine :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start. This is going to need a lot of testing.
ELB
EBS
KMS
S3
What else does k-c-m play with?
pkg/model/iam/iam_builder.go
Outdated
@@ -14,6 +14,13 @@ See the License for the specific language governing permissions and | |||
limitations under the License. | |||
*/ | |||
|
|||
// TODO: We have a couple different code paths until with do lifecycles, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are these TODOs updated??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a little unsure about the reference to using lifecycles and refactor of s3 code, so kept your comment as-is from the previous PR. I'll update the second TODO as we're now doing some more restrictive behaviour with use of resource paths and condition keys (but still room for improvement).
pkg/model/iam/iam_builder.go
Outdated
@@ -48,6 +55,7 @@ func (p *IAMPolicy) AsJSON() (string, error) { | |||
} | |||
|
|||
type IAMStatementEffect string | |||
type IAMSid string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type SID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. All of the types / constants are being picked up by golint for stuttering (iam.IAMSid
etc) so I'll clean it up.
pkg/model/iam/iam_builder.go
Outdated
case kops.InstanceGroupRoleBastion: | ||
p, err = b.BuildAWSIAMPolicyBastion() | ||
if err != nil { | ||
return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fmt
please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, I've done that in a few places. Will update.
pkg/model/iam/iam_builder.go
Outdated
HostedZoneID string | ||
S3Bucket string | ||
ResourceARN *string | ||
CreateECRPerms bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we wiring this in? What is setting this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pulled this in from #2497 but I don't believe anything is actually making use of it yet (aside from me, setting KMSKeys
in the unit tests). Shall I take the extra fields out for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've cleaned up the unused fields, leaving in KMSKeys
, ResourceARN
& CreateECRPerms
(although those aren't set elsewhere on use of the struct yet) as we try to make use of these within the builder.
pkg/model/iam/iam_builder.go
Outdated
"ecr:DescribeRepositories", | ||
"ecr:ListImages", | ||
"ecr:BatchGetImage", | ||
"ec2:AllocateAddress", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to prune these. The kops installer needs a lot of these, but the master and nodes to do not. Take a look at https://gist.github.com/chrislovecnm/0e23d11903cc36b99ccf73a013d5ae56 and https://gist.github.com/chrislovecnm/a93941b4afbbd5afe746354af0cfddcc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take this function out based on the other comment below, we can raise as a separate PR and review it independently.
"autoscaling:GetAsgForInstance" | ||
], | ||
"Resource": [ | ||
"*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we limit on ASGs that are tagged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this is read metadata for ASGs, so perhaps could have one statement with Sid "ReadAWSMetadataAllResources")
{ | ||
"Effect": "Allow", | ||
"Action": [ | ||
"s3:List*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we tighten and remove the *?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I'll update and test (I think I can remove this completely, will validate).
"kms:ReEncrypt*" | ||
], | ||
"Resource": [ | ||
"key-id-1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we test by adding a different key for a ebs volume? This may be too restricting. I am not familiar with kms perms, we need to double check if we have too many.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@faraazkhan do you know IAM kms perms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand this is a complete set to work with (and delegate use of ) a specific key. There are some other conditions that delegate use only to aws resources (ie to EC2 so that it can mount the volumes) but I have never got good information out of AWS for exactly what is required for a minimal set here.
{ | ||
"Effect": "Allow", | ||
"Action": [ | ||
"ec2:DescribeInstances" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DescribeRegion? Why do we need DescribeInstances?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was introduced in #2497 with a comment that Protokube makes a DescribeInstances call. If it's not needed at all we can drop the Node EC2 policy completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check. We may only need for gossip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This permission is still required, I got the below error when dropping it for compute nodes:
kubelet[8706]: error: failed to run Kubelet: could not init cloud provider "aws": error finding instance i-12345678: error listing AWS instances: UnauthorizedOperation: You are not authorized to perform this operation
"arn:aws:s3:::kops-tests/iam-builder-test.k8s.local/cluster.spec", | ||
"arn:aws:s3:::kops-tests/iam-builder-test.k8s.local/config", | ||
"arn:aws:s3:::kops-tests/iam-builder-test.k8s.local/instancegroup/*", | ||
"arn:aws:s3:::kops-tests/iam-builder-test.k8s.local/pki/issued/*", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which pki keys do we actually need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given there are limits on policy sizes, I would suggest you consider this role to own the whole iam-builder-test.k8s.local/* prefix - particularly for read actions. If there are specific prefixes under here that might hold secrets that a node should not be able to read, then you could use a DENY on those, or move that to a different prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back in PR #3158 I initially trialled this with open access and then explicit DENY rules for certain resources, however it became a bit more complex to manage. The /pki/private/
path can differ on content depending on your deployment configuration and there would be more rules to add in as the nodes shouldn't have access to most of these files, hence why I switched to this method.
The open access to /pki/issued/
was also brought up in #3158, but deemed acceptable (#3158 (review)).
@KashifSaadat thanks for your patience with this PR, it is a bear. How are u testing this? We really need to figure out if we can run e2e against this so that we can get better testing coverage. Otherwise, we can install a bunch of example apps. |
@KashifSaadat let me know if you want to go over these changes on a video call. This is a complicated enough PR that we may want to review on a call. |
"ec2:DescribeVolumes" | ||
], | ||
"Resource": [ | ||
"*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Chris, Grant Gardner here as requested. I would not worry too much about the "read metadata" permissions such as ec2:Describe* in a first pass at a stricter policy. Would concentrate on permissions that can read end user data (eg S3) or can create/update/delete AWS resources.
I have just commented on the *strict permissions json, if there is anything else you'd like me to look at please let me know.
"ec2:ModifyInstanceAttribute" | ||
], | ||
"Resource": [ | ||
"*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some cases you can use the aws:RequestTags condition to force the creation of a resource (Volume/Security group) to contain a specific tag value (which you use in the next statement to ensure that you are only working on your own resources)
and then move the Delete* actions into the "with tagged resources" below.
with Create/ModifyTags you can use a ForAllValues:StringEquals condition on Aws::TagKeys to limit which tags this role allows you to play with (ie only the tags associated with Kubernetes)
"autoscaling:GetAsgForInstance" | ||
], | ||
"Resource": [ | ||
"*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this is read metadata for ASGs, so perhaps could have one statement with Sid "ReadAWSMetadataAllResources")
"autoscaling:UpdateAutoScalingGroup" | ||
], | ||
"Resource": [ | ||
"*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AutoScaling now has resource level permissions so instead of tags you also have the option of using a prefix on your AutoScalingGroup and LaunchConfiguration names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be good to add in. Would you be able to point me in the right direction of retrieving the autoscaling & launch configuration names and account id to populate the resources list?
"kms:ReEncrypt*" | ||
], | ||
"Resource": [ | ||
"key-id-1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand this is a complete set to work with (and delegate use of ) a specific key. There are some other conditions that delegate use only to aws resources (ie to EC2 so that it can mount the volumes) but I have never got good information out of AWS for exactly what is required for a minimal set here.
"arn:aws:s3:::kops-tests/iam-builder-test.k8s.local/cluster.spec", | ||
"arn:aws:s3:::kops-tests/iam-builder-test.k8s.local/config", | ||
"arn:aws:s3:::kops-tests/iam-builder-test.k8s.local/instancegroup/*", | ||
"arn:aws:s3:::kops-tests/iam-builder-test.k8s.local/pki/issued/*", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given there are limits on policy sizes, I would suggest you consider this role to own the whole iam-builder-test.k8s.local/* prefix - particularly for read actions. If there are specific prefixes under here that might hold secrets that a node should not be able to read, then you could use a DENY on those, or move that to a different prefix.
@KashifSaadat PR needs rebase |
Thanks for the detailed feedback guys! I'll go through it later and try to address as much as I can :) |
89ea4d7
to
6a99c76
Compare
I've made some changes based on the feedback and updated tests accordingly (hopefully the E2E will still be clean). A call would be good to discuss! Maybe something to chat about on the next Friday kops call? |
5395b84
to
8c9d1af
Compare
409ee5f
to
7d99013
Compare
/retest |
@chrislovecnm do you know if there's any known issues with the E2E tests at the moment? My latest update didn't include any policy changes (just unused code cleanup) and still works locally, but getting failures in the E2E tests now. It looks like the E2E tests are missing some log files ( |
/test pull-kops-e2e-kubernetes-aws |
Looks like it passed |
@KashifSaadat how do we want to proceed with this? File some issues for the things we did not address and iterate? With these changes I assume this is opt in changes. We will not be changing the default behavior? |
7d99013
to
2e6b7ee
Compare
Hi @chrislovecnm, yeah it would be good if we could get this one in and then separately pick up the remaining tasks, such as optional ASG permissions, IAM SSL certs, validation around ECR permissions. I've aimed to keep the existing permission set / behaviour under the legacy flag, which will be enabled by default for existing clusters so users shouldn't be impacted (unless they explicitly set the flag to false in their cluster spec). |
Can you open some issues on those remaining items? We will need release notes and probably more documentation ;) /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: chrislovecnm The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue |
Based off of the work done by @chrislovecnm in PR #2497.
This PR tightens down the IAM policies created for Master & Node instance groups. The Cluster Spec
IAMSpec.Legacy
flag is used to control application of stricter policy rules, which is defaulted to true for existing clusters (to limit potential regression impact), and false for new cluster creation.