Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Master IAMRolePolicy too long with long cluster names. #12606

Closed
BenWolstencroft opened this issue Oct 25, 2021 · 14 comments · Fixed by #12700
Closed

Master IAMRolePolicy too long with long cluster names. #12606

BenWolstencroft opened this issue Oct 25, 2021 · 14 comments · Fixed by #12700
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@BenWolstencroft
Copy link

BenWolstencroft commented Oct 25, 2021

/kind bug

1. What kops version are you running?
1.21.2

2. What Kubernetes version are you running?
1.21.4

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
kops update cluster --yes

5. What happened after the commands executed?
IAMRolePolicy/. Example error: error reading actual policy document: policy size was 11655. Policy cannot exceed 10240 bytes.

6. What did you expect to happen?
Update to succeed

7. Please provide your cluster manifest.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2020-05-22T10:30:56Z"
  name: redacted
spec:
  api:
    dns: {}
    loadBalancer:
      class: Classic
      type: Internal
  authorization:
    rbac: {}
  awsLoadBalancerController:
    enabled: true
  certManager:
    defaultIssuer: redacted
    enabled: true
  channel: stable
  cloudLabels:
    BudgetCode: Kube
    ProjectCode: Kube-Testing-Standalone-Cluster
  cloudProvider: aws
  clusterAutoscaler:
    balanceSimilarNodeGroups: false
    cpuRequest: 100m
    enabled: true
    expander: least-waste
    memoryRequest: 300Mi
    newPodScaleUpDelay: 0s
    scaleDownDelayAfterAdd: 10m0s
    scaleDownUtilizationThreshold: "0.5"
    skipNodesWithLocalStorage: true
    skipNodesWithSystemPods: true
  configBase: s3://redacted/redacted
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: master-eu-west-1a
      name: a
    - instanceGroup: master-eu-west-1b
      name: b
    - instanceGroup: master-eu-west-1c
      name: c
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: master-eu-west-1a
      name: a
    - instanceGroup: master-eu-west-1b
      name: b
    - instanceGroup: master-eu-west-1c
      name: c
    memoryRequest: 100Mi
    name: events
  externalPolicies:
    master:
    - arn:aws:iam::redacted:policy/Kubernetes-Cluster-Systems-MasterNodePolicy-1DXUGOJ7A8T6E
    node:
    - arn:aws:iam::redacted:policy/Kubernetes-Cluster-Systems-NodePolicy-K9TNB85U5ELB
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    oidcClientID: systems-kubernetes
    oidcGroupsClaim: groups
    oidcIssuerURL: https://redacted/
    oidcUsernameClaim: email
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
  kubernetesApiAccess:
  - 10.0.0.0/8
  kubernetesVersion: 1.21.4
  masterInternalName: api.internal.redacted
  masterPublicName: api.redacted
  metricsServer:
    enabled: true
  networkCIDR: 10.30.4.0/22
  networkID: redacted
  networking:
    calico:
      crossSubnet: true
  nodeTerminationHandler:
    enableSQSTerminationDraining: true
    enabled: true
    managedASGTag: aws-node-termination-handler/managed
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 10.0.0.0/8
  sshKeyName: redacted
  subnets:
  - cidr: 10.30.4.0/23
    egress: External
    id: redacted
    name: eu-west-1a
    type: Private
    zone: eu-west-1a
  - cidr: 10.30.6.0/23
    egress: External
    id: redacted
    name: eu-west-1b
    type: Private
    zone: eu-west-1b
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

9. Anything else do we need to know?

I believe that this is due to a combination of the number of addons we have enabled, and the length of the cluster name, the cluster name is included in many of the policies to limit the resources that the permissions are granted against.

We have an identical cluster with a shorter name (40 characters long) which works (vs 43 characters, which fails)

Maybe to prevent addons growing the base policy would be to separate policies for addons out onto their own policy?

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 25, 2021
@BenWolstencroft BenWolstencroft changed the title Kops Master IAMRolePolicy too long with long cluster names. Master IAMRolePolicy too long with long cluster names. Oct 25, 2021
@BenWolstencroft
Copy link
Author

Note: while this gives the same result as #12558 - it's not the same problem, we are not using additional inline permissions for our own policy additions, we use managed policies for this - our configuration is failing based purely on addons being turned on with a long cluster name

@rifelpet
Copy link
Member

I agree we should put addons in their own policy(s). There is a limit of 10 (increasable to 20) attached IAM policies per IAM role so we'll need to be cognizant of that. We could start with all addons being in one separate policy which should be sufficient for now given that the control plane policy itself is fairly large.

IRSA is another valid workaround and IMO the solution we should be encouraging here, given that each (addon) service account has its own IAM role and policy.

@olemarkus
Copy link
Member

This bug was filed against kops 1.21. Can you try 1.22.1?

@BenWolstencroft
Copy link
Author

@olemarkus - same with 1.22.1

@olemarkus
Copy link
Member

1.22 have a test that specifically captures this. The cluster name is not that long in that test, but the margin is fairly large.
The cluster spec used can be seen here: https://github.com/kubernetes/kops/blob/v1.22.1/tests/integration/update_cluster/many-addons/in-v1alpha2.yaml
and the resulting policy for master nodes can be seen here: https://github.com/kubernetes/kops/blob/v1.22.1/tests/integration/update_cluster/many-addons/data/aws_iam_role_policy_masters.minimal.example.com_policy

Using 1.22.1, it would be interesting how your policy differs from the one above. The one above is just shy of 8k, and the max policy size is 10k. So with regards to cluster name, there should be a decent margin indeed.

@BenWolstencroft
Copy link
Author

@olemarkus - is there a way for me to get the resultant policy ? the dry run is showing me the diff of changes it's trying to make, but not the complete document

@rifelpet
Copy link
Member

Your best bet might be to run kops update cluster --yes -v 9 and find the PutRolePolicy API call in the output. That should include the full document

@olemarkus
Copy link
Member

Or run with terraform output, which would leave a policy locally with similar location and name

@mattoz0
Copy link

mattoz0 commented Nov 4, 2021

I am also getting hit by this issue, the worst part is anytime i edit the cluster configuration. It doesn't change the error message. error reading actual policy document: policy size was 11224. Policy cannot exceed 10240 bytes.

It always seems to be 11224 bytes. The issue happens regardless of whether i have an inline policy or not.

@olemarkus
Copy link
Member

Hey. Same as above, we'd need the generated policy to be able to investigate this further.

@mattoz0
Copy link

mattoz0 commented Nov 9, 2021

Not sure if this is the correct policy, it doesn't seem to be 11224 bytes. i changed all the details in the policy so they don't reflect my actual setup. However i made sure to keep the same character count.

Seems like the most appropriate based on the error message:

error running task "IAMRolePolicy/masters.kubernetes.example1234.dev" (9m58s remaining to succeed): error reading actual policy document: policy size was 11224. Policy cannot exceed 10240 bytes.

This was exported using kops update cluster ${cluster} --yes --target terraform

aws_iam_role_policy_masters.kubernetes.example1234.dev_policy

{
  "Statement": [
    {
      "Action": "ec2:AttachVolume",
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/KubernetesCluster": "kubernetes.example1234.dev",
          "aws:ResourceTag/k8s.io/role/master": "1"
        }
      },
      "Effect": "Allow",
      "Resource": [
        "*"
      ]
    },
    {
      "Action": [
        "s3:Get*"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::kops-state-store-test/kubernetes.example1234.dev/*"
    },
    {
      "Action": [
        "s3:GetObject",
        "s3:DeleteObject",
        "s3:DeleteObjectVersion",
        "s3:PutObject"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::kops-state-store-test/kubernetes.example1234.dev/backups/etcd/main/*"
    },
    {
      "Action": [
        "s3:GetObject",
        "s3:DeleteObject",
        "s3:DeleteObjectVersion",
        "s3:PutObject"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::kops-state-store-test/kubernetes.example1234.dev/backups/etcd/events/*"
    },
    {
      "Action": [
        "s3:GetBucketLocation",
        "s3:GetEncryptionConfiguration",
        "s3:ListBucket",
        "s3:ListBucketVersions"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::kops-state-store-test"
      ]
    },
    {
      "Action": [
        "route53:ChangeResourceRecordSets",
        "route53:ListResourceRecordSets",
        "route53:GetHostedZone"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:route53:::hostedzone/Z00118143D7LEO5A8IZU4"
      ]
    },
    {
      "Action": [
        "route53:GetChange"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:route53:::change/*"
      ]
    },
    {
      "Action": [
        "route53:ListHostedZones",
        "route53:ListTagsForResource"
      ],
      "Effect": "Allow",
      "Resource": [
        "*"
      ]
    },
    {
      "Action": "ec2:CreateTags",
      "Condition": {
        "StringEquals": {
          "ec2:CreateAction": [
            "CreateVolume",
            "CreateSnapshot"
          ]
        }
      },
      "Effect": "Allow",
      "Resource": [
        "arn:aws:ec2:*:*:volume/*",
        "arn:aws:ec2:*:*:snapshot/*"
      ]
    },
    {
      "Action": "ec2:CreateTags",
      "Condition": {
        "StringEquals": {
          "ec2:CreateAction": [
            "CreateVolume",
            "CreateSnapshot"
          ]
        }
      },
      "Effect": "Allow",
      "Resource": [
        "arn:aws:ec2:*:*:volume/*",
        "arn:aws:ec2:*:*:snapshot/*"
      ]
    },
    {
      "Action": "ec2:DeleteTags",
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/KubernetesCluster": "kubernetes.example1234.dev"
        }
      },
      "Effect": "Allow",
      "Resource": [
        "arn:aws:ec2:*:*:volume/*",
        "arn:aws:ec2:*:*:snapshot/*"
      ]
    },
    {
      "Action": [
        "ec2:AuthorizeSecurityGroupIngress",
        "ec2:DeleteSecurityGroup",
        "ec2:RevokeSecurityGroupIngress",
        "elasticloadbalancing:ModifyTargetGroupAttributes",
        "elasticloadbalancing:ModifyRule",
        "elasticloadbalancing:DeleteRule",
        "elasticloadbalancing:AddTags",
        "elasticloadbalancing:RemoveTags"
      ],
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/elbv2.k8s.aws/cluster": "kubernetes.example1234.dev"
        }
      },
      "Effect": "Allow",
      "Resource": "*"
    },
    {
      "Action": [
        "autoscaling:DescribeAutoScalingGroups",
        "autoscaling:DescribeAutoScalingInstances",
        "autoscaling:DescribeLaunchConfigurations",
        "autoscaling:DescribeTags",
        "ec2:CreateSecurityGroup",
        "ec2:CreateTags",
        "ec2:DescribeAccountAttributes",
        "ec2:DescribeAvailabilityZones",
        "ec2:DescribeInstanceTypes",
        "ec2:DescribeInstances",
        "ec2:DescribeLaunchTemplateVersions",
        "ec2:DescribeNetworkInterfaces",
        "ec2:DescribeRegions",
        "ec2:DescribeRouteTables",
        "ec2:DescribeSecurityGroups",
        "ec2:DescribeSubnets",
        "ec2:DescribeTags",
        "ec2:DescribeVolumes",
        "ec2:DescribeVolumesModifications",
        "ec2:DescribeVpcs",
        "ec2:ModifyNetworkInterfaceAttribute",
        "ecr:BatchCheckLayerAvailability",
        "ecr:BatchGetImage",
        "ecr:DescribeRepositories",
        "ecr:GetAuthorizationToken",
        "ecr:GetDownloadUrlForLayer",
        "ecr:GetRepositoryPolicy",
        "ecr:ListImages",
        "elasticloadbalancing:CreateRule",
        "elasticloadbalancing:DescribeListenerCertificates",
        "elasticloadbalancing:DescribeListeners",
        "elasticloadbalancing:DescribeLoadBalancerAttributes",
        "elasticloadbalancing:DescribeLoadBalancerPolicies",
        "elasticloadbalancing:DescribeLoadBalancers",
        "elasticloadbalancing:DescribeRules",
        "elasticloadbalancing:DescribeTags",
        "elasticloadbalancing:DescribeTargetGroupAttributes",
        "elasticloadbalancing:DescribeTargetGroups",
        "elasticloadbalancing:DescribeTargetHealth",
        "iam:GetServerCertificate",
        "iam:ListServerCertificates",
        "kms:DescribeKey",
        "kms:GenerateRandom"
      ],
      "Effect": "Allow",
      "Resource": "*"
    },
    {
      "Action": [
        "autoscaling:SetDesiredCapacity",
        "autoscaling:TerminateInstanceInAutoScalingGroup",
        "ec2:AttachVolume",
        "ec2:AuthorizeSecurityGroupIngress",
        "ec2:DeleteRoute",
        "ec2:DeleteSecurityGroup",
        "ec2:DeleteVolume",
        "ec2:DetachVolume",
        "ec2:ModifyInstanceAttribute",
        "ec2:ModifyVolume",
        "ec2:RevokeSecurityGroupIngress",
        "elasticloadbalancing:AddTags",
        "elasticloadbalancing:ApplySecurityGroupsToLoadBalancer",
        "elasticloadbalancing:AttachLoadBalancerToSubnets",
        "elasticloadbalancing:ConfigureHealthCheck",
        "elasticloadbalancing:DeleteListener",
        "elasticloadbalancing:DeleteLoadBalancer",
        "elasticloadbalancing:DeleteLoadBalancerListeners",
        "elasticloadbalancing:DeleteTargetGroup",
        "elasticloadbalancing:DeregisterInstancesFromLoadBalancer",
        "elasticloadbalancing:DeregisterTargets",
        "elasticloadbalancing:DetachLoadBalancerFromSubnets",
        "elasticloadbalancing:ModifyListener",
        "elasticloadbalancing:ModifyLoadBalancerAttributes",
        "elasticloadbalancing:ModifyTargetGroup",
        "elasticloadbalancing:RegisterInstancesWithLoadBalancer",
        "elasticloadbalancing:RegisterTargets",
        "elasticloadbalancing:SetLoadBalancerPoliciesForBackendServer",
        "elasticloadbalancing:SetLoadBalancerPoliciesOfListener"
      ],
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/KubernetesCluster": "kubernetes.example1234.dev"
        }
      },
      "Effect": "Allow",
      "Resource": "*"
    },
    {
      "Action": [
        "ec2:CreateSecurityGroup",
        "ec2:CreateVolume",
        "elasticloadbalancing:CreateListener",
        "elasticloadbalancing:CreateLoadBalancer",
        "elasticloadbalancing:CreateLoadBalancerListeners",
        "elasticloadbalancing:CreateLoadBalancerPolicy",
        "elasticloadbalancing:CreateTargetGroup"
      ],
      "Condition": {
        "StringEquals": {
          "aws:RequestTag/KubernetesCluster": "kubernetes.example1234.dev"
        }
      },
      "Effect": "Allow",
      "Resource": "*"
    }
  ],
  "Version": "2012-10-17"
}

@BenWolstencroft
Copy link
Author

BenWolstencroft commented Nov 9, 2021

Hi, apologies for the slow response here, my kops update cluster --yes -v 9 output did not contain the term PutRolePolicy, I have uploaded the entire log output to the following gist (it's long):

https://gist.github.com/BenWolstencroft/bb15bc888c92893facd97006fad49c53

I've redacted as much sensitive information as I could find in the log.

@BenWolstencroft
Copy link
Author

BenWolstencroft commented Nov 9, 2021

@olemarkus @rifelpet @mattoz0 - I've had some success here - looking through the logs it appears as though the issue is not when trying to write a new IAMRolePolicy, but when trying to read back the current one to establish the current state / generate a change!

I modified the contents of the current inline policy via the aws console to just have a single Action *, Resources * (dangerous i know, but i needed a policy that would work, and was short), then reran the kops update cluster --yes and it has now succeeded and overwritten the new, updated, correct policy (the same policy i get when i export for terraform)

@rifelpet
Copy link
Member

rifelpet commented Nov 9, 2021

Yes that lines up with the originally reported error message error reading actual policy document .... I have a theory this is due to kOps not ignoring white space when evaluating a policy document's size. I have a fix in #12700, if you're able to test that and confirm it works that would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants