Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overly restricitve permissions in v2.4.3 policy AWSLoadBalancerControllerIAMPolicy #2785

Closed
timharsch opened this issue Aug 30, 2022 · 20 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. triage/needs-investigation

Comments

@timharsch
Copy link

timharsch commented Aug 30, 2022

Describe the bug

When attempting to build my ingress resource with the aws-load-balancer-controller I saw the following error when describing the resource:

Warning  FailedDeployModel  54m   ingress  Failed deploy model due to UnauthorizedOperation: You are not authorized to perform this operation. Encoded authorization failure message:  cfUle6L---REDACTED---cF9vVm9Nq6XZICy8Glpi 

Which I then decoded like so:

aws sts decode-authorization-message --encoded-message cfUle6L---REDACTED---cF9vVm9Nq6XZICy8Glpi | jq | sed 's/[\\]"/"/g'

and then copied the DecodedMessage into an editor and formatted it for reading: Once I was able to read the message I deduced that AmazonEKSLoadBalancerControllerRole was choking on a long set of ec2:CreateTags operations it was trying to perform.

I solved this by updating the AWSLoadBalancerControllerIAMPolicy and changing The overly restrictive permissions section that looks like so:

        {
            "Effect": "Allow",
            "Action": [
                "ec2:CreateTags"
            ],
            "Resource": "arn:aws:ec2:*:*:security-group/*",
            "Condition": {
                "StringEquals": {
                    "ec2:CreateAction": "CreateSecurityGroup"
                },
                "Null": {
                    "aws:RequestTag/elbv2.k8s.aws/cluster": "false"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:CreateTags",
                "ec2:DeleteTags"
            ],
            "Resource": "arn:aws:ec2:*:*:security-group/*",
            "Condition": {
                "Null": {
                    "aws:RequestTag/elbv2.k8s.aws/cluster": "true",
                    "aws:ResourceTag/elbv2.k8s.aws/cluster": "false"
                }
            }
        },

after I noticed those sections were somewhat duplicating each other, I simplified them to:

        {
            "Effect": "Allow",
            "Action": [
                "ec2:CreateTags",
                "ec2:DeleteTags"
            ],
            "Resource": "*"
        },

and waited for the next k8s reconcile loop to occur (every 15 minutes). I then got the next permissions issue in the ingress resource ingress.networking.k8s.io/alb-ingress:

Warning  FailedDeployModel  2m36s  ingress  Failed deploy model due to AccessDenied: User: arn:aws:sts::0123456789:assumed-role/AmazonEKSLoadBalancerControllerRole/1661879092752206807 is not authorized to perform: elasticloadbalancing:AddTags on resource: arn:aws:elasticloadbalancing:us-east-1:0123456789:targetgroup/ebc2b01a-42a75cf07e6b68b008e/8eac39c029d24cb2 because no identity-based policy allows the elasticloadbalancing:AddTags action:

which I solved by changing the following portion of the policy file from:

        {
            "Effect": "Allow",
            "Action": [
                "elasticloadbalancing:SetWebAcl",
                "elasticloadbalancing:ModifyListener",
                "elasticloadbalancing:AddListenerCertificates",
                "elasticloadbalancing:RemoveListenerCertificates",
                "elasticloadbalancing:ModifyRule"
            ],
            "Resource": "*"
        }

TO:

        {
            "Effect": "Allow",
            "Action": [
                "elasticloadbalancing:SetWebAcl",
                "elasticloadbalancing:ModifyListener",
                "elasticloadbalancing:AddListenerCertificates",
                "elasticloadbalancing:RemoveListenerCertificates",
                "elasticloadbalancing:ModifyRule",
                "elasticloadbalancing:AddTags"
            ],
            "Resource": "*"
        }

adding the necessary elasticloadbalancing:AddTags permission. and then waited the 15 minutes to see the controller set up a working load balancer.

My suggestion with the policy file would be to simplify the createTags permissions as I did, it's just tags after all and there is precedent for wildcarding other permissions in the file for operations that are more of a security concern than tagging operations.

Environment

  • AWS Load Balancer controller version === 2.4.3
  • Kubernetes version === 1.22
  • Using EKS (yes/no), if so version? === 0.109.0

Additional Context:

@M00nF1sh
Copy link
Collaborator

M00nF1sh commented Aug 31, 2022

@timharsch
by default the permission should be sufficient to create AWS API Objects in the LBController.
Did you have any non-trival setup such as used a resource already exists such as upgrade from a previous version before v2.0.0? It would be good if you can share the CloudTrail event for request denied requests.

We want to provide the minimal permissions by default, TAGs in AWS resources are indeed a security concern as AWS supports tag based authorization.

@kishorj
Copy link
Collaborator

kishorj commented Sep 14, 2022

@timharsch, I'm closing the issue. Feel free to reach out to us if you have further concerns.

@kishorj kishorj closed this as completed Sep 14, 2022
@micksabox
Copy link

I also experienced a similar error to @timharsch and tried the posted solution.

My setup was migrating from a previous version of the ALB controller. (v1.x) I migrated straight to v2.4.3 using the installation instruction.

The failure occurred right after this:
{"level":"info","ts":1663256153.7654002,"logger":"controllers.ingress","msg":"adding resource tags","resourceID":"sg-09cbf32ace9d38570","change":{"elbv2.k8s.aws/cluster":"osd-staging"}}
which if you notice is attempting to add the tag that the Condition to CreateTag checks for to be not-Null.

I had an existing load balancer I was using. I also got stuck in one additional step, I had to remove this Condition

"Condition": {
                "Null": {
                    "aws:ResourceTag/elbv2.k8s.aws/cluster": "false"
                }
            }

from this statement.

{
            "Effect": "Allow",
            "Action": [
                "elasticloadbalancing:ModifyLoadBalancerAttributes",
                "elasticloadbalancing:SetIpAddressType",
                "elasticloadbalancing:SetSecurityGroups",
                "elasticloadbalancing:SetSubnets",
                "elasticloadbalancing:DeleteLoadBalancer",
                "elasticloadbalancing:ModifyTargetGroup",
                "elasticloadbalancing:ModifyTargetGroupAttributes",
                "elasticloadbalancing:DeleteTargetGroup"
            ],
            "Resource": "*"
        },

@kishorj
Copy link
Collaborator

kishorj commented Sep 15, 2022

@micksabox, which 1.x version were you on previously?

@micksabox
Copy link

@kishorj I was using the latest 1.x version, v1.1.9.

@timharsch
Copy link
Author

Sorry for the delay. Here is a redacted version of the decoded message. I shortened it to the pertinent parts and removed ids to my environment. you can see the load balancer is tagging resources. I did not do a a careful comparison of the resources here to those in the conditions, but I think I can see it is attempting to tag resources not in the permissions list.

{
    "allowed": false,
    "explicitDeny": false,
    "matchedStatements":
    {
        "items":
        []
    },
    "failures":
    {
        "items":
        []
    },
    "context":
    {
        "principal":
        {
            "id": "AROA2C4-REDACTED-66374036",
            "arn": "arn:aws:sts::0123456789:assumed-role/AmazonEKSLoadBalancerControllerRole/166187REDACTED74036"
        },
        "action": "ec2:CreateTags",
        "resource": "arn:aws:ec2:us-east-1:0123456789:security-group/sg-094REDACTED0243a",
        "conditions":
        {
            "items":
            [
                {
                    "key": "ec2:Vpc",
                    "key": "0123456789:ingress.k8s.aws/cluster",
                    "key": "aws:Resource",
                    "key": "ec2:ResourceTag/kubernetes.io/ingress-name",
                    "key": "ec2:ResourceTag/kubernetes.io/cluster-name",
                    "key": "aws:Account",
                    "key": "ec2:ResourceTag/kubernetes.io/namespace",
                    "key": "ec2:ResourceTag/ingress.k8s.aws/cluster",
                    "key": "ec2:SecurityGroupID",
                    "key": "0123456789:ingress.k8s.aws/stack",
                    "key": "aws:Region",
                    "key": "aws:Service",
                    "key": "0123456789:kubernetes.io/ingress-name",
                    "key": "ec2:ResourceTag/ingress.k8s.aws/stack",
                    "key": "aws:Type",
                    "key": "ec2:Region",
                    "key": "ec2:ResourceTag/ingress.k8s.aws/resource",
                    "key": "0123456789:kubernetes.io/namespace",
                    "key": "aws:ARN",
                    "key": "0123456789:ingress.k8s.aws/resource",
                    "key": "0123456789:kubernetes.io/cluster-name",

I was doing a from scratch build, not an upgrade.. I build my clusters using the following eksctl template:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: CLUSTER_NAME
  region: REGION

vpc:
  id: "VPCID"
  cidr: "VPC_CIDR"
  subnets:
    public:
      AZA:
        id: "SUBA_ID"
        cidr: "SUBA_CIDR"
      AZB:
        id: "SUBB_ID"
        cidr: "SUBB_CIDR"

nodeGroups:
  - name: ng-1
    instanceType: INSTANCE_TYPE
    desiredCapacity: 2
    ssh: # use existing EC2 key
      publicKeyName: KEYNAME

@timharsch
Copy link
Author

@micksabox can you reopen this issue? Or should I file another?

@micksabox
Copy link

@micksabox can you reopen this issue? Or should I file another?

I'm not able to re-open, maybe you meant @kishorj

@kishorj
Copy link
Collaborator

kishorj commented Sep 22, 2022

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Sep 22, 2022
@k8s-ci-robot
Copy link
Contributor

@kishorj: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2022
@timharsch
Copy link
Author

As far as I know, this is still an issue and should remain open until it can be addressed

@agconti
Copy link

agconti commented Jan 5, 2023

I'm encountering the same issue as @timharsch. Like him, I'm doing a fresh install on a new cluster.

My error message:

{
  "allowed": false,
  "explicitDeny": false,
  "matchedStatements": {
    "items": []
  },
  "failures": {
    "items": []
  },
  "context": {
    "principal": {
      "id": "AROA2WTHL5ZLXXGJ4XOEO:REDACTED:security",
      "arn": "arn:aws:sts::REDACTED:security:assumed-role/load-balancer-controller-qa/REDACTED:security"
    },
    "action": "ec2:CreateTags",
    "resource": "arn:aws:ec2:us-east-1:REDACTED:security-group/sg-0908332267840f3de",

   "//": "More omitted",
}

My terraform:

resource "helm_release" "aws_load_balancer_controller" {
  depends_on = [
    var.deployment_dependency,
  ]
  name       = "aws-load-balancer-controller"
  namespace  = "kube-system"
  chart      = "aws-load-balancer-controller"
  version          = "1.4.6"
  repository       = "https://aws.github.io/eks-charts"
  create_namespace = false

  set {
    name  = "clusterName"
    value = module.config.cluster_name
  }

  set {
    name  = "serviceAccount.create"
    value = true
  }

  set {
    name  = "serviceAccount.name"
    value = local.load_balancer_controller_irsa_service_account_name
  }

  set {
    name = "serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
    value = module.load_balancer_controller_irsa_role.iam_role_arn
  }
}

resource "kubernetes_ingress_v1" "main" {
  depends_on = [
    var.deployment_dependency,
    helm_release.aws_load_balancer_controller,
    module.load_balancer_controller_irsa_role
  ]
  wait_for_load_balancer = true

  metadata {
    name = "main"
    annotations = {
      "kubernetes.io/ingress.class"                                  = "alb"
      "alb.ingress.kubernetes.io/scheme"                             = "internet-facing"
      "alb.ingress.kubernetes.io/target-type"                        = "ip"
      "alb.ingress.kubernetes.io/tags"                               = "Environment=${var.environment}"
      "alb.ingress.kubernetes.io/certificate-arn"                    = join(",", var.ssl_cert_arns)
      "external-dns.alpha.kubernetes.io/hostname"                    = "www.${module.config.tech_domain_name}"
      "alb.ingress.kubernetes.io/listen-ports"                       = jsonencode([{ HTTP = 80 }, { HTTPS = 443 }])
      "alb.ingress.kubernetes.io/actions.ssl-redirect"               = "443"
      "alb.ingress.kubernetes.io/load-balancer-attributes"           = "idle_timeout.timeout_seconds=4000,routing.http2.enabled=true"
    }
  }

  spec {
    dynamic "rule" {
      for_each = toset(var.ingress_services)
      content {
        host = "${rule.value}.${module.config.tech_domain_name}"
        http {
          path {
            path = "/*"
            backend {
              service {
                name = rule.value
                port {
                  number = 80
                }
              }
            }
          }
        }
      }
    }
  }
}

module "load_balancer_controller_irsa_role" {
  source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
  version = "~> 5.9.2"
  role_name                              = local.load_balancer_controller_irsa_role_name
  attach_load_balancer_controller_policy = true

  oidc_providers = {
    main = {
      provider_arn               = var.oidc_provider_arn
      namespace_service_accounts = [
        "kube-system:${local.load_balancer_controller_irsa_service_account_name}",
      ]
    }
  }
}

module "eks" {
  source          = "terraform-aws-modules/eks/aws"
  version         = "~> 19.0"
  cluster_version = local.cluster_version
  cluster_name    = var.cluster_name
  subnet_ids      = var.private_subnets
  vpc_id          = var.vpc_id
  enable_irsa     = true
  tags            = local.tags
  cluster_endpoint_public_access = true
  cluster_endpoint_private_access = false
  node_security_group_enable_recommended_rules = true # <-- implements the correct 9443 sg

  # More omitted

}

Opening up the permissions like @timharsch suggested solved my issue. For others encountering this, here's how I did that.

# Temporary fix until this is solved: https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2785
resource "aws_iam_policy" "aws_load_balancer_controller_temp_policy" {
  name        = "aws_load_balancer_controller_temp_policy"
  description = "Reduces overly restrictive policy the controller can operate effectively"

  policy = <<EOF
{
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:CreateTags",
                "ec2:DeleteTags"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "elasticloadbalancing:SetWebAcl",
                "elasticloadbalancing:ModifyListener",
                "elasticloadbalancing:AddListenerCertificates",
                "elasticloadbalancing:RemoveListenerCertificates",
                "elasticloadbalancing:ModifyRule",
                "elasticloadbalancing:AddTags"
            ],
            "Resource": "*"
        }
    ],
    "Version": "2012-10-17"
}
EOF
}

resource "aws_iam_role_policy_attachment" "aws_load_balancer_controller_temp_policy" {
  role       = module.load_balancer_controller_irsa_role.iam_role_name
  policy_arn = aws_iam_policy.aws_load_balancer_controller_temp_policy.arn
}

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2023
@kishorj
Copy link
Collaborator

kishorj commented Jan 6, 2023

@agconti, @timharsch could you check whether your security groups have the following tags?

ingress.k8s.aws/resource: ManagedLBSecurityGroup
elbv2.k8s.aws/cluster: <cluster_name>
ingress.k8s.aws/stack: <namespace/name>

If these tags are not present on the security groups, then the SG is not created by the v2 release of this controller. The cluster tag gets added during the sg creation, and the reference IAM policy allows tagging operations on the concerned security groups.

If you had resources created by the v1 version of the controller, you need to grant additional IAM permissions mentioned in the upgrade instructions (https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.4/install/iam_policy_v1_to_v2_additional.json). The v1 version of the controller uses the ingress.k8s.aws/cluster tag while v2 version uses elbv2.k8s.aws/cluster - hence the additional permission.

@agconti
Copy link

agconti commented Jan 9, 2023

@kishorj Yes, my security group doesn't have those tags. I'm deleting the old v1 controller and creating the v2 controller from scratch so I'm not sure why it would be missing them. I'll try adding the additional iam perms like you suggested.

@agconti
Copy link

agconti commented Jan 9, 2023

@kishorj Thanks for your help! I tried adding the additional perms you linked. I'm able to create the ingress now, and it has the correct security group tags:

ingress.k8s.aws/resource: ManagedLBSecurityGroup
elbv2.k8s.aws/cluster: <cluster_name>
ingress.k8s.aws/stack: <namespace/name>

However, I'm still running into iam permissions issues, specifically with elasticloadbalancing:SetSecurityGroups:

Failed deploy model due to AccessDenied: User: arn:aws:sts::REDACTED:assumed-role/load-balancer-controller-qa/REDACTED is not authorized to perform: elasticloadbalancing:SetSecurityGroups on resource: arn:aws:elasticloadbalancing:us-east-1:REDACTED:loadbalancer/app/9ec9d36b-default-main-ebd4/80b73ad69b0b69e7 because no identity-based policy allows the elasticloadbalancing:SetSecurityGroups action.

I'm surprised this is happening, given that my SG now has the tags needed by the iam policy to use elasticloadbalancing:SetSecurityGroups. Is this a consequence of the Null condition on the policy? i.e. when the role is first assumed by the controller, the tags are missing on the security groups until it adds them. So the condition resolves that the role does not have permission to modify the security group. If so, my guess would be that re-assuming the role would solve this issue. I tried this by restarting the deployment but, but the pods go stuck in the terminating state.

This all seems to stem from an initially incorrectly tagged SG, so I tried adding the expected sg description with the suggestion from the migration instructions from v1 to v2 before I delete the v1 controller and create the v2 controller from scratch.

aws --region $REGION ec2 update-security-group-rule-descriptions-ingress --cli-input-json "$(aws --region $REGION ec2 describe-security-groups --group-ids $SG_ID | jq '.SecurityGroups[0] | {DryRun: false, GroupId: .GroupId ,IpPermissions: (.IpPermissions | map(select(.FromPort==0 and .ToPort==65535) | .UserIdGroupPairs |= map(.Description="elbv2.k8s.aws/targetGroupBinding=shared"))) }' -M)"

But the command fails with:

An error occurred (MissingParameter) when calling the UpdateSecurityGroupRuleDescriptionsIngress operation: Either 'ipPermissions' or 'securityGroupRuleDescriptions' should be provided.

For reference, the output of the subcommand does contain 'ipPermissions' its just an empty array:

aws --region $REGION ec2 describe-security-groups --group-ids $SG_ID | jq '.SecurityGroups[0] | {DryRun: false, GroupId: .GroupId ,IpPermissions: (.IpPermissions | map(select(.FromPort==0 and .ToPort==65535) | .UserIdGroupPairs |= map(.Description="elbv2.k8s.aws/targetGroupBinding=shared"))) }' 
{
  "DryRun": false,
  "GroupId": "sg-0073604c5e64ea780",
  "IpPermissions": []
}

@kishorj
Copy link
Collaborator

kishorj commented Jan 9, 2023

@agconti, the aws cli is to update existing SG ingress on your EC2 SG added by the v1 controller. If the permissions list is empty, either the sg is not the one attached to your EC2 instance, or it doesn't not contain rules added by the v1 version of the controller.

The SetSecurityGroups errors imply the underlying AWS ALB resource doesn't have the expected tags. If you use the reference IAM policy, the ALB must have the tag elbv2.k8s.aws/cluster.

If you used the v1 controller, you must be on v1.1.3 or later before upgrading to v2 controller. If not, the AWS resource tags will not be updated accordingly.

In your case, you either need to update the tags on the underlying AWS resources OR provide controller permissions to access your existing resources.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 9, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 9, 2023
@timharsch
Copy link
Author

Just a quick note to say that I recently upgraded the controller in our environment to v2.5.1. I didn't see any changes to the IAM policy that would address the first problem I described, but I did not have the problem again after the upgrade. I noticed that the v2.5.1 policy file did contain fixes that addressed the second problem I described.

I think it is safe to close this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. triage/needs-investigation
Projects
None yet
Development

No branches or pull requests

7 participants