Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ecs service creation fails when using newly created iam policy #2869

Closed
mvandiest opened this issue Jul 28, 2015 · 21 comments
Closed

ecs service creation fails when using newly created iam policy #2869

mvandiest opened this issue Jul 28, 2015 · 21 comments

Comments

@mvandiest
Copy link

I have what appears to be a timing issue when attempting to create a iam role/security policy and immediately use it as the iam_role of a new ecs service.

I get the following aws error in terraform:

* InvalidParameterException: Unable to assume role and validate the listeners configured on your load balancer.  Please verify the role being passed has the proper permissions.
    status code: 400, request id: []

If I specify a pre-existing iam role with an identical policy everything works fine.

I am using the following config:

provider "aws" {
  region = "${var.aws_region}"
}

resource "aws_ecs_cluster" "cluster" {
  name = "${var.exp_name}-${var.exp_version}"
}

resource "aws_ecs_service" "publicapi" {
  name = "publicapi"
  cluster = "${aws_ecs_cluster.cluster.id}"
  task_definition = "${aws_ecs_task_definition.publicapi.arn}"
  desired_count = 3
  iam_role = "${aws_iam_role.ecs_servicerole.arn}"

  load_balancer {
    elb_name = "${aws_elb.adminapi_elb.id}"
    container_name = "publicapi"
    container_port = 8081
  }
}

resource "template_file" "publicapi_task_definition" {
    filename = "${path.module}/task-definitions/publicapi.json.tpl"

    vars {
        version = "${var.exp_version}"
    }
}

resource "aws_ecs_task_definition" "publicapi" {
  family = "publicapi"
  container_definitions = "${template_file.publicapi_task_definition.rendered}"
}

resource "aws_iam_role_policy" "policy" {
    name = "policy"
    role = "${aws_iam_role.ecs_servicerole.id}"
    policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "elasticloadbalancing:Describe*",
        "elasticloadbalancing:DeregisterInstancesFromLoadBalancer",
        "elasticloadbalancing:RegisterInstancesWithLoadBalancer",
        "ec2:Describe*",
        "ec2:AuthorizeSecurityGroupIngress"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}
EOF
}

resource "aws_iam_role" "ecs_servicerole" {
    name = "ecs_servicerole"
    assume_role_policy = <<EOF
{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "ecs.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF
}

# Create a new load balancer
resource "aws_elb" "adminapi_elb" {
  name = "adminapielb"
  availability_zones = ["us-east-1b", "us-east-1c"]

  listener {
    instance_port = 8081
    instance_protocol = "http"
    lb_port = 80
    lb_protocol = "http"
  }

  health_check {
    healthy_threshold = 2
    unhealthy_threshold = 2
    timeout = 3
    target = "HTTP:8081/"
    interval = 30
  }

  cross_zone_load_balancing = true
  idle_timeout = 400
  connection_draining = true
  connection_draining_timeout = 400

  tags {
    Name = "adminapi_elb"
  }
}
@philp
Copy link

philp commented Jul 29, 2015

I'm experiencing almost exactly the same problem detailed here.

@catsby
Copy link
Member

catsby commented Aug 20, 2015

Hello – I believe you are correct, this is a timing issue. It takes a few seconds for permissions to propagate through AWS:

Important
After you create an IAM role, it may take several seconds for the permissions to propagate. If your first attempt to launch an instance with a role fails, wait a few seconds before trying again. For more information, see Troubleshooting Working with Roles in the Using IAM guide.

source: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html

Unfortunately the API doesn't give us any kind of status to base this on.
Can you confirm for me that a follow up plan & apply (or just apply) is sufficient for things to go thru?

@catsby catsby added the waiting-response An issue/pull request is waiting for a response from the community label Aug 20, 2015
@philp
Copy link

philp commented Aug 21, 2015

I can confirm that in my scenario, a new apply does get things working s expected.

@mvandiest
Copy link
Author

I know it's hacky, but would it make sense to add an artificial delay in execution for the policy dependent step? Seeing that the API gives you no feedback I don't see another option.

This is a pretty big deal for automated deployments. Running a failed step twice is not really something that should be happening in a CI system.

@philp
Copy link

philp commented Aug 21, 2015

Or would it be possible to poll the API after the role/policy has been created, until it can be described, at which point move on to the next step of the plan?

@catsby
Copy link
Member

catsby commented Aug 21, 2015

Or would it be possible to poll the API after the role/policy has been created, until it can be described, at which point move on to the next step of the plan?

I don't think an immediate follow-up describe would result in a failure; IIRC it's in the API, just not propagated to all the other AWS parts. But of course I'll try it out....

I know it's hacky, but would it make sense to add an artificial delay in execution for the policy dependent step?

Might have to do this... I'll try the describe thing first though

@radeksimko radeksimko removed the waiting-response An issue/pull request is waiting for a response from the community label Aug 23, 2015
@radeksimko
Copy link
Member

I did try to call Describe immediately after creating the IAM policy and API replied as I expected (unfortunately) - i.e. IAM policy exists. It was not fully propagated at that time though.

Therefore I submitted #3061 which just retries ECS service create calls. It took about 2 secs when I was testing it (effectively 3 retries after 500ms).

@mvandiest
Copy link
Author

Thx @radeksimko

@cordoval
Copy link

This problem still occurs for me.

InvalidParameterException: Unable to assume role ...

@radeksimko
Copy link
Member

@cordoval This could be caused either by naively low timeout (2 mins atm) or strong inconsistency as described here: #3928

Would you mind creating a new issue & attaching debug log (minus any secrets, of course)? Then we would at least know at which point did the error occur. The outcome can be either the solution described in the linked issue #3928 or timeout increase.

@cordoval
Copy link

I added a depends_on like in another ticket and it seems to behave better now, i get less errors. I actually forget now once i pass a certain point. I think i will start associating errors with commits that way i can go back and reproduce.

Thanks though for now.

@tiyberius
Copy link

@cordoval When you added a depends_on, I'm assuming that you added it to the IAM role that gets assigned to the load balancer?

@cordoval
Copy link

depends_on = ["aws_iam_role_policy.ecs_service_role_policy"]

on the aws_ecs_service resource block

@tiyberius
Copy link

Cool, thanks! And you said you were still getting errors, but just less frequently?

@cordoval
Copy link

not anymore, not of this type at least.

@sheeley
Copy link

sheeley commented Jan 14, 2016

I've been running terraform apply with this setup:

resource "template_file" "iam_elb_role" {
  template = "${file("policies/iam_elb_role.json")}"
  vars = {
    elb1 = "arn:aws:elasticloadbalancing:${var.region}:${var.account_id}:loadbalancer/${aws_elb.api.name}"
    elb2 = "arn:aws:elasticloadbalancing:${var.region}:${var.account_id}:loadbalancer/${aws_elb.ui.name}"
  }
}

resource "aws_iam_role_policy" "elb" {
    name = "test_policy"
    role = "${aws_iam_role.ecs_role.id}"
    policy = "${template_file.iam_elb_role.rendered}"
}

resource "aws_ecs_service" "api" {
  name = "lumen-api"
  cluster = "${aws_ecs_cluster.api.id}"
  task_definition = "${aws_ecs_task_definition.api.arn}"
  desired_count = 3
  iam_role = "${aws_iam_role.ecs_role.arn}"
  /*
  if this says "Unable to assume role and validate the listeners", it is likely a timeout:
  https://github.com/hashicorp/terraform/issues/2869
  not sure why it isn't actually fixed, given the bug was closed so long ago.
  */
  depends_on = ["aws_iam_role_policy.elb", "aws_s3_bucket_object.config"]

  load_balancer {
    elb_name = "${aws_elb.api.id}"
    container_name = "lumen-api"
    container_port = 80
  }
}

I'm running into these timeouts regularly. This file has been creating other resources that sometimes manage to increase the timeout to a point where it works, but often I see failures as mentioned above


* aws_ecs_service.api: InvalidParameterException: Unable to assume role and validate the listeners configured on your load balancer.  Please verify the role being passed has the proper permissions.
    status code: 400, request id: ... 

Let me know how I can help continue to debug!

@radeksimko
Copy link
Member

@sheeley this is (unfortunately) a known issue related to eventually-consistent IAM.
What you're describing is described already in #4375 I believe.

Hopefully #4447 and following PRs will address this.

@sheeley
Copy link

sheeley commented Jan 25, 2016

@radeksimko thanks for the info!

@sheeley
Copy link

sheeley commented Jan 25, 2016

@radeksimko Is it possible there's a separate issue? I've gone ahead and created my IAM role in a previous terraform run, so it already exists. When I run terraform plan, I see it is only trying to create 2 clusters:

+ aws_ecs_service.api
    cluster:                                 "" => "arn:aws:ecs:us-east-1:{acct-id}:cluster/lumen-api"
    desired_count:                           "" => "3"
    iam_role:                                "" => "arn:aws:iam::{acct-id}:role/lumen_ecs_role"
    load_balancer.#:                         "" => "1"
    load_balancer.3516934612.container_name: "" => "lumen-api"
    load_balancer.3516934612.container_port: "" => "80"
    load_balancer.3516934612.elb_name:       "" => "lumen-api-elb"
    name:                                    "" => "lumen-api"
    task_definition:                         "" => "arn:aws:ecs:us-east-1:{acct-id}:task-definition/lumen-api:5"

+ aws_ecs_service.ui
    cluster:                                 "" => "arn:aws:ecs:us-east-1:{acct-id}:cluster/lumen-ui"
    desired_count:                           "" => "3"
    iam_role:                                "" => "arn:aws:iam::{acct-id}:role/lumen_ecs_role"
    load_balancer.#:                         "" => "1"
    load_balancer.2643330267.container_name: "" => "lumen-ui"
    load_balancer.2643330267.container_port: "" => "80"
    load_balancer.2643330267.elb_name:       "" => "lumen-ui-elb"
    name:                                    "" => "lumen-ui"
    task_definition:                         "" => "arn:aws:ecs:us-east-1:{acct-id}:task-definition/lumen-ui:1"

I continue to get the InvalidParameterException. However, when I simulate the role (lumen_elb_role_policy) through the AWS UI, I see all passing. Could there be some additional issue? Should I follow up with a new GH issue?

@sheeley
Copy link

sheeley commented Jan 26, 2016

totally my fault. figured out a policy issue.

@ghost
Copy link

ghost commented Apr 28, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants