runners-scale-up fails with 'AuthFailure.ServiceLinkedRoleCreationNotPermitted' #104

compiaffe · 2020-07-30T15:33:48Z

Summary

When following the readme, using the example configuration and adjusting the Github app permissions as per #100 (comment) the scale-up lambda fails to create the EC2 instance due to ServiceLinkedRoleCreationNotPermitted

Steps to reproduce

Do step 1 of Github app setup
Checkout terraform-aws-github-runner repo, cd into example folder
Download lambda zips
Create terraform.tfvars file with Github App credentials
run terraform init && terraform apply
Trigger a build on Github

What is the current bug behavior?

Github app sends webhook, webhook lambda forwards it, scaleup-lambda throws error:

...
ERROR	AuthFailure.ServiceLinkedRoleCreationNotPermitted: The provided credentials do not have permission to create the service-linked role for EC2 Spot Instances.
    at Request.extractError (/var/task/index.js:41424:35)
    at Request.callListeners (/var/task/index.js:47771:20)
    at Request.emit (/var/task/index.js:47743:10)
    at Request.emit (/var/task/index.js:18467:14)
    at Request.transition (/var/task/index.js:17801:10)
    at AcceptorStateMachine.runTo (/var/task/index.js:26145:12)
    at /var/task/index.js:26157:10
    at Request.<anonymous> (/var/task/index.js:17817:9)
    at Request.<anonymous> (/var/task/index.js:18469:12)
    at Request.callListeners (/var/task/index.js:47781:18) {
  code: 'AuthFailure.ServiceLinkedRoleCreationNotPermitted',
  time: 2020-07-30T15:03:24.631Z,
  requestId: 'c7bab39e-b75c-4e7d-bc29-6622b3d4ddb1',
  statusCode: 403,
  retryable: false,
  retryDelay: 68.19342592727871
}

What is the expected correct behavior?

Scale up lambda should create EC2 instance

Possible fixes

I'm sure this is a IAM permissions issue. I am rather new to both AWS and terraform and am not sure in which of them this needs to be solved and how to go about it.
Would be great to get some pointers.

The text was updated successfully, but these errors were encountered:

cmcconnell1 · 2020-07-30T21:26:50Z

Hey @compiaffe ,

Assume that you are using the v0.2.0 tag?
And, curious if you always see this error now?

i.e.: can you terraform destroy, delete ./.terraform dir, and then terraform init/plan/apply again a few times and validate that you always get this error?
I don't see the error you noted--although the effect is the same for both of us, as we are not seeing any EC2 spot instances deployed (ref: dev-usw2-scale-up failure: "Failed handling SQS event" "PEM routines:get_name:no start line at Sign.sign" #100 (comment) ). And I definitely have terraform destroyed/applied probably ten or so times in the past couple of days.
For the record, I'm using terraform 0.12.28 on macos, curious what TF version you are using?

I ended up importing the project as I needed to make changes to some submodules code with things like the hard coded tags, etc. in case its helpful, here is how I have my internal module structure, based on the default/example from this project

tree
.
├── README.md
├── lambdas-download
│   ├── main.tf
│   ├── runner-binaries-syncer.zip
│   ├── runners.zip
│   ├── terraform.tfstate
│   └── webhook.zip
├── main.tf
├── outputs.tf
├── providers.tf
├── terraform-aws-github-runner.tfvars
├── terraform.secret.auto.tfvars
├── terraform.tf
└── variables.tf

1 directory, 13 files
cmcc@cmcc:default $ cat ../../.terraform-version
0.12.28

I found this curious seeing that this has apparently been an intermittent issue for folks in the past with other projects:
LeanerCloud/AutoSpotting#187

This issue notes how they apparently took their fix and made it less permissive and were able to get it resolved:
LeanerCloud/AutoSpotting#416 (comment)

npalm · 2020-07-31T14:14:42Z

I am quite sure it is an IAM issue. In this repo you find a bit more infor how to create the required service linked roles. When time I will add them also to this repo.

See, you only will need the one for spot https://github.com/npalm/terraform-aws-gitlab-runner#service-linked-roles

Via terraform you create the services link role as follow:

resource "aws_iam_service_linked_role" "spot" {
  aws_service_name = "spot.amazonaws.com"
}
``

Another quick fix would be first to create manaully via the AWS console a spot instance and remove it. You will see that AWS will create the required role for you.

compiaffe · 2020-09-02T12:16:33Z

@npalm Thanks for the info, will try it out. Due to upcoming holidays might not report back before October.

toots · 2020-09-12T23:56:10Z

I was having the same issue and adding

resource "aws_iam_service_linked_role" "spot" {
  aws_service_name = "spot.amazonaws.com"
}

Did indeed fix it. Thanks!

npalm · 2020-11-04T08:38:20Z

Docs are updated via #229

cmcconnell1 mentioned this issue Aug 11, 2020

dev-usw2-scale-up failure: "Failed handling SQS event" "PEM routines:get_name:no start line at Sign.sign" #100

Closed

theogravity mentioned this issue Sep 12, 2020

Some additional info needs to be added to readme #203

Closed

toots mentioned this issue Sep 24, 2020

Add create role policy to scale-up.tf #229

Closed

npalm closed this as completed Nov 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runners-scale-up fails with 'AuthFailure.ServiceLinkedRoleCreationNotPermitted' #104

runners-scale-up fails with 'AuthFailure.ServiceLinkedRoleCreationNotPermitted' #104

compiaffe commented Jul 30, 2020

cmcconnell1 commented Jul 30, 2020

npalm commented Jul 31, 2020

compiaffe commented Sep 2, 2020

toots commented Sep 12, 2020

npalm commented Nov 4, 2020

runners-scale-up fails with 'AuthFailure.ServiceLinkedRoleCreationNotPermitted' #104

runners-scale-up fails with 'AuthFailure.ServiceLinkedRoleCreationNotPermitted' #104

Comments

compiaffe commented Jul 30, 2020

Summary

Steps to reproduce

What is the current bug behavior?

What is the expected correct behavior?

Possible fixes

cmcconnell1 commented Jul 30, 2020

npalm commented Jul 31, 2020

compiaffe commented Sep 2, 2020

toots commented Sep 12, 2020

npalm commented Nov 4, 2020