Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runners-scale-up fails with 'AuthFailure.ServiceLinkedRoleCreationNotPermitted' #104

Closed
compiaffe opened this issue Jul 30, 2020 · 5 comments

Comments

@compiaffe
Copy link

Summary

When following the readme, using the example configuration and adjusting the Github app permissions as per #100 (comment) the scale-up lambda fails to create the EC2 instance due to ServiceLinkedRoleCreationNotPermitted

Steps to reproduce

  • Do step 1 of Github app setup
  • Checkout terraform-aws-github-runner repo, cd into example folder
  • Download lambda zips
  • Create terraform.tfvars file with Github App credentials
  • run terraform init && terraform apply
  • Trigger a build on Github

What is the current bug behavior?

Github app sends webhook, webhook lambda forwards it, scaleup-lambda throws error:

...
ERROR	AuthFailure.ServiceLinkedRoleCreationNotPermitted: The provided credentials do not have permission to create the service-linked role for EC2 Spot Instances.
    at Request.extractError (/var/task/index.js:41424:35)
    at Request.callListeners (/var/task/index.js:47771:20)
    at Request.emit (/var/task/index.js:47743:10)
    at Request.emit (/var/task/index.js:18467:14)
    at Request.transition (/var/task/index.js:17801:10)
    at AcceptorStateMachine.runTo (/var/task/index.js:26145:12)
    at /var/task/index.js:26157:10
    at Request.<anonymous> (/var/task/index.js:17817:9)
    at Request.<anonymous> (/var/task/index.js:18469:12)
    at Request.callListeners (/var/task/index.js:47781:18) {
  code: 'AuthFailure.ServiceLinkedRoleCreationNotPermitted',
  time: 2020-07-30T15:03:24.631Z,
  requestId: 'c7bab39e-b75c-4e7d-bc29-6622b3d4ddb1',
  statusCode: 403,
  retryable: false,
  retryDelay: 68.19342592727871
}

What is the expected correct behavior?

Scale up lambda should create EC2 instance

Possible fixes

I'm sure this is a IAM permissions issue. I am rather new to both AWS and terraform and am not sure in which of them this needs to be solved and how to go about it.
Would be great to get some pointers.

@cmcconnell1
Copy link

Hey @compiaffe ,

Assume that you are using the v0.2.0 tag?
And, curious if you always see this error now?

  • i.e.: can you terraform destroy, delete ./.terraform dir, and then terraform init/plan/apply again a few times and validate that you always get this error?

  • I don't see the error you noted--although the effect is the same for both of us, as we are not seeing any EC2 spot instances deployed (ref: dev-usw2-scale-up failure: "Failed handling SQS event" "PEM routines:get_name:no start line at Sign.sign" #100 (comment) ). And I definitely have terraform destroyed/applied probably ten or so times in the past couple of days.

  • For the record, I'm using terraform 0.12.28 on macos, curious what TF version you are using?

I ended up importing the project as I needed to make changes to some submodules code with things like the hard coded tags, etc. in case its helpful, here is how I have my internal module structure, based on the default/example from this project

tree
.
├── README.md
├── lambdas-download
│   ├── main.tf
│   ├── runner-binaries-syncer.zip
│   ├── runners.zip
│   ├── terraform.tfstate
│   └── webhook.zip
├── main.tf
├── outputs.tf
├── providers.tf
├── terraform-aws-github-runner.tfvars
├── terraform.secret.auto.tfvars
├── terraform.tf
└── variables.tf

1 directory, 13 files
cmcc@cmcc:default $ cat ../../.terraform-version
0.12.28

I found this curious seeing that this has apparently been an intermittent issue for folks in the past with other projects:
LeanerCloud/AutoSpotting#187

This issue notes how they apparently took their fix and made it less permissive and were able to get it resolved:
LeanerCloud/AutoSpotting#416 (comment)

@npalm
Copy link
Member

npalm commented Jul 31, 2020

I am quite sure it is an IAM issue. In this repo you find a bit more infor how to create the required service linked roles. When time I will add them also to this repo.

See, you only will need the one for spot https://github.com/npalm/terraform-aws-gitlab-runner#service-linked-roles

Via terraform you create the services link role as follow:

resource "aws_iam_service_linked_role" "spot" {
  aws_service_name = "spot.amazonaws.com"
}
``

Another quick fix would be first to create manaully via the AWS console a spot instance and remove it. You will see that AWS will create the required role for you.

@compiaffe
Copy link
Author

@npalm Thanks for the info, will try it out. Due to upcoming holidays might not report back before October.

@toots
Copy link
Contributor

toots commented Sep 12, 2020

I was having the same issue and adding

resource "aws_iam_service_linked_role" "spot" {
  aws_service_name = "spot.amazonaws.com"
}

Did indeed fix it. Thanks!

@npalm
Copy link
Member

npalm commented Nov 4, 2020

Docs are updated via #229

@npalm npalm closed this as completed Nov 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants