Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS metadata and STS API calls are slow #873

Closed
joeduffy opened this issue Feb 11, 2020 · 10 comments · Fixed by #1288
Closed

AWS metadata and STS API calls are slow #873

joeduffy opened this issue Feb 11, 2020 · 10 comments · Fixed by #1288
Assignees
Milestone

Comments

@joeduffy
Copy link
Member

joeduffy commented Feb 11, 2020

During the configuration of the AWS provider, it calls into multiple AWS APIs to check various endpoints and identity metadata. Times vary quite a bit depending on the speed of your network, however it's not uncommon for these to add up to 10-15 seconds of lag before an update even begins running.

Here are the specific calls:

Note that you can set config variables to skip this logic:

  • pulumi config set aws:skipCredentialsValidation true
  • pulumi config set aws:skipMetadataApiCheck true
  • pulumi config set aws:skipRequestingAccountId true

For some reason, the provider seems to call these APIs twice, one of which ignores the config. @lukehoban posited that this could be due to the way we do a prepass over configuration to validate and check for defaults. If so, that seems like it's a bug that we do it without having first applied the configuration.

Note also that you can set AWS_METADATA_TIMEOUT=0 which shortens the timeouts of the AWS metadata API calls and does have a small noticeable effect.

I don't know precisely what to do here, but we could consider setting our own defaults differently than the underlying Terraform provider. I don't know enough about what those APIs are doing -- it appears, for instance, that the metadata API check is determining whether the update is happening from within an AWS data center (though why the code needs to know that, I'm not quite sure).

@uLan08
Copy link

uLan08 commented Feb 12, 2020

I encountered an issue somewhat related to this. I was trying to spin up the basic example from pulumi new aws-javascript but pulumi up was just hanging indefinitely. The most I waited was probably 5 mins.

Setting pulumi config set aws:skipCredentialsValidation true fixed it for me,

Here are some of my environment info in case you need them.

▶ node -v
v10.16.3

▶ pulumi version
v1.10.1

"@pulumi/pulumi": "^1.0.0",
"@pulumi/aws": "^1.0.0",
"@pulumi/awsx": "^0.18.10"


config:
  aws:profile: foo
  aws:region: ap-southeast-1
  pulumi:template: aws-javascript

It was stuck at

▶ pulumi up --logtostderr -v=9 2> out.txt
Previewing update (dev):
     Type                 Name             Plan        
     pulumi:pulumi:Stack  pulumi-demo-dev  create..

Last entries from logs:

I0212 22:33:41.285702   73852 eventsink.go:60] AWS Auth provider used: "SharedCredentialsProvider"
I0212 22:33:41.285742   73852 eventsink.go:63] eventSink::Debug(<{%reset%}>AWS Auth provider used: "SharedCredentialsProvider"<{%reset%}>)
I0212 22:33:41.288928   73852 eventsink.go:60] Trying to get account information via sts:GetCallerIdentity
I0212 22:33:41.288964   73852 eventsink.go:63] eventSink::Debug(<{%reset%}>Trying to get account information via sts:GetCallerIdentity<{%reset%}>)

@joeduffy
Copy link
Member Author

Interesting, @stack72 is @uLan08's issue the same as #814? And pulumi/pulumi#3604?

@joeduffy joeduffy added this to the 0.32 milestone Feb 12, 2020
@joeduffy
Copy link
Member Author

Assigning to @pgavlin during M32 to triage as part of the overall performance push. On slow networks, this is by far the dominant performance issue (when deploying to AWS), as far as I can tell.

@pgavlin
Copy link
Member

pgavlin commented Feb 12, 2020

I'm already full up for M32--@leezen I'm going to bounce this to you and we can figure out where to put it.

@pgavlin pgavlin assigned leezen and unassigned pgavlin Feb 12, 2020
@lukehoban
Copy link
Member

lukehoban commented Feb 26, 2020

The commentary in pulumi/pulumi#3671 (comment) is really mostly about the specific issue tracked here. Copying below as well:

A few more observations looking at some detailed logs of the initialization sequence:

  • On first ever update of a stack, we Configure the AWS provider once for preview and once for update. On every future update, we Configure the AWS provider twice for preview and once for update (!).
  • During the Configure for AWS we hit the EC2 metadata endpoint twice (once due to our own PreConfigureCallback, the other part of the upstream provider Configure). Despite being set to timeout after 100ms, each call reliably takes 1-1.5s.
  • After those two calls, we then call sts/GetCallerIdentity twice (the upstream provider just happens to do this by default) and ec2/DescribeAccountAttributes once - each call taking .5-1s.

Total of above is that Configure takes 3.5-6s. And we call it twice during preview.

@lukehoban
Copy link
Member

Given the above - I see a few things we could consider changing:

  1. Removing the second Configure we do during a preview. This should be unnecessary. Would save 3.5-6s during preview.
  2. Changing our defaults on skipCredentialsValidation (the source of the second GetCallerIdentity call which would save 0.5-1s) and skipMetadataApiCheck (the source of the two EC2 metadata calls which would save 2-3s).
  3. Avoiding doing our PreConfigureCallback check (and accepting worse error messages - or string matching error messages to overwrite them) which wouldn't help much if we also did (2) above, but if we didn't do it, would save 1-1.5s.

I don't love changing our defaults on these things. It will no doubt lead to confusion in less common cases if we do. But it does feel like defaults that don't penalize 100% of usage in favor of relatively corner case needs are sensible?

Thoughts?

@stack72
Copy link
Contributor

stack72 commented Feb 26, 2020

@lukehoban FWIW, there are well known people in the Terraform ecosystem that set these defaults to be different by default as it's known to be slow upstream as well

I would suggest we set these and just document that it's the case

@shousper
Copy link

shousper commented Jan 5, 2021

Seeing as this issue is coming up on nearly 12 months, would it be worth considering some action on pulumi's side? I'm running into this issue pretty frequently myself with a fairly basic (standard?) cross-account setup using assume role everywhere with AWS profiles.

Setting aws:skipCredentialsValidation to true in every pulumi stack I create is getting old pretty fast. Is it time to bite the bullet and enable this behaviour by default?

@stack72
Copy link
Contributor

stack72 commented Jan 5, 2021

Hi @shousper

So I've actually been looking at this issue tonight to see what we can set the default variables to.

Do you happen to know how long your pulumi run was before setting that value? I am trying to gauge if you are getting the same behaviour as me

Paul

stack72 added a commit that referenced this issue Jan 5, 2021
Fixes: #873

* `skipCredentialsValidation` now defaults to `true`.
* `skipGetEc2Platforms` now defaults to `true`.
* `skipMetadataApiCheck` now defaults to `true`.
* `skipRegionValidation` now defaults to `true`.
@stack72 stack72 assigned stack72 and unassigned lukehoban Jan 5, 2021
@stack72 stack72 added this to the current milestone Jan 5, 2021
@shousper
Copy link

shousper commented Jan 6, 2021

Sorry @stack72, doesn't look like this is my issue. It ramped up, so I dug deeper.. looks like I'm in this trap: hashicorp/terraform#27350 golang/go#42700

😞

@leezen leezen removed this from the current milestone Jan 12, 2021
@leezen leezen modified the milestones: 0.50, 0.51 Jan 12, 2021
@leezen leezen modified the milestones: 0.51, 0.52 Feb 2, 2021
stack72 added a commit that referenced this issue Feb 7, 2021
#1288)

Fixes: #873

* `skipCredentialsValidation` now defaults to `true`.
* `skipGetEc2Platforms` now defaults to `true`.
* `skipMetadataApiCheck` now defaults to `true`.
* `skipRegionValidation` now defaults to `true`.
t0yv0 added a commit that referenced this issue May 22, 2024
This PR explores reverting the default `aws:skipMetadataApiCheck=false`
setting to enable the provider to be able to seamlessly authenticate
against an IMDS(v2) endpoints in the AWS environment. It appears that
doing so no longer slows down the provider startup time perceptibly. The
way I tested the speed delta was by measuring local empty preview of an
AWS s3 Bucket using AWS_PROFILE authentication with local <-> us-east-1
there is no perceptible difference.

Fixes: #1692

An integration test is added that exercises `pulumi preview` on an EC2
instance with IMDSv2 and asserts that the provider can authenticate
successfully.

Background:

- #873
- #1288
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants