Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replacing kops state fails with network error #10043

Closed
gs11 opened this issue Oct 12, 2020 · 19 comments
Closed

Replacing kops state fails with network error #10043

gs11 opened this issue Oct 12, 2020 · 19 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@gs11
Copy link

gs11 commented Oct 12, 2020

1. What kops version are you running? The command kops version, will display
this information.

1.18.1

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.16.9

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops -v 10 replace --force -f - --state <s3 statestore>

5. What happened after the commands executed?

Error / retry loop:

I1012 12:01:21.888585    2093 aws_cloud.go:1340] Querying EC2 for all valid zones in region "eu-west-1"
I1012 12:01:22.898841    2093 logging_retryer.go:60] Retryable error (RequestError: send request failed
caused by: Put http://169.254.169.254/latest/api/token: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)) from ec2metadata/GetToken - will retry after delay of 59.09752ms
I1012 12:01:22.899065    2093 aws_cloud.go:217] aws request sleeping for 59.09752ms
I1012 12:01:23.958751    2093 logging_retryer.go:60] Retryable error (RequestError: send request failed
caused by: Put http://169.254.169.254/latest/api/token: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)) from ec2metadata/GetToken - will retry after delay of 86.178348ms
I1012 12:01:23.958799    2093 aws_cloud.go:217] aws request sleeping for 86.178348ms

6. What did you expect to happen?

State store getting updated

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

n/a in this case?

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

See above

9. Anything else do we need to know?

Using kops in WSL 2

@gs11
Copy link
Author

gs11 commented Oct 12, 2020

What isn't clear to me is what is supposed to respond on 169.254.169.254 as that seems to be a metadata API reachable from EC2 instances (?)

@gs11
Copy link
Author

gs11 commented Oct 12, 2020

I referred to the AWS credentials using AWS_PROFILE which the aws cli happily uses but when changing to AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY env vars I get further but with other errors..

Seems the AWS SDK / kops (?) didn't accept AWS_PROFILE and tried to fallback to using EC2 authentication using the metadata API(?)

@olemarkus
Copy link
Member

It is not that much information in this ticket to go on. Safe to say that on a working environment, kops client should not try to use the metadata API. As far as I know, WSL should not cause this either, although I have not any way of confirming this one way or the other.

@Nuru
Copy link

Nuru commented Nov 6, 2020

I am seeing similar behavior with kops 1.18.2 on Alpine 3.11.

$ kops -v 3 validate cluster 
I1106 21:52:47.867316    3632 factory.go:68] state store s3://example-kops-state
I1106 21:52:48.577037    3632 s3context.go:213] found bucket in region "us-east-1"
I1106 21:52:49.110614    3632 aws_cloud.go:1340] Querying EC2 for all valid zones in region "us-east-1"
I1106 21:52:49.111922    3632 logging_retryer.go:60] Retryable error (EC2MetadataError: failed to make EC2Metadata request
	status code: 404, request id: 
caused by: 404 page not found
) from ec2metadata/GetToken - will retry after delay of 36.765921ms

It keeps retrying the failed metadata query but of course never succeeds

I am using aws-vault to provide AWS credentials via a local EC2 metadata server, but it only provides credentials, not the full EC2 Metadata. If the metadata server fails, it should fall back to using a standard API call.

@Nuru
Copy link

Nuru commented Nov 6, 2020

May be related to aws/aws-sdk-go#3066

@jsonmp-k8
Copy link

jsonmp-k8 commented Dec 9, 2020

I get this error while trying to do a rolling-update

I1209 14:20:10.754910     153 aws_cloud.go:1340] Querying EC2 for all valid zones in region "us-east-1"
I1209 14:20:10.756018     153 logging_retryer.go:60] Retryable error (EC2MetadataError: failed to make EC2Metadata request
	status code: 403, request id:
caused by: ) from ec2metadata/GetToken - will retry after delay of 39.887478ms
I1209 14:20:10.756034     153 aws_cloud.go:217] aws request sleeping for 39.887478ms
I1209 14:20:10.796946     153 logging_retryer.go:60] Retryable error (EC2MetadataError: failed to make EC2Metadata request
	status code: 403, request id:
caused by: ) from ec2metadata/GetToken - will retry after delay of 63.53531ms
I1209 14:20:10.796961     153 aws_cloud.go:217] aws request sleeping for 63.53531ms
I1209 14:20:10.861459     153 logging_retryer.go:60] Retryable error (EC2MetadataError: failed to make EC2Metadata request
	status code: 403, request id:
caused by: ) from ec2metadata/GetToken - will retry after delay of 157.697544ms
I1209 14:20:10.861476     153 aws_cloud.go:217] aws request sleeping for 157.697544ms
I1209 14:20:11.020327     153 logging_retryer.go:60] Retryable error (EC2MetadataError: failed to make EC2Metadata request
	status code: 403, request id:

@gs11
Copy link
Author

gs11 commented Jan 5, 2021

The error is manifested when getting a cluster as well.
export AWS_PROFILE=myprofile

AWS cli happily queries e.g. the kops state store buckets but kops -v 10 get cluster yields:

I0105 16:43:05.508243     678 factory.go:68] state store s3://devtest.imkube.kops.mycompany.internal
I0105 16:43:05.508331     678 s3context.go:328] unable to read /sys/devices/virtual/dmi/id/product_uuid, assuming not running on EC2: open /sys/devices/virtual/dmi/id/product_uuid: no such file or directory
I0105 16:43:06.872304     678 s3context.go:163] unable to get region from metadata:unable to get region from metadata: EC2MetadataRequestError: failed to get EC2 instance identity document
caused by: RequestError: send request failed
caused by: Get http://169.254.169.254/latest/dynamic/instance-identity/document: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
I0105 16:43:06.872349     678 s3context.go:173] defaulting region to "us-east-1"
I0105 16:43:13.142187     678 s3context.go:194] unable to get bucket location from region "us-east-1"; scanning all regions: NoCredentialProviders: no valid providers in chain
caused by: EnvAccessKeyNotFound: failed to find credentials in the environment.
SharedCredsLoad: failed to load profile, myprofile.
EC2RoleRequestError: no EC2 instance role found
caused by: RequestError: send request failed
caused by: Get http://169.254.169.254/latest/meta-data/iam/security-credentials/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

error reading state store: Unable to list AWS regions: NoCredentialProviders: no valid providers in chain
caused by: EnvAccessKeyNotFound: failed to find credentials in the environment.
SharedCredsLoad: failed to load profile, myprofile.
EC2RoleRequestError: no EC2 instance role found
caused by: RequestError: send request failed
caused by: Get http://169.254.169.254/latest/meta-data/iam/security-credentials/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 5, 2021
@johngmyers
Copy link
Member

Are you using an assumed role? You might need AWS_SDK_LOAD_CONFIG=1 in your environment.

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@anoopwebs
Copy link

/reopen

@k8s-ci-robot
Copy link
Contributor

@anoopwebs: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@anoopwebs
Copy link

I'm trying to run kops command as part of gitlab ci pipeline like https://github.com/kubernetes/kops/blob/master/docs/continuous_integration.md and AWS calls are authenticated through Kube2IAM service.
I see EC2MetadataError: failed to make EC2Metadata request error multiple times when kops update cluster command runs and each time it retries up to 3m31s and then request expired and resigning happened.
Then it works again until kops call some other AWS services

so in a way it looks like
AWS service 1 call --> EC2MetadataError: failed to make EC2Metadata request -> wait for timeout -> resign -> AWS call service 1 works -->
AWS service 2 call --> EC2MetadataError: failed to make EC2Metadata request -> wait for timeout -> resign -> AWS call service 2 works -->
AWS service 3 call --> ....

Kops version 1.17.2

Note -

  1. I tried with AWS_SDK_LOAD_CONFIG=1 and no luck
  2. If I assume same IAM role locally and set AWS ENV variables, kops commands work without any issues.

@Silvanoc
Copy link

Silvanoc commented Aug 4, 2021

@anoopwebs could you resolve the issue?

I'm having the same issue when running kOps for a Bastion inside of my AWS account that is assuming a role with all the needed policies.

@anoopwebs
Copy link

@Silvanoc I've learned that KOPS works well with AWS environmental variables so I used a work around to get AWS tokens from the AWS metadata API and set those environmental variables and then run KOPS commands. Something like below

curl http://169.254.169.254/latest/meta-data/iam/security-credentials/{Replace with your ROLE-ARN} >cred.json
AWS_ACCESS_KEY_ID=$(cat cred.json | jq -r '.AccessKeyId')
AWS_SECRET_ACCESS_KEY=$(cat cred.json | jq -r '.SecretAccessKey')
AWS_SESSION_TOKEN=$(cat cred.json | jq -r '.Token')
export AWS_ACCESS_KEY_ID
export AWS_SECRET_ACCESS_KEY
export AWS_SESSION_TOKEN

Hope it helps!

@Silvanoc
Copy link

Silvanoc commented Aug 4, 2021

@anoopwebs meanwhile I know the root-cause, a workaround and the kOps solution.

This comment to an issue on eksctl explains the root-cause pretty well and also provides an easy workaround for my use-case (running from a Docker container): running on the host network stack (--network=host).

The kOps solution is to enable IMDSv2 using the InstanceGroup Resource. Release 1.22 will have it enabled by default.

@anoopwebs
Copy link

Thanks for sharing!
I'm not in favor of Host Network Containers due to CVE but good to know Kops has something in 1.22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

9 participants