Replacing kops state fails with network error #10043

gs11 · 2020-10-12T10:02:05Z

1. What kops version are you running? The command kops version, will display
this information.

1.18.1

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.16.9

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops -v 10 replace --force -f - --state <s3 statestore>

5. What happened after the commands executed?

Error / retry loop:

I1012 12:01:21.888585    2093 aws_cloud.go:1340] Querying EC2 for all valid zones in region "eu-west-1"
I1012 12:01:22.898841    2093 logging_retryer.go:60] Retryable error (RequestError: send request failed
caused by: Put http://169.254.169.254/latest/api/token: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)) from ec2metadata/GetToken - will retry after delay of 59.09752ms
I1012 12:01:22.899065    2093 aws_cloud.go:217] aws request sleeping for 59.09752ms
I1012 12:01:23.958751    2093 logging_retryer.go:60] Retryable error (RequestError: send request failed
caused by: Put http://169.254.169.254/latest/api/token: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)) from ec2metadata/GetToken - will retry after delay of 86.178348ms
I1012 12:01:23.958799    2093 aws_cloud.go:217] aws request sleeping for 86.178348ms

6. What did you expect to happen?

State store getting updated

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

n/a in this case?

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

See above

9. Anything else do we need to know?

Using kops in WSL 2

The text was updated successfully, but these errors were encountered:

gs11 · 2020-10-12T14:26:21Z

What isn't clear to me is what is supposed to respond on 169.254.169.254 as that seems to be a metadata API reachable from EC2 instances (?)

gs11 · 2020-10-12T14:44:43Z

I referred to the AWS credentials using AWS_PROFILE which the aws cli happily uses but when changing to AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY env vars I get further but with other errors..

Seems the AWS SDK / kops (?) didn't accept AWS_PROFILE and tried to fallback to using EC2 authentication using the metadata API(?)

olemarkus · 2020-10-12T16:05:27Z

It is not that much information in this ticket to go on. Safe to say that on a working environment, kops client should not try to use the metadata API. As far as I know, WSL should not cause this either, although I have not any way of confirming this one way or the other.

Nuru · 2020-11-06T22:06:35Z

I am seeing similar behavior with kops 1.18.2 on Alpine 3.11.

$ kops -v 3 validate cluster 
I1106 21:52:47.867316    3632 factory.go:68] state store s3://example-kops-state
I1106 21:52:48.577037    3632 s3context.go:213] found bucket in region "us-east-1"
I1106 21:52:49.110614    3632 aws_cloud.go:1340] Querying EC2 for all valid zones in region "us-east-1"
I1106 21:52:49.111922    3632 logging_retryer.go:60] Retryable error (EC2MetadataError: failed to make EC2Metadata request
	status code: 404, request id: 
caused by: 404 page not found
) from ec2metadata/GetToken - will retry after delay of 36.765921ms

It keeps retrying the failed metadata query but of course never succeeds

I am using aws-vault to provide AWS credentials via a local EC2 metadata server, but it only provides credentials, not the full EC2 Metadata. If the metadata server fails, it should fall back to using a standard API call.

Nuru · 2020-11-06T22:12:37Z

May be related to aws/aws-sdk-go#3066

jsonmp-k8 · 2020-12-09T14:35:16Z

I get this error while trying to do a rolling-update

I1209 14:20:10.754910     153 aws_cloud.go:1340] Querying EC2 for all valid zones in region "us-east-1"
I1209 14:20:10.756018     153 logging_retryer.go:60] Retryable error (EC2MetadataError: failed to make EC2Metadata request
	status code: 403, request id:
caused by: ) from ec2metadata/GetToken - will retry after delay of 39.887478ms
I1209 14:20:10.756034     153 aws_cloud.go:217] aws request sleeping for 39.887478ms
I1209 14:20:10.796946     153 logging_retryer.go:60] Retryable error (EC2MetadataError: failed to make EC2Metadata request
	status code: 403, request id:
caused by: ) from ec2metadata/GetToken - will retry after delay of 63.53531ms
I1209 14:20:10.796961     153 aws_cloud.go:217] aws request sleeping for 63.53531ms
I1209 14:20:10.861459     153 logging_retryer.go:60] Retryable error (EC2MetadataError: failed to make EC2Metadata request
	status code: 403, request id:
caused by: ) from ec2metadata/GetToken - will retry after delay of 157.697544ms
I1209 14:20:10.861476     153 aws_cloud.go:217] aws request sleeping for 157.697544ms
I1209 14:20:11.020327     153 logging_retryer.go:60] Retryable error (EC2MetadataError: failed to make EC2Metadata request
	status code: 403, request id:

gs11 · 2021-01-05T15:45:39Z

The error is manifested when getting a cluster as well.
export AWS_PROFILE=myprofile

AWS cli happily queries e.g. the kops state store buckets but kops -v 10 get cluster yields:

I0105 16:43:05.508243     678 factory.go:68] state store s3://devtest.imkube.kops.mycompany.internal
I0105 16:43:05.508331     678 s3context.go:328] unable to read /sys/devices/virtual/dmi/id/product_uuid, assuming not running on EC2: open /sys/devices/virtual/dmi/id/product_uuid: no such file or directory
I0105 16:43:06.872304     678 s3context.go:163] unable to get region from metadata:unable to get region from metadata: EC2MetadataRequestError: failed to get EC2 instance identity document
caused by: RequestError: send request failed
caused by: Get http://169.254.169.254/latest/dynamic/instance-identity/document: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
I0105 16:43:06.872349     678 s3context.go:173] defaulting region to "us-east-1"
I0105 16:43:13.142187     678 s3context.go:194] unable to get bucket location from region "us-east-1"; scanning all regions: NoCredentialProviders: no valid providers in chain
caused by: EnvAccessKeyNotFound: failed to find credentials in the environment.
SharedCredsLoad: failed to load profile, myprofile.
EC2RoleRequestError: no EC2 instance role found
caused by: RequestError: send request failed
caused by: Get http://169.254.169.254/latest/meta-data/iam/security-credentials/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

error reading state store: Unable to list AWS regions: NoCredentialProviders: no valid providers in chain
caused by: EnvAccessKeyNotFound: failed to find credentials in the environment.
SharedCredsLoad: failed to load profile, myprofile.
EC2RoleRequestError: no EC2 instance role found
caused by: RequestError: send request failed
caused by: Get http://169.254.169.254/latest/meta-data/iam/security-credentials/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

fejta-bot · 2021-04-05T16:06:09Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-05-05T16:44:01Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

johngmyers · 2021-05-06T04:32:48Z

Are you using an assumed role? You might need AWS_SDK_LOAD_CONFIG=1 in your environment.

fejta-bot · 2021-06-05T05:14:59Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-06-05T05:15:06Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

anoopwebs · 2021-07-21T21:20:54Z

/reopen

k8s-ci-robot · 2021-07-21T21:20:59Z

@anoopwebs: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

anoopwebs · 2021-07-21T21:34:28Z

I'm trying to run kops command as part of gitlab ci pipeline like https://github.com/kubernetes/kops/blob/master/docs/continuous_integration.md and AWS calls are authenticated through Kube2IAM service.
I see EC2MetadataError: failed to make EC2Metadata request error multiple times when kops update cluster command runs and each time it retries up to 3m31s and then request expired and resigning happened.
Then it works again until kops call some other AWS services

so in a way it looks like
AWS service 1 call --> EC2MetadataError: failed to make EC2Metadata request -> wait for timeout -> resign -> AWS call service 1 works -->
AWS service 2 call --> EC2MetadataError: failed to make EC2Metadata request -> wait for timeout -> resign -> AWS call service 2 works -->
AWS service 3 call --> ....

Kops version 1.17.2

Note -

I tried with AWS_SDK_LOAD_CONFIG=1 and no luck
If I assume same IAM role locally and set AWS ENV variables, kops commands work without any issues.

Silvanoc · 2021-08-04T18:21:55Z

@anoopwebs could you resolve the issue?

I'm having the same issue when running kOps for a Bastion inside of my AWS account that is assuming a role with all the needed policies.

anoopwebs · 2021-08-04T18:55:41Z

@Silvanoc I've learned that KOPS works well with AWS environmental variables so I used a work around to get AWS tokens from the AWS metadata API and set those environmental variables and then run KOPS commands. Something like below

curl http://169.254.169.254/latest/meta-data/iam/security-credentials/{Replace with your ROLE-ARN} >cred.json
AWS_ACCESS_KEY_ID=$(cat cred.json | jq -r '.AccessKeyId')
AWS_SECRET_ACCESS_KEY=$(cat cred.json | jq -r '.SecretAccessKey')
AWS_SESSION_TOKEN=$(cat cred.json | jq -r '.Token')
export AWS_ACCESS_KEY_ID
export AWS_SECRET_ACCESS_KEY
export AWS_SESSION_TOKEN

Hope it helps!

Silvanoc · 2021-08-04T20:07:55Z

@anoopwebs meanwhile I know the root-cause, a workaround and the kOps solution.

This comment to an issue on eksctl explains the root-cause pretty well and also provides an easy workaround for my use-case (running from a Docker container): running on the host network stack (--network=host).

The kOps solution is to enable IMDSv2 using the InstanceGroup Resource. Release 1.22 will have it enabled by default.

anoopwebs · 2021-08-04T20:34:44Z

Thanks for sharing!
I'm not in favor of Host Network Containers due to CVE but good to know Kops has something in 1.22

Nuru mentioned this issue Nov 8, 2020

Implement IMDSv2 99designs/aws-vault#690

Open

1 task

michaelbeaumont mentioned this issue Jan 26, 2021

EC2MetadataError with eksctl version >= 0.15 eksctl-io/eksctl#2564

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 5, 2021

k8s-ci-robot closed this as completed Jun 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replacing kops state fails with network error #10043

Replacing kops state fails with network error #10043

gs11 commented Oct 12, 2020

gs11 commented Oct 12, 2020

gs11 commented Oct 12, 2020

olemarkus commented Oct 12, 2020

Nuru commented Nov 6, 2020

Nuru commented Nov 6, 2020

jsonmp-k8 commented Dec 9, 2020 •

edited

gs11 commented Jan 5, 2021

fejta-bot commented Apr 5, 2021

fejta-bot commented May 5, 2021

johngmyers commented May 6, 2021

fejta-bot commented Jun 5, 2021

k8s-ci-robot commented Jun 5, 2021

anoopwebs commented Jul 21, 2021

k8s-ci-robot commented Jul 21, 2021

anoopwebs commented Jul 21, 2021

Silvanoc commented Aug 4, 2021

anoopwebs commented Aug 4, 2021

Silvanoc commented Aug 4, 2021

anoopwebs commented Aug 4, 2021

Replacing kops state fails with network error #10043

Replacing kops state fails with network error #10043

Comments

gs11 commented Oct 12, 2020

gs11 commented Oct 12, 2020

gs11 commented Oct 12, 2020

olemarkus commented Oct 12, 2020

Nuru commented Nov 6, 2020

Nuru commented Nov 6, 2020

jsonmp-k8 commented Dec 9, 2020 • edited

gs11 commented Jan 5, 2021

fejta-bot commented Apr 5, 2021

fejta-bot commented May 5, 2021

johngmyers commented May 6, 2021

fejta-bot commented Jun 5, 2021

k8s-ci-robot commented Jun 5, 2021

anoopwebs commented Jul 21, 2021

k8s-ci-robot commented Jul 21, 2021

anoopwebs commented Jul 21, 2021

Silvanoc commented Aug 4, 2021

anoopwebs commented Aug 4, 2021

Silvanoc commented Aug 4, 2021

anoopwebs commented Aug 4, 2021

jsonmp-k8 commented Dec 9, 2020 •

edited