Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.6.1 becomes unresponsive when fails to describe EC2 instances when using auto discovery #2876

Closed
damienmarshall opened this Issue Jun 26, 2017 · 9 comments

Comments

Projects
None yet
9 participants
@damienmarshall
Copy link

damienmarshall commented Jun 26, 2017

What did you do?
I'm running Prometheus in a docker container on Ubuntu 16.04.02 on AWS using EC2 Auto discovery feature. Every few days I see the following errors in logs, after which CPU usage appears to spike and Prometheus appears to crash, requiring a restart

Environment
AWS

  • System information:

Linux 4.4.0-1020-aws x86_64

  • Prometheus version:

1.6.1 (have since upgrade to 1.7.1 to see if this helps)

  • Prometheus configuration file:
global:
  scrape_interval:     15s
  evaluation_interval: 15s
# Load and evaluate rules in this file every 'evaluation_interval' seconds.
rule_files:
  - "targets.rules"
  - "host.rules"
  - "containers.rules"
- job_name: node_exporter
    scrape_interval: 20s
    ec2_sd_configs:
      - region: us-east-1
        access_key: <key>
        secret_key: <key>
        port: 9100
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Name]
        target_label: instance
        action: replace
      - source_labels: [__meta_ec2_tag_monitored]
        target_label: instance
        regex: '(fals.*)'
        replacement: '${1}'
        action: drop
  • Logs:
time="2017-06-24T03:43:02Z" level=error msg="could not describe instances: RequestError: send request failed\ncaused by: Post https://ec2.us-east-1.amazonaws.com/: dial tcp: i/o timeout" source="ec2.go:116" 
time="2017-06-24T03:42:59Z" level=error msg="could not describe instances: RequestError: send request failed\ncaused by: Post https://ec2.us-east-1.amazonaws.com/: dial tcp: i/o timeout" source="ec2.go:116" 
time="2017-06-24T03:43:20Z" level=error msg="could not describe instances: RequestError: send request failed\ncaused by: Post https://ec2.us-east-1.amazonaws.com/: dial tcp: i/o timeout" source="ec2.go:116" 
time="2017-06-24T03:43:39Z" level=error msg="could not describe instances: RequestError: send request failed\ncaused by: Post https://ec2.us-east-1.amazonaws.com/: dial tcp: i/o timeout" source="ec2.go:116" 
time="2017-06-24T03:43:26Z" level=error msg="could not describe instances: RequestError: send request failed\ncaused by: Post https://ec2.us-east-1.amazonaws.com/: dial tcp: i/o timeout" source="ec2.go:116" 
time="2017-06-24T03:45:43Z" level=error msg="could not describe instances: RequestError: send request failed\ncaused by: Post https://ec2.us-east-1.amazonaws.com/: dial tcp: i/o timeout" source="ec2.go:116" 
time="2017-06-24T03:46:03Z" level=error msg="could not describe instances: RequestError: send request failed\ncaused by: Post https://ec2.us-east-1.amazonaws.com/: dial tcp: i/o timeout" source="ec2.go:116" 
time="2017-06-24T03:45:22Z" level=error msg="could not describe instances: RequestError: send request failed\ncaused by: Post https://ec2.us-east-1.amazonaws.com/: dial tcp: i/o timeout" source="ec2.go:116" 
@damienmarshall

This comment has been minimized.

Copy link
Author

damienmarshall commented Jun 27, 2017

This happened with 1.7.1, around the same time last night, so wondering if this might be something up with AWS credentials:

	status code: 401, request id: 0f318696-b68d-40d7-8609-b36b27aaf3e6" source="ec2.go:118" 
time="2017-06-27T07:27:06Z" level=error msg="could not describe instances: AuthFailure: AWS was not able to validate the provided access credentials
	status code: 401, request id: 0ae0044f-5681-47ba-b178-0270a9c9b844" source="ec2.go:118" 
time="2017-06-27T07:27:06Z" level=error msg="could not describe instances: AuthFailure: AWS was not able to validate the provided access credentials
	status code: 401, request id: ea18b62d-660b-49bb-b7d4-01b3da7dd016" source="ec2.go:118" 
time="2017-06-27T07:27:06Z" level=error msg="could not describe instances: AuthFailure: AWS was not able to validate the provided access credentials
	status code: 401, request id: 33fae258-6f45-4ae2-af65-0b82805141da" source="ec2.go:118" 
time="2017-06-27T07:27:06Z" level=error msg="could not describe instances: AuthFailure: AWS was not able to validate the provided access credentials
	status code: 401, request id: 6869b259-4ede-43b6-a958-cc6081513fea" source="ec2.go:118" 
time="2017-06-27T07:27:06Z" level=error msg="could not describe instances: AuthFailure: AWS was not able to validate the provided access credentials
	status code: 401, request id: f68ce53a-5d2f-44fe-8081-51487b2c0fd6" source="ec2.go:118" 
time="2017-06-27T07:27:06Z" level=error msg="could not describe instances: AuthFailure: AWS was not able to validate the provided access credentials
	status code: 401, request id: 919c5cb4-4ada-43d3-859f-eb128d84f21e" source="ec2.go:118" 
time="2017-06-27T07:27:06Z" level=error msg="could not describe instances: AuthFailure: AWS was not able to validate the provided access credentials
	status code: 401, request id: 385a697c-05c8-49de-b8d9-7ae90fd2bb7b" source="ec2.go:118" 
time="2017-06-27T07:27:06Z" level=error msg="could not describe instances: AuthFailure: AWS was not able to validate the provided access credentials
	status code: 401, request id: 4a6d3db4-3b7f-4e3c-9125-1c297bc41a1f" source="ec2.go:118" 

From researching online it appears this can happen if the timestamp is off, but checking the date the time seems fine, so will continue looking at that. However, it still appears that Prometheus doesn't recover well from this situation (or it could be that AWS access is continually flipping out at that time and not providing info to scrape)

@Sleashe

This comment has been minimized.

Copy link

Sleashe commented Nov 7, 2017

Experiencing the exact same issue with tag latest (from docker hub). I suppose the prometheus version is the latest stable one (1.8.2).
My logs are quite similar, and I'm experiencing the same:
1- High cpu spike
2- Prometheus container makes the host unresponsive.

could not describe instances: RequestError: send request failed
caused by: Post https://ec2.eu-west-1.amazonaws.com/: dial tcp: i/o timeout  source="ec2.go:121"
@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Nov 7, 2017

The startup message in the logs tells you the exact version.

But indeed, I believe the EC2 fix in 1.8.2 is different from the issue described here.

@kanand2

This comment has been minimized.

Copy link

kanand2 commented Nov 7, 2017

I too got the same error and was able to fix that by passing proxy while running the prometheus docker image and it worked for me.
-e "http_proxy=<proxy_host>:<proxy_port>" -e "https_proxy=<proxy_host>:<proxy_port>" -e "no_proxy=169.254.169.254"

@dbonatto

This comment has been minimized.

Copy link

dbonatto commented Dec 12, 2017

Got the same issue described here while using 1.8.2.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Aug 1, 2018

Anyone still seeing the issue with the latest v2.x Prometheus?

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 1, 2018

Closing as it relates to Prometheus 1.x. Feel free to (re)open the issue if it happens with the latest stable version too.

@vietthang207

This comment has been minimized.

Copy link

vietthang207 commented Dec 28, 2018

@simonpasquier I am using 2.3.1 binary executable on ubuntu 16.04 and I've experience this error:

level=error ts=2018-12-28T06:20:25.252856805Z caller=ec2.go:180 component="discovery manager scrape" discovery=ec2 msg="Refresh failed" err="could not describe instances: AuthFailure: AWS was not able to validate the provided access credentials\n\tstatus code: 401, request id: c43d1de4-b68a-4953-92ea-dae04b421f0b"

After that the whole server crash because it ran out of Memory and suffer high load.

Here is part of my config:

  • job_name: 'node_exporter'
    ec2_sd_configs:
    - region: ap-southeast-1
    access_key: XXX
    secret_key: XXX
    port: 9100

Thanks in advanced!

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jan 2, 2019

@vietthang207 please use our user mailing list, which you can also search. The message AuthFailure: AWS was not able to validate the provided access credentials\n\tstatus code: 401 indicates that the AWS API didn't authorize the request. It might be worth upgrading to Prometheus v2.6.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.