Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EC2 Discovery dropped all EC2 targets when AWS EC2 API respond 503. #2335

Closed
WeiBanjo opened this Issue Jan 10, 2017 · 10 comments

Comments

Projects
None yet
3 participants
@WeiBanjo
Copy link

WeiBanjo commented Jan 10, 2017

What did you do?
When AWS EC2 API has issues, Prometheus server loses EC2 targets.

What did you expect to see?
Prometheus should keep stale EC2 targets, when EC2 API call fails.

What did you see instead? Under which circumstances?
Prometheus dropped all EC2 targets when EC2 API respond 503

Environment

  • System information:
$ uname -srm
Linux 3.10.0-327.4.4.el7.x86_64 x86_64
  • Prometheus version:
$ prometheus -version
prometheus, version 1.4.1 (branch: master, revision: 2a89e8733f240d3cd57a6520b52c36ac4744ce12)
  build user:       root@e685d23d8809
  build date:       20161128-09:59:22
  go version:       go1.7.3
  • Prometheus configuration file:
global:
  scrape_interval: 600s

scrape_configs:
  - job_name: 'prod-exporter'
    scrape_interval: 30s
    scrape_timeout: 30s
    scheme: 'https'
    ec2_sd_configs:
      - region: us-east-1
        port: 8100
        refresh_interval: 600s
    relabel_configs:
      - source_labels: [__meta_ec2_public_ip]
        regex: .+
        action: keep
      - source_labels: [__meta_ec2_tag_Name]
        regex: prod-.*
        action: keep
      - source_labels: [__address__]
        modulus: 2
        target_label: __tmp_hash
        action: hashmod
      - source_labels: [__tmp_hash]
        regex: ^1$
        action: keep
      - source_labels: [__meta_ec2_private_ip]
        regex: (.*)
        replacement: ${1}:8100
        action: replace
        target_label: __address__
      - source_labels: [__meta_ec2_tag_Name]
        action: replace
        target_label: hostname
  • Logs:
prometheus[15492]: time="2017-01-10T20:41:48Z" level=error msg="could not describe instances: Unavailable: Tags could not be retrieved.\n\tstatus code: 503, request id: REQUEST_ID" source="ec2.go:116"
prometheus[15492]: time="2017-01-10T20:41:49Z" level=error msg="could not describe instances: Unavailable: Tags could not be retrieved.\n\tstatus code: 503, request id: REQUEST_ID" source="ec2.go:116"

@WeiBanjo WeiBanjo changed the title EC2 Discovery dropped all EC2 targets when AWS API respond 503. EC2 Discovery dropped all EC2 targets when AWS EC2 API respond 503. Jan 10, 2017

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 14, 2017

This needs confirmation.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 1, 2017

@WeiBanjo is still the case even with the latest version?

I did a big refactoring of the SD if you want to give it try and let me know if it still experiences the same bug.
Here is a link to download an executable for Linux 64bit
https://github.com/krasi-georgiev/prometheus/releases/download/v2.0.0-beta.x/prometheus

@WeiBanjo

This comment has been minimized.

Copy link
Author

WeiBanjo commented Dec 1, 2017

@krasi-georgiev Thanks. I will test that version in next couple days.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 2, 2017

We need some proper testing before the merge so I am on IRC if you encounter anymore bugs.

@WeiBanjo

This comment has been minimized.

Copy link
Author

WeiBanjo commented Dec 2, 2017

@krasi-georgiev Your build fails on start up. See following logs. My configuration works on Prometheus v2 binary.

Is there any EC2 discover related breaking changes I should aware ?

Dec 02 18:39:18 prometheus[2540]: level=error ts=2017-12-02T18:39:18.720043373Z caller=main.go:503 err="Error loading config one or more errors occurred while applying the new configuration (--config.file=/opt/prometheus/prometheus.yaml)"

And my configuration passes config check

$ /opt/go/bin/promtool --version
promtool, version 2.0.0 (branch: HEAD, revision: 0a74f98628a0463dddc90528220c94de5032d1a0)
  build user:       root@615b82cb36b6
  build date:       20171108-07:11:59
  go version:       go1.9.2

$ /opt/go/bin/promtool check config /opt/prometheus/prometheus.yaml
Checking /opt/prometheus/prometheus.yaml
  SUCCESS: 1 rule files found

Checking /opt/prometheus/eng-alerts/prometheus/recording_rules/collector_v2/fleet/common.rules.yml
  SUCCESS: 4 rules found

Some system infos

$ uname -a
Linux ip-172-31-28-9.ec2.internal 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/redhat-release
CentOS Linux release 7.4.1708 (Core)
@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 3, 2017

@WeiBanjo I tested using the config from your first comment, but it seems that you are using a different one with some recording rules in it.

@gouthamve mentioned that this might be caused by #3524 so I rebased and uploaded the new version.
Can you please download again form the same link and try again.
If it still doesn't load can you please post your new config so I can trace and kill this little bug....

@WeiBanjo

This comment has been minimized.

Copy link
Author

WeiBanjo commented Dec 3, 2017

@krasi-georgiev New build works.

I tested EC2 API failure scenario. Prometheus keeps using cached targets after failure.

Thanks for the fix.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 4, 2017

@WeiBanjo glad to hear than , let me know if you find any other bugs as I want to resolve before the merge.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 17, 2018

@brancz , @brian-brazil you can close this one.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.