Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upFailures to refresh targets list using ec2_sd_configs #3664
Comments
This comment has been minimized.
This comment has been minimized.
|
is this some lab env? can I get some temp access to replicate and find the culprit? |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev - I apologize, I can't really give you access, as this environment includes company IP.
|
This comment has been minimized.
This comment has been minimized.
|
sure that is understandable. |
This comment has been minimized.
This comment has been minimized.
|
Great, thanks. Please let me know if anything else is required, will be happy to help! |
This comment has been minimized.
This comment has been minimized.
|
SD went through some refactoring after 2.0 . the master has some bugs , but you can try this branch which I think should work as expected. the refactoring was about SD in general and not EC specific , but it would be good to test it. |
This comment has been minimized.
This comment has been minimized.
|
let me know if you want to test it and I can build an executable for this branch. |
This comment has been minimized.
This comment has been minimized.
|
Yes, a binary of this branch can help, we will deploy and test. Any major bug in master we should be aware of? |
This comment has been minimized.
This comment has been minimized.
|
the only one that I am aware of is with the k8s discovery. When I did the refactoring I used consul SD which send all targets on every update so later I discovered that not all Discoverers do that. here is the link to the binary for linux 64 BTW: do you use some proxy or something else in between? |
This comment has been minimized.
This comment has been minimized.
|
Thanks, @krasi-georgiev, we will try the fix. No, I do not use any proxy between Prometheus and EC2 endpoint. |
This comment has been minimized.
This comment has been minimized.
|
thanks , and also no AWS load balancer or something else you think that might intercept the calls ? |
This comment has been minimized.
This comment has been minimized.
|
Prometheus itself is behind a load balancer, but is it relevant? If I understand correctly, the issue is with Prometheus calling AWS EC2 APIs, right? I am having issues running the executable you supplied. I am trying to run it in Docker which is based on prom/prometheus:v2.0.0, and I simply replace the executable. I get a strange "not found" error although the file is there and I made sure permissions are correct. Anything you know of that can prevent me from running this exec in Docker this way? |
This comment has been minimized.
This comment has been minimized.
|
yes you are right the load balancer shouldn't matter. can you try it without docker just to see what will happen. |
This comment has been minimized.
This comment has been minimized.
|
OK, managed to run on Ubuntu outside Docker, currently looks OK. It will be difficult to really test it this way because we usually deploy on Docker so i need to deploy separately, but I will keep it up and running for a while in parallel to v2.0.0. |
This comment has been minimized.
This comment has been minimized.
|
Tests so far are useless, since it is not production environment, and I can't create the issue in a non-production environment. Any chance to get a Docker image in DockerHub that includes the fix, so I can push it to production environment? |
This comment has been minimized.
This comment has been minimized.
|
this fix is already in master so latest docker image should include it. |
This comment has been minimized.
This comment has been minimized.
|
cool - we will try! |
This comment has been minimized.
This comment has been minimized.
|
Good news - I believe the issue is indeed resolved. We are using 2.1.0 and we have not seen the issue with this version. |
moshebs
closed this
Jan 28, 2018
This comment has been minimized.
This comment has been minimized.
|
Unfortunately we have seen the issue again with version 2.1.0. This is the error message we got:
|
moshebs
reopened this
Feb 13, 2018
This comment has been minimized.
This comment has been minimized.
|
not sure how we can replicate this. |
This comment has been minimized.
This comment has been minimized.
|
If EC2 is sending incorrectly formatted output, that's something you'll need to take up with Amazon. |
This comment has been minimized.
This comment has been minimized.
|
I noticed you are using v1.5.1 of the AWS Go SDK. This version was released in Nov 2016. Current version is 1.13.11. I believe AWS will ask you to upgrade first to a newer version. I came across another issue with this version - it does not support assuming EC2 roles and ECS task roles, which is the recommended way to go. And I suspect we will see more issues with time. Any chance to get this upgraded? |
This comment has been minimized.
This comment has been minimized.
|
sure an upgrade might help. |
moshebs
referenced this issue
Mar 20, 2018
Closed
Prometheus 2.0 prioritizes EC2 instance role over credentials in environment variables #3545
This comment has been minimized.
This comment has been minimized.
cristian-marin-hs
commented
Apr 26, 2018
|
Hello, |
This comment has been minimized.
This comment has been minimized.
|
Same problem in 2.3.0, but without the error messages. Prometheus EC2 discovery tries to scrape terminated processes and fails to discover new ones. The way to recreate it is very simple - just launch a few instances, configure Prometheus to find them by tag (example below), and start killing them and create new ones. Prometheus will eventually fall behind.
|
brian-brazil
added
kind/bug
component/service discovery
labels
Jun 20, 2018
This comment has been minimized.
This comment has been minimized.
|
What do the prometheus_sd_ec2_* metrics look like? |
This comment has been minimized.
This comment has been minimized.
|
that looks similiar to what @beorn7 reported for the k8s SD |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil We are waiting for this to happen again and will report on the ec2 metrics. |
This comment has been minimized.
This comment has been minimized.
|
here is the other issue for the k8s updates. |
This comment has been minimized.
This comment has been minimized.
|
prometheus/discovery/manager.go Lines 157 to 174 in 566c80b @moshebs I just looked at the code again and we have a 5 sec throttle there so even if there is a delay it should catch up after few seconds. Are you still able to replicate? |
This comment has been minimized.
This comment has been minimized.
|
Yes, this is replicated with 2.3.0. @brian-brazil the ec2 metrics look like this: prometheus_sd_ec2_refresh_duration_seconds{quantile="0.5"} 1.493494796 In the Prometheus logs I see errors that look like this: level=error ts=2018-07-22T00:17:13.631398205Z caller=ec2.go:180 component="discovery manager scrape" discovery=ec2 msg="Refresh failed" err="could not describe instances: InvalidInstanceID.NotFound: The instance ID 'i-000257a4fb46f7d47' does not exist\n\tstatus code: 400, request id: 29a35ed6-6ab2-425c-b5df-cd2bae3b5842" Our configuration for EC2 discovery is a series of jobs that looks more or less like this:
To give you a context, the scenario is as follows: The Prometheus server is up and running, fully synched with EC2. Then machines go down, new ones are created and we see that Prometheus tries to scrape old, non existing VMs. Thanks! |
This comment has been minimized.
This comment has been minimized.
|
@moshebs thanks for the details. AFAICT the error is returned by the EC2 API client and bubbles up to the SD instance. Not sure what can be done here from the Prometheus POV. As a possible workaround, you can try to filter the EC2 instances from the service discovery instead of relying on ec2_sd_configs:
- region: us-east-1
access_key: AKIAFAKEKEY
secret_key:
refresh_interval: 1m
port: 80
filters: ["tag:Name": ["vas-admin"]]Not sure about the syntax but you should get the idea. |
This comment has been minimized.
This comment has been minimized.
|
@simonpasquier Thannks for looking into it, we will try using filters and report back. As a side note, I noticed that you upgraded the AWS GO SDK about 5 days ago - perhaps with the new version we will not see the problem any more even when using relabel_config. Looking forward to your next release to try it out! |
simonpasquier
added
the
kind/more-info-needed
label
Aug 7, 2018
This comment has been minimized.
This comment has been minimized.
|
@moshebs any update? |
This comment has been minimized.
This comment has been minimized.
|
@simonpasquier no, not yet, but I hope I will have an update soon. |
This comment has been minimized.
This comment has been minimized.
|
I can confirm that using filters resolved the issue for me, haven't seen it since I started filters. I am leaving the issue open since it might still happen if you use relabel config, but for me the issue is resolved. |
This comment has been minimized.
This comment has been minimized.
|
Thanks for the heads-up! |
moshebs commentedJan 8, 2018
After the upgrade to Prometheus 2.0.0 we started seeing issues with refreshing of EC2 targets when using ec2_sd_configs. What we noticed is that Prometheus keeps trying to scrape terminated instances, while new instances are not scraped.
When we looked at the Prometheus logs we noticed a few error messages. Here are a few examples:
Here is an example of one job configuration we use, they are all very similar:
This can be worked around by restarting Prometheus, then the list is refreshed but after a while we see it again.
Thanks!