Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failures to refresh targets list using ec2_sd_configs #3664

Closed
moshebs opened this Issue Jan 8, 2018 · 36 comments

Comments

Projects
None yet
5 participants
@moshebs
Copy link

moshebs commented Jan 8, 2018

After the upgrade to Prometheus 2.0.0 we started seeing issues with refreshing of EC2 targets when using ec2_sd_configs. What we noticed is that Prometheus keeps trying to scrape terminated instances, while new instances are not scraped.

When we looked at the Prometheus logs we noticed a few error messages. Here are a few examples:

level=error ts=2018-01-05T08:36:24.688393401Z caller=ec2.go:127 component="target manager" discovery=ec2 msg="Refresh failed" err="could not describe instances: SerializationError: failed decoding EC2 Query response\ncaused by: strconv.ParseBool: parsing \"\": invalid syntax"
level=error ts=2018-01-01T08:37:49.818761561Z caller=ec2.go:127 component="target manager" discovery=ec2 msg="Refresh failed" err="could not describe instances: SerializationError: failed decoding EC2 Query response\ncaused by: parsing time \"2017-12\" as \"2006-01-02T15:04:05Z\": cannot parse \"\" as \"-\""
level=error ts=2018-01-02T03:52:55.101081904Z caller=ec2.go:127 component="target manager" discovery=ec2 msg="Refresh failed" err="could not describe instances: SerializationError: failed decoding EC2 Query response\ncaused by: parsing time \"2018-01-01T21:10:00.\" as \"2006-01-02T15:04:05Z\": cannot parse \".\" as \"Z\""
level=error ts=2018-01-02T17:34:49.733028269Z caller=ec2.go:127 component="target manager" discovery=ec2 msg="Refresh failed" err="could not describe instances: InvalidInstanceID.NotFound: The instance IDs 'i-004f91a9661f13faa, i-0790ba434c660d598' do not exist\n\tstatus code: 400, request id: 4a91ee45-ae96-4f12-bbed-f5ca3a9a83f3

Here is an example of one job configuration we use, they are all very similar:

- job_name: ecs-task-1
  scrape_interval: 40s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  ec2_sd_configs:
  - region: us-east-1
    access_key: <secret>
    secret_key: <secret>
    refresh_interval: 1m
    port: 9999
  relabel_configs:
  - source_labels: [__meta_ec2_tag_tagkey]
    separator: ;
    regex: tagvalue
    replacement: $1
    action: keep

This can be worked around by restarting Prometheus, then the list is refreshed but after a while we see it again.

Thanks!

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jan 9, 2018

is this some lab env? can I get some temp access to replicate and find the culprit?
I am on #prometheus-dev if you want to ping me there.

@moshebs

This comment has been minimized.

Copy link
Author

moshebs commented Jan 9, 2018

@krasi-georgiev - I apologize, I can't really give you access, as this environment includes company IP.
I can try and help in 2 ways:

  • describe the way to reproduce (and since it happened to us more than once I suspect it should not be too difficult), or
  • collect any additional info required for debugging.
@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jan 9, 2018

sure that is understandable.
This would definitely need some replication and testing so I would try find some time to deploy a test AWS instance .

@moshebs

This comment has been minimized.

Copy link
Author

moshebs commented Jan 9, 2018

Great, thanks.

Please let me know if anything else is required, will be happy to help!

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jan 9, 2018

SD went through some refactoring after 2.0 . the master has some bugs , but you can try this branch which I think should work as expected.
https://github.com/krasi-georgiev/prometheus/tree/discovery-handle-discoverer-updates

the refactoring was about SD in general and not EC specific , but it would be good to test it.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jan 9, 2018

let me know if you want to test it and I can build an executable for this branch.

@moshebs

This comment has been minimized.

Copy link
Author

moshebs commented Jan 9, 2018

Yes, a binary of this branch can help, we will deploy and test.

Any major bug in master we should be aware of?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jan 9, 2018

the only one that I am aware of is with the k8s discovery. When I did the refactoring I used consul SD which send all targets on every update so later I discovered that not all Discoverers do that.
k8s for example send only updates so the branch I pointed fixes this.

here is the link to the binary for linux 64
https://github.com/krasi-georgiev/prometheus/releases/download/discovery-handle-discoverer-updates/prometheus

BTW: do you use some proxy or something else in between?
the error messages indicate some bad response which is usually caused by something altering the responses.

@moshebs

This comment has been minimized.

Copy link
Author

moshebs commented Jan 10, 2018

Thanks, @krasi-georgiev, we will try the fix.

No, I do not use any proxy between Prometheus and EC2 endpoint.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jan 10, 2018

thanks , and also no AWS load balancer or something else you think that might intercept the calls ?

@moshebs

This comment has been minimized.

Copy link
Author

moshebs commented Jan 10, 2018

Prometheus itself is behind a load balancer, but is it relevant? If I understand correctly, the issue is with Prometheus calling AWS EC2 APIs, right?

I am having issues running the executable you supplied. I am trying to run it in Docker which is based on prom/prometheus:v2.0.0, and I simply replace the executable. I get a strange "not found" error although the file is there and I made sure permissions are correct. Anything you know of that can prevent me from running this exec in Docker this way?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jan 10, 2018

yes you are right the load balancer shouldn't matter.

can you try it without docker just to see what will happen.

@moshebs

This comment has been minimized.

Copy link
Author

moshebs commented Jan 10, 2018

OK, managed to run on Ubuntu outside Docker, currently looks OK. It will be difficult to really test it this way because we usually deploy on Docker so i need to deploy separately, but I will keep it up and running for a while in parallel to v2.0.0.

@moshebs

This comment has been minimized.

Copy link
Author

moshebs commented Jan 16, 2018

Tests so far are useless, since it is not production environment, and I can't create the issue in a non-production environment. Any chance to get a Docker image in DockerHub that includes the fix, so I can push it to production environment?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jan 16, 2018

this fix is already in master so latest docker image should include it.

@moshebs

This comment has been minimized.

Copy link
Author

moshebs commented Jan 16, 2018

cool - we will try!

@moshebs

This comment has been minimized.

Copy link
Author

moshebs commented Jan 28, 2018

Good news - I believe the issue is indeed resolved. We are using 2.1.0 and we have not seen the issue with this version.

@moshebs moshebs closed this Jan 28, 2018

@moshebs

This comment has been minimized.

Copy link
Author

moshebs commented Feb 13, 2018

Unfortunately we have seen the issue again with version 2.1.0.

This is the error message we got:

level=error ts=2018-02-13T07:45:04.09857512Z caller=ec2.go:174 component=“discovery manager scrape” discovery=ec2 msg=“Refresh failed” err=“could not describe instances: SerializationError: failed decoding EC2 Query response\ncaused by: parsing time \“2016-09-27T08:10:2\” as \“2006-01-02T15:04:05Z\“: cannot parse \“2\” as \“05\“”

@moshebs moshebs reopened this Feb 13, 2018

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 15, 2018

not sure how we can replicate this.
@brian-brazil do you have any other ideas?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 15, 2018

If EC2 is sending incorrectly formatted output, that's something you'll need to take up with Amazon.

@moshebs

This comment has been minimized.

Copy link
Author

moshebs commented Mar 12, 2018

I noticed you are using v1.5.1 of the AWS Go SDK. This version was released in Nov 2016. Current version is 1.13.11. I believe AWS will ask you to upgrade first to a newer version.

I came across another issue with this version - it does not support assuming EC2 roles and ECS task roles, which is the recommended way to go. And I suspect we will see more issues with time.

Any chance to get this upgraded?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Mar 12, 2018

sure an upgrade might help.

@cristian-marin-hs

This comment has been minimized.

Copy link

cristian-marin-hs commented Apr 26, 2018

Hello,
I noticed this same issue happens on version 2.2.1 when trying to scrape an old group of instances when doing blue/green deployments. We are handling the instance termination through the EC2 autoscaling group, so there might be something weird happening when doing the EC2 service discovery and getting the list of running instances.

@moshebs

This comment has been minimized.

Copy link
Author

moshebs commented Jun 20, 2018

Same problem in 2.3.0, but without the error messages. Prometheus EC2 discovery tries to scrape terminated processes and fails to discover new ones. The way to recreate it is very simple - just launch a few instances, configure Prometheus to find them by tag (example below), and start killing them and create new ones. Prometheus will eventually fall behind.

- job_name: my_ec2_job
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /prometheus
  scheme: http
  ec2_sd_configs:
  - region: us-east-1
    access_key: WAHTEVER
    secret_key: <secret>
    refresh_interval: 1m
    port: 80
    filters: []
  basic_auth:
    username: ActuatorUser
    password: <secret>
  relabel_configs:
  - source_labels: [__meta_ec2_tag_Name]
    separator: ;
    regex: my_tag_name
    replacement: $1
    action: keep
  - source_labels: [__meta_ec2_tag_project]
    separator: ;
    regex: .*
    replacement: $1
    action: keep
  - source_labels: [__meta_ec2_tag_project]
    separator: ;
    regex: (.*)
    target_label: project
    replacement: $1
    action: replace
  - source_labels: [__meta_ec2_tag_Name]
    separator: ;
    regex: (.*)
    target_label: name
    replacement: $1
    action: replace
  - source_labels: [__meta_ec2_instance_id]
    separator: ;
    regex: (.*)
    target_label: instance_id
    replacement: $1
    action: replace

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 20, 2018

What do the prometheus_sd_ec2_* metrics look like?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 21, 2018

that looks similiar to what @beorn7 reported for the k8s SD

@moshebs

This comment has been minimized.

Copy link
Author

moshebs commented Jun 21, 2018

@brian-brazil We are waiting for this to happen again and will report on the ec2 metrics.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 21, 2018

here is the other issue for the k8s updates.
#4124

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jul 18, 2018

func (m *Manager) runUpdater(ctx context.Context) {
ticker := time.NewTicker(5 * time.Second)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
m.recentlyUpdatedMtx.Lock()
if m.recentlyUpdated {
m.syncCh <- m.allGroups()
m.recentlyUpdated = false
}
m.recentlyUpdatedMtx.Unlock()
}
}
}

@moshebs I just looked at the code again and we have a 5 sec throttle there so even if there is a delay it should catch up after few seconds.

Are you still able to replicate?

@moshebs

This comment has been minimized.

Copy link
Author

moshebs commented Jul 22, 2018

Yes, this is replicated with 2.3.0.

@brian-brazil the ec2 metrics look like this:

prometheus_sd_ec2_refresh_duration_seconds{quantile="0.5"} 1.493494796
prometheus_sd_ec2_refresh_duration_seconds{quantile="0.9"} 1.6152914680000001
prometheus_sd_ec2_refresh_duration_seconds{quantile="0.99"} 114.388959524
prometheus_sd_ec2_refresh_duration_seconds_sum 448639.2018253023
prometheus_sd_ec2_refresh_duration_seconds_count 163553
prometheus_sd_ec2_refresh_failures_total 53

In the Prometheus logs I see errors that look like this:

level=error ts=2018-07-22T00:17:13.631398205Z caller=ec2.go:180 component="discovery manager scrape" discovery=ec2 msg="Refresh failed" err="could not describe instances: InvalidInstanceID.NotFound: The instance ID 'i-000257a4fb46f7d47' does not exist\n\tstatus code: 400, request id: 29a35ed6-6ab2-425c-b5df-cd2bae3b5842"

Our configuration for EC2 discovery is a series of jobs that looks more or less like this:

  • job_name: ecs-task-vas-admin
    scrape_interval: 15s
    scrape_timeout: 10s
    metrics_path: /prometheus
    scheme: http
    ec2_sd_configs:
    • region: us-east-1
      access_key: AKIAFAKEKEY
      secret_key:
      refresh_interval: 1m
      port: 80
      filters: []
      relabel_configs:
    • source_labels: [__meta_ec2_tag_Name]
      separator: ;
      regex: vas-admin(.*)
      replacement: $1
      action: keep

To give you a context, the scenario is as follows: The Prometheus server is up and running, fully synched with EC2. Then machines go down, new ones are created and we see that Prometheus tries to scrape old, non existing VMs.

Thanks!

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jul 23, 2018

@moshebs thanks for the details. AFAICT the error is returned by the EC2 API client and bubbles up to the SD instance. Not sure what can be done here from the Prometheus POV.

As a possible workaround, you can try to filter the EC2 instances from the service discovery instead of relying on relabel_configs?

ec2_sd_configs:
  - region: us-east-1
    access_key: AKIAFAKEKEY
    secret_key:
    refresh_interval: 1m
    port: 80
    filters: ["tag:Name": ["vas-admin"]]

Not sure about the syntax but you should get the idea.

@moshebs

This comment has been minimized.

Copy link
Author

moshebs commented Jul 23, 2018

@simonpasquier Thannks for looking into it, we will try using filters and report back.

As a side note, I noticed that you upgraded the AWS GO SDK about 5 days ago - perhaps with the new version we will not see the problem any more even when using relabel_config. Looking forward to your next release to try it out!

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 1, 2018

@moshebs any update?

@moshebs

This comment has been minimized.

Copy link
Author

moshebs commented Oct 2, 2018

@simonpasquier no, not yet, but I hope I will have an update soon.

@moshebs

This comment has been minimized.

Copy link
Author

moshebs commented Oct 25, 2018

I can confirm that using filters resolved the issue for me, haven't seen it since I started filters.

I am leaving the issue open since it might still happen if you use relabel config, but for me the issue is resolved.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 25, 2018

Thanks for the heads-up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.