-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discovery for ECS #9310
Comments
Thank you for this proposal. Overall the proposal is interesting and I recognize the need to have this additional AWS integration. I have a few comments/questions, just by reading your proposal. I am currently not familiar with ECS, so I have a few questions. Is there any additional metadata? The fact that you use tags to filter the targets means that we can probably expose the tags as additional metadata.
Is is the same thing we use for EC2/Lightsail/Sigv4 ? Or should we all align them to this new technique as an intermediate step? (we'll have to be careful and be backwards compatible).
port_path as explained here might be confusing. Prometheus is generally explicit in its configuration. Could we have:
It's unsure why we would use port 9090 by default, should we simply ask the user to set at least one port? Is there a way to also filter the ports by a portName? I'd note that metrics_path is probably not really useful here since it can be set at the scrape_config level and via relabeling. It is also unclear to me why port_path is a list. Do you plan to verify it against the exposed ports of the containers or add it anyway for every target? You also plan to filter on ACTIVE tasks. Does this state also cover the containers that are starting and terminating? |
The proposal didn't have too much detail but additional metadata could be the EC2 instance metadata that runs the containers. It requires an additional request to get details of an EC2 instance for every ECS task container and instance metadata can be useful to identify the internal and external IP addresses of the task. These IPs don't change until the task is killed, so we can cache them in memory rather having to query them again and again.
This is what they use, nothing new here. It's the standard best practice and mechanism to do auth.
I wanted this to allow containers to publish at paths they would prefer but I have no objections to your suggested and my initial version included a port and a metrics_path just like yours.
Good question, 9090 came from the sidecar I'm writing that will publish ECS infra metrics in the Prometheus format but it's not a good port to default to. Ports don't have names so it's not possible to query ports by name. I think I overoptimized this for the sidecar and expecting users to set at least one port sounds reasonable.
I agree, let me move them to the ecs_sd_configs level.
This means starting and terminating tasks won't be discovered. Starting tasks will only be discovered at the next discovery if they started by the time we are querying the tasks again. |
I updated the configuration in the original proposal. I also turned the port into an int array because a task contains multiple containers that may publish multiple metric handlers. |
We know some users use one prometheus with multiple AWS accounts. Is it planned to have the current EC2's auth parameters?
Given than metrics_path is no longer per port, it would be easier to just keep them at the scrape_configs level and not repeat it at the service discovery level. |
Updated the config to add the auth and removed the metrics_path. Not sure if I got the notation correctly, I'm not very familiar with it. |
|
Updated the labels to be aligned with what we are doing for the ECS exporter, prometheus-community/ecs_exporter#2. |
My experience has been with a fork of https://github.com/teralytics/prometheus-ecs-discovery which is working well. We added ratelimiting of AWS API calls and a cache of task definitions between discovery runs which has pretty much solved problems with hitting API ratelimits. We run one prometheus instance per AWS region per account and it manages to discover and scrape from a large number of clusters, tasks and containers. The proposal here wouldn't replace it for our use case, sadly, as we are using it to dynamically discover all ECS clusters (we run large multitenant accounts where clusters can come and go at any time) so the requirement to specify a cluster in the SD config won't cut it for us. I still think it's a useful addition to prometheus, though, and I expect we will in future be looking to deploy a prometheus within each cluster, at which point hardcoding the cluster name would be fine. I'm working on getting the changes we made cleared for contribution back to the original project which may be helpful to inform this design?
As well as the AZ name, a label for the AZ ID would be useful (as a parallel to EC2 service discovery's |
Hello, We have the same need, I'm thinking of going to https://github.com/teralytics/prometheus-ecs-discovery for the moment, but I would have liked native support, do you have more information on the progress of this feature? Thank You |
We've been considering a few options how we can support this on ECS more natively so the autodiscovery problem becomes a non-problem. If existing solutions are sufficient for now, I'd highly recommend using them. I'll update the proposal once we have something more concrete. |
is anyone using any updated forks for the prometheus-ecs-discovery project? or any idea if we are to see an |
In an earlier evaluation, ECS discovery was rejected due to API rate limiting issues described at the discovery section. As of today, there are ECS users that are publishing Prometheus metrics and using CloudWatch Agent's Prometheus scraping capabilities. They configure the agent with task selection mechanism to shard the load among multiple clusters. Influenced by what the users already do, we think we can tackle the problem in a couple of ways:
Given we have this functionality in the CW Agent, not having a similar capability in Prometheus is confusing the ECS users. We would like to fill this gap by contributing an ECS discovery agent to Prometheus and want to switch to the discovery mechanism provided here in all our other collection agents (CW Agent, OpenTelemetry Prometheus Receiver, etc)
Goals
Config
Once implemented, ECS discovery will be supported in the Prometheus config. The example below will query the cluster to discover ECS tasks/containers matching the given task selectors.
Discovery
Discovery is done by periodically pulling the ListTasks API. Discovery will only return the ACTIVE tasks.
As an improvement, we will switch to a model where we will listen to ECS events to be notified about the task start and terminations in the future. This will allow us to call the ListTasks for once and rely on the events for the changes as an optimization.
Labels
Prometheus discovery can automatically add ECS task/container labels to the scraped metrics. The discovery will add the following labels:
Authentication & IAM
We will use the default credential provider chain, the following permissions are required:
The text was updated successfully, but these errors were encountered: