Discovery for ECS #9310

rakyll · 2021-09-07T19:45:00Z

In an earlier evaluation, ECS discovery was rejected due to API rate limiting issues described at the discovery section. As of today, there are ECS users that are publishing Prometheus metrics and using CloudWatch Agent's Prometheus scraping capabilities. They configure the agent with task selection mechanism to shard the load among multiple clusters. Influenced by what the users already do, we think we can tackle the problem in a couple of ways:

Asking users to configure the discovery to discover a set of matching tasks from a cluster, cache metadata in memory where possible.
Querying the initial data with the ECS API and then relying on ECS events to identify new and terminated tasks.
Asking users to run Prometheus as a sidecar in their ECS tasks as a last resort.

Given we have this functionality in the CW Agent, not having a similar capability in Prometheus is confusing the ECS users. We would like to fill this gap by contributing an ECS discovery agent to Prometheus and want to switch to the discovery mechanism provided here in all our other collection agents (CW Agent, OpenTelemetry Prometheus Receiver, etc)

Goals

Discovery will only discover metric endpoints from a single cluster.
We will allow users to filter the tasks by the Cluster Query language and ECS tags.
Users should be able to specify ports and metrics path where the Prometheus metrics are published from the task. (See the config for more.)
ECS discovery will support both ECS on EC2 and ECS on Fargate.

Config

Once implemented, ECS discovery will be supported in the Prometheus config. The example below will query the cluster to discover ECS tasks/containers matching the given task selectors.

scrape_configs:
  - job_name: ecs-job
    [ metrics_path: <string> ]
    ecs_sd_configs:
      - [ refresh_interval: <string> | default = 720s ]
        [ region: <string> ]
        cluster: <string>
        [ access_key: <string> ] 
        [ secret_key: <secret> ]
        [ profile: <string> ]
        [ role_arn: <string> ]
        ports:
            - <int>
        task_selectors:
          - [ service: <string> ]
            [ family: <string> ]
            [ revisions: <int> ]
            [ launch_type: <string> ]
            [ query: <string> ]
            [ tags: 
               - <string>:  <string> ]

Discovery

Discovery is done by periodically pulling the ListTasks API. Discovery will only return the ACTIVE tasks.

As an improvement, we will switch to a model where we will listen to ECS events to be notified about the task start and terminations in the future. This will allow us to call the ListTasks for once and rely on the events for the changes as an optimization.

Labels

Prometheus discovery can automatically add ECS task/container labels to the scraped metrics. The discovery will add the following labels:

Label	Source	Type	Description
__meta_ecs_cluster	ECS Cluster	string	ECS cluster name.
__meta_ecs_task_launch_type	ECS Task	string	"ec2" or "fargate".
__meta_ecs_task_family	ECS Task	string	ECS task family.
__meta_ecs_task_family_revision	ECS Task	string	ECS task family revision.
__meta_ecs_task_az	ECS Task	string	Availability zone
__meta_ecs_ec2_instance_id	EC2	string	EC2 instance id for EC2 launch type. Otherwise "fargate".

Authentication & IAM

We will use the default credential provider chain, the following permissions are required:

ec2:DescribeInstances
ecs:ListTasks
ecs:DescribeContainerInstances
ecs:DescribeTasks

The text was updated successfully, but these errors were encountered:

roidelapluie · 2021-09-07T23:12:09Z

Thank you for this proposal.

Overall the proposal is interesting and I recognize the need to have this additional AWS integration. I have a few comments/questions, just by reading your proposal. I am currently not familiar with ECS, so I have a few questions.

Is there any additional metadata? The fact that you use tags to filter the targets means that we can probably expose the tags as additional metadata.

We will use the default credential provider chain.

Is is the same thing we use for EC2/Lightsail/Sigv4 ? Or should we all align them to this new technique as an intermediate step? (we'll have to be careful and be backwards compatible).

port_path

port_path as explained here might be confusing. Prometheus is generally explicit in its configuration. Could we have:

- port: <int> | default 80
  metrics_path: <string> | default /metrics

It's unsure why we would use port 9090 by default, should we simply ask the user to set at least one port? Is there a way to also filter the ports by a portName?

I'd note that metrics_path is probably not really useful here since it can be set at the scrape_config level and via relabeling.

It is also unclear to me why port_path is a list. Do you plan to verify it against the exposed ports of the containers or add it anyway for every target?

You also plan to filter on ACTIVE tasks. Does this state also cover the containers that are starting and terminating?

rakyll · 2021-09-07T23:25:30Z

The proposal didn't have too much detail but additional metadata could be the EC2 instance metadata that runs the containers. It requires an additional request to get details of an EC2 instance for every ECS task container and instance metadata can be useful to identify the internal and external IP addresses of the task. These IPs don't change until the task is killed, so we can cache them in memory rather having to query them again and again.

Is is the same thing we use for EC2/Lightsail/Sigv4 ? Or should we all align them to this new technique as an intermediate step? (we'll have to be careful and be backwards compatible).

This is what they use, nothing new here. It's the standard best practice and mechanism to do auth.

port_path as explained here might be confusing. Prometheus is generally explicit in its configuration. Could we have:

I wanted this to allow containers to publish at paths they would prefer but I have no objections to your suggested and my initial version included a port and a metrics_path just like yours.

It's unsure why we would use port 9090 by default, should we simply ask the user to set at least one port? Is there a way to also filter the ports by a portName?

Good question, 9090 came from the sidecar I'm writing that will publish ECS infra metrics in the Prometheus format but it's not a good port to default to. Ports don't have names so it's not possible to query ports by name. I think I overoptimized this for the sidecar and expecting users to set at least one port sounds reasonable.

I'd note that metrics_path is probably not really useful here since it can be set at the scrape_config level and via relabeling.

I agree, let me move them to the ecs_sd_configs level.

You also plan to filter on ACTIVE tasks. Does this state also cover the containers that are starting and terminating?

This means starting and terminating tasks won't be discovered. Starting tasks will only be discovered at the next discovery if they started by the time we are querying the tasks again.

rakyll · 2021-09-07T23:29:29Z

I updated the configuration in the original proposal. I also turned the port into an int array because a task contains multiple containers that may publish multiple metric handlers.

roidelapluie · 2021-09-07T23:36:02Z

We know some users use one prometheus with multiple AWS accounts. Is it planned to have the current EC2's auth parameters?

# The AWS API keys. If blank, the environment variables `AWS_ACCESS_KEY_ID`
# and `AWS_SECRET_ACCESS_KEY` are used.
[ access_key: <string> ]
[ secret_key: <secret> ]
# Named AWS profile used to connect to the API.
[ profile: <string> ]

# AWS Role ARN, an alternative to using AWS API keys.
[ role_arn: <string> ]

Given than metrics_path is no longer per port, it would be easier to just keep them at the scrape_configs level and not repeat it at the service discovery level.

rakyll · 2021-09-08T00:30:09Z

Updated the config to add the auth and removed the metrics_path. Not sure if I got the notation correctly, I'm not very familiar with it.

roidelapluie · 2021-09-08T08:30:33Z

prometheus_scrape: "true" should probably be string: string, otherwise LGTM.

rakyll · 2021-09-14T20:21:08Z

Updated the labels to be aligned with what we are doing for the ECS exporter, prometheus-community/ecs_exporter#2.

rhowe · 2021-09-22T21:23:48Z

My experience has been with a fork of https://github.com/teralytics/prometheus-ecs-discovery which is working well. We added ratelimiting of AWS API calls and a cache of task definitions between discovery runs which has pretty much solved problems with hitting API ratelimits. We run one prometheus instance per AWS region per account and it manages to discover and scrape from a large number of clusters, tasks and containers.

The proposal here wouldn't replace it for our use case, sadly, as we are using it to dynamically discover all ECS clusters (we run large multitenant accounts where clusters can come and go at any time) so the requirement to specify a cluster in the SD config won't cut it for us. I still think it's a useful addition to prometheus, though, and I expect we will in future be looking to deploy a prometheus within each cluster, at which point hardcoding the cluster name would be fine.

I'm working on getting the changes we made cleared for contribution back to the original project which may be helpful to inform this design?

prometheus-ecs-sd works by expecting containers within tasks to specify their scrape config via their dockerLabels map. Exposing a container's dockerLabels as e.g. __meta_ecs_container_dockerlabel_<name> labels would allow the same behaviour via a relabel config. e.g. allowing a per-container scrape path, port or scheme as well as selection of targets via their docker labels. This would match up with what Prometheus's Kubernetes service discovery does too.

As well as the AZ name, a label for the AZ ID would be useful (as a parallel to EC2 service discovery's __meta_ec2_availability_zone_id label). This is not available in the ECS ListTasks API response though - that only has availabilityZone: https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_Task.html#ECS-Type-Task-availabilityZone. An opportunity for an API enhancement request perhaps?

euskadi31 · 2022-01-18T13:44:24Z

Hello, We have the same need, I'm thinking of going to https://github.com/teralytics/prometheus-ecs-discovery for the moment, but I would have liked native support, do you have more information on the progress of this feature?

Thank You

rakyll · 2022-02-16T18:06:17Z

We've been considering a few options how we can support this on ECS more natively so the autodiscovery problem becomes a non-problem. If existing solutions are sufficient for now, I'd highly recommend using them. I'll update the proposal once we have something more concrete.

bmariesan · 2024-01-01T21:46:43Z

is anyone using any updated forks for the prometheus-ecs-discovery project? or any idea if we are to see an ecs_sd_configs anytine soon? asking this as I've seen recently in the operator that we have the equivalent for ec2

roidelapluie added component/service discovery kind/feature priority/P3 labels Sep 8, 2021

rakyll mentioned this issue Sep 8, 2021

An ECS exporter prometheus-community/community#36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discovery for ECS #9310

Discovery for ECS #9310

rakyll commented Sep 7, 2021 •

edited

roidelapluie commented Sep 7, 2021

rakyll commented Sep 7, 2021

rakyll commented Sep 7, 2021

roidelapluie commented Sep 7, 2021

rakyll commented Sep 8, 2021

roidelapluie commented Sep 8, 2021

rakyll commented Sep 14, 2021

rhowe commented Sep 22, 2021 •

edited

euskadi31 commented Jan 18, 2022

rakyll commented Feb 16, 2022

bmariesan commented Jan 1, 2024

Discovery for ECS #9310

Discovery for ECS #9310

Comments

rakyll commented Sep 7, 2021 • edited

Goals

Config

Discovery

Labels

Authentication & IAM

roidelapluie commented Sep 7, 2021

rakyll commented Sep 7, 2021

rakyll commented Sep 7, 2021

roidelapluie commented Sep 7, 2021

rakyll commented Sep 8, 2021

roidelapluie commented Sep 8, 2021

rakyll commented Sep 14, 2021

rhowe commented Sep 22, 2021 • edited

euskadi31 commented Jan 18, 2022

rakyll commented Feb 16, 2022

bmariesan commented Jan 1, 2024

rakyll commented Sep 7, 2021 •

edited

rhowe commented Sep 22, 2021 •

edited