Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discovery for ECS #9310

Open
rakyll opened this issue Sep 7, 2021 · 11 comments
Open

Discovery for ECS #9310

rakyll opened this issue Sep 7, 2021 · 11 comments

Comments

@rakyll
Copy link

rakyll commented Sep 7, 2021

In an earlier evaluation, ECS discovery was rejected due to API rate limiting issues described at the discovery section. As of today, there are ECS users that are publishing Prometheus metrics and using CloudWatch Agent's Prometheus scraping capabilities. They configure the agent with task selection mechanism to shard the load among multiple clusters. Influenced by what the users already do, we think we can tackle the problem in a couple of ways:

  • Asking users to configure the discovery to discover a set of matching tasks from a cluster, cache metadata in memory where possible.
  • Querying the initial data with the ECS API and then relying on ECS events to identify new and terminated tasks.
  • Asking users to run Prometheus as a sidecar in their ECS tasks as a last resort.

Given we have this functionality in the CW Agent, not having a similar capability in Prometheus is confusing the ECS users. We would like to fill this gap by contributing an ECS discovery agent to Prometheus and want to switch to the discovery mechanism provided here in all our other collection agents (CW Agent, OpenTelemetry Prometheus Receiver, etc)

Goals

  • Discovery will only discover metric endpoints from a single cluster.
  • We will allow users to filter the tasks by the Cluster Query language and ECS tags.
  • Users should be able to specify ports and metrics path where the Prometheus metrics are published from the task. (See the config for more.)
  • ECS discovery will support both ECS on EC2 and ECS on Fargate.

Config

Once implemented, ECS discovery will be supported in the Prometheus config. The example below will query the cluster to discover ECS tasks/containers matching the given task selectors.

scrape_configs:
  - job_name: ecs-job
    [ metrics_path: <string> ]
    ecs_sd_configs:
      - [ refresh_interval: <string> | default = 720s ]
        [ region: <string> ]
        cluster: <string>
        [ access_key: <string> ] 
        [ secret_key: <secret> ]
        [ profile: <string> ]
        [ role_arn: <string> ]
        ports:
            - <int>
        task_selectors:
          - [ service: <string> ]
            [ family: <string> ]
            [ revisions: <int> ]
            [ launch_type: <string> ]
            [ query: <string> ]
            [ tags: 
               - <string>:  <string> ]

Discovery

Discovery is done by periodically pulling the ListTasks API. Discovery will only return the ACTIVE tasks.

As an improvement, we will switch to a model where we will listen to ECS events to be notified about the task start and terminations in the future. This will allow us to call the ListTasks for once and rely on the events for the changes as an optimization.

Labels

Prometheus discovery can automatically add ECS task/container labels to the scraped metrics. The discovery will add the following labels:

Label Source Type Description
__meta_ecs_cluster ECS Cluster string ECS cluster name.
__meta_ecs_task_launch_type ECS Task string "ec2" or "fargate".
__meta_ecs_task_family ECS Task string ECS task family.
__meta_ecs_task_family_revision ECS Task string ECS task family revision.
__meta_ecs_task_az ECS Task string Availability zone
__meta_ecs_ec2_instance_id EC2 string EC2 instance id for EC2 launch type. Otherwise "fargate".

Authentication & IAM

We will use the default credential provider chain, the following permissions are required:

  • ec2:DescribeInstances
  • ecs:ListTasks
  • ecs:DescribeContainerInstances
  • ecs:DescribeTasks
@roidelapluie
Copy link
Member

Thank you for this proposal.

Overall the proposal is interesting and I recognize the need to have this additional AWS integration. I have a few comments/questions, just by reading your proposal. I am currently not familiar with ECS, so I have a few questions.

Is there any additional metadata? The fact that you use tags to filter the targets means that we can probably expose the tags as additional metadata.

We will use the default credential provider chain.

Is is the same thing we use for EC2/Lightsail/Sigv4 ? Or should we all align them to this new technique as an intermediate step? (we'll have to be careful and be backwards compatible).

port_path

port_path as explained here might be confusing. Prometheus is generally explicit in its configuration. Could we have:

- port: <int> | default 80
  metrics_path: <string> | default /metrics

It's unsure why we would use port 9090 by default, should we simply ask the user to set at least one port? Is there a way to also filter the ports by a portName?

I'd note that metrics_path is probably not really useful here since it can be set at the scrape_config level and via relabeling.

It is also unclear to me why port_path is a list. Do you plan to verify it against the exposed ports of the containers or add it anyway for every target?

You also plan to filter on ACTIVE tasks. Does this state also cover the containers that are starting and terminating?

@rakyll
Copy link
Author

rakyll commented Sep 7, 2021

The proposal didn't have too much detail but additional metadata could be the EC2 instance metadata that runs the containers. It requires an additional request to get details of an EC2 instance for every ECS task container and instance metadata can be useful to identify the internal and external IP addresses of the task. These IPs don't change until the task is killed, so we can cache them in memory rather having to query them again and again.

Is is the same thing we use for EC2/Lightsail/Sigv4 ? Or should we all align them to this new technique as an intermediate step? (we'll have to be careful and be backwards compatible).

This is what they use, nothing new here. It's the standard best practice and mechanism to do auth.

port_path as explained here might be confusing. Prometheus is generally explicit in its configuration. Could we have:

I wanted this to allow containers to publish at paths they would prefer but I have no objections to your suggested and my initial version included a port and a metrics_path just like yours.

It's unsure why we would use port 9090 by default, should we simply ask the user to set at least one port? Is there a way to also filter the ports by a portName?

Good question, 9090 came from the sidecar I'm writing that will publish ECS infra metrics in the Prometheus format but it's not a good port to default to. Ports don't have names so it's not possible to query ports by name. I think I overoptimized this for the sidecar and expecting users to set at least one port sounds reasonable.

I'd note that metrics_path is probably not really useful here since it can be set at the scrape_config level and via relabeling.

I agree, let me move them to the ecs_sd_configs level.

You also plan to filter on ACTIVE tasks. Does this state also cover the containers that are starting and terminating?

This means starting and terminating tasks won't be discovered. Starting tasks will only be discovered at the next discovery if they started by the time we are querying the tasks again.

@rakyll
Copy link
Author

rakyll commented Sep 7, 2021

I updated the configuration in the original proposal. I also turned the port into an int array because a task contains multiple containers that may publish multiple metric handlers.

@roidelapluie
Copy link
Member

We know some users use one prometheus with multiple AWS accounts. Is it planned to have the current EC2's auth parameters?

# The AWS API keys. If blank, the environment variables `AWS_ACCESS_KEY_ID`
# and `AWS_SECRET_ACCESS_KEY` are used.
[ access_key: <string> ]
[ secret_key: <secret> ]
# Named AWS profile used to connect to the API.
[ profile: <string> ]

# AWS Role ARN, an alternative to using AWS API keys.
[ role_arn: <string> ]

Given than metrics_path is no longer per port, it would be easier to just keep them at the scrape_configs level and not repeat it at the service discovery level.

@rakyll
Copy link
Author

rakyll commented Sep 8, 2021

Updated the config to add the auth and removed the metrics_path. Not sure if I got the notation correctly, I'm not very familiar with it.

@roidelapluie
Copy link
Member

prometheus_scrape: "true" should probably be string: string, otherwise LGTM.

@rakyll
Copy link
Author

rakyll commented Sep 14, 2021

Updated the labels to be aligned with what we are doing for the ECS exporter, prometheus-community/ecs_exporter#2.

@rhowe
Copy link

rhowe commented Sep 22, 2021

My experience has been with a fork of https://github.com/teralytics/prometheus-ecs-discovery which is working well. We added ratelimiting of AWS API calls and a cache of task definitions between discovery runs which has pretty much solved problems with hitting API ratelimits. We run one prometheus instance per AWS region per account and it manages to discover and scrape from a large number of clusters, tasks and containers.

The proposal here wouldn't replace it for our use case, sadly, as we are using it to dynamically discover all ECS clusters (we run large multitenant accounts where clusters can come and go at any time) so the requirement to specify a cluster in the SD config won't cut it for us. I still think it's a useful addition to prometheus, though, and I expect we will in future be looking to deploy a prometheus within each cluster, at which point hardcoding the cluster name would be fine.

I'm working on getting the changes we made cleared for contribution back to the original project which may be helpful to inform this design?

prometheus-ecs-sd works by expecting containers within tasks to specify their scrape config via their dockerLabels map. Exposing a container's dockerLabels as e.g. __meta_ecs_container_dockerlabel_<name> labels would allow the same behaviour via a relabel config. e.g. allowing a per-container scrape path, port or scheme as well as selection of targets via their docker labels. This would match up with what Prometheus's Kubernetes service discovery does too.

As well as the AZ name, a label for the AZ ID would be useful (as a parallel to EC2 service discovery's __meta_ec2_availability_zone_id label). This is not available in the ECS ListTasks API response though - that only has availabilityZone: https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_Task.html#ECS-Type-Task-availabilityZone. An opportunity for an API enhancement request perhaps?

@euskadi31
Copy link

Hello, We have the same need, I'm thinking of going to https://github.com/teralytics/prometheus-ecs-discovery for the moment, but I would have liked native support, do you have more information on the progress of this feature?

Thank You

@rakyll
Copy link
Author

rakyll commented Feb 16, 2022

We've been considering a few options how we can support this on ECS more natively so the autodiscovery problem becomes a non-problem. If existing solutions are sufficient for now, I'd highly recommend using them. I'll update the proposal once we have something more concrete.

@bmariesan
Copy link

is anyone using any updated forks for the prometheus-ecs-discovery project? or any idea if we are to see an ecs_sd_configs anytine soon? asking this as I've seen recently in the operator that we have the equivalent for ec2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants