Agent Deployment Pattern: Sidecar Injection

Summary

With Prometheus-Operator finally supporting running Prometheus in Agent mode, we can start thinking about different deployment patterns that can be explored with this minimal container. This document aims to continue the work started by this document, focusing on exploring how Prometheus-Operator can leverage deploying PrometheusAgents as sidecars running alongside pods that a user wants to monitor.

Background

By the time this document was written, Prometheus-Operator can deploy Prometheus in Agent mode, but only using a pattern similar to the original implementation of Prometheus Server: using StatefulSets. The original design document for Prometheus Agent already mentions that different deployment patterns are desired, however, for the sake of speeding up the initial implementation it was decided to re-use the logic and start with the Agent running as StatefulSets.

Also for the sake of speeding up implementation, this document won't focus on several new Deployment patterns, but only one: Sidecar Injection.

Looking at the traditional deployment model, we have a single Prometheus (or an HA setup) per cluster or namespace, responsible for scraping all containers under their scope. Prometheus operator relies on ServiceMonitor, PodMonitor, and Probe CRs to configure Prometheus, which will eventually use Kubernetes service-discovery to find endpoints that need to be scraped.

Depending on the Cluster's scale and how often Prometheus hits Kubernetes API, Prometheus service discovery can increase the load on the API significantly and affect the overall functionality of said cluster.

Another problem is that one or more containers can be updated to a problematic version that causes a Cardinality Spike. Depending on the proportion of the spike, it is possible that a container could single-handedly crash the monitoring system of the whole cluster.

.

Proposal

This document proposes a new deployment model where Prometheus-Operator injects Prometheus agents as a sidecar container (and Prometheus config reloader) to pods that needs to be scrapped. With a sidecar, we tackle both problems mentioned above:

Load on Kubernetes API won't exist since it's not needed anymore. Prometheus will scrape containers from the same pod through their shared network interface and scrape configuration can be declared via pod annotations.

A sudden cardinality spike will not affect the whole monitoring system. In a worst-case scenario, it will fail a single pod.

A common pattern used with Prometheus's Kubernetes service discovery is the usage of annotation to declaratively tell Prometheus which endpoints need to be scraped. From a code search at Github for prometheus.io/scrape: "true", we can tell that this approach has good adoption already. To not conflict with the already commonly used annotation, we can start with our own, but with a very similar approach.

apiVersion: v1
kind: Pod
metadata:
  name: example
  annotations:
    prometheus.operator.io/scrape: "true"
    prometheus.operator.io/path: "/metrics"
    prometheus.operator.io/port: "8080"
    prometheus.operator.io/scrape-interval: "60s"
spec:
...

The existing PrometheusAgent CRD would be extended with a new field called mode, which can be one of two values(for now): [statefulset, sidecar], with statefulset as default. If mode is set to sidecar, Prometheus-Operator won't deploy any Prometheus agents initially. Instead, it will watch for Pod updates and inject the Prometheus Agent as a sidecar with the pre-determined annotations present.

In addition to telling the deployment model, the Agent CR will be the source of truth for remote-write configuration, such as URL and authentication. A change to the remote-write configuration would still require a hot reload of potentially millions of agent sidecar containers, but by avoiding having the remote-write configuration in pod annotation we at least avoid requiring that the Pod manifest also needs to be upgraded.

If different sets of pods require different remote-write configurations, then multiple PrometheusAgent CRs are needed. This means that the pod also needs to specify which Agent CR will inject the sidecar:

apiVersion: v1
kind: Pod
metadata:
  name: example
  annotations:
    prometheus.operator.io/scrape: "true"
    prometheus.operator.io/path: "/metrics"
    prometheus.operator.io/port: "8080"
    prometheus.operator.io/scrape-interval: "60s"
    prometheus.operator.io/agent-selector: "monitoring/agent-example"
spec:
...
---
apiVersion: monitoring.coreos.com/v1alpha1
kind: PrometheusAgent
metadata:
  name: agent-example
  namespace: monitoring
  spec:
    mode: sidecar
    remoteWrite:
      - url: https://example.com

With a visualization:

What to do with ServiceMonitor, PodMonitor, and Probe selectors?

With the sidecar approach, our goal is to scale Prometheus horizontally while avoiding impact in the Kubernetes API. It wouldn't make sense for a sidecar to also scrape metrics from other pods.

If mode is set to sidecar, a validating webhook would forbid PrometheusAgent CRs are created/updated with the following fields:

serviceMonitorSelector
serviceMonitorNamespaceSelector
podMonitorSelector
podMonitorNamespaceSelector
probeSelector
probeNamespaceSelector

Caveats

Config Hot Reload

There is two ways to change Prometheus configuration now, 1) by changing annotation on the pod and 2) by changing the remote-write field in PrometheusAgent CRD. The first one will only trigger a hot reload for the involved pod, but the latter has the pottention to trigger millions of hot reloads, depending on the scale of the cluster.

While there is no research regarding the config-reloader efficiency, this particular container might become problematic for huge scale environments.

WAL not optimized for small environments

Prometheus Write-Ahead-log(WAL) is stored as a sequence of numbered files with 128MiB each by default. Which means that, by default, at least 128MiB is needed for running Prometheus Agent if we ignore every other part of Prometheus. Using a sidecar, we're optimizing for horizontal scale and 128MiB might be much more than necessary to store metrics from a single Pod.

Lack of High-Availability setup

With the problem that Prometheus is not optimized for very small environments, injecting 2 sidecars per Pod sounds like a big waste of resources. However, with only 1 sidecar HA Prometheus won't be an option.

With that said, having an HA Prometheus in the traditional deployment pattern seems to be more critical than the sidecar approach. That's because with Prometheus fails in the first approach we lose the monitoring stack for the whole cluster, while with the latter we just lose metrics from a pod.

References

[1] https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/designs/prometheus-agent.md
[2] https://opentelemetry.io/docs/collector/scaling/
[3] https://www.acagroup.be/en/blog/auto-discovery-of-kubernetes-endpoint-services-prometheus/
[4] https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent-deployment-pattern-sidecar.md

agent-deployment-pattern-sidecar.md

Agent Deployment Pattern: Sidecar Injection

Summary

Background

Proposal

What to do with ServiceMonitor, PodMonitor, and Probe selectors?

Caveats

Config Hot Reload

WAL not optimized for small environments

Lack of High-Availability setup

References

Files

agent-deployment-pattern-sidecar.md

Latest commit

History

agent-deployment-pattern-sidecar.md

File metadata and controls

Agent Deployment Pattern: Sidecar Injection

Summary

Background

Proposal

What to do with ServiceMonitor, PodMonitor, and Probe selectors?

Caveats

Config Hot Reload

WAL not optimized for small environments

Lack of High-Availability setup

References