(feat) extend health checks by adding metric support by gianlucam76 · Pull Request #1816 · projectsveltos/addon-controller

gianlucam76 · 2026-05-29T13:39:22Z

ValidateHealth now supports metric-based checks via a new MetricSource field. The addon-controller (push mode) or the sveltos-applier agent running inside the managed cluster (pull mode) queries a Prometheus-compatible endpoint, collects named scalar values, and exposes them to a Lua script through a global metrics table.

In push mode the MetricSource URL must be reachable from the management cluster. In pull mode the agent running inside the managed cluster reaches the metric service directly via its in-cluster DNS name.

Example:

  validateHealths:
  - name: error-rate-low
    featureID: Helm
    metricSource:
      url: http://prometheus-server.monitoring.svc:9090
    metricQueries:
    - name: errorRate
      query: >-
        sum(rate(http_requests_errors_total{namespace="my-app"}[5m]))
        /
        sum(rate(http_requests_total{namespace="my-app"}[5m]))
    script: |
      function evaluate()
        if metrics["errorRate"] > 0.05 then
          return {healthy=false, message="error rate above 5%: " .. metrics["errorRate"]}
        end
        return {healthy=true, message=""}
      end

ValidateHealths are skipped during ClusterSummary deletion so that health checks do not block resource teardown when the managed workloads are already gone.

ValidateHealth now supports metric-based checks via a new MetricSource field. The addon-controller (push mode) or the sveltos-applier agent running inside the managed cluster (pull mode) queries a Prometheus-compatible endpoint, collects named scalar values, and exposes them to a Lua script through a global `metrics` table. In push mode the MetricSource URL must be reachable from the management cluster. In pull mode the agent running inside the managed cluster reaches the metric service directly via its in-cluster DNS name. Example: ```yaml validateHealths: - name: error-rate-low featureID: Helm metricSource: url: http://prometheus-server.monitoring.svc:9090 metricQueries: - name: errorRate query: >- sum(rate(http_requests_errors_total{namespace="my-app"}[5m])) / sum(rate(http_requests_total{namespace="my-app"}[5m])) script: | function evaluate() if metrics["errorRate"] > 0.05 then return {healthy=false, message="error rate above 5%: " .. metrics["errorRate"]} end return {healthy=true, message=""} end ``` ValidateHealths are skipped during ClusterSummary deletion so that health checks do not block resource teardown when the managed workloads are already gone.

gianlucam76 merged commit ad8e92d into projectsveltos:main May 29, 2026
9 checks passed

gianlucam76 deleted the metrics branch May 29, 2026 14:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(feat) extend health checks by adding metric support#1816

(feat) extend health checks by adding metric support#1816
gianlucam76 merged 1 commit into
projectsveltos:mainfrom
gianlucam76:metrics

gianlucam76 commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gianlucam76 commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant