Skip to content

(feat) extend health checks by adding metric support#1816

Merged
gianlucam76 merged 1 commit into
projectsveltos:mainfrom
gianlucam76:metrics
May 29, 2026
Merged

(feat) extend health checks by adding metric support#1816
gianlucam76 merged 1 commit into
projectsveltos:mainfrom
gianlucam76:metrics

Conversation

@gianlucam76
Copy link
Copy Markdown
Member

ValidateHealth now supports metric-based checks via a new MetricSource field. The addon-controller (push mode) or the sveltos-applier agent running inside the managed cluster (pull mode) queries a Prometheus-compatible endpoint, collects named scalar values, and exposes them to a Lua script through a global metrics table.

In push mode the MetricSource URL must be reachable from the management cluster. In pull mode the agent running inside the managed cluster reaches the metric service directly via its in-cluster DNS name.

Example:

  validateHealths:
  - name: error-rate-low
    featureID: Helm
    metricSource:
      url: http://prometheus-server.monitoring.svc:9090
    metricQueries:
    - name: errorRate
      query: >-
        sum(rate(http_requests_errors_total{namespace="my-app"}[5m]))
        /
        sum(rate(http_requests_total{namespace="my-app"}[5m]))
    script: |
      function evaluate()
        if metrics["errorRate"] > 0.05 then
          return {healthy=false, message="error rate above 5%: " .. metrics["errorRate"]}
        end
        return {healthy=true, message=""}
      end

ValidateHealths are skipped during ClusterSummary deletion so that health checks do not block resource teardown when the managed workloads are already gone.

ValidateHealth now supports metric-based checks via a new MetricSource
field. The addon-controller (push mode) or the sveltos-applier agent
running inside the managed cluster (pull mode) queries a
Prometheus-compatible endpoint, collects named scalar values, and
exposes them to a Lua script through a global `metrics` table.

In push mode the MetricSource URL must be reachable from the management
cluster. In pull mode the agent running inside the managed cluster
reaches the metric service directly via its in-cluster DNS name.

Example:

```yaml
  validateHealths:
  - name: error-rate-low
    featureID: Helm
    metricSource:
      url: http://prometheus-server.monitoring.svc:9090
    metricQueries:
    - name: errorRate
      query: >-
        sum(rate(http_requests_errors_total{namespace="my-app"}[5m]))
        /
        sum(rate(http_requests_total{namespace="my-app"}[5m]))
    script: |
      function evaluate()
        if metrics["errorRate"] > 0.05 then
          return {healthy=false, message="error rate above 5%: " .. metrics["errorRate"]}
        end
        return {healthy=true, message=""}
      end
```

ValidateHealths are skipped during ClusterSummary deletion so that
health checks do not block resource teardown when the managed workloads
are already gone.
@gianlucam76 gianlucam76 merged commit ad8e92d into projectsveltos:main May 29, 2026
9 checks passed
@gianlucam76 gianlucam76 deleted the metrics branch May 29, 2026 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant