Skip to content

monitoring gmp

Kadyapam edited this page Jun 19, 2026 · 1 revision

Production monitoring — Google Managed Prometheus (GMP)

The production GKE cluster (gke_noetl-demo-19700101_us-central1_noetl-cluster) runs Google Managed Prometheus, not the VictoriaMetrics stack the kind dev cluster uses. This page is the operator reference for the prod monitoring surface and the CQRS materializer-lag guardrail.

kind vs prod — two different monitoring stacks. On kind-noetl the scrape + alert objects are VictoriaMetrics CRDs (operator.victoriametrics.com/v1beta1: VMServiceScrape, VMRule, evaluated by VMAlert). On prod they are GMP CRDs (monitoring.googleapis.com/v1: PodMonitoring, Rules, evaluated by the GMP managed rule-evaluator). The PromQL and thresholds are kept identical; only the object kind differs. When you change one, change both.

What's running

  • GMP operator + collectorsgke-gmp-system namespace (gmp-operator, collector-* DaemonSet). The collectors scrape targets named by PodMonitoring / ClusterPodMonitoring.
  • OperatorConfig configgmp-public namespace. Sets the cluster external labels (cluster=noetl-cluster, location=us-central1, project_id=noetl-demo-19700101), enables managed rule evaluation (rules.alerting), and points the managedAlertmanager at secret alertmanager (key alertmanager.yaml) in gmp-public.

GMP does NOT honor prometheus.io/scrape pod annotations. Targets must be named by a PodMonitoring/ClusterPodMonitoring. The noetl workload pods carry prometheus.io/scrape annotations (meaningful under VM on kind) — on prod they are inert. Until the objects below landed (2026-06-19), the noetl namespace had no PodMonitoring, so neither the server nor the worker /metrics reached Managed Prometheus at all.

The noetl monitoring objects

Manifests live in ci/manifests/noetl/gmp/. Applying them is observability-only and non-traffic-affecting.

Object Kind Scrapes / does
noetl-workers PodMonitoring every worker pool's metrics port (9090) /metrics @15s — system-pool, shared rust, subscription pool, legacy
noetl-server PodMonitoring the server's http port (8082) /metrics @30s — incl. noetl_event_ingest_published_total
noetl-materializer-lag Rules the materializer-lag recording + alert rules (CQRS flip guardrail, noetl/ai-meta#103)

GMP auto-attaches target labels (cluster, location, namespace, pod, container, job); the materializer rules key off metric labels (stream, consumer), so no relabeling is needed.

Apply / verify

PROD=gke_noetl-demo-19700101_us-central1_noetl-cluster
kubectl --context $PROD apply -f ci/manifests/noetl/gmp/podmonitoring-noetl.yaml
kubectl --context $PROD apply -f ci/manifests/noetl/gmp/rules-materializer-lag.yaml
kubectl --context $PROD -n noetl get podmonitoring,rules

# Prove scraping via the Managed Prometheus query API:
TOKEN=$(gcloud auth print-access-token)
curl -s -H "Authorization: Bearer $TOKEN" \
  "https://monitoring.googleapis.com/v1/projects/noetl-demo-19700101/location/global/prometheus/api/v1/query" \
  --data-urlencode 'query=up{namespace="noetl"}'
# expect: server pod + every worker pod, value 1

The materializer-lag guardrail (CQRS PUBLISH_ONLY flip)

The Rules object reproduces the kind VMRule exactly: a noetl:materializer_backlog recording rule plus MaterializerBacklogWarning / BacklogCritical / BacklogGrowing / StalledUnderGate / ProjectErrors / AbsentUnderGate alerts. They watch the materializer consumer backlog (noetl_worker_nats_consumer_pending{consumer="noetl_materializer"} + _ack_pending, reported by the worker lag poller on an independent task) and the server publish-rate. Pre-flip (gate off) the publish-rate is 0, so the stall/absent alerts cannot fire and the backlog rule doubles as the green-baseline check (≈0).

Full operator procedure — including the prod-specific image-roll prerequisite, the staged flip sequence, and pager wiring — is in runbooks/noetl-cqrs-publish-only-flip.md "Production (GKE) — environment specifics".

Alert delivery (pager)

Alerts route to the GMP managedAlertmanager (OperatorConfig → secret alertmanager in gmp-public), NOT the vmstack Alertmanager. The receiver config is the operator's to provide (it needs the Slack webhook / PagerDuty routing key). Until wired, alerts still evaluate and fire (visible in Cloud Monitoring) — only delivery is missing. Wiring recipe + a templated receiver stub are in the flip runbook's "Alert routing → Production (GKE)" section.

GMP rewrites label names containing / or . to _. The alert label noetl.io/flip-guardrail is matched as noetl_io_flip_guardrail in the Alertmanager route — confirm the rewritten name on a live firing alert.

Prod CQRS-flip prep status (2026-06-19)

A read-only prep pass reconciled the live cluster:

  • Prod already runs the full Rust stack (the noetl/ai-meta#49 Python→Rust cutover is done; the noetl Service selector is app=noetl-server-rust). The live images are pre-#103 (server-rust:batch-dispatch-v1, noetl-worker-rust:cursor-100).
  • Both flip secrets (NOETL_ENCRYPTION_KEY, noetl-internal-api-token) already exist.
  • The post-#103 images (server v3.29.3 @sha256:6d2de32…, worker v5.35.0) are pushed to the prod Artifact Registry.
  • The GMP monitoring above is applied and verified live.
  • The roll-forward manifests (server-rust-deployment-prod.yaml → v3.29.3, PUBLISH_ONLY=false; worker-system-pool-deployment-prod.yaml → v5.35.0, MATERIALIZER_ENABLED=false) are staged but not applied — they roll live workloads, so the operator applies them per the runbook.

Related

Clone this wiki locally