-
Notifications
You must be signed in to change notification settings - Fork 0
monitoring gmp
The production GKE cluster (gke_noetl-demo-19700101_us-central1_noetl-cluster)
runs Google Managed Prometheus, not the VictoriaMetrics stack the kind dev
cluster uses. This page is the operator reference for the prod monitoring
surface and the CQRS materializer-lag guardrail.
kind vs prod — two different monitoring stacks. On
kind-noetlthe scrape + alert objects are VictoriaMetrics CRDs (operator.victoriametrics.com/v1beta1:VMServiceScrape,VMRule, evaluated by VMAlert). On prod they are GMP CRDs (monitoring.googleapis.com/v1:PodMonitoring,Rules, evaluated by the GMP managed rule-evaluator). The PromQL and thresholds are kept identical; only the object kind differs. When you change one, change both.
-
GMP operator + collectors —
gke-gmp-systemnamespace (gmp-operator,collector-*DaemonSet). The collectors scrape targets named byPodMonitoring/ClusterPodMonitoring. -
OperatorConfig
config—gmp-publicnamespace. Sets the cluster external labels (cluster=noetl-cluster,location=us-central1,project_id=noetl-demo-19700101), enables managed rule evaluation (rules.alerting), and points the managedAlertmanager at secretalertmanager(keyalertmanager.yaml) ingmp-public.
GMP does NOT honor
prometheus.io/scrapepod annotations. Targets must be named by aPodMonitoring/ClusterPodMonitoring. The noetl workload pods carryprometheus.io/scrapeannotations (meaningful under VM on kind) — on prod they are inert. Until the objects below landed (2026-06-19), thenoetlnamespace had no PodMonitoring, so neither the server nor the worker/metricsreached Managed Prometheus at all.
Manifests live in ci/manifests/noetl/gmp/.
Applying them is observability-only and non-traffic-affecting.
| Object | Kind | Scrapes / does |
|---|---|---|
noetl-workers |
PodMonitoring |
every worker pool's metrics port (9090) /metrics @15s — system-pool, shared rust, subscription pool, legacy |
noetl-server |
PodMonitoring |
the server's http port (8082) /metrics @30s — incl. noetl_event_ingest_published_total
|
noetl-materializer-lag |
Rules |
the materializer-lag recording + alert rules (CQRS flip guardrail, noetl/ai-meta#103) |
GMP auto-attaches target labels (cluster, location, namespace, pod,
container, job); the materializer rules key off metric labels
(stream, consumer), so no relabeling is needed.
PROD=gke_noetl-demo-19700101_us-central1_noetl-cluster
kubectl --context $PROD apply -f ci/manifests/noetl/gmp/podmonitoring-noetl.yaml
kubectl --context $PROD apply -f ci/manifests/noetl/gmp/rules-materializer-lag.yaml
kubectl --context $PROD -n noetl get podmonitoring,rules
# Prove scraping via the Managed Prometheus query API:
TOKEN=$(gcloud auth print-access-token)
curl -s -H "Authorization: Bearer $TOKEN" \
"https://monitoring.googleapis.com/v1/projects/noetl-demo-19700101/location/global/prometheus/api/v1/query" \
--data-urlencode 'query=up{namespace="noetl"}'
# expect: server pod + every worker pod, value 1The Rules object reproduces the kind VMRule exactly: a noetl:materializer_backlog
recording rule plus MaterializerBacklogWarning / BacklogCritical /
BacklogGrowing / StalledUnderGate / ProjectErrors / AbsentUnderGate
alerts. They watch the materializer consumer backlog
(noetl_worker_nats_consumer_pending{consumer="noetl_materializer"} +
_ack_pending, reported by the worker lag poller on an independent task)
and the server publish-rate. Pre-flip (gate off) the publish-rate is 0, so the
stall/absent alerts cannot fire and the backlog rule doubles as the
green-baseline check (≈0).
Full operator procedure — including the prod-specific image-roll prerequisite,
the staged flip sequence, and pager wiring — is in
runbooks/noetl-cqrs-publish-only-flip.md
"Production (GKE) — environment specifics".
Alerts route to the GMP managedAlertmanager (OperatorConfig → secret
alertmanager in gmp-public), NOT the vmstack Alertmanager. The receiver
config is the operator's to provide (it needs the Slack webhook / PagerDuty
routing key). Until wired, alerts still evaluate and fire (visible in Cloud
Monitoring) — only delivery is missing. Wiring recipe + a templated receiver
stub are in the flip runbook's "Alert routing → Production (GKE)" section.
GMP rewrites label names containing
/or.to_. The alert labelnoetl.io/flip-guardrailis matched asnoetl_io_flip_guardrailin the Alertmanager route — confirm the rewritten name on a live firing alert.
A read-only prep pass reconciled the live cluster:
- Prod already runs the full Rust stack (the noetl/ai-meta#49 Python→Rust
cutover is done; the
noetlService selector isapp=noetl-server-rust). The live images are pre-#103 (server-rust:batch-dispatch-v1,noetl-worker-rust:cursor-100). - Both flip secrets (
NOETL_ENCRYPTION_KEY,noetl-internal-api-token) already exist. - The post-#103 images (server v3.29.3
@sha256:6d2de32…, worker v5.35.0) are pushed to the prod Artifact Registry. - The GMP monitoring above is applied and verified live.
- The roll-forward manifests
(
server-rust-deployment-prod.yaml→ v3.29.3,PUBLISH_ONLY=false;worker-system-pool-deployment-prod.yaml→ v5.35.0,MATERIALIZER_ENABLED=false) are staged but not applied — they roll live workloads, so the operator applies them per the runbook.
- System worker pool — runs the materializer loop.
- GKE Helm install — the prod deploy path.
- Flip runbook:
noetl-cqrs-publish-only-flip.md - noetl/ai-meta#103 (CQRS event-log cutover).