automation gcp gke

GKE Helm install

Operational guide for installing NoETL on Google Kubernetes Engine (GKE) via the automation/gcp_gke/noetl_gke_fresh_stack.yaml playbook and the automation/helm/noetl/ chart.

This is the GKE deploy path the project supports. The kind-cluster path uses automation/development/noetl.yaml with the in-cluster deployment/postgres and ops manifests — see manifests-keda and manifests-nats-supercluster for the kind-side artifacts.

For the design rationale (why Helm + Cloud SQL on GKE and not the ops-manifest topology), see the GKE Postgres topology decision in noetl/ai-meta.

Topology

Layer	Component	Provisioning
Database	Cloud SQL Postgres (private IP)	Terraform + gcloud, driven by the playbook
Connection pool	PgBouncer (in-cluster Deployment)	Helm sub-chart applied by the playbook
Message bus	NATS JetStream (chart `nats-headless`)	Helm chart
API	`noetl-server` Deployment	Helm chart
Worker	`noetl-worker` Deployment	Helm chart
Projector	`noetl-projector` StatefulSet	Helm chart (off by default)
Outbox publisher	`noetl-outbox-publisher` Deployment	Helm chart (off by default)
Worker autoscaler	KEDA `ScaledObject` (NATS JetStream trigger)	Helm chart template, see manifests-keda for the kind variant
Object store	RustFS / SeaweedFS	Helm chart (off by default)

Workers scale on NATS JetStream consumer lag, not CPU. The chart's built-in CPU HorizontalPodAutoscaler is mutually exclusive with the KEDA ScaledObject (the two chart templates guard on worker.autoscaling.enabled × worker.autoscaling.keda.enabled), so only one autoscaler ever owns the worker Deployment.

Prerequisites

Tools

gcloud authenticated with project owner / Kubernetes Engine Admin / Cloud SQL Admin / Compute Network Admin roles.
kubectl and helm (Helm 3.x).
gh if you intend to track changes via PRs.
noetl CLI for running the provision playbook.

GCP setup

A GCP project with the GKE, Cloud SQL Admin, Service Networking, and Artifact Registry APIs enabled.
A VPC with a private-services-access range reserved for Cloud SQL (the playbook can create one if missing — see cloud_sql_private_service_range_name / cloud_sql_private_service_range_prefix_length in the playbook workload defaults).
A GKE Autopilot cluster, or a GKE Standard cluster with at least one node pool sized for your worker concurrency target.
Artifact Registry repository for the NoETL image (us-central1-docker.pkg.dev/<project>/noetl/noetl:<tag>).

One-time cluster installs

KEDA operator (the chart ScaledObject requires it):

helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda \
  --namespace keda \
  --create-namespace \
  --version 2.15.0

kubectl rollout status deployment/keda-operator -n keda

cert-manager (if you plan to use the chart's ingress template):

helm repo add jetstack https://charts.jetstack.io
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --set crds.enabled=true

Install

The playbook orchestrates everything from a single entry point. Inspect the workload defaults under workload: before running — the most-edited fields are listed below.

Frequently-edited workload variables

Variable	Default	What it controls
`project_id`	(none)	GCP project to install into. Required.
`region`	`us-central1`	GKE region.
`cluster_name`	`noetl-cluster`	GKE cluster name.
`namespace`	`noetl`	Helm release namespace.
`image_repository`	(derived from project)	Artifact Registry repo URL.
`image_tag`	`latest`	Container tag. Pin to a release tag in production.
`use_cloud_sql`	`true`	Use Cloud SQL + PgBouncer (true) vs in-cluster Postgres (false). On GKE keep `true`.
`cloud_sql_instance_name`	`noetl-shared-pg`	Cloud SQL instance ID.
`cloud_sql_tier`	`db-g1-small`	Cloud SQL machine type. Resize for production.
`cloud_sql_availability_type`	`ZONAL`	`ZONAL` or `REGIONAL` (HA).
`pgbouncer_enabled`	`true`	Run PgBouncer between the workload pods and Cloud SQL. Keep on.
`noetl_worker_autoscaling_enabled`	`true`	Enables the autoscaling umbrella in the chart.
`noetl_worker_autoscaling_keda_enabled`	`true`	Selects KEDA NATS-JetStream over the chart CPU HPA.
`noetl_worker_autoscaling_min_replicas`	`1`	Minimum worker pods.
`noetl_worker_autoscaling_max_replicas`	`20`	Cap. Respect downstream limits (PgBouncer pool, NATS throughput).

Run the playbook

cd repos/ops
noetl run automation/gcp_gke/noetl_gke_fresh_stack.yaml \
  --set action=provision-deploy \
  --set workload.project_id=<your-project> \
  --set workload.image_tag=v2.100.5

The playbook supports these action= modes:

`action`	What runs
`help`	Print the workload-variable reference and exit.
`provision`	Provision GCP infra only (Cloud SQL, networking, GKE prep). Does not deploy NoETL.
`deploy`	`helm upgrade --install` only, against an existing cluster.
`provision-deploy`	Both phases end-to-end.
`reset`	Destroy GKE workloads + Cloud SQL (gated; respects `delete_cloud_sql_on_destroy`).

provision-deploy is the typical fresh-install path. deploy is the day-to-day mode once infra is up. See the next section for ongoing upgrades.

Upgrade

Pin a new image tag and re-deploy via Helm directly:

helm upgrade --install noetl ./automation/helm/noetl \
  --namespace noetl \
  --reuse-values \
  --set image.tag=v2.100.6

This bumps the rollout without touching infra.

`--reuse-values` and new chart defaults

helm upgrade --reuse-values reuses the values the live release was rendered with — it does not merge in new defaults that the chart adds in later versions. If a chart upgrade introduces a new key the templates dereference, the upgrade fails with:

Error: UPGRADE FAILED: ...: nil pointer evaluating interface {}.<new_key>

When this happens, pass the new keys explicitly on the upgrade command, or use helm upgrade --reset-values --values <merged.yaml> with a values file you constructed from helm get values noetl plus the new defaults.

Roll back

helm history noetl -n noetl
helm rollback noetl <REVISION> -n noetl

Helm restores the previous chart + values atomically. Verify with helm status noetl -n noetl.

Verify

After install or upgrade:

# Release state
helm status noetl -n noetl
helm get values noetl -n noetl | head -60

# Workload health
kubectl get deploy,statefulset,pod -n noetl
kubectl rollout status deployment/noetl-server -n noetl --timeout=180s
kubectl rollout status deployment/noetl-worker -n noetl --timeout=180s

# Autoscaler: expect exactly one ScaledObject + one HPA on noetl-worker
kubectl get scaledobject,hpa -n noetl
# scaledobject.keda.sh/noetl-worker   apps/v1.Deployment   noetl-worker   1   20   nats-jetstream   Ready=True
# horizontalpodautoscaler.autoscaling/keda-hpa-noetl-worker   Deployment/noetl-worker   ...   1   20

# NATS durable consumer (created lazily by the worker on startup;
# self-heals if deleted — see noetl/noetl#600).
kubectl exec -n nats nats-0 -- nats consumer ls NOETL_COMMANDS

# Cloud SQL connectivity from a server pod through PgBouncer
kubectl exec -n noetl deploy/noetl-server -- \
  psql "host=pgbouncer.postgres.svc.cluster.local port=5432 dbname=noetl user=noetl" \
  -c "select current_database(), current_user;"

Smoke run

# Port-forward the API
kubectl port-forward -n noetl deploy/noetl-server 8082:8082 &

# Execute the standard smoke playbook
curl -s -X POST http://localhost:8082/api/v2/execute \
  -H 'content-type: application/json' \
  -d '{"playbook": "test/simple_python"}'
# Expect: completed=true, failed=false within a few seconds.

Tuning

Worker autoscaling

The chart templates KEDA against NATS JetStream consumer lag. The defaults in values.yaml sit under worker.autoscaling.keda:

worker:
  autoscaling:
    minReplicas: 1
    maxReplicas: 20
    keda:
      enabled: true
      pollingInterval: 10        # seconds; lower = more responsive
      cooldownPeriod: 30         # seconds before scaling down
      nats:
        account: "$G"
        monitoringEndpoint: nats-headless.nats.svc.cluster.local:8222
        stream: NOETL_COMMANDS
        consumer: noetl_worker_pool
        lagThreshold: "10"
        activationLagThreshold: "1"

account: $G is the Helm NATS chart's default JetStream account. This differs from the kind topology (which uses manifests-keda with account: NOETL). The two profiles are separate by design.

Tuning guidance is identical to the kind profile — see the KEDA Scaler — Tuning table.

PgBouncer connection budget

PgBouncer sits between the workload pods and Cloud SQL. Every NoETL component talks to pgbouncer.postgres.svc.cluster.local:5432, and PgBouncer multiplexes those client sessions into a much smaller pool of real Postgres backends.

Three layers of connections matter when sizing the cluster:

┌──────────────────────────────┐
│  noetl-server / worker /     │  app-side client sessions
│  projector / outbox pods     │  (one per HTTP/NATS in-flight)
└────────────┬─────────────────┘
             │
             ▼
┌──────────────────────────────┐
│  PgBouncer (pool_mode=        │  multiplex into a fixed pool
│  transaction)                 │  of server connections
└────────────┬─────────────────┘
             │
             ▼
┌──────────────────────────────┐
│  Cloud SQL Postgres           │  max_connections is the hard cap
└──────────────────────────────┘

Knobs and where they live

Knob	Default	Where it lives
`pgbouncer_pool_size`	`25`	playbook `workload.pgbouncer_pool_size` (workload defaults block)
`pgbouncer_max_client_conn`	`200`	playbook `workload.pgbouncer_max_client_conn`
`pgbouncer_replicas`	`1`	playbook `workload.pgbouncer_replicas`
`pgbouncer_pool_mode`	`transaction`	playbook `workload.pgbouncer_pool_mode`
Cloud SQL `max_connections`	tier-dependent	Cloud SQL instance flag (set via gcloud or console)
Per-pod app inflight	`inflight=6, db_inflight=32` (worker)	configmap `noetl-worker-config`

The math

The hard cap is Cloud SQL max_connections. Cloud SQL tier defaults:

Cloud SQL tier	Default `max_connections`
`db-g1-small`	~50
`db-custom-1-3840`	~100
`db-custom-2-7680`	~200
`db-n1-standard-1`	100
`db-n1-standard-2`	200
`db-n1-standard-4`	400

The Postgres-side budget is then:

cloud_sql_max_connections   ≥   pgbouncer_replicas × pgbouncer_pool_size + reserved_admin

reserved_admin covers Cloud SQL itself (cloudsqladmin, cloudsqlimportexport, cloudsqlreplica, ad-hoc psql sessions from operators). Reserve ~10 connections.

So with the default db-g1-small (50) and pgbouncer_replicas=1, pgbouncer_pool_size=25 works: 25 + 10 = 35 ≤ 50. Bumping pgbouncer_replicas to 2 (50 backends) does not work on db-g1-small — you need at least db-custom-1-3840.

The PgBouncer-side budget governs how many app sessions can wait in front of PgBouncer:

pgbouncer_replicas × pgbouncer_max_client_conn   ≥   Σ (worker_pods × db_inflight) + server_pods × server_pool + ...

For the default profile (worker db_inflight=32, max_replicas=20, pgbouncer_max_client_conn=200, pgbouncer_replicas=1):

Worst case worker demand: 20 × 32 = 640 sessions
Available at PgBouncer:   1 × 200 = 200

That's over-subscribed by 3.2×. In practice it's fine in transaction pool mode because worker sessions are short-lived (each tx checks out a backend for a few ms), but a burst that holds transactions open will queue at PgBouncer and surface as server is busy errors on the client.

Sizing checklist

When changing any of noetl_worker_autoscaling_max_replicas, worker db_inflight, server replicas, or projector/outbox concurrency, redo the math and bump the smallest insufficient layer:

Cloud SQL backends: pgbouncer_replicas × pgbouncer_pool_size stays under cloud_sql_max_connections - reserved_admin.
PgBouncer client capacity: for transaction pool mode, a 2–4× over-subscription is normal; for session pool mode, sessions tie up backends 1:1 and you cannot over-subscribe.
Bump PgBouncer first. Adding a PgBouncer pod is cheap (a few hundred MiB of RAM, no Cloud SQL bill change). Bumping the Cloud SQL tier doubles the bill.
Cloud SQL max_connections is a flag, not a tier limit. You can raise it independently of tier, but the underlying instance has to have RAM to back the extra processes. Rule of thumb: each Postgres connection costs 5–10 MiB. Leave headroom.

Diagnostics

# Cloud SQL active connections (run from a server pod)
kubectl exec -n noetl deploy/noetl-server -- \
  psql "host=pgbouncer.postgres.svc.cluster.local port=5432 dbname=noetl user=noetl" \
  -c "select count(*) from pg_stat_activity where datname='noetl';"

# PgBouncer pool stats (connect to the admin DB)
kubectl exec -n postgres deploy/pgbouncer -- \
  psql -h 127.0.0.1 -p 5432 -U pgbouncer pgbouncer -c "show pools;"
# Watch cl_active, cl_waiting, sv_active, sv_idle, maxwait

# PgBouncer client + server connection counts
kubectl exec -n postgres deploy/pgbouncer -- \
  psql -h 127.0.0.1 -p 5432 -U pgbouncer pgbouncer -c "show stats;"

cl_waiting > 0 consistently means PgBouncer is throttling on its pool_size. maxwait > 0 for more than a few seconds means clients are timing out — bump pgbouncer_pool_size (and the Cloud SQL backend budget if needed).

Cloud SQL HA

Default cloud_sql_availability_type: ZONAL is acceptable for demo / staging clusters. For production set REGIONAL for HA across two zones — at roughly 2× the Cloud SQL bill.

Common pitfalls

Two autoscalers fighting

If a prior install set worker.autoscaling.enabled=true while an external KEDA ScaledObject was also applied manually, you can end up with two HPAs on the same noetl-worker Deployment, each fighting the other on replica count. Symptoms: replicas oscillate, many pending pods during scale-up bursts.

Fix:

# Drop the externally-applied scaler (kind-profile manifest does
# not belong on GKE)
kubectl delete scaledobject noetl-worker-scaler-worker-cpu-01 -n noetl

# Re-apply the chart so the chart-rendered ScaledObject takes over
helm upgrade --install noetl ./automation/helm/noetl \
  --namespace noetl \
  --reuse-values \
  --set worker.autoscaling.enabled=true \
  --set worker.autoscaling.keda.enabled=true

KEDA's admission webhook refuses to create a second ScaledObject targeting a Deployment that already has one — delete the old one first.

Live-patching the autoscaler

Pre-2026-05-24 installs applied ci/manifests/keda/scaledobject-worker-cpu-01.yaml to GKE and then kubectl edit-ed it to swap account: NOETL for account: $G and the monitoring endpoint for nats-headless. Those live patches do not survive a re-apply. The chart-templated ScaledObject in current chart versions carries the correct GKE-shaped defaults in values.yaml. Do not kubectl apply the kind-profile manifest on GKE.

Worker missing the durable consumer

If the worker reports it cannot subscribe to NOETL_COMMANDS, check that the durable consumer noetl_worker_pool exists:

kubectl exec -n nats nats-0 -- nats consumer info NOETL_COMMANDS noetl_worker_pool

The worker creates this on startup and self-heals every 30s if the consumer goes missing (see noetl/noetl#600). No manual nats consumer add is needed.

Stale `Pending` PVCs

Older deploy attempts that used static hostPath PVs left orphan PVCs in the noetl and postgres namespaces. They stay Pending because no matching PV exists on GKE (the static PVs are kind-only artifacts). They do not affect the running stack — Cloud SQL data lives outside the cluster — but they show up in kubectl get pvc and look alarming.

Safe-cleanup recipe

This recipe is cosmetic. It deletes only PVCs that are confirmed Pending (never bound) and confirmed not referenced by any live pod. Run it from the operator workstation against the GKE context.

1. Inventory

# All Pending PVCs across the noetl + postgres namespaces
for ns in noetl postgres; do
  echo "=== namespace: $ns ==="
  kubectl get pvc -n "$ns" --field-selector=status.phase=Pending
done

Expected output: zero or more PVCs with STATUS=Pending and VOLUME=<empty>. If VOLUME is non-empty, the PVC was bound at some point — do not delete blindly; investigate first.

2. Confirm nothing references them

For each candidate PVC, confirm no pod still references it:

PVC=<name>
NS=<namespace>
kubectl get pod -n "$NS" -o json | \
  jq -r --arg p "$PVC" '
    .items[]
    | select(.spec.volumes[]? | .persistentVolumeClaim?.claimName == $p)
    | .metadata.name'

Expected output: empty. If any pod is listed, the PVC is in use — do not delete it.

Also confirm no StatefulSet, Deployment, or Job template mounts the PVC by name:

kubectl get deploy,sts,job -n "$NS" -o json | \
  jq -r --arg p "$PVC" '
    .items[]
    | select(.spec.template.spec.volumes[]? | .persistentVolumeClaim?.claimName == $p)
    | "\(.kind)/\(.metadata.name)"'

3. Snapshot before deletion

mkdir -p /tmp/pvc-cleanup-$(date -u +%Y%m%d)
for ns in noetl postgres; do
  kubectl get pvc -n "$ns" -o yaml > "/tmp/pvc-cleanup-$(date -u +%Y%m%d)/$ns-pvcs.yaml"
done

If a delete turns out to be wrong, the snapshot lets you re-create the PVC manifest. (It will still be Pending afterwards — but it'll exist again.)

4. Delete

# One PVC at a time. The flag explicitly forbids removing finalizers,
# so any PVC actually in use will block instead of force-deleting.
kubectl delete pvc -n "$NS" "$PVC"

If kubectl delete hangs for more than ~30 seconds, the PVC has a finalizer that points at something real. Stop. Run kubectl get pvc -n "$NS" "$PVC" -o yaml and check metadata.finalizers and spec.volumeName. Do not strip finalizers — investigate why something still owns the PVC.

5. Verify

kubectl get pvc -A | grep Pending
# Expected: no rows (or only PVCs you explicitly chose not to clean up)

Preventing recurrence

Stale Pending PVCs come from kubectl apply -f against manifests that reference static hostPath PVs (the kind profile). The Live-patching the autoscaler pitfall and this one share a root cause: applying kind-profile manifests on GKE. Use the Helm chart for GKE workload manifests; the chart uses dynamic-provisioning PersistentVolumeClaim templates that GKE's default standard-rwo storage class binds automatically.

manifests-keda — KEDA scaler for the kind profile.
manifests-nats-supercluster — multi-cluster JetStream topology.
GKE Postgres topology decision — why this stack uses Helm + Cloud SQL + PgBouncer.
Chart values reference — every knob the chart exposes.
GKE fresh-stack playbook — provision + deploy orchestration.

Ops

Home

Manifests

Monitoring

Production monitoring (GMP)

Automation

See also

noetl/noetl wiki (app)
noetl/docs (design docs)

automation gcp gke

GKE Helm install

Topology

Prerequisites

Tools

GCP setup

One-time cluster installs

Install

Frequently-edited workload variables

Run the playbook

Upgrade

--reuse-values and new chart defaults

Roll back

Verify

Smoke run

Tuning

Worker autoscaling

PgBouncer connection budget

Knobs and where they live

The math

Sizing checklist

Diagnostics

Cloud SQL HA

Common pitfalls

Two autoscalers fighting

Live-patching the autoscaler

Worker missing the durable consumer

Stale Pending PVCs

Safe-cleanup recipe

1. Inventory

2. Confirm nothing references them

3. Snapshot before deletion

4. Delete

5. Verify

Preventing recurrence

Related

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

`--reuse-values` and new chart defaults

Stale `Pending` PVCs