Skip to content

automation gcp gke

Kadyapam edited this page May 24, 2026 · 2 revisions

GKE Helm install

Operational guide for installing NoETL on Google Kubernetes Engine (GKE) via the automation/gcp_gke/noetl_gke_fresh_stack.yaml playbook and the automation/helm/noetl/ chart.

This is the GKE deploy path the project supports. The kind-cluster path uses automation/development/noetl.yaml with the in-cluster deployment/postgres and ops manifests — see manifests-keda and manifests-nats-supercluster for the kind-side artifacts.

For the design rationale (why Helm + Cloud SQL on GKE and not the ops-manifest topology), see the GKE Postgres topology decision in noetl/ai-meta.

Topology

Layer Component Provisioning
Database Cloud SQL Postgres (private IP) Terraform + gcloud, driven by the playbook
Connection pool PgBouncer (in-cluster Deployment) Helm sub-chart applied by the playbook
Message bus NATS JetStream (chart nats-headless) Helm chart
API noetl-server Deployment Helm chart
Worker noetl-worker Deployment Helm chart
Projector noetl-projector StatefulSet Helm chart (off by default)
Outbox publisher noetl-outbox-publisher Deployment Helm chart (off by default)
Worker autoscaler KEDA ScaledObject (NATS JetStream trigger) Helm chart template, see manifests-keda for the kind variant
Object store RustFS / SeaweedFS Helm chart (off by default)

Workers scale on NATS JetStream consumer lag, not CPU. The chart's built-in CPU HorizontalPodAutoscaler is mutually exclusive with the KEDA ScaledObject (the two chart templates guard on worker.autoscaling.enabled × worker.autoscaling.keda.enabled), so only one autoscaler ever owns the worker Deployment.

Prerequisites

Tools

  • gcloud authenticated with project owner / Kubernetes Engine Admin / Cloud SQL Admin / Compute Network Admin roles.
  • kubectl and helm (Helm 3.x).
  • gh if you intend to track changes via PRs.
  • noetl CLI for running the provision playbook.

GCP setup

  • A GCP project with the GKE, Cloud SQL Admin, Service Networking, and Artifact Registry APIs enabled.
  • A VPC with a private-services-access range reserved for Cloud SQL (the playbook can create one if missing — see cloud_sql_private_service_range_name / cloud_sql_private_service_range_prefix_length in the playbook workload defaults).
  • A GKE Autopilot cluster, or a GKE Standard cluster with at least one node pool sized for your worker concurrency target.
  • Artifact Registry repository for the NoETL image (us-central1-docker.pkg.dev/<project>/noetl/noetl:<tag>).

One-time cluster installs

KEDA operator (the chart ScaledObject requires it):

helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda \
  --namespace keda \
  --create-namespace \
  --version 2.15.0

kubectl rollout status deployment/keda-operator -n keda

cert-manager (if you plan to use the chart's ingress template):

helm repo add jetstack https://charts.jetstack.io
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --set crds.enabled=true

Install

The playbook orchestrates everything from a single entry point. Inspect the workload defaults under workload: before running — the most-edited fields are listed below.

Frequently-edited workload variables

Variable Default What it controls
project_id (none) GCP project to install into. Required.
region us-central1 GKE region.
cluster_name noetl-cluster GKE cluster name.
namespace noetl Helm release namespace.
image_repository (derived from project) Artifact Registry repo URL.
image_tag latest Container tag. Pin to a release tag in production.
use_cloud_sql true Use Cloud SQL + PgBouncer (true) vs in-cluster Postgres (false). On GKE keep true.
cloud_sql_instance_name noetl-shared-pg Cloud SQL instance ID.
cloud_sql_tier db-g1-small Cloud SQL machine type. Resize for production.
cloud_sql_availability_type ZONAL ZONAL or REGIONAL (HA).
pgbouncer_enabled true Run PgBouncer between the workload pods and Cloud SQL. Keep on.
noetl_worker_autoscaling_enabled true Enables the autoscaling umbrella in the chart.
noetl_worker_autoscaling_keda_enabled true Selects KEDA NATS-JetStream over the chart CPU HPA.
noetl_worker_autoscaling_min_replicas 1 Minimum worker pods.
noetl_worker_autoscaling_max_replicas 20 Cap. Respect downstream limits (PgBouncer pool, NATS throughput).

Run the playbook

cd repos/ops
noetl run automation/gcp_gke/noetl_gke_fresh_stack.yaml \
  --set action=provision-deploy \
  --set workload.project_id=<your-project> \
  --set workload.image_tag=v2.100.5

The playbook supports these action= modes:

action What runs
help Print the workload-variable reference and exit.
provision Provision GCP infra only (Cloud SQL, networking, GKE prep). Does not deploy NoETL.
deploy helm upgrade --install only, against an existing cluster.
provision-deploy Both phases end-to-end.
reset Destroy GKE workloads + Cloud SQL (gated; respects delete_cloud_sql_on_destroy).

provision-deploy is the typical fresh-install path. deploy is the day-to-day mode once infra is up. See the next section for ongoing upgrades.

Upgrade

Pin a new image tag and re-deploy via Helm directly:

helm upgrade --install noetl ./automation/helm/noetl \
  --namespace noetl \
  --reuse-values \
  --set image.tag=v2.100.6

This bumps the rollout without touching infra.

--reuse-values and new chart defaults

helm upgrade --reuse-values reuses the values the live release was rendered with — it does not merge in new defaults that the chart adds in later versions. If a chart upgrade introduces a new key the templates dereference, the upgrade fails with:

Error: UPGRADE FAILED: ...: nil pointer evaluating interface {}.<new_key>

When this happens, pass the new keys explicitly on the upgrade command, or use helm upgrade --reset-values --values <merged.yaml> with a values file you constructed from helm get values noetl plus the new defaults.

Roll back

helm history noetl -n noetl
helm rollback noetl <REVISION> -n noetl

Helm restores the previous chart + values atomically. Verify with helm status noetl -n noetl.

Verify

After install or upgrade:

# Release state
helm status noetl -n noetl
helm get values noetl -n noetl | head -60

# Workload health
kubectl get deploy,statefulset,pod -n noetl
kubectl rollout status deployment/noetl-server -n noetl --timeout=180s
kubectl rollout status deployment/noetl-worker -n noetl --timeout=180s

# Autoscaler: expect exactly one ScaledObject + one HPA on noetl-worker
kubectl get scaledobject,hpa -n noetl
# scaledobject.keda.sh/noetl-worker   apps/v1.Deployment   noetl-worker   1   20   nats-jetstream   Ready=True
# horizontalpodautoscaler.autoscaling/keda-hpa-noetl-worker   Deployment/noetl-worker   ...   1   20

# NATS durable consumer (created lazily by the worker on startup;
# self-heals if deleted — see noetl/noetl#600).
kubectl exec -n nats nats-0 -- nats consumer ls NOETL_COMMANDS

# Cloud SQL connectivity from a server pod through PgBouncer
kubectl exec -n noetl deploy/noetl-server -- \
  psql "host=pgbouncer.postgres.svc.cluster.local port=5432 dbname=noetl user=noetl" \
  -c "select current_database(), current_user;"

Smoke run

# Port-forward the API
kubectl port-forward -n noetl deploy/noetl-server 8082:8082 &

# Execute the standard smoke playbook
curl -s -X POST http://localhost:8082/api/v2/execute \
  -H 'content-type: application/json' \
  -d '{"playbook": "test/simple_python"}'
# Expect: completed=true, failed=false within a few seconds.

Tuning

Worker autoscaling

The chart templates KEDA against NATS JetStream consumer lag. The defaults in values.yaml sit under worker.autoscaling.keda:

worker:
  autoscaling:
    minReplicas: 1
    maxReplicas: 20
    keda:
      enabled: true
      pollingInterval: 10        # seconds; lower = more responsive
      cooldownPeriod: 30         # seconds before scaling down
      nats:
        account: "$G"
        monitoringEndpoint: nats-headless.nats.svc.cluster.local:8222
        stream: NOETL_COMMANDS
        consumer: noetl_worker_pool
        lagThreshold: "10"
        activationLagThreshold: "1"

account: $G is the Helm NATS chart's default JetStream account. This differs from the kind topology (which uses manifests-keda with account: NOETL). The two profiles are separate by design.

Tuning guidance is identical to the kind profile — see the KEDA Scaler — Tuning table.

PgBouncer connection budget

PgBouncer sits between the workload pods and Cloud SQL. Every NoETL component talks to pgbouncer.postgres.svc.cluster.local:5432, and PgBouncer multiplexes those client sessions into a much smaller pool of real Postgres backends.

Three layers of connections matter when sizing the cluster:

┌──────────────────────────────┐
│  noetl-server / worker /     │  app-side client sessions
│  projector / outbox pods     │  (one per HTTP/NATS in-flight)
└────────────┬─────────────────┘
             │
             ▼
┌──────────────────────────────┐
│  PgBouncer (pool_mode=        │  multiplex into a fixed pool
│  transaction)                 │  of server connections
└────────────┬─────────────────┘
             │
             ▼
┌──────────────────────────────┐
│  Cloud SQL Postgres           │  max_connections is the hard cap
└──────────────────────────────┘

Knobs and where they live

Knob Default Where it lives
pgbouncer_pool_size 25 playbook workload.pgbouncer_pool_size (workload defaults block)
pgbouncer_max_client_conn 200 playbook workload.pgbouncer_max_client_conn
pgbouncer_replicas 1 playbook workload.pgbouncer_replicas
pgbouncer_pool_mode transaction playbook workload.pgbouncer_pool_mode
Cloud SQL max_connections tier-dependent Cloud SQL instance flag (set via gcloud or console)
Per-pod app inflight inflight=6, db_inflight=32 (worker) configmap noetl-worker-config

The math

The hard cap is Cloud SQL max_connections. Cloud SQL tier defaults:

Cloud SQL tier Default max_connections
db-g1-small ~50
db-custom-1-3840 ~100
db-custom-2-7680 ~200
db-n1-standard-1 100
db-n1-standard-2 200
db-n1-standard-4 400

The Postgres-side budget is then:

cloud_sql_max_connections   ≥   pgbouncer_replicas × pgbouncer_pool_size + reserved_admin

reserved_admin covers Cloud SQL itself (cloudsqladmin, cloudsqlimportexport, cloudsqlreplica, ad-hoc psql sessions from operators). Reserve ~10 connections.

So with the default db-g1-small (50) and pgbouncer_replicas=1, pgbouncer_pool_size=25 works: 25 + 10 = 35 ≤ 50. Bumping pgbouncer_replicas to 2 (50 backends) does not work on db-g1-small — you need at least db-custom-1-3840.

The PgBouncer-side budget governs how many app sessions can wait in front of PgBouncer:

pgbouncer_replicas × pgbouncer_max_client_conn   ≥   Σ (worker_pods × db_inflight) + server_pods × server_pool + ...

For the default profile (worker db_inflight=32, max_replicas=20, pgbouncer_max_client_conn=200, pgbouncer_replicas=1):

Worst case worker demand: 20 × 32 = 640 sessions
Available at PgBouncer:   1 × 200 = 200

That's over-subscribed by 3.2×. In practice it's fine in transaction pool mode because worker sessions are short-lived (each tx checks out a backend for a few ms), but a burst that holds transactions open will queue at PgBouncer and surface as server is busy errors on the client.

Sizing checklist

When changing any of noetl_worker_autoscaling_max_replicas, worker db_inflight, server replicas, or projector/outbox concurrency, redo the math and bump the smallest insufficient layer:

  1. Cloud SQL backends: pgbouncer_replicas × pgbouncer_pool_size stays under cloud_sql_max_connections - reserved_admin.
  2. PgBouncer client capacity: for transaction pool mode, a 2–4× over-subscription is normal; for session pool mode, sessions tie up backends 1:1 and you cannot over-subscribe.
  3. Bump PgBouncer first. Adding a PgBouncer pod is cheap (a few hundred MiB of RAM, no Cloud SQL bill change). Bumping the Cloud SQL tier doubles the bill.
  4. Cloud SQL max_connections is a flag, not a tier limit. You can raise it independently of tier, but the underlying instance has to have RAM to back the extra processes. Rule of thumb: each Postgres connection costs 5–10 MiB. Leave headroom.

Diagnostics

# Cloud SQL active connections (run from a server pod)
kubectl exec -n noetl deploy/noetl-server -- \
  psql "host=pgbouncer.postgres.svc.cluster.local port=5432 dbname=noetl user=noetl" \
  -c "select count(*) from pg_stat_activity where datname='noetl';"

# PgBouncer pool stats (connect to the admin DB)
kubectl exec -n postgres deploy/pgbouncer -- \
  psql -h 127.0.0.1 -p 5432 -U pgbouncer pgbouncer -c "show pools;"
# Watch cl_active, cl_waiting, sv_active, sv_idle, maxwait

# PgBouncer client + server connection counts
kubectl exec -n postgres deploy/pgbouncer -- \
  psql -h 127.0.0.1 -p 5432 -U pgbouncer pgbouncer -c "show stats;"

cl_waiting > 0 consistently means PgBouncer is throttling on its pool_size. maxwait > 0 for more than a few seconds means clients are timing out — bump pgbouncer_pool_size (and the Cloud SQL backend budget if needed).

Cloud SQL HA

Default cloud_sql_availability_type: ZONAL is acceptable for demo / staging clusters. For production set REGIONAL for HA across two zones — at roughly 2× the Cloud SQL bill.

Common pitfalls

Two autoscalers fighting

If a prior install set worker.autoscaling.enabled=true while an external KEDA ScaledObject was also applied manually, you can end up with two HPAs on the same noetl-worker Deployment, each fighting the other on replica count. Symptoms: replicas oscillate, many pending pods during scale-up bursts.

Fix:

# Drop the externally-applied scaler (kind-profile manifest does
# not belong on GKE)
kubectl delete scaledobject noetl-worker-scaler-worker-cpu-01 -n noetl

# Re-apply the chart so the chart-rendered ScaledObject takes over
helm upgrade --install noetl ./automation/helm/noetl \
  --namespace noetl \
  --reuse-values \
  --set worker.autoscaling.enabled=true \
  --set worker.autoscaling.keda.enabled=true

KEDA's admission webhook refuses to create a second ScaledObject targeting a Deployment that already has one — delete the old one first.

Live-patching the autoscaler

Pre-2026-05-24 installs applied ci/manifests/keda/scaledobject-worker-cpu-01.yaml to GKE and then kubectl edit-ed it to swap account: NOETL for account: $G and the monitoring endpoint for nats-headless. Those live patches do not survive a re-apply. The chart-templated ScaledObject in current chart versions carries the correct GKE-shaped defaults in values.yaml. Do not kubectl apply the kind-profile manifest on GKE.

Worker missing the durable consumer

If the worker reports it cannot subscribe to NOETL_COMMANDS, check that the durable consumer noetl_worker_pool exists:

kubectl exec -n nats nats-0 -- nats consumer info NOETL_COMMANDS noetl_worker_pool

The worker creates this on startup and self-heals every 30s if the consumer goes missing (see noetl/noetl#600). No manual nats consumer add is needed.

Stale Pending PVCs

Older deploy attempts that used static hostPath PVs left orphan PVCs in the noetl and postgres namespaces. They stay Pending because no matching PV exists on GKE (the static PVs are kind-only artifacts). They do not affect the running stack — Cloud SQL data lives outside the cluster — but they show up in kubectl get pvc and look alarming.

Safe-cleanup recipe

This recipe is cosmetic. It deletes only PVCs that are confirmed Pending (never bound) and confirmed not referenced by any live pod. Run it from the operator workstation against the GKE context.

1. Inventory
# All Pending PVCs across the noetl + postgres namespaces
for ns in noetl postgres; do
  echo "=== namespace: $ns ==="
  kubectl get pvc -n "$ns" --field-selector=status.phase=Pending
done

Expected output: zero or more PVCs with STATUS=Pending and VOLUME=<empty>. If VOLUME is non-empty, the PVC was bound at some point — do not delete blindly; investigate first.

2. Confirm nothing references them

For each candidate PVC, confirm no pod still references it:

PVC=<name>
NS=<namespace>
kubectl get pod -n "$NS" -o json | \
  jq -r --arg p "$PVC" '
    .items[]
    | select(.spec.volumes[]? | .persistentVolumeClaim?.claimName == $p)
    | .metadata.name'

Expected output: empty. If any pod is listed, the PVC is in use — do not delete it.

Also confirm no StatefulSet, Deployment, or Job template mounts the PVC by name:

kubectl get deploy,sts,job -n "$NS" -o json | \
  jq -r --arg p "$PVC" '
    .items[]
    | select(.spec.template.spec.volumes[]? | .persistentVolumeClaim?.claimName == $p)
    | "\(.kind)/\(.metadata.name)"'
3. Snapshot before deletion
mkdir -p /tmp/pvc-cleanup-$(date -u +%Y%m%d)
for ns in noetl postgres; do
  kubectl get pvc -n "$ns" -o yaml > "/tmp/pvc-cleanup-$(date -u +%Y%m%d)/$ns-pvcs.yaml"
done

If a delete turns out to be wrong, the snapshot lets you re-create the PVC manifest. (It will still be Pending afterwards — but it'll exist again.)

4. Delete
# One PVC at a time. The flag explicitly forbids removing finalizers,
# so any PVC actually in use will block instead of force-deleting.
kubectl delete pvc -n "$NS" "$PVC"

If kubectl delete hangs for more than ~30 seconds, the PVC has a finalizer that points at something real. Stop. Run kubectl get pvc -n "$NS" "$PVC" -o yaml and check metadata.finalizers and spec.volumeName. Do not strip finalizers — investigate why something still owns the PVC.

5. Verify
kubectl get pvc -A | grep Pending
# Expected: no rows (or only PVCs you explicitly chose not to clean up)

Preventing recurrence

Stale Pending PVCs come from kubectl apply -f against manifests that reference static hostPath PVs (the kind profile). The Live-patching the autoscaler pitfall and this one share a root cause: applying kind-profile manifests on GKE. Use the Helm chart for GKE workload manifests; the chart uses dynamic-provisioning PersistentVolumeClaim templates that GKE's default standard-rwo storage class binds automatically.

Related

Clone this wiki locally