Skip to content

History / automation gcp gke

Revisions

  • wiki(automation-gcp-gke): expand PgBouncer budget + Pending PVC cleanup Two operational sections fleshed out beyond the install-page brief: PgBouncer connection budget (replaces the prior 2-paragraph stub): - Layer diagram showing app sessions -> PgBouncer -> Cloud SQL. - Knobs table listing pgbouncer_pool_size, pgbouncer_max_client_conn, pgbouncer_replicas, pgbouncer_pool_mode, Cloud SQL max_connections, and per-pod inflight, plus where each one lives. - Cloud SQL tier defaults table (~50 conns on db-g1-small, etc). - Math: cloud_sql_max_connections >= pgbouncer_replicas * pool_size + reserved_admin pgbouncer_replicas * pgbouncer_max_client_conn >= sum(app inflight) - Sizing checklist: backends first, client capacity, bump PgBouncer before Cloud SQL tier, max_connections is a flag not a tier limit. - Diagnostics: pg_stat_activity from a server pod, pgbouncer SHOW POOLS and SHOW STATS via the admin port; how to read cl_waiting / maxwait to spot throttling. Stale Pending PVCs cleanup (replaces the prior 2-command snippet): - 5-step safe-cleanup recipe: inventory, confirm no pod / deploy / sts / job references the PVC, snapshot before delete, delete one at a time without force-removing finalizers, verify. - jq filters to find references in pod specs and workload templates. - Explanation of why a non-empty VOLUME field means the PVC was bound at some point and should not be deleted blindly. - Root-cause callout: stale Pendings come from kubectl apply of kind- profile manifests on GKE; the Helm chart's dynamic-provisioning PVCs avoid the issue.

    @kadyapam kadyapam committed May 24, 2026
  • wiki(automation-gcp-gke): add GKE Helm install + clarify kind/GKE KEDA split New page automation-gcp-gke.md documenting the GKE install path: - Topology (Cloud SQL + PgBouncer + Helm NATS + chart-templated KEDA). - Prerequisites: gcloud / kubectl / helm, GCP project setup, one-time KEDA + cert-manager installs. - Install via the noetl_gke_fresh_stack.yaml playbook with the frequently-edited workload variables called out. - Upgrade flow (helm upgrade --reuse-values), including the gotcha where --reuse-values does not merge new chart defaults (PR #116 migration hit this). - Verify checklist including KEDA scaledobject, NATS durable consumer, Cloud SQL connectivity through PgBouncer, and a smoke run. - Tuning section for KEDA, PgBouncer connection budget, Cloud SQL HA. - Common pitfalls: two autoscalers fighting (HPA conflict that drove ops #115), live-patching the autoscaler (the anti-pattern that drove ops #116), worker durable consumer drift (noetl #600), and stale Pending PVCs. Update Home.md to add an Automation playbooks row for the new page. Update _Sidebar.md with an Automation section. Update manifests-keda.md to be honest about the kind/GKE split: the existing page sample (account: NOETL, nats.nats.svc:8222) is the kind-cluster artifact; the GKE artifact is chart-rendered with account: $G and nats-headless. Added a profile-note callout near the top and a cross-link in Related. Cross-references: - ai-meta decision doc 2026-05-24-gke-postgres-topology (Option A). - ops PRs #115, #116 (HPA conflict + KEDA chart promotion). - noetl PR #600 (worker consumer self-heal).

    @kadyapam kadyapam committed May 24, 2026