diff --git a/docs/.vitepress/config.mts b/docs/.vitepress/config.mts
index 6a58cdd..6790828 100644
--- a/docs/.vitepress/config.mts
+++ b/docs/.vitepress/config.mts
@@ -252,7 +252,12 @@ export default withMermaid(
{ text: "Download config files", link: "/self-hosting/methods/download-config" },
],
},
- { text: "Kubernetes", link: "/self-hosting/methods/kubernetes" },
+ {
+ text: "Kubernetes",
+ link: "/self-hosting/methods/kubernetes",
+ collapsed: true,
+ items: [{ text: "High availability", link: "/self-hosting/govern/high-availability" }],
+ },
{ text: "Podman Quadlets", link: "/self-hosting/methods/podman-quadlets" },
{
text: "Airgapped Edition",
diff --git a/docs/self-hosting/govern/high-availability.md b/docs/self-hosting/govern/high-availability.md
new file mode 100644
index 0000000..1e32969
--- /dev/null
+++ b/docs/self-hosting/govern/high-availability.md
@@ -0,0 +1,706 @@
+---
+title: High Availability Deployment
+description: How to deploy Plane Commercial Edition on Kubernetes with high availability using the plane-enterprise Helm chart.
+keywords: plane high availability, kubernetes ha, multi-az deployment, plane-enterprise helm chart, karpenter, pod disruption budget, hpa, self-hosting, plane kubernetes
+---
+
+# High Availability on Kubernetes
+
+This guide covers what high availability means, how the `plane-enterprise` Helm chart workloads behave under failure, and exactly what to configure so your deployment survives the loss of a single availability zone or node without manual recovery. The setup is cloud-agnostic. If you're deploying on AWS with Karpenter, there's a dedicated section for you.
+
+Read this alongside the chart's [README](https://github.com/makeplane/helm-charts/blob/master/charts/plane-enterprise/README.md) and [values.yaml](https://github.com/makeplane/helm-charts/blob/master/charts/plane-enterprise/values.yaml).
+
+## What HA means here
+
+Plane Commercial Edition is a single-region application. There's one primary Postgres, one Redis, one message queue, one search cluster. High availability here means Plane keeps serving traffic when **one AZ or one node disappears**, not that you can run two independent active-active regions.
+
+That's an important distinction for how you plan your infrastructure. You're engineering for node and AZ fault tolerance, not geographic redundancy. The playbook: run stateless workloads with multiple replicas spread across AZs, and replace every in-chart stateful service with a managed, multi-AZ equivalent.
+
+## Workload tiers
+
+Every workload in the chart falls into one of three tiers. The tier determines how you scale it, how it recovers from failure, and what HA configuration it needs.
+
+### Tier 1 - Stateless, scale horizontally
+
+These run as `Deployment`s with no local state. Scale them freely across nodes and AZs.
+
+`api`, `web`, `space`, `admin`, `live`, `worker`, `silo`, `email_service`, `outbox_poller`, `automation_consumer`, `pi`, `pi_worker`, `runner`, `iframely`
+
+Run at least `replicas: 2` per service. Use `replicas >= 2` for `api`, `worker`, `web`, and `live` - they carry the most traffic.
+
+### Tier 2 - Singletons (replicas: 1 only)
+
+These do scheduled or coordinator work. **Do not scale any of them past `replicas: 1`** - running two copies doubles job execution.
+
+| Workload | Kind | Why it stays at 1 |
+| ---------------- | ----------- | -------------------------------------------- |
+| `monitor` | StatefulSet | Coordinator role; owns a `ReadWriteOnce` PVC |
+| `beatworker` | Deployment | Celery beat - schedules periodic Plane jobs |
+| `pi_beat_worker` | Deployment | PI beat - schedules periodic PI jobs |
+| `migrator` | Job | DB migration; runs once per release |
+| `pi-migrator` | Job | PI DB migration; runs once per release |
+
+The stateless singletons (`beatworker`, `pi_beat_worker`) reschedule onto a healthy node within seconds when their node fails.
+
+`monitor` is different: it owns an AZ-bound `ReadWriteOnce` PVC. On AZ failure, Kubernetes has to reschedule it onto a node in a live AZ and reattach the volume - expect a **60–120 second** recovery window. That's acceptable because `monitor` is an internal component, not user-facing.
+
+`migrator` and `pi-migrator` are run-once-per-release Jobs. They aren't long-running, but they still must not run in parallel.
+
+### Tier 3 - Local stateful (not HA)
+
+The chart ships optional in-cluster StatefulSets for development and small deployments:
+
+`postgres`, `redis`, `rabbitmq`, `opensearch`, `minio`
+
+These use single-replica `ReadWriteOnce` PVCs. They're **not HA.** Their data is pinned to one disk in one AZ, and the chart doesn't configure replication, failover, or quorum.
+
+**For every HA deployment, set `local_setup: false` for every Tier-3 service** and point Plane at managed, multi-AZ equivalents. The [External managed services](#external-managed-services) section has the exact value keys.
+
+## Cluster prerequisites
+
+Your cluster needs the following before installing in HA mode.
+
+**1. Worker nodes in at least three AZs.** Three is the minimum for any quorum service (etcd, Postgres synchronous replicas, OpenSearch master quorum). Two AZs survive single-AZ loss for stateless workloads but can't maintain quorum.
+
+**2. A default `StorageClass` with `volumeBindingMode: WaitForFirstConsumer`.** This is non-negotiable when Tier-2 singletons run on nodes provisioned just-in-time (Karpenter, Cluster Autoscaler). Without it, a PVC can bind to a zone before the pod schedules, leaving the pod unable to find a matching node.
+
+Example for AWS EBS gp2:
+
+```yaml
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+ name: gp2
+parameters:
+ type: gp2
+ fsType: ext4
+provisioner: ebs.csi.aws.com
+volumeBindingMode: WaitForFirstConsumer
+reclaimPolicy: Retain
+allowVolumeExpansion: true
+```
+
+Then set this in `values.yaml`:
+
+```yaml
+env:
+ storageClass: gp2
+```
+
+**3. A cross-zone load balancer.** Traffic must reach pods in any AZ.
+
+| Cloud | Recommendation |
+| ------- | ------------------------------------------------- |
+| AWS | NLB or ALB with cross-zone load balancing enabled |
+| GCP | Default global LB |
+| Azure | Standard Load Balancer with zones `[1,2,3]` |
+| On-prem | MetalLB in BGP mode, or an external LB |
+
+**4. A working `IngressClass`.** The chart supports `traefik` (default) or `nginx`. Deploy the ingress controller with `replicas >= 2` spread across AZs.
+
+**5. AZ-aware node labels.** Kubernetes uses `topology.kubernetes.io/zone` for AZ awareness. Managed clusters populate this automatically. Verify your nodes carry this label if you're on a self-managed cluster.
+
+## Recommended topology
+
+```text
+ ┌──────────────────────────┐
+ │ External Load Balancer │
+ │ (cross-zone enabled) │
+ └────────────┬─────────────┘
+ │
+ ┌───────────────────┼───────────────────┐
+ │ │ │
+ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐
+ │ AZ-a │ │ AZ-b │ │ AZ-c │
+ │ │ │ │ │ │
+ │ ingress │ │ ingress │ │ ingress │
+ │ api x N │ │ api x N │ │ api x N │
+ │ web x N │ │ web x N │ │ web x N │
+ │ worker │ │ worker │ │ worker │
+ │ … │ │ … │ │ … │
+ └─────────┘ └─────────┘ └─────────┘
+ │
+ ┌─────────────────┼─────────────────┐
+ │ │ │
+ ┌───────▼──────┐ ┌───────▼──────┐ ┌───────▼──────┐
+ │ Managed │ │ Managed │ │ Object │
+ │ Postgres │ │ Redis │ │ Storage │
+ │ (multi-AZ) │ │ (multi-AZ) │ │ (S3-class) │
+ └──────────────┘ └──────────────┘ └──────────────┘
+ ┌──────────────┐ ┌──────────────┐
+ │ Managed │ │ Managed │
+ │ RabbitMQ │ │ OpenSearch │
+ │ (cluster) │ │ (multi-AZ) │
+ └──────────────┘ └──────────────┘
+```
+
+Tier-1 pods spread across AZs. All Tier-3 state lives in managed services that handle their own replication and failover.
+
+## External managed services
+
+### Value keys
+
+The chart supports pointing each stateful component at a remote managed service. Use these value keys.
+
+| Component | Disable local | External URL / credentials |
+| ------------ | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| Postgres | `services.postgres.local_setup: false` | `env.pgdb_remote_url`, `env.pg_pi_db_remote_url`; optional read replica via `services.postgres.read_replica.enabled` + `services.postgres.read_replica.remote_url` |
+| Redis | `services.redis.local_setup: false` | `env.remote_redis_url` |
+| RabbitMQ | `services.rabbitmq.local_setup: false` | `services.rabbitmq.external_rabbitmq_url` |
+| OpenSearch | `services.opensearch.local_setup: false` | `env.opensearch_remote_url`, `env.opensearch_remote_username`, `env.opensearch_remote_password`; optional `env.opensearch_index_prefix` for multi-tenant clusters |
+| Object store | `services.minio.local_setup: false` | `env.aws_access_key`, `env.aws_secret_access_key`, `env.aws_region`, `env.aws_s3_endpoint_url`, `env.docstore_bucket` |
+
+### What HA looks like for each service
+
+Setting `local_setup: false` doesn't make your data tier HA on its own. The managed service you point Plane at must also be HA. Here's what each one needs.
+
+- **Postgres** - Multi-AZ primary with synchronous replication and automated failover. Use RDS Multi-AZ, Cloud SQL HA, Azure Flexible Server zone-redundant, or self-managed Patroni.
+
+- **Redis** - A replica group with automatic failover. Use ElastiCache Multi-AZ, Memorystore HA, or Redis Sentinel/Cluster. Redis failover drops in-flight connections; Plane reconnects automatically.
+
+- **RabbitMQ** - A true cluster with quorum queues across ≥3 nodes in ≥3 AZs. CloudAMQP and Amazon MQ for RabbitMQ in cluster mode both work. A single-node managed RabbitMQ is **not** HA.
+
+- **OpenSearch** - ≥3 master-eligible nodes across 3 AZs, plus data nodes spread across AZs.
+
+- **Object storage** - S3, GCS, and Azure Blob are multi-AZ by design.
+
+## Spreading pods across availability zones
+
+### How the chart exposes scheduling controls
+
+The chart exposes `nodeSelector`, `tolerations`, and `affinity` on every service (see `templates/_helpers.tpl` → `plane.podScheduling`). Use these to spread Tier-1 pods across AZs.
+
+:::info
+The chart doesn't natively support `topologySpreadConstraints` - that's on the roadmap. Use `podAntiAffinity` in the meantime. It's functionally equivalent for AZ spreading.
+:::
+
+### Recommended pattern: soft AZ anti-affinity + hard node anti-affinity
+
+Use a hard rule to prevent two replicas landing on the same node, and a soft rule to prefer spreading across AZs. The soft AZ rule means the scheduler can still place pods if one AZ is under pressure.
+
+```yaml
+services:
+ api:
+ replicas: 3
+ affinity:
+ podAntiAffinity:
+ # Hard: never put two api pods on the same node
+ requiredDuringSchedulingIgnoredDuringExecution:
+ - labelSelector:
+ matchExpressions:
+ - key: app.name
+ operator: In
+ values:
+ - --api
+ topologyKey: kubernetes.io/hostname
+ # Soft: prefer spreading api pods across AZs
+ preferredDuringSchedulingIgnoredDuringExecution:
+ - weight: 100
+ podAffinityTerm:
+ labelSelector:
+ matchExpressions:
+ - key: app.name
+ operator: In
+ values:
+ - --api
+ topologyKey: topology.kubernetes.io/zone
+```
+
+The chart labels every workload with `app.name` set to {{ .Release.Namespace }}-{{ .Release.Name }}-<svc>. For a release named `plane` in namespace `plane`, that's `plane-plane-api` for the API.
+
+:::warning
+**Watch for this**
+The hard hostname anti-affinity rule requires at least as many schedulable nodes as the workload's replica count. Three `api` replicas need three nodes available, or pods sit `Pending`. If you can't guarantee that (small cluster, dedicated taints), relax the hostname rule to `preferredDuringSchedulingIgnoredDuringExecution`.
+:::
+
+Apply this pattern to every Tier-1 service: `web`, `space`, `admin`, `live`, `worker`, `silo`, `email_service`, `outbox_poller`, `automation_consumer`, `pi`, `pi_worker`, `runner`, `iframely`.
+
+### Pinning workloads to specific node pools
+
+Use `nodeSelector` and `tolerations` to route a workload to a specific pool - for example, spot instances for batch workers:
+
+```yaml
+services:
+ worker:
+ replicas: 6
+ nodeSelector:
+ workload-class: batch
+ tolerations:
+ - key: workload-class
+ operator: Equal
+ value: batch
+ effect: NoSchedule
+```
+
+## PodDisruptionBudgets
+
+:::info
+Native PDB rendering is planned for a future release. Apply the manifests below yourself until then.
+:::
+
+PDBs protect Tier-1 deployments from voluntary disruption - a node drain or cluster upgrade - taking a service down entirely. Without them, Kubernetes can evict all pods of a deployment simultaneously.
+
+Apply this manifest in the same namespace as your release. Replace `RELEASE` and `NAMESPACE` with your values.
+
+```yaml
+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+ name: plane-api-pdb
+ namespace: NAMESPACE
+spec:
+ minAvailable: 1
+ selector:
+ matchLabels:
+ app.name: NAMESPACE-RELEASE-api
+---
+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+ name: plane-web-pdb
+ namespace: NAMESPACE
+spec:
+ minAvailable: 1
+ selector:
+ matchLabels:
+ app.name: NAMESPACE-RELEASE-web
+---
+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+ name: plane-space-pdb
+ namespace: NAMESPACE
+spec:
+ minAvailable: 1
+ selector:
+ matchLabels:
+ app.name: NAMESPACE-RELEASE-space
+---
+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+ name: plane-admin-pdb
+ namespace: NAMESPACE
+spec:
+ minAvailable: 1
+ selector:
+ matchLabels:
+ app.name: NAMESPACE-RELEASE-admin
+---
+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+ name: plane-live-pdb
+ namespace: NAMESPACE
+spec:
+ minAvailable: 1
+ selector:
+ matchLabels:
+ app.name: NAMESPACE-RELEASE-live
+---
+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+ name: plane-worker-pdb
+ namespace: NAMESPACE
+spec:
+ minAvailable: 1
+ selector:
+ matchLabels:
+ app.name: NAMESPACE-RELEASE-worker
+---
+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+ name: plane-silo-pdb
+ namespace: NAMESPACE
+spec:
+ minAvailable: 1
+ selector:
+ matchLabels:
+ app.name: NAMESPACE-RELEASE-silo
+```
+
+Add similar PDBs for `pi`, `pi_worker`, `outbox_poller`, `automation_consumer`, `email_service`, `runner`, and `iframely` if you have enabled them.
+
+:::warning
+**Don't create PDBs for Tier-2 singletons** (`beatworker`, `pi_beat_worker`, `monitor`, `migrator`). A `minAvailable: 1` PDB on a `replicas: 1` workload blocks node drains entirely.
+:::
+
+## HorizontalPodAutoscalers
+
+:::info
+Native HPA rendering is planned for a future release. Apply the manifests below yourself until then.
+:::
+
+HPAs scale Tier-1 services automatically under load. The thresholds below match the default resource requests in `values.yaml`. Tune `averageUtilization` and `maxReplicas` based on observed production load.
+
+```yaml
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+ name: plane-api-hpa
+ namespace: NAMESPACE
+spec:
+ scaleTargetRef:
+ apiVersion: apps/v1
+ kind: Deployment
+ name: RELEASE-api-wl
+ minReplicas: 3
+ maxReplicas: 12
+ metrics:
+ - type: Resource
+ resource:
+ name: cpu
+ target:
+ type: Utilization
+ averageUtilization: 70
+ - type: Resource
+ resource:
+ name: memory
+ target:
+ type: Utilization
+ averageUtilization: 80
+---
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+ name: plane-worker-hpa
+ namespace: NAMESPACE
+spec:
+ scaleTargetRef:
+ apiVersion: apps/v1
+ kind: Deployment
+ name: RELEASE-worker-wl
+ minReplicas: 3
+ maxReplicas: 20
+ metrics:
+ - type: Resource
+ resource:
+ name: cpu
+ target:
+ type: Utilization
+ averageUtilization: 70
+---
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+ name: plane-web-hpa
+ namespace: NAMESPACE
+spec:
+ scaleTargetRef:
+ apiVersion: apps/v1
+ kind: Deployment
+ name: RELEASE-web-wl
+ minReplicas: 2
+ maxReplicas: 8
+ metrics:
+ - type: Resource
+ resource:
+ name: cpu
+ target:
+ type: Utilization
+ averageUtilization: 70
+```
+
+:::warning
+**Never create an HPA for `beatworker`, `pi_beat_worker`, `monitor`, or any migration Job.** Scheduled jobs would fire multiple times.
+:::
+
+## Karpenter on AWS
+
+If you're on EKS, Karpenter is the recommended node provisioner for Plane Commercial Edition. It's AZ-aware, provisions nodes in seconds, and lets you mix on-demand and spot capacity per workload type.
+
+### Minimum versions
+
+- Karpenter ≥ v1.0
+- Kubernetes ≥ 1.29
+- AWS Load Balancer Controller ≥ v2.7
+- AWS EBS CSI driver installed
+
+### EC2NodeClass
+
+One `EC2NodeClass` covers most installs. Use AL2023, IMDSv2-only, and gp2 root volumes.
+
+```yaml
+apiVersion: karpenter.k8s.aws/v1
+kind: EC2NodeClass
+metadata:
+ name: plane-default
+spec:
+ amiFamily: AL2023
+ amiSelectorTerms:
+ - alias: al2023@latest
+ role: KarpenterNodeRole-CLUSTER_NAME
+ subnetSelectorTerms:
+ - tags:
+ karpenter.sh/discovery: CLUSTER_NAME
+ securityGroupSelectorTerms:
+ - tags:
+ karpenter.sh/discovery: CLUSTER_NAME
+ blockDeviceMappings:
+ - deviceName: /dev/xvda
+ ebs:
+ volumeType: gp2
+ volumeSize: 100Gi
+ encrypted: true
+ deleteOnTermination: true
+ metadataOptions:
+ httpEndpoint: enabled
+ httpTokens: required
+ httpPutResponseHopLimit: 1
+```
+
+### NodePools
+
+Two NodePools cover most deployments: an on-demand pool for general Tier-1 workloads, and a spot pool for batch workers (`worker`, `pi_worker`, `runner`, `outbox_poller`, `automation_consumer`).
+
+```yaml
+apiVersion: karpenter.sh/v1
+kind: NodePool
+metadata:
+ name: plane-general
+spec:
+ template:
+ spec:
+ nodeClassRef:
+ group: karpenter.k8s.aws
+ kind: EC2NodeClass
+ name: plane-default
+ requirements:
+ - key: kubernetes.io/arch
+ operator: In
+ values: [amd64]
+ - key: karpenter.sh/capacity-type
+ operator: In
+ values: [on-demand]
+ - key: karpenter.k8s.aws/instance-category
+ operator: In
+ values: [c, m]
+ - key: karpenter.k8s.aws/instance-generation
+ operator: Gt
+ values: ["5"]
+ - key: topology.kubernetes.io/zone
+ operator: In
+ values: [REGION-a, REGION-b, REGION-c]
+ expireAfter: 720h
+ limits:
+ cpu: "200"
+ memory: 400Gi
+ disruption:
+ consolidationPolicy: WhenEmptyOrUnderutilized
+ consolidateAfter: 1m
+---
+apiVersion: karpenter.sh/v1
+kind: NodePool
+metadata:
+ name: plane-spot
+spec:
+ template:
+ spec:
+ nodeClassRef:
+ group: karpenter.k8s.aws
+ kind: EC2NodeClass
+ name: plane-default
+ taints:
+ - key: workload-class
+ value: batch
+ effect: NoSchedule
+ requirements:
+ - key: kubernetes.io/arch
+ operator: In
+ values: [amd64]
+ - key: karpenter.sh/capacity-type
+ operator: In
+ values: [spot]
+ - key: karpenter.k8s.aws/instance-category
+ operator: In
+ values: [c, m, r]
+ - key: karpenter.k8s.aws/instance-generation
+ operator: Gt
+ values: ["5"]
+ - key: topology.kubernetes.io/zone
+ operator: In
+ values: [REGION-a, REGION-b, REGION-c]
+ expireAfter: 24h
+ limits:
+ cpu: "400"
+ memory: 800Gi
+ disruption:
+ consolidationPolicy: WhenEmptyOrUnderutilized
+ consolidateAfter: 5m
+```
+
+Match the spot NodePool taint with tolerations in your values:
+
+```yaml
+services:
+ worker:
+ tolerations:
+ - key: workload-class
+ operator: Equal
+ value: batch
+ effect: NoSchedule
+ nodeSelector:
+ karpenter.sh/nodepool: plane-spot
+```
+
+### How Karpenter interacts with AZ spread
+
+- Karpenter respects `podAntiAffinity` when deciding which AZ to provision a node in. The affinity patterns from the previous section are sufficient to drive Karpenter's AZ distribution - no extra configuration needed.
+
+- Don't add `karpenter.sh/do-not-disrupt: "true"` to Tier-1 pods. They're stateless. Let Karpenter consolidate them freely.
+
+- Do add it to Tier-2 singletons (`beatworker`, `pi_beat_worker`, `monitor`) and to in-flight long-running Jobs (`migrator`). They tolerate rescheduling, but you don't want Karpenter bouncing them during a deployment:
+
+```yaml
+services:
+ beatworker:
+ annotations:
+ karpenter.sh/do-not-disrupt: "true"
+```
+
+- `consolidateAfter: 1m` on the on-demand pool keeps the cluster cost-efficient. Raise it to `5m` or `10m` if you see churn during normal scaling. The spot pool's `expireAfter: 24h` forces daily node recycling, spreading the impact of spot interruptions across time rather than concentrating them.
+
+## Ingress and load balancer
+
+- Deploy the ingress controller (`traefik` or `nginx`) with `replicas >= 2` spread across AZs using the same `podAntiAffinity` pattern.
+
+- Enable cross-zone load balancing on the cloud LB. On AWS:
+
+ ```yaml
+ service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
+ ```
+
+- The `live` service uses WebSockets. Make sure your ingress controller and LB don't have idle-timeout values that drop long-lived connections. The default AWS NLB idle timeout is 350s - that's usually fine. ALB defaults to 60s and needs raising for WebSocket connections.
+
+- The chart configures request-body size limits via `ingress.traefik.maxRequestBodyBytes` (Traefik) and `nginx.ingress.kubernetes.io/proxy-body-size` (nginx). Tune these to your expected file upload size.
+
+## Backup and disaster recovery
+
+HA protects against AZ and node failure. Backups protect against logical corruption, accidental deletion, and ransomware. You need both.
+
+| Component | Backup mechanism | Recommended retention |
+| ------------------ | --------------------------------------------------------------------------------------------------------------------------- | ---------------------- |
+| Postgres | Managed-service automated backups + PITR | 30 days, PITR ≥ 7 days |
+| Object storage | Bucket versioning + lifecycle to a different bucket/region | 90 days |
+| OpenSearch | Snapshots to object storage | 7 days |
+| Redis | Optional; treat as cache + queue. Document what your team loses on a full Redis failure (sessions, in-flight Celery tasks). | - |
+| RabbitMQ | Definitions export (users, queues, bindings) on a schedule; messages are transient | - |
+| Kubernetes objects | Velero, namespace-scoped, daily | 30 days |
+
+**Run a restore drill** before go-live and at least once per quarter. A backup that's never been restored is an assumption, not a guarantee.
+
+## Pre-go-live checklist
+
+Work through every item before sending real traffic.
+
+- [ ] Cluster has worker nodes in ≥3 AZs
+- [ ] Default `StorageClass` is `WaitForFirstConsumer`
+- [ ] `env.storageClass` is set to that class
+- [ ] All Tier-3 `local_setup` flags are `false`
+- [ ] Managed Postgres is multi-AZ with synchronous replica
+- [ ] Managed Redis has replica + auto-failover
+- [ ] Managed RabbitMQ is a true cluster across ≥3 AZs
+- [ ] Managed OpenSearch has ≥3 masters across ≥3 AZs
+- [ ] Object storage is multi-AZ (S3/GCS/Blob) with versioning enabled
+- [ ] Every Tier-1 service has `replicas >= 2` (3 for `api`, `worker`, `web`)
+- [ ] Every Tier-1 service has a `podAntiAffinity` block (hostname + zone)
+- [ ] Every Tier-1 service has a PDB
+- [ ] HPAs applied for `api`, `worker`, `web` at minimum
+- [ ] No HPA or PDB on `beatworker`, `pi_beat_worker`, `monitor`, `migrator`
+- [ ] Ingress controller runs with `replicas >= 2` spread across AZs
+- [ ] LB has cross-zone load balancing enabled
+- [ ] Backups configured and a restore drill has succeeded
+- [ ] Failure drill: cordon and drain every node in one AZ; Plane stays up
+- [ ] Failure drill: kill the active Postgres node; Plane recovers
+
+## Known chart gaps
+
+The following capabilities aren't natively provided by the chart and need to be applied separately.
+
+| Gap | Workaround |
+| -------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- |
+| No native `topologySpreadConstraints` in `plane.podScheduling` | Use `podAntiAffinity` as shown in the spreading section - functionally equivalent for AZ spread |
+| No PDBs rendered by the chart | Apply the PDB manifests from the PodDisruptionBudgets section |
+| No HPAs rendered by the chart | Apply the HPA manifests from the HorizontalPodAutoscalers section |
+| In-chart Tier-3 StatefulSets are single-replica, RWO | Set `local_setup: false` and use managed services |
+| `monitor` is a singleton StatefulSet | Accept the 60–120s reschedule window on AZ failure - it's internal and non-user-facing |
+
+## Reference values.yaml for HA
+
+A minimal example that disables every local stateful service and gives each Tier-1 workload three replicas with AZ anti-affinity. Adapt names to your release.
+
+```yaml
+planeVersion: v2.6.0
+
+license:
+ licenseServer: https://prime.plane.so
+ licenseDomain: plane.example.com
+
+ingress:
+ enabled: true
+ ingressClass: traefik
+
+env:
+ storageClass: gp2
+ pgdb_remote_url: "postgres://plane:***@pg-primary.example.internal:5432/plane?sslmode=require"
+ pg_pi_db_remote_url: "postgres://plane:***@pg-primary.example.internal:5432/plane_pi?sslmode=require"
+ remote_redis_url: "redis://:***@redis.example.internal:6379/0"
+ opensearch_remote_url: "https://opensearch.example.internal:9200"
+ opensearch_remote_username: plane
+ opensearch_remote_password: "***"
+ aws_access_key: "***"
+ aws_secret_access_key: "***"
+ aws_region: us-east-1
+ aws_s3_endpoint_url: https://s3.us-east-1.amazonaws.com
+ docstore_bucket: plane-uploads-prod
+ web_url: https://plane.example.com
+ instance_admin_email: admin@example.com
+ cors_allowed_origins: https://plane.example.com
+
+services:
+ postgres:
+ local_setup: false
+ read_replica:
+ enabled: true
+ remote_url: "postgres://plane:***@pg-reader.example.internal:5432/plane?sslmode=require"
+ redis:
+ local_setup: false
+ rabbitmq:
+ local_setup: false
+ external_rabbitmq_url: "amqps://plane:***@rabbitmq.example.internal:5671/plane"
+ opensearch:
+ local_setup: false
+ minio:
+ local_setup: false
+
+ api:
+ replicas: 3
+ affinity: &spread-api
+ podAntiAffinity:
+ requiredDuringSchedulingIgnoredDuringExecution:
+ - labelSelector:
+ matchExpressions:
+ - { key: app.name, operator: In, values: [plane-plane-api] }
+ topologyKey: kubernetes.io/hostname
+ preferredDuringSchedulingIgnoredDuringExecution:
+ - weight: 100
+ podAffinityTerm:
+ labelSelector:
+ matchExpressions:
+ - { key: app.name, operator: In, values: [plane-plane-api] }
+ topologyKey: topology.kubernetes.io/zone
+
+ web: { replicas: 3 }
+ space: { replicas: 2 }
+ admin: { replicas: 2 }
+ live: { replicas: 3 }
+ worker: { replicas: 4 }
+ silo: { enabled: true, replicas: 2 }
+
+ beatworker: { replicas: 1 } # singleton - do not scale
+ pi_beat_worker: { replicas: 1 } # singleton - do not scale
+```
+
+Repeat the `affinity` block (varying the pod label) for every Tier-1 service. YAML anchors (`&spread-api` / `*spread-api`) help avoid repetition.