googleforgames · roberthbailey · May 19, 2023 · May 15, 2023 · May 19, 2023
diff --git a/site/content/en/docs/Advanced/controlling-disruption.md b/site/content/en/docs/Advanced/controlling-disruption.md
@@ -74,26 +74,7 @@ GKE Autopilot supports only `Never` and `Always`, not `OnUpgrade`.
 
 ## Considerations for long sessions
 
-Outside of Cluster Autoscaler, the main source of disruption for long sessions is node upgrade. On some cloud products, such as GKE Standard, node upgrades are entirely within your control. On others, such as GKE Autopilot, node upgrade is automatic. Typical node upgrades use an eviction based, rolling recreate strategy, and may not honor `PodDisruptionBudget` for longer than an hour. Here we document strategies you can use for your cloud product to support long sessions.
-
-### On GKE
-
-On GKE, there are currently two possible approaches to manage disruption for session lengths longer than an hour:
-
-* (GKE Standard/Autopilot) [Blue/green deployment](https://martinfowler.com/bliki/BlueGreenDeployment.html) at the cluster level: If you are using an automated deployment process, you can:
-  * create a new, `green` cluster within a release channel e.g. every week,
-  * use [maintenance exclusions](https://cloud.google.com/kubernetes-engine/docs/concepts/maintenance-windows-and-exclusions#exclusions) to prevent node upgrades for 30d, and
-  * scale the `Fleet` on the old, `blue` cluster down to 0, and
-  * use [multi-cluster allocation]({{< relref "multi-cluster-allocation.md" >}}) on Agones, which will then direct new allocations to the new `green` cluster (since `blue` has 0 desired), then
-  * delete the old, `blue` cluster when the `Fleet` successfully scales down.
-
-* (GKE Standard only) Use [node pool blue/green upgrades](https://cloud.google.com/kubernetes-engine/docs/concepts/node-pool-upgrade-strategies#blue-green-upgrade-strategy)
-
-### Other cloud products
-
-The blue/green cluster strategy described for GKE is likely applicable to your cloud product.
-
-We welcome contributions to this section for other products!
+Outside of Cluster Autoscaler, the main source of disruption for long sessions is node upgrade. On some cloud products, such as GKE Standard, node upgrades are entirely within your control. On others, such as GKE Autopilot, node upgrade is automatic. Typical node upgrades use an eviction based, rolling recreate strategy, and may not honor `PodDisruptionBudget` for longer than an hour. See [Best Practices]({{< relref "Best Practices" >}}) for information specific to your cloud product.
 
 ## Implementation / Under the hood
 

diff --git a/site/content/en/docs/Guides/Best Practices/_index.md b/site/content/en/docs/Guides/Best Practices/_index.md
@@ -0,0 +1,45 @@
+---
+title: "Best Practices" 
+linkTitle: "Best Practices"
+date: 2023-05-12T00:00:00Z
+weight: 9
+description: "Best practices for running Agones in production."
+---
+
+## Overview
+
+Running Agones in production takes consideration, from planning your launch to figuring
+out the best course of action for cluster and Agones upgrades. On this page, we've collected
+some general best practices. We also have cloud specific pages for:
+
+* [Google Kubernetes Engine (GKE)]({{< relref "gke.md" >}})
+
+If you are interested in submitting best practices for your cloud prodiver / on-prem, [please contribute!]({{< relref "/Contribute" >}})
+
+## Separation of Agones from GameServer nodes
+
+When running in production, Agones should be scheduled on a dedicated pool of nodes, distinct from where Game Servers
+are scheduled for better isolation and resiliency. By default Agones prefers to be scheduled on nodes labeled with
+`agones.dev/agones-system=true` and tolerates the node taint `agones.dev/agones-system=true:NoExecute`.
+If no dedicated nodes are available, Agones will run on regular nodes. See [taints and tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/)
+for more information about Kubernetes taints and tolerations.
+
+If you are collecting [Metrics]({{< relref "metrics" >}}) using our standard Prometheus installation, see
+[the installation guide]({{< relref "metrics#prometheus-installation" >}}) for instructions on configuring a separate node pool for the `agones.dev/agones-metrics=true` taint.
+
+See [Creating a Cluster]({{< relref "Creating Cluster" >}}) for initial set up on your cloud provider.
+
+## Redundant Clusters
+
+### Allocate Across Clusters
+
+Agones supports [Multi-cluster Allocation]({{< relref "multi-cluster-allocation" >}}), allowing you to allocate from a set of clusters, versus a single point of potential failure. There are several other options for multi-cluster allocation:
+* [Anthos Service Mesh](https://cloud.google.com/anthos/service-mesh) can be used to route allocation traffic to different clusters based on arbitrary criteria. See [Global Multiplayer Demo](https://github.com/googleforgames/global-multiplayer-demo) for an example where the match maker influences which cluster the allocation is routed to.
+* [Allocation Endpoint](https://github.com/googleforgames/agones/tree/main/examples/allocation-endpoint) can be used in Cloud Run to proxy allocation requests.
+* Or peruse the [Third Party Examples]({{< relref "../../Third Party Content/libraries-tools.md/#allocation" >}})
+
+### Spread
+
+You should consider spreading your game servers in two ways:
+* **Across geographic fault domains** ([GCP regions](https://cloud.google.com/compute/docs/regions-zones), [AWS availability zones](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html), separate datacenters, etc.): This is desirable for geographic fault isolation, but also for optimizing client latency to the game server.
+* **Within a fault domain**: Kubernetes Clusters are single points of failure. A single misconfigured RBAC rule, an overloaded Kubernetes Control Plane, etc. can prevent new game server allocations, or worse, disrupt existing sessions. Running multiple clusters within a fault domain also allows for [easier upgrades]({{< relref "Upgrading#upgrading-agones-multiple-clusters" >}}).
diff --git a/site/content/en/docs/Guides/Best Practices/gke.md b/site/content/en/docs/Guides/Best Practices/gke.md
@@ -0,0 +1,60 @@
+---
+title: "Google Kubernetes Engine Best Practices"
+linkTitle: "Google Cloud"
+date: 2023-05-12T00:00:00Z
+description: "Best practices for running Agones on Google Kubernetes Engine (GKE)."
+---
+
+## Overview
+
+On this page, we've collected several [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine/) best practices.
+
+## Release Channels
+
+### Why?
+
+We recommned using [Release Channels](https://cloud.google.com/kubernetes-engine/docs/concepts/release-channels) for all GKE clusters. Using Release Channels has several advantages:
+* Google automatically manages the version and upgrade cadence for your Kubernetes Control Plane and its nodes.
+* Clusters on a Release Channel are allowed to use the `No minor upgrades` and `No minor or node upgrades` [scope of maintenance exclusions](https://cloud.google.com/kubernetes-engine/docs/concepts/maintenance-windows-and-exclusions#limitations-maint-exclusions) - in other words, enrolling a cluster in a Release Channel gives you _more control_ over node upgrades.
+* Clusters enrolled in `rapid` channel have access to the newest Kubernetes version first. Agones strives to [support the newest release in `rapid` channel]({{< relref "Installation#agones-and-kubernetes-supported-versions" >}}) to allow you to test the newest Kubernetes soon after it's available in GKE.
+
+{{< alert title="Note" color="info" >}}
+GKE Autopilot clusters must be on Release Channels.
+{{< /alert >}}
+
+### What channel should I use?
+
+We recommend the `regular` channel, which offers a balance between stability and freshness. See [this guide](https://cloud.google.com/kubernetes-engine/docs/concepts/release-channels#what_channel_should_i_use) for more discussion.
+
+If you need to disallow minor version upgrades for more than 6 months, consider choosing the freshest Kubernetes version possible: Choosing the freshest version on `rapid` or `regular` will extend the amount of time before your cluster reaches [end of life](https://cloud.google.com/kubernetes-engine/docs/release-schedule#schedule-for-release-channels).
+
+### What versions are available on a given channel?
+
+You can query the versions available across different channels using `gcloud`:
+
+```
+gcloud container get-server-config \
+  --region=[COMPUTE_REGION] \
+  --flatten="channels" \
+  --format="yaml(channels)"
+```
+Replace the following:
+
+* **COMPUTE_REGION**: the
+[Google Cloud region](https://cloud.google.com/compute/docs/regions-zones#available)
+where you will create the cluster.
+
+## Managing Game Server Disruption on GKE
+
+If your game session length is less than an hour, use the `eviction` API to configure your game servers appropriately - see [Controlling Disruption]({{< relref "controlling-disruption" >}}).
+
+For sessions longer than an hour, there are currently two possible approaches to manage disruption:
+
+* (GKE Standard/Autopilot) [Blue/green deployment](https://martinfowler.com/bliki/BlueGreenDeployment.html) at the cluster level: If you are using an automated deployment process, you can:
+  * create a new, `green` cluster within a release channel e.g. every week,
+  * use [maintenance exclusions](https://cloud.google.com/kubernetes-engine/docs/concepts/maintenance-windows-and-exclusions#exclusions) to prevent node upgrades for 30d, and
+  * scale the `Fleet` on the old, `blue` cluster down to 0, and
+  * use [multi-cluster allocation]({{< relref "multi-cluster-allocation.md" >}}) on Agones, which will then direct new allocations to the new `green` cluster (since `blue` has 0 desired), then
+  * delete the old, `blue` cluster when the `Fleet` successfully scales down.
+
+* (GKE Standard only) Use [node pool blue/green upgrades](https://cloud.google.com/kubernetes-engine/docs/concepts/node-pool-upgrade-strategies#blue-green-upgrade-strategy)