Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Cloud Best Practices guide, add guide on Release Channels #3152

Merged
merged 2 commits into from May 19, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
21 changes: 1 addition & 20 deletions site/content/en/docs/Advanced/controlling-disruption.md
Expand Up @@ -74,26 +74,7 @@ GKE Autopilot supports only `Never` and `Always`, not `OnUpgrade`.

## Considerations for long sessions

Outside of Cluster Autoscaler, the main source of disruption for long sessions is node upgrade. On some cloud products, such as GKE Standard, node upgrades are entirely within your control. On others, such as GKE Autopilot, node upgrade is automatic. Typical node upgrades use an eviction based, rolling recreate strategy, and may not honor `PodDisruptionBudget` for longer than an hour. Here we document strategies you can use for your cloud product to support long sessions.

### On GKE

On GKE, there are currently two possible approaches to manage disruption for session lengths longer than an hour:

* (GKE Standard/Autopilot) [Blue/green deployment](https://martinfowler.com/bliki/BlueGreenDeployment.html) at the cluster level: If you are using an automated deployment process, you can:
* create a new, `green` cluster within a release channel e.g. every week,
* use [maintenance exclusions](https://cloud.google.com/kubernetes-engine/docs/concepts/maintenance-windows-and-exclusions#exclusions) to prevent node upgrades for 30d, and
* scale the `Fleet` on the old, `blue` cluster down to 0, and
* use [multi-cluster allocation]({{< relref "multi-cluster-allocation.md" >}}) on Agones, which will then direct new allocations to the new `green` cluster (since `blue` has 0 desired), then
* delete the old, `blue` cluster when the `Fleet` successfully scales down.

* (GKE Standard only) Use [node pool blue/green upgrades](https://cloud.google.com/kubernetes-engine/docs/concepts/node-pool-upgrade-strategies#blue-green-upgrade-strategy)

### Other cloud products

The blue/green cluster strategy described for GKE is likely applicable to your cloud product.

We welcome contributions to this section for other products!
Outside of Cluster Autoscaler, the main source of disruption for long sessions is node upgrade. On some cloud products, such as GKE Standard, node upgrades are entirely within your control. On others, such as GKE Autopilot, node upgrade is automatic. Typical node upgrades use an eviction based, rolling recreate strategy, and may not honor `PodDisruptionBudget` for longer than an hour. See [Best Practices]({{< relref "Best Practices" >}}) for information specific to your cloud product.

## Implementation / Under the hood

Expand Down
45 changes: 45 additions & 0 deletions site/content/en/docs/Guides/Best Practices/_index.md
@@ -0,0 +1,45 @@
---
title: "Best Practices"
linkTitle: "Best Practices"
date: 2023-05-12T00:00:00Z
weight: 9
description: "Best practices for running Agones in production."
---

## Overview

Running Agones in production takes consideration, from planning your launch to figuring
out the best course of action for cluster and Agones upgrades. On this page, we've collected
some general best practices. We also have cloud specific pages for:

* [Google Kubernetes Engine (GKE)]({{< relref "gke.md" >}})

If you are interested in submitting best practices for your cloud prodiver / on-prem, [please contribute!]({{< relref "/Contribute" >}})

## Separation of Agones from GameServer nodes

When running in production, Agones should be scheduled on a dedicated pool of nodes, distinct from where Game Servers
are scheduled for better isolation and resiliency. By default Agones prefers to be scheduled on nodes labeled with
`agones.dev/agones-system=true` and tolerates the node taint `agones.dev/agones-system=true:NoExecute`.
If no dedicated nodes are available, Agones will run on regular nodes. See [taints and tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/)
for more information about Kubernetes taints and tolerations.

If you are collecting [Metrics]({{< relref "metrics" >}}) using our standard Prometheus installation, see
[the installation guide]({{< relref "metrics#prometheus-installation" >}}) for instructions on configuring a separate node pool for the `agones.dev/agones-metrics=true` taint.

See [Creating a Cluster]({{< relref "Creating Cluster" >}}) for initial set up on your cloud provider.

## Redundant Clusters

### Allocate Across Clusters

Agones supports [Multi-cluster Allocation]({{< relref "multi-cluster-allocation" >}}), allowing you to allocate from a set of clusters, versus a single point of potential failure. There are several other options for multi-cluster allocation:
* [Anthos Service Mesh](https://cloud.google.com/anthos/service-mesh) can be used to route allocation traffic to different clusters based on arbitrary criteria. See [Global Multiplayer Demo](https://github.com/googleforgames/global-multiplayer-demo) for an example where the match maker influences which cluster the allocation is routed to.
* [Allocation Endpoint](https://github.com/googleforgames/agones/tree/main/examples/allocation-endpoint) can be used in Cloud Run to proxy allocation requests.
* Or peruse the [Third Party Examples]({{< relref "../../Third Party Content/libraries-tools.md/#allocation" >}})

### Spread

You should consider spreading your game servers in two ways:
* **Across geographic fault domains** ([GCP regions](https://cloud.google.com/compute/docs/regions-zones), [AWS availability zones](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html), separate datacenters, etc.): This is desirable for geographic fault isolation, but also for optimizing client latency to the game server.
* **Within a fault domain**: Kubernetes Clusters are single points of failure. A single misconfigured RBAC rule, an overloaded Kubernetes Control Plane, etc. can prevent new game server allocations, or worse, disrupt existing sessions. Running multiple clusters within a fault domain also allows for [easier upgrades]({{< relref "Upgrading#upgrading-agones-multiple-clusters" >}}).
60 changes: 60 additions & 0 deletions site/content/en/docs/Guides/Best Practices/gke.md
@@ -0,0 +1,60 @@
---
title: "Google Kubernetes Engine Best Practices"
linkTitle: "Google Cloud"
date: 2023-05-12T00:00:00Z
description: "Best practices for running Agones on Google Kubernetes Engine (GKE)."
---

## Overview

On this page, we've collected several [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine/) best practices.

## Release Channels

### Why?

We recommned using [Release Channels](https://cloud.google.com/kubernetes-engine/docs/concepts/release-channels) for all GKE clusters. Using Release Channels has several advantages:
* Google automatically manages the version and upgrade cadence for your Kubernetes Control Plane and its nodes.
* Clusters on a Release Channel are allowed to use the `No minor upgrades` and `No minor or node upgrades` [scope of maintenance exclusions](https://cloud.google.com/kubernetes-engine/docs/concepts/maintenance-windows-and-exclusions#limitations-maint-exclusions) - in other words, enrolling a cluster in a Release Channel gives you _more control_ over node upgrades.
* Clusters enrolled in `rapid` channel have access to the newest Kubernetes version first. Agones strives to [support the newest release in `rapid` channel]({{< relref "Installation#agones-and-kubernetes-supported-versions" >}}) to allow you to test the newest Kubernetes soon after it's available in GKE.

{{< alert title="Note" color="info" >}}
GKE Autopilot clusters must be on Release Channels.
{{< /alert >}}

### What channel should I use?

We recommend the `regular` channel, which offers a balance between stability and freshness. See [this guide](https://cloud.google.com/kubernetes-engine/docs/concepts/release-channels#what_channel_should_i_use) for more discussion.

If you need to disallow minor version upgrades for more than 6 months, consider choosing the freshest Kubernetes version possible: Choosing the freshest version on `rapid` or `regular` will extend the amount of time before your cluster reaches [end of life](https://cloud.google.com/kubernetes-engine/docs/release-schedule#schedule-for-release-channels).

### What versions are available on a given channel?

You can query the versions available across different channels using `gcloud`:

```
gcloud container get-server-config \
--region=[COMPUTE_REGION] \
--flatten="channels" \
--format="yaml(channels)"
```
Replace the following:

* **COMPUTE_REGION**: the
[Google Cloud region](https://cloud.google.com/compute/docs/regions-zones#available)
where you will create the cluster.

## Managing Game Server Disruption on GKE

If your game session length is less than an hour, use the `eviction` API to configure your game servers appropriately - see [Controlling Disruption]({{< relref "controlling-disruption" >}}).

For sessions longer than an hour, there are currently two possible approaches to manage disruption:

* (GKE Standard/Autopilot) [Blue/green deployment](https://martinfowler.com/bliki/BlueGreenDeployment.html) at the cluster level: If you are using an automated deployment process, you can:
* create a new, `green` cluster within a release channel e.g. every week,
* use [maintenance exclusions](https://cloud.google.com/kubernetes-engine/docs/concepts/maintenance-windows-and-exclusions#exclusions) to prevent node upgrades for 30d, and
* scale the `Fleet` on the old, `blue` cluster down to 0, and
* use [multi-cluster allocation]({{< relref "multi-cluster-allocation.md" >}}) on Agones, which will then direct new allocations to the new `green` cluster (since `blue` has 0 desired), then
* delete the old, `blue` cluster when the `Fleet` successfully scales down.

* (GKE Standard only) Use [node pool blue/green upgrades](https://cloud.google.com/kubernetes-engine/docs/concepts/node-pool-upgrade-strategies#blue-green-upgrade-strategy)