Multicluster: Add gRPC dial timeout #1700

adamhosier · 2020-07-20T12:04:49Z

What happened:

When using the multicluster allocator, dialling remote clusters which are unhealthy can cause allocation requests to block for 2 minutes+.

What you expected to happen:

Some sensible default, or configuration option to control dial timeout of multicluster allocation gRPC requests.

How to reproduce it (as minimally and precisely as possible):

Deploy a gameserverallocationpolicy that allocates to some blocking endpoint. We have observed this issue occurring with a DNS name that resolves to the IP of a google LB that has no configured upstreams (because the cluster has been taken down for maintenance, for example.)

apiVersion: multicluster.agones.dev/v1
kind: GameServerAllocationPolicy
metadata:
  name: test
spec:
  connectionInfo:
    allocationEndpoints:
    - my-blackhole-server.com
    clusterName: test
    secretName: test
  priority: 10
  weight: 100

Observe that allocation requests following this policy will block for ~150 seconds before responding.

Anything else we need to know?:

Our use-case is slightly more involved than the minimum repro scenario, we'd like to have a second, low-priority, cluster configured as "failover" for when the first cluster is unhealthy or down for maintenance. The behaviour we've observed is that when the high-priority cluster blackholes requests (such is the default behaviour when a GKE cluster with GCP loadbalancers is taken down), allocation requests take ~150 seconds to succeed as they block waiting for the high-priority cluster to respond.

Environment:

Agones version: 1.7.0
Kubernetes version: 1.16.10
Cloud provider or hardware configuration: GKE + bare-metal
Install method (yaml/helm): yaml

The text was updated successfully, but these errors were encountered:

pooneh-m · 2020-07-20T16:03:00Z

There is a retry logic that retries for 7 times to sent request to the high priority cluster.
It retried every 100ms * try^2:

Which should take around 25s if request immediately fails. Can you share your logs on the type of error the service is encountering? It could be that the request is failing with DEADLINE_EXCEEDED which suggest the default gRPC timeout is reached. DEADLINE_EXCEEDED should add 10s default gRPC deadline per retry, so in this case, it could take 70s + 25s. Which is close to what you are experiencing.

Some potential solutions:

Tune retries for this scenario (e.g. for DEADLINE_EXCEEDED failover to the secondary cluster).
If a cluster fails after all the retries, take it out of remote allocation for 10m before trying again.
Immediate failover. e.g. Don't retry.
Reduce the gRPC timeout
As you suggested have a configurable cap on the retry timeout.
Verify the health of the service using healthz

adamhosier · 2020-07-21T14:23:29Z

Hi @pooneh-m, I see the retry logic taking affect from the logs. I've redacted some information but the logs are as follows:

2020-07-21T13:57:52.07 "allocation request received"
2020-07-21T13:58:12.07 "remote allocation failed with: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp <REDACTED ADDRESS>: i/o timeout\""
2020-07-21T13:58:32.17 "remote allocation failed with: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp <REDACTED ADDRESS>: i/o timeout\""
2020-07-21T13:58:52.37 "remote allocation failed with: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp <REDACTED ADDRESS>: i/o timeout\""
2020-07-21T13:59:12.77 "remote allocation failed with: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp <REDACTED ADDRESS>: i/o timeout\""
2020-07-21T13:59:33.57 "remote allocation failed with: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp <REDACTED ADDRESS>: i/o timeout\""
2020-07-21T13:59:55.17 "remote allocation failed with: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp <REDACTED ADDRESS>: i/o timeout\""
2020-07-21T14:00:18.37 "remote allocation failed with: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp <REDACTED ADDRESS>: i/o timeout\""
2020-07-21T14:00:18.37 "remote-allocation failed"

It seems like the gRPC dial timeout here is 20 seconds, with ~6 seconds being added from the backoff for ~146 seconds blocked in total. Sadly I imagine there are situations where we would want to continue retrying on "Unavailable" codes, but the other solutions you offered seem reasonable to me.

pooneh-m · 2020-07-21T16:27:12Z

This is very helpful, thanks! The formula for retry is sum(100ms * try^2){1, 7} = 1002 + 1002^2 + ... + 100*2^7 = 25.4s

It seems that it took 14:00:18-13:57:52 = 166s which is 7*20s (timeout) +26s (backoff)

I think the gRPC timeout should be set to 10s for (4) with a default cap of total retry timeout at 30s, if the helm config for (5) is not set. Does it seem reasonable for your case?

In the meantime, I highly recommend updating the GameServerAllocationPolicy resource during the maintenance window not to direct the traffic to the cluster that is taken down for maintenance. This cloud solution also provides a management solution for GameServerAllocationPolicy. If your cluster is down for for maintenance, it removes GameServerAllocationPolicy from other clusters in your realm to stop forwarding multi-cluster allocation requests to your cluster.

adamhosier · 2020-07-22T08:37:55Z

I think the gRPC timeout should be set to 10s for (4) with a default cap of total retry timeout at 30s, if the helm config for (5) is not set. Does it seem reasonable for your case?

Very reasonable yes - introducing a total cap of 30s would be a great start, with configuration for this option being a bonus for us.

I agree with your intermediate solution of adjusting the allocation policies manually where possible, we'll definitely be doing that where we can. But it's those unplanned "maintenance windows" that worry us a bit here 😉

aLekSer · 2020-09-15T15:20:25Z

There could be a solution with adding new fields to the CRD also: backoffCap and timeout. I think it would be ready in the nearest time. Our colleague is going to open a PR soon. So this is an enhancement of the previous PR and more general approach.

robbieheywood · 2020-09-16T09:13:00Z

@aLekSer - that sound fantastic - looking forward to it!

When using the multicluster allocator dialling remote clusters can cause allocation requests to block for 2 minutes + In order to control dial timeout of allocation requests Timeout, BackoffCap properties were added to game server allocation policy

When using the multicluster allocator dialing remote clusters can cause allocation requests to block for 2 minutes + In order to control dial timeout of allocation requests Timeout, BackoffCap properties were added to game server allocation policy

pooneh-m · 2020-09-24T17:08:00Z

There could be a solution with adding new fields to the CRD also: backoffCap and timeout. I think it would be ready in the nearest time. Our colleague is going to open a PR soon. So this is an enhancement of the previous PR and more general approach.

Why do we need to add a per cluster setting for the timeout and backoffCap? Do we expect every cluster would have a different timeout setting? If not, an environment setting for agones-allocator should suffice.

Since the API is stable every change to the fields should go through a feature stage, starting with alpha.

adamhosier added the kind/bug These are bugs. label Jul 20, 2020

kdima mentioned this issue Sep 15, 2020

adding timeout to remote cluster allocate call and adding total timeout to allocate #1815

Merged

yeukovichd mentioned this issue Sep 24, 2020

Add gRPC dial timeout to game server allocation policy (#1700) #1830

Closed

pooneh-m closed this as completed in #1815 Oct 20, 2020

markmandel added this to the 1.10.0 milestone Nov 3, 2020

markmandel added area/operations Installation, updating, metrics etc kind/feature New features for Agones and removed kind/bug These are bugs. labels Nov 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multicluster: Add gRPC dial timeout #1700

Multicluster: Add gRPC dial timeout #1700

adamhosier commented Jul 20, 2020

pooneh-m commented Jul 20, 2020 •

edited

adamhosier commented Jul 21, 2020

pooneh-m commented Jul 21, 2020 •

edited

adamhosier commented Jul 22, 2020

aLekSer commented Sep 15, 2020

robbieheywood commented Sep 16, 2020

pooneh-m commented Sep 24, 2020

Multicluster: Add gRPC dial timeout #1700

Multicluster: Add gRPC dial timeout #1700

Comments

adamhosier commented Jul 20, 2020

pooneh-m commented Jul 20, 2020 • edited

adamhosier commented Jul 21, 2020

pooneh-m commented Jul 21, 2020 • edited

adamhosier commented Jul 22, 2020

aLekSer commented Sep 15, 2020

robbieheywood commented Sep 16, 2020

pooneh-m commented Sep 24, 2020

pooneh-m commented Jul 20, 2020 •

edited

pooneh-m commented Jul 21, 2020 •

edited