Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multicluster: Add gRPC dial timeout #1700

Closed
adamhosier opened this issue Jul 20, 2020 · 7 comments · Fixed by #1815
Closed

Multicluster: Add gRPC dial timeout #1700

adamhosier opened this issue Jul 20, 2020 · 7 comments · Fixed by #1815
Labels
area/operations Installation, updating, metrics etc kind/feature New features for Agones
Milestone

Comments

@adamhosier
Copy link

What happened:

When using the multicluster allocator, dialling remote clusters which are unhealthy can cause allocation requests to block for 2 minutes+.

What you expected to happen:

Some sensible default, or configuration option to control dial timeout of multicluster allocation gRPC requests.

How to reproduce it (as minimally and precisely as possible):

Deploy a gameserverallocationpolicy that allocates to some blocking endpoint. We have observed this issue occurring with a DNS name that resolves to the IP of a google LB that has no configured upstreams (because the cluster has been taken down for maintenance, for example.)

apiVersion: multicluster.agones.dev/v1
kind: GameServerAllocationPolicy
metadata:
  name: test
spec:
  connectionInfo:
    allocationEndpoints:
    - my-blackhole-server.com
    clusterName: test
    secretName: test
  priority: 10
  weight: 100

Observe that allocation requests following this policy will block for ~150 seconds before responding.

Anything else we need to know?:

Our use-case is slightly more involved than the minimum repro scenario, we'd like to have a second, low-priority, cluster configured as "failover" for when the first cluster is unhealthy or down for maintenance. The behaviour we've observed is that when the high-priority cluster blackholes requests (such is the default behaviour when a GKE cluster with GCP loadbalancers is taken down), allocation requests take ~150 seconds to succeed as they block waiting for the high-priority cluster to respond.

Environment:

  • Agones version: 1.7.0
  • Kubernetes version: 1.16.10
  • Cloud provider or hardware configuration: GKE + bare-metal
  • Install method (yaml/helm): yaml
@adamhosier adamhosier added the kind/bug These are bugs. label Jul 20, 2020
@pooneh-m
Copy link
Collaborator

pooneh-m commented Jul 20, 2020

There is a retry logic that retries for 7 times to sent request to the high priority cluster.
It retried every 100ms * try^2:

Which should take around 25s if request immediately fails. Can you share your logs on the type of error the service is encountering? It could be that the request is failing with DEADLINE_EXCEEDED which suggest the default gRPC timeout is reached. DEADLINE_EXCEEDED should add 10s default gRPC deadline per retry, so in this case, it could take 70s + 25s. Which is close to what you are experiencing.

Some potential solutions:

  1. Tune retries for this scenario (e.g. for DEADLINE_EXCEEDED failover to the secondary cluster).
  2. If a cluster fails after all the retries, take it out of remote allocation for 10m before trying again.
  3. Immediate failover. e.g. Don't retry.
  4. Reduce the gRPC timeout
  5. As you suggested have a configurable cap on the retry timeout.
  6. Verify the health of the service using healthz

@adamhosier
Copy link
Author

Hi @pooneh-m, I see the retry logic taking affect from the logs. I've redacted some information but the logs are as follows:

2020-07-21T13:57:52.07 "allocation request received"
2020-07-21T13:58:12.07 "remote allocation failed with: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp <REDACTED ADDRESS>: i/o timeout\""
2020-07-21T13:58:32.17 "remote allocation failed with: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp <REDACTED ADDRESS>: i/o timeout\""
2020-07-21T13:58:52.37 "remote allocation failed with: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp <REDACTED ADDRESS>: i/o timeout\""
2020-07-21T13:59:12.77 "remote allocation failed with: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp <REDACTED ADDRESS>: i/o timeout\""
2020-07-21T13:59:33.57 "remote allocation failed with: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp <REDACTED ADDRESS>: i/o timeout\""
2020-07-21T13:59:55.17 "remote allocation failed with: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp <REDACTED ADDRESS>: i/o timeout\""
2020-07-21T14:00:18.37 "remote allocation failed with: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp <REDACTED ADDRESS>: i/o timeout\""
2020-07-21T14:00:18.37 "remote-allocation failed"

It seems like the gRPC dial timeout here is 20 seconds, with ~6 seconds being added from the backoff for ~146 seconds blocked in total. Sadly I imagine there are situations where we would want to continue retrying on "Unavailable" codes, but the other solutions you offered seem reasonable to me.

@pooneh-m
Copy link
Collaborator

pooneh-m commented Jul 21, 2020

This is very helpful, thanks! The formula for retry is sum(100ms * try^2){1, 7} = 1002 + 1002^2 + ... + 100*2^7 = 25.4s

It seems that it took 14:00:18-13:57:52 = 166s which is 7*20s (timeout) +26s (backoff)

I think the gRPC timeout should be set to 10s for (4) with a default cap of total retry timeout at 30s, if the helm config for (5) is not set. Does it seem reasonable for your case?

In the meantime, I highly recommend updating the GameServerAllocationPolicy resource during the maintenance window not to direct the traffic to the cluster that is taken down for maintenance. This cloud solution also provides a management solution for GameServerAllocationPolicy. If your cluster is down for for maintenance, it removes GameServerAllocationPolicy from other clusters in your realm to stop forwarding multi-cluster allocation requests to your cluster.

@adamhosier
Copy link
Author

I think the gRPC timeout should be set to 10s for (4) with a default cap of total retry timeout at 30s, if the helm config for (5) is not set. Does it seem reasonable for your case?

Very reasonable yes - introducing a total cap of 30s would be a great start, with configuration for this option being a bonus for us.

I agree with your intermediate solution of adjusting the allocation policies manually where possible, we'll definitely be doing that where we can. But it's those unplanned "maintenance windows" that worry us a bit here 😉

@aLekSer
Copy link
Collaborator

aLekSer commented Sep 15, 2020

There could be a solution with adding new fields to the CRD also: backoffCap and timeout. I think it would be ready in the nearest time. Our colleague is going to open a PR soon. So this is an enhancement of the previous PR and more general approach.

@robbieheywood
Copy link
Contributor

@aLekSer - that sound fantastic - looking forward to it!

aLekSer pushed a commit that referenced this issue Sep 24, 2020
When using the multicluster allocator dialling remote clusters can cause
allocation requests to block for 2 minutes +

In order to control dial timeout of allocation requests Timeout, BackoffCap properties
were added to game server allocation policy
aLekSer pushed a commit that referenced this issue Sep 24, 2020
When using the multicluster allocator dialing remote clusters can cause
allocation requests to block for 2 minutes +

In order to control dial timeout of allocation requests Timeout, BackoffCap properties
were added to game server allocation policy
@pooneh-m
Copy link
Collaborator

There could be a solution with adding new fields to the CRD also: backoffCap and timeout. I think it would be ready in the nearest time. Our colleague is going to open a PR soon. So this is an enhancement of the previous PR and more general approach.

Why do we need to add a per cluster setting for the timeout and backoffCap? Do we expect every cluster would have a different timeout setting? If not, an environment setting for agones-allocator should suffice.

Since the API is stable every change to the fields should go through a feature stage, starting with alpha.

@markmandel markmandel added this to the 1.10.0 milestone Nov 3, 2020
@markmandel markmandel added area/operations Installation, updating, metrics etc kind/feature New features for Agones and removed kind/bug These are bugs. labels Nov 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/operations Installation, updating, metrics etc kind/feature New features for Agones
Projects
None yet
5 participants