Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding timeout to remote cluster allocate call and adding total timeout to allocate #1815

Merged
merged 29 commits into from Oct 20, 2020

Conversation

kdima
Copy link
Contributor

@kdima kdima commented Sep 15, 2020

What type of PR is this?
/kind feature

What this PR does / Why we need it:
This PR adds an explicit timeout to the Allocate call. This timeout includes both the RPC itself and creating the connection. I also added the total timeout for Allocate retries.
I think total timeout should be configurable as a command line parameter but I am not sure where to add that. Is cmd/allocator/metrics.go:43 struct the right place?

Which issue(s) this PR fixes:
Closes #1700

Special notes for your reviewer:

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: d4d76177-4c85-4ceb-9bae-61d687f262cb

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: c1b9fed2-5c34-4a8c-a8e5-89d79e1128be

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: ec29894a-f815-4afa-b8fc-574372fddcd1

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/1815/head:pr_1815 && git checkout pr_1815
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.9.0-464d177

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 4f63122b-89ec-4597-9df3-87ae4ee1d4e2

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/1815/head:pr_1815 && git checkout pr_1815
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.9.0-da989a1

@markmandel markmandel requested review from pooneh-m and removed request for dzlier-gcp September 15, 2020 15:54
@markmandel
Copy link
Member

Thanks for your contribution! Should this timeout be documented somewhere?

Copy link
Collaborator

@pooneh-m pooneh-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kdima thanks for the change.

// Timeout for an individual Allocate RPC call can take
remoteAllocateTimeout = 10 * time.Second
// Total timeout for allocation including retries
totalRemoteAllocationTimeout = 30 * time.Second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout is set, but not used?
For the total timeout you can check the time passed in the retry: https://github.com/googleforgames/agones/blob/master/pkg/gameserverallocations/allocator.go#L340

Fail with DEADLINE_EXCEEDED when the request does not complete by 30s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. Added.

@@ -82,6 +82,13 @@ const (
batchWaitTime = 500 * time.Millisecond
)

const (
// Timeout for an individual Allocate RPC call can take
remoteAllocateTimeout = 10 * time.Second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the client timeout needs to be set on the request context: https://grpc.io/blog/deadlines/#go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am setting it

return grpcClient.Allocate(allocationCtx, request)

Unless you mean something else?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@robbieheywood
Copy link
Contributor

Looks really good Dima - +1 to making this a CLI argument if possible (we'd probably want to set this lower than the defaults here - maybe something like 1s). Also, is there any way to add a metric for this (i.e. number of allocation timeouts)?

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 6aa7e7ee-1237-4466-835d-0903410fe31d

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: b43f3dc5-5582-401d-aed4-f5270bcdb154

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@kdima kdima requested a review from pooneh-m September 16, 2020 10:04
@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 93d2f848-f37e-40dc-977c-efb3a39934fe

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/1815/head:pr_1815 && git checkout pr_1815
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.9.0-c25c5cf

Copy link
Collaborator

@pooneh-m pooneh-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. Can we add testing for it?
For the env flag and Helm config, here is an example of plumbing through a config value: https://github.com/googleforgames/agones/pull/1777/files

@@ -82,6 +82,13 @@ const (
batchWaitTime = 500 * time.Millisecond
)

const (
// Timeout for an individual Allocate RPC call can take
remoteAllocateTimeout = 10 * time.Second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

select {
case <-ctx.Done():
return status.Errorf(codes.DeadlineExceeded, "remote allocation retry timeout exceeded")
default:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be after defer cancel()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the code a fair bit, see if it makes sense now

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 5d773430-7522-462e-afd4-f69c441c5235

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@@ -62,6 +62,8 @@ var (
ErrNoGameServerReady = errors.New("Could not find a Ready GameServer")
// ErrConflictInGameServerSelection is returned when the candidate gameserver already allocated
ErrConflictInGameServerSelection = errors.New("The Gameserver was already allocated")
// ErrTotalTimeoutExceeded is used to signal that total retry timeout has been exceeded and no additional retries should be made
ErrTotalTimeoutExceeded = status.Errorf(codes.DeadlineExceeded, "remote allocation retry timeout exceeded")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/retry/total

@pooneh-m
Copy link
Collaborator

I added custom ErrTotalTimeoutExceeded that follow the pattern similar to the one from ErrNoGameServerReady this way error is reported as DeadlineExceeded which I agree is much better for the user. And we still have the test that verifies that no allocation calls are made after the total timeout is exceeded.

Looks great, thanks for the change! You are missing a test coverage for per request timeout. Please add a test that validates that retries happen when the allocation request has exceeds the remoteAllocationTimeout timeout limit.

Dmytro Kislov added 2 commits October 19, 2020 10:44
@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: ec4289f1-f5cb-4812-9e82-108aad3869e4

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 43703567-32bc-4385-89f3-7cb5b598b60c

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: a2d86d19-43f6-4e23-9b21-9b0a75a5cc9e

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: e5d5436e-893c-4a4c-bc05-4871e9ac08de

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: b42f23ee-135b-46d3-85a7-83d6c420d4f1

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

Copy link
Collaborator

@pooneh-m pooneh-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition. Thank you!

assert.Equal(t, st.Code(), codes.DeadlineExceeded)
assert.Equal(t, 0, calls)
})
t.Run("No allocations called after total timeout", func(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description of the test should be updated?

assert.Equal(t, 0, calls)
})
t.Run("No allocations called after total timeout", func(t *testing.T) {
c, m := newFakeControllerWithTimeout(10*time.Second, 10*time.Second)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use newFakeController if the values are not used?

calls := 0
c.allocator.remoteAllocationCallback = func(ctx context.Context, endpoint string, dialOpt grpc.DialOption, request *pb.AllocationRequest) (*pb.AllocationResponse, error) {
if calls == 0 {
calls++
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can move calls to the top of the func and condition of calls == 1. Please also leave a comment on what the function is doing.


// Allocation policy reactor
secretName := clusterName + "secret"
m.AgonesClient.AddReactor("list", "gameserverallocationpolicies", func(action k8stesting.Action) (bool, k8sruntime.Object, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you refactor the reactors to a method call?

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kdima, pooneh-m

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pooneh-m
Copy link
Collaborator

It looks like your changes is causing TestAllocatorCrossNamespace to fail. Please take a look.

@google-oss-robot
Copy link

New changes are detected. LGTM label has been removed.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 568eeb41-017e-44c1-aacd-1e8148534530

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/1815/head:pr_1815 && git checkout pr_1815
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.10.0-8bdf18c

@pooneh-m pooneh-m merged commit 926e277 into googleforgames:master Oct 20, 2020
ilkercelikyilmaz pushed a commit to ilkercelikyilmaz/agones that referenced this pull request Oct 23, 2020
…ut to allocate (googleforgames#1815)

* adding timeout to allocate call and adding total timeout to allocate

* adding explicit context check to retry

* added part of plumbing for timeout flags

* allocation

* generated install.yaml

* retry test

* Added test

* cleanup of fake controller

* test improvement

* Added docs

* linting

* lint

* fixing yamls

* fixes

* testcase for illustration

* custom error used to exit Retry

* comment

* test

* fixes

* fixes

* fixes

* fixes

* potential e2e test
@markmandel markmandel added this to the 1.10.0 milestone Nov 3, 2020
@markmandel markmandel added area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc kind/feature New features for Agones labels Nov 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc cla: yes kind/feature New features for Agones size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multicluster: Add gRPC dial timeout
7 participants