New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding timeout to remote cluster allocate call and adding total timeout to allocate #1815
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Build Failed 😱 Build Id: d4d76177-4c85-4ceb-9bae-61d687f262cb To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
Build Failed 😱 Build Id: c1b9fed2-5c34-4a8c-a8e5-89d79e1128be To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Succeeded 👏 Build Id: ec29894a-f815-4afa-b8fc-574372fddcd1 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
Build Succeeded 👏 Build Id: 4f63122b-89ec-4597-9df3-87ae4ee1d4e2 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
Thanks for your contribution! Should this timeout be documented somewhere? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kdima thanks for the change.
// Timeout for an individual Allocate RPC call can take | ||
remoteAllocateTimeout = 10 * time.Second | ||
// Total timeout for allocation including retries | ||
totalRemoteAllocationTimeout = 30 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout is set, but not used?
For the total timeout you can check the time passed in the retry: https://github.com/googleforgames/agones/blob/master/pkg/gameserverallocations/allocator.go#L340
Fail with DEADLINE_EXCEEDED when the request does not complete by 30s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. Added.
@@ -82,6 +82,13 @@ const ( | |||
batchWaitTime = 500 * time.Millisecond | |||
) | |||
|
|||
const ( | |||
// Timeout for an individual Allocate RPC call can take | |||
remoteAllocateTimeout = 10 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the client timeout needs to be set on the request context: https://grpc.io/blog/deadlines/#go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am setting it
agones/pkg/gameserverallocations/allocator.go
Line 153 in cd28618
return grpcClient.Allocate(allocationCtx, request) |
Unless you mean something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Looks really good Dima - +1 to making this a CLI argument if possible (we'd probably want to set this lower than the defaults here - maybe something like |
Build Failed 😱 Build Id: 6aa7e7ee-1237-4466-835d-0903410fe31d To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😱 Build Id: b43f3dc5-5582-401d-aed4-f5270bcdb154 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Succeeded 👏 Build Id: 93d2f848-f37e-40dc-977c-efb3a39934fe The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update. Can we add testing for it?
For the env flag and Helm config, here is an example of plumbing through a config value: https://github.com/googleforgames/agones/pull/1777/files
@@ -82,6 +82,13 @@ const ( | |||
batchWaitTime = 500 * time.Millisecond | |||
) | |||
|
|||
const ( | |||
// Timeout for an individual Allocate RPC call can take | |||
remoteAllocateTimeout = 10 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
select { | ||
case <-ctx.Done(): | ||
return status.Errorf(codes.DeadlineExceeded, "remote allocation retry timeout exceeded") | ||
default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be after defer cancel()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the code a fair bit, see if it makes sense now
Build Failed 😱 Build Id: 5d773430-7522-462e-afd4-f69c441c5235 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
@@ -62,6 +62,8 @@ var ( | |||
ErrNoGameServerReady = errors.New("Could not find a Ready GameServer") | |||
// ErrConflictInGameServerSelection is returned when the candidate gameserver already allocated | |||
ErrConflictInGameServerSelection = errors.New("The Gameserver was already allocated") | |||
// ErrTotalTimeoutExceeded is used to signal that total retry timeout has been exceeded and no additional retries should be made | |||
ErrTotalTimeoutExceeded = status.Errorf(codes.DeadlineExceeded, "remote allocation retry timeout exceeded") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/retry/total
Looks great, thanks for the change! You are missing a test coverage for per request timeout. Please add a test that validates that retries happen when the allocation request has exceeds the |
Build Failed 😱 Build Id: ec4289f1-f5cb-4812-9e82-108aad3869e4 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😱 Build Id: 43703567-32bc-4385-89f3-7cb5b598b60c To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😱 Build Id: a2d86d19-43f6-4e23-9b21-9b0a75a5cc9e To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😱 Build Id: e5d5436e-893c-4a4c-bc05-4871e9ac08de To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😱 Build Id: b42f23ee-135b-46d3-85a7-83d6c420d4f1 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great addition. Thank you!
assert.Equal(t, st.Code(), codes.DeadlineExceeded) | ||
assert.Equal(t, 0, calls) | ||
}) | ||
t.Run("No allocations called after total timeout", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description of the test should be updated?
assert.Equal(t, 0, calls) | ||
}) | ||
t.Run("No allocations called after total timeout", func(t *testing.T) { | ||
c, m := newFakeControllerWithTimeout(10*time.Second, 10*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use newFakeController
if the values are not used?
calls := 0 | ||
c.allocator.remoteAllocationCallback = func(ctx context.Context, endpoint string, dialOpt grpc.DialOption, request *pb.AllocationRequest) (*pb.AllocationResponse, error) { | ||
if calls == 0 { | ||
calls++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can move calls to the top of the func and condition of calls == 1. Please also leave a comment on what the function is doing.
|
||
// Allocation policy reactor | ||
secretName := clusterName + "secret" | ||
m.AgonesClient.AddReactor("list", "gameserverallocationpolicies", func(action k8stesting.Action) (bool, k8sruntime.Object, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you refactor the reactors to a method call?
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kdima, pooneh-m The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
It looks like your changes is causing |
New changes are detected. LGTM label has been removed. |
Build Succeeded 👏 Build Id: 568eeb41-017e-44c1-aacd-1e8148534530 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
…ut to allocate (googleforgames#1815) * adding timeout to allocate call and adding total timeout to allocate * adding explicit context check to retry * added part of plumbing for timeout flags * allocation * generated install.yaml * retry test * Added test * cleanup of fake controller * test improvement * Added docs * linting * lint * fixing yamls * fixes * testcase for illustration * custom error used to exit Retry * comment * test * fixes * fixes * fixes * fixes * potential e2e test
What type of PR is this?
/kind feature
What this PR does / Why we need it:
This PR adds an explicit timeout to the Allocate call. This timeout includes both the RPC itself and creating the connection. I also added the total timeout for Allocate retries.
I think total timeout should be configurable as a command line parameter but I am not sure where to add that. Is cmd/allocator/metrics.go:43 struct the right place?
Which issue(s) this PR fixes:
Closes #1700
Special notes for your reviewer: