Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeout to Client RPC calls #11500

Merged
merged 24 commits into from
Apr 21, 2022
Merged

Conversation

wjordan
Copy link
Contributor

@wjordan wjordan commented Nov 5, 2021

Fixes #8504. Adds a timeout (deadline) to client RPC calls, so that streams will no longer hang indefinitely in unstable network conditions.
See further discussion of this issue in #6616 (comment):

The problem is that we hold the write lock while performing requests over the network, which as #8504 shows, can last forever (we don't have read timeouts because of blocking queries and connection multiplexing, although arguably we should, even if they are very long timeouts).

This PR reuses the timeout values calculated by the retry-timeout logic in #10299 / #8978, adding an extra Timeout function to the RPCInfo interface that returns the timeout duration, to supplement the HasTimedOut function that returns a boolean.

@hashicorp-cla
Copy link

hashicorp-cla commented Nov 5, 2021

CLA assistant check
All committers have signed the CLA.

@vercel vercel bot temporarily deployed to Preview – consul November 5, 2021 00:02 Inactive
@vercel vercel bot temporarily deployed to Preview – consul November 5, 2021 20:12 Inactive
@Amier3 Amier3 requested a review from dnephin November 10, 2021 18:24
Log error only when failure can no longer be retried.
When an RPC with a blocking query is retried the deadline should include
the max query time, not just the remaining time in the initial query.
A query may block up to twice as long as requested if any retries occur.
@vercel vercel bot temporarily deployed to Preview – consul November 24, 2021 00:41 Inactive
@dnephin
Copy link
Contributor

dnephin commented Nov 26, 2021

Thank you for working on this problem! Unfortunately I think this lack of read timeouts is going to be a very challenging thing to solve. I looked to see if we have an issue that describes the problem, but I haven't found one.

I have not yet looked at this PR in much detail, but from a quick glance I think rpcHoldTimeout is the wrong value. It is only 7 seconds, but a blocking query needs to stay open for much longer, often 5 to 10 minutes.

There are a number of challenges in fixing this problem because we have many paths for client connections. I believe the current one is Yamux, so we'd need support to timeout a multiplexed connection. There are a few older PRs open on the yamux repo that aim to add that support, but I'm not sure what state those are in. Adding something like this will require a lot of testing.

I would still like to take the time to write up a better issue but that will take some more time to research.

@wjordan
Copy link
Contributor Author

wjordan commented Nov 26, 2021

@dnephin thanks for taking a look!

I have not yet looked at this PR in much detail, but from a quick glance I think rpcHoldTimeout is the wrong value. It is only 7 seconds, but a blocking query needs to stay open for much longer, often 5 to 10 minutes.

Yes, the implementation refactors/reuses the logic previously added in #10299 / #8978 which already takes this into account. For blocking queries, the timeout is derived from the provided MaxQueryTime:

func (q QueryOptions) Timeout(rpcHoldTimeout, maxQueryTime, defaultQueryTime time.Duration) time.Duration {
// Match logic in Server.blockingQuery.
if q.MinQueryIndex > 0 {
if q.MaxQueryTime > maxQueryTime {
q.MaxQueryTime = maxQueryTime
} else if q.MaxQueryTime <= 0 {
q.MaxQueryTime = defaultQueryTime
}
// Timeout after maximum jitter has elapsed.
q.MaxQueryTime += q.MaxQueryTime / JitterFraction
return q.MaxQueryTime + rpcHoldTimeout
}
return rpcHoldTimeout
}

we'd need support [in Yamux] to timeout a multiplexed connection.

Yamux already supports global timeout of a multiplexed connection through its keepalive-ping feature (which immediately fails all in-flight RPCs going over the connection). More relevant here, Yamux also already supports timing out an individual stream through setDeadline:

consul/agent/pool/pool.go

Lines 593 to 597 in d6e2cbb

if !deadline.IsZero() {
if err = sc.stream.SetDeadline(deadline); err != nil {
return fmt.Errorf("rpc error setting client deadline: %w", err)
}
}

I didn't notice anything else that would be needed for this to work.

I would still like to take the time to write up a better issue but that will take some more time to research.

If it's helpful, see TestClient_RPC_Timeout added by this PR (4f54651) which provides a concise implementation of the issue that this PR addresses- the test fails without this PR and succeeds with it.

@vercel vercel bot temporarily deployed to Preview – consul December 8, 2021 00:19 Inactive
@wjordan
Copy link
Contributor Author

wjordan commented Dec 9, 2021

After running this patch in production for a while, I observed two issues:

  1. RPCs for very large, non-blocking queries were timing out because their total transfer time was greater than rpcHoldTimeout. This makes sense- the timeout is only an expectation on the amount of time for an RPC call to complete, not on the amount of time the largest-possible response is expected to be transferred across the connection.

    To fix this, I wrapped the stream with configurable timeouts, where each read/write operation extends the stream deadline by a timeout value. This fix is in 2f2857e.

  2. After applying the rpcHoldTimeout to all client RPC calls, I noticed one type of blocking query (KVS.Get / KVS.List) wasn't having its timeout set properly based on the query parameters. Removing an ampersand from the args parameter fixed this issue (92525c4). This probably also affects the existing RPC retry logic for this RPC method so could possibly be extracted as a separate fix PR.

@dnephin
Copy link
Contributor

dnephin commented Dec 14, 2021

@wjordan thanks for working on this! I'll need to dedicate some time to look into all of the links you've shared. It is quite possible that my understanding is stale and maybe we can improve the timeout support. I'm hoping to look into this in more detail in the next couple of weeks.

@kisunji kisunji self-assigned this Jan 11, 2022
Copy link
Contributor

@kisunji kisunji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR and apologies for the delay in review. This PR touches on critical code paths so we'll try to get input from more members of our team.

In the mean time, I've left a couple of questions I'd like to see clarified

Comment on lines 58 to 59
// TimeoutConn wraps net.Conn with a default timeout.
// readTimeout and writeTimeout apply to the subsequent operation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment accurate? Read and Write implementations look like ReadTimeout and WriteTimeout are applied first, zero'ed out, then DefaultTimeout is applied on subsequent operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is intentional but could perhaps be worded more clearly. What I meant was that setting a read/write timeout only affects the very next operation, and then the timeout reverts to the default after that.

@@ -177,7 +177,7 @@ func testServerConfig(t *testing.T) (string, *Config) {

// TODO (slackpad) - We should be able to run all tests w/o this, but it
// looks like several depend on it.
config.RPCHoldTimeout = 5 * time.Second
config.RPCHoldTimeout = 10 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary for tests to pass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't recall exactly (it's been a few months) but according to the commit message I left in cb4de4d, the 5-second RPCHoldTimeout was causing some timeouts when running many tests concurrently.

Comment on lines +178 to +180
// The default timeout for stream reads/writes
Timeout time.Duration

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me understand the difference between ConnPool.Timeout vs the timeout args that have been added to all the rpc methods?

I'm wondering if it's possible to achieve the same behaviour without adding an arg (which is defaulted to 0 in most cases) to an already long list of arguments to rpc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConnPool.Timeout ('The default timeout for stream reads/writes') is set from RPCHoldTimeout, and represents the default timeout for any stream read/write when an explicit timeout for a specific read/write call is not provided. RPCHoldTimeout represents an existing duration the server is tuned to expect round-trips to be shorter than, so it seemed like a reasonable timeout to use for most reads/writes going through this connection pool.

The ConnPool.rpc method sets a specific timeout for the first stream read after the query is sent. This is because the expected wait time for a (potentially blocking) query response can be much longer than the default round-trip timeout configuration.

I'm wondering if it's possible to achieve the same behaviour without adding an arg (which is defaulted to 0 in most cases) to an already long list of arguments to rpc.

Thanks for the suggestion. Yes, Dropping the timeout argument by moving the RPC-timeout calculation from Client.RPC to ConnPool.rpc requires MaxQueryTime and DefaultQueryTime to be passed to ConnPool, but I think it does end up a bit cleaner that way.

@@ -78,12 +119,14 @@ func (c *Conn) getClient() (*StreamClient, error) {
return nil, err
}

timeoutStream := &TimeoutConn{stream, c.pool.Timeout, 0, 0}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more readable to explicitly define struct fields here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever set WriteTimeout anywhere else in this PR? Wondering if it's valuable to keep read/write timeouts separate (my naive assumption would be that the upper bound on size of reads and writes wouldn't vary enough to benefit from tuning them separately)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WriteTimeout isn't used anywhere, so I can remove WriteTimeout and just make TimeoutConn.Write pass through to c.Conn.Write.

The read timeout needs to be kept separate from writes though, because the RPC timeout is read-specific, it applies to the first stream Read following the RPC request.

agent/pool/pool.go Outdated Show resolved Hide resolved
@kisunji kisunji added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 26, 2022
@wjordan
Copy link
Contributor Author

wjordan commented Mar 1, 2022

@kisunji thanks for the detailed review! Sorry for taking a while for my response. I've pushed a couple of commits that reduce the size of the PR diff based on your feedback and should hopefully address the helpful questions you raised.

@github-actions github-actions bot removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Mar 1, 2022
@hc-github-team-consul-core
Copy link
Collaborator

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/651287.

@hc-github-team-consul-core
Copy link
Collaborator

🍒✅ Cherry pick of commit c48120d onto release/1.12.x succeeded!

hc-github-team-consul-core pushed a commit that referenced this pull request Apr 25, 2022
Adds a timeout (deadline) to client RPC calls, so that streams will no longer hang indefinitely in unstable network conditions.

Co-authored-by: kisunji <ckim@hashicorp.com>
@hc-github-team-consul-core
Copy link
Collaborator

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/651305.

@hc-github-team-consul-core
Copy link
Collaborator

🍒❌ Cherry pick of commit c48120d onto release/1.12.x failed! Build Log

@hc-github-team-consul-core
Copy link
Collaborator

🍒❌ Cherry pick of commit c48120d onto release/1.11.x failed! Build Log

@hc-github-team-consul-core
Copy link
Collaborator

🍒❌ Cherry pick of commit c48120d onto release/1.10.x failed! Build Log

dhiaayachi pushed a commit that referenced this pull request Sep 6, 2022
Adds a timeout (deadline) to client RPC calls, so that streams will no longer hang indefinitely in unstable network conditions.

Co-authored-by: kisunji <ckim@hashicorp.com>
dhiaayachi pushed a commit that referenced this pull request Sep 6, 2022
Adds a timeout (deadline) to client RPC calls, so that streams will no longer hang indefinitely in unstable network conditions.

Co-authored-by: kisunji <ckim@hashicorp.com>
dhiaayachi added a commit that referenced this pull request Sep 13, 2022
* Add timeout to Client RPC calls (#11500)

Adds a timeout (deadline) to client RPC calls, so that streams will no longer hang indefinitely in unstable network conditions.

Co-authored-by: kisunji <ckim@hashicorp.com>

* Use the maximum jitter when calculating the timeout

The timeout should include the maximum possible
jitter since the server will randomly add to it's
timeout a jitter. If the server's timeout is less
than the client's timeout then the client will
return an i/o deadline reached error.

Before:
```
time curl 'http://localhost:8500/v1/catalog/service/service?dc=other-dc&stale=&wait=600s&index=15820644'
rpc error making call: i/o deadline reached
real    10m11.469s
user    0m0.018s
sys     0m0.023s
```

After:
```
time curl 'http://localhost:8500/v1/catalog/service/service?dc=other-dc&stale=&wait=600s&index=15820644'
[...]
real    10m35.835s
user    0m0.021s
sys     0m0.021s
```

Co-authored-by: Will Jordan <wjordan@users.noreply.github.com>
Co-authored-by: kisunji <ckim@hashicorp.com>
Co-authored-by: James Hartig <me@jameshartig.com>
@kisunji kisunji mentioned this pull request Oct 13, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hanging RPC caused agent's state lock to be held for hours
6 participants