Add timeout to Client RPC calls #11500

wjordan · 2021-11-05T00:02:43Z

Fixes #8504. Adds a timeout (deadline) to client RPC calls, so that streams will no longer hang indefinitely in unstable network conditions.
See further discussion of this issue in #6616 (comment):

The problem is that we hold the write lock while performing requests over the network, which as #8504 shows, can last forever (we don't have read timeouts because of blocking queries and connection multiplexing, although arguably we should, even if they are very long timeouts).

This PR reuses the timeout values calculated by the retry-timeout logic in #10299 / #8978, adding an extra Timeout function to the RPCInfo interface that returns the timeout duration, to supplement the HasTimedOut function that returns a boolean.

hashicorp-cla · 2021-11-05T00:02:47Z

All committers have signed the CLA.

Timeout can be reached when running many tests concurrently.

Log error only when failure can no longer be retried.

When an RPC with a blocking query is retried the deadline should include the max query time, not just the remaining time in the initial query. A query may block up to twice as long as requested if any retries occur.

dnephin · 2021-11-26T17:18:29Z

Thank you for working on this problem! Unfortunately I think this lack of read timeouts is going to be a very challenging thing to solve. I looked to see if we have an issue that describes the problem, but I haven't found one.

I have not yet looked at this PR in much detail, but from a quick glance I think rpcHoldTimeout is the wrong value. It is only 7 seconds, but a blocking query needs to stay open for much longer, often 5 to 10 minutes.

There are a number of challenges in fixing this problem because we have many paths for client connections. I believe the current one is Yamux, so we'd need support to timeout a multiplexed connection. There are a few older PRs open on the yamux repo that aim to add that support, but I'm not sure what state those are in. Adding something like this will require a lot of testing.

I would still like to take the time to write up a better issue but that will take some more time to research.

wjordan · 2021-11-26T22:45:47Z

@dnephin thanks for taking a look!

I have not yet looked at this PR in much detail, but from a quick glance I think rpcHoldTimeout is the wrong value. It is only 7 seconds, but a blocking query needs to stay open for much longer, often 5 to 10 minutes.

Yes, the implementation refactors/reuses the logic previously added in #10299 / #8978 which already takes this into account. For blocking queries, the timeout is derived from the provided MaxQueryTime:

consul/agent/structs/structs.go

Lines 303 to 317 in d6e2cbb

    
           func (q QueryOptions) Timeout(rpcHoldTimeout, maxQueryTime, defaultQueryTime time.Duration) time.Duration { 
        
           	// Match logic in Server.blockingQuery. 
        
           	if q.MinQueryIndex > 0 { 
        
           		if q.MaxQueryTime > maxQueryTime { 
        
           			q.MaxQueryTime = maxQueryTime 
        
           		} else if q.MaxQueryTime <= 0 { 
        
           			q.MaxQueryTime = defaultQueryTime 
        
           		} 
        
           		// Timeout after maximum jitter has elapsed. 
        
           		q.MaxQueryTime += q.MaxQueryTime / JitterFraction 
        
           		return q.MaxQueryTime + rpcHoldTimeout 
        
           	} 
        
           	return rpcHoldTimeout 
        
           }

we'd need support [in Yamux] to timeout a multiplexed connection.

Yamux already supports global timeout of a multiplexed connection through its keepalive-ping feature (which immediately fails all in-flight RPCs going over the connection). More relevant here, Yamux also already supports timing out an individual stream through setDeadline:

consul/agent/pool/pool.go

Lines 593 to 597 in d6e2cbb

    
           if !deadline.IsZero() { 
        
           	if err = sc.stream.SetDeadline(deadline); err != nil { 
        
           		return fmt.Errorf("rpc error setting client deadline: %w", err) 
        
           	} 
        
           }

I didn't notice anything else that would be needed for this to work.

I would still like to take the time to write up a better issue but that will take some more time to research.

If it's helpful, see TestClient_RPC_Timeout added by this PR (4f54651) which provides a concise implementation of the issue that this PR addresses- the test fails without this PR and succeeds with it.

wjordan · 2021-12-09T18:25:25Z

After running this patch in production for a while, I observed two issues:

RPCs for very large, non-blocking queries were timing out because their total transfer time was greater than rpcHoldTimeout. This makes sense- the timeout is only an expectation on the amount of time for an RPC call to complete, not on the amount of time the largest-possible response is expected to be transferred across the connection.

To fix this, I wrapped the stream with configurable timeouts, where each read/write operation extends the stream deadline by a timeout value. This fix is in 2f2857e.
After applying the rpcHoldTimeout to all client RPC calls, I noticed one type of blocking query (KVS.Get / KVS.List) wasn't having its timeout set properly based on the query parameters. Removing an ampersand from the args parameter fixed this issue (92525c4). This probably also affects the existing RPC retry logic for this RPC method so could possibly be extracted as a separate fix PR.

dnephin · 2021-12-14T19:39:56Z

@wjordan thanks for working on this! I'll need to dedicate some time to look into all of the links you've shared. It is quite possible that my understanding is stale and maybe we can improve the timeout support. I'm hoping to look into this in more detail in the next couple of weeks.

kisunji

Thank you for this PR and apologies for the delay in review. This PR touches on critical code paths so we'll try to get input from more members of our team.

In the mean time, I've left a couple of questions I'd like to see clarified

kisunji · 2022-01-18T15:20:46Z

agent/pool/pool.go

+// TimeoutConn wraps net.Conn with a default timeout.
+// readTimeout and writeTimeout apply to the subsequent operation.


Is this comment accurate? Read and Write implementations look like ReadTimeout and WriteTimeout are applied first, zero'ed out, then DefaultTimeout is applied on subsequent operations.

The comment is intentional but could perhaps be worded more clearly. What I meant was that setting a read/write timeout only affects the very next operation, and then the timeout reverts to the default after that.

kisunji · 2022-01-18T15:22:57Z

agent/consul/server_test.go

@@ -177,7 +177,7 @@ func testServerConfig(t *testing.T) (string, *Config) {

 	// TODO (slackpad) - We should be able to run all tests w/o this, but it
 	// looks like several depend on it.
-	config.RPCHoldTimeout = 5 * time.Second
+	config.RPCHoldTimeout = 10 * time.Second


Is this necessary for tests to pass?

Yes, I don't recall exactly (it's been a few months) but according to the commit message I left in cb4de4d, the 5-second RPCHoldTimeout was causing some timeouts when running many tests concurrently.

kisunji · 2022-01-18T15:31:26Z

agent/pool/pool.go

+	// The default timeout for stream reads/writes
+	Timeout time.Duration
+


Could you help me understand the difference between ConnPool.Timeout vs the timeout args that have been added to all the rpc methods?

I'm wondering if it's possible to achieve the same behaviour without adding an arg (which is defaulted to 0 in most cases) to an already long list of arguments to rpc.

ConnPool.Timeout ('The default timeout for stream reads/writes') is set from RPCHoldTimeout, and represents the default timeout for any stream read/write when an explicit timeout for a specific read/write call is not provided. RPCHoldTimeout represents an existing duration the server is tuned to expect round-trips to be shorter than, so it seemed like a reasonable timeout to use for most reads/writes going through this connection pool.

The ConnPool.rpc method sets a specific timeout for the first stream read after the query is sent. This is because the expected wait time for a (potentially blocking) query response can be much longer than the default round-trip timeout configuration.

I'm wondering if it's possible to achieve the same behaviour without adding an arg (which is defaulted to 0 in most cases) to an already long list of arguments to rpc.

Thanks for the suggestion. Yes, Dropping the timeout argument by moving the RPC-timeout calculation from Client.RPC to ConnPool.rpc requires MaxQueryTime and DefaultQueryTime to be passed to ConnPool, but I think it does end up a bit cleaner that way.

kisunji · 2022-01-18T15:39:06Z

agent/pool/pool.go

@@ -78,12 +119,14 @@ func (c *Conn) getClient() (*StreamClient, error) {
 		return nil, err
 	}

+	timeoutStream := &TimeoutConn{stream, c.pool.Timeout, 0, 0}


It would be more readable to explicitly define struct fields here

Do we ever set WriteTimeout anywhere else in this PR? Wondering if it's valuable to keep read/write timeouts separate (my naive assumption would be that the upper bound on size of reads and writes wouldn't vary enough to benefit from tuning them separately)

WriteTimeout isn't used anywhere, so I can remove WriteTimeout and just make TimeoutConn.Write pass through to c.Conn.Write.

The read timeout needs to be kept separate from writes though, because the RPC timeout is read-specific, it applies to the first stream Read following the RPC request.

agent/pool/pool.go

Other changes based on feedback

wjordan · 2022-03-01T22:42:48Z

@kisunji thanks for the detailed review! Sorry for taking a while for my response. I've pushed a couple of commits that reduce the size of the PR diff based on your feedback and should hopefully address the helpful questions you raised.

hc-github-team-consul-core · 2022-04-25T18:53:01Z

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/651287.

hc-github-team-consul-core · 2022-04-25T18:53:04Z

🍒✅ Cherry pick of commit c48120d onto release/1.12.x succeeded!

Adds a timeout (deadline) to client RPC calls, so that streams will no longer hang indefinitely in unstable network conditions. Co-authored-by: kisunji <ckim@hashicorp.com>

hc-github-team-consul-core · 2022-04-25T18:53:54Z

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/651305.

hc-github-team-consul-core · 2022-04-25T18:53:55Z

🍒❌ Cherry pick of commit c48120d onto release/1.12.x failed! Build Log

hc-github-team-consul-core · 2022-04-25T18:53:58Z

🍒❌ Cherry pick of commit c48120d onto release/1.11.x failed! Build Log

hc-github-team-consul-core · 2022-04-25T18:54:00Z

🍒❌ Cherry pick of commit c48120d onto release/1.10.x failed! Build Log

Adds a timeout (deadline) to client RPC calls, so that streams will no longer hang indefinitely in unstable network conditions. Co-authored-by: kisunji <ckim@hashicorp.com>

* Add timeout to Client RPC calls (#11500) Adds a timeout (deadline) to client RPC calls, so that streams will no longer hang indefinitely in unstable network conditions. Co-authored-by: kisunji <ckim@hashicorp.com> * Use the maximum jitter when calculating the timeout The timeout should include the maximum possible jitter since the server will randomly add to it's timeout a jitter. If the server's timeout is less than the client's timeout then the client will return an i/o deadline reached error. Before: ``` time curl 'http://localhost:8500/v1/catalog/service/service?dc=other-dc&stale=&wait=600s&index=15820644' rpc error making call: i/o deadline reached real 10m11.469s user 0m0.018s sys 0m0.023s ``` After: ``` time curl 'http://localhost:8500/v1/catalog/service/service?dc=other-dc&stale=&wait=600s&index=15820644' [...] real 10m35.835s user 0m0.021s sys 0m0.021s ``` Co-authored-by: Will Jordan <wjordan@users.noreply.github.com> Co-authored-by: kisunji <ckim@hashicorp.com> Co-authored-by: James Hartig <me@jameshartig.com>

vercel bot temporarily deployed to Preview – consul November 5, 2021 00:02 Inactive

wjordan added 4 commits November 5, 2021 09:12

Add deadline to Client RPC calls

28bb096

Increase RPCHoldTimeout for test config

cb4de4d

Timeout can be reached when running many tests concurrently.

Add TestClient_RPC_Timeout test

4f54651

add deadline to DirectRPC interface

244eedb

wjordan force-pushed the client_rpc_read_deadline branch from 8df46bc to 244eedb Compare November 5, 2021 16:25

vercel bot temporarily deployed to Preview – consul November 5, 2021 16:25 Inactive

Timeout only after maximum jitter has elapsed

4103d14

vercel bot temporarily deployed to Preview – consul November 5, 2021 20:12 Inactive

Amier3 requested a review from dnephin November 10, 2021 18:24

wjordan added 2 commits November 23, 2021 16:35

Log warning on RPC retry

c4ad24f

Log error only when failure can no longer be retried.

Always add full timeout for client RPC deadline

d6e2cbb

When an RPC with a blocking query is retried the deadline should include the max query time, not just the remaining time in the initial query. A query may block up to twice as long as requested if any retries occur.

vercel bot temporarily deployed to Preview – consul November 24, 2021 00:41 Inactive

Use timeout instead of deadline for RPC calls

2f2857e

vercel bot temporarily deployed to Preview – consul December 8, 2021 00:19 Inactive

Fix KVSGet passing args to agent.RPC

92525c4

kisunji self-assigned this Jan 11, 2022

kisunji reviewed Jan 18, 2022

View reviewed changes

agent/pool/pool.go Outdated Show resolved Hide resolved

kisunji added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 26, 2022

wjordan added 3 commits February 22, 2022 12:50

use query timeout in Server.ForwardRPC

b80ae67

Remove timeout arg from ConnPool.RPC

40c9b40

Remove WriteTimeout from TimeoutConn

dc7683d

Other changes based on feedback

github-actions bot removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Mar 1, 2022

kisunji added the backport/1.12 label Apr 25, 2022

kisunji added backport/1.10 labels Apr 25, 2022

LadyNamedLaura mentioned this pull request May 31, 2022

i/o deadline reached using blocking queries on consul 1.12.1 #13303

Closed

kisunji mentioned this pull request Jun 2, 2022

Fix KVSGet method to handle QueryOptions properly #13344

Merged

hc-github-team-consul-core mentioned this pull request Jun 2, 2022

Backport of Fix KVSGet method to handle QueryOptions properly into release/1.12.x #13348

Merged

This was referenced Aug 16, 2022

Consul v1.12.2 - rpc error making call: i/o deadline reached #13695

Closed

Use the maximum jitter when calculating the timeout #14233

Merged

This was referenced Aug 17, 2022

Backport of Use the maximum jitter when calculating the timeout into release/1.12.x #14235

Merged

Backport of Use the maximum jitter when calculating the timeout into release/1.13.x #14236

Merged

kisunji removed the backport/1.11 label Aug 29, 2022

dhiaayachi mentioned this pull request Sep 6, 2022

Port to 1.11.x: Add timeout to Client RPC calls (#11500) #14498

Merged

3 tasks

hc-github-team-consul-core mentioned this pull request Sep 22, 2022

Backport of Fix KVSGet method to handle QueryOptions properly into release/1.11.x #14705

Merged

eculver mentioned this pull request Sep 22, 2022

docs: fix missing bugfix changelog entries #14720

Merged

dclaisse mentioned this pull request Sep 23, 2022

rpc_hold_timeout has been reused for client RPC call timeout #14732

Closed

kisunji mentioned this pull request Oct 13, 2022

Refactor client RPC timeouts #14965

Merged

3 tasks

obourdon mentioned this pull request Oct 18, 2022

Consul connection error on port 8300 #14464

Closed

This was referenced Oct 18, 2022

Backport of Refactor client RPC timeouts into release/1.11.x #15038

Merged

Backport of Refactor client RPC timeouts into release/1.12.x #15039

Merged

Backport of Refactor client RPC timeouts into release/1.13.x #15040

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add timeout to Client RPC calls #11500

Add timeout to Client RPC calls #11500

wjordan commented Nov 5, 2021

hashicorp-cla commented Nov 5, 2021 •

edited

Loading

dnephin commented Nov 26, 2021

wjordan commented Nov 26, 2021

wjordan commented Dec 9, 2021 •

edited

Loading

dnephin commented Dec 14, 2021

kisunji left a comment

kisunji Jan 18, 2022

wjordan Mar 1, 2022

kisunji Jan 18, 2022

wjordan Mar 1, 2022

kisunji Jan 18, 2022

wjordan Mar 1, 2022

kisunji Jan 18, 2022

kisunji Jan 18, 2022

wjordan Mar 1, 2022

wjordan commented Mar 1, 2022

hc-github-team-consul-core commented Apr 25, 2022

hc-github-team-consul-core commented Apr 25, 2022

hc-github-team-consul-core commented Apr 25, 2022

hc-github-team-consul-core commented Apr 25, 2022

hc-github-team-consul-core commented Apr 25, 2022

hc-github-team-consul-core commented Apr 25, 2022

		// TimeoutConn wraps net.Conn with a default timeout.
		// readTimeout and writeTimeout apply to the subsequent operation.

		// The default timeout for stream reads/writes
		Timeout time.Duration

Add timeout to Client RPC calls #11500

Add timeout to Client RPC calls #11500

Conversation

wjordan commented Nov 5, 2021

hashicorp-cla commented Nov 5, 2021 • edited Loading

dnephin commented Nov 26, 2021

wjordan commented Nov 26, 2021

wjordan commented Dec 9, 2021 • edited Loading

dnephin commented Dec 14, 2021

kisunji left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjordan commented Mar 1, 2022

hc-github-team-consul-core commented Apr 25, 2022

hc-github-team-consul-core commented Apr 25, 2022

hc-github-team-consul-core commented Apr 25, 2022

hc-github-team-consul-core commented Apr 25, 2022

hc-github-team-consul-core commented Apr 25, 2022

hc-github-team-consul-core commented Apr 25, 2022

hashicorp-cla commented Nov 5, 2021 •

edited

Loading

wjordan commented Dec 9, 2021 •

edited

Loading