Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor client RPC timeouts #14965

Merged
merged 14 commits into from Oct 18, 2022
Merged

Refactor client RPC timeouts #14965

merged 14 commits into from Oct 18, 2022

Conversation

kisunji
Copy link
Contributor

@kisunji kisunji commented Oct 12, 2022

Description

This PR has two objectives:

  1. Fix an issue identified in rpc_hold_timeout has been reused for client RPC call timeout #14732 where rpc_hold_timeout was being used as the timeout for non-blocking queries. Users should be able to tune read timeouts without fiddling with rpc_hold_timeout. A new configuration rpc_client_timeout was created.
  2. Refactor some implementation from the original PR Add timeout to Client RPC calls #11500 to remove the misleading linkage between RPCInfo's timeout (used to retry in case of certain modes of failures) and the client RPC timeouts.

Testing & Reproduction steps

  • Added test cases
  • Manually tested with a client-server setup where I ran the following queries against the client concurrently and confirmed deadlines were respected:
    • Long non-blocking query (KVS.Get modified to sleep on a certain key)
    • Short non-blocking query (KVS.Get)
    • Blocking query (KVS.Get)

PR Checklist

  • updated test coverage
  • external facing docs updated
  • not a security concern

@@ -364,7 +364,7 @@ func (p *ConnPool) dial(
tlsRPCType RPCType,
) (net.Conn, HalfCloser, error) {
// Try to dial the conn
d := &net.Dialer{LocalAddr: p.SrcAddr, Timeout: p.Timeout}
d := &net.Dialer{LocalAddr: p.SrcAddr, Timeout: DefaultDialTimeout}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was changed in #11500 but I reverted it back to DefaultDialTimeout to be consistent with other usages.

Comment on lines -68 to -83
func (c *TimeoutConn) Read(b []byte) (int, error) {
timeout := c.DefaultTimeout
// Apply timeout to first read then zero it out
if c.FirstReadTimeout > 0 {
timeout = c.FirstReadTimeout
c.FirstReadTimeout = 0
}
var deadline time.Time
if timeout > 0 {
deadline = time.Now().Add(timeout)
}
if err := c.Conn.SetReadDeadline(deadline); err != nil {
return 0, err
}
return c.Conn.Read(b)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type was introduced in #11500 but while debugging timeout issues I found this difficult to understand.

Now, instead of a wrapped net.Conn which sets a deadline per Read call, we simply set the deadline during the rpc call

type BlockableQuery interface {
// BlockingTimeout returns duration > 0 if the query is blocking.
// Otherwise returns 0 for non-blocking queries.
BlockingTimeout(maxQueryTime, defaultQueryTime time.Duration) time.Duration
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that narrowly defining this interface was better than giving RPCInfo a Timeout method and forcing every type to return a timeout (in most cases is rpc_hold_timeout)

@kisunji kisunji requested review from a team, DanStough and dhiaayachi and removed request for a team October 13, 2022 21:30
Comment on lines 916 to 922
var out struct{}
err := c1.RPC("Long.Wait", &structs.NodeSpecificRequest{}, &out)
require.Error(t, err)
require.Contains(t, err.Error(), "rpc error making call: i/o deadline reached")
// We use structs.KVSRequest, which does not implement pool.BlockableQuery
// and should have no timeouts defined.
require.NoError(t, c1.RPC("Long.Wait", &structs.KVSRequest{}, &out))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to sanity check that the setting the deadline on the stream connection did not persist when the client was cached in the connection pool.

Copy link
Collaborator

@huikang huikang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add the new field to api?

@kisunji
Copy link
Contributor Author

kisunji commented Oct 14, 2022

Do we need to add the new field to api?

No, the new fields are config and don't need to be in api

@@ -31,7 +32,7 @@ type muxSession interface {

// streamClient is used to wrap a stream with an RPC client
type StreamClient struct {
stream *TimeoutConn
stream net.Conn
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is simply reverting a change from #11500

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good.

Couple of small fixes noted inline but should be quick!

agent/consul/client.go Show resolved Hide resolved
agent/pool/pool.go Outdated Show resolved Hide resolved
agent/pool/pool.go Outdated Show resolved Hide resolved
agent/pool/pool.go Outdated Show resolved Hide resolved
@kisunji kisunji requested a review from banks October 18, 2022 15:54
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing job @kisunji 🥳

Thanks for patiently going through those extra bits to add reloading etc. This seems solid now!

agent/pool/pool.go Outdated Show resolved Hide resolved
agent/consul/server.go Show resolved Hide resolved
kisunji and others added 3 commits October 18, 2022 13:36
Co-authored-by: Paul Banks <banks@banksco.de>
@kisunji kisunji merged commit 29a297d into main Oct 18, 2022
@kisunji kisunji deleted the kisunji/NET-1092 branch October 18, 2022 19:05
kisunji added a commit that referenced this pull request Oct 18, 2022
Fix an issue where rpc_hold_timeout was being used as the timeout for non-blocking queries. Users should be able to tune read timeouts without fiddling with rpc_hold_timeout. A new configuration `rpc_read_timeout` is created.

Refactor some implementation from the original PR 11500 to remove the misleading linkage between RPCInfo's timeout (used to retry in case of certain modes of failures) and the client RPC timeouts.

(cherry picked from commit 29a297d)
kisunji added a commit that referenced this pull request Oct 18, 2022
Fix an issue where rpc_hold_timeout was being used as the timeout for non-blocking queries. Users should be able to tune read timeouts without fiddling with rpc_hold_timeout. A new configuration `rpc_read_timeout` is created.

Refactor some implementation from the original PR 11500 to remove the misleading linkage between RPCInfo's timeout (used to retry in case of certain modes of failures) and the client RPC timeouts.

(cherry picked from commit 29a297d)
kisunji added a commit that referenced this pull request Oct 18, 2022
Fix an issue where rpc_hold_timeout was being used as the timeout for non-blocking queries. Users should be able to tune read timeouts without fiddling with rpc_hold_timeout. A new configuration `rpc_read_timeout` is created.

Refactor some implementation from the original PR 11500 to remove the misleading linkage between RPCInfo's timeout (used to retry in case of certain modes of failures) and the client RPC timeouts.

(cherry picked from commit 29a297d)
kisunji added a commit that referenced this pull request Oct 18, 2022
Fix an issue where rpc_hold_timeout was being used as the timeout for non-blocking queries. Users should be able to tune read timeouts without fiddling with rpc_hold_timeout. A new configuration `rpc_read_timeout` is created.

Refactor some implementation from the original PR 11500 to remove the misleading linkage between RPCInfo's timeout (used to retry in case of certain modes of failures) and the client RPC timeouts.

(cherry picked from commit 29a297d)
kisunji added a commit that referenced this pull request Oct 18, 2022
Fix an issue where rpc_hold_timeout was being used as the timeout for non-blocking queries. Users should be able to tune read timeouts without fiddling with rpc_hold_timeout. A new configuration `rpc_read_timeout` is created.

Refactor some implementation from the original PR 11500 to remove the misleading linkage between RPCInfo's timeout (used to retry in case of certain modes of failures) and the client RPC timeouts.

(cherry picked from commit 29a297d)
kisunji added a commit that referenced this pull request Oct 18, 2022
Fix an issue where rpc_hold_timeout was being used as the timeout for non-blocking queries. Users should be able to tune read timeouts without fiddling with rpc_hold_timeout. A new configuration `rpc_read_timeout` is created.

Refactor some implementation from the original PR 11500 to remove the misleading linkage between RPCInfo's timeout (used to retry in case of certain modes of failures) and the client RPC timeouts.

(cherry picked from commit 29a297d)
kisunji added a commit that referenced this pull request Oct 18, 2022
Fix an issue where rpc_hold_timeout was being used as the timeout for non-blocking queries. Users should be able to tune read timeouts without fiddling with rpc_hold_timeout. A new configuration `rpc_read_timeout` is created.

Refactor some implementation from the original PR 11500 to remove the misleading linkage between RPCInfo's timeout (used to retry in case of certain modes of failures) and the client RPC timeouts.

(cherry picked from commit 29a297d)
kisunji added a commit that referenced this pull request Oct 18, 2022
Fix an issue where rpc_hold_timeout was being used as the timeout for non-blocking queries. Users should be able to tune read timeouts without fiddling with rpc_hold_timeout. A new configuration `rpc_read_timeout` is created.

Refactor some implementation from the original PR 11500 to remove the misleading linkage between RPCInfo's timeout (used to retry in case of certain modes of failures) and the client RPC timeouts.

(cherry picked from commit 29a297d)

Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
kisunji added a commit that referenced this pull request Oct 18, 2022
Fix an issue where rpc_hold_timeout was being used as the timeout for non-blocking queries. Users should be able to tune read timeouts without fiddling with rpc_hold_timeout. A new configuration `rpc_read_timeout` is created.

Refactor some implementation from the original PR 11500 to remove the misleading linkage between RPCInfo's timeout (used to retry in case of certain modes of failures) and the client RPC timeouts.

(cherry picked from commit 29a297d)
kisunji added a commit that referenced this pull request Oct 18, 2022
Fix an issue where rpc_hold_timeout was being used as the timeout for non-blocking queries. Users should be able to tune read timeouts without fiddling with rpc_hold_timeout. A new configuration `rpc_read_timeout` is created.

Refactor some implementation from the original PR 11500 to remove the misleading linkage between RPCInfo's timeout (used to retry in case of certain modes of failures) and the client RPC timeouts.

(cherry picked from commit 29a297d)
kisunji added a commit that referenced this pull request Oct 18, 2022
Fix an issue where rpc_hold_timeout was being used as the timeout for non-blocking queries. Users should be able to tune read timeouts without fiddling with rpc_hold_timeout. A new configuration `rpc_read_timeout` is created.

Refactor some implementation from the original PR 11500 to remove the misleading linkage between RPCInfo's timeout (used to retry in case of certain modes of failures) and the client RPC timeouts.

(cherry picked from commit 29a297d)
@freddygv
Copy link
Contributor

@kisunji can #14732 be closed now?

@kisunji
Copy link
Contributor Author

kisunji commented Oct 18, 2022

@kisunji can #14732 be closed now?

Yeah I'm handling the backporting. Will close when everything is completely merged

mbrulatout pushed a commit to criteo-forks/consul that referenced this pull request Mar 31, 2023
Fix an issue where rpc_hold_timeout was being used as the timeout for non-blocking queries. Users should be able to tune read timeouts without fiddling with rpc_hold_timeout. A new configuration `rpc_read_timeout` is created.

Refactor some implementation from the original PR 11500 to remove the misleading linkage between RPCInfo's timeout (used to retry in case of certain modes of failures) and the client RPC timeouts.

(cherry picked from commit 29a297d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-inactive/1.12 This release series is longer active backport-inactive/1.13 This release series is longer active backport-inactive/1.14 This release series is longer active
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants