Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: always enable TCP keepalives with OS defaults #6834

Merged
merged 4 commits into from Dec 7, 2023

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Dec 6, 2023

In #6672, we attempted to configure OS defaults for TCP keepalive parameters (keepalive time and interval). But this had the unintended consequence of disabling TCP keepalives completely because of the way things are implemented in the Go stdlib. In this PR, we attempt to do the following:

Summary of changes:

  • Always enable TCP keepalives on client connections
    • This has been the behavior for a while because the Go stdlib enables TCP keepalives as long as the net.Dialer.KeepAlive field is not set to a negative value.
    • Explicitly setting the socket option ensures the old behavior persists.
  • Use OS defaults for TCP keepalive interval and time
    • We do this by setting the net.Dialer.KeepAlive field to a negative value, thereby disabling the Go stdlib's override of these values to 15s.

There will be a follow-up PR for the server side.

Addresses #6250

RELEASE NOTES:

  • client: always enable TCP keepalives with OS defaults

@easwars
Copy link
Contributor Author

easwars commented Dec 6, 2023

@atollena : If you are interested in this change, please let me know and I'll mark you as a reviewer. Thanks.

@easwars easwars added this to the 1.60 Release milestone Dec 6, 2023
@easwars easwars requested a review from dfawley December 6, 2023 02:33
Copy link

codecov bot commented Dec 6, 2023

Codecov Report

Merging #6834 (8675d7d) into master (0866ce0) will increase coverage by 0.08%.
Report is 1 commits behind head on master.
The diff coverage is 95.83%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6834      +/-   ##
==========================================
+ Coverage   83.53%   83.61%   +0.08%     
==========================================
  Files         285      286       +1     
  Lines       30754    30770      +16     
==========================================
+ Hits        25690    25729      +39     
+ Misses       3998     3981      -17     
+ Partials     1066     1060       -6     
Files Coverage Δ
dialoptions.go 88.84% <ø> (ø)
internal/tcp_keepalive.go 100.00% <100.00%> (ø)
internal/transport/proxy.go 63.88% <100.00%> (ø)
server.go 79.00% <ø> (+0.23%) ⬆️
internal/transport/http2_client.go 90.23% <50.00%> (+0.15%) ⬆️

... and 13 files with indirect coverage changes

Copy link
Collaborator

@atollena atollena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I had in mind, thanks!

// disabling the overriding of TCP keepalive parameters by setting the
// KeepAlive field to a negative value above, results in OS defaults for
// the TCP keealive interval and time parameters.
ControlContext: func(_ context.Context, _, _ string, c syscall.RawConn) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to ignore: you can use Control instead of ControlContext for one less unused parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a minor improvement to me, but I don't feel strongly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with ControlContext over Control (even though the context parameter is unused) since the former is preferred if both are set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the ControlContext field was only added in Go1.20, and currently the least supported Go version is still Go1.19. So I'm changing this back to Control.

dialoptions.go Outdated
// TCP keepalive time and interval to 15s.
// To retain OS defaults, use a net.Dialer with the KeepAlive field set to a
// negative value.
// Note: As of Go 1.21, the standard library overrides the OS defaults for TCP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this wording is confusing, and have been trying to think of something better... Maybe:

Note: All supported releases of Go (as of December 2023) override the OS defaults for...

Or "as of the Go 1.21 release" or "up to and including at least Go 1.21" for the parenthetical?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

@@ -813,15 +813,6 @@ func (l *listenSocket) Close() error {
// Serve returns when lis.Accept fails with fatal errors. lis will be closed when
// this method returns.
// Serve will return a non-nil error unless Stop or GracefulStop is called.
//
// Note: As of Go 1.21, the standard library overrides the OS defaults for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I will be sending a follow-up PR to fix things on the server-side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this with the latest commit, based on our discussion on the issue.

dialoptions.go Outdated
Comment on lines 406 to 407
// for keepalive time and interval, use a net.Dialer that sets the KeepAlive
// field set to a negative value, and sets the SO_KEEPALIVE socket option to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"sets (the KeepAlive field) set to a negative". Remove the 2nd "set".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// disabling the overriding of TCP keepalive parameters by setting the
// KeepAlive field to a negative value above, results in OS defaults for
// the TCP keealive interval and time parameters.
ControlContext: func(_ context.Context, _, _ string, c syscall.RawConn) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a minor improvement to me, but I don't feel strongly.

@dfawley dfawley assigned easwars and unassigned dfawley Dec 6, 2023
@easwars easwars assigned dfawley and unassigned easwars Dec 6, 2023
Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but one question about something else we might want to fix.

@@ -264,7 +263,7 @@ func newHTTP2Client(connectCtx, ctx context.Context, addr resolver.Address, opts
}
keepaliveEnabled := false
if kp.Time != infinity {
if err = syscall.SetTCPUserTimeout(conn, kp.Timeout); err != nil {
if err = isyscall.SetTCPUserTimeout(conn, kp.Timeout); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this function sets the TCP user timeout but does not enable keepalives.. should(n't) we update it to do that, too, in case they are disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TCP_USER_TIMEOUT specifies the maximum amount of time that transmitted data may remain unacknowledged before TCP will forcibly close the corresponding connection and return ETIMEDOUT to the application. If the option value is specified as 0, TCP will to use the system default.

I think it is totally a reasonable thing to use it without TCP keepalives.

As per A18, the reason we set TCP_USER_TIMEOUT when gRPC keepalives are configured is to ensure that if things are stuck at the TCP layer, setting the TCP_USER_TIMEOUT will ensure that the connection is closed anyways (even if gRPC keepalives are not able to do the same).

Are you concerned about the case where a user specifies a custom net.Dialer to disable TCP keepalives, and configures gRPC keepalives, and therefore, we end up setting TCP_USER_TIMEOUT, but TCP keepalives are disabled? Even in this case, I feel it should be fine, since TCP_USER_TIMEOUT can work irrespective of whether TCP keepalives are configured on the connection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah no, sorry that makes sense, I'm not sure what I was thinking yesterday.. :)

@dfawley dfawley assigned easwars and unassigned dfawley Dec 7, 2023
@easwars easwars assigned dfawley and unassigned easwars Dec 7, 2023
@dfawley dfawley assigned easwars and unassigned dfawley Dec 7, 2023
@easwars easwars merged commit a03c7f1 into grpc:master Dec 7, 2023
14 checks passed
@easwars easwars deleted the tcp_keepalive_fix branch December 7, 2023 22:04
easwars added a commit to easwars/grpc-go that referenced this pull request Dec 7, 2023
easwars added a commit that referenced this pull request Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants