Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: default TCP Keep-Alive interval causes significant power usage #48622

Closed
ValdikSS opened this issue Sep 25, 2021 · 12 comments
Closed

net: default TCP Keep-Alive interval causes significant power usage #48622

ValdikSS opened this issue Sep 25, 2021 · 12 comments
Labels
NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@ValdikSS
Copy link

ValdikSS commented Sep 25, 2021

Description

Golang's default TCP Keep-Alive is 15 seconds for both listening and connecting sockets.
Every time you use golang software, or connect to the website with long-polling/websockets running golang, your cell phone battery drains a lot quicker than it should.
The change has been originally introduced by:
https://go-review.googlesource.com/c/go/+/107196

There's a modern proxy application called V2ray, and it's available on Android as well. It's written in Go.
I noticed that my phone sends keep-alive packets every 3-5 seconds while keeping only 7 TCP sockets opened. The battery died rather quickly.

Current Golang version has two issues with TCP Keep-Alive interval:

  1. It is enabled by default on both listening and connecting sockets (dial.Dialer / net.Listener)
  2. It is very short (15 seconds), which creates unnecessary network load and makes cellphone radio module wake up much more frequently than it should
  3. dial.Dialer / net.Listener KeepAlive option changes both Keep-Alive time (TCP_KEEPIDLE) and Keep-Alive interval (TCP_KEEPINTVL) to the same value (can't be configured separately).

The latest item behavior is totally incorrect in my opinion. Linux uses 9 keep-alive probes of TCP_KEEPINTVL interval before closing the socket, so setting dial.Dialer KeepAlive to 300 seconds gives 50 minutes of actual socket hang detection.
If golang could set only TCP_KEEPIDLE and not touch TCP_KEEPINTVL, 300 second KeepAlive with the default Linux behavior (TCP_KEEPINTVL=75) would close the socket after ≈16 minutes, which is correct and expected. The latest behavior is widely used elsewhere.

Please note that golang also sets TCP_KEEPIDLE and TCP_KEEPINTVL by default for all listening and accepted sockets: not only golang clients, but also any clients connecting to golang servers are affected by short timeout.

What version of Go are you using (go version)?

$ go version
go version go1.17.1 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

Reproducible on any architecture and any OS.

What did you do?

Use dial.Dialer / net.Listener with default settings.

What did you expect to see?

Sane Keep-Alive values

What did you see instead?

Very short Keep-Alive period and inability to tune TCP_KEEPIDLE and TCP_KEEPINTVL separately.

@beoran
Copy link

beoran commented Sep 26, 2021

As a workaround, it should be possible to call syscall.Setsockoptint on the FD of the socket. Example here for different socket options: https://stackoverflow.com/questions/40544096/how-to-set-socket-option-ip-tos-for-http-client-in-go-language#40549614

@mknyszek mknyszek changed the title Default TCP Keep-Alive interval is very short (15s), drains cell phone battery net: default TCP Keep-Alive interval causes significant power usage Oct 4, 2021
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 4, 2021
@mknyszek mknyszek added this to the Backlog milestone Oct 4, 2021
@mknyszek
Copy link
Contributor

mknyszek commented Oct 4, 2021

CC @neild

@ValdikSS
Copy link
Author

Hello, any updates, discussions, ideas?

@ianlancetaylor
Copy link
Contributor

The argument for why there is a default keep-alive value is #23459.

It is of course possible for a program to change the default, as I think you know.

Are you suggesting that we change the default behavior on android and ios? If so, what should we change it to?

@ValdikSS
Copy link
Author

ValdikSS commented Jan 30, 2022

The original idea was to have about 3 minutes for dead peer detection, but the current implementation does that suboptimal.

RIght now it is achieved by sending 10 keep-alive packets every 15 seconds (on Linux), after which, if the peer did not respond to any of them, the connection is considered broken after 2 minutes 30 seconds (15 + 15*9).

TCP_KEEPIDLE=15
TCP_KEEPINTVL=15
TCP_KEEPCNT=9

However, the same could be achieved without spending too much battery life, by configuring bigger initial timeout and smaller number of packets TCP_KEEPCNT. Like this:

TCP_KEEPIDLE=180
TCP_KEEPINTVL=15
TCP_KEEPCNT=2

This configuration begins probing the client only after 3 minutes (180 seconds) of inactivity on the socket. After 2 keep-alive packets with 15 seconds interval, or 3 minutes 30 seconds, the connection would be considered broken.

Are you suggesting that we change the default behavior on android and ios?

This should be changed globally, because changing it only on Android/iOS would still affect the devices connecting to the servers with low timeouts.

Please also take a look at the arguments here: #23459 (comment)
It states the problems I'm describing in this ticket, and it's from 2018.

I also believe that 3 minute timeout is still low, consider setting it to at least 5 minutes.

@ianlancetaylor
Copy link
Contributor

Note that I think there are some portability concerns here. For example, as far as I know OpenBSD does not permit setting these values individually for each socket.

@ianlancetaylor ianlancetaylor modified the milestones: Backlog, Go1.19 Jan 31, 2022
@ValdikSS
Copy link
Author

ValdikSS commented Jun 6, 2022

Regarding "why 5 minutes":

The mean time between two metro stations in Saint Petersburg, Russia is about 2 minutes. Cellular connectivity of my operator is available only on the stations, but not during the trip between stations. Since waiting time on the station is about 20-30 seconds only, my cellphone does not manage to find the network and connect to it all the time, and additional 2 minutes are needed to get to another station and connect there.
Despite this being a very selfish calculation method, I think it's a pretty realistic scenario of cellular connectivity loss on a global scale. It usually takes 2-3 minutes to move between connectivity spots on a train, metro, under the bridge, at least in Europe. It usually takes 2-3 minutes to go make a tea/coffee from the workplace to the kitchen in the office, and return back. I hope you got the idea.

@ianlancetaylor
Copy link
Contributor

Nothing happened for 1.19. Moving to 1.20.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.19, Go1.20 Jun 24, 2022
@gopherbot gopherbot modified the milestones: Go1.20, Go1.21 Feb 1, 2023
@Gr33nbl00d
Copy link

Gr33nbl00d commented Apr 26, 2023

Personally i think the main problem here is not the count but the problem that we are not able to maintain keep alive interval and idle timeout seperatly.

We also have a similar problem with tcp keep alive mechanism in golang. It is practically useless for us because of this issue since 2014 have a look here please too: #8328

Also my comment here:
#8328 (comment)

Being able to have a high idle time would prevent flodding the network draining bateries and having at the same time a low interval would help to dectect failed connections fast. That would be enough for us. Ofcourse haveing also possibility to control keep alive count would be good too but maybe this would not even necessary because you could adjust the interval according to your needs.

@easwars
Copy link

easwars commented Aug 1, 2023

gRPC-Go is running into an issue with how TCP keepalives are configured by Go.

gRPC supports HTTP/2 level keepalives and these are configurable by the user by specifying a keepalive.Time and keepalive.Timeout value.

  • keepalive.Time is analogous to tcp_keepalive_time, and
  • keepalive.Timeout can be considered analogous to tcp_keepalive_intvl * tcp_keepalive_probes

gRPC also sets the TCP_USER_TIMEOUT to the value of keepalive.Timeout or to a default value of 20s if this field is unspecified by the user.

With the defaults set by Go, after a period of inactivity of 15s, a TCP keepalive probe is sent (irrespective of the value configured for keepalive.Time). And if TCP_USER_TIMEOUT is set to the default 20s, and the keepalive probe is not ACKed within 20s, the connection gets closed.

This is not the case with other language implementations of gRPC since they get the default values configured by the OS.

What is your recommendation to gRPC-Go for working around this behavior of Go?

cc: @dfawley @ejona86

@gopherbot gopherbot modified the milestones: Go1.21, Go1.22 Aug 8, 2023
geekman added a commit to geekman/hapz2m that referenced this issue Aug 8, 2023
It looks like Go has adopted 15s TCP keepalives as a default for _all_
TCP connections, which is quite dumb if you ask me.
golang/go#48622

For the HAP server's side, it degrades iOS battery life significantly by
waking the device every 15s to respond to these packets. In the case as
a normal MQTT client, it increases traffic on top of the 60s keepalive
we've already set at the application layer. In both cases, the solution
is to just explicitly disable TCP keepalives.

Upgrade hap to the latest version that contains the fix brutella/hap#36.
@neild
Copy link
Contributor

neild commented Aug 23, 2023

#62254 proposes changing the net package APIs which set the keep-alive period to set only TCP_KEEPIDLE, not TCP_KEEPINTVL. It also proposes a new set of APIs which would permit setting TCP_KEEPIDLE, TCP_KEEPINTVL, and TCP_KEEPCNT independently.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/542275 mentions this issue: net: add KeepAliveConfig and implement SetKeepAliveConfig

Asutorufa added a commit to yuhaiin/yuhaiin that referenced this issue Dec 13, 2023
Signed-off-by: Asutorufa <16442314+Asutorufa@users.noreply.github.com>
Asutorufa added a commit to yuhaiin/yuhaiin that referenced this issue Dec 13, 2023
Signed-off-by: Asutorufa <16442314+Asutorufa@users.noreply.github.com>
Asutorufa added a commit to yuhaiin/yuhaiin that referenced this issue Dec 14, 2023
Signed-off-by: Asutorufa <16442314+Asutorufa@users.noreply.github.com>
@panjf2000 panjf2000 modified the milestones: Go1.22, Go1.23 Jan 23, 2024
@panjf2000 panjf2000 added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsFix The path to resolution is known, but the work has not been done.
Projects
Status: Done
Development

No branches or pull requests

9 participants