Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: timer doesn't scale on multi-CPU systems with a lot of timers #15133

Closed
valyala opened this issue Apr 5, 2016 · 7 comments
Closed
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge Performance
Milestone

Comments

@valyala
Copy link
Contributor

valyala commented Apr 5, 2016

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version devel +dfad3f8 Tue Apr 5 16:10:33 2016 +0300 linux/amd64

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GORACE=""
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build040466407=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"

What did you do?

I run an http server serving more than a million concurrent TCP keep-alive connections on a system with 32 CPU cores. Each connection has read and write deadlines set via net.Conn.SetReadDeadline and net.Conn.SetWriteDeadline.

What did you expect to see?

Server performance should scale with GOMAXPROCS up to the number of CPU cores.

What did you see instead?

  • addtimer and deltimer functions from runtime/time.go are at the top of CPU profile.
  • iowait reaches 20% due to memory ping-pong across CPU cores inside siftupTimer and siftdownTimer functions in runtime/time.go.

@dvyukov , could you look into this?

@dvyukov
Copy link
Member

dvyukov commented Apr 6, 2016

Server performance should scale with GOMAXPROCS up to the number of CPU cores.

What did you see instead? What is performance depending on number of cores?

iowait reaches 20% due to memory ping-pong across CPU cores inside siftupTimer and siftdownTimer functions in runtime/time.go.

I don't understand relation between iowait and memory ping-pong? iowait if waiting for IO, like hard drive, memory accesses are not IO.

How exactly does profile look?

Each connection has read and write deadlines set via net.Conn.SetReadDeadline and net.Conn.SetWriteDeadline.

Would it be possible to change it to SetDeadline? It would setup 1 timer instead of 2.

@dvyukov
Copy link
Member

dvyukov commented Apr 6, 2016

We probably can merge this into #6239

@valyala
Copy link
Contributor Author

valyala commented Apr 6, 2016

What did you see instead? What is performance depending on number of cores?

iowait and system CPU shares grow with increasing the number of CPU cores

I don't understand relation between iowait and memory ping-pong? iowait if waiting for IO, like hard drive, memory accesses are not IO.

I'm not an expert in iowait, but the fact is that iowait completely vanishes from 20% to 0% when deadlines on net.TCPConn connections are disabled. See the following image:
cpuusage.

Connection deadlines are enabled during 15:43-15:56
Connection deadlines are disabled starting from 15:56

How exactly does profile look?

Here is CPU profile for the process with deadlines enabled:

(pprof) top20
9.85mins of 15.65mins total (62.97%)
Dropped 623 nodes (cum <= 0.08mins)
Showing top 20 nodes out of 187 (cum >= 4.98mins)
      flat  flat%   sum%        cum   cum%
  3.27mins 20.92% 20.92%   3.27mins 20.92%  runtime.futex
  1.98mins 12.66% 33.57%   2.20mins 14.04%  syscall.Syscall
  0.66mins  4.25% 37.82%   0.66mins  4.25%  runtime.memmove
  0.54mins  3.48% 41.30%   0.54mins  3.48%  runtime.procyield
  0.52mins  3.31% 44.61%   3.18mins 20.29%  runtime.lock
  0.41mins  2.61% 47.22%   0.41mins  2.61%  runtime/internal/atomic.Xchg
  0.34mins  2.18% 49.40%   0.34mins  2.18%  runtime.epollwait
  0.25mins  1.61% 51.02%   0.25mins  1.61%  runtime.epollctl
  0.25mins  1.60% 52.61%   0.25mins  1.60%  runtime.siftdownTimer
  0.24mins  1.55% 54.16%   0.24mins  1.55%  runtime.siftupTimer
  0.24mins  1.54% 55.70%   0.24mins  1.54%  runtime.osyield
  0.19mins  1.19% 56.89%   0.19mins  1.19%  runtime/internal/atomic.Cas
  0.15mins  0.96% 57.85%   3.03mins 19.36%  runtime.deltimer
  0.14mins  0.91% 58.76%  11.56mins 73.90%  github.com/valyala/fasthttp.(*Server).serveConn
  0.12mins  0.76% 59.52%      2mins 12.78%  runtime.unlock
  0.12mins  0.74% 60.26%   0.16mins     1%  runtime.deferreturn
  0.11mins  0.71% 60.97%   0.12mins  0.77%  runtime.gopark
  0.11mins  0.71% 61.68%   0.42mins  2.68%  runtime.gentraceback
  0.10mins  0.66% 62.35%   0.23mins  1.47%  lib/metric.(*sample).Update
  0.10mins  0.62% 62.97%   4.98mins 31.86%  net.runtime_pollSetDeadline

(pprof) top20 -cum
387.31s of 938.84s total (41.25%)
Dropped 623 nodes (cum <= 4.69s)
Showing top 20 nodes out of 187 (cum >= 112.06s)
      flat  flat%   sum%        cum   cum%
         0     0%     0%    864.20s 92.05%  runtime.goexit
     0.01s 0.0011% 0.0011%    733.84s 78.16%  github.com/valyala/fasthttp.(*workerPool).getCh.func1
     0.39s 0.042% 0.043%    733.75s 78.15%  github.com/valyala/fasthttp.(*workerPool).workerFunc
     0.10s 0.011% 0.053%    693.90s 73.91%  github.com/valyala/fasthttp.(*Server).(github.com/valyala/fasthttp.serveConn)-fm
     8.53s  0.91%  0.96%    693.80s 73.90%  github.com/valyala/fasthttp.(*Server).serveConn
     1.28s  0.14%  1.10%    306.32s 32.63%  net.setDeadlineImpl
     5.86s  0.62%  1.72%    299.09s 31.86%  net.runtime_pollSetDeadline
   196.36s 20.92% 22.64%    196.36s 20.92%  runtime.futex
    31.04s  3.31% 25.94%    190.52s 20.29%  runtime.lock
        9s  0.96% 26.90%    181.79s 19.36%  runtime.deltimer
     0.80s 0.085% 26.99%    159.09s 16.95%  lib/httpserver.(*customConn).SetReadDeadline
     0.72s 0.077% 27.06%    158.10s 16.84%  net.(*conn).SetReadDeadline
     0.36s 0.038% 27.10%    157.38s 16.76%  net.(*netFD).setReadDeadline
     2.63s  0.28% 27.38%    154.28s 16.43%  lib/httpserver.(*customConn).SetWriteDeadline
     0.84s 0.089% 27.47%    150.37s 16.02%  net.(*conn).SetWriteDeadline
     0.23s 0.024% 27.50%    149.53s 15.93%  net.(*netFD).setWriteDeadline
     0.80s 0.085% 27.58%    139.24s 14.83%  runtime.addtimer
   118.83s 12.66% 40.24%    131.79s 14.04%  syscall.Syscall
     7.14s  0.76% 41.00%    120.01s 12.78%  runtime.unlock
     2.39s  0.25% 41.25%    112.06s 11.94%  github.com/valyala/fasthttp.acquireByteReader

(pprof) list addtimer$
Total: 15.65mins
ROUTINE ======================== runtime.addtimer in /home/aliaksandr/work/go1.5/src/runtime/time.go
     800ms   2.32mins (flat, cum) 14.83% of Total
         .          .     80:// Ready the goroutine arg.
         .          .     81:func goroutineReady(arg interface{}, seq uintptr) {
         .          .     82:   goready(arg.(*g), 0)
         .          .     83:}
         .          .     84:
     360ms      360ms     85:func addtimer(t *timer) {
      50ms   1.35mins     86:   lock(&timers.lock)
     320ms      7.42s     87:   addtimerLocked(t)
      30ms     50.47s     88:   unlock(&timers.lock)
      40ms       40ms     89:}
         .          .     90:
         .          .     91:// Add a timer to the heap and start or kick the timer proc.
         .          .     92:// If the new timer is earlier than any of the others.
         .          .     93:// Timers are locked.
         .          .     94:func addtimerLocked(t *timer) {


(pprof) list deltimer$
Total: 15.65mins
ROUTINE ======================== runtime.deltimer in /home/aliaksandr/work/go1.5/src/runtime/time.go
        9s   3.03mins (flat, cum) 19.36% of Total
         .          .    117:   }
         .          .    118:}
         .          .    119:
         .          .    120:// Delete timer t from the heap.
         .          .    121:// Do not need to update the timerproc: if it wakes up early, no big deal.
     200ms      200ms    122:func deltimer(t *timer) bool {
         .          .    123:   // Dereference t so that any panic happens before the lock is held.
         .          .    124:   // Discard result, because t might be moving in the heap.
      50ms       50ms    125:   _ = t.i
         .          .    126:
     180ms   1.61mins    127:   lock(&timers.lock)
         .          .    128:   // t may not be registered anymore and may have
         .          .    129:   // a bogus i (typically 0, if generated by Go).
         .          .    130:   // Verify it before proceeding.
     570ms      570ms    131:   i := t.i
     1.64s      1.64s    132:   last := len(timers.t) - 1
     2.64s      2.64s    133:   if i < 0 || i > last || timers.t[i] != t {
         .          .    134:       unlock(&timers.lock)
         .          .    135:       return false
         .          .    136:   }
         .          .    137:   if i != last {
     1.47s      1.47s    138:       timers.t[i] = timers.t[last]
     330ms      330ms    139:       timers.t[i].i = i
         .          .    140:   }
     1.71s      1.71s    141:   timers.t[last] = nil
      30ms       30ms    142:   timers.t = timers.t[:last]
         .          .    143:   if i != last {
      30ms     11.32s    144:       siftupTimer(i)
      30ms      6.23s    145:       siftdownTimer(i)
         .          .    146:   }
      30ms     58.89s    147:   unlock(&timers.lock)
      90ms       90ms    148:   return true
         .          .    149:}
         .          .    150:
         .          .    151:// Timerproc runs the time-driven events.
         .          .    152:// It sleeps until the next event in the timers heap.
         .          .    153:// If addtimer inserts a new earlier event, addtimer1 wakes timerproc early.

Here is CPU profile for the app with connection deadlines disabled:

(pprof) top20
465.38s of 1002.60s total (46.42%)
Dropped 622 nodes (cum <= 5.01s)
Showing top 20 nodes out of 192 (cum >= 15.87s)
      flat  flat%   sum%        cum   cum%
    54.74s  5.46%  5.46%     54.74s  5.46%  runtime.memmove
    54.54s  5.44% 10.90%    150.32s 14.99%  runtime.lock
    48.96s  4.88% 15.78%    330.74s 32.99%  runtime.findrunnable
    34.12s  3.40% 19.19%     34.12s  3.40%  runtime.futex
    32.93s  3.28% 22.47%     32.93s  3.28%  runtime/internal/atomic.Xchg
    27.41s  2.73% 25.20%     27.41s  2.73%  runtime.procyield
    26.82s  2.68% 27.88%     33.59s  3.35%  runtime.runqgrab
    20.20s  2.01% 29.89%     35.12s  3.50%  runtime.selectgoImpl
    17.45s  1.74% 31.63%     17.45s  1.74%  runtime/internal/atomic.Cas
    16.69s  1.66% 33.30%     21.79s  2.17%  github.com/valyala/fasthttp.(*headerScanner).next
    16.56s  1.65% 34.95%     16.56s  1.65%  math/big.addMulVVW
    16.38s  1.63% 36.58%     25.14s  2.51%  github.com/valyala/fasthttp.(*RequestHeader).AppendBytes
    14.24s  1.42% 38.01%    446.04s 44.49%  github.com/valyala/fasthttp.(*Server).serveConn
    13.88s  1.38% 39.39%     22.81s  2.28%  runtime.netpollready
    13.72s  1.37% 40.76%     13.72s  1.37%  runtime/internal/atomic.Load
    12.19s  1.22% 41.97%     32.18s  3.21%  runtime.unlock
    11.54s  1.15% 43.12%    351.66s 35.07%  runtime.schedule
    11.42s  1.14% 44.26%     11.42s  1.14%  runtime.indexbytebody
    11.02s  1.10% 45.36%     11.02s  1.10%  runtime.osyield
    10.57s  1.05% 46.42%     15.87s  1.58%  github.com/valyala/fasthttp.appendArgBytes

(pprof) top20 -cum
155.46s of 1002.60s total (15.51%)
Dropped 622 nodes (cum <= 5.01s)
Showing top 20 nodes out of 192 (cum >= 61.52s)
      flat  flat%   sum%        cum   cum%
         0     0%     0%    636.02s 63.44%  runtime.goexit
         0     0%     0%    466.84s 46.56%  github.com/valyala/fasthttp.(*workerPool).getCh.func1
     2.85s  0.28%  0.28%    466.84s 46.56%  github.com/valyala/fasthttp.(*workerPool).workerFunc
     0.28s 0.028%  0.31%    446.32s 44.52%  github.com/valyala/fasthttp.(*Server).(github.com/valyala/fasthttp.serveConn)-fm
    14.24s  1.42%  1.73%    446.04s 44.49%  github.com/valyala/fasthttp.(*Server).serveConn
     3.38s  0.34%  2.07%    361.68s 36.07%  runtime.mcall
     1.30s  0.13%  2.20%    354.98s 35.41%  runtime.park_m
    11.54s  1.15%  3.35%    351.66s 35.07%  runtime.schedule
    48.96s  4.88%  8.23%    330.74s 32.99%  runtime.findrunnable
     1.74s  0.17%  8.41%    206.95s 20.64%  lib/httpserver.getHTTPHandler.func1
     1.21s  0.12%  8.53%    199.45s 19.89%  lib/httpserver.httpHandler
     0.43s 0.043%  8.57%    161.15s 16.07%  main.requestHandler
     0.14s 0.014%  8.58%    158.64s 15.82%  main.proxyHandler
    54.54s  5.44% 14.02%    150.32s 14.99%  runtime.lock
     2.46s  0.25% 14.27%    149.65s 14.93%  main.proxyRequest
         0     0% 14.27%    112.92s 11.26%  github.com/valyala/fasthttp.(*PipelineClient).worker.func1
     4.15s  0.41% 14.68%    112.92s 11.26%  github.com/valyala/fasthttp.(*PipelineClient).writer
     5.54s  0.55% 15.24%    106.39s 10.61%  github.com/valyala/fasthttp.acquireByteReader
     0.73s 0.073% 15.31%     67.26s  6.71%  github.com/valyala/fasthttp.(*RequestHeader).SetBytesV
     1.97s   0.2% 15.51%     61.52s  6.14%  github.com/valyala/fasthttp.(*RequestHeader).SetCanonical

(pprof) list addtimer$
Total: 16.71mins
ROUTINE ======================== runtime.addtimer in /home/aliaksandr/work/go1.5/src/runtime/time.go
     150ms      1.60s (flat, cum)  0.16% of Total
         .          .     80:// Ready the goroutine arg.
         .          .     81:func goroutineReady(arg interface{}, seq uintptr) {
         .          .     82:   goready(arg.(*g), 0)
         .          .     83:}
         .          .     84:
     100ms      100ms     85:func addtimer(t *timer) {
      40ms      780ms     86:   lock(&timers.lock)
      10ms      310ms     87:   addtimerLocked(t)
         .      410ms     88:   unlock(&timers.lock)
         .          .     89:}
         .          .     90:
         .          .     91:// Add a timer to the heap and start or kick the timer proc.
         .          .     92:// If the new timer is earlier than any of the others.
         .          .     93:// Timers are locked.


(pprof) list deltimer$
Total: 16.71mins
ROUTINE ======================== runtime.deltimer in /home/aliaksandr/work/go1.5/src/runtime/time.go
     2.21s      4.53s (flat, cum)  0.45% of Total
         .          .    117:   }
         .          .    118:}
         .          .    119:
         .          .    120:// Delete timer t from the heap.
         .          .    121:// Do not need to update the timerproc: if it wakes up early, no big deal.
     370ms      370ms    122:func deltimer(t *timer) bool {
         .          .    123:   // Dereference t so that any panic happens before the lock is held.
         .          .    124:   // Discard result, because t might be moving in the heap.
      10ms       10ms    125:   _ = t.i
         .          .    126:
      60ms      1.03s    127:   lock(&timers.lock)
         .          .    128:   // t may not be registered anymore and may have
         .          .    129:   // a bogus i (typically 0, if generated by Go).
         .          .    130:   // Verify it before proceeding.
      40ms       40ms    131:   i := t.i
      30ms       30ms    132:   last := len(timers.t) - 1
     1.06s      1.06s    133:   if i < 0 || i > last || timers.t[i] != t {
         .          .    134:       unlock(&timers.lock)
         .          .    135:       return false
         .          .    136:   }
         .          .    137:   if i != last {
     230ms      230ms    138:       timers.t[i] = timers.t[last]
      70ms       70ms    139:       timers.t[i].i = i
         .          .    140:   }
     250ms      250ms    141:   timers.t[last] = nil
         .          .    142:   timers.t = timers.t[:last]
         .          .    143:   if i != last {
      20ms      180ms    144:       siftupTimer(i)
      10ms      110ms    145:       siftdownTimer(i)
         .          .    146:   }
      20ms      1.11s    147:   unlock(&timers.lock)
      40ms       40ms    148:   return true
         .          .    149:}
         .          .    150:
         .          .    151:// Timerproc runs the time-driven events.
         .          .    152:// It sleeps until the next event in the timers heap.
         .          .    153:// If addtimer inserts a new earlier event, addtimer1 wakes timerproc early.

Would it be possible to change it to SetDeadline? It would setup 1 timer instead of 2.

No, this won't work, since read and write deadlines are configured independently in our application.

@valyala
Copy link
Contributor Author

valyala commented Apr 6, 2016

FYI, this is not an OS scalability problem, since the CPU usage graph above until 15:33 corresponds to 'app process per CPU core' mode of the app with enabled connection deadlines. We had to switch to this mode due to the issue with timer scalability in Go runtime.

@dvyukov
Copy link
Member

dvyukov commented Apr 6, 2016

Thanks for the detained info!

This confirms that the main issue is the global timers mutex. siftdown/up consume ~3% of time. If we distribute timers, then siftdown/up should become faster as well (smaller per-P heap + better locality).
Unfortunately this is not trivial to do. There is #6239 for this.

FYI, this is not an OS scalability problem, since the CPU usage graph above until 15:33 corresponds to 'app process per CPU core' mode of the app with enabled connection deadlines. We had to switch to this mode due to the issue with timer scalability in Go runtime.

Sounds pretty bad. @rsc @aclements

@bradfitz bradfitz added this to the Unplanned milestone Apr 7, 2016
valyala added a commit to Vertamedia/fasthttp that referenced this issue Apr 15, 2016
…dline calls for keep-alive connections, since they don't scale properly. See golang/go#15133 for details
valyala added a commit to valyala/fasthttp that referenced this issue Apr 15, 2016
…dline calls for keep-alive connections, since they don't scale properly. See golang/go#15133 for details
valyala added a commit to valyala/fasthttp that referenced this issue Apr 15, 2016
…dline calls for keep-alive connections, since they don't scale properly. See golang/go#15133 for details.
@gopherbot
Copy link
Contributor

CL https://golang.org/cl/34784 mentions this issue.

@valyala
Copy link
Contributor Author

valyala commented Jan 10, 2017

The CL fixes timers scalability issue for us. Simple repro:

Hammer the following http server with more than 100K concurrent http keep-alive connections and send serveral requests per second on each such a connection. The server must run on multi-core machine with GOMAXPROCS=NumCPU.

package main

import (
        "flag"
        "fmt"
        "log"
        "net/http"
        "time"
)

var addr = flag.String("addr", ":80", "TCP address to listen to")

func main() {
        flag.Parse()
        s := &http.Server{
                Addr: *addr,

                // These timeouts trigger high iowait without the CL 34784
                // if many requests are sent over more than 100K
                // keep-alive http connections.
                ReadTimeout: 90*time.Second,
                WriteTimeout: 5*time.Second,

                Handler: http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
                        fmt.Fprintf(w, "OK")
                }),
        }
        if err := s.ListenAndServe(); err != nil {
                log.Fatalf("error in ListenAndServe: %s", err)
        }
}

CPU load should look like
image

  • 12:37 - 12:42 - the server without the CL - see high iowait share
  • 12:43 - 12:50 - the server with the CL.

@aclements aclements modified the milestones: Go1.10, Unplanned Jul 7, 2017
@aclements aclements added the early-in-cycle A change that should be done early in the 3 month dev cycle. label Jul 7, 2017
@aclements aclements self-assigned this Jul 7, 2017
ivankorobkov added a commit to mdspinc/aerospike-client-go that referenced this issue Jun 27, 2018
Update deadline only if more than 25% of the last deadline pasted.
See golang/go#15133 for details.
@golang golang locked and limited conversation to collaborators Sep 12, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge Performance
Projects
None yet
Development

No branches or pull requests

5 participants