runtime: os.File 1.2x slower, no longer parallel, in tip compared to go18 (regression) #19586

Closed
navytux opened this Issue Mar 17, 2017 · 17 comments

Comments

Projects
None yet
6 participants
@navytux
Contributor

navytux commented Mar 17, 2017

Please answer these questions before submitting your issue. Thanks!

What did you do?

Hello up there. During discussion in #19563 it was found that in go tip e.g. os.File.ReadAt became ~ 1.2x slower compared to go18. Let me quote #19563 (comment):

name \ time/op  go14.txt    go15.txt    go16.txt    go17.txt    go18.txt    gotip.txt
ReadAt1         268ns ± 1%
Pread1          259ns ± 1%
ReadAt1-4                   274ns ± 1%  277ns ± 1%  274ns ± 1%  263ns ± 1%  323ns ± 0%
Pread1-4                    262ns ± 1%  266ns ± 1%  262ns ± 2%  252ns ± 1%  251ns ± 1%

(updated benchmark, diff)

...

The differencein between os.File.ReadAt(len(buf)=1) and syscall.Pread(len(buf)=1) was ~ 10ns during go14 - go18. With go tip it raises to ~70ns, so it is indeed 60ns regression for os.File.ReadAt.

What did you expect to see?

os.File.ReadAt and friends works the same speed or faster compared to go18.

What did you see instead?

os.File.ReadAt became slower by 60ns compared to go18.

Does this issue reproduce with the latest release (go1.8)?

No.

This is probably related to recently introduced runtime poller.

Thanks beforehand,
Kirill

/cc @ianlancetaylor

System details

go version devel +88e47187c1 Fri Mar 17 05:21:53 2017 +0000 linux/amd64
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/kirr/src/tools/go/g.env"
GORACE=""
GOROOT="/home/kirr/src/tools/go/go"
GOTOOLDIR="/home/kirr/src/tools/go/go/pkg/tool/linux_amd64"
GCCGO="/usr/bin/gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build094114164=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOROOT/bin/go version: go version devel +88e47187c1 Fri Mar 17 05:21:53 2017 +0000 linux/amd64
GOROOT/bin/go tool compile -V: compile version devel +88e47187c1 Fri Mar 17 05:21:53 2017 +0000 X:framepointer
uname -sr: Linux 4.9.0-2-amd64
Distributor ID:	Debian
Description:	Debian GNU/Linux 9.0 (stretch)
Release:	9.0
Codename:	stretch
/lib/x86_64-linux-gnu/libc.so.6: GNU C Library (Debian GLIBC 2.24-9) stable release version 2.24, by Roland McGrath et al.
gdb --version: GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
@philhofer

This comment has been minimized.

Show comment
Hide comment
@philhofer

philhofer Mar 20, 2017

Contributor

This is almost certainly due to the poller.

I'd also add that the new poller integration implicitly serializes access to the file descriptor, which has pretty dramatic performance effects on some applications. Unlike sockets, there are legitimate reasons to have multiple reads and/or writes execute concurrently.

Here's some output I saw from perf record -g recently on one of our test harnesses:

-   29.43%     0.48%  kvfill   kvfill            [.] os.(*File).WriteAt 
     28.95% os.(*File).WriteAt
      - os.(*File).pwrite
         - 28.67% internal/poll.(*FD).Pwrite
            + 22.74% internal/poll.(*FD).writeLock
            + 3.77% internal/poll.(*FD).writeUnlock
            + 1.26% syscall.Syscall6

For concurrent calls to (*os.File).WriteAt, more than 95% of the time is spent dealing with an unnecessary lock. (The application is already managing locking to file regions.)

Contributor

philhofer commented Mar 20, 2017

This is almost certainly due to the poller.

I'd also add that the new poller integration implicitly serializes access to the file descriptor, which has pretty dramatic performance effects on some applications. Unlike sockets, there are legitimate reasons to have multiple reads and/or writes execute concurrently.

Here's some output I saw from perf record -g recently on one of our test harnesses:

-   29.43%     0.48%  kvfill   kvfill            [.] os.(*File).WriteAt 
     28.95% os.(*File).WriteAt
      - os.(*File).pwrite
         - 28.67% internal/poll.(*FD).Pwrite
            + 22.74% internal/poll.(*FD).writeLock
            + 3.77% internal/poll.(*FD).writeUnlock
            + 1.26% syscall.Syscall6

For concurrent calls to (*os.File).WriteAt, more than 95% of the time is spent dealing with an unnecessary lock. (The application is already managing locking to file regions.)

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux Mar 20, 2017

Contributor

@philhofer thanks for your feedback.

Let me also quote relevant bits about new poller locking here eventhough it was already described in #19563:

---- 8< ----

  1. the defer usage in IO poller is quite noticable. Either making defer faster (#14939, #19466, #18923, ...) or not using defer could be a solution.

  2. internal/poll locking: at least on Linux it is ok to issue several preads simultaneously (please correct me if I'm wrong). From pread man page:

    The  pread()  and  pwrite()  system calls are especially useful in multithreaded applications.  They allow multiple
    threads to perform I/O on the same file descriptor without being affected by changes to the file  offset  
    by  other threads.
    

    (http://man7.org/linux/man-pages/man2/pread.2.html)

    Thus no locking is needed to serialize pread vs pread or vs read. Maybe some locking is still needed for internal bookkeeping, but may I wonder it can be made lighter?

---- 8< ----

P.S. Someone please add "Performance" label to this issue. Thanks.

Contributor

navytux commented Mar 20, 2017

@philhofer thanks for your feedback.

Let me also quote relevant bits about new poller locking here eventhough it was already described in #19563:

---- 8< ----

  1. the defer usage in IO poller is quite noticable. Either making defer faster (#14939, #19466, #18923, ...) or not using defer could be a solution.

  2. internal/poll locking: at least on Linux it is ok to issue several preads simultaneously (please correct me if I'm wrong). From pread man page:

    The  pread()  and  pwrite()  system calls are especially useful in multithreaded applications.  They allow multiple
    threads to perform I/O on the same file descriptor without being affected by changes to the file  offset  
    by  other threads.
    

    (http://man7.org/linux/man-pages/man2/pread.2.html)

    Thus no locking is needed to serialize pread vs pread or vs read. Maybe some locking is still needed for internal bookkeeping, but may I wonder it can be made lighter?

---- 8< ----

P.S. Someone please add "Performance" label to this issue. Thanks.

@bradfitz bradfitz added this to the Go1.9Maybe milestone Mar 20, 2017

@navytux navytux changed the title from runtime: os.File 1.2x slower in tip compared to go18 (regression) to runtime: os.File 1.2x slower, non parallelizable, in tip compared to go18 (regression) Mar 22, 2017

@navytux navytux changed the title from runtime: os.File 1.2x slower, non parallelizable, in tip compared to go18 (regression) to runtime: os.File 1.2x slower, non concurrent, in tip compared to go18 (regression) Mar 22, 2017

@navytux navytux changed the title from runtime: os.File 1.2x slower, non concurrent, in tip compared to go18 (regression) to runtime: os.File 1.2x slower, no longer parallel, in tip compared to go18 (regression) Mar 24, 2017

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux Apr 7, 2017

Contributor

Related to "no longer parallel" for UDP sockets: #17520

Contributor

navytux commented Apr 7, 2017

Related to "no longer parallel" for UDP sockets: #17520

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux Apr 24, 2017

Contributor

@bradfitz added this to the Go1.9Maybe milestone on 20 Mar

Voice from poor people: it would be a pity to let this regression go into Go1.9

Contributor

navytux commented Apr 24, 2017

@bradfitz added this to the Go1.9Maybe milestone on 20 Mar

Voice from poor people: it would be a pity to let this regression go into Go1.9

@ianlancetaylor

This comment has been minimized.

Show comment
Hide comment
@ianlancetaylor

ianlancetaylor Apr 25, 2017

Contributor

I can speed up ReadAt, which gets the slowdown on the benchmark above down to about 15ns or 4%, but I haven't figured out a way to do better without losing the useful property of being able to fix #7970.

Contributor

ianlancetaylor commented Apr 25, 2017

I can speed up ReadAt, which gets the slowdown on the benchmark above down to about 15ns or 4%, but I haven't figured out a way to do better without losing the useful property of being able to fix #7970.

@gopherbot

This comment has been minimized.

Show comment
Hide comment

CL https://golang.org/cl/41670 mentions this issue.

gopherbot pushed a commit that referenced this issue Apr 25, 2017

internal/poll: don't use r/w lock for Pread/Pwrite
Since Pread/Pwrite specify a file offset, using incref is sufficient.
This permits multiple Pread/Pwrite calls in parallel.

Since Pread/Pwrite specify a file offset, it doesn't seem to make
sense to use the poller for them, so don't.

Updates #19586

Change-Id: I676be16bf519b9a45f8e6b1d991c44f10848bc11
Reviewed-on: https://go-review.googlesource.com/41670
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux Apr 25, 2017

Contributor

@ianlancetaylor thanks for feedback, patch and heads up on #7970.

I'm not so handy with the poller code, but after briefly looking around there is some feedback on my side. I appologize in advance for maybe some mistakes:

  1. as the new FD.Pread unconditionally calls syscall.Pread without poller, we have to make sure fd.Sysfd is not in nonblocking mode - else EAGAIN could be returned to actual ReadAt user. Currently this property is implicitly satisfied to be so, at least on Linux, where epoll refuses to register fd of regular files. Maybe it is worth at least a comment.

  2. on Windows Pread emulation uses syscall.Seek internally. Is it correct there to have only incref instead of full read lock?

  3. I've also noticed that when there is syscall.Pread error we reset n to 0? To me this is legitimate situation when there is e.g. 4K read request and IO error is there after 1K has been read. So why on error we are not telling the caller 1K still has been successfully read?

  4. In the context of #7970 I've noticed that the actual syscall.Close happens in FD.destroy which itself is called from under last decref. This is ok when poller is used - there awaiting for IO goroutines are first deregistered by pollDesc.evict but in case of blocking IO without poller (ReadAt case) FD.Close will only mark fd mutex as closed (so issuing new read/write requests will not be possible) but e.g. pread request that is already in flight won't be interrupted.

    In case where pread could be blocking potentially indefinitely long (networked filesystems, slow IO with retries, ...) that could lead to deadlocks when client code expects Close to interrupt IO, then waits for goroutine doing IO to complete.

    So what is the plan here? (offhand I would say that at least for fds not using poller syscall.Close should be called by FD.Close itself with protection from #7970-style races)

Thanks beforehand,
Kirill

P.S. updated timings on my side (i7-6600U; Linux 4.9.18-1 (2017-03-30) x86_64) are:

name \ time/op  go18.txt    go+8ae60dc1bb.txt  go+34ee8ec193.txt
                            (tip before patch)   (tip with patch)
ReadAt1-4       260ns ± 1%         376ns ± 2%         286ns ± 1%
Pread1-4        254ns ± 1%         251ns ± 0%         255ns ± 1%

so compared to go18 slowdown for ReadAt is ~ 10% (286ns over 260ns). Note also that before patch tip got even more slower compared to original 1.2x regression (ReadAt timing then was 323ns and now it is 376ns) - probably something else slowed down in the poller more.

For the reference here are go18 and tip profiles for ReadAt benchmark:

---- 8< ---- (go18)

30.91s of 31.72s total (97.45%)
Dropped 30 nodes (cum <= 0.16s)
Showing top 10 nodes out of 17 (cum >= 0.29s)
      flat  flat%   sum%        cum   cum%
    23.64s 74.53% 74.53%     29.88s 94.20%  syscall.Syscall6
     2.20s  6.94% 81.46%      2.20s  6.94%  runtime.casgstatus
     1.34s  4.22% 85.69%      2.85s  8.98%  runtime.reentersyscall
     1.12s  3.53% 89.22%      1.35s  4.26%  runtime.exitsyscallfast
     0.69s  2.18% 91.39%      2.91s  9.17%  runtime.exitsyscall
     0.58s  1.83% 93.22%     31.18s 98.30%  os.(*File).ReadAt
     0.38s  1.20% 94.42%     30.60s 96.47%  os.(*File).pread
     0.34s  1.07% 95.49%     30.22s 95.27%  syscall.Pread
     0.33s  1.04% 96.53%     31.51s 99.34%  command-line-arguments.xreadat
     0.29s  0.91% 97.45%      0.29s  0.91%  runtime.getcallerpc

---- 8< ---- (tip with patch: 34ee8ec)

Showing nodes accounting for 32.82s, 94.99% of 34.55s total
Dropped 23 nodes (cum <= 0.17s)
Showing top 10 nodes out of 20
      flat  flat%   sum%        cum   cum%
    23.93s 69.26% 69.26%     30.11s 87.15%  syscall.Syscall6 /home/kirr/src/tools/go/go/src/syscall/asm_linux_amd64.s
     2.43s  7.03% 76.30%      2.43s  7.03%  runtime.casgstatus /home/kirr/src/tools/go/go/src/runtime/proc.go
     1.23s  3.56% 79.86%      2.74s  7.93%  runtime.reentersyscall /home/kirr/src/tools/go/go/src/runtime/proc.go
     1.20s  3.47% 83.33%      1.44s  4.17%  runtime.exitsyscallfast /home/kirr/src/tools/go/go/src/runtime/proc.go
     1.18s  3.42% 86.74%      1.18s  3.42%  internal/poll.(*fdMutex).incref /home/kirr/src/tools/go/go/src/internal/poll/fd_mutex.go
     1.04s  3.01% 89.75%      1.04s  3.01%  internal/poll.(*fdMutex).decref /home/kirr/src/tools/go/go/src/internal/poll/fd_mutex.go
     0.67s  1.94% 91.69%      3.21s  9.29%  runtime.exitsyscall /home/kirr/src/tools/go/go/src/runtime/proc.go
     0.46s  1.33% 93.02%     34.04s 98.52%  os.(*File).ReadAt /home/kirr/src/tools/go/go/src/os/file.go
     0.35s  1.01% 94.04%     30.46s 88.16%  syscall.Pread /home/kirr/src/tools/go/go/src/syscall/zsyscall_linux_amd64.go
     0.33s  0.96% 94.99%     34.41s 99.59%  command-line-arguments.xreadat /home/kirr/src/wendelin/z/rust/bench/X/syspread_test.go

---- 8< ---- (tip without patch: 8ae60dc)

Showing nodes accounting for 43.67s, 96.06% of 45.46s total
Dropped 45 nodes (cum <= 0.23s)
Showing top 20 nodes out of 26
      flat  flat%   sum%        cum   cum%
    26.83s 59.02% 59.02%     33.83s 74.42%  syscall.Syscall6 /home/kirr/src/tools/go/go/src/syscall/asm_linux_amd64.s
     2.22s  4.88% 63.90%      2.22s  4.88%  runtime.casgstatus /home/kirr/src/tools/go/go/src/runtime/proc.go
     1.73s  3.81% 67.71%      3.09s  6.80%  runtime.reentersyscall /home/kirr/src/tools/go/go/src/runtime/proc.go
     1.57s  3.45% 71.16%      1.57s  3.45%  runtime.newdefer /home/kirr/src/tools/go/go/src/runtime/panic.go
     1.19s  2.62% 73.78%      3.33s  7.33%  runtime.exitsyscall /home/kirr/src/tools/go/go/src/runtime/proc.go
     1.15s  2.53% 76.31%     43.76s 96.26%  os.(*File).ReadAt /home/kirr/src/tools/go/go/src/os/file.go
     1.13s  2.49% 78.79%     45.11s 99.23%  command-line-arguments.xreadat /home/kirr/src/wendelin/z/rust/bench/X/syspread_test.go
     1.05s  2.31% 81.10%      1.05s  2.31%  internal/poll.(*fdMutex).rwlock /home/kirr/src/tools/go/go/src/internal/poll/fd_mutex.go
     0.98s  2.16% 83.26%      0.98s  2.16%  internal/poll.(*fdMutex).rwunlock /home/kirr/src/tools/go/go/src/internal/poll/fd_mutex.go
     0.92s  2.02% 85.28%      1.16s  2.55%  runtime.exitsyscallfast /home/kirr/src/tools/go/go/src/runtime/proc.go
     0.84s  1.85% 87.13%     42.08s 92.56%  internal/poll.(*FD).Pread /home/kirr/src/tools/go/go/src/internal/poll/fd_unix.go
     0.62s  1.36% 88.50%      2.59s  5.70%  runtime.deferproc /home/kirr/src/tools/go/go/src/runtime/panic.go
     0.57s  1.25% 89.75%      0.57s  1.25%  runtime.getcallerpc /home/kirr/src/tools/go/go/src/runtime/asm_amd64.s
     0.52s  1.14% 90.89%     42.61s 93.73%  os.(*File).pread /home/kirr/src/tools/go/go/src/os/file_unix.go
     0.52s  1.14% 92.04%      0.53s  1.17%  runtime.freedefer /home/kirr/src/tools/go/go/src/runtime/panic.go
     0.46s  1.01% 93.05%     34.29s 75.43%  syscall.Pread /home/kirr/src/tools/go/go/src/syscall/zsyscall_linux_amd64.go
     0.44s  0.97% 94.02%      1.06s  2.33%  runtime.deferreturn /home/kirr/src/tools/go/go/src/runtime/panic.go
     0.38s  0.84% 94.85%      1.43s  3.15%  internal/poll.(*FD).readLock /home/kirr/src/tools/go/go/src/internal/poll/fd_mutex.go
     0.28s  0.62% 95.47%      0.55s  1.21%  internal/poll.(*pollDesc).prepareRead /home/kirr/src/tools/go/go/src/internal/poll/fd_poll_runtime.go
     0.27s  0.59% 96.06%     45.39s 99.85%  command-line-arguments.BenchmarkReadAt1 /home/kirr/src/wendelin/z/rust/bench/X/syspread_test.go
Contributor

navytux commented Apr 25, 2017

@ianlancetaylor thanks for feedback, patch and heads up on #7970.

I'm not so handy with the poller code, but after briefly looking around there is some feedback on my side. I appologize in advance for maybe some mistakes:

  1. as the new FD.Pread unconditionally calls syscall.Pread without poller, we have to make sure fd.Sysfd is not in nonblocking mode - else EAGAIN could be returned to actual ReadAt user. Currently this property is implicitly satisfied to be so, at least on Linux, where epoll refuses to register fd of regular files. Maybe it is worth at least a comment.

  2. on Windows Pread emulation uses syscall.Seek internally. Is it correct there to have only incref instead of full read lock?

  3. I've also noticed that when there is syscall.Pread error we reset n to 0? To me this is legitimate situation when there is e.g. 4K read request and IO error is there after 1K has been read. So why on error we are not telling the caller 1K still has been successfully read?

  4. In the context of #7970 I've noticed that the actual syscall.Close happens in FD.destroy which itself is called from under last decref. This is ok when poller is used - there awaiting for IO goroutines are first deregistered by pollDesc.evict but in case of blocking IO without poller (ReadAt case) FD.Close will only mark fd mutex as closed (so issuing new read/write requests will not be possible) but e.g. pread request that is already in flight won't be interrupted.

    In case where pread could be blocking potentially indefinitely long (networked filesystems, slow IO with retries, ...) that could lead to deadlocks when client code expects Close to interrupt IO, then waits for goroutine doing IO to complete.

    So what is the plan here? (offhand I would say that at least for fds not using poller syscall.Close should be called by FD.Close itself with protection from #7970-style races)

Thanks beforehand,
Kirill

P.S. updated timings on my side (i7-6600U; Linux 4.9.18-1 (2017-03-30) x86_64) are:

name \ time/op  go18.txt    go+8ae60dc1bb.txt  go+34ee8ec193.txt
                            (tip before patch)   (tip with patch)
ReadAt1-4       260ns ± 1%         376ns ± 2%         286ns ± 1%
Pread1-4        254ns ± 1%         251ns ± 0%         255ns ± 1%

so compared to go18 slowdown for ReadAt is ~ 10% (286ns over 260ns). Note also that before patch tip got even more slower compared to original 1.2x regression (ReadAt timing then was 323ns and now it is 376ns) - probably something else slowed down in the poller more.

For the reference here are go18 and tip profiles for ReadAt benchmark:

---- 8< ---- (go18)

30.91s of 31.72s total (97.45%)
Dropped 30 nodes (cum <= 0.16s)
Showing top 10 nodes out of 17 (cum >= 0.29s)
      flat  flat%   sum%        cum   cum%
    23.64s 74.53% 74.53%     29.88s 94.20%  syscall.Syscall6
     2.20s  6.94% 81.46%      2.20s  6.94%  runtime.casgstatus
     1.34s  4.22% 85.69%      2.85s  8.98%  runtime.reentersyscall
     1.12s  3.53% 89.22%      1.35s  4.26%  runtime.exitsyscallfast
     0.69s  2.18% 91.39%      2.91s  9.17%  runtime.exitsyscall
     0.58s  1.83% 93.22%     31.18s 98.30%  os.(*File).ReadAt
     0.38s  1.20% 94.42%     30.60s 96.47%  os.(*File).pread
     0.34s  1.07% 95.49%     30.22s 95.27%  syscall.Pread
     0.33s  1.04% 96.53%     31.51s 99.34%  command-line-arguments.xreadat
     0.29s  0.91% 97.45%      0.29s  0.91%  runtime.getcallerpc

---- 8< ---- (tip with patch: 34ee8ec)

Showing nodes accounting for 32.82s, 94.99% of 34.55s total
Dropped 23 nodes (cum <= 0.17s)
Showing top 10 nodes out of 20
      flat  flat%   sum%        cum   cum%
    23.93s 69.26% 69.26%     30.11s 87.15%  syscall.Syscall6 /home/kirr/src/tools/go/go/src/syscall/asm_linux_amd64.s
     2.43s  7.03% 76.30%      2.43s  7.03%  runtime.casgstatus /home/kirr/src/tools/go/go/src/runtime/proc.go
     1.23s  3.56% 79.86%      2.74s  7.93%  runtime.reentersyscall /home/kirr/src/tools/go/go/src/runtime/proc.go
     1.20s  3.47% 83.33%      1.44s  4.17%  runtime.exitsyscallfast /home/kirr/src/tools/go/go/src/runtime/proc.go
     1.18s  3.42% 86.74%      1.18s  3.42%  internal/poll.(*fdMutex).incref /home/kirr/src/tools/go/go/src/internal/poll/fd_mutex.go
     1.04s  3.01% 89.75%      1.04s  3.01%  internal/poll.(*fdMutex).decref /home/kirr/src/tools/go/go/src/internal/poll/fd_mutex.go
     0.67s  1.94% 91.69%      3.21s  9.29%  runtime.exitsyscall /home/kirr/src/tools/go/go/src/runtime/proc.go
     0.46s  1.33% 93.02%     34.04s 98.52%  os.(*File).ReadAt /home/kirr/src/tools/go/go/src/os/file.go
     0.35s  1.01% 94.04%     30.46s 88.16%  syscall.Pread /home/kirr/src/tools/go/go/src/syscall/zsyscall_linux_amd64.go
     0.33s  0.96% 94.99%     34.41s 99.59%  command-line-arguments.xreadat /home/kirr/src/wendelin/z/rust/bench/X/syspread_test.go

---- 8< ---- (tip without patch: 8ae60dc)

Showing nodes accounting for 43.67s, 96.06% of 45.46s total
Dropped 45 nodes (cum <= 0.23s)
Showing top 20 nodes out of 26
      flat  flat%   sum%        cum   cum%
    26.83s 59.02% 59.02%     33.83s 74.42%  syscall.Syscall6 /home/kirr/src/tools/go/go/src/syscall/asm_linux_amd64.s
     2.22s  4.88% 63.90%      2.22s  4.88%  runtime.casgstatus /home/kirr/src/tools/go/go/src/runtime/proc.go
     1.73s  3.81% 67.71%      3.09s  6.80%  runtime.reentersyscall /home/kirr/src/tools/go/go/src/runtime/proc.go
     1.57s  3.45% 71.16%      1.57s  3.45%  runtime.newdefer /home/kirr/src/tools/go/go/src/runtime/panic.go
     1.19s  2.62% 73.78%      3.33s  7.33%  runtime.exitsyscall /home/kirr/src/tools/go/go/src/runtime/proc.go
     1.15s  2.53% 76.31%     43.76s 96.26%  os.(*File).ReadAt /home/kirr/src/tools/go/go/src/os/file.go
     1.13s  2.49% 78.79%     45.11s 99.23%  command-line-arguments.xreadat /home/kirr/src/wendelin/z/rust/bench/X/syspread_test.go
     1.05s  2.31% 81.10%      1.05s  2.31%  internal/poll.(*fdMutex).rwlock /home/kirr/src/tools/go/go/src/internal/poll/fd_mutex.go
     0.98s  2.16% 83.26%      0.98s  2.16%  internal/poll.(*fdMutex).rwunlock /home/kirr/src/tools/go/go/src/internal/poll/fd_mutex.go
     0.92s  2.02% 85.28%      1.16s  2.55%  runtime.exitsyscallfast /home/kirr/src/tools/go/go/src/runtime/proc.go
     0.84s  1.85% 87.13%     42.08s 92.56%  internal/poll.(*FD).Pread /home/kirr/src/tools/go/go/src/internal/poll/fd_unix.go
     0.62s  1.36% 88.50%      2.59s  5.70%  runtime.deferproc /home/kirr/src/tools/go/go/src/runtime/panic.go
     0.57s  1.25% 89.75%      0.57s  1.25%  runtime.getcallerpc /home/kirr/src/tools/go/go/src/runtime/asm_amd64.s
     0.52s  1.14% 90.89%     42.61s 93.73%  os.(*File).pread /home/kirr/src/tools/go/go/src/os/file_unix.go
     0.52s  1.14% 92.04%      0.53s  1.17%  runtime.freedefer /home/kirr/src/tools/go/go/src/runtime/panic.go
     0.46s  1.01% 93.05%     34.29s 75.43%  syscall.Pread /home/kirr/src/tools/go/go/src/syscall/zsyscall_linux_amd64.go
     0.44s  0.97% 94.02%      1.06s  2.33%  runtime.deferreturn /home/kirr/src/tools/go/go/src/runtime/panic.go
     0.38s  0.84% 94.85%      1.43s  3.15%  internal/poll.(*FD).readLock /home/kirr/src/tools/go/go/src/internal/poll/fd_mutex.go
     0.28s  0.62% 95.47%      0.55s  1.21%  internal/poll.(*pollDesc).prepareRead /home/kirr/src/tools/go/go/src/internal/poll/fd_poll_runtime.go
     0.27s  0.59% 96.06%     45.39s 99.85%  command-line-arguments.BenchmarkReadAt1 /home/kirr/src/wendelin/z/rust/bench/X/syspread_test.go
@ianlancetaylor

This comment has been minimized.

Show comment
Hide comment
@ianlancetaylor

ianlancetaylor Apr 25, 2017

Contributor

as the new FD.Pread unconditionally calls syscall.Pread without poller, we have to make sure fd.Sysfd is not in nonblocking mode - else EAGAIN could be returned to actual ReadAt user. Currently this property is implicitly satisfied to be so, at least on Linux, where epoll refuses to register fd of regular files. Maybe it is worth at least a comment.

There is a comment saying that using the poller doesn't make sense for pread. I thought about adding another comment, but I can't think of what it would say. If a file is seekable, I don't see how it could be pollable. We also don't check for EAGAIN in the Seek method. Happy to hear a suggestion.

on Windows Pread emulation uses syscall.Seek internally. Is it correct there to have only incref instead of full read lock?

The existing code works because it acquires the fd.l lock. This CL was a minimal code change to make it more likely to be correct. There may be some way on Windows to reduce the two locks to one, though simply falling back to readLock and writeLock doesn't work since both are permitted simultaneously.

I've also noticed that when there is syscall.Pread error we reset n to 0? To me this is legitimate situation when there is e.g. 4K read request and IO error is there after 1K has been read. So why on error we are not telling the caller 1K still has been successfully read?

The syscall.Pread function simply calls the system call pread, which only has a single return value, either -1 with an error or a non-negative count. It never returns both a non-negative count and an error.

In the context of #7970 I've noticed that the actual syscall.Close happens in FD.destroy which itself is called from under last decref. This is ok when poller is used - there awaiting for IO goroutines are first deregistered by pollDesc.evict but in case of blocking IO without poller (ReadAt case) FD.Close will only mark fd mutex as closed (so issuing new read/write requests will not be possible) but e.g. pread request that is already in flight won't be interrupted.

In case where pread could be blocking potentially indefinitely long (networked filesystems, slow IO with retries, ...) that could lead to deadlocks when client code expects Close to interrupt IO, then waits for goroutine doing IO to complete.

So what is the plan here? (offhand I would say that at least for fds not using poller syscall.Close should be called by FD.Close itself with protection from #7970-style races)

I don't know how much that is a real problem. Can you open a separate issue for it? Thanks.

Contributor

ianlancetaylor commented Apr 25, 2017

as the new FD.Pread unconditionally calls syscall.Pread without poller, we have to make sure fd.Sysfd is not in nonblocking mode - else EAGAIN could be returned to actual ReadAt user. Currently this property is implicitly satisfied to be so, at least on Linux, where epoll refuses to register fd of regular files. Maybe it is worth at least a comment.

There is a comment saying that using the poller doesn't make sense for pread. I thought about adding another comment, but I can't think of what it would say. If a file is seekable, I don't see how it could be pollable. We also don't check for EAGAIN in the Seek method. Happy to hear a suggestion.

on Windows Pread emulation uses syscall.Seek internally. Is it correct there to have only incref instead of full read lock?

The existing code works because it acquires the fd.l lock. This CL was a minimal code change to make it more likely to be correct. There may be some way on Windows to reduce the two locks to one, though simply falling back to readLock and writeLock doesn't work since both are permitted simultaneously.

I've also noticed that when there is syscall.Pread error we reset n to 0? To me this is legitimate situation when there is e.g. 4K read request and IO error is there after 1K has been read. So why on error we are not telling the caller 1K still has been successfully read?

The syscall.Pread function simply calls the system call pread, which only has a single return value, either -1 with an error or a non-negative count. It never returns both a non-negative count and an error.

In the context of #7970 I've noticed that the actual syscall.Close happens in FD.destroy which itself is called from under last decref. This is ok when poller is used - there awaiting for IO goroutines are first deregistered by pollDesc.evict but in case of blocking IO without poller (ReadAt case) FD.Close will only mark fd mutex as closed (so issuing new read/write requests will not be possible) but e.g. pread request that is already in flight won't be interrupted.

In case where pread could be blocking potentially indefinitely long (networked filesystems, slow IO with retries, ...) that could lead to deadlocks when client code expects Close to interrupt IO, then waits for goroutine doing IO to complete.

So what is the plan here? (offhand I would say that at least for fds not using poller syscall.Close should be called by FD.Close itself with protection from #7970-style races)

I don't know how much that is a real problem. Can you open a separate issue for it? Thanks.

@ianlancetaylor

This comment has been minimized.

Show comment
Hide comment
@ianlancetaylor

ianlancetaylor Apr 25, 2017

Contributor

I'm not sure there is anything else to be done here. I'll leave this open for a few days for comments.

Contributor

ianlancetaylor commented Apr 25, 2017

I'm not sure there is anything else to be done here. I'll leave this open for a few days for comments.

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux Apr 25, 2017

Contributor

Thanks for feedback.

  • I've spawned #20110 about Close vs ReadAt
  • thanks for clarifying about Windows lock & syscall.Pread returns which I missed
  • for me an example of both practically seekable and pollable file would be file located on network or otherwise active filesystem. From this point of view seekable and pollable properties are not mutually exclusive. If we grep linux filesystems sources for being pollable support we see:
kirr@deco:~/src/linux/linux/fs$ git grep '\.poll'
cachefiles/daemon.c:    .poll           = cachefiles_daemon_poll,
coda/psdev.c:   .poll           = coda_psdev_poll,
dlm/plock.c:    .poll    = dev_poll,
dlm/user.c:     .poll    = device_poll,
ecryptfs/miscdev.c:     .poll    = ecryptfs_miscdev_poll,
eventfd.c:      .poll           = eventfd_poll,
eventpoll.c:    .poll           = ep_eventpoll_poll,
fuse/cuse.c:    .poll                   = fuse_file_poll,
fuse/dev.c:     .poll           = fuse_dev_poll,
fuse/file.c:    .poll           = fuse_file_poll,
fuse/file.c:    .poll           = fuse_file_poll,
kernfs/file.c:  .poll           = kernfs_fop_poll,
notify/fanotify/fanotify_user.c:        .poll           = fanotify_poll,
notify/inotify/inotify_user.c:  .poll           = inotify_poll,
ocfs2/dlmfs/dlmfs.c:    .poll           = dlmfs_file_poll,
orangefs/devorangefs-req.c:     .poll = orangefs_devreq_poll
pipe.c: .poll           = pipe_poll,
proc/inode.c:   .poll           = proc_reg_poll,
proc/inode.c:   .poll           = proc_reg_poll,
proc/kmsg.c:    .poll           = kmsg_poll,
proc/proc_sysctl.c:     .poll           = proc_sys_poll,
proc_namespace.c:       .poll           = mounts_poll,
proc_namespace.c:       .poll           = mounts_poll,
signalfd.c:     .poll           = signalfd_poll,
timerfd.c:      .poll           = timerfd_poll,
userfaultfd.c:  .poll           = userfaultfd_poll,

so eventhough NFS is not there

  • notably FUSE is there (and this way any fuse-based filesystem which support polling, because e.g. it is synthethic with data depending on network or something else).
  • some files on /proc are there too - e.g. /proc/kmsg. Below is a demo showing /proc/kmsg is both seekable and pollable:

---- 8< ----

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/epoll.h>
#include <fcntl.h>
#include <unistd.h>

#include <stdio.h>
#include <stdlib.h>


int main() {
        int fd, epollfd;
        off_t off;
        struct epoll_event ev;

        fd = open("/proc/kmsg", O_RDWR);
        if (fd == -1) {
                perror("open");
                abort();
        }

        off = lseek(fd, 1, SEEK_CUR);
        if (off == -1) {
                perror("lseek");
                abort();
        }


        epollfd = epoll_create1(0);
        if (epollfd == -1) {
            perror("epoll_create1");
            abort();
        }

        ev.events = EPOLLIN;
        ev.data.fd = fd;
        if (epoll_ctl(epollfd, EPOLL_CTL_ADD, fd, &ev) == -1) {
            perror("epoll_ctl: add");
            abort();
        }
}
root@deco:/home/kirr/tmp/trashme/t# strace ./a.out 
...
open("/proc/kmsg", O_RDWR)              = 3
lseek(3, 1, SEEK_CUR)                   = 1
epoll_create1(0)                        = 4
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=3, u64=11215853338799112195}}) = 0
exit_group(0)                           = ?
+++ exited with 0 +++

---- 8< ----

  • the same way as /proc/kmsg is pollable and seekable many other files in /proc are pollable and seekable too (see proc/inode.c entry in above grep + below):
static const struct file_operations proc_reg_file_ops = {
        .llseek         = proc_reg_llseek,
        .read           = proc_reg_read,
        .write          = proc_reg_write,
        .poll           = proc_reg_poll,
        .unlocked_ioctl = proc_reg_unlocked_ioctl,
#ifdef CONFIG_COMPAT
        .compat_ioctl   = proc_reg_compat_ioctl,
#endif
        .mmap           = proc_reg_mmap,
        .get_unmapped_area = proc_reg_get_unmapped_area,
        .open           = proc_reg_open,
        .release        = proc_reg_release,
};
  • OCFS2 also supports both poll and llseek
  • NFS may add support for polling at some point.

So once again this shows that being seekable and being pollable is not generally mutually exclusive. From this point of view I would check the return from epoll_ctl when we initially try to register fd to the poller

https://github.com/golang/go/blob/516e6f6d/src/os/file_unix.go#L173
https://github.com/golang/go/blob/516e6f6d/src/os/file_unix.go#L91

and put that information into some flag to later know whether file descriptor was set to be non-blocking / registered to epoll.

Thanks again,
Kirill

Contributor

navytux commented Apr 25, 2017

Thanks for feedback.

  • I've spawned #20110 about Close vs ReadAt
  • thanks for clarifying about Windows lock & syscall.Pread returns which I missed
  • for me an example of both practically seekable and pollable file would be file located on network or otherwise active filesystem. From this point of view seekable and pollable properties are not mutually exclusive. If we grep linux filesystems sources for being pollable support we see:
kirr@deco:~/src/linux/linux/fs$ git grep '\.poll'
cachefiles/daemon.c:    .poll           = cachefiles_daemon_poll,
coda/psdev.c:   .poll           = coda_psdev_poll,
dlm/plock.c:    .poll    = dev_poll,
dlm/user.c:     .poll    = device_poll,
ecryptfs/miscdev.c:     .poll    = ecryptfs_miscdev_poll,
eventfd.c:      .poll           = eventfd_poll,
eventpoll.c:    .poll           = ep_eventpoll_poll,
fuse/cuse.c:    .poll                   = fuse_file_poll,
fuse/dev.c:     .poll           = fuse_dev_poll,
fuse/file.c:    .poll           = fuse_file_poll,
fuse/file.c:    .poll           = fuse_file_poll,
kernfs/file.c:  .poll           = kernfs_fop_poll,
notify/fanotify/fanotify_user.c:        .poll           = fanotify_poll,
notify/inotify/inotify_user.c:  .poll           = inotify_poll,
ocfs2/dlmfs/dlmfs.c:    .poll           = dlmfs_file_poll,
orangefs/devorangefs-req.c:     .poll = orangefs_devreq_poll
pipe.c: .poll           = pipe_poll,
proc/inode.c:   .poll           = proc_reg_poll,
proc/inode.c:   .poll           = proc_reg_poll,
proc/kmsg.c:    .poll           = kmsg_poll,
proc/proc_sysctl.c:     .poll           = proc_sys_poll,
proc_namespace.c:       .poll           = mounts_poll,
proc_namespace.c:       .poll           = mounts_poll,
signalfd.c:     .poll           = signalfd_poll,
timerfd.c:      .poll           = timerfd_poll,
userfaultfd.c:  .poll           = userfaultfd_poll,

so eventhough NFS is not there

  • notably FUSE is there (and this way any fuse-based filesystem which support polling, because e.g. it is synthethic with data depending on network or something else).
  • some files on /proc are there too - e.g. /proc/kmsg. Below is a demo showing /proc/kmsg is both seekable and pollable:

---- 8< ----

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/epoll.h>
#include <fcntl.h>
#include <unistd.h>

#include <stdio.h>
#include <stdlib.h>


int main() {
        int fd, epollfd;
        off_t off;
        struct epoll_event ev;

        fd = open("/proc/kmsg", O_RDWR);
        if (fd == -1) {
                perror("open");
                abort();
        }

        off = lseek(fd, 1, SEEK_CUR);
        if (off == -1) {
                perror("lseek");
                abort();
        }


        epollfd = epoll_create1(0);
        if (epollfd == -1) {
            perror("epoll_create1");
            abort();
        }

        ev.events = EPOLLIN;
        ev.data.fd = fd;
        if (epoll_ctl(epollfd, EPOLL_CTL_ADD, fd, &ev) == -1) {
            perror("epoll_ctl: add");
            abort();
        }
}
root@deco:/home/kirr/tmp/trashme/t# strace ./a.out 
...
open("/proc/kmsg", O_RDWR)              = 3
lseek(3, 1, SEEK_CUR)                   = 1
epoll_create1(0)                        = 4
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=3, u64=11215853338799112195}}) = 0
exit_group(0)                           = ?
+++ exited with 0 +++

---- 8< ----

  • the same way as /proc/kmsg is pollable and seekable many other files in /proc are pollable and seekable too (see proc/inode.c entry in above grep + below):
static const struct file_operations proc_reg_file_ops = {
        .llseek         = proc_reg_llseek,
        .read           = proc_reg_read,
        .write          = proc_reg_write,
        .poll           = proc_reg_poll,
        .unlocked_ioctl = proc_reg_unlocked_ioctl,
#ifdef CONFIG_COMPAT
        .compat_ioctl   = proc_reg_compat_ioctl,
#endif
        .mmap           = proc_reg_mmap,
        .get_unmapped_area = proc_reg_get_unmapped_area,
        .open           = proc_reg_open,
        .release        = proc_reg_release,
};
  • OCFS2 also supports both poll and llseek
  • NFS may add support for polling at some point.

So once again this shows that being seekable and being pollable is not generally mutually exclusive. From this point of view I would check the return from epoll_ctl when we initially try to register fd to the poller

https://github.com/golang/go/blob/516e6f6d/src/os/file_unix.go#L173
https://github.com/golang/go/blob/516e6f6d/src/os/file_unix.go#L91

and put that information into some flag to later know whether file descriptor was set to be non-blocking / registered to epoll.

Thanks again,
Kirill

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux Apr 25, 2017

Contributor

For the reference here is what happens with Go on /proc/kmsg currently:

package main

import "os"

func main() {
        f, err := os.Open("/proc/kmsg")
        if err != nil {
                panic(err)
        }

        buf := make([]byte, 128)
        _, err = f.ReadAt(buf, 0)
        if err != nil {
                panic(err)
        }
}
root@deco:/home/kirr/tmp/trashme/t# ~kirr/src/tools/go/go/bin/go run q.go 
panic: read /proc/kmsg: resource temporarily unavailable

goroutine 1 [running]:
main.main()
        /home/kirr/tmp/trashme/t/q.go:14 +0xf3
exit status 2
root@deco:/home/kirr/tmp/trashme/t# strace ./q
...
openat(AT_FDCWD, "/proc/kmsg", O_RDONLY|O_CLOEXEC) = 3
epoll_create1(EPOLL_CLOEXEC)            = 4
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2744639232, u64=140229131886336}}) = 0
fcntl(3, F_GETFL)                       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
fcntl(3, F_SETFL, O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 0
pread64(3, 0xc42003bef8, 128, 0)        = -1 EAGAIN (Resource temporarily unavailable)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
write(2, "panic: ", 7panic: )                  = 7
...
Contributor

navytux commented Apr 25, 2017

For the reference here is what happens with Go on /proc/kmsg currently:

package main

import "os"

func main() {
        f, err := os.Open("/proc/kmsg")
        if err != nil {
                panic(err)
        }

        buf := make([]byte, 128)
        _, err = f.ReadAt(buf, 0)
        if err != nil {
                panic(err)
        }
}
root@deco:/home/kirr/tmp/trashme/t# ~kirr/src/tools/go/go/bin/go run q.go 
panic: read /proc/kmsg: resource temporarily unavailable

goroutine 1 [running]:
main.main()
        /home/kirr/tmp/trashme/t/q.go:14 +0xf3
exit status 2
root@deco:/home/kirr/tmp/trashme/t# strace ./q
...
openat(AT_FDCWD, "/proc/kmsg", O_RDONLY|O_CLOEXEC) = 3
epoll_create1(EPOLL_CLOEXEC)            = 4
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2744639232, u64=140229131886336}}) = 0
fcntl(3, F_GETFL)                       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
fcntl(3, F_SETFL, O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 0
pread64(3, 0xc42003bef8, 128, 0)        = -1 EAGAIN (Resource temporarily unavailable)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
write(2, "panic: ", 7panic: )                  = 7
...
@ianlancetaylor

This comment has been minimized.

Show comment
Hide comment
@ianlancetaylor

ianlancetaylor Apr 29, 2017

Contributor

Thanks for the example. /proc/kmsg is not a great file to use for a test since it can only be opened by root. Do you know if there is a similar example that can be opened by an ordinary user? The ones I've checked either do not support pread or always succeed.

Contributor

ianlancetaylor commented Apr 29, 2017

Thanks for the example. /proc/kmsg is not a great file to use for a test since it can only be opened by root. Do you know if there is a similar example that can be opened by an ordinary user? The ones I've checked either do not support pread or always succeed.

@ianlancetaylor

This comment has been minimized.

Show comment
Hide comment
@ianlancetaylor

ianlancetaylor Apr 29, 2017

Contributor

By the way, as far as I can tell /proc/kmsg is a poor example anyhow. Yes, you can use pread on it, but the file position you pass seems to be ignored. The kernel really ought to return an ESPIPE error.

Contributor

ianlancetaylor commented Apr 29, 2017

By the way, as far as I can tell /proc/kmsg is a poor example anyhow. Yes, you can use pread on it, but the file position you pass seems to be ignored. The kernel really ought to return an ESPIPE error.

@gopherbot

This comment has been minimized.

Show comment
Hide comment

CL https://golang.org/cl/42144 mentions this issue.

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux May 1, 2017

Contributor

@ianlancetaylor thanks for feedback. I did not noticed /proc/kmsg ignores position on read. Anyway e.g. /proc/self/mounts is an example of user-readable pollable & seekable file:

#include <sys/stat.h>
#include <sys/epoll.h>
#include <fcntl.h>
#include <unistd.h>

#include <stdio.h>
#include <stdlib.h>


#define PATH    "/proc/self/mounts"

off_t xlseek(int fd, off_t off, int whence);
int readall(int fd, char *buf, int size);
void preadall(int fd, char *buf, int size);

int main() {
        int fd, epollfd, n;
        struct epoll_event ev;

        fd = open(PATH, O_RDONLY);
        if (fd == -1) {
                perror("open");
                abort();
        }

        xlseek(fd, 0, SEEK_CUR);
        fprintf(stderr, "%s: seekable\n", PATH);


        epollfd = epoll_create1(0);
        if (epollfd == -1) {
            perror("epoll_create1");
            abort();
        }

        ev.events = EPOLLIN;
        ev.data.fd = fd;
        if (epoll_ctl(epollfd, EPOLL_CTL_ADD, fd, &ev) == -1) {
            perror("epoll_ctl: add");
            abort();
        }
        fprintf(stderr, "%s: pollable\n", PATH);

#define N 80    // bytes ro read
        char buf1[N], buf2[N];

        n = readall(fd, buf1, N);
        // XXX vvv ignores \0 in buf1
        printf("---- 8< ---- (%d bytes via read(2))\n%.*s\n---- 8< ----\n", n, n, buf1);

        preadall(fd, buf2, n);
        printf("---- 8< ---- (%d bytes via pread(2))\n%.*s\n---- 8< ----\n", n, n, buf2);
}

off_t xlseek(int fd, off_t off, int whence) {
        off = lseek(fd, off, whence);
        if (off == -1) {
                perror("lseek");
                abort();
        }
        return off;
}

// readall reads max size bytes of fd file via read(2) untill either all is read or eof
int readall(int fd, char *buf, int size) {
        int haveread, n;

        xlseek(fd, 0, SEEK_SET);
        haveread = 0;
        while (size > 0) {
                n = read(fd, buf, size);
                if (n == -1) {
                        perror("read");
                        abort();
                }
                if (n == 0) {
                        break;  // eof
                }

                haveread += n;
                size -= n;
                buf += n;
        }

        return haveread;
}

// preadall reads size bytes of fd file via pread(2) one byte at a time backwards
void preadall(int fd, char *buf, int size) {
        off_t off;
        int n;

        for (off = size - 1; off >= 0; off--) {
                n = pread(fd, &buf[off], 1, off);
                if (n == -1) {
                        perror("pread");
                        abort();
                }
                if (n == 0) {
                        fprintf(stderr, "@%ld: unexpected EOF", off);
                }
        }
}
kirr@deco:~/tmp/trashme/t$ strace -o x.log ./a.out 
/proc/self/mounts: seekable
/proc/self/mounts: pollable
---- 8< ---- (80 bytes via read(2))
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,n
---- 8< ----
---- 8< ---- (80 bytes via pread(2))
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,n
---- 8< ----

x.log:

open("/proc/self/mounts", O_RDONLY)     = 3
lseek(3, 0, SEEK_CUR)                   = 0
write(2, "/proc/self/mounts: seekable\n", 28) = 28
epoll_create1(0)                        = 4
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=3, u64=14460279055900999683}}) = 0
write(2, "/proc/self/mounts: pollable\n", 28) = 28
lseek(3, 0, SEEK_SET)                   = 0
read(3, "sysfs /sys sysfs rw,nosuid,nodev"..., 80) = 80
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 6), ...}) = 0
brk(NULL)                               = 0x56166cda3000
brk(0x56166cdc4000)                     = 0x56166cdc4000
write(1, "---- 8< ---- (80 bytes via read("..., 36) = 36
write(1, "sysfs /sys sysfs rw,nosuid,nodev"..., 53) = 53
write(1, "proc /proc proc rw,nosuid,n\n----"..., 41) = 41
pread64(3, "n", 1, 79)                  = 1
pread64(3, ",", 1, 78)                  = 1
pread64(3, "d", 1, 77)                  = 1
pread64(3, "i", 1, 76)                  = 1
pread64(3, "u", 1, 75)                  = 1
pread64(3, "s", 1, 74)                  = 1
pread64(3, "o", 1, 73)                  = 1
pread64(3, "n", 1, 72)                  = 1
pread64(3, ",", 1, 71)                  = 1
...

however as you say it "always succeed" in a sense that EAGAIN is never returned - the polling is there hooked to triger "readable" only when mounting table changes - e.g. on mount/umount/etc.


I've grasped through whole linux/fs/ to check where/how O_NONBLOCK is handled and what I can say is that no regular filesystem there supports non-blocking read or write - whenever O_NONBLOCK is handled the file has llseek either disabled or set to noop_llseek and pos in actually not handled in read which means the file in question is stream-like, not random-access. So this indeed seems to be kernel misbehaving.

I think part of the story here is that "ready for read" and "ready for write" semantics of poll is tied to streams - the "where" in questions above is implicitly "at the end". That's probably why networked filesystems such as NFS has no good way to tell callers that previously-reported EAGAIN region is now ready to e.g. read - the opration can return EAGAIN if O_NONBLOCK is set but with poll API there is no way to subscribe to it. So they this way they actually never return EAGAIN.

I guess the way to handle blocking "regular" IO would be to have some kind of IO requests and after submitting it to kernel be able to subscribe for completion - either waiting directly or via passing it somehow to epoll. I'm not sure this functionality currently exists on Linux though...


So about our original topic - EAGAIN being returned to ReadAt caller I think:

  • the bug is on kernel side (/proc/kmsg and similar cases)
  • it is thus ok to pass it through without workarounds
  • I would say all the above conversations deserve to have a comment inside poller code why even though if fd was set non-blocking, we can expect pread not to return EAGAIN. Probably in that comment it make sense to reference this issue too.

Sorry maybe for the noise and not coming up with the doc patch,
Kirill

Contributor

navytux commented May 1, 2017

@ianlancetaylor thanks for feedback. I did not noticed /proc/kmsg ignores position on read. Anyway e.g. /proc/self/mounts is an example of user-readable pollable & seekable file:

#include <sys/stat.h>
#include <sys/epoll.h>
#include <fcntl.h>
#include <unistd.h>

#include <stdio.h>
#include <stdlib.h>


#define PATH    "/proc/self/mounts"

off_t xlseek(int fd, off_t off, int whence);
int readall(int fd, char *buf, int size);
void preadall(int fd, char *buf, int size);

int main() {
        int fd, epollfd, n;
        struct epoll_event ev;

        fd = open(PATH, O_RDONLY);
        if (fd == -1) {
                perror("open");
                abort();
        }

        xlseek(fd, 0, SEEK_CUR);
        fprintf(stderr, "%s: seekable\n", PATH);


        epollfd = epoll_create1(0);
        if (epollfd == -1) {
            perror("epoll_create1");
            abort();
        }

        ev.events = EPOLLIN;
        ev.data.fd = fd;
        if (epoll_ctl(epollfd, EPOLL_CTL_ADD, fd, &ev) == -1) {
            perror("epoll_ctl: add");
            abort();
        }
        fprintf(stderr, "%s: pollable\n", PATH);

#define N 80    // bytes ro read
        char buf1[N], buf2[N];

        n = readall(fd, buf1, N);
        // XXX vvv ignores \0 in buf1
        printf("---- 8< ---- (%d bytes via read(2))\n%.*s\n---- 8< ----\n", n, n, buf1);

        preadall(fd, buf2, n);
        printf("---- 8< ---- (%d bytes via pread(2))\n%.*s\n---- 8< ----\n", n, n, buf2);
}

off_t xlseek(int fd, off_t off, int whence) {
        off = lseek(fd, off, whence);
        if (off == -1) {
                perror("lseek");
                abort();
        }
        return off;
}

// readall reads max size bytes of fd file via read(2) untill either all is read or eof
int readall(int fd, char *buf, int size) {
        int haveread, n;

        xlseek(fd, 0, SEEK_SET);
        haveread = 0;
        while (size > 0) {
                n = read(fd, buf, size);
                if (n == -1) {
                        perror("read");
                        abort();
                }
                if (n == 0) {
                        break;  // eof
                }

                haveread += n;
                size -= n;
                buf += n;
        }

        return haveread;
}

// preadall reads size bytes of fd file via pread(2) one byte at a time backwards
void preadall(int fd, char *buf, int size) {
        off_t off;
        int n;

        for (off = size - 1; off >= 0; off--) {
                n = pread(fd, &buf[off], 1, off);
                if (n == -1) {
                        perror("pread");
                        abort();
                }
                if (n == 0) {
                        fprintf(stderr, "@%ld: unexpected EOF", off);
                }
        }
}
kirr@deco:~/tmp/trashme/t$ strace -o x.log ./a.out 
/proc/self/mounts: seekable
/proc/self/mounts: pollable
---- 8< ---- (80 bytes via read(2))
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,n
---- 8< ----
---- 8< ---- (80 bytes via pread(2))
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,n
---- 8< ----

x.log:

open("/proc/self/mounts", O_RDONLY)     = 3
lseek(3, 0, SEEK_CUR)                   = 0
write(2, "/proc/self/mounts: seekable\n", 28) = 28
epoll_create1(0)                        = 4
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=3, u64=14460279055900999683}}) = 0
write(2, "/proc/self/mounts: pollable\n", 28) = 28
lseek(3, 0, SEEK_SET)                   = 0
read(3, "sysfs /sys sysfs rw,nosuid,nodev"..., 80) = 80
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 6), ...}) = 0
brk(NULL)                               = 0x56166cda3000
brk(0x56166cdc4000)                     = 0x56166cdc4000
write(1, "---- 8< ---- (80 bytes via read("..., 36) = 36
write(1, "sysfs /sys sysfs rw,nosuid,nodev"..., 53) = 53
write(1, "proc /proc proc rw,nosuid,n\n----"..., 41) = 41
pread64(3, "n", 1, 79)                  = 1
pread64(3, ",", 1, 78)                  = 1
pread64(3, "d", 1, 77)                  = 1
pread64(3, "i", 1, 76)                  = 1
pread64(3, "u", 1, 75)                  = 1
pread64(3, "s", 1, 74)                  = 1
pread64(3, "o", 1, 73)                  = 1
pread64(3, "n", 1, 72)                  = 1
pread64(3, ",", 1, 71)                  = 1
...

however as you say it "always succeed" in a sense that EAGAIN is never returned - the polling is there hooked to triger "readable" only when mounting table changes - e.g. on mount/umount/etc.


I've grasped through whole linux/fs/ to check where/how O_NONBLOCK is handled and what I can say is that no regular filesystem there supports non-blocking read or write - whenever O_NONBLOCK is handled the file has llseek either disabled or set to noop_llseek and pos in actually not handled in read which means the file in question is stream-like, not random-access. So this indeed seems to be kernel misbehaving.

I think part of the story here is that "ready for read" and "ready for write" semantics of poll is tied to streams - the "where" in questions above is implicitly "at the end". That's probably why networked filesystems such as NFS has no good way to tell callers that previously-reported EAGAIN region is now ready to e.g. read - the opration can return EAGAIN if O_NONBLOCK is set but with poll API there is no way to subscribe to it. So they this way they actually never return EAGAIN.

I guess the way to handle blocking "regular" IO would be to have some kind of IO requests and after submitting it to kernel be able to subscribe for completion - either waiting directly or via passing it somehow to epoll. I'm not sure this functionality currently exists on Linux though...


So about our original topic - EAGAIN being returned to ReadAt caller I think:

  • the bug is on kernel side (/proc/kmsg and similar cases)
  • it is thus ok to pass it through without workarounds
  • I would say all the above conversations deserve to have a comment inside poller code why even though if fd was set non-blocking, we can expect pread not to return EAGAIN. Probably in that comment it make sense to reference this issue too.

Sorry maybe for the noise and not coming up with the doc patch,
Kirill

@aclements

This comment has been minimized.

Show comment
Hide comment
Member

aclements commented Jun 21, 2017

@ianlancetaylor

This comment has been minimized.

Show comment
Hide comment
@ianlancetaylor

ianlancetaylor Jun 21, 2017

Contributor

Closing this issue per comment on April 25. My conclusion from the follow-on discussion about /proc/kmsg is that nothing need be done.

Contributor

ianlancetaylor commented Jun 21, 2017

Closing this issue per comment on April 25. My conclusion from the follow-on discussion about /proc/kmsg is that nothing need be done.

@golang golang locked and limited conversation to collaborators Jun 21, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.