Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: add mechanism to wait for readability on a TCPConn #15735

Open
bradfitz opened this issue May 18, 2016 · 85 comments
Open

net: add mechanism to wait for readability on a TCPConn #15735

bradfitz opened this issue May 18, 2016 · 85 comments
Labels
Go2 NeedsDecision Thinking
Milestone

Comments

@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented May 18, 2016

EDIT: this proposal has shifted. See #15735 (comment) below.

Old:

The net/http package needs a way to wait for readability on a TCPConn without actually reading from it. (See #15224)

http://golang.org/cl/22031 added such a mechanism, making Read(0 bytes) do a wait for readability, followed by returning (0, nil). But maybe that is strange. Windows already works like that, though. (See new tests in that CL)

Reconsider this for Go 1.8.

Maybe we could add a new method to TCPConn instead, like WaitRead.

@bradfitz bradfitz added this to the Go1.8 milestone May 18, 2016
@bradfitz bradfitz self-assigned this May 18, 2016
@bradfitz
Copy link
Contributor Author

@bradfitz bradfitz commented May 18, 2016

@gopherbot
Copy link

@gopherbot gopherbot commented May 18, 2016

CL https://golang.org/cl/23227 mentions this issue.

gopherbot pushed a commit that referenced this issue May 19, 2016
Updates #15735

Change-Id: I42ab2345443bbaeaf935d683460fc2c941b7679c
Reviewed-on: https://go-review.googlesource.com/23227
Reviewed-by: Ian Lance Taylor <iant@golang.org>
gopherbot pushed a commit that referenced this issue May 19, 2016
Updates #15735.
Fixes #15741.

Change-Id: Ic4ad7e948e8c3ab5feffef89d7a37417f82722a1
Reviewed-on: https://go-review.googlesource.com/23199
Run-TryBot: Mikio Hara <mikioh.mikioh@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@RalphCorderoy
Copy link

@RalphCorderoy RalphCorderoy commented May 20, 2016

read(2) with a count of zero may be used to detect errors. Linux man page confirms, as does POSIX's read(3p) here. Mentioning it in case it influences this subverting of a Read(0 bytes) not calling syscall.Read.

@quentinmit quentinmit added the NeedsDecision label Oct 7, 2016
@bradfitz
Copy link
Contributor Author

@bradfitz bradfitz commented Oct 21, 2016

I found a way to do without this in net/http, so punting to Go 1.9.

@bradfitz bradfitz added this to the Go1.9 milestone Oct 21, 2016
@bradfitz bradfitz removed this from the Go1.8 milestone Oct 21, 2016
@bradfitz
Copy link
Contributor Author

@bradfitz bradfitz commented Dec 12, 2016

Actually, the more I think about this, I don't even want my idle HTTP/RPC goroutines to stick around blocked in a read call. In addition to the array memory backed by the slice given to Read, the goroutine itself is ~4KB of wasted memory.

What I'd really like is a way to register a func() to run when my *net.TCPConn is readable (when a Read call wouldn't block). By analogy, I want the time.AfterFunc efficiency of running a func in a goroutine later, rather than running a goroutine just to block in a time.Sleep.

My new proposal is more like:

package net

// OnReadable runs f in a new goroutine when c is readable;
// that is, when a call to c.Read will not block.
func (c *TCPConn) OnReadable(f func()) {
   // ...
}

Yes, maybe this is getting dangerously into event-based programming land.

Or maybe just the name ("OnWhatever") is offensive. Maybe there's something better.

I would use this in http, http2, and grpc.

/cc @ianlancetaylor @rsc

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Dec 12, 2016

Sounds like you are getting close to #15021.

I'm worried that the existence of such a method will encourage people to start writing their code as callbacks rather than as straightforward goroutines.

@bradfitz
Copy link
Contributor Author

@bradfitz bradfitz commented Dec 12, 2016

Yeah. I'm conflicted. I see the benefits and the opportunity for overuse.

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Jan 6, 2017

If we do OnReadable(f func()), won't we need to fork half of standard library for async style? Compress, io, tls, etc readers all assume blocking style and require a blocked goroutine.
I don't see any way to push something asynchronously into e.g. gzip.Reader. Does this mean that I have to choose between no blocked goroutine + my own gzip impl and blocked goroutine + std lib?

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Jan 6, 2017

Re 0-sized reads.
It should work with level-triggered notifications, but netpoll uses epoll in edge-triggered mode (and kqueue iirc). I am concerned if cl/22031 works in more complex cases: waiting for already ready IO, double wait, wait without completely draining read buffer first, etc?

@bradfitz
Copy link
Contributor Author

@bradfitz bradfitz commented Jan 6, 2017

@dvyukov, no, we would only use OnReadable in very high-level places, like the http1 and http2 servers where we know the conn is expected to be idle for long periods of time. The rest of the code underneath would remain in the blocking style.

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Jan 6, 2017

This looks like a half-measure. An http connection can halt in the middle of request...

@bradfitz
Copy link
Contributor Author

@bradfitz bradfitz commented Jan 6, 2017

@dvyukov, but not commonly. This would be an optimization for the common case.

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Jan 7, 2017

An alternative interface can be to register a channel that will receive readiness notifications. The other camp wants this for packet-processing servers, and there starting a goroutine for every packet will be too expensive. However, if at the end you want a goroutine, then the channel will introduce unnecessary overhead.
Channel has a problem with overflow handling (netpoll can't block on send, on the other hand it is not OK to lose notifications).
For completeness, this API should also handle writes.

@DemiMarie
Copy link

@DemiMarie DemiMarie commented Jan 10, 2017

We need to make sure that this works with Windows IOCP as well.

@rsc
Copy link
Contributor

@rsc rsc commented Jan 10, 2017

Not obvious to me why the API has to handle writes. The thing about reads is that until the data is ready for reading, you can use the memory for other work. If you're waiting to write data, that memory is not reusable (otherwise you'd lose the data you are waiting to write).

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Jan 11, 2017

@rsc If we do just 0-sized reads, then write support is not necessary. However, if we do Brad's "My new proposal is more like": func (c *TCPConn) OnReadable(f func()), then this equally applies to writes as well -- to avoid 2 blocked goroutines per connection.

@noblehng
Copy link

@noblehng noblehng commented Feb 21, 2017

If memory usage is the concern, it is possible to make long parked G use less memory instead of changing programming style? One main selling point of Go to me is high efficiency network servers without resorting to callbacks.

Something like shrink the stack or move the stack to heap by the GC using some heuristics, that will be littile different from spinning up a new goroutine on callback memory usage wise, and scheduling wise a callback is not much different than goready(). Also I assume the liveness change in Go1.8 could help here too.

For the backing array, if it is preallocated buffer, than a callback doesn't make much different than Read(), maybe it will make some different if it is allocated per-callback and use a pool.

Edit:
Actually we could have some GC deadline or gopark time in runtime.pollDesc, so we could get a list of long parked G from the poller, then GC can kick in, but more dance is still needed to avoid race and make it fast.

@noblehng
Copy link

@noblehng noblehng commented Feb 22, 2017

How about a epoll like interface for net.Listener:

type PollableListener interface {
   net.Listener
   // Poll will block till at least one connection been ready for read or write
   // reads and writes are special net.Conn that will not block on EAGAIN
   Poll() (reads []net.Conn, writes []net.Conn)
}

Then the caller of Poll() can has a small number of goroutines to poll for readiness and handle the reads and writes. This should also works well for packet-processing servers.

Note that this only needs to be implemented in the runtime for those Listeners that multiplexed in the kernel, like the net.TCPListener. For other protocol that multiplex in the userspace and doesn't attached to the runtime poller directly, like udp listener or multiplexing streams in a tcp connection, can be implemented outside the runtime. For example, for multiplexing in a tcp connection, we can implemented the epoll like behavior by read from/write to some buffers then poll from them or register callbacks on buffer size changed.

Edit:
To implement this, we can let users of the runtime poller, like socket and os.File, provide a callback function pointer when open the poller for a fd, to notify them the readiness of I/O. The callback should
looks like:

type IOReadyNotify func(mode int32)

And we store this in the runtime.pollDesc, then the runtime.netpollready() function should also call this callback if not nil besides give out the pending goroutine(s).

@aajtodd
Copy link

@aajtodd aajtodd commented Feb 27, 2017

I'm fairly new to Go but seeing the callback interface is a little grating given the blocking API exposed everywhere else. Why not expose a public API to the netpoll interfaces?

Go provides no standard public facing event loop (correct me if I'm wrong please). I have need to wait for readability on external FFI socket(s) (given through cgo). It would be nice to re-use the existing netpoll abstraction to also spawn FFI sockets onto rather than having to wrap epoll/IOCP/select. Also I'm guessing wrapping (e.g) epoll from the sys package does not integrate with the scheduler which would also be a bummer.

@mjgarton
Copy link
Contributor

@mjgarton mjgarton commented Mar 15, 2017

For a number of my use cases, something like this :

package net

// Readable returns a channel which can be read from whenever a call to c.Read
// would not block.
func (c *TCPConn) Readable() <-chan struct{} {
        // ...
}

.. would be nice because I can select on it. I have no idea whether it's practical to implement this though.

Another alternative (for some of my cases at least) might be somehow enabling reads to be canceled by using a context.

@szuecs
Copy link

@szuecs szuecs commented Jul 10, 2019

@ianlancetaylor + @bradfitz a typical problem I have in an http proxy is, that connection spikes can create spikes in memory usage. I think this can be fixed with using epoll and I hope your approach will cover the problem. We would need to be able to set the max concurrency level for the goroutine calls, that will read and write from/to sockets.
We have an internal protection mechanism to avoid this problem, but you can see the memory spike in in bufio:

image

@xtaci
Copy link

@xtaci xtaci commented Nov 27, 2019

I need readability notification badly in my project, as I have to pre-allocate 4k buffer per connection before conn.Read([]byte), just like io.Copy does:

https://golang.org/src/io/io.go?s=12796:12856#L399

UPDATE:
solved this by RawConn:
https://github.com/xtaci/kcptun/blob/v20191219/generic/rawcopy_unix.go

@eloff
Copy link

@eloff eloff commented Dec 22, 2019

So there is RawConn now which has an interface very much like what @bradfitz was proposing here. However, it calls the read callback before calling wait for read. It must do this, as the net poller uses edge-triggered events - they won't fire if there is already data on the socket.

One workaround is to use a small stack buffer for the initial Read, and then when that reads some data, allocate and copy it to a real buffer, and then call Read again. That helps, but you'll still have the goroutine's 4KB stack overhead.

Another option is use RawConn.Control with unix.SetsockoptInt(int(fd), unix.SOL_SOCKET, unix.TCP_DEFER_ACCEPT, 1) for connections like HTTP where the client must send data first, it won't create a goroutine in the first place until there's data buffered, but only helps for that initial Read. It can be combined with the stack buffer approach for long-lived connections.

I'd still like to see an approach (maybe an async OnReadable callback method on RawConn, like bradfitz is proposing here?) to avoid having the goroutine overhead. Mail.ru avoids the net poller entirely and manually manages epoll for precisely this reason, as 4KB per WebSocket connection with millions of open but mostly idle WebSockets waiting for new mail is just too much overhead.

@xtaci
Copy link

@xtaci xtaci commented Dec 23, 2019

I'd still like to see an approach (maybe an async OnReadable callback method on RawConn, like bradfitz is proposing here?) to avoid having the goroutine overhead. Mail.ru avoids the net poller entirely and manually manages epoll for precisely this reason, as 4KB per WebSocket connection with millions of open but mostly idle WebSockets waiting for new mail is just too much overhead.

I don't think OnReadable callback is a good idea. If I have 2 handlers to toggle based on incoming data, then it's not impossible to code this kind of nested callbacks which reference to each other.

For such reason, in order to writing comprehensive logic we have to copy the buffer from callback out to another goroutine, in such case, the memory usage will be out of control, as we lost the ability to back-pressure the congestion signal to senders. (Before, we won't start next net.Read() until we've processed data.)

Even for a callback like func() { Read();Process();Write() } scenario, if Write() blocks, this callback still holds 4KB-buffer waiting to be sent.

In all these cases above, we still have inactive 4KB-buffers somewhere.

@xtaci
Copy link

@xtaci xtaci commented Jan 13, 2020

I wrote a library based on ideas above, golang async-io

https://github.com/xtaci/gaio

@ortuman
Copy link

@ortuman ortuman commented Dec 29, 2020

For my particular use case having a non-blocking Read is highly desirable. An chat service aiming to manage millions of connections per instance. Blocking interface forces to have an alive goroutine per connection, which makes this goal totally unrealistic.

@pmgexpo17
Copy link

@pmgexpo17 pmgexpo17 commented Mar 30, 2021

I wrote a library based on ideas above, golang async-io

https://github.com/xtaci/gaio

And it works great - it uses an epoll event handler and callback to solve the readability and writeability issues that are discussed here (if I'm not mistaken ... lol)
I use it instead of a multiplexer for a message proxy component which handles one to many client connections for inbound and outbound traffic.
It has a batch style of message processing which requires a message reader and frame size delimited message protocol. That is, the message reader compiles a complete message from 1 or more partial messages provided in each event result buffer

@lesismal
Copy link

@lesismal lesismal commented May 6, 2021

I wrote another async-io lib nbio to avoid using one or more goroutines per connection and reduce memory usage.
It supports http 1.x and is basically compatible with net/http. Many net/http-based web frameworks can be easily converted to use the async-io network layer.
It also supports tls and websocket.
I am trying to increase the load capacity of a single hardware, 1m/1000k-connections would not be that hard anymore for golang.

https://github.com/lesismal/nbio

@bcmills
Copy link
Member

@bcmills bcmills commented May 12, 2021

In addition to the array memory backed by the slice given to Read, the goroutine itself is ~4KB of wasted memory.

If the problem is the size of the idle goroutine stack, would it make sense to have the runtime recognize goroutines that are likely to block ~immediately and allocate a much smaller initial stack for them?

@networkimprov
Copy link

@networkimprov networkimprov commented May 12, 2021

@bcmills, maybe that could be a new proposal?

@funny-falcon
Copy link
Contributor

@funny-falcon funny-falcon commented May 12, 2021

In addition to the array memory backed by the slice given to Read, the goroutine itself is ~4KB of wasted memory.

If the problem is the size of the idle goroutine stack, would it make sense to have the runtime recognize goroutines that are likely to block ~immediately and allocate a much smaller initial stack for them?

Problems are both goroutine stack and buffer for Read. Especially when TLS is involved.

@ortuman
Copy link

@ortuman ortuman commented Jun 17, 2021

Problems are both goroutine stack and buffer for Read. Especially when TLS is involved.

At least, if we had goroutines memory pressure under control the read buffer issue may be tackled by extending net.Conn API:

conn := getTLSConn()

_, err := conn.WaitRead() // block until there's available content
if err != nil {
    return err
}
rdBuff := grabReadBuffer()
defer releaseReadBuffer(rdBuff)

n, err := conn.ReadAvailable(rdBuff)
if err != nil {
    return err
}
handleContent(rdBuff)
return nil

@lesismal
Copy link

@lesismal lesismal commented Sep 11, 2021

Problems are both goroutine stack and buffer for Read. Especially when TLS is involved.

At least, if we had goroutines memory pressure under control the read buffer issue may be tackled by extending net.Conn API:

conn := getTLSConn()

_, err := conn.WaitRead() // block until there's available content
if err != nil {
    return err
}
rdBuff := grabReadBuffer()
defer releaseReadBuffer(rdBuff)

n, err := conn.ReadAvailable(rdBuff)
if err != nil {
    return err
}
handleContent(rdBuff)
return nil

I wrote an async-io lib nbio. I also forked std tls and rewrote it to support non-blocking.
Here are examples that save more memory:
lesismal/nbio#62 (comment)

In my simple echo test with100k websocket tls connections, compared with std based frameworks:

solution qps cpu memory
nbio avg:110-120k, running fine around 300% 1.3G
std avg: 60-80k, with obvious stw around 300% 3.3G

@CAFxX
Copy link
Contributor

@CAFxX CAFxX commented Sep 18, 2021

If I may voice a contrary opinion, I would argue that there are other ways to obtain the same benefits (minimize space wasted on parked goroutine stacks and on buffers that are waiting for data to be written into them) without unleashing callbacks and/or readability signals.

One such way could be to have buffers be decommitted from memory before submitting them for a blocking read (I haven't tested this, but AFAIK it should work). Right now this would require an additional syscall (e.g. madvise) before the read, but once/if #31908 comes around that won't be required anymore (as the madvise and read could be submitted in a single syscall). Given page-aligned buffers of size greater than a page, this should transparently achieve the goal at least for the buffers.

For the goroutine stack, something similarly transparent could also be achieved: as part of GC we already shrink the stacks of goroutines, if they are too big. It is not impossible to imagine extending this mechanism to detect goroutines that have been parked for a while waiting for I/O and "freeze" their stacks, aggressively packing together their contents (without maintaining the minimum 4KB size), and freeing the stacks themselves. When the goroutine needs to be unparked a new stack of the appropriate size would be allocated, the frozen stack contents would be copied into the newly allocated stack and the goroutine would resume. This should transparently achieve the goal for the stacks of goroutines blocked on I/O.

I won't deny that there is a lot of handwaving in this comment: in my defense, I am not trying to provide a fully fleshed-out design. I'm just trying to point out that there may be other possibilities to achieve the same goals besides the mechanisms being discussed in the rest of this issue. Because these mechanisms, as has been pointed out already, are pretty non-go-like I would hope that any alternative that may achieve transparently the desirable goals set out in this issue is fully exhausted before more extreme options are implemented.

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Sep 18, 2021

There are several potential issues with madvise:

  1. It conflicts with large 2M pages.
  2. The buffers are not necessary page-aligned.
  3. The contents are not necessary discard-able (for a short read the rest of buffer must be left intact).

@CAFxX
Copy link
Contributor

@CAFxX CAFxX commented Sep 18, 2021

(sorry for being off topic, I will not add further comments in this issue; if needed I will update this comment to address further replies to avoid polluting the discussion with design considerations of a potential counterproposal)


IIUC 1 and 2 have multiple potential solutions (document that only pagesize-aligned buffers can benefit from this optimization, teach the memory allocator to handle pagesize-sized allocations especially, provide an allocator function specialized for this purpose, suggest to use off-go-heap memory, ...) so possibly these are not a showstopper (unless I'm missing something). (update: furthermore, it seems that in practice page-sized allocations are already page-aligned)

3 OTOH could be a showstopper (limited to the buffer part), but I'm confused. io.Reader docs seems to indicate that it is not the case the rest of the buffer must be left intact in case of a short read:

type Reader interface {
	Read(p []byte) (n int, err error)
}

[...] Even if Read returns n < len(p), it may use all of p as scratch space during the call. [...]

It only says that the space beyond n can be used as scratch space, but it does not say whether the contents of p[n:] are restored after the area has been used as scratch space. I always interpreted that to mean that data in p[n:] may not contain the original contents when Read returns, otherwise to be able to use all of p as scratch space Read would always require yet another scratch space to temporarily store a copy of p just in case of a short read (but then what would be the point of using p as the scratch space?). Furthermore, the caller can know n only once Read returns, so the "during" in the doc can not just refer as a means to avoid data races in which both Read and some other goroutine both access the same slice. Did I get this wrong all this time?

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Sep 18, 2021

[...] Even if Read returns n < len(p), it may use all of p as scratch space during the call. [...]

I think you are right. I just assumed it should not change what's not returned following the principle of the least surprise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Go2 NeedsDecision Thinking
Projects
None yet
Development

No branches or pull requests