New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: net: TCPConn supports Writev #13451

Closed
winlinvip opened this Issue Dec 2, 2015 · 64 comments

Comments

@winlinvip
Copy link

winlinvip commented Dec 2, 2015

I have search go-nuts and google about the writev for unix, seems go not support it yet. And I found a project vectorio which support writev but not work.

I am rewriting the srs to go-oryx. SRS can support 7.5k clients per CPU, while golang version only support 2k, for the media streaming server need to delivery the video or audio packet to different parts, then use writev to send to client to avoid bytes copy.

I try to use reflect to implements the writev, but I found it's possible for the pollDesc is not exported, the commit is here.

Does go plan to support writev(netFD.Writev)?

@winlinvip

This comment has been minimized.

Copy link

winlinvip commented Dec 2, 2015

I want to explain why writev is important for media streaming server:

  1. The media streaming server always delivery some audio or video packets to a connection.
  2. These audio or video packets can be sent in a writev to avoid too many syscall.
  3. The writev can used to avoid the copy from videos to the buffer, for the video payload is very large.

I have use cpu and mem profile to fix the bottleneck, but the memory copy and syscall by use netFD.Write hurts the performance.

@ianlancetaylor ianlancetaylor added this to the Unplanned milestone Dec 2, 2015

@Terry-Mao

This comment has been minimized.

Copy link

Terry-Mao commented Dec 2, 2015

if has writev, it's easy for us to implement zero copy service....

@Terry-Mao

This comment has been minimized.

Copy link

Terry-Mao commented Dec 12, 2015

@ianlancetaylor
As a network proxy, many network protocol has TWO parts HEADER & DATA part.
if netFD has writev and also support by bufio, it's no need do like:

// reader & writer goroutine call write concurrency, must lock.
lock.Lock()
w.Write(head)
w.Write(data)
lock.Unlock()

but the inner netFD has fd.writeLock(), so caller must do a wasting lock.Lock().
something like this is very nice:

// Write is atomic, and goroutine safe
w.Write(head, data)

func (w *Writer) Write(data ...[]byte) (n int, err error)
@winlinvip

This comment has been minimized.

Copy link

winlinvip commented Dec 13, 2015

For streaming service, for example, to send a video packet to RTMP(over TCP) client, the video maybe:

video=make([]byte, 156 * 1024)

Then we need to add some small header every some video bytes, for instance, 10k:

h0=make([]byte, 12)
p0 = video[0:10240]

h1=make([]byte, 5)
p1=video(10240:20480)

......

// util the hN and pN
hN=make([]byte, 5)
pN=video(x:)

RIght now, we send data in a very very slow way:

// merge all to a big buffer
bigBuffer = bytes.Buffer{}
for b in (h0, p0, h1, p1, ..., hN, pN) {
    bigBuffer.Write(b)
}

// send in a syscall
netFD.Write(bigBuffer)

Because the syscall is expensive than copy buffer:

// very very slow for too many syscall
for b in (h0, p0, h1, p1, ..., hN, pN) {
    netFD.Write(b)
}

When golang support writev, we can send in a syscall and without copy to a big buffer.

// high effiency writev for stream server.
Write(h0, p0, h1, p1, ......, hN, pN)
@davecheney

This comment has been minimized.

Copy link
Contributor

davecheney commented Dec 13, 2015

// merge all to a big buffer
bigBuffer = bytes.Buffer{}
for b in (h0, p0, h1, p1, ..., hN, pN) {
    bigBuffer.Write(b)
}

h0, p0, etc all derive from a single []byte slice, why do you slice them up, then copy them back together? Sending the original slice would avoid the copy, and avoid extending the net.Conn interface.

@winlinvip

This comment has been minimized.

Copy link

winlinvip commented Dec 14, 2015

@davecheney Nop, only the p0,p1,...,pN is slice from video, the h0,h1,...,hN is another slice. That is, we slice the video payload:

// it's ok to use slice, without copy.
video = p0 + p1 + ... + pN

But the headers is another slice:

headers = h0 + h1 + ... + hN

We should interleave the header and payload to a big buffer:

big-buffer = h0 + p0 + h1 + p1 + ... + hN + pN

Because each header is belong to its payload, for instance, hN is only for pN. When use c/c++ to write to socket, we can use writev:

iovec iovs[N*2];
for h,p in (h0, p0, ..., hN, pN) {
    iovs[i].data = h
    iovs[i+1].data = p
}
socket.writev(iovs)

Does that make sense?

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

ianlancetaylor commented Dec 14, 2015

I'm fine with adding a Writev call to the net package. I'm not aware of anybody actually working on it.

@winlinvip

This comment has been minimized.

Copy link

winlinvip commented Dec 14, 2015

@ianlancetaylor Great~

I create SRS, and rewrite it by golang go-oryx, and the benchmark tool srs-bench for streaming server.

SRS can serve 7.5K clients, use 1CPU(about 80% usage); the bandwidth is about 3.75Gbps. SRS is really very high efficient for use writev to avoid copy and use little syscall. But SRS is single process model, for steaming server is too complex.

Nginx-RTMP which is a plugin in nginx, which only support 2k clients per CPU, but nginx-rtmp support multiple-processes.

I create go-oryx, because I want to use the multile processes feature of golang. Right now, after lots of performance refine, the go-oryx can support 8k by 4CPU. If golang support writev, I think the performance can improve 200% to 400%, that is about 16k to 32k clients use 4CPU. It's really awesome.

@winlinvip

This comment has been minimized.

Copy link

winlinvip commented Dec 16, 2015

@davecheney @ianlancetaylor What's the state of this issue now? Accept or postpone? Any plan?

@bradfitz

This comment has been minimized.

Copy link
Member

bradfitz commented Dec 16, 2015

@winlinvip, are you sure your problem is Writev and not, say, allocations? Some of your examples above look very allocation-happy.

Before accepting this proposal, I'd want to see before & after programs with numbers and profiles to show that it helps enough. There will definitely be pushback about expanding the net package's API, which many feel is already too large.

That is, convince us with data perhaps. Implement this in the net package and then show us numbers with how much it helped. We have to be able to reproduce the numbers.

@winlinvip

This comment has been minimized.

Copy link

winlinvip commented Dec 16, 2015

@bradfitz I try the no-copy version, to send the []byte one by one, but it will cause lots of syscall which hurts performance more. So I use a big buffer to copy the data to send, then use one syscall to send it. The big buffer solution is better than one by one, I have profile it and test it with mock clients. Please read ossrs/go-oryx#20

@winlinvip

This comment has been minimized.

Copy link

winlinvip commented Dec 16, 2015

@bradfitz And I think the writev is not a new API or language feature, it exists on all linux, unix and unix-like os. It's really very useful for high performance server, I found nginx also use writev:

find . -name "*.c"|xargs grep -in "= writev("
./os/unix/ngx_darwin_sendfile_chain.c:285:            rc = writev(c->fd, header.elts, header.nelts);
./os/unix/ngx_files.c:229:        n = writev(file->fd, vec.elts, vec.nelts);
./os/unix/ngx_freebsd_sendfile_chain.c:336:            rc = writev(c->fd, header.elts, header.nelts);
./os/unix/ngx_linux_sendfile_chain.c:294:            rc = writev(c->fd, header.elts, header.nelts);
./os/unix/ngx_writev_chain.c:113:        n = writev(c->fd, vec.elts, vec.nelts);
@winlinvip

This comment has been minimized.

Copy link

winlinvip commented Dec 16, 2015

Seems writev was introduced at 2001, https://en.wikipedia.org/wiki/Vectored_I/O

Standards bodies document the applicable functions readv[1] and writev[2] 
in POSIX 1003.1-2001 and the Single UNIX Specification version 2.
@kostya-sh

This comment has been minimized.

Copy link
Contributor

kostya-sh commented Dec 16, 2015

@winlinvip, can you try to modify variant 6 to avoid allocations of group buffers by populating buf (https://github.com/winlinvip/go-writev/blob/master/golang/server.go#L123) directly. I think this will be roughly equivalent to calling writev on multiple buffers. What your benchmark shows for this program?

@winlinvip

This comment has been minimized.

Copy link

winlinvip commented Dec 17, 2015

@kostya-sh Did u compare the c++ version? The allocation is not the bottle-neck, but the memory copy and syscall.

@winlinvip

This comment has been minimized.

Copy link

winlinvip commented Dec 17, 2015

It's really a very basic problem for server. Eventhough the golang is modern language, but it's compile to binary code and execute on linux, the memory copy and syscall is always the bottle-neck for server, I have profile it.

@bradfitz

This comment has been minimized.

Copy link
Member

bradfitz commented Dec 17, 2015

Please post a profile here.

@kostya-sh

This comment has been minimized.

Copy link
Contributor

kostya-sh commented Dec 17, 2015

@winlinvip, by removing allocations from your test program (variant 6) you will end up with a single syscall per write. This way you can estimate how fast would be your Go program if it used writev. If improvement of Go application is significant then it could be a pro argument for adding writev.

if you can implement writev in your local version of Go standard library, test the performance and post numbers here it would be even better.

@winlinvip

This comment has been minimized.

Copy link

winlinvip commented Dec 17, 2015

@kostya-sh I will try to add my writev version to golang stadard library, and give the result.

@winlinvip

This comment has been minimized.

Copy link

winlinvip commented Dec 17, 2015

I will test the real streaming server go-oryx later.

@ggaaooppeenngg

This comment has been minimized.

Copy link

ggaaooppeenngg commented Dec 17, 2015

@winlinvip It seems syscalls grow, I guess there are other side effect syscalls arisen. From the diff maybe some of these calls ?

    18.22s  5.47% 44.58%        38s 11.41%  runtime.scanobject
    12.18s  3.66% 48.24%     28.46s  8.55%  runtime.selectgoImpl
     9.93s  2.98% 51.22%      9.93s  2.98%  runtime.heapBitsForObject
     8.89s  2.67% 53.89%     13.78s  4.14%  runtime.greyobject
     8.17s  2.45% 56.35%     37.90s 11.38%  runtime.mallocgc

@bradfitz

This comment has been minimized.

Copy link
Member

bradfitz commented Dec 17, 2015

@ggaaooppeenngg, selectGoImpl is about select. The rest are garbage collection: paying the cost of allocating memory.

@winlinvip

This comment has been minimized.

Copy link

winlinvip commented Dec 18, 2015

A similar implemnetation for writev by coroutine:

ssize_t st_writev(_st_netfd_t *fd, const struct iovec *iov, int iov_size, st_utime_t timeout)
{
    ssize_t n, rv;
    size_t nleft, nbyte;
    int index, iov_cnt;
    struct iovec *tmp_iov;
    struct iovec local_iov[_LOCAL_MAXIOV];

    /* Calculate the total number of bytes to be sent */
    nbyte = 0;
    for (index = 0; index < iov_size; index++) {
        nbyte += iov[index].iov_len;
    }

    rv = (ssize_t)nbyte;
    nleft = nbyte;
    tmp_iov = (struct iovec *) iov; /* we promise not to modify iov */
    iov_cnt = iov_size;

    while (nleft > 0) {
        if (iov_cnt == 1) {
            if (st_write(fd, tmp_iov[0].iov_base, nleft, timeout) != (ssize_t) nleft) {
                rv = -1;
            }
            break;
        }

        if ((n = writev(fd->osfd, tmp_iov, iov_cnt)) < 0) {
            if (errno == EINTR) {
                continue;
            }
            if (!_IO_NOT_READY_ERROR) {
                rv = -1;
                break;
            }
        } else {
            if ((size_t) n == nleft) {
                break;
            }
            nleft -= n;
            /* Find the next unwritten vector */
            n = (ssize_t)(nbyte - nleft);
            for (index = 0; (size_t) n >= iov[index].iov_len; index++) {
                n -= iov[index].iov_len;
            }

            if (tmp_iov == iov) {
                /* Must copy iov's around */
                if (iov_size - index <= _LOCAL_MAXIOV) {
                    tmp_iov = local_iov;
                } else {
                    tmp_iov = calloc(1, (iov_size - index) * sizeof(struct iovec));
                    if (tmp_iov == NULL) {
                        return -1;
                    }
                }
            }

            /* Fill in the first partial read */
            tmp_iov[0].iov_base = &(((char *)iov[index].iov_base)[n]);
            tmp_iov[0].iov_len = iov[index].iov_len - n;
            index++;
            /* Copy the remaining vectors */
            for (iov_cnt = 1; index < iov_size; iov_cnt++, index++) {
                tmp_iov[iov_cnt].iov_base = iov[index].iov_base;
                tmp_iov[iov_cnt].iov_len = iov[index].iov_len;
            }
        }

        /* Wait until the socket becomes writable */
        if (st_netfd_poll(fd, POLLOUT, timeout) < 0) {
            rv = -1;
            break;
        }
    }

    if (tmp_iov != iov && tmp_iov != local_iov) {
        free(tmp_iov);
    }

    return rv;
}
@Terry-Mao

This comment has been minimized.

Copy link

Terry-Mao commented Dec 18, 2015

@winlinvip your C++ version struct iovec local_iov[_LOCAL_MAXIOV]; is allocated in the stack, your Go version on the heap.

@winlinvip

This comment has been minimized.

Copy link

winlinvip commented Dec 18, 2015

@bradfitz Bellow is my research result by go-oryx:

OS API CPU MEM GC Connections Bitrate
Linux Write 160% 5.4G 40ms 10k 300kbps
Linux Writev 140% 1.5G 30ms 10k 300kbps

Conclude:

  1. The writev(1440%) cpu usage is almost the same than write(160%) big-buffer.
  2. The writev(1.5G) use less memory than write(5.4G), for writev avoid of copy to a big buffer.
  3. The writev gc(30ms) use less time than write(40ms).
@winlinvip

This comment has been minimized.

Copy link

winlinvip commented Dec 18, 2015

Will golang accept the proposal to support netFD.Writev?

@bradfitz

This comment has been minimized.

Copy link
Member

bradfitz commented Dec 18, 2015

@winlinvip, thanks for prototyping.

Can you make a small stand-alone program for profiling that can do either multiple writes or writev (controlled by a flag) and link to the code so others can reproduce your results and investigate more?

I also have an API idea. Instead of adding API like (*TCPConn).Writev([][]byte), we could have a special type that implements io.WriterTo that represents the multiple buffers (the [][]byte) and TCPConn's ReadFrom can special-case it into a writev internally, just like it does sendfile.

Then we can add the writev optimization gradually to different OS/conn types over time.

@winlinvip

This comment has been minimized.

Copy link

winlinvip commented Dec 24, 2015

@Terry-Mao @davecheney @ianlancetaylor @bradfitz @kostya-sh @ggaaooppeenngg @mikioh @extemporalgenome @minux Seems writev is attractive for all of us, please update this post when discuss writev in February 😸

@extemporalgenome

This comment has been minimized.

Copy link

extemporalgenome commented Dec 25, 2015

@minux ah. I misread it as writing one buffer to multiple fd's (and I wasn't suggesting modifying io.MultiWriter itself -- but rather adding a specialized function [which is now moot] to the net package)

@spance

This comment has been minimized.

Copy link

spance commented Feb 4, 2016

And, should add new API method to support sending multiple WSABuf with WSASend/WSASendto (similar to writev() and struct iovec) for windows.

// src/net/fd_windows.go:L502 Single WSABuf sending

func (fd *netFD) Write(buf []byte) (int, error) {
    if err := fd.writeLock(); err != nil {
        return 0, err
    }
    defer fd.writeUnlock()
    if raceenabled {
        raceReleaseMerge(unsafe.Pointer(&ioSync))
    }
    o := &fd.wop
    o.InitBuf(buf)  // should init multiple WSABuf
    n, err := wsrv.ExecIO(o, "WSASend", func(o *operation) error {
        // #### here arg 2,3 should be *WSAbuf, count ###
        return syscall.WSASend(o.fd.sysfd, &o.buf, 1, &o.qty, 0, &o.o, nil) 
    })
    if _, ok := err.(syscall.Errno); ok {
        err = os.NewSyscallError("wsasend", err)
    }
    return n, err
}
@mbeacom

This comment has been minimized.

Copy link

mbeacom commented Apr 21, 2016

@bradfitz ping.
Would it be possible for this to be revisited, since the release of Go >= 1.6?

Thank you!

@bradfitz

This comment has been minimized.

Copy link
Member

bradfitz commented Apr 21, 2016

@mbeacom, sorry, this also will not make Go 1.7. The 1.7 development freeze is May 1st and this isn't ready.

@bradfitz bradfitz self-assigned this Aug 22, 2016

@adg

This comment has been minimized.

Copy link
Contributor

adg commented Sep 26, 2016

ping @bradfitz

@mbeacom

This comment has been minimized.

Copy link

mbeacom commented Sep 27, 2016

@bradfitz What is the likelihood of this moving into the Go 1.8 Maybe milestone?

@bradfitz

This comment has been minimized.

Copy link
Member

bradfitz commented Sep 27, 2016

@ianlancetaylor and I are happy with this.

API-wise, though, I'm thinking:

package net

// Buffers contains zero or more runs of bytes to write.                                                                                
//                                                                                                                                      
// On certain machines, for certain types of connections, this is optimized                                                                   
// into an OS-specific batch write operation.                                                                                   
type Buffers [][]byte

func (v Buffers) WriteTo(w io.Writer) (n int64, err error)
@cespare

This comment has been minimized.

Copy link
Contributor

cespare commented Sep 27, 2016

@bradfitz would we do func (v Buffers) ReadFrom(r io.Reader) (n int64, err error) for readv as well?

(Your proposed docstring says that Buffers is "bytes to write".)

@bradfitz

This comment has been minimized.

Copy link
Member

bradfitz commented Sep 27, 2016

Perhaps. But let's leave readv for another day.

@quentinmit

This comment has been minimized.

Copy link
Contributor

quentinmit commented Sep 27, 2016

@bradfitz: What is the expected behavior if WriteTo is called with zero runs of bytes to write? I would prefer that it is documented. (Both "does nothing" and "calls write with a zero-length buffer" are acceptable responses, I think.)

@bradfitz

This comment has been minimized.

Copy link
Member

bradfitz commented Sep 27, 2016

Why would a user care? I'd prefer the future implementation freedom and documentation clarity from leaving it unspecified.

Also, do you mean zero overall or an entry with zero length? Because currently (for ease of implementation), I do nothing if the overall vector is zero, but if one of the vector entries is empty, I still include it in the set, but zero length. I'd prefer to not get into that level of detail in docs.

@bradfitz

This comment has been minimized.

Copy link
Member

bradfitz commented Sep 27, 2016

I sent out https://golang.org/cl/29951

It works, but the thing I don't like about it is net.Buffers, despite being an io.WriterTo is not an io.Reader, so it can't be used as the src of an io.Copy call, which is kinda weird. The whole point of io.ReaderFrom and io.WriterTo was to optimize io.Copy, so it feels weird to have any WriterTo which is not a Reader.

But I can't make net.Buffers a Reader because then it would need a pointer receiver and record state, which is kinda disgusting, especially if the writev EAGAIN loop already needs to maintain that state (see the CL). I'd prefer that state be local to the loop and not required to be allocated by the user in every write call, since the whole point of this is to avoid allocations.

@gopherbot

This comment has been minimized.

Copy link

gopherbot commented Sep 27, 2016

CL https://golang.org/cl/29951 mentions this issue.

@quentinmit

This comment has been minimized.

Copy link
Contributor

quentinmit commented Sep 27, 2016

I think a user would care because we might have the behavior that a Conn that is closed or otherwise unusable might or might not return an error if you try to write zero bytes. That seems confusing to me. I guess it looks like we currently also don't promise what will happen if Write is called with a zero-length slice?

@bradfitz

This comment has been minimized.

Copy link
Member

bradfitz commented Sep 27, 2016

unusable might or might not return an error if you try to write zero bytes.

The initial validity check in the CL (if !c.ok() {) would still catch that. That's like all the methods which operate on the netFD.

I guess it looks like we currently also don't promise what will happen if Write is called with a zero-length slice?

Exactly. Zero-length reads and writes have historically been very ill-defined in Go. In general, we promise nothing.

@winlinvip

This comment has been minimized.

Copy link

winlinvip commented Sep 30, 2016

👍 Nice writev~

@gopherbot

This comment has been minimized.

Copy link

gopherbot commented Oct 2, 2016

CL https://golang.org/cl/30102 mentions this issue.

@gopherbot

This comment has been minimized.

Copy link

gopherbot commented Oct 29, 2016

CL https://golang.org/cl/32371 mentions this issue.

gopherbot pushed a commit that referenced this issue Nov 2, 2016

net: implement Buffers on windows
Updates #13451

Change-Id: I2c3c66d9532c16e616c476e2afe31b3ddc0a8d79
Reviewed-on: https://go-review.googlesource.com/32371
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>

@golang golang locked and limited conversation to collaborators Oct 31, 2017

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.