Skip to content

proposal: bufio: add Reader.DiscardRunes(n int) and utf8: add RuneByteLen  #47621

@cafra

Description

@cafra

What version of Go are you using (go version)?

$ go version
go version go1.17.1 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="on"
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/cz/Library/Caches/go-build"
GOENV="/Users/cz/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/cz/go/pkg/mod"
GONOPROXY="gitlab.alipay-inc.com/*"
GONOSUMDB="gitlab.alipay-inc.com/*"
GOOS="darwin"
GOPATH="/Users/cz/go"
GOPRIVATE="gitlab.alipay-inc.com/*"
GOPROXY="https://goproxy.io,direct"
GOROOT="/Users/cz/sdk/go1.17"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/cz/sdk/go1.17/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.17"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/cz/go/src/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/3g/6n_rr7kx61d3dbbjbsn394fr0000gp/T/go-build180543292=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

In mesh service traffic management, there are often businesses that need to parse request messages.
But the protocol analysis performance is the bottleneck.
Therefore, by adding the feature of extracting the corresponding value according to the rules in the decode of the hessian-go library, avoid parsing all request messages.
In the hessian protocol, the length of each data type records the length of characters but the bytes, that is, the length of rune.
For data that does not match to the rules, use bufio.DiscardRunes to skip useless data to improve the performance of parsing parameters.
Discard can only skip the length of byte, but there are problems with utf8 type data,
so the bufio.DiscardRunes and utf8.RuneByteLen is extended to meet the demand.

What did you expect to see?

bufio Reader add DiscardRunes(n int) for skipping byte data of specified length characters;
utf8 add RuneByteLen for getting the number of bytes of the first utf-8 encoding in bytes;

  • bufio code
// DiscardRunes skips the next n runes, returning the number of bytes discarded.
//
// If DiscardRunes skips fewer than n runes, it also returns an error.
// If 0 <= n <= b.Buffered(), DiscardRunes is guaranteed to succeed without
// reading from the underlying io.Reader.
func (b *Reader) DiscardRunes(n int) (discardedBytes int, err error) {
	if n < 0 {
		return 0, ErrNegativeCount
	}
	if n == 0 {
		return
	}
	for i := 0; i < n; i++ {
		for b.r+utf8.UTFMax > b.w && !utf8.FullRune(b.buf[b.r:b.w]) && b.err == nil && b.w-b.r < len(b.buf) {
			b.fill() // b.w-b.r < len(buf) => buffer is not full
		}

		r, size := rune(b.buf[b.r]), 1
		if r >= utf8.RuneSelf {
			size = utf8.RuneByteLen(b.buf[b.r:b.w])
		}
		discardedBytes += size
		b.r += size
	}

	return discardedBytes, nil
}
  • utf8 code
// RuneByteLen returns the number of bytes of the first utf-8 encoding in p.
// If p is empty it returns 0. Otherwise, if
// the encoding is invalid, it returns 1.
// Both are impossible results for correct, non-empty UTF-8.
//
// an encoding is invalid if it is incorrect utf-8, encodes a rune that is
// out of range, or is not the shortest possible utf-8 encoding for the
// value. no other validation is performed.
func RuneByteLen(p []byte) (size int) {
	n := len(p)
	if n < 1 {
		return 0
	}
	p0 := p[0]
	x := first[p0]
	if x >= as {
		return 1
	}
	sz := int(x & 7)
	accept := acceptRanges[x>>4]
	if n < sz {
		return 1
	}
	b1 := p[1]
	if b1 < accept.lo || accept.hi < b1 {
		return 1
	}
	if sz <= 2 { // <= instead of == to help the compiler eliminate some bounds checks
		return 2
	}
	b2 := p[2]
	if b2 < locb || hicb < b2 {
		return 1
	}
	if sz <= 3 {
		return 3
	}
	b3 := p[3]
	if b3 < locb || hicb < b3 {
		return 1
	}
	return 4
}
  • performance code
func BenchmarkDiscardVsRead(b *testing.B) {
	b.Run("DiscardRunes", func(b *testing.B) {
		data := strings.Repeat("中", 4097)
		for i := 0; i < b.N; i++ {
			buf := bytes.NewBufferString(data)
			b := NewReader(buf)
			b.DiscardRunes(4097)
		}
	})

	b.Run("DiscardRunesCompare(NoRuneByteLen)", func(b *testing.B) {
		data := strings.Repeat("中", 4097)
		for i := 0; i < b.N; i++ {
			buf := bytes.NewBufferString(data)
			b := NewReader(buf)
			b.DiscardRunesCompare(4097)
		}
	})

	b.Run("readRuneForDiscard", func(b *testing.B) {
		data := strings.Repeat("中", 4097)
		for i := 0; i < b.N; i++ {
			buf := bytes.NewBufferString(data)
			b := NewReader(buf)
			for i := 0; i < 4097; i++ {
				b.ReadRune()
			}
		}
	})
}


func (b *Reader) DiscardRunesCompare(n int) (discardedBytes int, err error) {
	if n < 0 {
		return 0, ErrNegativeCount
	}
	if n == 0 {
		return
	}
	for i := 0; i < n; i++ {
		for b.r+utf8.UTFMax > b.w && !utf8.FullRune(b.buf[b.r:b.w]) && b.err == nil && b.w-b.r < len(b.buf) {
			b.fill() // b.w-b.r < len(buf) => buffer is not full
		}

		r, size := rune(b.buf[b.r]), 1
		if r >= utf8.RuneSelf {
			r, size = utf8.DecodeRune(b.buf[b.r:b.w])
		}
		discardedBytes += size
		b.r += size
	}

	return discardedBytes, nil
}
  • performance data
goos: darwin
goarch: amd64
pkg: bufio
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkDiscardVsRead
BenchmarkDiscardVsRead/DiscardRunes-12        	   49599	     23714 ns/op
BenchmarkDiscardVsRead/DiscardRunesCompare-12    42514	     27172 ns/op
BenchmarkDiscardVsRead/readRuneForDiscard-12   	    37992	     31682 ns/op
PASS

What did you see instead?

DiscardRunes has 12% performance improvement
DiscardRunes + RuneByteLen has 31% performance improvement

Please post code as ordinary text or a link to the Go playground, not as an image. Images are hard to read. Thanks.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions