Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: emit optimized assembly for checking the first chars before `{strings|bytes}.IndexByte` call #25844

Closed
valyala opened this issue Jun 12, 2018 · 5 comments

Comments

@valyala
Copy link
Contributor

commented Jun 12, 2018

The issue

Currently strings.IndexByte (and bytes.IndexByte) call overhead makes these functions slower comparing to a plain check for the first few bytes on amd64. The following benchmark shows the issue:

package indexbyte_test

import (
        "strings"
        "testing"
)

func stdFunc(s string, ch byte) int {
        // Just call strings.IndexByte
        return strings.IndexByte(s, ch)
}

func prefixCheckFunc(s string, ch byte) int {
        if len(s) >= 4 {
                // Search for ch in the first few bytes of s before resorting
                // to strings.IndexByte call.
                // The compiler may substitute this by compact optimized assembly in order
                // to reduce overhead when ch isn't found in the first few bytes.
                if s[0] == ch {
                        return 0
                }
                if s[1] == ch {
                        return 1
                }
                if s[2] == ch {
                        return 2
                }
                if s[3] == ch {
                        return 3
                }
                return strings.IndexByte(s, ch)
        }

        // The code for short s completely skips strings.IndexByte call overhead.
        if len(s) == 0 {
                return -1
        }
        if len(s) == 1 {
                if s[0] == ch {
                        return 0
                }
                return -1
        }
        if len(s) == 2 {
                if s[0] == ch {
                        return 0
                }
                if s[1] == ch {
                        return 1
                }
                return -1
        }
        if len(s) == 3 {
                if s[0] == ch {
                        return 0
                }
                if s[1] == ch {
                        return 1
                }
                if s[2] == ch {
                        return 2
                }
                return -1
        }
        return -1
}

func BenchmarkCommaSearch(b *testing.B) {
        for _, s := range []string{
                        "",
                        "1",
                        "12",
                        "123",
                        "1234",
                        "12345",
                        "123456",
                        ",",
                        "1,",
                        "12,",
                        "123,",
                        "1234,",
                        "12345,",
                        "123456,",
                        ", 123, 4567890 qwertyuiop",
                        "1, 23, 4567890 qwertyuiop",
                        "12, 34, 567890 qwertyuiop",
                        "123, 4, 567890 qwertyuiop",
                        "1234, 5, 67890 qwertyuiop",
                        "12345, 6, 7890 qwertyuiop",
                        "123456, 7, 890 qwertyuiop",
                } {
                b.Run(s, func(b *testing.B) {
                        b.Run("std", func(b *testing.B) {
                                for i := 0; i < b.N; i++ {
                                        Sink += stdFunc(s, ',')
                                }
                        })
                        b.Run("prefixCheck", func(b *testing.B) {
                                for i := 0; i < b.N; i++ {
                                        Sink += prefixCheckFunc(s, ',')
                                }
                        })
                })
        }
}

var Sink int

Benchmark results:

$ go test -bench=CommaSearch
goos: linux
goarch: amd64
BenchmarkCommaSearch/#00/std-4     	300000000	         5.39 ns/op
BenchmarkCommaSearch/#00/prefixCheck-4         	500000000	         3.04 ns/op
BenchmarkCommaSearch/1/std-4                   	300000000	         5.72 ns/op
BenchmarkCommaSearch/1/prefixCheck-4           	500000000	         3.24 ns/op
BenchmarkCommaSearch/12/std-4                  	300000000	         5.88 ns/op
BenchmarkCommaSearch/12/prefixCheck-4          	500000000	         3.46 ns/op
BenchmarkCommaSearch/123/std-4                 	300000000	         5.52 ns/op
BenchmarkCommaSearch/123/prefixCheck-4         	500000000	         3.94 ns/op
BenchmarkCommaSearch/1234/std-4                	300000000	         5.92 ns/op
BenchmarkCommaSearch/1234/prefixCheck-4        	200000000	         6.22 ns/op
BenchmarkCommaSearch/12345/std-4               	200000000	         5.68 ns/op
BenchmarkCommaSearch/12345/prefixCheck-4       	200000000	         6.32 ns/op
BenchmarkCommaSearch/123456/std-4              	300000000	         5.97 ns/op
BenchmarkCommaSearch/123456/prefixCheck-4      	200000000	         6.24 ns/op
BenchmarkCommaSearch/,/std-4                   	300000000	         6.17 ns/op
BenchmarkCommaSearch/,/prefixCheck-4           	500000000	         3.30 ns/op
BenchmarkCommaSearch/1,/std-4                  	200000000	         6.49 ns/op
BenchmarkCommaSearch/1,/prefixCheck-4          	500000000	         3.76 ns/op
BenchmarkCommaSearch/12,/std-4                 	300000000	         6.23 ns/op
BenchmarkCommaSearch/12,/prefixCheck-4         	500000000	         3.74 ns/op
BenchmarkCommaSearch/123,/std-4                	200000000	         6.08 ns/op
BenchmarkCommaSearch/123,/prefixCheck-4        	300000000	         4.04 ns/op
BenchmarkCommaSearch/1234,/std-4               	200000000	         6.23 ns/op
BenchmarkCommaSearch/1234,/prefixCheck-4       	200000000	         6.55 ns/op
BenchmarkCommaSearch/12345,/std-4              	300000000	         5.95 ns/op
BenchmarkCommaSearch/12345,/prefixCheck-4      	200000000	         6.55 ns/op
BenchmarkCommaSearch/123456,/std-4             	200000000	         6.47 ns/op
BenchmarkCommaSearch/123456,/prefixCheck-4     	200000000	         7.08 ns/op
BenchmarkCommaSearch/,_123,_4567890_qwertyuiop/std-4     	200000000	         6.28 ns/op
BenchmarkCommaSearch/,_123,_4567890_qwertyuiop/prefixCheck-4         	500000000	         3.35 ns/op
BenchmarkCommaSearch/1,_23,_4567890_qwertyuiop/std-4                 	200000000	         6.68 ns/op
BenchmarkCommaSearch/1,_23,_4567890_qwertyuiop/prefixCheck-4         	500000000	         3.60 ns/op
BenchmarkCommaSearch/12,_34,_567890_qwertyuiop/std-4                 	200000000	         6.87 ns/op
BenchmarkCommaSearch/12,_34,_567890_qwertyuiop/prefixCheck-4         	300000000	         3.73 ns/op
BenchmarkCommaSearch/123,_4,_567890_qwertyuiop/std-4                 	200000000	         6.26 ns/op
BenchmarkCommaSearch/123,_4,_567890_qwertyuiop/prefixCheck-4         	300000000	         4.32 ns/op
BenchmarkCommaSearch/1234,_5,_67890_qwertyuiop/std-4                 	200000000	         6.26 ns/op
BenchmarkCommaSearch/1234,_5,_67890_qwertyuiop/prefixCheck-4         	200000000	         6.96 ns/op
BenchmarkCommaSearch/12345,_6,_7890_qwertyuiop/std-4                 	200000000	         6.25 ns/op
BenchmarkCommaSearch/12345,_6,_7890_qwertyuiop/prefixCheck-4         	200000000	         7.47 ns/op
BenchmarkCommaSearch/123456,_7,_890_qwertyuiop/std-4                 	200000000	         6.76 ns/op
BenchmarkCommaSearch/123456,_7,_890_qwertyuiop/prefixCheck-4         	200000000	         6.94 ns/op

The benchmark shows that the prefixCheckFunc outperforms strings.IndexByte for short strings and when the required char is located in the short string prefix.

Solution

Emit optimized assembly for checking the first chars before {strings|bytes}.IndexByte call.

This may speed up parsers extracting small tokens such as short numbers, ids and strings.

cc'ing @TocarIP and @randall77 , who may be interested in the implementation.

@valyala

This comment has been minimized.

Copy link
Contributor Author

commented Jun 12, 2018

cc'ing @quasilyte and @josharian , since it looks like this may help CLs like https://go-review.googlesource.com/c/go/+/115616 .

@bcmills bcmills added the Performance label Jun 12, 2018

@bcmills bcmills added this to the Unplanned milestone Jun 12, 2018

@josharian

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2018

We could also do something similar for HasPrefix and HasSuffix, particularly when the sought byte/prefix/suffix is a constant. However, it seems to me these are perfect candidates for mid-stack or partial inlining, rather than hand-coded compiler optimizations.

@valyala

This comment has been minimized.

Copy link
Contributor Author

commented Jun 12, 2018

However, it seems to me these are perfect candidates for mid-stack or partial inlining, rather than hand-coded compiler optimizations.

But as I know mid-stack inlining cannot inline yet functions written in assembly

@agnivade

This comment has been minimized.

Copy link
Member

commented Mar 26, 2019

It is not clear to me what this issue is about. {strings,bytes}.IndexByte already emits handcoded assembly for various platforms.

Is this about checking the initial chars before going with the generic algorithm ? What is the heuristic to have 4 as the number ? Why not 3 or 5 ?

Or is this about inlining the function calls ?

If it is the previous, the difference is very minimal now(smaller than a ns), for the cases where prefix is faster than std. And we already have a special-case assembly when the length of the string is less than 16. So I am not sure if it is worth pursuing to optimize for more corner cases when the attained gain is so less.

$benchstat std.txt prefix.txt 
name                   old time/op  new time/op  delta
CommaSearch/#00-4      4.16ns ± 1%  3.77ns ± 1%   -9.19%  (p=0.008 n=5+5)
CommaSearch/1-4        4.55ns ± 1%  4.16ns ± 1%   -8.48%  (p=0.008 n=5+5)
CommaSearch/12-4       4.53ns ± 0%  4.91ns ± 1%   +8.44%  (p=0.008 n=5+5)
CommaSearch/123-4      4.80ns ± 2%  4.94ns ± 1%   +2.75%  (p=0.008 n=5+5)
CommaSearch/1234-4     4.76ns ± 1%  8.42ns ± 1%  +77.04%  (p=0.008 n=5+5)
CommaSearch/12345-4    4.76ns ± 1%  8.42ns ± 0%  +76.83%  (p=0.008 n=5+5)
CommaSearch/123456-4   4.55ns ± 1%  8.32ns ± 1%  +82.97%  (p=0.008 n=5+5)
CommaSearch/,-4        4.72ns ± 1%  3.99ns ± 2%  -15.47%  (p=0.008 n=5+5)
CommaSearch/1,-4       4.69ns ± 1%  4.34ns ± 1%   -7.43%  (p=0.008 n=5+5)
CommaSearch/12,-4      4.99ns ± 1%  4.65ns ± 1%   -6.85%  (p=0.008 n=5+5)
CommaSearch/123,-4     4.98ns ± 0%  4.66ns ± 1%   -6.58%  (p=0.008 n=5+5)
CommaSearch/1234,-4    4.70ns ± 1%  8.73ns ± 1%  +85.78%  (p=0.008 n=5+5)
CommaSearch/12345,-4   4.73ns ± 2%  8.74ns ± 1%  +84.85%  (p=0.008 n=5+5)
CommaSearch/123456,-4  4.72ns ± 1%  8.72ns ± 3%  +84.99%  (p=0.008 n=5+5)

@valyala ?

@gopherbot

This comment has been minimized.

Copy link

commented Apr 26, 2019

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@gopherbot gopherbot closed this Apr 26, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.