Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/idna: we are almost out of room for new mapped runes #49371

Open
TimothyGu opened this issue Nov 5, 2021 · 3 comments
Open

x/net/idna: we are almost out of room for new mapped runes #49371

TimothyGu opened this issue Nov 5, 2021 · 3 comments
Labels
NeedsDecision

Comments

@TimothyGu
Copy link
Contributor

@TimothyGu TimothyGu commented Nov 5, 2021

What version of Go are you using (go version)?

$ go version
go version go1.17.1 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/timothy-gu/.cache/go-build"
GOENV="/home/timothy-gu/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/timothy-gu/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/timothy-gu/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/lib/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.17.1"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/timothy-gu/go/src/golang.org/x/text/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build93394766=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Not a bug… yet.

The x/net/idna package (aka x/text/internal/export/idna) stores information about each character in a 16-bit info structure. Of the 16 bits, 13 bits can be used for an index into the mappings table, which defines replacements of deviation and mapped characters.

The 13 bits necessarily mean that the index can only be in the range [0, 8191]. However, we see that in the latest version of Unicode supported by Go (13.0.0), the mappings table already has 8188 bytes of data. Each new mapped rune adds at least two bytes to the table (one for a header, one for the actual mapping), so at most only two more runes can be added in future versions of Unicode for the encoding scheme to continue working!

What did you expect to see?

The encoding scheme should be future-proof for the foreseeable future.

What did you see instead?

The encoding scheme will stop working imminently, perhaps as soon as Unicode 14.0.0 (#48621)!


I propose two ways to solve this problem, each of which would suffice by itself.

  1. We could adjust the encoding to give more room to the mapping table index. In particular, we can merge the xorBit (bit 2) into the indices, since there are not as many XOR table indices (xorData has 4862 bytes for Unicode 13). Here's one possible scheme:

    bits 15 14 13 12 11 ... 2       possibilities     meaning
          0  X  X  X  X     X       8192              mapping table index
          1  0  0  X  X     X       2048              mapping table index
          1  0  1  X  X     X       2048              XOR table index
          1  1  0  X  X     X       2048              XOR table index
          1  1  1  0  X     X       1024              XOR table index
          1  1  1  1  X     X       1024              inline XOR
    
    mapping table index = 10240 possibilities (need ≥8188)
        XOR table index =  5120 possibilities (need ≥4862)
                              can expand to up to 5888
                              by taking more prefixes from inline XOR
             inline XOR =  1024 possibilities (need exactly 256)
    

    This scheme is implemented in https://golang.org/cl/361496.

  2. We could align each mapping table index to be a multiple of two, and stop explicitly storing the least significant bit. This allows the current layout of the info type to be kept. After this change, the size of mappings is 9012, which means that the allowed stored indices are in the range [0, 4505] – well within the range of 13 bits.

    In terms of binary size impacts: this change causes the total size of tables to grow from 43370 bytes to 44314 bytes (944 bytes) – a modest increase on par with two Unicode version upgrades (Unicode 11.0.0 → 13.0.0 caused an increase of 904 bytes).

    This scheme is implemented in https://golang.org/cl/361497.

Finally, I have benchmarked both approaches on Go 1.17.1:

❯ go test -bench BenchmarkProfile -benchtime 20s -run '^$' -v golang.org/x/text/internal/export/idna
goos: linux
goarch: amd64
pkg: golang.org/x/text/internal/export/idna
cpu: Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz

origin/master
BenchmarkProfile-8      84007281               266.4 ns/op

approach 1
BenchmarkProfile-8      86914498               267.2 ns/op   (0.3% slower)

approach 2
BenchmarkProfile-8      86389702               267.8 ns/op   (0.5% slower)

Notice, though that BenchmarkProfile is not an entirely fair benchmark as it hammers one ASCII-only string specifically. The real-world performance impact for both approaches is likely to be more pronounced especially when characters in the mapping table are exercised – but probably not overly so.

@TimothyGu
Copy link
Contributor Author

@TimothyGu TimothyGu commented Nov 5, 2021

@gopherbot
Copy link

@gopherbot gopherbot commented Nov 5, 2021

Change https://golang.org/cl/361497 mentions this issue: internal/export/idna: change representation of trie (approach 2)

@gopherbot
Copy link

@gopherbot gopherbot commented Nov 5, 2021

Change https://golang.org/cl/361496 mentions this issue: internal/export/idna: change representation of trie (approach 1)

@thanm thanm added the NeedsDecision label Nov 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsDecision
Projects
None yet
Development

No branches or pull requests

3 participants