Description
What version of Go are you using (go version
)?
$ go version go version go1.16.8 linux/amd64
Does this issue reproduce with the latest release?
It reproduces with the Go Playground, which I assume is the latest version.
What operating system and processor architecture are you using (go env
)?
go env
Output
$ go env GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/home/roel/.cache/go-build" GOENV="/home/roel/.config/go/env" GOEXE="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="linux" GOINSECURE="" GOMODCACHE="/home/roel/go/pkg/mod" GONOPROXY="" GONOSUMDB="" GOOS="linux" GOPATH="/home/roel/go" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/snap/go/8408" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/snap/go/8408/pkg/tool/linux_amd64" GOVCS="" GOVERSION="go1.16.8" GCCGO="gccgo" AR="ar" CC="gcc" CXX="g++" CGO_ENABLED="1" GOMOD="/home/roel/dev/json-api-golang/go.mod" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3447792839=/tmp/go-build -gno-record-gcc-switches"
What did you do?
I tried to validate a user-provided string, which might contain non-utf8 data, using a regex.
See https://play.golang.org/p/j-jsteknY0M for a concise example.
What did you expect to see?
The regex package docs says:
All characters are UTF-8-encoded code points.
So, I expected matching on non-utf8 strings to either:
- always give an error,
- always return
false
What did you see instead?
For some regexes, it returns false
, for some it returns true
. It never returns an error.
Particularly, regexes referencing or containing the Unicode REPLACEMENT CHARACTER
(\ufffd
, �) inside a bracket expression return true (but only if there are other characters in the same bracket). See the playgound example.
I understand that the immediate solution for me is to just check for invalid utf8 first, before regexing. However, the actual behaviour was so unexpected to me, even if it's technically undefined when reading the docs, that it might be a good idea to at least document this.