Skip to content

regexp: document and implement invalid UTF-8 treated as U+FFFD #48749

Closed
@ComaVN

Description

@ComaVN

What version of Go are you using (go version)?

$ go version
go version go1.16.8 linux/amd64

Does this issue reproduce with the latest release?

It reproduces with the Go Playground, which I assume is the latest version.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/roel/.cache/go-build"
GOENV="/home/roel/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/roel/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/roel/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/snap/go/8408"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/snap/go/8408/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.16.8"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/roel/dev/json-api-golang/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3447792839=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I tried to validate a user-provided string, which might contain non-utf8 data, using a regex.

See https://play.golang.org/p/j-jsteknY0M for a concise example.

What did you expect to see?

The regex package docs says:

All characters are UTF-8-encoded code points.

So, I expected matching on non-utf8 strings to either:

  • always give an error,
  • always return false

What did you see instead?

For some regexes, it returns false, for some it returns true. It never returns an error.

Particularly, regexes referencing or containing the Unicode REPLACEMENT CHARACTER (\ufffd, �) inside a bracket expression return true (but only if there are other characters in the same bracket). See the playgound example.

I understand that the immediate solution for me is to just check for invalid utf8 first, before regexing. However, the actual behaviour was so unexpected to me, even if it's technically undefined when reading the docs, that it might be a good idea to at least document this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    DocumentationIssues describing a change to documentation.FrozenDueToAgeNeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions