regexp: document and implement invalid UTF-8 treated as U+FFFD

### What version of Go are you using (`go version`)?

<pre>
$ go version
go version go1.16.8 linux/amd64
</pre>

### Does this issue reproduce with the latest release?
It reproduces with the [Go Playground](https://play.golang.org), which I assume is the latest version.

### What operating system and processor architecture are you using (`go env`)?

<details><summary><code>go env</code> Output</summary><br><pre>
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/roel/.cache/go-build"
GOENV="/home/roel/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/roel/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/roel/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/snap/go/8408"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/snap/go/8408/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.16.8"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/roel/dev/json-api-golang/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3447792839=/tmp/go-build -gno-record-gcc-switches"
</pre></details>

### What did you do?

I tried to validate a user-provided string, which might contain non-utf8 data, using a regex.

See https://play.golang.org/p/j-jsteknY0M for a concise example.

### What did you expect to see?

The regex package docs says:
> All characters are UTF-8-encoded code points.

So, I expected matching on non-utf8 strings to either:
- always give an error, 
- always return `false`

### What did you see instead?
For some regexes, it returns `false`, for some it returns `true`. It never returns an error.

Particularly, regexes referencing or containing the Unicode `REPLACEMENT CHARACTER` (`\ufffd`, �) inside a bracket expression return true (but only if there are other characters in the same bracket). See the playgound example.

I understand that the immediate solution for me is to just check for invalid utf8 first, before regexing. However, the actual behaviour was so unexpected to me, even if it's technically undefined when reading the docs, that it might be a good idea to at least document this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

regexp: document and implement invalid UTF-8 treated as U+FFFD #48749

What version of Go are you using (`go version`)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (`go env`)?

What did you do?

What did you expect to see?

What did you see instead?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

regexp: document and implement invalid UTF-8 treated as U+FFFD #48749

Description

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?