Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
regexp: behavior on invalid UTF-8 input is undocumented and inconsistent #38006
$ go version go version go1.14.1 linux/amd64
The documentation for regexp package says that "All characters are UTF-8-encoded code points". However, it doesn't say anything about what happens if the input is not valid UTF-8.
The current behavior on invalid UTF-8 is not consistent. For example, when matching against the string
The documentation should be updated to specify the behavior of regexp package on invalid UTF-8 input. Also, the behavior should be made more consistent (for example, always convert every undecodeable byte into