Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp: behavior on invalid UTF-8 input is undocumented and inconsistent #38006

abacabadabacaba opened this issue Mar 22, 2020 · 2 comments


Copy link

@abacabadabacaba abacabadabacaba commented Mar 22, 2020

$ go version
go version go1.14.1 linux/amd64

The documentation for regexp package says that "All characters are UTF-8-encoded code points". However, it doesn't say anything about what happens if the input is not valid UTF-8.

The current behavior on invalid UTF-8 is not consistent. For example, when matching against the string "\xff", the pattern a doesn't match, the pattern \x{fffd} doesn't match either, but the pattern a|\x{fffd} surprisingly does match.

The documentation should be updated to specify the behavior of regexp package on invalid UTF-8 input. Also, the behavior should be made more consistent (for example, always convert every undecodeable byte into \ufffd).

Copy link

@andig andig commented Jun 28, 2020

Different issue but similar context: it would also be interesting to use regexes with "binary" patterns irrespective of utf8 code points. Seems this is not supported at all (while it is in python).

Copy link

@davecheney davecheney commented Jun 29, 2020

@andig please open a new issue for binary regex. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants