Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp: behavior on invalid UTF-8 input is undocumented and inconsistent #38006

Open
abacabadabacaba opened this issue Mar 22, 2020 · 2 comments
Open
Assignees
Labels

Comments

@abacabadabacaba
Copy link

@abacabadabacaba abacabadabacaba commented Mar 22, 2020

$ go version
go version go1.14.1 linux/amd64

The documentation for regexp package says that "All characters are UTF-8-encoded code points". However, it doesn't say anything about what happens if the input is not valid UTF-8.

The current behavior on invalid UTF-8 is not consistent. For example, when matching against the string "\xff", the pattern a doesn't match, the pattern \x{fffd} doesn't match either, but the pattern a|\x{fffd} surprisingly does match.

The documentation should be updated to specify the behavior of regexp package on invalid UTF-8 input. Also, the behavior should be made more consistent (for example, always convert every undecodeable byte into \ufffd).

@andig
Copy link
Contributor

@andig andig commented Jun 28, 2020

Different issue but similar context: it would also be interesting to use regexes with "binary" patterns irrespective of utf8 code points. Seems this is not supported at all (while it is in python).

@davecheney
Copy link
Contributor

@davecheney davecheney commented Jun 29, 2020

@andig please open a new issue for binary regex. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.