Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The dot (".") meta-character should exclude surrogates #544

Merged
merged 4 commits into from Nov 18, 2019

Conversation

sarowe
Copy link
Contributor

@sarowe sarowe commented Jan 25, 2019

Followup from #538:

Notably, the dot metachar . includes these [surrogate] non-code-point chars, which should not be present unpaired in a well-formed UTF-16 sequence. Maybe the . definition should exclude them?

I'd be happy to change the definition of . to remove non-code-point chars. It's intended to match actual characters, so that makes sense to me.

Definitions from section 3.9 of the Unicode v11.0 standard:

  • D76: Unicode scalar value: Any Unicode code point except high-surrogate and low-surro- gate code points.
    • As a result of this definition, the set of Unicode scalar values consists of the ranges U+0 to U+D7FF and U+E000 to U+10FFFF, inclusive.
  • D77: Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange. [....] In the Unicode Standard, specific values of some code units cannot be used to represent an encoded character in isolation. This restriction applies to isolated surrogate code units in UTF-16 and to the bytes 80–FF in UTF-8. [...]

@sarowe sarowe requested a review from lsf37 as a code owner Jan 25, 2019
@sarowe
Copy link
Contributor Author

sarowe commented Jan 25, 2019

The branch is WIP: existing unit tests succeed, but:

  • at least one testsuite case ("bol") fails because the jflex generation output differs
  • new unit and/or testsuite test(s) should be added
  • a note about this change should be added to the changelog and the manual

@sarowe
Copy link
Contributor Author

sarowe commented Jan 27, 2019

I adjusted the dot-newline testsuite case's spec to directly handle surrogate chars; I think that is sufficient testing. I documented unpaired surrogates' exclusion from . matches. So the outlined work is done.

But I wonder if excluding unpaired surrogates from . matches goes far enough? I think we should consider extending this idea to also not match unpaired surrogates in negated character classes ([^...]) and negated Unicode properties (\P{...} - note the capital P); currently in both cases unpaired surrogates will match. @lsf37 what do you think?

lsf37
lsf37 approved these changes Nov 18, 2019
Copy link
Member

@lsf37 lsf37 left a comment

This looks good to me for now and can be merged. There is still the question if we should do more, which I'm not sure about. Let's keep discussing, but add that in a different PR.

@lsf37 lsf37 added the bug label Nov 18, 2019
@lsf37 lsf37 added this to the 1.8.0 milestone Nov 18, 2019
@lsf37
Copy link
Member

lsf37 commented Nov 18, 2019

Leaving out those unpaired surrogates from negated classes does make sense. My main concern is [^], which should match any single character, and would then also not match those, I guess. In a way, that is correct, because they are not valid characters.

On the other hand, we would then get that [^]* no longer matches any input, which is something we'd want, I think. Similarly, we'd expect the expression [a]|[^a] to match any single character for any a. Maybe we shouldn't expect that, but currently I would.

The basic problem is that there exist inputs that contain sequences that do not resolve to characters, and one of the basic assumptions for the scanner from the dark ages of Java 1.1 is that every input is a sequence of characters.

Would it make sense to make [^] a special case, or is that even worse?

@lsf37 lsf37 force-pushed the dot-metachar-excludes-surrogates branch from 93c5c82 to 46281f9 Compare Nov 18, 2019
@lsf37
Copy link
Member

lsf37 commented Nov 18, 2019

(rebased on master)

@lsf37 lsf37 merged commit e789e93 into master Nov 18, 2019
5 checks passed
@regisd regisd deleted the dot-metachar-excludes-surrogates branch Nov 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants