-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for accented characters in data type "Name" #56
Comments
Why did you not include the note which is meant to deal with this? From Section 7.2.4.1:
There are other ways to include support for Unicode characters in other programming languages, for example in perl (the API Definition document uses perl's notation for regular expressions) you should use Please also note that https://regex101.com/ is not an official definition of how regular expressions works. |
Because it doesn't deal with it. As the bug report says, it's independent of the implementation, and of the programming languages that might be used. As you rightly say, regex101 isn't an official definition of how regular expressions work. But it is an implementation-independent way of representing the problem. Are you saying that this isn't a problem? Because Mowali definitely think it is, and that it is unaffected by the flag you propose as a solution. |
The flag mentioned in the note only deals with Java as an example, not other programming languages. The Switch (which I assume you mean with Mowali) is implemented in Javascript. Have you looked at the link that I provided? |
OK, let me put this another way. Did you test with accented characters? and are you therefore confident that it's an implementation-dependent issue? |
Yes, I have tested the regular expression in Java with accented characters using the flag mentioned in the note. |
You can try the following in Java which should work fine (at least in Java 7 and 8, which are the only versions that I currently have installed on this machine):
|
I looked at the link you provided. As far as I can see, it contains various suggestions for work-arounds in the content of the regex expression. These are equivalent, as far as I can see, to the form of solution that I proposed; but perhaps I have mistaken your intention in making the reference. Can you confirm? As a general principle, I think that our aim should be that the content of the regex expression should not require language-specific work-arounds, if only because there is, at least in principle, a very large number of language instances for which we might need to provide specific suggestions, and I would very much prefer that the API should remain independent of the languages which might be used to implement it. I would rather see a regex which we are confident will work in any language context - which is both the reason why I proposed the solution I did and the thinking behind using a language-independent way of verifying the regex. Would you agree, or not? |
HI Michael,
This is a good catch!
However, I think your recommended replacement pattern is incorrect.
You wrote:
^(?!\s*$)[(\p{L}|\p{Nd}) .,'-]{1,128}$
The sequence “[…]” denotes a character class, with each character matched literally, after the \p{…} sequences are parsed. Yet you appear to use “(…)” for grouping and “|” for alternation, within this character class. But alternation doesn’t work inside a character class. The characters “)”, “(", and “|” will be matched literally, which I think is not what was intended.
As constructed, this regular expression happens to work with the test string, but will also permit a name to begin with a number or special symbol, which should not be allowed.
The sequence should be “at least one international letter” followed by between 0 and 127 “international letters, numbers, or the characters . , ‘ - and space," so it requires two character classes in order and no alternation…
^(?!\s*$)([\p{L}][\p{L}\p{Nd} .,'-]{0,127})$
You might verify that I did not misunderstand your intentions. This RE parses at regex101.com <http://regex101.com/> to what appears to be the intention in the specification.
— Miller
… On Mar 3, 2020, at 1:54 AM, mjbrichards ***@***.***> wrote:
Describe the bug
In Section 7.4.2.1 of the API specification, the definition of the regular expression to parse a variable of the Name type states: "all Unicode32 characters are allowed". In fact, accented and non-Roman characters are rejected by the example regular expression given in Listing 14.
To Reproduce
Steps to reproduce the behavior:
Go to https://regex101.com/ <https://regex101.com/>
Paste the regular expression in Listing 14 into the REGULAR EXPRESSION box
Paste any text with accented characters into the TEST STRING box (e.g. "Côte d'Ivoire")
MATCH INFORMATION box displays "Your regular expression does not match the subject string."
Expected behavior
MATCH INFORMATION box displays:
Match 1
Full match | 0-13 | Côte d'Ivoire
Desktop (please complete the following information):
Windows 10
Chrome
Version 80.0.3987.122 (Official Build) (64-bit)
Additional information:
Replace the regular expression in Listing 14 with:
^(?!\s*$)[(\p{L}|\p{Nd}) .,'-]{1,128}$
This uses the Unicode character groups for "any letter" and "any numeric digit"
Now the accented characters (and non-Roman scripts) are parsed correctly.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#56?email_source=notifications&email_token=AB6OJ6GXO347FONI4ZKYCWDRFTHWJA5CNFSM4LAGVIX2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IR63TSA>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB6OJ6FU5SMTCZKTOTDSKBTRFTHWJANCNFSM4LAGVIXQ>.
|
@MichaelJBRichards, the problem is that the suggested regex will not work in every language context either. As an example for Javascript, support for Unicode property escapes was first added to ECMAScript 2018. I can happily agree that it would be great if we could find some regex that would work in every language out of the box without having the note to say that Unicode support should be enabled, but to my knowledge that is not possible. @millerabel, your comment regarding the errors in the regex are correct, but not allowing a leading number would mean a breaking change for existing implementations (that have correctly understood the note in 7.2.4.1 that says that you need to enable support for Unicode), as leading numbers are currently allowed. As such, |
All,
I’m not concerned about an error in the specification being corrected. Systems that implemented an ambiguous requirement would not likely have interoperated. We saw adequate evidence of this when testing the four platforms in London two years ago.
The API spec has many ambiguities that must be iteratively driven out. Some may look like breaking changes, but are not. The behavior allowed by the previous RE specification was never correct to begin with!
There is no requirement to support ancient language versions like pre-2018 ECMAScript that don’t support the 30-year old Unicode character set in a complete way. Different environments may support Unicode RE differently and adaptation of the platform and of the recommended regular expression syntax on those systems will be required by implementers of the specification.
So we want both,
A canonical RE in the specification that (correctly!) demonstrates what we mean implementations must adhere to;
A reminder that the RE is a specification, not source code, and might need adjustment to apply to a particular runtime environment or programming language or that flags for an executive environment might need to be enabled.
The RE in the specification is not source code to be typed into an implementation. It is a (thankfully) machine readable specification of what is permitted by the specification for particular data fields. An implementer is free to implement that requirement in any way that matches the behavior of the machine-readable and verifiable specification and in any language and environment they choose.
I’m noting another error in both my and your RE: They both permit trailing blanks which should not be allowed.
Let’s get the RE spec right. It should correctly describe the canonical requirement. It’s not source code. And it’s not a “breaking change” to get this right in the specification.
— Miller
… On Mar 3, 2020, at 10:26 AM, Henrik Karlsson ***@***.***> wrote:
@MichaelJBRichards <https://github.com/MichaelJBRichards>, the problem is that the suggested regex will not work in every language context either. As an example for Javascript, support for Unicode property escapes was first added to ECMAScript 2018. I can happily agree that it would be great if we could find some regex that would work in every language out of the box without having the note to say that Unicode support should be enabled, but to my knowledge that is not possible.
@millerabel <https://github.com/millerabel>, your comment regarding the errors in the regex are correct, but not allowing a leading number would mean a breaking change for existing implementations (that have correctly understood the note in 7.2.4.1 that says that you need to enable support for Unicode), as leading numbers are currently allowed. As such, ^(?!\s*$)[\p{L}\p{Nd} .,'-]{1,128}$ would be the replacement if we should use Unicode property escapes instead of the existing note. See also comment above to Michael regarding support (or lack of support for Javascript before ES2018) for Unicode property escapes.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#56?email_source=notifications&email_token=AB6OJ6BEGOSLXTIQ3QTOESLRFVDVVA5CNFSM4LAGVIX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENUTJ5A#issuecomment-594097396>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB6OJ6E57S22AZEXHGLHJZDRFVDVVANCNFSM4LAGVIXQ>.
|
I'm further noting that 7.4.2 incorrectly describes the character space for Name format.
7.2.4.1 Regular Expression
The regular expression for restricting the Name type appears in Listing 14 <https://github.com/mojaloop/mojaloop-specification/blob/master/documents/API%20Definition%20v1.0.md#listing-14>. The restriction does not allow a string consisting of whitespace only, all Unicode32 characters are allowed, as well as the period (.) (apostrophe ('), dash (-), comma (,) and space characters ( ). The maximum number of characters in the Name is 128.
Note: In some programming languages, Unicode support must be specifically enabled. For example, if Java is used the flag UNICODE_CHARACTER_CLASS must be enabled to allow Unicode characters.
<https://github.com/mojaloop/mojaloop-specification/blob/master/documents/API%20Definition%20v1.0.md#listing-14>Listing 14
^(?!\s*$)[\w .,'-]{1,128}$
Listing 14 -- Regular expression for data type Name
<https://github.com/mojaloop/mojaloop-specification/blob/master/documents/API%20Definition%20v1.0.md#725-integer>
From RFC 5198, Network Unicode:
Fortunately, however, we are
converging on Unicode [Unicode <https://tools.ietf.org/html/rfc5198#ref-Unicode>] [ISO10646 <https://tools.ietf.org/html/rfc5198#ref-ISO10646>] as a single international
interchange character coding and no longer need to deal with per-
script standards for character sets (e.g., one standard for each of
Arabic, Cyrillic, Devanagari, etc., or even standards keyed to
languages that are usually considered to share a script, such as
French, German, or Swedish). Unfortunately, though, while it is
certainly time to define a Unicode-based text type for use as a
common text interchange format, "use Unicode" involves even more
ambiguity than "use ASCII" did decades ago.
By “UNICODE32” did you mean Unicode 3.2, a deprecated prior version of the specification, or did you mean UTF-32, which is an encoding format for Unicode (I hope you don’t mean this!). The current reference version of Unicode for Network Unicode, since 2008, is Unicode 5.0. As is mentioned here in the RFC, just saying "all Unicode characters are allowed" is worse than saying "use ASCII." And in this case, is just wrong. All Unicode characters are not allowed.
We should specify the specific minimum version of Unicode that implementers are expected to support. Support for Burmese (the Myanmar script) entered the Unicode specification at version 5.1.0, e.g.
http://www.unicode.org/versions/ <http://www.unicode.org/versions/>
We need a much tighter RE specification for what we mean by Name and a tighter English specification. Something like this:
“Letters, both accented and unaccented, being chosen from the “alpha" character class configured to select code points from the complete Unicode 5.0 specification representing letters, digits, the period (.), apostrophe ('), dash (-), comma (,) and space character. Interior spaces are allowed, but no leading or trailing spaces.
For the avoidance of doubt, Names may include leading digits.
When comparing Names, receivers should first prepare Name values by replacing all punctuation with spaces. Then remove any leading and trailing spaces and replace each interior span of multiple spaces with a single space. This canonical form of the name should be comparable in a case-insensitive manner for equivalence with other instances of the same Name. This will render similar Names with different accented and unaccented letters as different Names."
This is certainly not perfect, but is directional of what implementers should expect in precision of this important aspect of the specification. Let’s please iterate to get to a better definition.
— Miller
… On Mar 3, 2020, at 5:51 PM, Miller Abel ***@***.***> wrote:
All,
I’m not concerned about an error in the specification being corrected. Systems that implemented an ambiguous requirement would not likely have interoperated. We saw adequate evidence of this when testing the four platforms in London two years ago.
The API spec has many ambiguities that must be iteratively driven out. Some may look like breaking changes, but are not. The behavior allowed by the previous RE specification was never correct to begin with!
There is no requirement to support ancient language versions like pre-2018 ECMAScript that don’t support the 30-year old Unicode character set in a complete way. Different environments may support Unicode RE differently and adaptation of the platform and of the recommended regular expression syntax on those systems will be required by implementers of the specification.
So we want both,
A canonical RE in the specification that (correctly!) demonstrates what we mean implementations must adhere to;
A reminder that the RE is a specification, not source code, and might need adjustment to apply to a particular runtime environment or programming language or that flags for an executive environment might need to be enabled.
The RE in the specification is not source code to be typed into an implementation. It is a (thankfully) machine readable specification of what is permitted by the specification for particular data fields. An implementer is free to implement that requirement in any way that matches the behavior of the machine-readable and verifiable specification and in any language and environment they choose.
I’m noting another error in both my and your RE: They both permit trailing blanks which should not be allowed.
Let’s get the RE spec right. It should correctly describe the canonical requirement. It’s not source code. And it’s not a “breaking change” to get this right in the specification.
— Miller
> On Mar 3, 2020, at 10:26 AM, Henrik Karlsson ***@***.*** ***@***.***>> wrote:
>
> @MichaelJBRichards <https://github.com/MichaelJBRichards>, the problem is that the suggested regex will not work in every language context either. As an example for Javascript, support for Unicode property escapes was first added to ECMAScript 2018. I can happily agree that it would be great if we could find some regex that would work in every language out of the box without having the note to say that Unicode support should be enabled, but to my knowledge that is not possible.
>
> @millerabel <https://github.com/millerabel>, your comment regarding the errors in the regex are correct, but not allowing a leading number would mean a breaking change for existing implementations (that have correctly understood the note in 7.2.4.1 that says that you need to enable support for Unicode), as leading numbers are currently allowed. As such, ^(?!\s*$)[\p{L}\p{Nd} .,'-]{1,128}$ would be the replacement if we should use Unicode property escapes instead of the existing note. See also comment above to Michael regarding support (or lack of support for Javascript before ES2018) for Unicode property escapes.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub <#56?email_source=notifications&email_token=AB6OJ6BEGOSLXTIQ3QTOESLRFVDVVA5CNFSM4LAGVIX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENUTJ5A#issuecomment-594097396>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB6OJ6E57S22AZEXHGLHJZDRFVDVVANCNFSM4LAGVIXQ>.
>
|
Hi @millerabel, First let me just say that I'm very positive regarding your suggestions for making the regex stricter, I just don't want a stricter regex in the current major version (1) as it would introduce a change in what is allowed to be sent in the This is why I proposed As using the modifier
I can't agree more. The more the API is used, the more ambiguities we will find that needs to be corrected.
It was correct if you followed the note saying that you need to enable support for Unicode. This can be achieved in different ways depending on the programming language (and version), see above. The regex itself can be improved to make it stricter, but that is for another major version.
Trailing blanks are also allowed in the current regex. Not permitting trailing blanks would mean a breaking change, requiring a new major version.
If you read the original spec you will find that the 32 is a footnote. It is just a copy-paste error from the PDF-file to markdown which makes it look like it says Unicode32.
You are entirely correct.
Sounds good!
Your suggestion sounds good on a quick read. The "Then remove any leading and trailing spaces.." part can then be removed when we introduce the stricter regex. I hope I managed to answer on most of your comments.. Thank you for the detailed review of the type! |
I agree with Miller. Since it seems clear that the use of specific regex expressions will lead us into potentially unwise assumptions about the technology used to implement the specification, would it not be simpler to give a clear English-language statement of the rules to be followed and the minimum standards (e.g. Unicode) that implementers must meet? So I think I'd like to suggest that there should be a reference somewhere in the API specification (preferably in a table of such versioning references) to the Unicode release level. Within that, we should use references to the Unicode General Categories that we allow or prohibit in a field of a specific type. So we might rewrite Miller's proposal to say: "Letters, both accented and unaccented, being chosen from all code points belonging to the Letter and Decimal_Number general categories as defined in the reference version of the Unicode specification (with link to reference.) In addition, the period (.), apostrophe ('), dash (-), comma (,) and space character are permitted. Interior spaces are allowed, but no leading or trailing spaces. For the avoidance of doubt, Names may include leading digits." We can then allow implementers to decide how best to meet these requirements. With regard to Miller's suggestion on the canonical form of a name, I wonder whether it might not be better to have a section which explains in general how to obtain the canonical form of any data item, and how to compare two instances of canonical names for equality. I don't have any quarrel with his statement of the method, but I suspect that we may find ourselves duplicating information if we make this part of the definition of the individual type. |
Are you suggesting that we exclude the regular expression completely? I would very much like to have a good English-language statement of the rules, but I would also like to keep at least one of the regular expression versions that have been discussed in this issue in the specification as an example to follow the rules. Otherwise I think we risk ending up in different versions of the regular expressions that are not necessarily entirely compatible.
Some comments:
|
I should clarify my position as it aligns, more or less, with both Henrik and Michael...
I believe we should lead in the specification with a clear English description of the requirements. But we should also include a crisp RE the describes the requirements in machine-verifiable form, where that is practical.
Henrik had proposed a refinement of the original RE that appears to be closer, and with one improvement, I think it gets at the core requirements:
/^(?!\s*$)[\p{L}\p{Nd} .,'-]{1,128}$/u
However, this expression would admit other forms of “Name Format” values that were never intended by the V1 specification and must be treated as an error in the specification.
Leading and trailing blanks are not part of a name.
Spaces can’t be a prefix or suffix and will be stripped by any reasonable implementation. We should require that in the specification to remove the ambiguity that this RE otherwise implies. “Don’t send; tolerate but ignore on receive.” That’s the language of the resilient Internet.
And in all of this debate, we should understand that this Name Format is applicable only in Latin-centric language countries. It won’t work in Asia.
— Miller
|
Please note that you do not need the modifier
Sounds good to me. |
Henrik’s correct that the “/u” Unicode flag is implicit when a Unicode pattern is compiled. It hurts nothing to include it and it makes it then quite explicit, which is my preference (and habit). But I’d be happy either way as long as we leave the English text that says you may need to adjust your environment to process Unicode (or something like that).
The Unicode category questioned earlier is 'Number, Decimal Digit’ and there are hundreds of code points covering Asian and Arabic languages. Please never use [0-9] when a Unicode human readable text is being parsed: Use \d and include the ‘/u’ flag (another reason to be explicit when constructing an RE pattern by including the Unicode flag). Or use the Unicode Number, Decimal Digit category which will activate Unicode flag implicitly.
An important consideration is that when parsing machine-readable text that is assumed to be ASCII (such as in communication protocols, e.g. an IP address), you should explicitly use the “/a” flag to ensure your meaning is clear and that Unicode code points outside of the narrow ASCII set are not recognized.
— Miller
|
Is this comment directed some earlier comment to this issue, to the specification, or just in general? As far as I know there are no instances of [0-9] in the specification meant for Unicode human readable text? Otherwise please point to where in the specification that you think this needs to be updated.
Same question as above. |
It’s important that the specification materials provide guidance to implementers. The English language of the specification that deals with these generalities should include some guidance on how we use the REs in the specifications. You’ve already suggested some language on the activation of Unicode mode which you claim is different on various platforms (though I’ve found no evidence of that so far).
That the REs are specification and not code should be made explicit. By showing explicit RE flag settings in the specification rather than relying on implicit flags we avoid potential misunderstanding and we make the spec clearer. This won’t always be true, but it would have been true here.
New implementers of the specification should be made aware of hazards in assuming the REs are code to be used directly, which they are not. As for positive and negative test cases, the specification doesn’t provide test suites, but Mojaloop does. We might test some of these assumptions directly with negative test cases in our suites.
There have been discussions on this subject on various threads and GitHub issues on the use of [0-9]. I’m not aware of any place within the specification that uses symbol ranges, please flag any you are aware of so we can fix them.
— Miller
|
Describe the bug
In Section 7.2.4.1 of the API specification, the definition of the regular expression to parse a variable of the Name type states: "all Unicode32 characters are allowed". In fact, accented and non-Roman characters are rejected by the example regular expression given in Listing 14.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
MATCH INFORMATION box displays:
Match 1
Full match | 0-13 | Côte d'Ivoire
Desktop (please complete the following information):
Additional information:
Replace the regular expression in Listing 14 with:
^(?!\s*$)[(\p{L}|\p{Nd}) .,'-]{1,128}$
This uses the Unicode character groups for "any letter" and "any numeric digit"
Now the accented characters (and non-Roman scripts) are parsed correctly.
The text was updated successfully, but these errors were encountered: