Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Species Type: Character Set after "other:" #258

Open
s9105947 opened this issue Jan 4, 2022 · 2 comments
Open

Species Type: Character Set after "other:" #258

s9105947 opened this issue Jan 4, 2022 · 2 comments
Labels
EXT: SpeciesType physical particle species extension

Comments

@s9105947
Copy link

s9105947 commented Jan 4, 2022

To specify custom species, the species Type extension allows: "user are free to append a free text after a colon".
Is really any text allowed? (What even is text?)

IMO "free text" should be severely restricted, at least to "printable characters without semicolon (;)".
notable cases:

  • semicolons ";" -> already is species list separator
  • colons ":" -> is separator after "other" (though probably harmless)
  • apostrophe (') and quotation marks (")
    • are generally annoying to escape
    • might be re-written by editors to use their UTF8-versions (“ ” ‘ ’)
  • backslash \
    • is equally annoying to escape
    • is sometimes used to write control characters, e.g. \\n or \\a
  • whitespaces -> sometimes annoying to handle, but generally safe
    • in order of [difficulty|annoyingness] to handle: spaces < tabs < newlines
  • control characters -> influence display
    • ASCII? (e.g. nul, form feed, carriage return, backspace, bell)
    • UTF8? (e.g. byte order mark, right to left text indicator)
  • non-ascii in general
    • e.g. anything non-english (e.g. german, russian, arabic, chinese)
    • (UTF8 is endless, see also: emojis, skin-tone modifiers, single flag characters, greek question mark etc.)
  • pedantic is a character set even specified? are obscure iso-charsets allowed?
  • empty string (entire name would be "other:")

By gut instinct I'd suggest to allow only "universally" safe and unambigous characters after "other:":

  • only ASCII encoding (possibly already fixed by type "string", I have not checked)
  • any number of characters from any number of these classes:
    • letters a-z (either case)
    • digits 0-9
    • one of: dash "-", plus "+", underscore "_", period ".", number sign "#", caret "^"
  • excluded by design:
    • ";" b/c it is the list delimiter
    • "," b/c it could be confused for a list delimiter
    • whitespaces (space, tab, newline) b/c they tend to be difficult to handle or form special cases
    • backslash, apostrophe, quotation mark b/c they are annoying to escape
@ax3l
Copy link
Member

ax3l commented Jan 6, 2022

To specify custom species, the species Type extension allows: "user are free to append a free text after a colon".
Is really any text allowed? (What even is text?)
IMO "free text" should be severely restricted, at least to "printable characters without semicolon (;)".

Correct. Derived from the base standard, we always mean pure ASCII when we write text.
Yes, generally we leave other: unspecified for cases we have not yet thought about and that get standardized in later versions.
A semicolon is also fine in that case, maybe someone things of another compound or list and wants to use the same convention. When it gets standardized later one, we only drop the other:, making file conversions easy. Thus,; are fine here.

I would not pro-actively allow <a-type-that-we-already-define>;other:<someStuff> yet, since I think we have no use case for this yet and it just complicates the syntax & conventions. (Unless you have a concrete case you need to achieve right now, of course.)

@ax3l ax3l added the EXT: SpeciesType physical particle species extension label Jan 6, 2022
@s9105947
Copy link
Author

s9105947 commented Jan 7, 2022

Thank you for the clarification on ASCII, I overlooked that ._.
I like the concept of reserving other: for entirely custom types and forbidding it from lists

So if I understand you correctly, a speciesType is:

  1. one of the pre-defined species (fundamental particle, atom, maybe ion/molecule)
  2. a list formed by multiple of those conforming to 1., separated by a single ";"
  • empty lists are forbidden
  • empty list items (containing ;;) are forbidden
  • trailing semicolons must be ignored
  1. a string beginning with other:, followed by see below

Correct, so far?

This would leave these questions:

  1. should empty strings after other: be permitted?
  2. can a newline ever follow after other:?
  3. which other characters are allowed?

I'd suggest: yes, no, class "print" of POSIX locale (IEEE 1003.1-2008, s. 7.3.1, l. 4187) 123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_abcdefghijklmnopqrstuvwxyz{|}~ (+backtick ` +space +horizontal tab)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EXT: SpeciesType physical particle species extension
Projects
None yet
Development

No branches or pull requests

2 participants