Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Place restrictions on String content in relevant sections #3068

Closed

Conversation

henrikt-ma
Copy link
Collaborator

This addresses the problem that the rather important restriction of what data a String may contain is hidden away in an appendix.

@HansOlsson
Copy link
Collaborator

I agree that this would make sense, and better present the current status.

However, this triggered that we should also consider allowing Unicode-characters (using UTF-8 that everyone seems to have converged on) - and in that case it would be slightly preferable to first switch to UTF-8 in one place, and then spread it like this.
We might still do this first.

The reason is that Unicode is already sneaking into code:
In particular for external tables, not only for the contents - but even the file name themselves for some users, as they are often found using loadResource that for files in non-ASCII directories need to return Unicode-strings. Telling users that they cannot load a table from their home-directory because they have non-ASCII character in that name isn't good.
modelica/ModelicaStandardLibrary#3789

The practical implications of allowing UTF-8 instead of ASCII is minimal for the language itself; as we don't even have string-subscripting in the language and thus we don't have to standarize byte/code-point/grapheme-cluster
The only tricky part is string-comparison where I see the possibilities:

  • Say that only ASCII string comparison is standarized
  • Standarize lexicographic comparison of UTF-8, which by design is the same as lexicgraphic comparison of code-points.
  • Standarize using Unicode-comparison in alphabetic order according to locale in some specific way
    The second option seems straightforward (well except for CESU-8 mess).

@henrikt-ma
Copy link
Collaborator Author

It is exactly because of the kind of things that you comment on that I wanted to make it clear where the specification stands at the moment. I thought that clearly expressing the current status would bring more light on these matters, and I can't see that it would be better to have the description of current situation tucked away in an appendix.

Let's open another issue about the current limitations.

@sjoelund
Copy link
Member

Why would the specification need to standardize how to sort strings? There is no way to compare strings in the language itself.
Instead it is up to e.g. MSL to define how to compare / sort strings. And the function could be extended to take an enumeration to change behaviour to allow all of the options above (making the default check if strings are ASCII).

@HansOlsson
Copy link
Collaborator

Why would the specification need to standardize how to sort strings?

Compare strings: "A" < "B" is allowed and defined in Modelica; https://specification.modelica.org/master/operators-and-expressions.html#equality-relational-and-logical-operators (using strcmp which preserves code-point order for UTF-8).

@sjoelund
Copy link
Member

I didn't even realize that. I think I always see Modelica_stringCompare and think that exists for a reason... Then yes, I guess we need to decide how to compare UTF-8.

@henrikt-ma
Copy link
Collaborator Author

I didn't even realize that. I think I always see Modelica_stringCompare and think that exists for a reason... Then yes, I guess we need to decide how to compare UTF-8.

But not here. This PR is meant to not introduce any changes to the language.

@HansOlsson
Copy link
Collaborator

I didn't even realize that. I think I always see Modelica_stringCompare and think that exists for a reason... Then yes, I guess we need to decide how to compare UTF-8.

But not here. This PR is meant to not introduce any changes to the language.

Yes, I understand. It was just that if we soon might add UTF-8 ( #3079 ) I think it is best to wait with this one.
Normally a clean-up before a change wouldn't be a problem, but in this case the change will sort of undo the clean-up.

@henrikt-ma
Copy link
Collaborator Author

But would any of the restructuring introduced in this PR be bad if we switched to UTF-8?

I like the initiative in #3079, but we don't know yet how long the discussions will be, and I think that merging this first would help us see the effect of #3079 more clearly.

@HansOlsson
Copy link
Collaborator

But would any of the restructuring introduced in this PR be bad if we switched to UTF-8?

Not bad, but almost entirely removed (see below).

The change in classes.tex is unnecessary if String-variables can contain any String from the language - and in syntax.tex the corresponding texts would just be removed (old and new).
In functions.tex it would be replaced by UTF-8 mapping and note about "valid string" removed.

Note that I skipped the String-mapping for Fortran in #3079. We probably might add it even though Fortran-Strings are normally for Lapack/Blas arguments that should be ASCII. However, I don't fully understand the Fortran-support for Unicode and searches gave confusing results; e.g., https://fortranwiki.org/fortran/files/character_handling_in_Fortran.html (at the end).

@henrikt-ma
Copy link
Collaborator Author

Not bad, but almost entirely removed (see below).

OK, but then the benefits of #3079 will shine even more. :)

@henrikt-ma
Copy link
Collaborator Author

Considering the amount of discussion we had in the MSL issue modelica/ModelicaStandardLibrary#3789 without realizing that non-ASCII String values don't even exist in the language today, and also that the amount of discussion in that issue gives an indication of how complicated it will be to finish #3079, I'd like to stress once more that we should merge this one to be in better shape while working on #3079.

@HansOlsson
Copy link
Collaborator

Since #3079 is accepted I believe we can close this one.

@henrikt-ma
Copy link
Collaborator Author

Since #3079 is accepted I believe we can close this one.

Yes. Closing.

@henrikt-ma henrikt-ma closed this Feb 15, 2022
@henrikt-ma henrikt-ma deleted the cleanup/string-character-set branch February 15, 2022 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants