Place restrictions on String content in relevant sections #3068

henrikt-ma · 2021-12-09T11:28:08Z

This addresses the problem that the rather important restriction of what data a String may contain is hidden away in an appendix.

HansOlsson · 2021-12-10T09:12:22Z

I agree that this would make sense, and better present the current status.

However, this triggered that we should also consider allowing Unicode-characters (using UTF-8 that everyone seems to have converged on) - and in that case it would be slightly preferable to first switch to UTF-8 in one place, and then spread it like this.
We might still do this first.

The reason is that Unicode is already sneaking into code:
In particular for external tables, not only for the contents - but even the file name themselves for some users, as they are often found using loadResource that for files in non-ASCII directories need to return Unicode-strings. Telling users that they cannot load a table from their home-directory because they have non-ASCII character in that name isn't good.
modelica/ModelicaStandardLibrary#3789

The practical implications of allowing UTF-8 instead of ASCII is minimal for the language itself; as we don't even have string-subscripting in the language and thus we don't have to standarize byte/code-point/grapheme-cluster
The only tricky part is string-comparison where I see the possibilities:

Say that only ASCII string comparison is standarized
Standarize lexicographic comparison of UTF-8, which by design is the same as lexicgraphic comparison of code-points.
Standarize using Unicode-comparison in alphabetic order according to locale in some specific way
The second option seems straightforward (well except for CESU-8 mess).

henrikt-ma · 2021-12-10T11:06:41Z

It is exactly because of the kind of things that you comment on that I wanted to make it clear where the specification stands at the moment. I thought that clearly expressing the current status would bring more light on these matters, and I can't see that it would be better to have the description of current situation tucked away in an appendix.

Let's open another issue about the current limitations.

sjoelund · 2021-12-10T11:14:05Z

Why would the specification need to standardize how to sort strings? There is no way to compare strings in the language itself.
Instead it is up to e.g. MSL to define how to compare / sort strings. And the function could be extended to take an enumeration to change behaviour to allow all of the options above (making the default check if strings are ASCII).

HansOlsson · 2021-12-10T11:34:19Z

Why would the specification need to standardize how to sort strings?

Compare strings: "A" < "B" is allowed and defined in Modelica; https://specification.modelica.org/master/operators-and-expressions.html#equality-relational-and-logical-operators (using strcmp which preserves code-point order for UTF-8).

sjoelund · 2021-12-10T11:40:57Z

I didn't even realize that. I think I always see Modelica_stringCompare and think that exists for a reason... Then yes, I guess we need to decide how to compare UTF-8.

henrikt-ma · 2021-12-10T11:59:58Z

I didn't even realize that. I think I always see Modelica_stringCompare and think that exists for a reason... Then yes, I guess we need to decide how to compare UTF-8.

But not here. This PR is meant to not introduce any changes to the language.

HansOlsson · 2021-12-17T12:22:57Z

I didn't even realize that. I think I always see Modelica_stringCompare and think that exists for a reason... Then yes, I guess we need to decide how to compare UTF-8.

But not here. This PR is meant to not introduce any changes to the language.

Yes, I understand. It was just that if we soon might add UTF-8 ( #3079 ) I think it is best to wait with this one.
Normally a clean-up before a change wouldn't be a problem, but in this case the change will sort of undo the clean-up.

henrikt-ma · 2021-12-17T13:18:31Z

But would any of the restructuring introduced in this PR be bad if we switched to UTF-8?

I like the initiative in #3079, but we don't know yet how long the discussions will be, and I think that merging this first would help us see the effect of #3079 more clearly.

HansOlsson · 2021-12-17T14:05:09Z

But would any of the restructuring introduced in this PR be bad if we switched to UTF-8?

Not bad, but almost entirely removed (see below).

The change in classes.tex is unnecessary if String-variables can contain any String from the language - and in syntax.tex the corresponding texts would just be removed (old and new).
In functions.tex it would be replaced by UTF-8 mapping and note about "valid string" removed.

Note that I skipped the String-mapping for Fortran in #3079. We probably might add it even though Fortran-Strings are normally for Lapack/Blas arguments that should be ASCII. However, I don't fully understand the Fortran-support for Unicode and searches gave confusing results; e.g., https://fortranwiki.org/fortran/files/character_handling_in_Fortran.html (at the end).

henrikt-ma · 2021-12-17T15:01:59Z

Not bad, but almost entirely removed (see below).

OK, but then the benefits of #3079 will shine even more. :)

henrikt-ma · 2022-01-13T06:25:02Z

Considering the amount of discussion we had in the MSL issue modelica/ModelicaStandardLibrary#3789 without realizing that non-ASCII String values don't even exist in the language today, and also that the amount of discussion in that issue gives an indication of how complicated it will be to finish #3079, I'd like to stress once more that we should merge this one to be in better shape while working on #3079.

HansOlsson · 2022-02-14T14:42:17Z

Since #3079 is accepted I believe we can close this one.

henrikt-ma · 2022-02-15T21:34:33Z

Since #3079 is accepted I believe we can close this one.

Yes. Closing.

Place restrictions on String content in relevant sections

e303fa9

henrikt-ma requested a review from HansOlsson December 9, 2021 11:28

HansOlsson mentioned this pull request Dec 10, 2021

Support Unicode in all Strings. #3070

Closed

This was referenced Jan 10, 2022

Specify that all strings can be Unicode. #3079

Merged

String length of UTF-8 stings modelica/ModelicaStandardLibrary#3789

Open

henrikt-ma closed this Feb 15, 2022

henrikt-ma deleted the cleanup/string-character-set branch February 15, 2022 21:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Place restrictions on String content in relevant sections #3068

Place restrictions on String content in relevant sections #3068

henrikt-ma commented Dec 9, 2021

HansOlsson commented Dec 10, 2021

henrikt-ma commented Dec 10, 2021

sjoelund commented Dec 10, 2021

HansOlsson commented Dec 10, 2021

sjoelund commented Dec 10, 2021

henrikt-ma commented Dec 10, 2021

HansOlsson commented Dec 17, 2021

henrikt-ma commented Dec 17, 2021

HansOlsson commented Dec 17, 2021

henrikt-ma commented Dec 17, 2021

henrikt-ma commented Jan 13, 2022

HansOlsson commented Feb 14, 2022

henrikt-ma commented Feb 15, 2022

Place restrictions on String content in relevant sections #3068

Place restrictions on String content in relevant sections #3068

Conversation

henrikt-ma commented Dec 9, 2021

HansOlsson commented Dec 10, 2021

henrikt-ma commented Dec 10, 2021

sjoelund commented Dec 10, 2021

HansOlsson commented Dec 10, 2021

sjoelund commented Dec 10, 2021

henrikt-ma commented Dec 10, 2021

HansOlsson commented Dec 17, 2021

henrikt-ma commented Dec 17, 2021

HansOlsson commented Dec 17, 2021

henrikt-ma commented Dec 17, 2021

henrikt-ma commented Jan 13, 2022

HansOlsson commented Feb 14, 2022

henrikt-ma commented Feb 15, 2022