Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check file encoding and language standard #2587

Merged
merged 4 commits into from
Jun 6, 2018

Conversation

sjoelund
Copy link
Member

@sjoelund sjoelund commented Jun 5, 2018

This makes Travis verify that the Modelica files conform to the 3.2
version of the language standard, and that files are encoded in 7-bit
ASCII format (no unicode) to make the library consistent.

At least MSL was released as ASCII in the past; has this changed? (There are some unicode files checked in right now; the CI job should reflect this) If it has changed, I would change this to verify that all files are UTF-8 instead.

@sjoelund sjoelund force-pushed the travis-ascii branch 2 times, most recently from 5272355 to 7da10e8 Compare June 5, 2018 06:04
@beutlich
Copy link
Member

beutlich commented Jun 5, 2018

Non-ASCII characters in current master (09def33):

"Drum boiler example, see Franke, Rode, Krüger: On-line Optimization of Drum Boiler Startup, 3rd International Modelica Conference, Linköping, 2003"

You are right, this was not the case for MSL v3.2.2.

@sjoelund
Copy link
Member Author

sjoelund commented Jun 5, 2018

Travis said:

./Modelica/ComplexBlocks.mo: iconv: illegal input sequence at position 0
./Modelica/Magnetic/QuasiStatic/FluxTubes.mo: iconv: illegal input sequence at position 0
./Modelica/Fluid/Examples/DrumBoiler.mo: iconv: illegal input sequence at position 95

So there are also UTF8 byte-order marks in there.

@beutlich
Copy link
Member

beutlich commented Jun 5, 2018

So there are also UTF8 byte-order marks in there.

Can confirm Modelica/ComplexBlocks.mo and Modelica/Magnetic/QuasiStatic/FluxTubes.mo

@sjoelund
Copy link
Member Author

sjoelund commented Jun 5, 2018

Can only confirm Modelica/ComplexBlocks.mo and Modelica/Magnetic/QuasiStatic/FluxTubes.mo

Position 0 means a byte-order mark, so that's to be expected.

The question is how to resolve this: Should I change the files to ASCII using html-tags? Does this even work for textString="μ"?

@christiankral
Copy link
Contributor

In the MLS version 3.2, revision 2 (July 30, 2013) it is stated on page 161:

"Each Modelica file in the file-system is stored in UTF-8 format (defined by The Unicode Consortium;
http://www.unicode.org ) and may start with the UTF-8 encoded byte order mark ( 0xef 0xbb 0xbf ); this is
treated as white-space in the grammar."

If a UTF-8 encoded Modelica file does not contain a specific UTF-8 character, then it may appear as ASCII file, as the UTF-8 encoded byte order mark is optional. However, I suppose we shall release all Modelica files in UTF-8 format.

@sjoelund
Copy link
Member Author

sjoelund commented Jun 5, 2018

@christiankral Yes, UTF-8 is allowed in MLS. But at least in the past MSL has decided to not use it (all HTML-documentation was changed to use HTML entities instead of unicode). I believe part of the reason is that tool support was lacking at the time (with some using latin1), so if you had a UTF8 version of MSL it got messed up in those tools.

I would be fine with either keeping it as ASCII or changing it to UTF8. But we should check the files to conform to this regardless (and if UTF8 is used, I believe to strip the BOM from all files at least).

@sjoelund sjoelund added this to the MSL3.2.3 milestone Jun 5, 2018
@beutlich
Copy link
Member

beutlich commented Jun 5, 2018

Looks all good in Dymola, MWorks.Sysplorer and SimulationX. Not sure if MapleSim (@svorkoetter), OMEdit or Wolfram.SystemModelica (@otronarp) support UTF-8 strings in their GUI.

@sjoelund
Copy link
Member Author

sjoelund commented Jun 5, 2018

OpenModelica has supported UTF-8 since before we had OMEdit, so I don't think that's an issue anymore. And indeed many third-party libraries use UTF-8 strings in the GUI.

sjoelund and others added 2 commits June 5, 2018 08:43
This makes Travis verify that the Modelica files conform to the 3.2
version of the language standard, and that files are encoded in 7-bit
ASCII format (no unicode) to make the library consistent.
dietmarw
dietmarw previously approved these changes Jun 5, 2018
@beutlich
Copy link
Member

beutlich commented Jun 5, 2018

If we move to UTF-8 the check should pass at least before merge.

@HansOlsson
Copy link
Contributor

I would prefer to keep BOM - if the files contain any other non-ASCII character.

@sjoelund
Copy link
Member Author

sjoelund commented Jun 5, 2018

If we move to UTF-8 the check should pass at least before merge.

That's easy: change iconv -f ascii -t ascii to iconv -f utf8 -t utf8.

I would prefer to keep BOM - if the files contain any other non-ASCII character.

If we decide on that, we should add a check for it (I don't think it's all too complicated).

@beutlich
Copy link
Member

beutlich commented Jun 5, 2018

Looks all good in Dymola, MWorks.Sysplorer and SimulationX.

I checked the no-BOM files only.

@beutlich
Copy link
Member

beutlich commented Jun 5, 2018

This is a quick poll for

  • UTF-8 w/o BOM 👍
  • UTF-8 with BOM 😆
  • ASCII ❤️

@HansOlsson
Copy link
Contributor

Pressed return to early:

The reason to keep BOM is that it makes it clear that the file isn't iso-latin-1 (there are still people using iso-latin-1 for some Modelica files (outside of MSL); we can almost always detect that - but it still feels iffy whereas a BOM is an unambiguous statement that the file isn't iso-latin-1).

@dietmarw
Copy link
Member

dietmarw commented Jun 5, 2018

Leaving BOM aside, to fix the current UTF-8 occurrences in the HTML the MSL should always use HTML entities. I'll can push a fix for that. As for text string, the two occurrences can be replaced by images just like we do for math in some places.

@HansOlsson
Copy link
Contributor

For the poll I would also like UTF-8 with BOM but I couldn't figure out which emoji it was 😖

@sjoelund
Copy link
Member Author

sjoelund commented Jun 5, 2018

@HansOlsson The code for that is :laughing:, so use the laugh one I guess even though it looks different.

@dietmarw dietmarw force-pushed the travis-ascii branch 2 times, most recently from d2c43c8 to 86a736b Compare June 5, 2018 16:13
@dietmarw
Copy link
Member

dietmarw commented Jun 5, 2018

@christiankral In order to allow extend the CI checks we needed to decide on an encoding. For this minor release I propose to keep MSL as pure ASCII like all previous ones.
This meant I need to replace μ by mu and ∠ by angle. I will create a PR that adds them back again for the next major release and hope that is fine by you.

MSL always used to be ASCII and although the spec allows UTF8 it seems strange to change the encoding now with the next minor.

I will prepare a PR to change those characters back to UTF8 for the next MAJOR release.
@christiankral
Copy link
Contributor

@dietmarw I'm OK with keeping this MSL release as pure ASCII version

@beutlich
Copy link
Member

beutlich commented Jun 5, 2018

I'm OK with keeping this MSL release as pure ASCII version

We should add a PR targeting the next major milestone adding the UTF-8 characters again.

@dietmarw dietmarw merged commit 7ee87a9 into modelica:master Jun 6, 2018
@beutlich beutlich added the CI Issue that addresses continuous integration label Jun 6, 2018
@sjoelund sjoelund deleted the travis-ascii branch June 7, 2018 06:18
@beutlich beutlich modified the milestone: MSL3.2.3 Jun 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Issue that addresses continuous integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants