New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃崉 [[ Builder ]] Further work on encoding/decoding module #1754

Open
wants to merge 18 commits into
base: develop
from

Conversation

Projects
None yet
1 participant
@peter-b
Contributor

peter-b commented Jan 30, 2015

This pull request is for continued work on encoding/decoding support for Builder, based on @livecodeali's progress in #1652.

In particular, I have:

  • Added support for multi-byte replacements when encoding (only ASCII and native encodings affected)
  • Added support for multi-character replacements when decoding ASCII

Not done yet

  • Support for multi-character replacements when decoding (but I can't deal with this without resolving the issues listed below).

Unresolved issues

There are still some unresolved issues when decoding UTF-8, UTF-16 and UTF-32.

  1. When decoding UTF-8, we still use UTF8ToUnicode(), which detects and ignores errors with helpful comments like:

    foundation-legacy.cpp:401: // This is an error
    

    Should there be a MCStringCreateWithUTF8CharsAndReplacement() function? My current feeling is that MCStringCreateWithBytesAndReplacement() should be able to detect UTF-8 encoding errors and thus a replacement for UTF8ToUnicode() is needed.

  2. When decoding UTF-16 and UTF-32, what should be done with trailing bytes? Should they be ignored, should they cause a replacement string to be inserted, or should they always cause an error? @livecodeali and I have discussed it, and have had difficulty coming to a solid conclusion.

  3. When decoding UTF-16, what should be done about unpaired surrogates? Given that we can safely round-trip an unpaired surrogate through LiveCode's internal representation, I don't think that it makes sense to consider them to be a encoding error.

  4. What should be done about UTF-16 encoded as UTF-8? @runrevmark brought this up in the previous pull request.

Several of these issues could be solved by allowing flags to be passed to the codec functions, at the cost of making the API even more complicated.

runrevali and others added some commits Jan 9, 2015

runrevali
[[ LC Builder ]] Implement various things in encoding module
Conflicts:
	engine/src/modules.cpp
	libscript/libscript.xcodeproj/project.pbxproj
	libscript/libstdscript-modules.list
	toolchain/lc-compile/lc-compile.xcodeproj/project.pbxproj
	toolchain/lc-compile/src/module-helper.cpp
runrevali
[[ LCB StdLib ]] Allow replacement byte (resp. char) to be specified 鈥
鈥hen encoding (resp. decoding)

[[ LCB StdLib ]] Return undefined when replacement is unspecified and encoding or decoding fails
libfoundation: Remove MCStringGetNativeChars() from public API.
Remove both MCStringGetNativeChars() and
MCStringGetNativeCharsWithReplacement() from the public libfoundation
API.
libfoundation: Refactor MCStringGetNativeChars() API.
* Return true on success/false on failure from
  MCStringGetNativeChars() and
  MCStringGetNativeCharsWithReplacement().

* Add an in parameter for size of character buffer provided and an out
  parameter for number of characters used/needed.

* Allow multi-byte replacement sequences for characters that can't be
  represented in the native encoding, and add an explicit in parameter
  for the replacement sequence length.
libfoundation: Add MCStringGetAsciiCharsWithReplacement().
Add a new function that allows more efficient conversion to ASCII,
without going via the native encoding.
[[ LCB stdlib ]] Block Builder programs from requesting "native" enco鈥
鈥ing.

The "native" encoding available to script programs is rarely correct
for modern systems, e.g.:

* Almost all Linux systems use UTF-8, not ISO-8859-1.
* All Macs use UTF-8, not MacRoman.

Block all Builder programs from using "native" to specify an encoding.
If someone's absolutely determined to get "native encoding exactly as
in script", they can use "-native" to bypass the check.
[[Bug 12204]] libfoundation: Refactor MCStringConvertToNativeWithRepl鈥
鈥cement().

Allow zero-byte and multibyte replacement sequences.  Note that it is
*not* an error for the replacement array to contain non-native (or
non-ASCII) values.

**NOTE**: This patch means that "ASCII" encoding **really is** ASCII
encoding, not the native encoding discarding bytes > 127.
libfoundation: Refactor MCStringCreateWithAsciiCharsAndReplacement().
Allow multi-character substitutions for non-ASCII chars, including
allowing non-ASCII characters in the replacement.

@peter-b peter-b added this to the 8.0.0-rc-1 milestone Jan 30, 2015

@peter-b peter-b added the enhancement label Jan 30, 2015

@peter-b

This comment has been minimized.

Show comment
Hide comment
@peter-b

peter-b Feb 2, 2015

Contributor

Discussion with @runrevmark today (offline).

About the unresolved issues: the thing that determines a decoding error is if we encounter something while decoding that we can't represent internally. That means:

  • For UTF-8, incomplete or too-long sequences get replaced. But we allow short sequences (as in Modified UTF-8) or surrogates (as in UTF-16-encoded-as-UTF-8). We'll have to add a new e.g. MCStringCreateWithUtf8BytesAndReplacement() function.
  • For UTF-16 & UTF-32, trailing bytes get replaced.
  • Unpaired surrogates get accepted silently.

A couple of additional things were brought up:

  • We should convert by grapheme rather than by unichar_t. This may require changes to the way we do ASCII and "native" encoding. Unfortunately this is likely to be a massive amount of work.
  • We need to clean up the MCString* API and hide as many of the encoding-specific functions as possible. Also, get rid of the external_rep flag wherever feasible.

We need to decide how far we want to go with all of the above before the next release.

Contributor

peter-b commented Feb 2, 2015

Discussion with @runrevmark today (offline).

About the unresolved issues: the thing that determines a decoding error is if we encounter something while decoding that we can't represent internally. That means:

  • For UTF-8, incomplete or too-long sequences get replaced. But we allow short sequences (as in Modified UTF-8) or surrogates (as in UTF-16-encoded-as-UTF-8). We'll have to add a new e.g. MCStringCreateWithUtf8BytesAndReplacement() function.
  • For UTF-16 & UTF-32, trailing bytes get replaced.
  • Unpaired surrogates get accepted silently.

A couple of additional things were brought up:

  • We should convert by grapheme rather than by unichar_t. This may require changes to the way we do ASCII and "native" encoding. Unfortunately this is likely to be a massive amount of work.
  • We need to clean up the MCString* API and hide as many of the encoding-specific functions as possible. Also, get rid of the external_rep flag wherever feasible.

We need to decide how far we want to go with all of the above before the next release.

@peter-b peter-b added the WIP label Sep 1, 2015

@peter-b peter-b modified the milestone: 8.0.0-rc-1 Sep 28, 2015

@peter-b peter-b changed the title from [[ Builder ]] Further work on encoding/decoding module to [DEAD][[ Builder ]] Further work on encoding/decoding module Apr 13, 2017

@peter-b peter-b changed the title from [DEAD][[ Builder ]] Further work on encoding/decoding module to 馃崉 [[ Builder ]] Further work on encoding/decoding module Apr 13, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment