Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"\U" Unicode escape sequence for strings accepts invalid value instead of raising error #15456

Open
srutzky opened this issue Jun 28, 2019 · 1 comment

Comments

Projects
None yet
2 participants
@srutzky
Copy link

commented Jun 28, 2019

The \Unnnnnnnn Unicode escape sequence for strings accepts a Unicode code point (UTF-32 value) for the "nnnnnnnn". The highest code point is "0x10FFFF". Passing a value of "0x110000" (just add 1 to that highest value) into \U should raise an error. And, as expected, it does:

using System;
 
public class Test
{
	public static void Main()
	{
		// Highest / last Code Point = U+10FFFF
		// One value higher than the Highest Code Point / UTF-32 via \\U:
		Console.WriteLine("\U00110000");
	}
}

You can run that code directly on IDE One. It results in:

error CS1009: "Unrecognized escape sequence"

HOWEVER, I noticed that values starting at "0x80000000" do not raise an exception. This is odd. I can't even compile the code using either Visual Studio 2015 (Update 3) and .NET Framework version 4.7.2 or LINQPad 5. Yet, the following code runs using what is identified as being gmas (4.6.2):

using System;
 
public class Test
{
	public static void Main()
	{
		// Code Point U+1F47E
		Console.WriteLine("\\U0001F47E = \U0001F47E");
		Console.WriteLine("");
		
		// Specifying the UTF-16 Surrogate Pair raises a "CS1009
		// Unrecognized escape sequence" error in LINQPad and Visual
		// Studio 2015.
		//
		// Here it returns the default replacement character due to ignoring
		// the first 4 hex digits and only seeing 0xDC7E, which is a
		// surrogate code unit.
		Console.WriteLine("\\UD83DDC7E = \UD83DDC7E");
		Console.WriteLine("");

		// The following is invalid, and raises an error in LINQPad and in
		// Visual Studio 2015. But here it returns an empty string due to
		// ignoring the first 4 hex digits and only seeing 0x0000.
		Console.WriteLine("\\UD83D0000 = \UD83D0000");
		Console.WriteLine("");
		
		// The following are invalid, and raises an error in LINQPad and in
		// Visual Studio 2015. But here it returns an "A" due to ignoring
		// the first 4 hex digits and only seeing 0x0041.
		Console.WriteLine("\\UD83D0041 = \UD83D0041");
		Console.WriteLine("\\U80000041 = \U80000041");
		Console.WriteLine("\\UFFFF0041 = \UFFFF0041");
		Console.WriteLine("");

		// The following is invalid, and raises an error HERE (as expected),
		// due to the first 4 hex digits being below the 0x8000 mark.
		//Console.WriteLine("\\U7FFF0041 = \U7FFF0041");
	}
}

You can run the code above directly on IDE One. It results in:

\U0001F47E = 👾

\UD83DDC7E = �

\UD83D0000 =

\UD83D0041 = A
\U80000041 = A
\UFFFF0041 = A

So, somehow, if the first four hex digits are between 0x0000 and 0x7FFF, then all 8 digits are interpreted correctly as being either valid or invalid. But, if the first four hex digits are between 0x8000 and 0xFFFF, then those 4 digits are ignored and a character is generated using the last 4 hex digits as being the code point.

srutzky added a commit to srutzky/docs that referenced this issue Jun 28, 2019

Fix and improve Unicode escape sequence info
1. Remove erroneous note regarding `\U` being used for specifying surrogate pairs. That note was patently false given that a) specifying a surrogate pair raises an exception, and b) specifying any valid code point / UTF-32 code unit returns the correct Unicode character for that code point.
    * Even if the original author meant "supplementary characters" instead of "surrogate pairs", that would still be incorrect as the `\U` escape can also be used for BMP characters.
    * Runnable example code showing that a valid code point (U+1F47E) works via `\U0001F47E`, and its surrogate pair via `\UD83DDC7E` does not, on [IDE One](https://ideone.com/deoylQ)
   * In creating the test noted above, I found a bug in the Mono C\# compiler, so I submitted that here:  
       ["\U" Unicode escape sequence for strings accepts invalid value instead of raising error #15456](mono/mono#15456)
  * Runnable example code showing that invalid code point (U+110000) raises an exception, on [IDE One](https://ideone.com/jpVxL4)

2. Correctly indicated that `\U` is for a 4-byte UTF-32 value, and `\u` is for a 2-byte UTF-16 value.

3. Show the pattern _and_ an example to be more readable / helpful. Please note that `\U00nnnnnn` has two permanent zeros and only 6 user-supplied hex digits. This is not only being completely honest (since those first two zeros can only ever be zeros), it removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros), hence reducing confusion.

4. Properly formatted escape sequences as being inline-code

5. Added warning about using `\x` escape with less than 4 hex digits. For more info on this, please see:
     [Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)](https://sqlquantumleap.wordpress.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/#csharp)
@srutzky

This comment has been minimized.

Copy link
Author

commented Jun 28, 2019

If it helps, this issue seems to be affecting C# (gmcs) but not F#:

open System

printfn "Code Point / UTF-32 via \\U: \U0001F47E";

// The line below (using value that works in gmcs 4.6.2) raises error (as it should):
// "FS1245: \U80000041 is not a valid Unicode character escape sequence"
//printfn "Code Point / UTF-32 via \\U: \U80000041";


// The line below (using the surrogate pair for U+1F47E) raises error (as it should):
// "FS1245: \UD83DDC7E is not a valid Unicode character escape sequence"
printfn "Code Point / UTF-32 via \\U: \UD83DDC7E";

as demonstrated on IDE One.


FWIW, I stumbled upon this bug while coming up with a test case to support my edits of the escape sequences section of the C# Strings documentation: Fix and improve Unicode escape sequence info #13162

Unicode escape sequences for both C# and F# (and others) is covered in my post:

Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)

BillWagner added a commit to dotnet/docs that referenced this issue Jul 1, 2019

Fix and improve Unicode escape sequence info (#13162)
1. Remove erroneous note regarding `\U` being used for specifying surrogate pairs. That note was patently false given that a) specifying a surrogate pair raises an exception, and b) specifying any valid code point / UTF-32 code unit returns the correct Unicode character for that code point.
    * Even if the original author meant "supplementary characters" instead of "surrogate pairs", that would still be incorrect as the `\U` escape can also be used for BMP characters.
    * Runnable example code showing that a valid code point (U+1F47E) works via `\U0001F47E`, and its surrogate pair via `\UD83DDC7E` does not, on [IDE One](https://ideone.com/deoylQ)
   * In creating the test noted above, I found a bug in the Mono C\# compiler, so I submitted that here:  
       ["\U" Unicode escape sequence for strings accepts invalid value instead of raising error #15456](mono/mono#15456)
  * Runnable example code showing that invalid code point (U+110000) raises an exception, on [IDE One](https://ideone.com/jpVxL4)

2. Correctly indicated that `\U` is for a 4-byte UTF-32 value, and `\u` is for a 2-byte UTF-16 value.

3. Show the pattern _and_ an example to be more readable / helpful. Please note that `\U00nnnnnn` has two permanent zeros and only 6 user-supplied hex digits. This is not only being completely honest (since those first two zeros can only ever be zeros), it removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros), hence reducing confusion.

4. Properly formatted escape sequences as being inline-code

5. Added warning about using `\x` escape with less than 4 hex digits. For more info on this, please see:
     [Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)](https://sqlquantumleap.wordpress.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/#csharp)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.