Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generates invalid utf-8 (Surrogate pairs) #483

Closed
nixypanda opened this issue Jan 29, 2017 · 24 comments
Closed

Generates invalid utf-8 (Surrogate pairs) #483

nixypanda opened this issue Jan 29, 2017 · 24 comments

Comments

@nixypanda
Copy link

Hi I was using tidy-html5 with utf-8 with the following params

		-indent
		-quiet
		-utf8
		--doctype html5
		--drop-empty-elements yes
		--drop-empty-paras yes
		--fix-bad-comments yes
		--output-html yes
		--repeated-attributes keep-last
		--tidy-mark no

and when I try to pass the input: ��
I get back HTML with header etc... But this particular input is encoded as
[237 160 189 237 184 141] byte array.

Which to me is not valid utf-8.

this in hex is -> ed a0 bd ed b8 8d which translates to code points U+D83D and U+DE0D. Both of which lie in the surrogate pairs range U+D800 to U+DFFF.

Can someone point to me if there is something I am doing wrong or is it a bug?

@geoffmcl
Copy link
Contributor

@jckdrpr thanks for the issue...

I am afraid I get very confused by character encoding, so am perhaps not the best person to comment, but given an input of �� I can indeed confirm tidy will output, in hex ED A0 BD ED B8 8D, which my utf-8 checker advises is not valid utf-8. But what is the right or wrong thing in this case, I do not know!

I can see in the debugger, on the input decoding, after seeing the &# start, and ; end, it will store these in its lexer first as D83D and DE0D resp., then will generate two 3-byte, supposed to be utf-8, ED A0 BD for the first, and ED B8 8D for the second...

Using those entities as inputs, my browsers displays them as two black diamonds with a white ?, which I understand as invalid characters, but...

But what exactly do you expect to be the utf-8 output given those two numeric entities as input?

Maybe if I understand this, I could comment more... and maybe others can offer more insight...

Meantime marking this as Technical Support, until more information, understanding, comes to light...

@geoffmcl geoffmcl added this to the 5.3 milestone Jan 29, 2017
@nixypanda
Copy link
Author

nixypanda commented Jan 29, 2017

Right now I have a wrapper around tidy-html which post-processes the generated output. Right now the way I am handling it is by converting the generated surrogate pair to unicode code point and then encode it.

Here is a detailed description.

UTF-8 description

I check a given utf-8 (with aforementioned invalidities) byte sequence (generate by tidy and detect if it contains a surrogate pair which by UTF-8 standard makes the byte sequence invalid UTF-8.

According to utf-8

Char. number range  |        UTF-8 octet sequence
   (hexadecimal)    |              (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

NOTE: The definition of UTF-8 prohibits encoding character numbers betweenU+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.

Detecting invalid utf-8 outputs
given that if we encode the code points from U+D800 to U+DFFF with utf-8 encoding scheme (without checking for surrogates) we get.
U+D800 -> \xED\xA0\x80
U+DFFF -> \xED\xBF\xBF

i.e. this regex \xED[\xA0-\xBF][\x80-\xBF] should detect invalid utf-8 sequence which contain surrogates.

Fixing the problem

  1. Detect the surrogate pair.
  2. Get the unicode code points (d83d de0d) from invalid utf-8 (ed a0 bd ed b8 8d). Using
    int(c1&mask3)<<12 | int(c2&maskx)<<6 | int(c3&maskx)
    where maskx = 0x3F // 0011 1111 and mask3 = 0x0F // 0000 1111
  3. Convert the unicode surrogate pair to a scalar value
    N = (H - 0xD800) * 0x400 + (L - 0xDC00) + 0x10000
    in this case 0x1F60D
  4. Now just encode it using utf-8 scheme. Which turns out to be. F0 9F 98 8D or 😍.

Hopefully you found this description informative. Here are a few links that I found helpful while I was looking into it.

http://www.russellcottrell.com/greek/utilities/surrogatepaircalculator.htm
https://tools.ietf.org/html/rfc3629

P.S. I hate Emojis BTW! lol.

@geoffmcl
Copy link
Contributor

@jckdrpr thanks for the full explanation, which I sort of understood, maybe... but still stuck on fixing the problem!

As mentioned, during the decoding of the numeric entity, begins &# and ends ;, tidy starts out with 26 23 35 35 33 35 37 in its lexer buffer. Then as part of the post processing of the collected text data, it ends up replacing that with the invalid utf-8 ed a0 bd...

So this starts in ParseEntity() - https://github.com/htacg/tidy-html5/blob/master/src/lexer.c#L1063 - so it decodes the first and has 26 23 35 35 33 35 37 in its buffer and calls EntityInfo - https://github.com/htacg/tidy-html5/blob/master/src/entities.c#L352 - it sees the second char is #, to denote a numeric entity, check if it is hexadecimal or not, and uses sscanf to get its value - the lexer buffer is always zero terminated -

        if ( name[2] == 'x' || (!isXml && name[2] == 'X') )
            res = sscanf( name+3, "%x", &c );
        else
            res = sscanf( name+2, "%u", &c );

Now it backs up the lexer, and replaces it with the sscanf value, D83D, converted to utf-8 by - https://github.com/htacg/tidy-html5/blob/master/src/utf8.c#L337 - then does the same with the second entity... and so on if more text to decode from the file...

Now I give all this explanation to try to understand at what point should tidy start looking for these surrogates? That is Detect the surrogate pair?

Is it just sufficient to range check the sscanf value is in the invalid range U+D800 and U+DFFF? But at that point we only have the first value! Are we always looking for a pair?

Now I can see that if I set the value, c above, to 0x01f60d, tidy will generate 4 byte f0 9f 98 8d, so tidy's value to utf-8 seems ok... But in this case that value is derived from combining two values... How to know to look at two values, and combine them with N = (H - 0xD800) * 0x400 + (L - 0xDC00) + 0x10000?

The problem seems tidy operates in a linear like fashion. Yes, after adding 3 utf-8 bytes to the lexer buffer, could check (a) is this 3 byte utf-8 - it would always be 3, in this surrogate case, right?, (b) see this is in the bad range with v = (c1 & mask3) << 12 | (c2 & maskx) << 6 | (c3 & maskx), (c) Is it preceded by 3 bytes in the lexer, (d) if yes, and it is in the same bad range, deal with changing the 6 bytes to something else - WOW, that might be possible...

As you can see I know tidy's code very well, but it is just getting my head around looking for surrogate pairs... When? At what point?

Sort of out of time tonight... will sleep on this... but really look forward to your help on this... it is for sure feeling like a big bad BUG!

@nixypanda
Copy link
Author

nixypanda commented Jan 30, 2017

Where to look for surrogates?
Probably where we are going through Lexer output.

Are we always looking for a pair?
Yes. AFAIK.

Would always be 3, in this surrogate case, right?
Yes. As surrogates are in range U+D800 to U+DBFF (low) and U+DC00 to U+DFFF (high) making it 16 bits and if we look at the number of x below we see there are 16, when encoding it as utf-8

0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx

EDIT: I have confused low and high so to clarify.
High means first (Leading surrogate) and is in range U+D800 to U+DBFF
Low means second surrogate (trailing surrogate) and lies in range U+DC00 to U+DFFF.

@geoffmcl
Copy link
Contributor

@jckdrpr ok, seemed a good idea to look for surrogates when doing the output. That is when pretty printing the document, specifically in this case PPrintText, ie a text node - https://github.com/htacg/tidy-html5/blob/master/src/pprint.c#L1052 - and dealing with utf-8 output...

        /* look for UTF-8 multibyte character */
        if ( c > 0x7F )
             ix += TY_(GetUTF8)( doc->lexer->lexbuf + ix, &c );

Now in that service, GetUTF8, it reads the ed a0 bd back into the 0xD83D code point - this is done so tidy can have a different output encoding - and bingo found this had been partially done before, but now commented out, #if 0, with a comment that it breaks Big5 decoding -

#if 0 /* Breaks Big5 D8 - DF */
    if (!hasError && (n >= kUTF16LowSurrogateBegin) && (n <= kUTF16HighSurrogateEnd))
        /* unpaired surrogates not allowed */
        hasError = yes;
#endif

I say partially, because as you can see it is only checking the first of the pair! The comment unpaired surrogates not allowed seems to imply that the conversion of a pair into a different utf-8 sequence, a different code point, should/would have been done before this...

Now this code point is stored in an output print line buffer, a uint *, and only when it is actually flushed to out, is the output encoding applied - see WriteChar https://github.com/htacg/tidy-html5/blob/master/src/streamio.c#L588 - where the output encoding will determine the output, and again there is a commented out section, for utf-8 output, if 0, with the same comment -

    else if (c <= 0xFFFF)  /* 1110 XXXX  three bytes */
    {
        buf[0] = (tmbchar) (0xE0 | (c >> 12));
        buf[1] = (tmbchar) (0x80 | ((c >> 6) & 0x3F));
        buf[2] = (tmbchar) (0x80 | (c & 0x3F));
        bytes = 3;
        if ( c == kUTF8ByteSwapNotAChar || c == kUTF8NotAChar )
            hasError = yes;
#if 0 /* Breaks Big5 D8 - DF */
        else if ( c >= kUTF16LowSurrogateBegin && c <= kUTF16HighSurrogateEnd )
            /* unpaired surrogates not allowed */
            hasError = yes;
#endif
    }

So no, I am starting to believe these surrogate pairs must be handled on input of entities! It seems only at that moment can the idea of a pair can be applied...

The idea would be, when we reach the ;, fetch one more char from the stream... if it is not another &, put it back - tidy has no problem with that - and continue, else range check the value, and if in the low/high range of surrogate, continue and collect the second entity, etc, etc...

This looks like a good place - https://github.com/htacg/tidy-html5/blob/master/src/lexer.c#L1162 - We already have ch == 0xd83d... falls into the pair range... can enter another while ( (c = TY_(ReadChar)(doc->docIn)) != EndOfStream ) { } loop fetching the second... can apply the N = (H - 0xD800) * 0x400 + (L - 0xDC00) + 0x10000, and get another value, etc, etc...

From your comments, and here I think I read that these code points are divided into leading or "high surrogates" (D800–DBFF) and trailing or "low surrogates" (DC00–DFFF). Assume that means we should range check the first and second differently? And then there is deciding what to do if one or the other fail...

This might be workable, without too much change... need to experiment with this... probably in a branch surrogates...

But am now convinced it is a bug, and marking this so... Tidy should try hard to not output invalid utf-8...

Naturally would appreciate any help with this... comments, patches, PR...

And really thanks for my ongoing education in this weird charset world... always fun...

@nixypanda
Copy link
Author

@geoffmcl Would really love to help on this issue but I haven't done any C for a (long) while now. :(. Would appreciate if you update this issue with more details on what you are working on. I will try help as best as I can and will do some experiments on my end too, but can't promise you anything.

@geoffmcl
Copy link
Contributor

@jckdrpr started to try to code this but ran into my first big bump ;=() naming conventions!

The pair you gave &#55357;&#56845;, or D83D DE0D, seems correct, but then some naming conventions seem reversed?

The wiki I pointed to says "These code points are divided into leading or "high surrogates" (D800–DBFF) and trailing or "low surrogates" (DC00–DFFF)."

And then you stated "surrogates are in range U+D800 to U+DBFF (low) and U+DC00 to U+DFFF (high)", which seems correct in order, but opposite in conventional naming low/high!!!

So the wiki states the leading value, called the "high surrogates", is in fact the lower value range, D800-DBFF, and the trailing value, called the "low surrogates", is in fact the 'highervalue range,DC00-DFFF`...

It seems a sort of reversal in naming convention... that is not named according to the range, but according to the position, in the pair...

Who, which is right? Specifically which range should be the first found? Which the second? Assume the wiki is correct...

Then tidy source has a utf8.c, which has 2 services, which like you, seemed named after the range values, rather than the position - quite confusing...

/* UTF-16 surrogate pair areas */
#define kUTF16LowSurrogateBegin  0xD800
#define kUTF16LowSurrogateEnd    0xDBFF
#define kUTF16HighSurrogateBegin 0xDC00
#define kUTF16HighSurrogateEnd   0xDFFF

Bool    TY_(IsHighSurrogate)( tchar ch )
{
    return ( ch >= kUTF16HighSurrogateBegin && ch <= kUTF16HighSurrogateEnd );
}
Bool    TY_(IsLowSurrogate)( tchar ch )
{
    return ( ch >= kUTF16LowSurrogateBegin && ch <= kUTF16LowSurrogateEnd );
}

If we go with the wiki positional naming, ie first == high, but the lower range, and second == low, but the higher range, then maybe these service should be renamed -

IsHighSurrogatet -> IsTrailingSurrogate or IsSecondSurrogate
IsLowSurrogatte -> IsLeadingSurrogate or IsFirstSurrogate

Or could keep these names so long as one understands the first should be an IsLowSurrogate, followed by the second, an IsHighSurrogate...

And that naming works for the other important services offered -

tchar   TY_(CombineSurrogatePair)( tchar high, tchar low )
{
    assert( TY_(IsHighSurrogate)(high) && TY_(IsLowSurrogate)(low) );
    return ( ((low - kUTF16LowSurrogateBegin) * 0x400) + 
             high - kUTF16HighSurrogateBegin + 0x10000 );
}

Bool   TY_(SplitSurrogatePair)( tchar utf16, tchar* low, tchar* high )
{
    Bool status = ( TY_(IsValidCombinedChar)( utf16 ) && high && low );
    if ( status )
    {
        *low  = (utf16 - kUTF16SurrogatesBegin) / 0x400 + kUTF16LowSurrogateBegin;
        *high = (utf16 - kUTF16SurrogatesBegin) % 0x400 + kUTF16HighSurrogateBegin;
    }
    return status;
}

Note, in that combine service, seems should pass in (tchar second, tchar first), which is counter intuitive...

But somehow still feel should go for Leading/Trailing, or First/Second, of a pair, and avoid the confusing High/Low...

Still undecided, but would appreciate comments from others who maybe understand this more...

Coding stalled until I get all this completely clear in my mind... help... thanks...

PS: Just read yours on sending this... Thanks for your offer to help...

This is a patch I started - not complete - WIP - but shows the idea, if we have the first of a surrogate, plough on to get the second...

diff --git a/src/lexer.c b/src/lexer.c
index 0fea386..fc01a63 100644
--- a/src/lexer.c
+++ b/src/lexer.c
@@ -1159,6 +1159,62 @@ static void ParseEntity( TidyDocImpl* doc, GetTokenMode mode )
         found = TY_(EntityInfo)( lexer->lexbuf+start, isXml, &ch, &entver );
     }
 
+#if 0 /* 0000000000000000000000000000000000000000 */
+    /* Issue #483 - Deal with 'surrogate pairs' */
+    if (!preserveEntities && found && TY_(IsLowSurrogate)(ch))
+    {
+        uint c1;
+        if ((c1 = TY_(ReadChar)(doc->docIn)) == '&') 
+        {
+            /* have a following entity */
+            uint start2 = lexer->lexsize;  /* to start at "&" */
+            TY_(AddCharToLexer)(lexer, c1);
+            entState = ENT_default;
+            charRead = 0;
+            while ((c1 = TY_(ReadChar)(doc->docIn)) != EndOfStream)
+            {
+                if (c1 == ';')
+                {
+                    semicolon = yes;
+                    break;
+                }
+                ++charRead;
+
+                if (charRead == 1 && c1 == '#')
+                {
+                    TY_(AddCharToLexer)(lexer, c1);
+                    entState = ENT_numdec;
+                    continue;
+                }
+                else if (charRead == 2 && entState == ENT_numdec
+                    && (c1 == 'x' || (!isXml && c1 == 'X')))
+                {
+                    TY_(AddCharToLexer)(lexer, c1);
+                    entState = ENT_numhex;
+                    continue;
+                }
+
+                if (entFn[entState](c1))
+                {
+                    TY_(AddCharToLexer)(lexer, c1);
+                    continue;
+                }
+
+                /* otherwise put it back */
+                TY_(UngetChar)(c1, doc->docIn);
+                break;
+            }
+
+        }
+        else 
+        {
+            /* otherwise put it back */
+            TY_(UngetChar)(c1, doc->docIn);
+        }
+
+    }
+#endif /* 000000000000000000000000000000000 */
+
     /* deal with unrecognized or invalid entities */
     /* #433012 - fix by Randy Waki 17 Feb 01 */
     /* report invalid NCR's - Terry Teague 01 Sep 01 */

But stopped after getting confused about high/low, low/high, and decided this should be pulled out as a new service, like GetSurrogatePair, probably not storing the collected second in the lexer at all... too confusing...

Every thing will be fixed if we can get the combined value into the ch variable, 0x1f60d, then tidy will store fo 9f 98 8d in the lexer, having eaten the two entities... WIP...

@nixypanda
Copy link
Author

nixypanda commented Jan 31, 2017

The wiki one is correct. Seems like I confused low and high. For reference UTF-16 rfc which confirms what wiki says.

   -  Characters with values between 0x10000 and 0x10FFFF are
      represented by a 16-bit integer with a value between 0xD800 and
      0xDBFF (within the so-called high-half zone or high surrogate
      area) followed by a 16-bit integer with a value between 0xDC00 and
      0xDFFF (within the so-called low-half zone or low surrogate area).

I like you would prefer leading-trailing over high-low.
High or Leading -> 0xD800 to 0xDBFF
Low or trailing -> 0xDC00 to 0xDFFF

P.S. Also updated the comment.
UTF-16 RFC for reference https://www.ietf.org/rfc/rfc2781.txt

geoffmcl added a commit that referenced this issue Feb 1, 2017
Only deals with a successful case.

TODO: Maybe add a warning/error if the trailing surrogate not found, and
maybe consider substituting to avoid invalid utf-8 output.
@geoffmcl
Copy link
Contributor

geoffmcl commented Feb 1, 2017

@jckdrpr thanks for clarifying...

Yes, while maybe changing the Tidy function names might be a good idea, I decided to press on using the current slightly confusing names...

Have now pushed what I think is a fix to a surrogates branch - to try it out -

$ cd tidy-html5
$ git pull
$ git checkout surrogates
$ cd build, and build, test...

Now loading the output into a browser I see the funny emoticon ;=))

At this stage I have done nothing about when a leading entity is found, and the trailing entity fails, but left a TODO: note in the code...

Hope you get a chance to test this surrogates branch and look forward to suggestions on dealing with an error condition... thanks...

@geoffmcl
Copy link
Contributor

geoffmcl commented Feb 1, 2017

@jckdrpr just for fun I wrote a short perl script to generate all 1,048,576 surrogate pairs entities, in a big table... then ran tidy 5.3.16I483, the surrogates branch, and found Tidy rejected 32 of them... not sure why yet... they seem the last two at the end of a row... and I note the Tidy service, IsValidCombinedChar says no to FFFE and FFFF...

# range
# Leading: U+D800 to U+DBFF (High - low range) and
# Trailing U+DC00 to U+DFFF (Low - high range) 
# (1,024 × 1,024 = 1,048,576 code points
sub gen_surrogates() {
    my ($x,$y,$e1,$e2);
    my $count = 0;
    my $width = 32; # 64;
    my $htm = "<table>\n";
    my $wrap = 0;
    for ($x = 0xd800; $x <= 0xdbff; $x++) {
        for ($y = 0xdc00; $y <= 0xdfff; $y++) {
            $count++;
            $e1 = sprintf("&#%u;",$x);
            $e2 = sprintf("&#%u;",$y);
            if ($wrap == 0) {
                $htm .= "<tr>\n";
            }
            $htm .= "<td>$e1$e2</td>\n";
            $wrap++;
            if ($wrap >= $width) {
                $wrap = 0;
                $htm .= "</tr>\n";
            }
        }
    }
    $htm .= "</table>\n";
    $x = get_nn($count);    # just add the comas
    prt("Generated $x surrogate pairs...\n");
    write2file($htm,$out_file);
    prt("html written to $out_file\n");
}

Then when I loaded the tidied file into a browser - takes a long time to load - many minutes! Is a 16MB html file - found zillions of them that do not correspond to a glyph in my Windows 10 system... just get an open squarish box... but also found they produce some very, VERY, interesting and complicated glyphs... some very ccolorful... seems all the emoticons, including your test sample, many I had never seen before... of course lots of what looks like Chinese characters...

I have copied the tidied file to - http://geoffair.org/tmp/surrogates.html - but don't blame me if it blows up your browser ;=()

And seem to get many more glyphs shown in linux, than in windows... in fact there seems no blanks like a see in Windows, but they do not look right... each is multiple??? Something wrong there... seems not being interpretated as utf-8 at all???

Anyway, interesting, but all just for fun...

@nixypanda
Copy link
Author

nixypanda commented Feb 1, 2017

@geoffmcl Awesome. I was doing something similar

chunks :: Int -> [a] -> [[a]]
chunks _ [] = []
chunks n xs = let (ys, zs) = splitAt n xs in  ys : chunks n zs

allPairs :: [String]
allPairs =
  let
    showOne x = "&#" ++ show x ++ ";"
  in
    (\x y -> showOne x ++ showOne y) <$> [0xd800..0xdbff] <*> [0xdc00..0xdfff]

main :: IO ()
main = putStr . unlines . map (unwords) . chunks 100 $ allPairs

Before this I was trying to input some random values and logging tidy's output and checking the generated utf-8, seemed correct for those cases. Will need to check more. will try to validate output with some other langs.

P.S. My browser is having a hard time displaying all the pairs. haha....

@geoffmcl
Copy link
Contributor

geoffmcl commented Feb 2, 2017

@jckdrpr, wow, had to read up on "What is Haskell?" ;=)) Looks interesting, and may give it a try...

However, in dealing with the error cases, found several bugs in my implementation, and have now pushed a fix to the surrogates branch.

Now, with this fix, if Tidy encounters an invalid surrogate pair, Tidy just silently generates the invalid utf-8, as before... no warning/error yet issued!

So, the first question is what error message should be output? And then should this be a warning, or an error? At this moment I tend to think error! As you may know tidy will not produce an output if an error condition, unless you add --force-output yes. Tidy should not output invalid utf-8!

And then what to do about it? I note in a utf-8 decode error - https://github.com/htacg/tidy-html5/blob/master/src/utf8.c#L447 - Tidy will substitute code point 0xFFFD. Is that also a good idea for this bad surrogate pairs case, together with a warning message of course?

Really seek your ideas on this, and comments from other, so we can settle on a direction... thanks...

@nixypanda
Copy link
Author

@geoffmcl Apologies for the late reply. I feel like a lot of systems depend on tidy so directly creating an error would not be ideal in my opinion. Probably creating a warning and ignoring the invalid pair is one approach that can be considered.

P.S. You are going to try Haskell. Get ready to meet your favourite language.... :)

@geoffmcl
Copy link
Contributor

geoffmcl commented Feb 8, 2017

@jckdrpr thanks for the feedback, and no apology needed! A late reply is much better than no reply...

... directly creating an error would not be ideal ...

But that is exactly what tidy does now! It generates invalid utf-8! That should not happen!

And looking more into the code point U+FFFD, bf bf bd, the so called replacement character, is exactly what I see browsers do when trying to display an invalid utf-8 - they display a black diamond with a white question mark - ie U+FFFD!

Remember, my fixed surrogates branch tidy now correctly generates over 1 million surrogate paris perfectly correctly, at least generates the valid 4-byte utf-8 sequences, so this replacement character, basically a question mark, is only in the following limited circumstances -

  1. have a valid leading surrogate entity, and a valid trailing entity, but is one of the 32 out-of-range cases - add 2 U+FFFD, replacing both entities.
  2. have a valid leading surrogate entity, but an invalid trailing entity - just add 1 U+FFFD, replacing the leading entity, and let tidy deal with what ever the second is.
  3. have a trailing surrogate entity, with no leading - just add 1 U+FFFD, replacing that entity.

And in each case, output a warning message, advising the problem encountered, 1, 2, 3. Still to decide if that should be a warning, or an error.

The warning/error messages could be something like -

  1. Have out-of-range surrogate pair U+XXXX:U+XXXX, replaced with two U+FFFD values.
  2. Leading (High) surrogate pair U+XXXX, with no trailing (Low) entity, replaced with U+FFFD
  3. Trailing (Low) surrogate pair U+XXXX, with no leading (High) entity, replaced with U+FFFD

So then this issue as written - Generates invalid utf-8 (Surrogate pairs) - would be solved... and tidy would never output invalid utf-8... at least in this surrogate pair case...

Reconsidering, what do you, and others think? Thanks...

P.S. Unfortunately meeting my favorite language, Haskell will have to wait a while ;=))

@balthisar
Copy link
Member

I'd recommend marking them as warnings, because Tidy is generating valid HTML with the U+FFFD character.

@nixypanda
Copy link
Author

@geoffmcl I completely agree with your approach. I really appreciate that you kept me updated on all the changes.

@geoffmcl
Copy link
Contributor

geoffmcl commented Feb 9, 2017

@balthisar, @jckdrpr thanks for the feedback, and agree, if we are going to use the substitute character, U+FFFD, then yes Tidy will be outputing valid utf-8, so this should be a warning messages only, advising of the problem and substitution done...

I have modified my new GetSurrogatePair() service to return an enumerated SPStatus value...

  1. SP_ok means all done and combined character is good to go - combine char added - works for over 1 million valid surrogate pairs.
  2. SP_failed means found two entities, but is one of the 32 out-of-range pairs. Will eat the two value, warn, and add 1 (or 2) substitute chars.
  3. SP_error means could not get the second, so only eat first, use sub. char, and warn failed to get the second of a pair.

And added the case where a trailing pair value found with no leading. Warn and use the substitute U+FFFD. So this now covers the 3 warnings given above.

Then add 3 new tidyErrorCodes enumerations -

  1. BAD_SURROGATE_PAIR, = got lead, tail, but have out of range pair
  2. BAD_SURROGATE_TAIL, = got lead, but failed to get valid trailing
  3. BAD_SURROGATE_LEAD, = got tail, but no lead

Now came unstuck when adding the warning messages. @balthisar seems I have not been following too well all your language work. Started a README/MESSAGE.md, and understood some things, but then saw new stuff that do not yet understand.

I hope you can help fill out my MESSAGE.md, describing, if possible each change that is needed to establish these 3 new messages. Don't worry about the code to do the actual message formatting yet. I will handle that...

So I have pushed my WIP to the surrogates branch. It should all work, but without the actual warning messages being output. Need to test and review all my code logic, especially for the error cases... that is 1. good pair, but out of range, 2. good lead, but some rubbish following, and 3. trailing without a leading...

Well warnings are output, but only using fprintf(stderr,"format"...); mechanism... search for SP WARNING:... would really appreciate if the error conditions could all be tested... realising this is WIP... thanks...

Also have not solved how to output 2 sub. chars in case 1.

My cases:

  1. in_483.html - &#55357;&#56845; - valid pair - no warning
  2. in_483-1.html - &#55359;&#57342; - out of range pair
  3. in_483-2.html - &#55357;no tail - no trailing pair
  4. in_483-3.html - no lead&#56845; - no leading pair

PS: These test files are (temporarily) in my repo - https://github.com/geoffmcl/tidy-test, in the test/input5 folder... each run with -o out_483-1.html --show-body-only yes --show-info no in_483-1.html, view the messages, and load the out in a browser for viewing, and check that the out contains valid utf-8 using the messy tools from this repo - https://github.com/geoffmcl/utf8-test

@balthisar
Copy link
Member

Started a README/MESSAGE.md, and understood some things, but then saw new stuff that do not yet understand.

I'll have a look at fleshing it out.

Don't worry about the code to do the actual message formatting yet. I will handle that...

I may pull your branch and add something anyway, just to be sure that my instructions work! I'm assuming you're going to use of the existing message output functions.

@geoffmcl
Copy link
Contributor

@balthisar thanks for the comment...

I'll have a look at fleshing it out.

That would be greatly appreciated...

...use of the existing message output functions.

Yes, absolutely!

I have started to look a little deeper. We are in decoding an entity, and there is an existing TY_(ReportEntityError)(). but this does not seem very suitable. This has things like UNKNOWN_ENTITY, which will report like unescaped & or unknown entity "&rubbish;", and a few others like this, but as I say, do not seem particulary good for these surrogate pairs!

As I have already added, and envisage 3 messages like, although would appreciate any thoughts on the exact wording -

{ BAD_SURROGATE_PAIR,  0,  "Have out-of-range surrogate pair U+%s:U+%s, replaced with 2 U+FFFD values."}, /* warning */
{ BAD_SURROGATE_TAIL,  0,  "Leading (High) surrogate pair U+%s, with no trailing (Low) entity, replaced with U+FFFD." }, /* warning */
{ BAD_SURROGATE_LEAD,  0,  "Trailing (Low) surrogate pair U+%s, with no leading (High) entity, replaced with U+FFFD." }, /* warning */

And as already reported, the replaced with 2 is not yet a reality...

You can already see I want to be quite explicit and show what the problem with the pair really is - 1. bad pair, 2. bad trailing, 3. bad leading - and show the substitution. Maybe I am being too, just TOO ambitious?

And maybe actually reporting using code points, as the above suggests, is also too much? Up until I started doing this issue, I too was quite confused by code points - what is that? But ok, I now have a much better idea, and they do describe the situation very precisely... once you understand...

Maybe we would need a new service like say TY_(ReportSurrogatePairError)(), but again was hoping to avoid that, and use one of the existing report error services, like TY_(ReportWarning)()...

But on the other hand, we do already have more that a dozen ReportXXXX, so maybe yet another, special for surrogate pairs, is not really a bad thing...

Starting to sink in the sea of options, and thus stalled! ;=() Any comments appreciated...

@balthisar
Copy link
Member

@geoffmcl, go ahead and add another ReportXXXX if you like. I actually have hopes of completely overhauling the messaging system in 5.5.x (more discussion on that later) in order to address the current inflexibility of having to write a new output function, and to address some of the feature requests like #476 and #477 in a flexible way, and especially to address the horrible kludge that's tidyErrorFilterKeysStruct (my kludge, so I take responsibility for it).

@geoffmcl
Copy link
Contributor

@balthisar ok, added a new ReportSurrogateError service... and replaced all my previous fprintf messages with this, and pushed... seems to be working fine...

If someone can test and confirm, will merge this surrogates branch, and close this is as fixed... thanks...

@geoffmcl geoffmcl mentioned this issue Feb 11, 2017
@geoffmcl
Copy link
Contributor

Added PR #490...

@geoffmcl
Copy link
Contributor

@balthisar thanks for updating the MESSAGES.md... this looks good...

Now that you are back I must get used to doing a git pull, a lot, otherwise you get a nasty message when you try to push ;=))

You will note that in the new service, using %04X in the message format string easily handled outputting the U+XXXX code point, so no special effort needed here...

Hope you get a chance to test it... I have, using my 4 test files, and it seems to work fine...

@balthisar
Copy link
Member

Testing on macOS was perfectly fine, so I've merge your PR, and will be happy to close this one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants