-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generates invalid utf-8 (Surrogate pairs) #483
Comments
@jckdrpr thanks for the issue... I am afraid I get very confused by character encoding, so am perhaps not the best person to comment, but given an input of I can see in the debugger, on the input decoding, after seeing the Using those entities as inputs, my browsers displays them as two black diamonds with a white But what exactly do you expect to be the utf-8 output given those two numeric entities as input? Maybe if I understand this, I could comment more... and maybe others can offer more insight... Meantime marking this as |
Right now I have a wrapper around Here is a detailed description. UTF-8 description I check a given utf-8 (with aforementioned invalidities) byte sequence (generate by According to utf-8
NOTE: The definition of UTF-8 prohibits encoding character numbers betweenU+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters. Detecting invalid utf-8 outputs i.e. this regex Fixing the problem
Hopefully you found this description informative. Here are a few links that I found helpful while I was looking into it. http://www.russellcottrell.com/greek/utilities/surrogatepaircalculator.htm P.S. I hate Emojis BTW! lol. |
@jckdrpr thanks for the full explanation, which I sort of understood, maybe... but still stuck on fixing the problem! As mentioned, during the decoding of the numeric entity, begins So this starts in ParseEntity() - https://github.com/htacg/tidy-html5/blob/master/src/lexer.c#L1063 - so it decodes the first and has
Now it backs up the lexer, and replaces it with the sscanf value, Now I give all this explanation to try to understand at what point should tidy start looking for these Is it just sufficient to range check the Now I can see that if I set the value, c above, to The problem seems tidy operates in a linear like fashion. Yes, after adding 3 utf-8 bytes to the lexer buffer, could check (a) is this 3 byte utf-8 - it would always be 3, in this surrogate case, right?, (b) see this is in the bad range with As you can see I know tidy's code very well, but it is just getting my head around looking for Sort of out of time tonight... will sleep on this... but really look forward to your help on this... it is for sure feeling like a big bad BUG! |
Where to look for surrogates? Are we always looking for a pair? Would always be 3, in this surrogate case, right?
EDIT: I have confused low and high so to clarify. |
@jckdrpr ok, seemed a good idea to look for
Now in that service,
I say Now this code point is stored in an output print line buffer, a
So no, I am starting to believe these The idea would be, when we reach the This looks like a good place - https://github.com/htacg/tidy-html5/blob/master/src/lexer.c#L1162 - We already have From your comments, and here I think I read that these code points are divided into leading or "high surrogates" This might be workable, without too much change... need to experiment with this... probably in a branch But am now convinced it is a bug, and marking this so... Tidy should try hard to not output invalid utf-8... Naturally would appreciate any help with this... comments, patches, PR... And really thanks for my ongoing education in this weird charset world... always fun... |
@geoffmcl Would really love to help on this issue but I haven't done any C for a (long) while now. :(. Would appreciate if you update this issue with more details on what you are working on. I will try help as best as I can and will do some experiments on my end too, but can't promise you anything. |
@jckdrpr started to try to code this but ran into my first big bump ;=() naming conventions! The pair you gave The wiki I pointed to says "These code points are divided into leading or "high surrogates" (D800–DBFF) and trailing or "low surrogates" (DC00–DFFF)." And then you stated "surrogates are in range U+D800 to U+DBFF (low) and U+DC00 to U+DFFF (high)", which seems correct in order, but opposite in conventional naming low/high!!! So the wiki states the leading value, called the "high surrogates", is in fact the It seems a sort of reversal in naming convention... that is not named according to the range, but according to the position, in the pair... Who, which is right? Specifically which range should be the first found? Which the second? Assume the wiki is correct... Then tidy source has a utf8.c, which has 2 services, which like you, seemed named after the range values, rather than the position - quite confusing...
If we go with the wiki positional naming, ie
Or could keep these names so long as one understands the And that naming works for the other important services offered -
Note, in that But somehow still feel should go for Still undecided, but would appreciate comments from others who maybe understand this more... Coding stalled until I get all this completely clear in my mind... help... thanks... PS: Just read yours on sending this... Thanks for your offer to help... This is a patch I started - not complete - WIP - but shows the idea, if we have the first of a surrogate, plough on to get the second...
But stopped after getting confused about high/low, low/high, and decided this should be pulled out as a new service, like Every thing will be fixed if we can get the combined value into the |
The wiki one is correct. Seems like I confused low and high. For reference UTF-16 rfc which confirms what wiki says.
I like you would prefer leading-trailing over high-low. P.S. Also updated the comment. |
Only deals with a successful case. TODO: Maybe add a warning/error if the trailing surrogate not found, and maybe consider substituting to avoid invalid utf-8 output.
@jckdrpr thanks for clarifying... Yes, while maybe changing the Tidy function names might be a good idea, I decided to press on using the current slightly confusing names... Have now pushed what I think is a fix to a
Now loading the output into a browser I see the funny emoticon ;=)) At this stage I have done nothing about when a leading entity is found, and the trailing entity fails, but left a TODO: note in the code... Hope you get a chance to test this |
@jckdrpr just for fun I wrote a short perl script to generate all 1,048,576 # range
# Leading: U+D800 to U+DBFF (High - low range) and
# Trailing U+DC00 to U+DFFF (Low - high range)
# (1,024 × 1,024 = 1,048,576 code points
sub gen_surrogates() {
my ($x,$y,$e1,$e2);
my $count = 0;
my $width = 32; # 64;
my $htm = "<table>\n";
my $wrap = 0;
for ($x = 0xd800; $x <= 0xdbff; $x++) {
for ($y = 0xdc00; $y <= 0xdfff; $y++) {
$count++;
$e1 = sprintf("&#%u;",$x);
$e2 = sprintf("&#%u;",$y);
if ($wrap == 0) {
$htm .= "<tr>\n";
}
$htm .= "<td>$e1$e2</td>\n";
$wrap++;
if ($wrap >= $width) {
$wrap = 0;
$htm .= "</tr>\n";
}
}
}
$htm .= "</table>\n";
$x = get_nn($count); # just add the comas
prt("Generated $x surrogate pairs...\n");
write2file($htm,$out_file);
prt("html written to $out_file\n");
} Then when I loaded the tidied file into a browser - takes a long time to load - many minutes! Is a 16MB html file - found zillions of them that do not correspond to a glyph in my Windows 10 system... just get an open squarish box... but also found they produce some very, VERY, interesting and complicated glyphs... some very ccolorful... seems all the emoticons, including your test sample, many I had never seen before... of course lots of what looks like Chinese characters... I have copied the tidied file to - http://geoffair.org/tmp/surrogates.html - but don't blame me if it blows up your browser ;=() And seem to get many more glyphs shown in linux, than in windows... in fact there seems no Anyway, interesting, but all just for fun... |
@geoffmcl Awesome. I was doing something similar chunks :: Int -> [a] -> [[a]]
chunks _ [] = []
chunks n xs = let (ys, zs) = splitAt n xs in ys : chunks n zs
allPairs :: [String]
allPairs =
let
showOne x = "&#" ++ show x ++ ";"
in
(\x y -> showOne x ++ showOne y) <$> [0xd800..0xdbff] <*> [0xdc00..0xdfff]
main :: IO ()
main = putStr . unlines . map (unwords) . chunks 100 $ allPairs Before this I was trying to input some random values and logging tidy's output and checking the generated utf-8, seemed correct for those cases. Will need to check more. will try to validate output with some other langs. P.S. My browser is having a hard time displaying all the pairs. haha.... |
@jckdrpr, wow, had to read up on "What is Haskell?" ;=)) Looks interesting, and may give it a try... However, in dealing with the error cases, found several bugs in my implementation, and have now pushed a fix to the Now, with this fix, if Tidy encounters an invalid So, the first question is what error message should be output? And then should this be a And then what to do about it? I note in a utf-8 decode error - https://github.com/htacg/tidy-html5/blob/master/src/utf8.c#L447 - Tidy will substitute code point Really seek your ideas on this, and comments from other, so we can settle on a direction... thanks... |
@geoffmcl Apologies for the late reply. I feel like a lot of systems depend on tidy so directly creating an error would not be ideal in my opinion. Probably creating a warning and ignoring the invalid pair is one approach that can be considered. P.S. You are going to try Haskell. Get ready to meet your favourite language.... :) |
@jckdrpr thanks for the feedback, and no apology needed! A late reply is much better than no reply...
But that is exactly what tidy does now! It generates invalid utf-8! That should not happen! And looking more into the code point Remember, my fixed
And in each case, output a warning message, advising the problem encountered, 1, 2, 3. Still to decide if that should be a warning, or an error. The warning/error messages could be something like -
So then this issue as written - Reconsidering, what do you, and others think? Thanks... P.S. Unfortunately meeting my favorite language, |
I'd recommend marking them as warnings, because Tidy is generating valid HTML with the U+FFFD character. |
@geoffmcl I completely agree with your approach. I really appreciate that you kept me updated on all the changes. |
@balthisar, @jckdrpr thanks for the feedback, and agree, if we are going to use the substitute character, U+FFFD, then yes Tidy will be outputing valid utf-8, so this should be a warning messages only, advising of the problem and substitution done... I have modified my new
And added the case where a trailing pair value found with no leading. Warn and use the substitute U+FFFD. So this now covers the 3 warnings given above. Then add 3 new
Now came unstuck when adding the warning messages. @balthisar seems I have not been following too well all your language work. Started a README/MESSAGE.md, and understood some things, but then saw new stuff that do not yet understand. I hope you can help fill out my MESSAGE.md, describing, if possible each change that is needed to establish these 3 new messages. Don't worry about the code to do the actual message formatting yet. I will handle that... So I have pushed my WIP to the Well warnings are output, but only using Also have not solved how to output 2 sub. chars in case 1. My cases:
PS: These test files are (temporarily) in my repo - https://github.com/geoffmcl/tidy-test, in the test/input5 folder... each run with |
I'll have a look at fleshing it out.
I may pull your branch and add something anyway, just to be sure that my instructions work! I'm assuming you're going to use of the existing message output functions. |
@balthisar thanks for the comment...
That would be greatly appreciated...
Yes, absolutely! I have started to look a little deeper. We are in decoding an entity, and there is an existing As I have already added, and envisage 3 messages like, although would appreciate any thoughts on the exact wording -
And as already reported, the You can already see I want to be quite explicit and show what the problem with the pair really is - 1. bad pair, 2. bad trailing, 3. bad leading - and show the substitution. Maybe I am being too, just TOO ambitious? And maybe actually reporting using Maybe we would need a new service like say But on the other hand, we do already have more that a dozen Starting to sink in the sea of options, and thus stalled! ;=() Any comments appreciated... |
@geoffmcl, go ahead and add another |
@balthisar ok, added a new If someone can test and confirm, will merge this |
Added PR #490... |
@balthisar thanks for updating the MESSAGES.md... this looks good... Now that you are back I must get used to doing a You will note that in the new service, using Hope you get a chance to test it... I have, using my 4 test files, and it seems to work fine... |
Testing on macOS was perfectly fine, so I've merge your PR, and will be happy to close this one! |
Hi I was using
tidy-html5
withutf-8
with the following paramsand when I try to pass the input:
��
I get back HTML with header etc... But this particular input is encoded as
[237 160 189 237 184 141]
byte array.Which to me is not valid utf-8.
this in hex is ->
ed a0 bd ed b8 8d
which translates to code pointsU+D83D and U+DE0D
. Both of which lie in the surrogate pairs rangeU+D800 to U+DFFF
.Can someone point to me if there is something I am doing wrong or is it a bug?
The text was updated successfully, but these errors were encountered: