Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Umlauts/special characters not converted to correct html entities #936

Closed
Feathered-Serpent opened this issue Apr 5, 2021 · 4 comments
Closed

Comments

@Feathered-Serpent
Copy link

Hi everyone,

Let's say I have a very easy HTML file with this content:

<html>
<head>
<title>Test</title>
</head>
<body>
ä ö ü ß á é í ó ú
</body>
</html>

When saving this sampe html code as an ANSI encoded file with notepad++, my browser also displays these characters correctly.

But how do I tell Tidy to convert that to &auml; &ouml; &uuml; &szlig; &aacute; &eacute; &iacute; &oacute; &uacute; and output a utf8 file? I've tried multiple char encoding options, but only when I use --character-encoding win1252 (so for input and output) it gave me what I wanted (with the probability of characters in the input file not able to be displayed on win1252?). If I run tidy without any additional options, it gives me this:

line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 6 column 1 - Warning: replacing invalid UTF-8 bytes (char. code U+0004)
line 6 column 3 - Warning: replacing invalid UTF-8 bytes (char. code U+0006)
line 6 column 5 - Warning: replacing invalid UTF-8 bytes (char. code U+0000)
line 6 column 7 - Warning: replacing invalid UTF-8 bytes (char. code U+001F)
line 6 column 9 - Warning: replacing invalid UTF-8 bytes (char. code U+0001)
line 6 column 11 - Warning: replacing invalid UTF-8 bytes (char. code U+0009)
line 6 column 13 - Warning: replacing invalid UTF-8 bytes (char. code U+000D)
line 6 column 15 - Warning: replacing invalid UTF-8 bytes (char. code U+0003)
line 6 column 17 - Warning: replacing invalid UTF-8 bytes (char. code U+0002)
Info: Document content looks like HTML5
Tidy found 10 warnings and 0 errors!

Within the browser, the characters then are displayed like this: � � � � � � � � �

Do I have to use win1252 character encoding for all my HTML files?

@geoffmcl
Copy link
Contributor

geoffmcl commented Apr 5, 2021

@Feathered-Serpent, thank you for the issue... but what is it exactly?

Yes, character encoding can be a difficult topic to easily explain ;=(( And I do not put myself in the expert range, but I know some...

It seems what we have here is a mixture of -

  1. latin1, win1252, ISO 8859-1 - a single byte, i.e. 8-bit encoding
  2. utf-8 - the current, very common, multibyte character encoding - 1 to 4 bytes.

For the first 128 characters, that is, code points U+0000 to U+007F, they are the same... can be referred to as 7-bit ASCII... After that, all bets are off ;=))

Some brief history...

Sorry to bore you, if you already know all this...

Computers, being bits and byte machines, adopted latin1, sometimes referred to extended ASCII, early in their development, and got widespead use on the fledgling internet... meant only basically western european language displays were available... just 256 chars... 1-byte... ugh!

With the advent of the so called unicode, circa 1990's, utf-8 became the todays pseudo standard, and I wiki'ed recently - accounting for 97% of all web pages - and is certainly the way to go...

Interesting enough, tidy's default in/out encoding was latin1, back in the 2000 release. It is now, for sure, utf-8... not sure of the exact change-over date - maybe someone remembers, or can research it, and can remind us...

History over... back to the issue at hand ;=))

When you saved the files as ASCII, it converted the first character, &auml;, from it's 2-byte utf-8 encoding, 0xC3 0xA4, to its 1-byte latin1 encoding, 0xE4... and likewise for the other 8 - 0xC3 0xB6 became 0xF6, etc, etc...

And when tidy encounters this char, in its default utf-8 mode, it correctly outputs a warning, saying what it is doing, with what it found - replacing invalid UTF-8 bytes (char. code U+0004) - namely it replaced the 0xE4, with a 3-byte utf-8 error code, 0xEF 0xBF 0xBD, which is usually displayed as a black diamond, with a white ? inside... or �, or square outline, with ?, or ... depends...

So, to answer your last question, Do I have to use win1252...?, yes, you must adjust, at least, tidy's input encoding to match what is in the file... using --input-encoding latin1, tidy had no warnings about these 9 1-byte characters, and output each in their utf-8 equivalent...

Tidy does not have an option to convert characters to known entities... sorry... maybe there are other tools for this...

Tidy will preserve valid input entities, like &auml;, in the output, with --preserve-entities yes, in the config...

Due to the unknowns introduce by cut & paste, added 2 samples to my test repo - this character encoding issue comes up now and then - in_936.html - a utf-8 examples, and in_936-1.html - a latin1 equivalent. They should look equivalent when displayed in a browser...

Have I missed some point here? If yes, please explain... thanks...

At this moment can not see a problem in tidy... look forward to further feedback, comments, etc... thanks...

@Feathered-Serpent
Copy link
Author

Thank you for your detailed explanation. Though one thing is wrong, as when I use the option --char-encoding win1252, then the example in my starter is converted to this:

<html>
<head>
<title>Test</title>
</head>
<body>
&auml; &ouml; &uuml; &szlig; &aacute; &eacute; &iacute; &oacute; &uacute;
</body>
</html>```

so Tidy does convert the single byte umlauts into the html entities automatically. And according to Notepad++ the resulting file has a UTF-8 encoding *scritches his head*

@geoffmcl
Copy link
Contributor

geoffmcl commented Apr 6, 2021

@Feathered-Serpent, thank you for the further testing, and feedback...

Yes, using --char-encoding win1252, is quite different to using --char-encoding latin1, and I do not exactly know why! They share most of the same code points... but...

Wonders never cease! ;=))

Using my in_936-1.html sample... with one paragraph of 9 entities, and then a paragraph with 9 hi-bit, single byte, chars...

As you point out, with win1252 i/o, the single byte, hi-bit set, chars, are indeed output as entities! Tidy can also add <meta charset="us-ascii">, to the output <head>...

I think the UTF-8 encoding, reported by Notepad++, on the output, is also not wrong, in that the file only contains 7-bit, US ASCII, and, as advised, the first 128 characters are the same in the two very different encodings...

And if latin1 i/o is used, the single byte, hi-bit chars are left exactly as is, and <meta charset="iso-8859-1"> can be added to the head... if you add --preserve-entities yes, then the first paragraph of entities is preserved...

Loading the output in Notepad++, it can not display these 9 chars - displays an open square instead, and suggests the file is ANSI encoded... which is virtually synonymous with ASCII

I too scratch my head over this... ;=))

But we have still not arrived at a problem with tidy... strange behavior perhaps, maybe confusing, but nothing particularly wrong... that needs to be fixed... that I can see...

If not agreed, please explain what, where, why, when, etc... else, maybe close this issue... thanks...

@Feathered-Serpent
Copy link
Author

Hmm well basically, there is a workaround when using win-1252 as character encoding. So I better should use that one for HTML files where I know there are Umlauts inside.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants