New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Umlauts/special characters not converted to correct html entities #936
Comments
@Feathered-Serpent, thank you for the issue... but what is it exactly? Yes, It seems what we have here is a mixture of -
For the first 128 characters, that is, code points Some brief history...Sorry to bore you, if you already know all this... Computers, being bits and byte machines, adopted latin1, sometimes referred to extended ASCII, early in their development, and got widespead use on the fledgling internet... meant only basically western european language displays were available... just 256 chars... 1-byte... ugh! With the advent of the so called Interesting enough, History over... back to the issue at hand ;=))When you saved the files as ASCII, it converted the first character, And when So, to answer your last question, Tidy does not have an option to convert characters to known Tidy will preserve valid Due to the unknowns introduce by Have I missed some point here? If yes, please explain... thanks... At this moment can not see a problem in |
Thank you for your detailed explanation. Though one thing is wrong, as when I use the option --char-encoding win1252, then the example in my starter is converted to this:
|
@Feathered-Serpent, thank you for the further testing, and feedback... Yes, using Wonders never cease! ;=))Using my in_936-1.html sample... with one paragraph of 9 entities, and then a paragraph with 9 hi-bit, single byte, chars... As you point out, with I think the UTF-8 encoding, reported by And if Loading the output in Notepad++, it can not display these 9 chars - displays an open square instead, and suggests the file is I too
|
Hmm well basically, there is a workaround when using win-1252 as character encoding. So I better should use that one for HTML files where I know there are Umlauts inside. |
Hi everyone,
Let's say I have a very easy HTML file with this content:
When saving this sampe html code as an ANSI encoded file with notepad++, my browser also displays these characters correctly.
But how do I tell Tidy to convert that to
ä ö ü ß á é í ó ú
and output a utf8 file? I've tried multiple char encoding options, but only when I use --character-encoding win1252 (so for input and output) it gave me what I wanted (with the probability of characters in the input file not able to be displayed on win1252?). If I run tidy without any additional options, it gives me this:Within the browser, the characters then are displayed like this:
� � � � � � � � �
Do I have to use win1252 character encoding for all my HTML files?
The text was updated successfully, but these errors were encountered: