Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word filtered html doesn't convert accents to utf8 #512

pdo2641 opened this issue Mar 11, 2017 · 2 comments

Word filtered html doesn't convert accents to utf8 #512

pdo2641 opened this issue Mar 11, 2017 · 2 comments


Copy link

pdo2641 commented Mar 11, 2017

Saved a Word 2016 document as filtered html. Saved as UTF-8 also tried Latin 1.

Using the following tidy config:
anchor-as-name: yes
clean: YES
bare: YES

drop-proprietary-attributes: YES
word-2000: No
wrap: 144
vertical-space: yes
input-encoding:latin1 (tried 1252 and utf8 as well)

In every case. it would not convert the Word html ö, ñ, and various other accented characters typically used in German and Spanish into anything other the garbage or a boxed question mark. I've tried variations of input-encoding, output-encoding, clean, bare without improvement.

Please advise

Copy link

@pdo2641 thanks for the issue, but not sure what you exactly want... for sure I am no expert on character encoding issues, but have picked up a few things along the way...

You have used an o umlaut, ö, 0xf6, 246, and ñ, 0xf1, 241, and if I use a config --input-ecoding latin1 --output-encoding utf8, those two will be converted to utf-8, namely 0xc3 0xb6 and 0xc3 0xb1, resp., and in a browser are again correctly displayed as ö and ñ, so where is the problem?

Sure, in my code page 437 console, they are only shown as ?, or sort of garbage - sequence of high bit characters - since my console does not support utf-8, even if I run chcp 65001. But they are correctly displayed in good editors, and browsers, even very dumb notepad, as the character they are...

This seem nothing to do with a Word 2016 document, or Word filtered html... especially since your config shows word-2000: No... This would seem true for any html containing latin1 characters... like the follow french accented characters -

<li>cédille Ç,</li>
<li>accent aigu é,</li>
<li>accent circonflexe â, ê, î, ô, û,</li>
<li>accent grave à, è, ù</li>
<li>accent tréma ë, ï, ü.</li>

Processed this with the above config, --input-ecoding latin1 --output-encoding utf8, the output document will be displayed the same in a browser, but each has been converted to utf-8. The latin1 Ç, 0xC7, has been converted to utf-8, 0xC3 0x87, and so on for each of the others...

Maybe I misunderstanding somethings here... please explain more... thanks...

@geoffmcl geoffmcl added this to the 5.5 milestone Mar 12, 2017
Copy link

No comments for a long time... maybe question asked and answered... so closing this...

Please feel free to re-open, or file a new issue... thanks...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet

No branches or pull requests

2 participants