Wrong character encoding #863
Consider this page, specifically this line from its html:
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
From my testing, it looks like tidy doesn't respect that encoding, instead in src/clean.c:2316 it looks like it forcibly replaces that "windows-1252" value with "utf-8", if I read the code correctly.
The problem is that when processing the above html page, the output is not valid utf-8 - there is an accented character near the string "des Mondes", if you grep for it you should see it, that gets destroyed for example.
tidyOptSetInt(tidyDoc, TidyInCharEncoding, TidyEncWin1252);
fixes the issue and I get valid utf-8 out with correct accents and all, but I can't hardcode that because now my code is incorrect for every other html page out there. I also can't get that information from the cleaned html anymore because tidy overwrites it.
I think one of these two things should happen:
I don't know if I'm doing something wrong in the way I call tidy, but after trying several options I can't get it to give me a correctly converted utf8 html string.
The text was updated successfully, but these errors were encountered:
@KingDuckZ thank you for your issue...
First some simple
As you have shown, in your own code, you need to set input, but you must also set output, else it will default to utf8... which can have conversion probs... as you maybe point out...
See how tidy internally handles this, AdjustCharEncoding... note the default chosen pairs for
I had no problems with the online page you linked to, using a config of
With the latter got entities like
Does this solve your problem?
If not, please advise the character code(s) that are causing a problem, and the config used... maybe there is a bug here, but not seeing it yet...
Look forward to further feedback... thanks...
Hi @geoffmcl, thanks for replying. I'm a bit confused about what I should do. What I want to do is normalise my input so that my code only ever deals with utf8. It's ok if I have to use iconv for that, but in that case the problem is I need to tell iconv about the source charset and the destination charset. Destination is always utf8, so that's easy, but what about input? In my test, the meta charset tag from tidy always says utf8, even when that's not the case, because as you pointed out, tidy does not convert characters unless you explicitly ask it to do so. But it will change the meta tag always.
This is how I set up tidy, then if you look at line 108 if the user specified a charset explicitly then I apply it, otherwise I just go with tidy's default. The page linked in my first post only produces a correct utf8 output if that
I'm trying to figure out if I'm doing something wrong myself, it might be this is indeed the expected behaviour, in which case my question is how can I tell what the original encoding was so I can invoke iconv() correctly? Thanks.
how can I tell what the original encoding was so I can invoke iconv() correctly?
It seems the more correct way is to first look for the encoding when downloading the file: https://www.w3.org/International/questions/qa-headers-charset.en In particular, it is important to note that the encoding declared in the HTTP header overrides all in-document encoding declarations in HTML and CSS files. and if there's no content-type: text/html; charset=xxx header then look in the file for <meta charset="xxx"> or <meta http-equiv="Content-Type" content="text/html; charset=xxx"> Geoff, the '--show-meta-change yes' option seems not quite right: $ tidy --show-meta-change yes x.htm >| x.html line 1 column 1 - Warning: missing <!DOCTYPE> declaration line 119 column 33 - Warning: replacing invalid UTF-8 bytes (char. code U+0096) line 125 column 33 - Warning: replacing invalid UTF-8 bytes (char. code U+0096) line 151 column 36 - Warning: replacing invalid UTF-8 bytes (char. code U+0009) line 151 column 81 - Warning: replacing invalid UTF-8 bytes (char. code U+0009) line 1254 column 61 - Warning: replacing invalid UTF-8 bytes (char. code U+000E) line 2422 column 31 - Warning: replacing invalid UTF-8 bytes (char. code U+0001) line 2422 column 53 - Warning: replacing invalid UTF-8 bytes (char. code U+0001) line 4404 column 31 - Warning: replacing invalid UTF-8 bytes (char. code U+0002) line 5738 column 43 - Warning: replacing invalid UTF-8 bytes (char. code U+001F) line 17 column 1 - Info: <meta> attribute "content", incorrect value "text/html; charset=windows-1252" replaced This is nice - the user has a pretty good hint why tidy complains about invalid utf-8 bytes $ tidy --show-meta-change yes --input-encoding win1252 --output-encoding utf8 x.htm >| x.html line 1 column 1 - Warning: missing <!DOCTYPE> declaration line 17 column 1 - Info: <meta> attribute "content", incorrect value "text/html; charset=windows-1252" replaced This is confusing - charset=windows-1252 in the input file is correct but tidy is saying it's incorrect. Lee
@ler762 that sounds like a good solution, and to be honest that's what I set out to do at the start. In this case the server just reports
If I didn't want to use xpath then I'd have to manually parse the html before tidying, deal with invalid input etc, so in short duplicate what tidy already does. Anything simpler, like a regex match for example is just doomed to fail sooner or later.
So what I'd need is one of the following:
Instinctively, I'd say option 3 is the one that makes more sense to me since, if tidy is changing the charset value then I'd expect the output to be self-consistent and be encoded in the way the new meta charset claims it should. But I understand the desire to keep dependencies to a minimum (there could be a build option to enable/disable this).
I don't like option 1 because although I can grab the charset value through xpath, I'd still be left with the old value after iconv(), resulting once again in an inconsistent html.
Option 2 sounds like a "good enough" solution, in that it's probably quick to implement for you guys, it gets me past my issue and no extra dependencies or work on the build system are required.
Edit You can play around with my project if you want, maybe the problem will be more clear:
Output is produced right after tidying the input html, so save for bugs it should be exactly what tidy returned to me.
Edit2 I'm in a bit in a hurry and I will double check this later, but it looks to me that the output is not even windows-1252, it's just broken as if I run this command
I still don't get accented characters, more like garbage data:
Option 2 is probably quicker for you to implement. Call tidy with the "--show-meta-change yes" option and look for a Info: <meta> attribute "content", incorrect value "text/html; charset=xxx" replaced warning. If you see it, call tidy again with a "--char-encoding <tidy xxx equivalent>" Maybe that'll get you close enough?
@KingDuckZ I had started this reply, offline, before two/three more arrived... but what I had started seems still relevant...
Well I am certainly not an expert on character encodings, but...
But the fact that the document contains the character code 0x96, decimal 150, certainly indicates that it is
That suggests, at a minimum, tidy must be given a config of
@KingDuckZ, I can see your catch-22 - download the html from the wild, feed it to tidy, then check...
Oops, the doc is in an esoteric character set, like
To your points...
@ler762, yes perhaps the meta info message is a little confusing, in that, in this case, it is not incorrect, for the input, but it is for the output tidy created...
Maybe there could be a better wording of the message... suggestions welcome... but perhaps that should be a new, different issue, if someone wants to pursue it... thanks...
I guess the bottom line is
I tried building
I guess if I was writing a download app, using
This is as @ler762 suggests... sort of...
But at this moment, do not yet clearly see a
HTH, and look forward to further feedback... thanks...
@ler762 I meant, ask programmatically. This task shouldn't require duckscraper user to read through the logs and then re-run their queries.
Let me clarify, users of duckscraper are obviously expected to be able to read html and write xpath, and in fact the
As I understand it, figuring out the correct encoding is done this way:
Point 1 is fine, I can ask curl and it will kindly tell me.
@geoffmcl I suppose I could preprocess the
As for building duckscraper, if you're still interested please check out the scraplang branch, not main. You won't need Pugi anymore, but you will need XQilla and Xerces-c instead, which on my system I compiled manually and installed to my /usr/local. Unfortunately I don't know if any of this will work on Windows, I work on Linux only.
@KingDuckZ, thanks you for your further feedback...
As you suggested, my build of
But I guess you get a buffer from curl, it seems a simple thing to search for
I am afraid you are asking too much from
So, again, at this moment, do not yet clearly see a
HTH, and look forward to further feedback... thanks...
Hi geoffmcl, thanks for looking into this.
Correct me if I'm wrong, but I think all of the above problems are already taken care of in your library? They must be, since iirc the encoding gets replaced. At that point, right when the old encoding is replaced with the new one, is it very hard for you to make a copy of the substring that got replaced and make it available through some accessor?
@KingDuckZ thanks for the further feedback, but, sorry, I think you are, or seem, mistaken...
Tidy does not search for
It does write, or modify, the
Yes, it searches, using the given charset, byte by byte, for
This means, if the configured input is windows-1252, iso-8859-1, latin1, the char
Conversely, if the configured input is
I also tried searching the header, using
And I hope using the http response specifies a charset was some success also...
One simple additional thought is, you have the original text in
But as indicated, I do not think
As always, HTH, and look forward to further feedback... thanks...