Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong character encoding #863

Closed
KingDuckZ opened this issue Feb 19, 2020 · 10 comments
Closed

Wrong character encoding #863

KingDuckZ opened this issue Feb 19, 2020 · 10 comments

Comments

@KingDuckZ
Copy link
Contributor

Consider this page, specifically this line from its html:

<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

From my testing, it looks like tidy doesn't respect that encoding, instead in src/clean.c:2316 it looks like it forcibly replaces that "windows-1252" value with "utf-8", if I read the code correctly.

The problem is that when processing the above html page, the output is not valid utf-8 - there is an accented character near the string "des Mondes", if you grep for it you should see it, that gets destroyed for example.
In my own code, this line

tidyOptSetInt(tidyDoc, TidyInCharEncoding, TidyEncWin1252);

fixes the issue and I get valid utf-8 out with correct accents and all, but I can't hardcode that because now my code is incorrect for every other html page out there. I also can't get that information from the cleaned html anymore because tidy overwrites it.

I think one of these two things should happen:

  1. tidy leaves the encoding as-is, I can try and find the relevant meta tag myself through xpath and manually convert to utf8 myself, using iconv
  2. tidy reads and respects that encoding when converting to utf8

I don't know if I'm doing something wrong in the way I call tidy, but after trying several options I can't get it to give me a correctly converted utf8 html string.

@geoffmcl
Copy link
Contributor

@KingDuckZ thank you for your issue...

First some simple Tidy facts...

Tidy `default` in, and out, encoding is UTF8... 1.
Tidy `ignores` the meta charset in the doc, on input... 2.
Tidy is `not` an encoding converter... use other tools for that...
  1. See char-encoding, and other IO...
  2. Except maybe re-set it on output... as you point out in src/clean.c:2316... also see show-meta-change & add-meta-charset...

As you have shown, in your own code, you need to set input, but you must also set output, else it will default to utf8... which can have conversion probs... as you maybe point out...

See how tidy internally handles this, AdjustCharEncoding... note the default chosen pairs for in and out... and both are set...

I had no problems with the online page you linked to, using a config of -raw, or -win1252... all seemed ok...

With the latter got entities like &ndash; for 0x96, and &eacute; for 0xE9... etc...

Does this solve your problem?

If not, please advise the character code(s) that are causing a problem, and the config used... maybe there is a bug here, but not seeing it yet...

Look forward to further feedback... thanks...

@KingDuckZ
Copy link
Contributor Author

Hi @geoffmcl, thanks for replying. I'm a bit confused about what I should do. What I want to do is normalise my input so that my code only ever deals with utf8. It's ok if I have to use iconv for that, but in that case the problem is I need to tell iconv about the source charset and the destination charset. Destination is always utf8, so that's easy, but what about input? In my test, the meta charset tag from tidy always says utf8, even when that's not the case, because as you pointed out, tidy does not convert characters unless you explicitly ask it to do so. But it will change the meta tag always.

This is how I set up tidy, then if you look at line 108 if the user specified a charset explicitly then I apply it, otherwise I just go with tidy's default. The page linked in my first post only produces a correct utf8 output if that src_charset in my code is true and the if branch is taken. If it's false, then I get an html file that claims to be utf8 but that contains non-utf8 sequences for the accented characters.

I'm trying to figure out if I'm doing something wrong myself, it might be this is indeed the expected behaviour, in which case my question is how can I tell what the original encoding was so I can invoke iconv() correctly? Thanks.

@ler762
Copy link
Contributor

ler762 commented Feb 22, 2020 via email

@KingDuckZ
Copy link
Contributor Author

KingDuckZ commented Feb 22, 2020

@ler762 that sounds like a good solution, and to be honest that's what I set out to do at the start. In this case the server just reports Content-Type: text/html so what matters is the meta tag. I can easily retrieve the first meta tag using xpath (after all that's what my program does), but before I can do that html needs to be cleaned and become valid xml. See my problem? After I clean the html it's ready to be fed to the xpath library, but because tidy overwrites the charset field then my xpath query always returns utf-8, no matter what.

If I didn't want to use xpath then I'd have to manually parse the html before tidying, deal with invalid input etc, so in short duplicate what tidy already does. Anything simpler, like a regex match for example is just doomed to fail sooner or later.

So what I'd need is one of the following:

  1. tell tidy I want it to preserve the original meta charset value, so I can run xpath over the clean xml (I don't like this idea much)
  2. ask tidy what was the charset value before the replacement happened
  3. have tidy invoke iconv() automatically

Instinctively, I'd say option 3 is the one that makes more sense to me since, if tidy is changing the charset value then I'd expect the output to be self-consistent and be encoded in the way the new meta charset claims it should. But I understand the desire to keep dependencies to a minimum (there could be a build option to enable/disable this).

I don't like option 1 because although I can grab the charset value through xpath, I'd still be left with the old value after iconv(), resulting once again in an inconsistent html.

Option 2 sounds like a "good enough" solution, in that it's probably quick to implement for you guys, it gets me past my issue and no extra dependencies or work on the build system are required.

Edit You can play around with my project if you want, maybe the problem will be more clear:

#correct output, requires user to manually inspect source html
./duckscraper -f windows-1252 --dump - https://psxdatacenter.com/psp/plist.html "/test"

#wrong output, html claims to be utf-8 but accented characters won't show up
./duckscraper --dump - https://psxdatacenter.com/psp/plist.html "/test"

Output is produced right after tidying the input html, so save for bugs it should be exactly what tidy returned to me.

Edit2 I'm in a bit in a hurry and I will double check this later, but it looks to me that the output is not even windows-1252, it's just broken as if I run this command

iconv --from-code=windows-1252 --to-code=utf-8 --output=utf8plist.html plist_clean.html

I still don't get accented characters, more like garbage data: Crois�e

@ler762
Copy link
Contributor

ler762 commented Feb 22, 2020 via email

@geoffmcl
Copy link
Contributor

@KingDuckZ I had started this reply, offline, before two/three more arrived... but what I had started seems still relevant...

Well I am certainly not an expert on character encodings, but...

But the fact that the document contains the character code 0x96, decimal 150, certainly indicates that it is WINDOWS-1252, as the meta charset indicates...

That suggests, at a minimum, tidy must be given a config of --input-encoding win1252, ie tidyOptSetInt(tidyDoc, TidyInCharEncoding, TidyEncWin1252); MUST be used on this document...

Now @KingDuckZ, and @ler762, thank you for the further feedback...

@KingDuckZ, I can see your catch-22 - download the html from the wild, feed it to tidy, then check...

Oops, the doc is in an esoteric character set, like WINDOWS-1252, or others - perhaps quite rare these days, with the ubiquity of utf-8, but exist - which tidy assumes is utf-8, unless told otherwise, and tries its best, making a big mess in the process...

To your points...

  1. Tidy can not leave the original meta charset... since that too is wrong for the utf-8 output...
  2. With show-meta-change set, tidy does tell you what it is replacing...
  3. Tidy can not invoke the likes of iconv... adding a dependency... no...

@ler762, yes perhaps the meta info message is a little confusing, in that, in this case, it is not incorrect, for the input, but it is for the output tidy created...

Maybe there could be a better wording of the message... suggestions welcome... but perhaps that should be a new, different issue, if someone wants to pursue it... thanks...

I guess the bottom line is libtidy might not be the best tool, for html, where you do not yet know the encoding character set... sorry...

I tried building duckscaper, but am missing the PugiXML dependency... in Windows... probably can be solved...

I guess if I was writing a download app, using libtidy, like it seems your is? ... I would have to pre-process the <head>, as say ascii, or latin1, like say a browser, or validators, ... etc, have to do... to find the charset, to configure libTidy appropriately... or not use libTidy...

This is as @ler762 suggests... sort of...

But at this moment, do not yet clearly see a Feature Request for libTidy emerging... but...

HTH, and look forward to further feedback... thanks...

@KingDuckZ
Copy link
Contributor Author

Option 2 is probably quicker for you to implement.

@ler762 I meant, ask programmatically. This task shouldn't require duckscraper user to read through the logs and then re-run their queries.

Let me clarify, users of duckscraper are obviously expected to be able to read html and write xpath, and in fact the -f flag I added recently is a reasonable workaround to the current problem, however I don't think figuring out the charset encoding should fall onto their shoulders by default. Ideally the -f flag should only be needed in some rare circumstances.
Please keep in mind that figuring out the encoding is not straightforward because, as @ler762 correctly pointed out, the http response header might specify a different encoding that overrides the one given in the html, or there might be none at all. Leaving this burden on my users is not acceptable for me.

As I understand it, figuring out the correct encoding is done this way:

  1. if http response specifies a charset, use that one
  2. if html has a meta charset, use that one (maybe use the first if more than one is given?)
  3. choose a default encoding depending on the Content-Type

Point 1 is fine, I can ask curl and it will kindly tell me.
Point 2 is, like I'm saying, problematic. Once again, it would be good if, from code, I could ask tidy what the original charset was, if present at all. Then I could discard the cleaned html and re-clean the raw one, either passing the correct encoding in or invoking iconv() first.
Point 3 is also fine as it involves choosing between some hardcoded values that I already have.

@geoffmcl I suppose I could preprocess the <head> as you suggest, but wouldn't tidy already have all the primitives for doing that? Would it be possible for you to provide some guess_charset() function?
To recap, these are my questions to you:

  1. could you provide a function or an out parameter to retrieve the old encoding after a call to tidyCleanAndRepair()?
  2. if not, could you provide a function that extracts the encoding if present? I'm not trying to offload my work to you, it's just that I think tidy already has all the primitives needed and it's better equipped than my code for parsing non-xml input files

As for building duckscraper, if you're still interested please check out the scraplang branch, not main. You won't need Pugi anymore, but you will need XQilla and Xerces-c instead, which on my system I compiled manually and installed to my /usr/local. Unfortunately I don't know if any of this will work on Windows, I work on Linux only.

@geoffmcl
Copy link
Contributor

@KingDuckZ, thanks you for your further feedback...

As you suggested, my build of duckscraper/scraplang failed, missing XQilla, Xerces-c, ... I was trying to look at your use of libTidy... but that must wait until I download and build these depend libs...

But I guess you get a buffer from curl, it seems a simple thing to search for <meta ...>, in the limited head of the buffer, a char *guess_charset(buf,size) funtion... stop at </head>... returning a ptr to 'utf8', or what ever it is, to pass to tidy's config a known --input-encoding string ...

I am afraid you are asking too much from libTidy to supply this service...

So, again, at this moment, do not yet clearly see a Feature Request for libTidy emerging... but...

HTH, and look forward to further feedback... thanks...

@KingDuckZ
Copy link
Contributor Author

KingDuckZ commented Sep 29, 2020

Hi geoffmcl, thanks for looking into this.
I can do a quick search, possibly with regex, but it's problematic in case of <meta> inside a comment for example. There's the case of broken html too (there might be no </head>)
Newlines and weird indentation might cause problems too.

Correct me if I'm wrong, but I think all of the above problems are already taken care of in your library? They must be, since iirc the encoding gets replaced. At that point, right when the old encoding is replaced with the new one, is it very hard for you to make a copy of the substring that got replaced and make it available through some accessor?

@geoffmcl
Copy link
Contributor

geoffmcl commented Oct 3, 2020

@KingDuckZ thanks for the further feedback, but, sorry, I think you are, or seem, mistaken...

... I think all of the above problems are already taken care of in your library?

No, libTidy HAS to be told what the input charset is! Like I understand iconv requires... If none given, it defaults to utf-8...

Tidy does not search for <meta...> at all, except as just another html tag... it does not decode it... or use any value it might contain... whether it is right or wrong...

It does write, or modify, the <meta charset=...> on output... it will be set to the output-encoding value... which also defaults to utf-8, unless otherwise instructed...

Yes, it searches, using the given charset, byte by byte, for html code, text, etc... and internally stores all text in utf-8, in the lexer buffer...

This means, if the configured input is windows-1252, iso-8859-1, latin1, the char 0xEB, namely an &euml;, will be stored as 0xC3,0xAB, U+00EB... and tidy does not remember it was single 0xEB... so there can not be an accessor to it...

Conversely, if the configured input is utf-8, on encountering 0xEB, it will flag a warning, and replace it with EF BF BD, the unicode REPLACEMENT CHARACTER, usually shown as a white question mark on a black diamond... and again there is no way to get back what it was originally...

So, libTidy input-encoding must be configured before it reads the data, and nothing in that data will alter the input-encoding... sorry if you got some other idea...

I also tried searching the header, using perl, and that sort of worked... but, yes, you do need to avoid comments, <!-- any thing -->, maybe <style> ... stuff ... </style>, or <script> ... script ... </script>, and maybe others... not exactly easy...

And I hope using the http response specifies a charset was some success also...

One simple additional thought is, you have the original text in std::string html, in clean_html()... and are able to run that multiple times through libTidy, sooooo, if you get a tidy warning replacing invalid UTF-8 bytes, or maybe other invalid char warnings, you change the input-encoding, and present it again... and again... just an idea...

But as indicated, I do not think libTdiy can help you more... sorry...

No libTidy feature request emerging, so closing this...

As always, HTH, and look forward to further feedback... thanks...

@geoffmcl geoffmcl closed this as completed Oct 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants