Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto charset conversión for getPlainText()? #16

Closed
GoogleCodeExporter opened this issue Feb 17, 2016 · 4 comments
Closed

Auto charset conversión for getPlainText()? #16

GoogleCodeExporter opened this issue Feb 17, 2016 · 4 comments

Comments

@GoogleCodeExporter
Copy link

I'm scrapping a web page in iso-8859-1, but my scripts works in UTF-8 (php 
code, mysql databases, etc), so.. if I get the text of a node, getPlainText() 
returns the text in iso-8859-1 (the charset oh the loaded html) and I cant make 
equality comparisions in my code.

I solved this (for this particular case) converting to UTF-8 in the 
getPlainText implementation:

function getPlainText() {
    return preg_replace('`\s+`', ' ', utf8_encode( html_entity_decode($this->toString(true, true, true), ENT_QUOTES) ));
}

but... I'm thinking... what about an automatic detection of the loaded html 
encoding and one option to set the charset for the result strings of 
getPlainText()?

I's just an idea O:)

Original issue reported on code.google.com by Radika...@gmail.com on 6 Sep 2012 at 7:50

@GoogleCodeExporter
Copy link
Author

I also noticed ganon has trouble handling GB2312 (Simplified Chinese). I ended 
up having to use iconv to convert to GBK before parsing, which is pretty slow 
for larger DOMs. Rules for charset conversions can be tricky.

Original comment by sjwood...@gmail.com on 18 Oct 2012 at 1:52

@GoogleCodeExporter
Copy link
Author

I don't think it's a good idea to alter getPlainText, but maybe an extra method 
called getPlainTextUTF8? Perhaps might be better to just use a local solution, 
though.

Original comment by niels....@gmail.com on 19 Oct 2012 at 4:34

  • Added labels: Priority-Low, Type-Enhancement
  • Removed labels: Priority-Medium, Type-Defect

@GoogleCodeExporter
Copy link
Author

The problem is... sometimes I dont know (or I dont want to know) in wich 
charset is the input webpage... so any kind of autodetection would be great so 
I can use my code always in the same charset

Original comment by Radika...@gmail.com on 19 Oct 2012 at 4:39

@GoogleCodeExporter
Copy link
Author

Added a simple version of getPlainTextUTF8 in rev #76.

Original comment by niels....@gmail.com on 20 Oct 2012 at 10:45

  • Changed state: Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant