Auto charset conversión for getPlainText()? #16

GoogleCodeExporter · 2016-02-17T14:31:36Z

I'm scrapping a web page in iso-8859-1, but my scripts works in UTF-8 (php 
code, mysql databases, etc), so.. if I get the text of a node, getPlainText() 
returns the text in iso-8859-1 (the charset oh the loaded html) and I cant make 
equality comparisions in my code.

I solved this (for this particular case) converting to UTF-8 in the 
getPlainText implementation:

function getPlainText() {
    return preg_replace('`\s+`', ' ', utf8_encode( html_entity_decode($this->toString(true, true, true), ENT_QUOTES) ));
}

but... I'm thinking... what about an automatic detection of the loaded html 
encoding and one option to set the charset for the result strings of 
getPlainText()?

I's just an idea O:)

Original issue reported on code.google.com by Radika...@gmail.com on 6 Sep 2012 at 7:50

The text was updated successfully, but these errors were encountered:

GoogleCodeExporter · 2016-02-17T14:31:36Z

I also noticed ganon has trouble handling GB2312 (Simplified Chinese). I ended 
up having to use iconv to convert to GBK before parsing, which is pretty slow 
for larger DOMs. Rules for charset conversions can be tricky.

Original comment by sjwood...@gmail.com on 18 Oct 2012 at 1:52

GoogleCodeExporter · 2016-02-17T14:31:36Z

I don't think it's a good idea to alter getPlainText, but maybe an extra method 
called getPlainTextUTF8? Perhaps might be better to just use a local solution, 
though.

Original comment by niels....@gmail.com on 19 Oct 2012 at 4:34

Added labels: Priority-Low, Type-Enhancement
Removed labels: Priority-Medium, Type-Defect

GoogleCodeExporter · 2016-02-17T14:31:36Z

The problem is... sometimes I dont know (or I dont want to know) in wich 
charset is the input webpage... so any kind of autodetection would be great so 
I can use my code always in the same charset

Original comment by Radika...@gmail.com on 19 Oct 2012 at 4:39

GoogleCodeExporter · 2016-02-17T14:31:36Z

Added a simple version of getPlainTextUTF8 in rev #76.

Original comment by niels....@gmail.com on 20 Oct 2012 at 10:45

Changed state: Done

GoogleCodeExporter added auto-migrated Type-Enhancement Priority-Low labels Feb 17, 2016

GoogleCodeExporter closed this as completed Feb 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto charset conversión for getPlainText()? #16

Auto charset conversión for getPlainText()? #16

GoogleCodeExporter commented Feb 17, 2016

GoogleCodeExporter commented Feb 17, 2016

GoogleCodeExporter commented Feb 17, 2016

GoogleCodeExporter commented Feb 17, 2016

GoogleCodeExporter commented Feb 17, 2016

Auto charset conversión for getPlainText()? #16

Auto charset conversión for getPlainText()? #16

Comments

GoogleCodeExporter commented Feb 17, 2016

GoogleCodeExporter commented Feb 17, 2016

GoogleCodeExporter commented Feb 17, 2016

GoogleCodeExporter commented Feb 17, 2016

GoogleCodeExporter commented Feb 17, 2016