You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm scrapping a web page in iso-8859-1, but my scripts works in UTF-8 (php
code, mysql databases, etc), so.. if I get the text of a node, getPlainText()
returns the text in iso-8859-1 (the charset oh the loaded html) and I cant make
equality comparisions in my code.
I solved this (for this particular case) converting to UTF-8 in the
getPlainText implementation:
function getPlainText() {
return preg_replace('`\s+`', ' ', utf8_encode( html_entity_decode($this->toString(true, true, true), ENT_QUOTES) ));
}
but... I'm thinking... what about an automatic detection of the loaded html
encoding and one option to set the charset for the result strings of
getPlainText()?
I's just an idea O:)
Original issue reported on code.google.com by Radika...@gmail.com on 6 Sep 2012 at 7:50
The text was updated successfully, but these errors were encountered:
I also noticed ganon has trouble handling GB2312 (Simplified Chinese). I ended
up having to use iconv to convert to GBK before parsing, which is pretty slow
for larger DOMs. Rules for charset conversions can be tricky.
Original comment by sjwood...@gmail.com on 18 Oct 2012 at 1:52
I don't think it's a good idea to alter getPlainText, but maybe an extra method
called getPlainTextUTF8? Perhaps might be better to just use a local solution,
though.
Original comment by niels....@gmail.com on 19 Oct 2012 at 4:34
The problem is... sometimes I dont know (or I dont want to know) in wich
charset is the input webpage... so any kind of autodetection would be great so
I can use my code always in the same charset
Original comment by Radika...@gmail.com on 19 Oct 2012 at 4:39
Original issue reported on code.google.com by
Radika...@gmail.com
on 6 Sep 2012 at 7:50The text was updated successfully, but these errors were encountered: