You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using perl-HTML-Tree-3.18. I have met the following problem:
When I use HTML::TreeBuilder to parse a tree, that contains the text like
"Gebühr vor Ort von € 30,- pro Woche" (without quotes), I will get the string in the strange encoding: ü will be encoded as one char, € will be encoded as two chars. I think, that is incorrect.
From on 2005-08-17 14:42:14
:
In the above post the string should be read as:
"Gebühr vor Ort von € 30,- pro Woche"
Tree builder seems to decode the string entities via HTML::Entities. Is
it possible to extend the tree builder with an option, that allows to
skip encoding the HTML entities into chars? The only way out seems to
call encode again, but that is not pretty.
From on 2005-09-08 19:32:13
:
The problem seems to be solved, when upgraded form Perl v5.8.3 to v5.8.6.
From on 2005-10-06 18:37:33
:
The Debian stable folk have 5.8.4, and this bug is affecting their programs.
How could they go around this problem? I'm looking at the source code
and I'm tempted to comment out the HTML::Entities::encode line... but
would that then create other problems?
Can't reproduce with 3.18 and up. Please resubmit with a test case if
you are still having this issue.
As an aside, I have added this case as a test in HTML-Tree 3.22, which
will be released as part of the Chicago Hackathon this weekend.
Hello! Thanks that you've paid attention to the (possible) problem.
Finally, as I said above, some of perl installations work, some -- not,
and I've come to the conclusion, it's a core Perl bug with unicode
chars. What version of Perl do you use for testing?
Can you please, define more precisely the return value for
"HTML::Entity->as_text()"? Should it return the UTF-8 text? Localized
text? While investigating the problem, I've read
http://jerakeen.org/files/2005/perl-utf8.slides.pdf -- it has a very
nice chart. Consider for reading!
Actually, I've found this problem, while implemnting the HTML parser,
that stores the data to the MySQL DB, and this data is supposed to be
displayed as HTML again. So, in my case I used the following flow:
my $html_root = HTML::TreeBuilder->new_from_content($contents);
foreach ($html_root->guts())
{
...
$dbh->prepare("insert into my_table (id, contents) values ($id,
?)")->execute(HTML::Entities::encode_entities($_->as_trimmed_text()));
}
so I used the "reverse" convertion for chars. Unfortunately, I still
don;t have any working example to store Unicode strings into MySQL 4.0.x
from Perl to be read later correctly from Java :( but that's out of the
scope of the problem, being discussed.
On Mon Nov 13 04:50:27 2006, dma_k@mail.ru wrote:
> Finally, as I said above, some of perl installations work, some -- not,
> and I've come to the conclusion, it's a core Perl bug with unicode
> chars. What version of Perl do you use for testing?
I use Apple's Perl (5.8.6 on OSX), Debian sarge's Perl (5.8.4), and a
custom Perl (5.8.2) for release testing. I do have a 5.6 install
sitting around, and t/body.t fails on unicode escape tests. (I should
skip those on that platform.)
> Can you please, define more precisely the return value for
> "HTML::Entity->as_text()"? Should it return the UTF-8 text? Localized
> text?
It returns the text exactly as it's contained in each HTML::Element (not
HTML::Entity) and children. If that's UTF-8, Unicode, ISO-8859-1, or
whatever, that's been decided by HTML::Parser. HTML::Element is just
the middleman, doing simple concatenation.
If you could give a test case that shows the broken behavior on your
platform, I would appreciate it.
I'm not sure if this module is still being actively maintained but I am
experiencing the same issues on perl 5.10. I don't know if the issue is
with HTML::Element or an underlying module.
Here is a test case which fails on Fedora 9 platform:
==============================
#!/usr/bin/perl
use HTML::Element;
use Test::More tests => 2;
my $test_string = 'This is a test 漢�';
like( $test_string, qr/漢�/xms, 'Found chinese chars input string' );
my $h = HTML::Element->new( 'p' );
$h->push_content('This is a test 漢�');
like( $h->as_HTML, qr/漢�/xms, 'Found chinese chars in html output' );
========================
Running this on Fedora 9 produces the following output:
1..2
ok 1 - Found chinese chars input string
not ok 2 - Found chinese chars in html output
# Failed test 'Found chinese chars in html output'
# at ./test2.pl line 13.
# '<p>This is a test
漢語
# '
# doesn't match '(?msx-i:漢�)'
# Looks like you failed 1 test of 2.
Sorry, not sure if I was experiencing the same issue as described above,
but it seemed the same. Just realized that passing empty string to
as_HTML solves this issue. Updated test case, which passes:
===================================
#!/usr/bin/perl
use HTML::Element;
use Test::More tests => 2;
my $test_string = 'This is a test 漢�';
like( $test_string, qr/漢�/xms, 'Found chinese chars input string' );
my $h = HTML::Element->new( 'p' );
$h->push_content('This is a test 漢�');
like( $h->as_HTML( '' ), qr/漢�/xms, 'Found chinese chars in html
output' );
===================================
1..2
ok 1 - Found chinese chars input string
ok 2 - Found chinese chars in html output
Using as_HTML('') is funny, because in this case you tell HTML::Element
not to encode entities at all (the default should be '<>&').
Why do you expect that as_HTML() should return a non-HTML-encoded string
back? I would use as_text() for this case. Or you mean that as_HTML()
basically does incorrect HTML-encoding for Chinese characters? Try plain
with first argument, seems to be a bug but of the different nature.
This is a bug in HTML::Entities, line 479 is encoding the Chinese
characters. Adding the following debug code to HTML/Entities.pm reveals
this:
print(STDERR "1: ref = $$ref\n");
$$ref =~ s/([^\n\r\t !\#\$%\(-;=?-~])/$char2entity{$1} ||
num_entity($1)/ge;
print(STDERR "2: ref = $$ref\n");
1: ref = This is a test 漢�
2: ref = This is a test 漢語
Cheers, Jeff.
From you example I can't tell if the string you passed to HTML::Entities::encode() was a Unicode
string or the decoded UTF-8 bytes.
Please try the attached test program. It prints:
# encode-test.pl:4: "This is a test \x{6F22}\x{8A9E}"
# encode-test.pl:5: "This is a test 漢語"
for me, so it seems correct. If I comment out the 'use utf8;' line then the output becomes:
# encode-test.pl:4: "This is a test \xE6\xBC\xA2\xE8\xAA\x9E"
# encode-test.pl:5: "This is a test 漢語"
It you get different results, please tell me what version of perl and HTML::Parser you are using.
If you get the result above then I don't consider this a bug.
Migrated from rt.cpan.org#14212 (status was 'open')
Requestors:
Attachments:
From on 2005-08-17 14:31:39
:
From on 2005-08-17 14:42:14
:
From on 2005-09-08 19:32:13
:
From on 2005-10-06 18:37:33
:
From petek@cpan.org on 2006-11-11 23:13:43
:
From dma_k@mail.ru on 2006-11-13 09:50:27
:
From petek@cpan.org on 2006-11-13 16:29:30
:
From stocks@cpan.org on 2009-11-23 23:19:09
:
From stocks@cpan.org on 2009-11-23 23:28:17
:
From dma_k@mail.ru on 2009-11-24 10:38:26
:
From jeff.fearn@gmail.com on 2010-04-24 04:17:21
:
From gaas@cpan.org on 2010-07-09 13:16:30
:
From gaas@cpan.org on 2010-07-09 13:17:49
:
The text was updated successfully, but these errors were encountered: