HTML::TreeBuilder generates text nodes in a strange encoding [rt.cpan.org #14212] #10

oalders · 2020-08-24T18:36:31Z

Migrated from rt.cpan.org#14212 (status was 'open')

Requestors:

dma_k@mail.ru

Attachments:

encode-test.pl

From on 2005-08-17 14:31:39
:

I am using perl-HTML-Tree-3.18. I have met the following problem:
When I use HTML::TreeBuilder to parse a tree, that contains the text like
"Geb&uuml;hr vor Ort von &euro; 30,- pro Woche" (without quotes), I will get the string in the strange encoding: &uuml; will be encoded as one char, &euro; will be encoded as two chars. I think, that is incorrect.

From on 2005-08-17 14:42:14
:

In the above post the string should be read as:
"Geb&amp;uuml;hr vor Ort von &amp;euro; 30,- pro Woche"
Tree builder seems to decode the string entities via HTML::Entities. Is
it possible to extend the tree builder with an option, that allows to
skip encoding the HTML entities into chars? The only way out seems to
call encode again, but that is not pretty.

From on 2005-09-08 19:32:13
:

The problem seems to be solved, when upgraded form Perl v5.8.3 to v5.8.6.

From on 2005-10-06 18:37:33
:

The Debian stable folk have 5.8.4, and this bug is affecting their programs.

How could they go around this problem? I'm looking at the source code
and I'm tempted to comment out the HTML::Entities::encode line... but
would that then create other problems?

From petek@cpan.org on 2006-11-11 23:13:43
:

Can't reproduce with 3.18 and up. Please resubmit with a test case if
you are still having this issue.

As an aside, I have added this case as a test in HTML-Tree 3.22, which
will be released as part of the Chicago Hackathon this weekend.

From dma_k@mail.ru on 2006-11-13 09:50:27
:

Hello! Thanks that you've paid attention to the (possible) problem.

Finally, as I said above, some of perl installations work, some -- not,
and I've come to the conclusion, it's a core Perl bug with unicode
chars. What version of Perl do you use for testing?

Can you please, define more precisely the return value for
"HTML::Entity->as_text()"? Should it return the UTF-8 text? Localized
text? While investigating the problem, I've read
http://jerakeen.org/files/2005/perl-utf8.slides.pdf -- it has a very
nice chart. Consider for reading!

Actually, I've found this problem, while implemnting the HTML parser,
that stores the data to the MySQL DB, and this data is supposed to be
displayed as HTML again. So, in my case I used the following flow:

my $html_root = HTML::TreeBuilder->new_from_content($contents);

foreach ($html_root->guts())
{
  ...
  $dbh->prepare("insert into my_table (id, contents) values ($id,
?)")->execute(HTML::Entities::encode_entities($_->as_trimmed_text()));
}

so I used the "reverse" convertion for chars. Unfortunately, I still
don;t have any working example to store Unicode strings into MySQL 4.0.x
from Perl to be read later correctly from Java :( but that's out of the
scope of the problem, being discussed.

From petek@cpan.org on 2006-11-13 16:29:30
:

On Mon Nov 13 04:50:27 2006, dma_k@mail.ru wrote:
> Finally, as I said above, some of perl installations work, some -- not,
> and I've come to the conclusion, it's a core Perl bug with unicode
> chars. What version of Perl do you use for testing?

I use Apple's Perl (5.8.6 on OSX), Debian sarge's Perl (5.8.4), and a
custom Perl (5.8.2) for release testing.  I do have a 5.6 install
sitting around, and t/body.t fails on unicode escape tests.  (I should
skip those on that platform.)

> Can you please, define more precisely the return value for
> "HTML::Entity->as_text()"? Should it return the UTF-8 text? Localized
> text?

It returns the text exactly as it's contained in each HTML::Element (not
HTML::Entity) and children.  If that's UTF-8, Unicode, ISO-8859-1, or
whatever, that's been decided by HTML::Parser.  HTML::Element is just
the middleman, doing simple concatenation.

If you could give a test case that shows the broken behavior on your
platform, I would appreciate it.

From stocks@cpan.org on 2009-11-23 23:19:09
:

I'm not sure if this module is still being actively maintained but I am 
experiencing the same issues on perl 5.10. I don't know if the issue is 
with HTML::Element or an underlying module.

Here is a test case which fails on Fedora 9 platform:
==============================
#!/usr/bin/perl

use HTML::Element;
use Test::More tests => 2;

my $test_string = 'This is a test æ¼¢èª�';

like( $test_string, qr/æ¼¢èª�/xms, 'Found chinese chars input string' );

my $h = HTML::Element->new( 'p' );
$h->push_content('This is a test æ¼¢èª�');

like( $h->as_HTML, qr/æ¼¢èª�/xms, 'Found chinese chars in html output' );
========================

Running this on Fedora 9 produces the following output:

1..2
ok 1 - Found chinese chars input string
not ok 2 - Found chinese chars in html output
#   Failed test 'Found chinese chars in html output'
#   at ./test2.pl line 13.
#                   '<p>This is a test 
&aelig;&frac14;&cent;&egrave;&ordf;&#158;
# '
#     doesn't match '(?msx-i:æ¼¢èª�)'
# Looks like you failed 1 test of 2.

From stocks@cpan.org on 2009-11-23 23:28:17
:

Sorry, not sure if I was experiencing the same issue as described above, 
but it seemed the same. Just realized that passing empty string to 
as_HTML solves this issue. Updated test case, which passes:

===================================
#!/usr/bin/perl

use HTML::Element;
use Test::More tests => 2;

my $test_string = 'This is a test æ¼¢èª�';

like( $test_string, qr/æ¼¢èª�/xms, 'Found chinese chars input string' );

my $h = HTML::Element->new( 'p' );
$h->push_content('This is a test æ¼¢èª�');

like( $h->as_HTML( '' ), qr/æ¼¢èª�/xms, 'Found chinese chars in html 
output' );
===================================
1..2
ok 1 - Found chinese chars input string
ok 2 - Found chinese chars in html output

From dma_k@mail.ru on 2009-11-24 10:38:26
:

Using as_HTML('') is funny, because in this case you tell HTML::Element
not to encode entities at all (the default should be '<>&').
Why do you expect that as_HTML() should return a non-HTML-encoded string
back? I would use as_text() for this case. Or you mean that as_HTML()
basically does incorrect HTML-encoding for Chinese characters? Try plain
with first argument, seems to be a bug but of the different nature.

From jeff.fearn@gmail.com on 2010-04-24 04:17:21
:

This is a bug in HTML::Entities, line 479 is encoding the Chinese
characters. Adding the following debug code to HTML/Entities.pm reveals
this:

print(STDERR "1: ref = $$ref\n");
	$$ref =~ s/([^\n\r\t !\#\$%\(-;=?-~])/$char2entity{$1} ||
num_entity($1)/ge;
print(STDERR "2: ref = $$ref\n");

1: ref = This is a test æ¼¢èª�
2: ref = This is a test &aelig;&frac14;&cent;&egrave;&ordf;&#158;

Cheers, Jeff.

From gaas@cpan.org on 2010-07-09 13:16:30
:

From you example I can't tell if the string you passed to HTML::Entities::encode() was a Unicode 
string or the decoded UTF-8 bytes.

Please try the attached test program.  It prints:

# encode-test.pl:4: "This is a test \x{6F22}\x{8A9E}"
# encode-test.pl:5: "This is a test &#x6F22;&#x8A9E;"

for me, so it seems correct.  If I comment out the 'use utf8;' line then the output becomes:

# encode-test.pl:4: "This is a test \xE6\xBC\xA2\xE8\xAA\x9E"
# encode-test.pl:5: "This is a test &aelig;&frac14;&cent;&egrave;&ordf;&#158;"

It you get different results, please tell me what version of perl and HTML::Parser you are using.  
If you get the result above then I don't consider this a bug.

From gaas@cpan.org on 2010-07-09 13:17:49
:

On Fri Jul 09 09:16:30 2010, GAAS wrote:
> Please try the attached test program.  It prints:

Of course, I forgot to attach the file :-(

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML::TreeBuilder generates text nodes in a strange encoding [rt.cpan.org #14212] #10

HTML::TreeBuilder generates text nodes in a strange encoding [rt.cpan.org #14212] #10

oalders commented Aug 24, 2020

HTML::TreeBuilder generates text nodes in a strange encoding [rt.cpan.org #14212] #10

HTML::TreeBuilder generates text nodes in a strange encoding [rt.cpan.org #14212] #10

Comments

oalders commented Aug 24, 2020