Punycode issue reported against RDF-RDFa-Generator, appears to involve RDF-Trine. #85

Closed
tobyink opened this Issue May 15, 2013 · 15 comments

Projects

None yet

3 participants

@kasei
Owner
kasei commented May 15, 2013

I don't have time to dig into this right now (am traveling in very internet sparse areas), but this seems to pass right through RDF::Trine code and is caused by URI (afaict). In RDF::Trine::Node::Resource->new, the call to URI->new_abs is the point at which the punycode URI is turned into what looks like proper unicode string, but I'm not sure what that means for RDF::Trine, since in many cases you have unicode strings that originate as unicode (IRI values) that you wouldn't want to emit as punycode URIs on the other end.

@kasei
Owner
kasei commented Jun 8, 2013

After talking to Jonas a bit more about the issue here, I've found out that (at least part of) the problem relates to URI's handling of punycode. It seems that the URI string being returned after the punycode is decoded is encoded in latin-1 (or something similar) and (obviously) doesn't have the utf8 flag set. That's where the "invalid utf-8" error messages are coming from. I don't know enough yet about the specifics of the punycode algorithm or of URI's implementation to know what the actual problem is or how to fix it yet.

@kasei
Owner
kasei commented Jun 9, 2013

It does appear to be latin-1. The URI "http://www.xn--hestebedgrd-58a.dk/" gets decoded in URI::_punycode::decode_punycode to a set of character numbers (like codepoints, but apparently not unicode):

[ 104, 101, 115, 116, 101, 98, 101, 100, 103, 229, 114, 100 ]

These are then turned into a string with join '', map chr, @output, with chr(229) turning into "å" in latin-1.

@kasei
Owner
kasei commented Jun 9, 2013

Hmmm... I missed the fact that U+00e5 (229) is å, also. So maybe the problem is simply that URI::_punycode is producing the correct byte sequence, but the SV its in isn't marked as UTF8.

@kasei
Owner
kasei commented Jun 9, 2013

I think maybe this fixes the problem:

--- URI/_punycode.pm.orig   2013-06-09 09:14:14.000000000 +0300
+++ URI/_punycode.pm    2013-06-09 09:14:17.000000000 +0300
@@ -86,7 +86,11 @@
    warn join " ", map sprintf('%04x', $_), @output if $DEBUG;
    $i++;
     }
-    return join '', map chr, @output;
+    my $uri    = join '', map chr, @output;
+    use Encode;
+   my $octets     = encode('UTF-8', $uri, Encode::FB_CROAK);
+   $uri = decode('UTF-8', $octets,     Encode::FB_CROAK);
+    return $uri;
 }

 sub encode_punycode {

This isn't ideal, but if it does fix the problem, I'll try to push this upstream and ask GAAS about it.

@jonassmedegaard

Doesn't look like it makes any difference to the test at https://rt.cpan.org/Public/Bug/Display.html?id=85297#txn-1212286 :-(

@jonassmedegaard

...but I confirm that it fixes my original testcase at https://rt.cpan.org/Public/Bug/Display.html?id=85297#txn-1212116

@kasei
Owner
kasei commented Jun 10, 2013

This should fix the related serializer bug that was mangling punycode URIs on output:

--- a/RDF-Trine/lib/RDF/Trine/Node/Resource.pm
+++ b/RDF-Trine/lib/RDF/Trine/Node/Resource.pm
@@ -176,7 +176,7 @@ sub as_ntriples {
        if ($ntriples{ $ra }) {
                return $ntriples{ $ra };
        } else {
-               my $string              = URI->new( encode_utf8($self->uri_value) )->canonical;
+               my $string              = URI->new( $self->uri_value )->canonical;
                my $ntriples    = '<' . $string . '>';
                $ntriples{ $ra }        = $ntriples;
                return $ntriples;
@kasei
Owner
kasei commented Jun 10, 2013

Commit 342c3db is the fix to RDF::Trine::Node::Resource.

@jonassmedegaard

I mentioned on irc that only when also dropping encode_utf8() from sse() it helped my (real, non-testcase) code, and you wondered why.

My code uses RDF::Query and feeds resulting URI $nodes to RDF::RDFa::Builder.

Seems different bugs are revealed depending on how non-ascii URLs are stringified:

If I pass $node (i.e. let RDF::Trine::Node::Resource auto-stringify) then I get garbled punycode.
Fix: encode_utf8() dropped from sse().

If I pass $node->as_string then I get "Redland error: XML tree error: string is not in UTF-8".
Fix: URI patch.

If I pass $node->uri_value then I get "Redland error: XML tree error: string is not in UTF-8".
Fix: URI patch.

In other words, none of the combinations have any need for the already-applied patch to as_ntriples().

What I did as workaround until above fixes came along was to pass URI->new($node->uri_value)->as_string - which effectively passes punycode so avoid problems with utf8 chars not flagged as such.

@kasei
Owner
kasei commented Jun 11, 2013

I think it will depend on the patch to as_ntriples as soon as I fix Toby's RDF::Trine::Serializer::NTriples::Canonical to use as_ntriples() instead of sse() (as it should have been doing all along).

@jonassmedegaard

Ah - sounds reasonable indeed.

@kasei
Owner
kasei commented Jun 11, 2013
@kasei
Owner
kasei commented Jun 15, 2013

Should be fixed in commit 85276ea.

@kasei
Owner
kasei commented Jun 17, 2013

Commit e20f560 (on the unicode-bugfix branch) should have now resolved all of the issues associated with this bug. Punycode is no longer decoded by parsers, and URI base resolution works with full IRIs. The only thing to note now is that the N-Triples serializer will decode punycode URIs and emit them as unicode escaped IRIs. For example, the URI 'http://www.xn--hestebedgrd-58a.dk/' will be emitted as the N-Triples <http://www.hestebedg\u00E5rd.dk/>. This is probably a good thing as it encourages the use of proper IRIs instead of letting punycode into the wild.

@kasei kasei closed this Jun 17, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment