Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Description and implementation of YAML::PP::Schema::Binary does not make sense #28

Open
pali opened this issue Jan 23, 2020 · 4 comments

Comments

@pali
Copy link

pali commented Jan 23, 2020

First description:

By prepending a base64 encoded binary string with the C<!!binary> tag, it can
be automatically decoded when loading.
If you are using this schema, any string containing C<[\x{7F}-\x{10FFFF}]>
will be dumped as binary. That also includes encoded utf8.

encode_base64() is a function: {0x00..0xFF}ᴺ →{0x2B, 0x2F, 0x30..0x39, 0x41..0x5A, 0x61..7A}ᴹ
So YAML::PP::Schema::Binary obviously cannot take string which contains {0x000100..0x10FFFF}.

Binary data are defined as stream of octets, therefore from set {0x00..0xFF} like encode_base64() function takes it.

And next implementation:

my $binary = $node->{value};
unless ($binary =~ m/[\x{7F}-\x{10FFFF}]/) {
# ASCII
return;
}
if (utf8::is_utf8($binary)) {
# utf8
return;
}
# everything else must be base64 encoded
my $base64 = encode_base64($binary);

unless ($binary =~ m/[\x{7F}-\x{10FFFF}]/)is equivalent to if ($binary =~ m/[\x{00}-\x{7E}]/) checks for all 7-bit ASCII characters except 7-bit ASCII \x{7F}. Comment says that this code is for ASCII which is not truth as it is ASCII ∖ {0x7F}.

Next there is check if (utf8::is_utf8($binary)) which in our case is basically:
if ($binary !~ m/[\x00-\xFF]/ and (int(rand(2))%2 == 1 or $binary =~ m/[\x80-\xFF]/))
So this code always skips strings with values from set {0x000100..0x10FFFF} and then (pseudo)-randomly (depending on internal representation of scalar) also skips strings from set {0x80..0xFF}. Comment says that this code is for utf8, but it is not truth, see documentation and my above simplified implementation.

And finally it calls encode_base64 function. When this function is called? Strings with only {0x00..0x7E} are ignored by first ASCII check. Then by second checks are always ignored strings which have at least one value from {0x000100..0x10FFFF}. So encode_base64 is called when string contains at least one value 0x7F or when there is at least one value from set {0x80..0xFF} and is_utf8 (pseudo)-randomly returned false.

Suggested fix

  1. Change YAML::PP::Schema::Binary module to work only with binary data as name suggests. Therefore with strings which contains only values from set {0x00..0xFF}.

  2. Use encode_base64() always when input string is non-7-bit ASCII, therefore contains at least one value from set {0x80..0xFF}.

  3. Decide if Base64 encoding is really needed for strings with character 0x7F. It is 7-bit ASCII therefore in 7-bit applications it is not needed special treatment for it. But I'm not sure if YAML needs special treatment of 0x7F or not.

CC @2shortplanks Please look at it and ideally do some correction if I wrote some mistake. Some parts I simplified to make it more easier to understand.

@perlpunk
Copy link
Owner

So I tried fixing the code by using this check:

if ($binary =~ m/[\x{100}-\x{10FFFF}]/) {
    return;
}
unless ($binary =~ m/[\x{80}-\x{FF}]/) {
    return;
}

However, this produces unwanted output:

use YAML::PP;
use Encode;
my $input = decode_utf8("ä");
my $dump = YAML::PP->new( schema => ['Binary'] )->dump_string($input);
say $dump;
__END__
--- !!binary |
  5A==

Maybe I didn't understand your suggestions correctly?

@perlpunk
Copy link
Owner

Maybe it's simply not possible to detect if data is binary.
My tests in t/45.binary.t have been working like I expected, but maybe my assumptions in there are wrong.

I will document that only loading binary data this way is recommended.
For dumping, I might provide a special class that users can bless/tie their data with to mark them as binary.

@pali
Copy link
Author

pali commented Jan 30, 2020

Maybe it's simply not possible to detect if data is binary.

Yes, this is not possible. You must declare in API if input should be treated as binary (and therefore only qr/^[\x00-\xff]*$/ is accepted, otherwise function should die) or input should be treated as (Unicode) string and then any non-undef input is accepted.

However, this produces unwanted output:

use YAML::PP;
use Encode;
my $input = decode_utf8("ä");

Lets stop at above line as it does not have to be obvious and simple what is doing here.

Unless you specify use utf8 pragma, Perl parses source code/file in Latin1 encoding. As Latin1 is defined at whole 8-bit domain, you can represent with it also byte sequence of characters in UTF-8. It is ugly but I see it (mis)used in Perl lot of times. So when use utf8 is not specified (which is your case) above line is equivalent to (I'm assuming that you have wrote that file in UTF-8 encoding):

my $input = decode_utf8("\xC3\xA4");

Encode::decode_utf8 expects on its input sequence of bytes (which should represent UTF-8 sequence) and returns Unicode string, decoded from UTF-8.

So in $input would be stored Unicode string "ä" (now written in UTF-8, not in Latin1) which is equivalent to string "\xE4" and to string "\N{U+E4}".

my $dump = YAML::PP->new( schema => ['Binary'] )->dump_string($input);
say $dump;
__END__
--- !!binary |
  5A==

Maybe I didn't understand your suggestions correctly?

And what should Binary schema/encoding do? I would say that it should expects on its input binary buffer and produce ouput in YAML marked as binary.

And because on input was one byte 0xE4 its representation in base64 is 5A==. Which seems to be correct.

So, main problem is there how is (or rather how should be) defined API of Binary schema.

If you define API in way that input is expected and must be in binary 8bit, then code should look like:

if ($binary =~ m/[\x{100}-\x{10FFFF}]/) {
    die "Input is not 8-bit";
}
unless ($binary =~ m/[\x{80}-\x{FF}]/) {
    # Only 7-bit ascii, base64 encoding is not needed
    return;
}
# now apply base64 encoding
...

Plus there is a still small problem, 64bit perl supports also characters above 0x10FFFF up to the 2^64-1. They are marked as Extended Perl Unicode and are not portable. But in Perl you can create them and use them. So to be fully precise, you should check for non-8-bit as:

unless ($binary =~ m/^[\x00-\xFF]*$/) {
    die "Input is not 8-bit";
}

For dumping, I might provide a special class that users can bless/tie their data with to mark them as binary.

This should work fine.

Alternative solution how to handle types when dumping perl structure to typed-formats (YAML/JSON/...) is to provide types explicitly via additional argument / interface. This is for example implemented in Cpanel::JSON::XS, you can look at examples in documentation: https://metacpan.org/pod/Cpanel::JSON::XS::Type

@pali
Copy link
Author

pali commented Feb 13, 2020

Based on above description, here is my proposed change with API that binary schema expects 8-bit data:

diff --git a/lib/YAML/PP/Schema/Binary.pm b/lib/YAML/PP/Schema/Binary.pm
index 30b4491..a63fa64 100644
--- a/lib/YAML/PP/Schema/Binary.pm
+++ b/lib/YAML/PP/Schema/Binary.pm
@@ -27,15 +27,15 @@ sub register {
         code => sub {
             my ($rep, $node) = @_;
             my $binary = $node->{value};
-            unless ($binary =~ m/[\x{7F}-\x{10FFFF}]/) {
-                # ASCII
-                return;
+            if ($binary =~ m/[^\x{00}-\x{FF}]/) {
+                # non 8-bit
+                die "Input is not 8-bit binary\n";
             }
-            if (utf8::is_utf8($binary)) {
-                # utf8
+            if ($binary =~ m/^[\x{00}-\x{7F}]*$/) {
+                # 7-bit ASCII
                 return;
             }
-            # everything else must be base64 encoded
+            # 8-bit must be base64 encoded
             my $base64 = encode_base64($binary);
             $node->{style} = YAML_ANY_SCALAR_STYLE;
             $node->{data} = $base64;
@@ -84,10 +84,10 @@ See <https://yaml.org/type/binary.html>
 By prepending a base64 encoded binary string with the C<!!binary> tag, it can
 be automatically decoded when loading.
 
-Note that the logic for dumping is probably broken, see
-L<https://github.com/perlpunk/YAML-PP-p5/issues/28>.
+If you are using this schema, any string containing C<[\x{80}-\x{FF}]>
+(non-7-bit) will be dumped as binary.
 
-Suggestions welcome.
+This schema cannot be used for non-8-bit (non-binary) data.
 
 =head1 METHODS
 
diff --git a/t/45.binary.t b/t/45.binary.t
index caed569..24e4b58 100644
--- a/t/45.binary.t
+++ b/t/45.binary.t
@@ -1,6 +1,7 @@
 #!/usr/bin/env perl
 use strict;
 use warnings;
+use utf8;
 use Test::More tests => 3;
 
 use YAML::PP;
@@ -41,16 +42,15 @@ EOM
 
 };
 
-my $latin1_a_umlaut = encode(latin1 => (decode_utf8 "ä"));
+my $latin1_a_umlaut = encode(latin1 => "ä");
 my @tests = (
-    [utf8 => "a"],
+    [ascii => "a"],
+    [ascii => "test"],
+    [ascii => "euro"],
     [binary => $latin1_a_umlaut],
-    [binary =>  "\304\244",],
-    [utf8 =>  decode_utf8("\304\244"),],
-    [binary => "a umlaut ä",],
-    [utf8 => decode_utf8("a umlaut ä"),],
-    [binary => "euro €",],
-    [utf8 => decode_utf8("euro €"),],
+    [binary => "\304\244",],
+    [binary => encode_utf8("a umlaut ä"),],
+    [binary => encode_utf8("euro €"),],
     [binary => "\303\274 \374",],
     [binary => "\xC0\x80"],
     [binary => "\xC0\xAF"],
@@ -59,7 +59,7 @@ my @tests = (
     [binary => "\xE0\x83\xBF"],
     [binary => "\xF0\x80\x83\xBF"],
     [binary => "\xF0\x80\xA3\x80"],
-    [binary => [$gif, decode_utf8("ä")],],
+    [binary => [$gif, encode_utf8("ä")],],
     [binary => [$gif, 'foo'],],
 );
 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants