Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dump_string must not care about Perl's internal representation of a variable #16

Open
2shortplanks opened this issue May 20, 2019 · 26 comments
Labels

Comments

@2shortplanks
Copy link

2shortplanks commented May 20, 2019

As of 0.0016 dump string actually cares what the internal UTF8 flag of a scalar is set to and behaves differently depending on what state it is in.

Perl's current internal representation should have no user visible effects. A library should be able to return a scalar containing the bytes \x{C3}\x{A9} with the UTF8 flag off, or a scalar containing \x{C3}\x{83}\x{C2}\x{A9} with the UTF8 flag on, and both be considered the byte sequence for é when decoded as UTF-8.

@perlpunk
Copy link
Owner

perlpunk commented May 20, 2019

Could you give me a small example code, and what you expect? Are you talking only about the result from dump or also about the reloaded data?

YAML::PP is now behaving like YAML::XS (with the exception that YAML::XS::Dump() returns utf8 encoded data).

YAML 1.2 says that YAML stream consists of unicode characters, and the contents of scalar nodes too. That means that if I get input that does not have the utf8 flag, and has the content \x{C3}\x{A9} for example, it is considered binary data, which YAML does not support, so I need to upgrade it to utf8.

Consider the input string \303\300 (\x{c3}\x{c0}). This is binary data, and it is not valid unicode.

my $yaml = $yp->dump_string("\303\300");

If I didn't upgrade the input string, the resulting YAML would contain: --- \303\300\n which is invalid unicode and therefor invalid YAML.

Note that I'm not a unicode expert, so please correct me if I'm saying something wrong.

I can think of adding options for YAML::PP to control behaviour, but first I would like to understand your use case.

Edit: removed YAML.pm and YAML::Tiny
Edit: fix typo

@perlpunk
Copy link
Owner

Here is a comparison. The behaviour for YAML.pm and YAML::Tiny changes, if the input also contains a decode_utf8("ü"):

% perl -wE'
use Encode;
use Devel::Peek;
use YAML ();
use YAML::XS ();
use YAML::PP ();
use YAML::Tiny ();
my $utf8 = decode_utf8("ü");
my $binary = "\303";
Dump(YAML::Dump($binary, $utf8));
Dump(YAML::PP::Dump($binary, $utf8));
Dump(YAML::Tiny::Dump($binary, $utf8));
Dump(YAML::XS::Dump($binary, $utf8));
Dump(YAML::Dump($binary));
Dump(YAML::PP::Dump($binary));
Dump(YAML::Tiny::Dump($binary));
Dump(YAML::XS::Dump($binary));'
SV = PV(0x562fa0b76b70) at 0x562fa0baac88
  REFCNT = 1
  FLAGS = (TEMP,POK,IsCOW,pPOK,UTF8)
  PV = 0x562fa0568280 "--- \303\203\n--- \303\274\n"\0 [UTF8 "--- \x{c3}\n--- \x{fc}\n"]
  CUR = 14
  LEN = 24
  COW_REFCNT = 1
SV = PV(0x562fa05675f0) at 0x562fa0bb4770
  REFCNT = 1
  FLAGS = (TEMP,POK,IsCOW,pPOK,UTF8)
  PV = 0x562fa0bb58f0 "--- \303\203\n--- \303\274\n"\0 [UTF8 "--- \x{c3}\n--- \x{fc}\n"]
  CUR = 14
  LEN = 24
  COW_REFCNT = 0
SV = PV(0x562fa0567670) at 0x562fa09f03f8
  REFCNT = 1
  FLAGS = (TEMP,POK,IsCOW,pPOK,UTF8)
  PV = 0x562fa0bb8b50 "--- \303\203\n--- \303\274\n"\0 [UTF8 "--- \x{c3}\n--- \x{fc}\n"]
  CUR = 14
  LEN = 17
  COW_REFCNT = 1
SV = PV(0x562fa0567670) at 0x562fa09f03f8
  REFCNT = 1
  FLAGS = (TEMP,POK,pPOK)
  PV = 0x562fa0bb55b0 "--- \303\203\n--- \303\274\n"\0
  CUR = 14
  LEN = 24
SV = PV(0x562fa0567670) at 0x562fa0bb4320
  REFCNT = 1
  FLAGS = (TEMP,POK,IsCOW,pPOK)
  PV = 0x562fa0aae840 "--- \303\n"\0
  CUR = 6
  LEN = 10
  COW_REFCNT = 1
SV = PV(0x562fa05676a0) at 0x562fa0a9ed70
  REFCNT = 1
  FLAGS = (TEMP,POK,IsCOW,pPOK,UTF8)
  PV = 0x562fa0aaeef0 "--- \303\203\n"\0 [UTF8 "--- \x{c3}\n"]
  CUR = 7
  LEN = 10
  COW_REFCNT = 0
SV = PV(0x562fa0b76ab0) at 0x562fa0bb4338
  REFCNT = 1
  FLAGS = (TEMP,POK,pPOK)
  PV = 0x562fa0bb57f0 "--- \303\n"\0
  CUR = 6
  LEN = 10
SV = PV(0x562fa0b76ab0) at 0x562fa0bb4338
  REFCNT = 1
  FLAGS = (TEMP,POK,pPOK)
  PV = 0x562fa0bb57f0 "--- \303\203\n"\0
  CUR = 7
  LEN = 10

So to me YAML::PP and YAML::XS behave more consistent.

@pali
Copy link

pali commented Jan 22, 2020

@2shortplanks is fully right!

UTF8 flag is relevant only for XS code and says if returned C char* buffer is encoded in UTF-8 or in Latin1. UTF8 flag does not (or rather should not) expose to pure Perl code, which YAML::PP is.

About YAML::XS, it has bugs in handling Unicode. And YAML::PP should not try to emulate these bugs, but be rather correct Unicode implementation.

I have somewhere tests for YAML::Syck (not for YAML::XS) where it shows that YAML::Syck is broken in handling of Unicode. If you want these tests I could try to find them...

@pali
Copy link

pali commented Jan 22, 2020

https://metacpan.org/pod/Encode#is_utf8

  • Typically only necessary for debugging and testing
  • Don't use this flag as a marker to distinguish character and binary data
  • If STRING has UTF8 flag set, it does NOT mean that STRING is UTF-8 encoded and vice-versa
  • utf8 also has the utf8::is_utf8 function

So usage of is_utf8 function is in YAML::PP wrong. It does not say anything about binary data or UTF-8 data.

As @2shortplanks said, YAML::PP must not care about UTF8 flag exposed by is_utf8 function.

@perlpunk
Copy link
Owner

perlpunk commented Jan 22, 2020

I'm not saying anyone of you is wrong.
But I asked for a use case. A code example.

I would like to do it right, and as far as I can see for detecting invalid characters or characters that need to be quoted I need a string with unicode characters. I have two choices: do a decode_utf8 or utf8::upgrade. Right?

I can leave out the check for is_utf8 and call utf8::upgrade in every case. But I guess that results in the same behaviour anyway.

@choroba
Copy link

choroba commented Jan 22, 2020

Another option would be to fail with invalid input.

@pali
Copy link

pali commented Jan 22, 2020

utf8::upgrade in pure perl code basically do nothing (it is just optimization for following string operations) and can be fully avoided. It does not do any validation nor check.

decode_utf8 takes sequence of octets (scalar with values from range 0x00...0xFF), treats them as UTF-8 sequence and returns sequence of Unicode code points (scalar value with values from range 0x000000..0x10FFFF). So it is doing conversion from UTF-8 to Unicode.

And now question is what you need to. It is needed to know:

  1. What is the user input? UTF-8 (octets) or Unicode (code points)?
  2. What is expected on output? UTF-8 (octets) or Unicode (code points)?

To make it more easier, I would take an example from Cpanel::JSON::XS module which is JSON encoder / decoder.

Cpanel::JSON::XS's decode_json function is: expecting on its input UTF-8 (octets) string and returns structure (hash/array/...) with string values in Unicode. encode_json function takes as it input perl structure with string values in Unicode and returns one scalar in UTF-8.

So I think that YAML encoder/decoder could do same thing as JSON encoder/decoder.

@perlpunk
Copy link
Owner

Another option would be to fail with invalid input.

Well, what do you mean with invalid input? How can I detect it? :)

@perlpunk
Copy link
Owner

@pali YAML::PP dump is supossed to take a data structure with Unicode characters and returns a string with Unicode characters.
load works the same the other way around.

If I get latin1 and operate on it, checking for characters that need to be quoted, that can get wrong results.
That's why I do utf8::upgrade.

Like I said in my first comment, I could add an option to control behaviour.
But I would like to know the use case(s) first.

@2shortplanks
Copy link
Author

2shortplanks commented Jan 22, 2020 via email

@pali
Copy link

pali commented Jan 22, 2020

@pali YAML::PP dump is supossed to take a data structure with Unicode characters and returns a string with Unicode characters.
load works the same the other way around.

In this case it is simple: You do not have to (or rather should not) use any utf8 function.

Everything in pure perl is character related, one character is one Unicode code point.

Latin1/UTF-8 is needed to handle only in XS/C code which works with char* C buffers.

@perlpunk
Copy link
Owner

perlpunk commented Jan 22, 2020

@2shortplanks ok, I think I understand now what result you expect.
I just have to clarify a couple of things.
So if you encode the perl data ["L\x{e9}on"] (L\351on), you want to get back the exact same value "---\n- L\351on\n".

Module Result
Cpanel::JSON::XS ["L\303\251on"]
JSON::PP ["L\303\251on"]
YAML::PP (with utf8::upgrade) ---\n- L\303\251on\n ([UTF8 "---\n- L\x{e9}on\n)
YAML::PP (without utf8::upgrade) ---\n- L\351on\n
YAML::XS ---\n- L\303\251on\n
YAML ---\n- L\351on\n
YAML::Syck --- \n- L\351on\n

So currently the YAML::PP (and YAML::XS) result is the same as Cpanel::JSON::XS & JSON::PP, but YAML::PP returns unicode characters.

It gets interesting as soon as I add another value to encode:
["L\x{e9}on", decode_utf8("ä")]
then I get the following:

Module Result
Cpanel::JSON::XS ["L\303\251on","\303\244"]
JSON::PP ["L\303\251on","\303\244"]
YAML::PP (with utf8::upgrade) ---\n- L\303\251on\n- \303\244\n" ([UTF8 "---\n- L\x{e9}on\n- \x{e4}\n"])
YAML::PP (without utf8::upgrade) ---\n- L\303\251on\n- \303\244\n" ([UTF8 "---\n- L\x{e9}on\n- \x{e4}\n"])
YAML::XS ---\n- L\303\251on\n- \303\244\n
YAML ---\n- L\303\251on\n- \303\244\n"
YAML::Syck --- \n- L\351on\n- \303\244\n

(YAML::Syck is a bit weird, and YAML::XS::Load doesn't accept it.)

That means that as soon as I add other items to the data, the first value also changes.
That's because due to the decoded "ä" the utf8::upgrade happens implicitly.

What I don't like about it, is that adding other data changes the kind of the result.

But like I said, I'm still not very good in unicode things. So my documentation says that Dump returns unicode characters.
I wonder if that is still true if I remove the upgrade. Because if I Dump the string "\344", I get back "\344", but this is a latin1 character?
On the other hand, comparing the result to decode_utf8("---\n- ä\n") returns 1.

What I'm also wondering, why is the current behaviour a problem for you?
The internal result string might look different, but the string is the same.

Can someone help removing my confusion? :)

@pali
Copy link

pali commented Jan 22, 2020

Instead of Devel::Peek::Dump, you should look at output from:
print join ' ', map { sprintf "U+%04x", ord($_) } split //, $string
This will print sequence of ordinals stored in $string. And this is important.

Cpanel::JSON::XS::encode_json is automatically doing conversion from Unicode to UTF-8 -- equivalent of passing $output = Encode::encode("UTF-8", $input). If you want to prevent doing conversion from Unicode to UTF-8 in Cpanel::JSON::XS, you need to use its decorator oop API: Cpanel::JSON::XS->new->encode($input) (instead of API Cpanel::JSON::XS::encode_json($input)).

@pali
Copy link

pali commented Jan 22, 2020

Reading output from Dump is sometimes hard as you need to know what to read: You should always look at UTF8 part if is available and ignores other. Therefore I suggest to read output from my above form from previous post.

Can someone help removing my confusion? :)

I will try it. In string perl scalar you have always stored sequence of ordinals (numbers from 0 to 0x10FFFF). When comparing two strings via eq Perl is doing comparision on ordinals.

There are two internal representations of ordinals in scalar, but in pure perl code you do not have (easy) access to it (and you should not care about it). When needed Perl itself automatically convert from one representation to another if something requires it (explicit conversion can be done by those utf8::upgrade and downgrade function; but it is not necessary as Perl do it automatically when needed). So you can call eq without any problem on two scalars which are in different internal representation. Perl always correctly compares them (e.g. convert them to same representation and then do comparision).

I would suggest you to look at output from code which I posted in previous post on your tested modules, so see what is there really stored.

@pali
Copy link

pali commented Jan 23, 2020

@perlpunk let me know if it is more clear for you, or you need to explain some specific part of Unicode. I will try if there are still some unclear parts.

@perlpunk
Copy link
Owner

perlpunk commented Jan 24, 2020

@pali thanks!

From what I see as the output from Devel::Peek::Dump and your snippet, it looks to me that both results actually return the correct string, no matter if I use upgrade or not, just with a different internal representation.

So if I leave it out, it won't be upgraded, unless necessary if it gets concatenated with another string.

If I remeber correctly, I made this change because someone wanted to dump binary data and there was a problem. I can't find the issue, though.
But the test suceeds without the upgrade.

I'm wondering about the documentation

Input strings should be Unicode characters. If not, they will be upgraded with utf8::upgrade.
Output will return Unicode characters (decoded).

What would be the best description, if I take out the upgrade?

I'm also wondering if the current code is really broken, in a way that the result would behave differently? When the internal representation is important for the result?
Or it just about removing an unnecessary upgrade?

While looking at the Emitter code, I also think I just fixed some control-character escaping.
I will also look in to your issue about YAML::PP::Schema::Binary!

@pali
Copy link

pali commented Jan 27, 2020

From what I see as the output from Devel::Peek::Dump and your snippet, it looks to me that both results actually return the correct string, no matter if I use upgrade or not, just with a different internal representation.

Well, utf8::upgrade does not change logical value in string, only internal representation. So calling utf8::upgrade does not change result and you always get correct output (with and also utf8::upgrade call).

Normally you need to call utf8::upgrade just for debbuging or testing purposes. E.g. test that your C/XS module works correctly when it get string scalar with any internal representation. Or another purpose of using utf8::upgrade is when using buggy/broken C/XS module which does not work correctly with one of internal string representation.

But for pure perl module there is basically no need to use it.

@pali
Copy link

pali commented Jan 27, 2020

What @2shortplanks wrote in his report is to not look at UTF8 flag. In pure perl it can be accessed by Encode::is_utf8 or utf8::is_utf8 functions. And on more places in YAML::PP code is is_utf8 used. This is what should be eliminated. utf8::upgrade is harmless, but not needed.

@pali
Copy link

pali commented Jan 27, 2020

What would be the best description, if I take out the upgrade?

Just say: Input must be Unicode. And output will be in Unicode.

I'm also wondering if the current code is really broken, in a way that the result would behave differently?

Looking at code and is_utf8 is called on these places:

YAML::PP::Emitter::scalar_event

    unless (utf8::is_utf8($value)) {
        utf8::upgrade($value);
    }

utf8::upgrade do nothing on already-upgraded-string, so upgrade can be called unconditionally.

t/45.binary.t

            if (utf8::is_utf8($reload)) {
                utf8::downgrade($reload);
            }

utf8::downgrade do nothing on already-downgraded-string, so downgrade can be called unconditionally.

So both these cases does not lead to broken code, but it is suspicious that is_utf8 is used.

Last usage of is_utf8 is in YAML::PP::Schema::Binary which is problematic, but I for it I created separate issue.

@perlpunk
Copy link
Owner

Thanks @pali!
I have pushed my changes so far to the escape branch (I started with escaping control character issues, see #17 )

perlpunk added a commit that referenced this issue Jan 28, 2020
See
* #16 dump_string must not care about Perl's internal representation of a variable
* #17 Dump() can create invalid YAML 1.1
perlpunk added a commit that referenced this issue Jan 30, 2020
See
* #16 dump_string must not care about Perl's internal representation of a variable
* #17 Dump() can create invalid YAML 1.1
@perlpunk
Copy link
Owner

I just uploaded version 0.018_001 with my fixes

@pali
Copy link

pali commented Feb 13, 2020

In documentation is utf8::upgrade still mentioned, even it was removed:
https://metacpan.org/pod/release/TINITA/YAML-PP-0.019/lib/YAML/PP.pm#dump_string

@pali
Copy link

pali commented Feb 13, 2020

So if you read from a file, you should decode it, for example with Encode::decode_utf8($bytes).

Encode::decode_utf8 documentations says that this function should not be used for data exchange: https://metacpan.org/pod/Encode#decode_utf8

So I think it is not a good idea to suggest using this function for decoding external data.

@pali
Copy link

pali commented Feb 13, 2020

Anyway, original problem about internal representation as described in this ticket seems to be already fixed. Thanks!

@perlpunk
Copy link
Owner

@pali thanks, fixed in git

@pali
Copy link

pali commented Feb 17, 2020

Thank you! I think that this issue is fixed now and can be closed.

Btw, Encode::encode_utf8 has same warning: https://metacpan.org/pod/Encode#encode_utf8

@perlpunk perlpunk added the bug label Feb 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants