-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dump_string must not care about Perl's internal representation of a variable #16
Comments
Could you give me a small example code, and what you expect? Are you talking only about the result from dump or also about the reloaded data? YAML::PP is now behaving like YAML::XS (with the exception that YAML::XS::Dump() returns utf8 encoded data). YAML 1.2 says that YAML stream consists of unicode characters, and the contents of scalar nodes too. That means that if I get input that does not have the utf8 flag, and has the content Consider the input string
If I didn't upgrade the input string, the resulting YAML would contain: Note that I'm not a unicode expert, so please correct me if I'm saying something wrong. I can think of adding options for YAML::PP to control behaviour, but first I would like to understand your use case. Edit: removed YAML.pm and YAML::Tiny |
Here is a comparison. The behaviour for YAML.pm and YAML::Tiny changes, if the input also contains a
So to me YAML::PP and YAML::XS behave more consistent. |
@2shortplanks is fully right! UTF8 flag is relevant only for XS code and says if returned C char* buffer is encoded in UTF-8 or in Latin1. UTF8 flag does not (or rather should not) expose to pure Perl code, which YAML::PP is. About YAML::XS, it has bugs in handling Unicode. And YAML::PP should not try to emulate these bugs, but be rather correct Unicode implementation. I have somewhere tests for YAML::Syck (not for YAML::XS) where it shows that YAML::Syck is broken in handling of Unicode. If you want these tests I could try to find them... |
https://metacpan.org/pod/Encode#is_utf8
So usage of As @2shortplanks said, YAML::PP must not care about UTF8 flag exposed by |
I'm not saying anyone of you is wrong. I would like to do it right, and as far as I can see for detecting invalid characters or characters that need to be quoted I need a string with unicode characters. I have two choices: do a I can leave out the check for |
Another option would be to fail with invalid input. |
And now question is what you need to. It is needed to know:
To make it more easier, I would take an example from Cpanel::JSON::XS module which is JSON encoder / decoder. Cpanel::JSON::XS's decode_json function is: expecting on its input UTF-8 (octets) string and returns structure (hash/array/...) with string values in Unicode. encode_json function takes as it input perl structure with string values in Unicode and returns one scalar in UTF-8. So I think that YAML encoder/decoder could do same thing as JSON encoder/decoder. |
Well, what do you mean with invalid input? How can I detect it? :) |
@pali YAML::PP dump is supossed to take a data structure with Unicode characters and returns a string with Unicode characters. If I get latin1 and operate on it, checking for characters that need to be quoted, that can get wrong results. Like I said in my first comment, I could add an option to control behaviour. |
I sorry Tina, I don’t understand what you’re saying.
It’s my understanding that YAML::PP is meant to take in a data structure
that can contain strings and return a scalar that contains a bytes.
In *either* case you don’t need to know how Perl is storing the data
internally. It could store the characters as one character per byte. It
could use multiple bytes to represent a character. It could paint symbols
on the side of elephants. None of it matters. Please pretend the utf8 flag
doesn’t exist.
I do not understand what you mean when you say “ If I get latin1 and
operate on it, checking for characters that need to be quoted, that can get
wrong results.”
The characters that Perl has in its data structure aren’t stored in Latin-1
or UTF-8. They’re *characters* not bytes. You don’t have to “check” for
characters to be encoded - all characters need to be encoded with
Encode::encode (because they’re characters not bytes).
If I try and encode [“L\x{e9}on”] then I expect the result to work.
“L\x{e9}on” Is a perfectly good string that can be represented in UTF-8.
Mark
…On Wed, Jan 22, 2020 at 3:25 PM Tina Müller (tinita) < ***@***.***> wrote:
@pali <https://github.com/pali> YAML::PP dump is supossed to take a data
structure with Unicode characters and returns a string with Unicode
characters.
load works the same the other way around.
If I get latin1 and operate on it, checking for characters that need to be
quoted, that can get wrong results.
That's why I do utf8::upgrade.
Like I said in my first comment, I could add an option to control
behaviour.
But I would like to know the use case(s) first.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16?email_source=notifications&email_token=AAALNYWBD6WEFAMBXR5EDNLQ7BQPHA5CNFSM4HOBO4W2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJT7RIQ#issuecomment-577239202>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAALNYTSENYXQTF4ERKUFV3Q7BQPHANCNFSM4HOBO4WQ>
.
|
In this case it is simple: You do not have to (or rather should not) use any utf8 function. Everything in pure perl is character related, one character is one Unicode code point. Latin1/UTF-8 is needed to handle only in XS/C code which works with char* C buffers. |
@2shortplanks ok, I think I understand now what result you expect.
So currently the YAML::PP (and YAML::XS) result is the same as Cpanel::JSON::XS & JSON::PP, but YAML::PP returns unicode characters. It gets interesting as soon as I add another value to encode:
(YAML::Syck is a bit weird, and That means that as soon as I add other items to the data, the first value also changes. What I don't like about it, is that adding other data changes the kind of the result. But like I said, I'm still not very good in unicode things. So my documentation says that What I'm also wondering, why is the current behaviour a problem for you? Can someone help removing my confusion? :) |
Instead of Devel::Peek::Dump, you should look at output from: Cpanel::JSON::XS::encode_json is automatically doing conversion from Unicode to UTF-8 -- equivalent of passing |
Reading output from Dump is sometimes hard as you need to know what to read: You should always look at
I will try it. In string perl scalar you have always stored sequence of ordinals (numbers from 0 to 0x10FFFF). When comparing two strings via There are two internal representations of ordinals in scalar, but in pure perl code you do not have (easy) access to it (and you should not care about it). When needed Perl itself automatically convert from one representation to another if something requires it (explicit conversion can be done by those utf8::upgrade and downgrade function; but it is not necessary as Perl do it automatically when needed). So you can call I would suggest you to look at output from code which I posted in previous post on your tested modules, so see what is there really stored. |
@perlpunk let me know if it is more clear for you, or you need to explain some specific part of Unicode. I will try if there are still some unclear parts. |
@pali thanks! From what I see as the output from Devel::Peek::Dump and your snippet, it looks to me that both results actually return the correct string, no matter if I use So if I leave it out, it won't be upgraded, unless necessary if it gets concatenated with another string. If I remeber correctly, I made this change because someone wanted to dump binary data and there was a problem. I can't find the issue, though. I'm wondering about the documentation
What would be the best description, if I take out the I'm also wondering if the current code is really broken, in a way that the result would behave differently? When the internal representation is important for the result? While looking at the Emitter code, I also think I just fixed some control-character escaping. |
Well, Normally you need to call But for pure perl module there is basically no need to use it. |
What @2shortplanks wrote in his report is to not look at UTF8 flag. In pure perl it can be accessed by |
Just say: Input must be Unicode. And output will be in Unicode.
Looking at code and YAML::PP::Emitter::scalar_event unless (utf8::is_utf8($value)) {
utf8::upgrade($value);
}
t/45.binary.t if (utf8::is_utf8($reload)) {
utf8::downgrade($reload);
}
So both these cases does not lead to broken code, but it is suspicious that Last usage of is_utf8 is in YAML::PP::Schema::Binary which is problematic, but I for it I created separate issue. |
I just uploaded version 0.018_001 with my fixes |
In documentation is |
So I think it is not a good idea to suggest using this function for decoding external data. |
Anyway, original problem about internal representation as described in this ticket seems to be already fixed. Thanks! |
@pali thanks, fixed in git |
Thank you! I think that this issue is fixed now and can be closed. Btw, |
As of 0.0016 dump string actually cares what the internal UTF8 flag of a scalar is set to and behaves differently depending on what state it is in.
Perl's current internal representation should have no user visible effects. A library should be able to return a scalar containing the bytes
\x{C3}\x{A9}
with the UTF8 flag off, or a scalar containing\x{C3}\x{83}\x{C2}\x{A9}
with the UTF8 flag on, and both be considered the byte sequence foré
when decoded as UTF-8.The text was updated successfully, but these errors were encountered: