-
Notifications
You must be signed in to change notification settings - Fork 5
Add initial draft syrup specification #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misc. comments in-line.
Separately, I'd like to gently suggest that it might be better to just use the standard binary form of preserves, and drop syrup entirely -- as far as I can tell the only advantage of syrup is that the use of mnemonic characters makes it a bit easier to eyeball the meaning of data without tools, but it has several intrinsic disadvantages (many of them related to efficiency in one way or another), and would require a bunch of work to formally specify, which Tony has already done for us if we use the existing format.
draft-specification.md
Outdated
|
||
Syrup is a lightweight and easy-to-implement data serialization format. A serialized Syrup structure is canonicalized; the same data thus always serializes to the same set of octets (8 bit values). | ||
|
||
Syrup is a *binary* serialization format; data is not encoded within text, but as the raw underlying data, preventing common escaping bugs. Syrup also provides an easy way to canonicalized as unordered data types are sorted by the octet which is simple to implement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awkwardly worded. Maybe:
Syrup is a *binary* serialization format; data is not encoded within text, but as the raw underlying data, preventing common escaping bugs. Syrup also provides an easy way to canonicalized as unordered data types are sorted by the octet which is simple to implement. | |
Syrup is a *binary* serialization format; data is not encoded within text, but as the raw underlying data, preventing common escaping bugs. Syrup is also easy to canonicalize; unordered data types are sorted by the octet, which is simple to implement. |
Also: while not something that should be dealt with in this PR (which is just getting the current spec down, rather than actually doing design work on the format), I actually think the way syrup does canonicalization is clumsy and should be changed: you end up needing to do canonicalization recursively, which means you can't actually decide on the order of elements in the set map until you have decoded the elements for canonicalization and then re-encoded them. I think it would conceptually cleaner, easier to implement, and more efficient to just use the ordering described in the preserves spec and do the sorting with parsed values rather than encoded ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you disagree with it, but it's simpler to do from an implementation standpoint, even if from a conceptual/CS standpoint, it's a bit silly potentially. It's trivially easy to implement in a recursive descent writer. That's part of the goal of Syrup, even if it's a bit silly. Nice from an engineering perspective, dissatisfying from a CS perspective.
(I understand that ASN.1 did the same thing, and was one of the few parts of that ecosystem that maybe wasn't so complicated.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how it's simpler from an implementation standpoint, and I don't think you've understood my critique, since the bulk of my complaint was that the opposite was true (in addition to being inefficient).
...frankly though, it seems like the data model we're converging on in e.g. #5 kindof obviates this; if we're roughly agreed upon things like no maps & sets, only strings as labels for Tagged, and other changes that are clearly incompatible with preserves, then does it even make sense for us to merge a spec for syrup at this point? It seems clear we won't use it in the end.
draft-specification.md
Outdated
|
||
### Negative Integers | ||
|
||
Integers are serialized as a base 10 string format beginning with the most significant digit until the least significant digit. The integer is then followed by a `-` character. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 0-
valid? If not, what should a decoder do with it?
|
||
Binary Data is serialized with the size of the data (number of octets), followed by a `:` and then the octets representing the underlying data. The size is a base 10 string format beginning with the most significant digit until the least significant digit. | ||
|
||
The size is a base 10 string format beginning with the most significant digit until the least significant digit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is duplicated.
|
||
Symbols are serialized with the size (number of octets used to represent the symbol), followed by a `'` and then the octets representing the symbol. | ||
|
||
The size is a base 10 string format beginning with the most significant digit until the least significant digit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it might make the document a bit clearer to actually group binary data, strings, and symbols as "length delimited" and explicitly call out the fact that the only difference is what character terminates the length.
|
||
Sorting is used by certain Syrup data types to ensure values are in their canonicalized form. Sorting is done by first each value is serialized to their respective Syrup binary representations and then sorted from lowest to highest in value. | ||
|
||
Below is an algorithm that takes two sequences of octets and says if the first sequence of octets is smaller than the second sequence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems excessively detailed just to describe standard lexical ordering on bytes.
draft-specification.md
Outdated
{3:age30+4:name5:Alice7:isAlivet} | ||
``` | ||
|
||
Note that the order of the keys occur in the following order: `age`, `name`, and `isAlive` due to sorting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awkward:
Note that the order of the keys occur in the following order: `age`, `name`, and `isAlive` due to sorting. | |
Note that the keys occur in the following order: `age`, `name`, and `isAlive` due to sorting. |
Also: is canonicalization actually required? this seems like a major source of inefficiency given that canonicalization is not needed for the bulk of our use, and it's not clear what problem it solves. Note that to avoid ambiguity, we would still need to separately decide what decoders should do with unsorted data or data with duplicate keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Canonicalization is required. It's true it introduces a bit of inefficiency if that property isn't used, however a lot of data serialization formats have no canonicalization or opt-in canonicalization which causes issues where it actually is needed. I don't think this should add too much overhead in most situations.
It's been here a long time so I'm going to merge this despite the few comments from @zenhack and take another look at what those comments are to see if they could be addressed. |
No description provided.