Skip to content

Add initial draft syrup specification #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 26, 2023
Merged

Add initial draft syrup specification #9

merged 4 commits into from
Dec 26, 2023

Conversation

tsyesika
Copy link
Contributor

No description provided.

Copy link

@zenhack zenhack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misc. comments in-line.

Separately, I'd like to gently suggest that it might be better to just use the standard binary form of preserves, and drop syrup entirely -- as far as I can tell the only advantage of syrup is that the use of mnemonic characters makes it a bit easier to eyeball the meaning of data without tools, but it has several intrinsic disadvantages (many of them related to efficiency in one way or another), and would require a bunch of work to formally specify, which Tony has already done for us if we use the existing format.


Syrup is a lightweight and easy-to-implement data serialization format. A serialized Syrup structure is canonicalized; the same data thus always serializes to the same set of octets (8 bit values).

Syrup is a *binary* serialization format; data is not encoded within text, but as the raw underlying data, preventing common escaping bugs. Syrup also provides an easy way to canonicalized as unordered data types are sorted by the octet which is simple to implement.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awkwardly worded. Maybe:

Suggested change
Syrup is a *binary* serialization format; data is not encoded within text, but as the raw underlying data, preventing common escaping bugs. Syrup also provides an easy way to canonicalized as unordered data types are sorted by the octet which is simple to implement.
Syrup is a *binary* serialization format; data is not encoded within text, but as the raw underlying data, preventing common escaping bugs. Syrup is also easy to canonicalize; unordered data types are sorted by the octet, which is simple to implement.

Also: while not something that should be dealt with in this PR (which is just getting the current spec down, rather than actually doing design work on the format), I actually think the way syrup does canonicalization is clumsy and should be changed: you end up needing to do canonicalization recursively, which means you can't actually decide on the order of elements in the set map until you have decoded the elements for canonicalization and then re-encoded them. I think it would conceptually cleaner, easier to implement, and more efficient to just use the ordering described in the preserves spec and do the sorting with parsed values rather than encoded ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you disagree with it, but it's simpler to do from an implementation standpoint, even if from a conceptual/CS standpoint, it's a bit silly potentially. It's trivially easy to implement in a recursive descent writer. That's part of the goal of Syrup, even if it's a bit silly. Nice from an engineering perspective, dissatisfying from a CS perspective.

(I understand that ASN.1 did the same thing, and was one of the few parts of that ecosystem that maybe wasn't so complicated.)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how it's simpler from an implementation standpoint, and I don't think you've understood my critique, since the bulk of my complaint was that the opposite was true (in addition to being inefficient).

...frankly though, it seems like the data model we're converging on in e.g. #5 kindof obviates this; if we're roughly agreed upon things like no maps & sets, only strings as labels for Tagged, and other changes that are clearly incompatible with preserves, then does it even make sense for us to merge a spec for syrup at this point? It seems clear we won't use it in the end.


### Negative Integers

Integers are serialized as a base 10 string format beginning with the most significant digit until the least significant digit. The integer is then followed by a `-` character.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 0- valid? If not, what should a decoder do with it?


Binary Data is serialized with the size of the data (number of octets), followed by a `:` and then the octets representing the underlying data. The size is a base 10 string format beginning with the most significant digit until the least significant digit.

The size is a base 10 string format beginning with the most significant digit until the least significant digit.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is duplicated.


Symbols are serialized with the size (number of octets used to represent the symbol), followed by a `'` and then the octets representing the symbol.

The size is a base 10 string format beginning with the most significant digit until the least significant digit.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it might make the document a bit clearer to actually group binary data, strings, and symbols as "length delimited" and explicitly call out the fact that the only difference is what character terminates the length.


Sorting is used by certain Syrup data types to ensure values are in their canonicalized form. Sorting is done by first each value is serialized to their respective Syrup binary representations and then sorted from lowest to highest in value.

Below is an algorithm that takes two sequences of octets and says if the first sequence of octets is smaller than the second sequence.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems excessively detailed just to describe standard lexical ordering on bytes.

{3:age30+4:name5:Alice7:isAlivet}
```

Note that the order of the keys occur in the following order: `age`, `name`, and `isAlive` due to sorting.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awkward:

Suggested change
Note that the order of the keys occur in the following order: `age`, `name`, and `isAlive` due to sorting.
Note that the keys occur in the following order: `age`, `name`, and `isAlive` due to sorting.

Also: is canonicalization actually required? this seems like a major source of inefficiency given that canonicalization is not needed for the bulk of our use, and it's not clear what problem it solves. Note that to avoid ambiguity, we would still need to separately decide what decoders should do with unsorted data or data with duplicate keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Canonicalization is required. It's true it introduces a bit of inefficiency if that property isn't used, however a lot of data serialization formats have no canonicalization or opt-in canonicalization which causes issues where it actually is needed. I don't think this should add too much overhead in most situations.

@tsyesika
Copy link
Contributor Author

It's been here a long time so I'm going to merge this despite the few comments from @zenhack and take another look at what those comments are to see if they could be addressed.

@tsyesika tsyesika merged commit 2214cbb into master Dec 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants