CapTP AST / data representation and serialization #3

cwebber · 2021-04-16T12:41:48Z

There are really two questions:

What is the on-the-wire representation of CapTP?
What is the abstracted representation (dare I say, the AST) of CapTP?

Right now, (1) is handled by Syrup, (2) is handled by the abstract types in Preserves. Technically, (1) is just a very simple (but canonicalized) encoding of (2), simple enough to implement in about 3 hours, but there is also a (lossy, due to floats) textual representation and an alternate binary representation (which @tonyg and I have considered replacing with Syrup).

I propose we stick to representing CapTP's AST in terms of the abstract datatypes of Preserves, no matter which encoding we end up ultimately using. The core datatypes that then are used to compose this AST are:

                      Value = Atom
                            | Compound

                       Atom = Boolean
                            | Float
                            | Double
                            | SignedInteger
                            | String
                            | ByteString
                            | Symbol

                   Compound = Record
                            | Sequence
                            | Set
                            | Dictionary

I propose that we stick to this as the foundational abstracted set of types on which we build the AST. One advantage of Preserves' abstract types is that they do not on their own specify an encoding, they represent a language-oriented representation. Thus it is easy to switch to something else encoding-wise later.

(EDIT: I removed "Pointer", which is on the Preserves page but I don't think it should be there, and it wasn't when I wrote Syrup I think. @tonyg we should talk!)

The text was updated successfully, but these errors were encountered:

cwebber · 2021-04-16T13:10:11Z

FWIW, here is the full encoding rules (well, in example form) for Syrup's on-the-wire-representation, taken from a comment in the Racket implementation:

;; Booleans: t or f
;; Single flonum: F<ieee-single-float> (big endian)
;; Double flonum: D<ieee-double-float> (big endian)
;; (Signed) integers: i<maybe-sign><int>e
;; Bytestrings: 3:cat
;; Strings: 3"cat
;; Symbols: 3'cat
;; Dictionary: {<key1><val1><key1><val1>}
;; Lists: [<item1><item2><item3>]
;; Records: <<label><val1><val2><val3>> (the outer <> for realsies tho)
;; Sets: #<item1><item2><item3>$

cwebber · 2021-04-16T13:11:23Z

Integers could be simplified to remove a character, btw. @tonyg and I have discussed that:

-;; (Signed) integers: i<maybe-sign><int>e
+;; (Signed) integers: <maybe-sign><int>i

Hasn't been implemented though. That's one small change worth considering however.

zarutian · 2021-04-29T01:13:02Z

Integers could be simplified to remove a character, btw. @tonyg and I have discussed that:
-;; (Signed) integers: i<maybe-sign><int>e
+;; (Signed) integers: <maybe-sign><int>i
Hasn't been implemented though. That's one small change worth considering however.

Already decodes correctly with syrup.js though have yet to add an marshaller that encodes (big)ints to it.

zenhack · 2021-04-29T03:28:33Z

Note that on IRC we discussed yet another approach to this, which people liked: ocapn/syrup#2

dckc · 2023-03-27T18:41:36Z

@jar398 this has substantial overlap with #5 . I somewhat prefer the title / description of this one, but recent discussion is happening more in #5 , so I suggest closing this as a dup of #5.

jar398 · 2023-03-27T20:45:33Z

Interestingly the two issues were posted by the same person on the same day, so there was some intention behind keeping them separate. The title suggests #5 was nominally requirements driven (related to abstract syntax e.g. Preserves), or "what do we need and not need to transmit", while this one is nominally implementation driven ('bindings' of abstract syntax to concrete syntax e.g. syrup or "how do we render it"). This seems like a natural division to me, if it can be observed, but I haven't checked to see how well the distinction is reflected in the actual issue comments. TBD me: check the #5 comments to see whether keeping concrete syntax discussion out of #5 is feasible.

Comments @cwebber or @tsyesika ?

By the way I don't like 'AST' - too suggestive of a data structure. 'Abstract syntax' is fine.

zenhack · 2023-05-17T01:26:31Z

We seem to be close on #5, and I think we're close enough that we can probably unblock talking about concrete serialization.

Agoric folks have specified an embedding of their data model into json called "smallcaps," and it seems like we can relatively easily extend that to include the full data model. I like it, at least broadly and they are apparently stuck supporting it, so I'd suggest having this be our "textual" representation, rather than defining Another Thing.

However, it probably isn't good enough for efficiency reasons, particularly when we look at adding ByteStrings. So we should also specify some binary encoding of the data model with better efficiency.

The bridge will likely include a version of the data model encoded in capnp, but that is a lot of machinery to include, so instead we should probably define the protocol to use something else and just let that be a bridge thing.

The data model has diverged somewhat from preserves, so using Syrup as-is is no longer an option.

My thinking is to build on top of cbor in the same way smallcaps builds on top of json. The situation is similar: the data model is close, but not exact -- we'd need to concoct ways of encoding e.g. Capabilities and Errors into CBOR, and we'd want to impose some restrictions when decoding. E.g. we could specify that cbor's map type is decoded as a struct, where non-string keys cause decoding to fail.

Thoughts on that general approach?

dckc · 2023-05-17T21:23:53Z

I don't have any experience with CBOR; might be fine.

Agoric does a lot of Cosmos-SDK stuff, and Cosmos-SDK uses protobuf. So some protobuf tooling is sunk cost, for us. My work in #19 was sort of at the wrong level, but I'm inclined to try a protobuf rendition of the data model and check out the costs and benefits.

it's nice to get stubs roughly for free:

scheme bindings: r6rs-protobuf among guile libraries, protobuf racket package
JS encoder/decoder (several out there)

Now that I think about it, the existing JS protobuf tooling might not be a good fit here... we might want to use some of their low-level encoding APIs but do the high level tree-walking by writing something driven by the @endo/marshal types.

The capnproto schema language is roughly as expressive, so that's worth a try too.

zenhack · 2023-05-17T22:55:50Z

If folks are open to protobufs maybe I have misjudged the appetite for capnp. It would be nice to use capnp for serialization insofar as in a bridged environment it's one less thing in the tech stack.

Here's a stab at modeling the current state of #5 using capnp schema:

# ocapn.capnp
@0xcd301da1d95b8242;

struct Value {
  union {
    ## Atoms ##

    undefined @0 :Void;
    null @1 :Void;
    bool @2 :Bool;
    float64 @3 :Float64;

    unsignedInt @4 :Data;
    # Non-negative integer, in big-endian format
    negativeInt @5 :Data;
    # Negative integer. value is `-1 - n`, where `n`
    # is the data interpreted as an unsigned big-endian integer.

    string @6 :Text;
    byteString @7 :Data;
    symbol @8 :Text; # Might be removed, pending #46.

    ## Containers ##

    list @9 :List(Value);

    struct @10 :List(StructField);
    # Duplicate keys are not allowed; will need to enforce this at a higher level
    # of abstraction.

    tagged :group {
      label @11 :Text;
      value @12 :Value;
    }

    capability @13 :Cap;

    error @14 :Error;
  }
}

struct StructField {
  key @0 :Text;
  value @1 :Value;
}

interface Cap {
  # TODO
}

struct Error {
  # TODO, pending #10.
}

zarutian · 2023-05-19T02:52:19Z

I am in faviour of CBOR (or msgpck) but I vehemely am against both protobuf and capnproto for the following reason:

Neither are self descriptive on the wire and schemas always get lost and/or loose something in translation to stubs.*

Plus both protobuf and capnproto assume a buildstep or build environ where their schema language interpreter can run. ( @kentonv have you yet described capnprotos field packing algorithm anywhere other than in the c++ implementation yet? )

The ocapn protocol described in an RFC like document should be sufficient for
a programmer to implement it without having to download or run any executable or
tool. It greately decreases the ‘hacktivation energy’ required.

@dckc stubs for Remotables? that say like for an Agoric Issuer Purse?

(* having to reverse engineer MIPS32 firmware to figgure out a binary protocol whose schema was lost to the sands of time was not fun.)

kentonv · 2023-05-19T04:26:03Z

I vehemely am against both protobuf and capnproto for the following reason:

Neither are self descriptive on the wire and schemas always get lost and/or loose something in translation to stubs.*

(* having to reverse engineer MIPS32 firmware to figgure out a binary protocol whose schema was lost to the sands of time was not fun.)

Self-description for the purpose of reverse engineering is not really an on-or-off thing, it is a spectrum. Protocols based on JSON, CBOR, or msgpack often (but not always) contain textual field names which might offer clues to a human trying to reverse engineer them; protobuf and Cap'n Proto do not. However, Protobuf and Cap'n Proto both still do allow you to determine the "shape" of the message tree without knowing the schema; this is still much more information than a completely arbitrary binary encoding provides. Conversely, there are many JSON protocols which manage to be inscrutable. (E.g. some people intentionally encode objects as tuples to avoid wasting bytes on field names. Some others just choose really terrible field names.)

You could require that people using CBOR make sure to use intelligible field names. But similarly you could of course specify a protocol which uses Protobuf or Cap'n Proto, but requires every message to contain a copy of the schema. You could even go further and require each message to contain human-readable documentation explaining how to use it.

Of course, at some point the cost of sending this schema and documentation in every message outweighs the benefits it brings in terms of reverse engineerability. So this is really an argument about trade-offs: what amount of wasted bytes in every single message is "worth it" to make it easier in the case that someone needs to reverse-engineer the protocol? It sounds like you are arguing that field names are worth it. I might agree in some use cases but certainly not in all cases.

What if, instead, the RPC layer of the protocol defined a standard way to query a peer for their schemas, which implementations were expected to support by default? Then there's no waste in the common case but you can still get the info you need for reverse engineering. Both Protobuf and Cap'n Proto could easily support such a requirement, all the pieces are already in place to make schemas available automatically.

@kentonv have you yet described capnprotos field packing algorithm anywhere other than in the c++ implementation yet?

No, but nothing is stopping someone from reading the code and translating it to prose if desired. It's really not that complicated.

(But I personally have no need or desire to standardize Cap'n Proto, so I haven't done it.)

zenhack · 2023-05-19T05:12:55Z

Quoting Kenton Varda (2023-05-19 00:26:14)

What if, instead, the RPC layer of the protocol defined a standard way to query a peer for their schemas, which implementations were expected to support by default? Then there's no waste in the common case but you can still get the info you need for reverse engineering. Both Protobuf and Cap'n Proto could easily support such a requirement, all the pieces are already in place to make schemas available automatically.

Note, this is something that we will in fact need to build in order to adequately bridge capnp and ocapn, regardless of what decisions we make about using or not using capnp (or anything else) for binary serialization. I have most of a design for this sketched out in my head, which can be dumped out when it's something that's enough of a priority.

***@***.*** have you yet described capnprotos field packing algorithm anywhere other than in the c++ implementation yet? No, but nothing is stopping someone from reading the code and translating it to prose if desired. It's really not that complicated.

Indeed, if this is the only reason for not using it I will spend the time to sit down and document it.

…

--- I will add to this: what's proposed above is a single schema that would be used to encode the self-describing data model into capnp as a binary encoding. It would not actually be any less self describing than CBOR, since e.g. ocapn struct fields would be encoded as (Text, Value) pairs, not in the C-like nameless layout that a capnp-defined struct uses. We could copy the information in the output of `capnp compile -ocapnp ocapn.capnp` into the spec and we'd be on equal footing with CBOR, which similarly documents the meaning of otherwise arbitrary tags for lists, strings etc. in a spec somewhere.

dckc · 2023-05-19T06:30:19Z

The ocapn.capnp blurb above is pretty handy as a specification mechanism.

As to capnproto serialization...

stubs for Remotables? that say like for an Agoric Issuer Purse?

No; what I said about protobuf applies equally:

the existing JS protobuf tooling might not be a good fit here... we might want to use some of their low-level encoding APIs but do the high level tree-walking by writing something driven by the @endo/marshal types.

I just took the ocapn.capnp blurb above and rendered it in typescript, since that's the type system I spend most of my time in lately:

ocapn-value.d.ts

Then I spent enough time with CBOR to say it's probably fine:

toCBOR.js

I dind't find a JS API for capnproto at the same level... encoder._pushString() and such.

That is: the API that the code generated by protobuf tools uses. For example:

// writing
var buffer = protobuf.Writer.create()
    .uint32((1 << 3 | 2) >>> 0) // id 1, wireType 2
    .string("hello world!")
    .finish();

https://github.com/protobufjs/protobuf.js/blob/56b1e64979dae757b67a21d326e16acee39f2267/examples/reader-writer.js#LL7C1-L11C15

p.s. as to self-describing: what @kentonv said, especially:

However, Protobuf and Cap'n Proto both still do allow you to determine the "shape" of the message tree without knowing the schema

This leads to things like the online Protobuf Decoder, which I find indispensible from time to time.

pawitp/protobuf-decoder: JavaScript-based web UI to decode ad-hoc Protobuf data

zenhack · 2023-05-21T17:04:13Z

Is the choice to skip past the code generator and write the marshaling code by hand just to avoid the dependency on the code generator? If not, what's the reason behind that?

zarutian · 2023-05-21T23:50:14Z

Is the choice to skip past the code generator and write the marshaling code by hand just to avoid the dependency on the code generator? If not, what's the reason behind that?

Four reasons to avoid the dependency on the schema to code generator:

Build environments are brittle, needs setting up, make assumptions that might not hold, and has higher “hacktivation” energy requirement for intrested parties to implement the ocapn protocol. Well, compared to reading an RFC style document. Requiring one just for a code generator is a bit of a too much burden.
There is an assumption that there is a build step or such where the codegenrator can run. This precludes programming environments that are mainly online/interactive such as Squeak/SmallTalk80, various Forths, various Scheme or Lisps, Luas, and so on.
Schemas in seperate IDL files go out of sync with versions of the software and/or the protocol in either part or whole.
High assurance software implementation setups that do not allow for any executables downloaded from the Internet nor compiled from source code that does not follow simplicity and explainability idomatic rules. This point is rather unlikely for us but it puts its small weight on the side of against such code generators.

More on the first point: One ancidotal experience I had with Corbin Simpson’s Monte was that its build environment made quite the assumptions that it had binaries available (which were legacy x86 64 bit only) and other such.

More on the second point: I want to allow for this ocapn protocol to be implemented in wierd places such as Minecraft ComputerCraft in-game networks, whatever Roblox and SecondLife are supporting. And any place you can program an general purpose compute.

More on the third point: You will be surprised how often this kind of thing has happened and where only the executable binary surrived. Knowing only the overall ‘shape’ of the datastructure does not help if there is not even a hint of which binary bits belongs to which field or what its datum type is.

More on the fourth point: I blame the Trusting Trust for this and the whole underhanded c code contest but in a good way. Requiring peeps to setup say Genode+seL4 system to hack on or implement the ocapn protocol in such situations might be too much of an ask.

zarutian · 2023-05-21T23:53:45Z

@.*** have you yet described capnprotos field packing algorithm anywhere other than in the c++ implementation yet? No, but nothing is stopping someone from reading the code and translating it to prose if desired. It's really not that complicated.
Indeed, if this is the only reason for not using it I will spend the time to sit down and document it.

Either way please do because when I tried I could not make heads or tails of that code within the nenna/gumption quota I was willing to spend on it.

dpwiz · 2023-08-31T15:08:51Z

I see there's no Embedded part of Value which used to wrap descriptors/references/handles/etc specifically.

Wouldn't it be nice to have a special place in the AST that says "here, you really have to give this a thought" instead of dealing with generic structure and guess "is that a descriptor or you just happy to send me?" for each node (and perhaps consider its context, up and down, too!).

dckc · 2023-08-31T15:29:13Z

You're referring to the grammar in the 1st comment in this issue? Right. It's missing.

It's present in several other sketches, such as the May 17 capnproto sketch above.

    capability @13 :Cap;

dpwiz · 2023-08-31T16:24:28Z

It is notably missing from the messages in the spec. And also isn't used by the test suite.

dckc · 2023-08-31T17:30:47Z

Which spec? I didn't know we had a spec covering this issue.

I think the test suite uses remote references; for example, object_to_greet in...

https://github.com/ocapn/ocapn-test-suite/blob/4242a7b096ac3a69adca8d8cf4a98c010b3fc694/tests/op_delivers.py#L23-L32

I suppose the OpDeliverOnly there corresponds to the op:deliver-only section of the captp draft, which shows:

to-desc is a desc:export descriptor which corresponds to the object the message is being sent to.

So desc:export is what references are called there. desc:export seems to have its own section.

dpwiz · 2023-08-31T18:54:40Z

Yes, the test and <desc:export position> use plain records here. It could've been #!<export position> / Syrup.Embedded(Syrup.Record 'export' [position]) instead.

dckc · 2023-08-31T18:59:40Z

Ok. I think I see your point now.

I don't have much of an opinion. I'm not sure how relevant the syrup structure will be in the end.

cwebber mentioned this issue Apr 16, 2021

Verbose record labels while we figure things out, enums or short labels later #4

Open

jar398 added this to the First working drafts milestone Mar 24, 2023

This was referenced May 18, 2023

unit / bottom type(s): null / undefined #50

Closed

core data types: symbol? #46

Open

dckc mentioned this issue May 19, 2023

chore: exploring CBOR, protobuf #51

Draft

dckc mentioned this issue May 26, 2023

core data types: IEEE floating point (in)compatibility #58

Open

jar398 mentioned this issue Nov 24, 2023

Working document to capture requirements and proposed requirements for concrete syntax(es) #93

Closed

dckc mentioned this issue Mar 12, 2024

March 2024 meeting #107

Closed

dckc mentioned this issue May 3, 2024

Invitation to Cap'n Proto monthly "office hours" chat #111

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CapTP AST / data representation and serialization #3

CapTP AST / data representation and serialization #3

cwebber commented Apr 16, 2021 •

edited

cwebber commented Apr 16, 2021

cwebber commented Apr 16, 2021 •

edited

zarutian commented Apr 29, 2021

zenhack commented Apr 29, 2021

dckc commented Mar 27, 2023 •

edited

jar398 commented Mar 27, 2023 •

edited

zenhack commented May 17, 2023

dckc commented May 17, 2023

zenhack commented May 17, 2023

zarutian commented May 19, 2023

kentonv commented May 19, 2023

zenhack commented May 19, 2023 via email

dckc commented May 19, 2023

zenhack commented May 21, 2023

zarutian commented May 21, 2023

zarutian commented May 21, 2023

dpwiz commented Aug 31, 2023

dckc commented Aug 31, 2023

dpwiz commented Aug 31, 2023

dckc commented Aug 31, 2023

dpwiz commented Aug 31, 2023 •

edited

dckc commented Aug 31, 2023

CapTP AST / data representation and serialization #3

CapTP AST / data representation and serialization #3

Comments

cwebber commented Apr 16, 2021 • edited

cwebber commented Apr 16, 2021

cwebber commented Apr 16, 2021 • edited

zarutian commented Apr 29, 2021

zenhack commented Apr 29, 2021

dckc commented Mar 27, 2023 • edited

jar398 commented Mar 27, 2023 • edited

zenhack commented May 17, 2023

dckc commented May 17, 2023

zenhack commented May 17, 2023

zarutian commented May 19, 2023

kentonv commented May 19, 2023

zenhack commented May 19, 2023 via email

dckc commented May 19, 2023

zenhack commented May 21, 2023

zarutian commented May 21, 2023

zarutian commented May 21, 2023

dpwiz commented Aug 31, 2023

dckc commented Aug 31, 2023

dpwiz commented Aug 31, 2023

dckc commented Aug 31, 2023

dpwiz commented Aug 31, 2023 • edited

dckc commented Aug 31, 2023

cwebber commented Apr 16, 2021 •

edited

cwebber commented Apr 16, 2021 •

edited

dckc commented Mar 27, 2023 •

edited

jar398 commented Mar 27, 2023 •

edited

dpwiz commented Aug 31, 2023 •

edited