Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CapTP AST / data representation and serialization #3

Open
cwebber opened this issue Apr 16, 2021 · 22 comments
Open

CapTP AST / data representation and serialization #3

cwebber opened this issue Apr 16, 2021 · 22 comments

Comments

@cwebber
Copy link
Contributor

cwebber commented Apr 16, 2021

There are really two questions:

  1. What is the on-the-wire representation of CapTP?
  2. What is the abstracted representation (dare I say, the AST) of CapTP?

Right now, (1) is handled by Syrup, (2) is handled by the abstract types in Preserves. Technically, (1) is just a very simple (but canonicalized) encoding of (2), simple enough to implement in about 3 hours, but there is also a (lossy, due to floats) textual representation and an alternate binary representation (which @tonyg and I have considered replacing with Syrup).

I propose we stick to representing CapTP's AST in terms of the abstract datatypes of Preserves, no matter which encoding we end up ultimately using. The core datatypes that then are used to compose this AST are:

                      Value = Atom
                            | Compound

                       Atom = Boolean
                            | Float
                            | Double
                            | SignedInteger
                            | String
                            | ByteString
                            | Symbol

                   Compound = Record
                            | Sequence
                            | Set
                            | Dictionary

I propose that we stick to this as the foundational abstracted set of types on which we build the AST. One advantage of Preserves' abstract types is that they do not on their own specify an encoding, they represent a language-oriented representation. Thus it is easy to switch to something else encoding-wise later.

(EDIT: I removed "Pointer", which is on the Preserves page but I don't think it should be there, and it wasn't when I wrote Syrup I think. @tonyg we should talk!)

@cwebber
Copy link
Contributor Author

cwebber commented Apr 16, 2021

FWIW, here is the full encoding rules (well, in example form) for Syrup's on-the-wire-representation, taken from a comment in the Racket implementation:

;; Booleans: t or f
;; Single flonum: F<ieee-single-float> (big endian)
;; Double flonum: D<ieee-double-float> (big endian)
;; (Signed) integers: i<maybe-sign><int>e
;; Bytestrings: 3:cat
;; Strings: 3"cat
;; Symbols: 3'cat
;; Dictionary: {<key1><val1><key1><val1>}
;; Lists: [<item1><item2><item3>]
;; Records: <<label><val1><val2><val3>> (the outer <> for realsies tho)
;; Sets: #<item1><item2><item3>$

@cwebber
Copy link
Contributor Author

cwebber commented Apr 16, 2021

Integers could be simplified to remove a character, btw. @tonyg and I have discussed that:

-;; (Signed) integers: i<maybe-sign><int>e
+;; (Signed) integers: <maybe-sign><int>i

Hasn't been implemented though. That's one small change worth considering however.

@zarutian
Copy link

Integers could be simplified to remove a character, btw. @tonyg and I have discussed that:

-;; (Signed) integers: i<maybe-sign><int>e
+;; (Signed) integers: <maybe-sign><int>i

Hasn't been implemented though. That's one small change worth considering however.

Already decodes correctly with syrup.js though have yet to add an marshaller that encodes (big)ints to it.

@zenhack
Copy link
Collaborator

zenhack commented Apr 29, 2021

Note that on IRC we discussed yet another approach to this, which people liked: ocapn/syrup#2

@jar398 jar398 added this to the First working drafts milestone Mar 24, 2023
@dckc
Copy link
Collaborator

dckc commented Mar 27, 2023

@jar398 this has substantial overlap with #5 . I somewhat prefer the title / description of this one, but recent discussion is happening more in #5 , so I suggest closing this as a dup of #5.

@jar398
Copy link
Contributor

jar398 commented Mar 27, 2023

Interestingly the two issues were posted by the same person on the same day, so there was some intention behind keeping them separate. The title suggests #5 was nominally requirements driven (related to abstract syntax e.g. Preserves), or "what do we need and not need to transmit", while this one is nominally implementation driven ('bindings' of abstract syntax to concrete syntax e.g. syrup or "how do we render it"). This seems like a natural division to me, if it can be observed, but I haven't checked to see how well the distinction is reflected in the actual issue comments. TBD me: check the #5 comments to see whether keeping concrete syntax discussion out of #5 is feasible.

Comments @cwebber or @tsyesika ?

By the way I don't like 'AST' - too suggestive of a data structure. 'Abstract syntax' is fine.

@zenhack
Copy link
Collaborator

zenhack commented May 17, 2023

We seem to be close on #5, and I think we're close enough that we can probably unblock talking about concrete serialization.

Agoric folks have specified an embedding of their data model into json called "smallcaps," and it seems like we can relatively easily extend that to include the full data model. I like it, at least broadly and they are apparently stuck supporting it, so I'd suggest having this be our "textual" representation, rather than defining Another Thing.

However, it probably isn't good enough for efficiency reasons, particularly when we look at adding ByteStrings. So we should also specify some binary encoding of the data model with better efficiency.

The bridge will likely include a version of the data model encoded in capnp, but that is a lot of machinery to include, so instead we should probably define the protocol to use something else and just let that be a bridge thing.

The data model has diverged somewhat from preserves, so using Syrup as-is is no longer an option.

My thinking is to build on top of cbor in the same way smallcaps builds on top of json. The situation is similar: the data model is close, but not exact -- we'd need to concoct ways of encoding e.g. Capabilities and Errors into CBOR, and we'd want to impose some restrictions when decoding. E.g. we could specify that cbor's map type is decoded as a struct, where non-string keys cause decoding to fail.

Thoughts on that general approach?

@dckc
Copy link
Collaborator

dckc commented May 17, 2023

I don't have any experience with CBOR; might be fine.

Agoric does a lot of Cosmos-SDK stuff, and Cosmos-SDK uses protobuf. So some protobuf tooling is sunk cost, for us. My work in #19 was sort of at the wrong level, but I'm inclined to try a protobuf rendition of the data model and check out the costs and benefits.

it's nice to get stubs roughly for free:

Now that I think about it, the existing JS protobuf tooling might not be a good fit here... we might want to use some of their low-level encoding APIs but do the high level tree-walking by writing something driven by the @endo/marshal types.

The capnproto schema language is roughly as expressive, so that's worth a try too.

@zenhack
Copy link
Collaborator

zenhack commented May 17, 2023

If folks are open to protobufs maybe I have misjudged the appetite for capnp. It would be nice to use capnp for serialization insofar as in a bridged environment it's one less thing in the tech stack.

Here's a stab at modeling the current state of #5 using capnp schema:

# ocapn.capnp
@0xcd301da1d95b8242;

struct Value {
  union {
    ## Atoms ##

    undefined @0 :Void;
    null @1 :Void;
    bool @2 :Bool;
    float64 @3 :Float64;

    unsignedInt @4 :Data;
    # Non-negative integer, in big-endian format
    negativeInt @5 :Data;
    # Negative integer. value is `-1 - n`, where `n`
    # is the data interpreted as an unsigned big-endian integer.

    string @6 :Text;
    byteString @7 :Data;
    symbol @8 :Text; # Might be removed, pending #46.

    ## Containers ##

    list @9 :List(Value);

    struct @10 :List(StructField);
    # Duplicate keys are not allowed; will need to enforce this at a higher level
    # of abstraction.

    tagged :group {
      label @11 :Text;
      value @12 :Value;
    }

    capability @13 :Cap;

    error @14 :Error;
  }
}

struct StructField {
  key @0 :Text;
  value @1 :Value;
}

interface Cap {
  # TODO
}

struct Error {
  # TODO, pending #10.
}

@zarutian
Copy link

I am in faviour of CBOR (or msgpck) but I vehemely am against both protobuf and capnproto for the following reason:

Neither are self descriptive on the wire and schemas always get lost and/or loose something in translation to stubs.*

Plus both protobuf and capnproto assume a buildstep or build environ where their schema language interpreter can run. ( @kentonv have you yet described capnprotos field packing algorithm anywhere other than in the c++ implementation yet? )

The ocapn protocol described in an RFC like document should be sufficient for
a programmer to implement it without having to download or run any executable or
tool. It greately decreases the ‘hacktivation energy’ required.

@dckc stubs for Remotables? that say like for an Agoric Issuer Purse?

(* having to reverse engineer MIPS32 firmware to figgure out a binary protocol whose schema was lost to the sands of time was not fun.)

@kentonv
Copy link

kentonv commented May 19, 2023

I vehemely am against both protobuf and capnproto for the following reason:

Neither are self descriptive on the wire and schemas always get lost and/or loose something in translation to stubs.*

(* having to reverse engineer MIPS32 firmware to figgure out a binary protocol whose schema was lost to the sands of time was not fun.)

Self-description for the purpose of reverse engineering is not really an on-or-off thing, it is a spectrum. Protocols based on JSON, CBOR, or msgpack often (but not always) contain textual field names which might offer clues to a human trying to reverse engineer them; protobuf and Cap'n Proto do not. However, Protobuf and Cap'n Proto both still do allow you to determine the "shape" of the message tree without knowing the schema; this is still much more information than a completely arbitrary binary encoding provides. Conversely, there are many JSON protocols which manage to be inscrutable. (E.g. some people intentionally encode objects as tuples to avoid wasting bytes on field names. Some others just choose really terrible field names.)

You could require that people using CBOR make sure to use intelligible field names. But similarly you could of course specify a protocol which uses Protobuf or Cap'n Proto, but requires every message to contain a copy of the schema. You could even go further and require each message to contain human-readable documentation explaining how to use it.

Of course, at some point the cost of sending this schema and documentation in every message outweighs the benefits it brings in terms of reverse engineerability. So this is really an argument about trade-offs: what amount of wasted bytes in every single message is "worth it" to make it easier in the case that someone needs to reverse-engineer the protocol? It sounds like you are arguing that field names are worth it. I might agree in some use cases but certainly not in all cases.

What if, instead, the RPC layer of the protocol defined a standard way to query a peer for their schemas, which implementations were expected to support by default? Then there's no waste in the common case but you can still get the info you need for reverse engineering. Both Protobuf and Cap'n Proto could easily support such a requirement, all the pieces are already in place to make schemas available automatically.

@kentonv have you yet described capnprotos field packing algorithm anywhere other than in the c++ implementation yet?

No, but nothing is stopping someone from reading the code and translating it to prose if desired. It's really not that complicated.

(But I personally have no need or desire to standardize Cap'n Proto, so I haven't done it.)

@zenhack
Copy link
Collaborator

zenhack commented May 19, 2023 via email

@dckc
Copy link
Collaborator

dckc commented May 19, 2023

The ocapn.capnp blurb above is pretty handy as a specification mechanism.

As to capnproto serialization...

stubs for Remotables? that say like for an Agoric Issuer Purse?

No; what I said about protobuf applies equally:

the existing JS protobuf tooling might not be a good fit here... we might want to use some of their low-level encoding APIs but do the high level tree-walking by writing something driven by the @endo/marshal types.

I just took the ocapn.capnp blurb above and rendered it in typescript, since that's the type system I spend most of my time in lately:

Then I spent enough time with CBOR to say it's probably fine:

I dind't find a JS API for capnproto at the same level... encoder._pushString() and such.

That is: the API that the code generated by protobuf tools uses. For example:

// writing
var buffer = protobuf.Writer.create()
    .uint32((1 << 3 | 2) >>> 0) // id 1, wireType 2
    .string("hello world!")
    .finish();

https://github.com/protobufjs/protobuf.js/blob/56b1e64979dae757b67a21d326e16acee39f2267/examples/reader-writer.js#LL7C1-L11C15

p.s. as to self-describing: what @kentonv said, especially:

However, Protobuf and Cap'n Proto both still do allow you to determine the "shape" of the message tree without knowing the schema

This leads to things like the online Protobuf Decoder, which I find indispensible from time to time.

@zenhack
Copy link
Collaborator

zenhack commented May 21, 2023

Is the choice to skip past the code generator and write the marshaling code by hand just to avoid the dependency on the code generator? If not, what's the reason behind that?

@zarutian
Copy link

Is the choice to skip past the code generator and write the marshaling code by hand just to avoid the dependency on the code generator? If not, what's the reason behind that?

Four reasons to avoid the dependency on the schema to code generator:

  1. Build environments are brittle, needs setting up, make assumptions that might not hold, and has higher “hacktivation” energy requirement for intrested parties to implement the ocapn protocol. Well, compared to reading an RFC style document. Requiring one just for a code generator is a bit of a too much burden.
  2. There is an assumption that there is a build step or such where the codegenrator can run. This precludes programming environments that are mainly online/interactive such as Squeak/SmallTalk80, various Forths, various Scheme or Lisps, Luas, and so on.
  3. Schemas in seperate IDL files go out of sync with versions of the software and/or the protocol in either part or whole.
  4. High assurance software implementation setups that do not allow for any executables downloaded from the Internet nor compiled from source code that does not follow simplicity and explainability idomatic rules. This point is rather unlikely for us but it puts its small weight on the side of against such code generators.

More on the first point: One ancidotal experience I had with Corbin Simpson’s Monte was that its build environment made quite the assumptions that it had binaries available (which were legacy x86 64 bit only) and other such.

More on the second point: I want to allow for this ocapn protocol to be implemented in wierd places such as Minecraft ComputerCraft in-game networks, whatever Roblox and SecondLife are supporting. And any place you can program an general purpose compute.

More on the third point: You will be surprised how often this kind of thing has happened and where only the executable binary surrived. Knowing only the overall ‘shape’ of the datastructure does not help if there is not even a hint of which binary bits belongs to which field or what its datum type is.

More on the fourth point: I blame the Trusting Trust for this and the whole underhanded c code contest but in a good way. Requiring peeps to setup say Genode+seL4 system to hack on or implement the ocapn protocol in such situations might be too much of an ask.

@zarutian
Copy link

@.*** have you yet described capnprotos field packing algorithm anywhere other than in the c++ implementation yet? No, but nothing is stopping someone from reading the code and translating it to prose if desired. It's really not that complicated.
Indeed, if this is the only reason for not using it I will spend the time to sit down and document it.

Either way please do because when I tried I could not make heads or tails of that code within the nenna/gumption quota I was willing to spend on it.

@dpwiz
Copy link
Contributor

dpwiz commented Aug 31, 2023

I see there's no Embedded part of Value which used to wrap descriptors/references/handles/etc specifically.

Wouldn't it be nice to have a special place in the AST that says "here, you really have to give this a thought" instead of dealing with generic structure and guess "is that a descriptor or you just happy to send me?" for each node (and perhaps consider its context, up and down, too!).

@dckc
Copy link
Collaborator

dckc commented Aug 31, 2023

You're referring to the grammar in the 1st comment in this issue? Right. It's missing.

It's present in several other sketches, such as the May 17 capnproto sketch above.

    capability @13 :Cap;

@dpwiz
Copy link
Contributor

dpwiz commented Aug 31, 2023

It is notably missing from the messages in the spec. And also isn't used by the test suite.

@dckc
Copy link
Collaborator

dckc commented Aug 31, 2023

Which spec? I didn't know we had a spec covering this issue.

I think the test suite uses remote references; for example, object_to_greet in...

https://github.com/ocapn/ocapn-test-suite/blob/4242a7b096ac3a69adca8d8cf4a98c010b3fc694/tests/op_delivers.py#L23-L32

I suppose the OpDeliverOnly there corresponds to the op:deliver-only section of the captp draft, which shows:

to-desc is a desc:export descriptor which corresponds to the object the message is being sent to.

So desc:export is what references are called there. desc:export seems to have its own section.

@dpwiz
Copy link
Contributor

dpwiz commented Aug 31, 2023

Yes, the test and <desc:export position> use plain records here. It could've been #!<export position> / Syrup.Embedded(Syrup.Record 'export' [position]) instead.

@dckc
Copy link
Collaborator

dckc commented Aug 31, 2023

Ok. I think I see your point now.

I don't have much of an opinion. I'm not sure how relevant the syrup structure will be in the end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants