Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cbor tags? #7

Closed
whyrusleeping opened this issue Oct 6, 2017 · 12 comments
Closed

cbor tags? #7

whyrusleeping opened this issue Oct 6, 2017 · 12 comments

Comments

@whyrusleeping
Copy link
Collaborator

Is this a thing I can do?

@warpfork
Copy link
Member

You're holding it wrong! 😀

In refmt, the theory is that object mapping/marshalling/unmarshalling is completely decoupled from the serial format, so we don't have tags of "cbor" or "json". They're all just "refmt".

So a pretty mundane example struct with tags from another project looks like this:

type OutputSpec struct {
	PackType string `refmt:"packtype"`
	Filters  string `refmt:",omitempty"`
}

The format of tags is the same as the Go standard library json tags; they're just named "refmt".

(Additionally, there's technically support for custom tag names using atlas.AutogenerateStructMapEntryUsingTags(rt reflect.Type, tagName string) *AtlasEntry... but this is buried rather deep and I'm not sure it's a sensibly exposed API at the moment.)

Sidenote (if you didn't already notice): you may also want to note that the default field name mapping is slightly different than Go standard lib. refmt maps the first character of the name to lowercase by default. I'm of the opinion "it's probably what you wanted". So, a field called PackType will be named "packType" in serial form by default.

@warpfork
Copy link
Member

warpfork commented Oct 15, 2017

OH. <<...out-of-band epiphany...>> You mean CBOR Tags. Sorry; I got hung up on the name similarity with golang struct tags.

Okay, CBOR tags: Yep, not supported right now.

Tag support will be coming up soon though -- working on it now! I'll post progress updates as they occur. Here's the outline on what I currently expect to support:

  • Reading tags into the token stream format and serializing them from tokens: ✅ This will come out pretty quick (and I'm also adding a Pretty Printer feature in parallel, which will make that visible, if nothing else).
  • Handling recursive tags: ❌ The CBOR spec actually allows recursive tags -- there's a snippet in it about "if tag A is followed by tag B"... I'm not planning to support that. A) I think it's dumb. B) I've never seen nor heard of that feature being used. C) It would make a complete mess of performance: turning every situation with tokens into something that may require an unbounded number of memory allocations, because instead of a single slot, it may require a list; I'd rather not. (If someone really needs this someday, maybe I'll reconsider, but I just don't actually expect that to happen.)
  • Tag-triggered behaviors: ❓ I'll need your feedback on what behaviors you want from the library while handling tags. I don't use them, so I'm flying blind here.

More news soon.

@whyrusleeping
Copy link
Collaborator Author

Our primary usecase of tags is to denote where an ipld link is. Basically, when converting from cbor into an in memory object, anything with a certain tag needs to become an instantiation of a certain type, filled with the data the tag is wrapping.

warpfork added a commit that referenced this issue Oct 15, 2017
We use two fields in the Token struct.  This is to resolve ambiguity with cbor tags of the int value zero.  An earlier iteration of the design tried to get away with one field, and use -1 as a sentinel value, but this generated a *lot* of implementation pain, because that sentinel value would need initialization e v e r y w h e r e in the entire project's codebase, whereas the use of tags is confined to a very small area.  It's not worth optimizing for the extra bytes in struct layout; we reuse the bejezus out of the token struct already anyway.

The json decoder chooses *not* to bother to re-false the Token.Tagged field -- it just believes that you'll hand it a roughly zero'd token, and not inject screwyness.  I think this is sensible and fine given that all the actual user-facing methods for initializing full marshal/unmarshal systems do in fact allocate one plain Token, and it's never reused with completely different implementations of decoder logic which would actually give that field a way to become set.

Partially fixes #7.

Signed-off-by: Eric Myhre <hash@exultant.us>
warpfork added a commit that referenced this issue Oct 15, 2017
We should now be able to round-trip cbor tags losslessly in a cbor<->tokenstream<->cbor decode/encoder.  This is good; but also fairly useless, because nothing but cbor supports these tags; no other serializers, nor have we yet implemented any way to alter the behavior of obj unmarshal in response to them, nor yet obj marshal to emit them.

Partially fixes #7.

Signed-off-by: Eric Myhre <hash@exultant.us>
@warpfork
Copy link
Member

Oh, wow, thanks github, that's totally an auto-behavior I meant for you to have.

@warpfork warpfork reopened this Oct 15, 2017
@warpfork
Copy link
Member

CBOR tags, both deserializing and serializing, is now in on master. Tags should round-trip in cbor<->tokenstream<->cbor losslessly.

It's not yet wired to any marshaller/unmarshaller stuff.

@warpfork
Copy link
Member

So, the behavior you describe sounds defined if deserializing into a &interface{}, map[string]interface{}, or suchlike grabbag:

Let's say we have some cbor message roughly like: {"a": 1, "b":<tag:123>{"c":"d"}}.

If we set up the unmarshaller with a custom behavior to take tag=123 as a hint to produce TypeFoo, and give it that whole message and an &interface{}, it'll produce:

map[string]interface{}{
    "a": 1,
    "b": TypeFoo{"d"},
}

So far so good.

What should the unmarshaller do if you give it that same message and a handle to a &TypeBaz, where

type TypeBaz struct {
    A string
    B TypeQuux
}

The unmarshaller can't stuff a TypeFoo into a TypeQuux. What's the correct behavior?

@Stebalien
Copy link
Contributor

Stebalien commented Nov 8, 2017

If TypeQuux implements the unmarshal interface, I'd pass it to that and let it deal with it (is that possible in refmt?). If we're deserializing TypeQuux using reflection, I'd drop it. Basically, I'd treat tags hints unless we're unmarshaling into an interface (in which case we'd lookup the tag in type map).

However, it might be worth making it possible to specify levels of strictness for tags. That is, when registering a tag, one could say specify Hint (the above), NoReflect (error when using reflection with a type mismatch), and Force (the tag must use this type). Honestly, it may make sense to just make this a boolean and not have Force (that's a bit much).

(also, sorry for taking so long to get back to you on this, we've been a bit busy)

@warpfork
Copy link
Member

warpfork commented Nov 12, 2017

Ok, a bunch of this stuff should now be in and workable on master! You can see an example combined with some other advanced usage here in the test suite:

atlas.MustBuild(
	atlas.BuildEntry(tObjStr{}).UseTag(50).Transform().
		TransformMarshal(atlas.MakeMarshalTransformFunc(
			func(x tObjStr) (string, error) {
				return x.X, nil
			})).
		TransformUnmarshal(atlas.MakeUnmarshalTransformFunc(
			func(x string) (tObjStr, error) {
				return tObjStr{x}, nil
			})).
		Complete(),
)

This atlas building snippet will give you the power to have...

  • the serial format be a string (tagged)...
  • while the in-memory object format is struct{x string}...
  • and when deserializing into an interface{}, you'll still get the struct type, due to the tag hint.

So this is pretty cool.

There may still be paths where support for tag behavior is spotty, but if you uncover any, open an issue and we'll keep expanding the text fixture coverage and the features to match.


Aside: Despite supporting this, I also feel obliged to mention at least in passing that I think using CBOR tags is Probably A Bad Idea in most applications. You can do it. But I wouldn't do it without spending serious thought on the tradeoffs.

One of the major selling points of CBOR is its isomorphism to JSON -- it's easy to convert CBOR to JSON; and in most cases it's easy to convert the other way as well, and thus it's both simple and correct to consider JSON as an easy way for humans to author raw structures (that can then be canonicalized into CBOR). This breaks down with tags: there's no way to take a CBOR object with tags and convert it to a JSON object losslessly short of doing a giant schema expansion where every object is expanded into a tuple of {"tag":777, "real_obj": {...}} or {"tag": null, "real_obj": {...}}... and no one really wants to do that, because it's ugly as all sin. Similarly, there's no generic way to take a JSON object and somehow intuit which fields should end up with some kind of tag if it was converted into a CBOR object; either that massive unconditional object-depth-doubling approach, or some kind of external schema or other customized behaviors would be needed.

The ./refmt cbor=json converter utility I added this morning will currently silently drop the tags from the json output -- but that should almost certainly be considered a bug; what it should do is emit an error and reject the transformation since it cannot be done losslessly -- there's no way that piping the result back into ./refmt json=cbor will yield the original data again.

@Stebalien
Copy link
Contributor

TL;DR: We're only using one tag and we're using a "reserved" key in JSON to represent it. That is, "name": <our-one-tag>"data" in CBOR maps to "name": {"/": "data"} in JSON.

Despite supporting this, I also feel obliged to mention at least in passing that I think using CBOR tags is Probably A Bad Idea in most applications. You can do it. But I wouldn't do it without spending serious thought on the tradeoffs.

Don't worry, we have. We're only going to use one tag and we're only doing that because we don't want to bend over backwards and make our system worse just to support JSON.

For some background, we're using CBOR as an (well, the "default") encoding for a merkle-linked structured data system we call IPLD. IPLD is basically a meta-system for understanding (reading, writing, traversing, querying, etc.) any merkle-linked data structure like, e.g., git and ethereum. We've chosen CBOR as the "default" encoding for new applications built on top of this system as it's flexible, compact, and schema-less (i.e., JSON but better).

However, to do this, we need a way to efficiently represent merkle-links. In JSON, we're reserving the special key "/" and using {"/": "LinkData"} however, doing that in CBOR would be a waste of space and would force us to reserve that key, even when we don't technically need it. That's why we're using a tag.

One of the major selling points of CBOR is its isomorphism to JSON -- it's easy to convert CBOR to JSON;

Many of our applications need to efficiently store binary data so the ship has already sailed on that to some extent. We've been talking about adding a special syntax for representing binary data in JSON (e.g. {"/": "b64encoded data", "type": "binary"}) but, honestly, we're not too concerned with being able to represent all data in JSON (our main concern is going from JSON to other formats, not the other way around).

Really, in my view, the selling point isn't JSON compatibility, it's JavaScript/Python/etc. object compatibility. That is, CBOR maps cleanly to the standard "object" structures used in most dynamically typed languages.

@Stebalien
Copy link
Contributor

Also, this is awesome and this library is awesome (and will make our lives so much easier). Thanks!

@warpfork
Copy link
Member

Many of our applications need to efficiently store binary data so the ship has already sailed on that to some extent.

I see... Yeah, that's a fair cop.

(The default, generalized admonition stands for Anyone Else on the wide internets who ends up here by searching for "cbor tags" though...)


So... I will tentatively close this issue then? :D I think all the core path for the features you need is in place now. If you find stuff missing, more issues welcome!

@Stebalien
Copy link
Contributor

Yep. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants