Sort map entries marshalling dag-json #203

rvagg · 2021-07-14T03:41:43Z

Let me count the ways that you're going to hate this @warpfork ... but this is what everyone else has agreed the spec should do, it's what go-ipfs is doing with its existing encoding/json usage so existing test fixtures all make this assumption, and this opens the way to get some cross-impl test fixtures in place.

As for my impl, yeah .. that might take some finessing.

Builds on #202 so whitespace removal is in here already.

rvagg · 2021-07-14T03:42:24Z

codec/dagjson/marshal.go

@@ -31,21 +32,32 @@ func Marshal(n ipld.Node, sink shared.TokenSink, allowLinks bool) error {
 		if _, err := sink.Step(&tk); err != nil {
 			return err
 		}
-		// Emit map contents (and recurse).
+		// Collect map entries, then sort by key
+		type entry struct {


is declaring a struct in the middle of this naughty, or OK? can I get away without a struct somehow?

I don't think it's naughty :) certainly better than declaring it globally, unless we need methods on it in the future.

rvagg · 2021-07-14T03:43:59Z

codec/dagjson/marshal.go

 			if err != nil {
 				return err
 			}
+			entries = append(entries, entry{keyStr, v})


What are the implications of holding on to v while we collect and sort? Yeah, a perf hit, but what about memory and safety?

The interface doesn't say whether it's safe to use the value of a Next call after the Next call has happened. Many APIs like badgerdb or Go's own range will reuse the same memory for the following iterations, meaning that it's on you to make a copy if you want to keep it around for longer.

I think it should be Eric's call to say what the interface semantics should be here. Allowing implementations to reuse memory would certainly unlock more performance for the cases where one doesn't need to hold onto memory, and those will likely be more common.

(Okay. Going to think-out-loud for a first pass here.)

Hrm. I don't think that's actually come up as a trade before, because I think we've always been able to answer with pointers to existing and already escaped-to-heap memory.

The immutability guarantee held by ~~all~~ most of the node implementations so far also made it easy to say it's no problem to hold onto things.

Allowing implementations to reuse memory would certainly unlock more performance for the cases where one doesn't need to hold onto memory, and those will likely be more common.

So this is certainly true, we just happen to not have hit a case where the distinction has been possible yet.

I can imagine in the future we might have ADL implementations which can optimize things by reusing internal buffers. That would hit this. I'm not immediately certain that's going to be a predominating concern in that situation, though.

I think assuming it's safe to hold onto the values is the safer approach, anyway, as going the opposite route might well break some users.

In the future, we could always add opt-in iterator interfaces that take more aggressive shortcuts, like "may share memory", or "allows for more performant codec sorting", etc. But the basic iteration interface is... simple :)

Yep. Agree. Let's go that route.

Also worth noting here that we probably shouldn't optimise too heavily for the theoretical ADL case here because this is a block codec and if you're trying to serialize a massive HAMT into DAG-{JSON,CBOR} then you're in for a bad time regardless of what optimisations might apply here. I think it's safe-ish to assume we're dealing with maps of a maximum of ~1M, maybe a little more, in the 99.9% case.

For sorting in codecs -- yeah. It's not particularly important to optimize for someone feeding an ADL into a marshaller. (Even if that's technically allowed, it's... not a usual thing to want to actually do.)

The reason we ended up considering it is because it would have effect on the iterator design.

But all the same, the result here is: yeah, this is fine as written.

mvdan

I think this is a sane default behavior, even if it's a bit slow.

We could define an optional interface too, like:

type SortedMapNode interface {
    SortedMapIterator() MapIterator
}

Then, any ipld.Node implementation having that method would signal to us that its keys are already provided in a sorted manner, meaning we wouldn't need to buffer them all in memory and sort.

That should be a separate issue, though. Then we could look at implementing it appropriately for codegen and/or bindnode.

mvdan · 2021-07-14T09:49:08Z

codec/dagjson/marshal.go

@@ -31,21 +32,32 @@ func Marshal(n ipld.Node, sink shared.TokenSink, allowLinks bool) error {
 		if _, err := sink.Step(&tk); err != nil {
 			return err
 		}
-		// Emit map contents (and recurse).
+		// Collect map entries, then sort by key
+		type entry struct {


I don't think it's naughty :) certainly better than declaring it globally, unless we need methods on it in the future.

mvdan · 2021-07-14T10:10:37Z

codec/dagjson/marshal.go

 			if err != nil {
 				return err
 			}
+			entries = append(entries, entry{keyStr, v})


The interface doesn't say whether it's safe to use the value of a Next call after the Next call has happened. Many APIs like badgerdb or Go's own range will reuse the same memory for the following iterations, meaning that it's on you to make a copy if you want to keep it around for longer.

I think it should be Eric's call to say what the interface semantics should be here. Allowing implementations to reuse memory would certainly unlock more performance for the cases where one doesn't need to hold onto memory, and those will likely be more common.

rvagg · 2021-07-15T12:05:20Z

Click for some rambling about how different codecs need different sorting and the options available .. off-topic for this particular change.

type SortedMapNode interface {
    SortedMapIterator() MapIterator
}

Sadly this won't cut it as we have different sorting rules for different codecs. DAG-JSON does standard bytewise sorting while DAG-CBOR does length-first bytewise-second sorting. Eric's ideal codec would not do sorting, and that's certainly an option for the future.

Something like this might work so you'd get something like a standard Less(), but in ipld.Node form, or plain string form would be even better I think:

type SortedMapNode interface {
    SortedMapIterator(less func(i, j ipld.Node) bool)) MapIterator
}

But .. messy and complicated, so not just now.

mvdan · 2021-07-15T12:30:12Z

Codecs could also have an opt-in to disable the sorting, but the default should be as per the spec, IMO.

Good point about different codecs. Maybe one interface per codec is an option, too. I feel like we should discuss those opt-in optimizations in a separate thread.

warpfork · 2021-07-16T22:42:36Z

Okay, so yeah, holistically, I'm gonna concede, fine, let's make the codec sort.

Thing one, seems to import to address now:

I think we need to put the whole thing in a condition branch, even on this first pass. DAG-JSON should gain this sort. JSON shouldn't. (I guess we don't have tests which have made this obvious, but iirc, the json package depends on the dagjson package, so will be affected unless we make it not so.)

We don't have to worry about trying to standardize a codec config interface for this today, I don't think. (That would be nice-to-have, but probably requires more thought.) Just enough of a bool passing so the JSON codec can not exhibit this trait.

Thing two, that we're keeping an eye on for later: about the fastpath options:

I think agree with the general idea of an outline for a feature detector for fastpath:

type SortedMapNode interface {
    SortedMapIterator(less func(i, j ipld.Node) bool)) MapIterator
}

But I would like to figure out how we can get this to support memoization, if something is created with a sort, and later asked to yield an iterator for the same sort -- that should be as close to free as possible.

I made some speculations on this, but these are actually slightly broken, because they assume golang will let one compare function pointers, which is... not entirely the case. However, I'll share this train of thought anyway, and we can maybe figure out practical routes to it later:

We should probably expect to also then form a list of well-known sort funcs, a la:

func Sort_Natural(i, j ipld.Node) bool { /* ... */ }
func Sort_RFC7049(i, j ipld.Node) bool { /* ... */ }
// etc...

and then another option for builders:

// this gets a little more complicated:
// I think we'll want functions as a facade that do the thing,
// and a feature detect interface on nodeprototype or nodeassembler which the facade uses if it can.
// thinking about how to make this look "normal"/ergonomic within the overall system might need work.

func BeginMapSorted(sizeHint int64, sort func(i, j ipld.Node) bool) (MapAssembler, error) {
    // feature detect for fastpath, or, just do it on the fly.
    // may want variants of this function which sort
    // vs variants which just *check* the sort and hollers if it receives out-of-order input.
    // may also want a variant that says "trust me", but would be unsafe;
    // the "trust me" variant could be useful for when there's compile-time-known ordering.
}

type NodeAssemblerSupportingSortedMap interface {
    BeginMapSorted(sizeHint int64, sort func(i, j ipld.Node) bool) (MapAssembler, error)
}

and then the reason why having all that can pay off:

Map node implementations that implement both of these feature detection interfaces can store the sort func(i, j ipld.Node) bool function pointer that they were assembled with. This adds a word of memory to the size of the map, but this is not usually significant. And if asked to yield an iterator for the same sort, now the node can just do a quick pointer-compare ((amend: whoops, no, I just wish this is possible; it is not, the idea breaks here!)) on the sort function requested vs the sort function that was the invariant when the map was assembled, and if the same, then an excellent shortcut is available: the sort has been memoized, and we just do the natural iterate with no sort at all!

mvdan · 2021-07-16T22:56:25Z

All the ideas around future opt-in optimizations sound really interesting and I do think we should explore them further. But I also think this PR is probably not the place :) Perhaps my bad for starting that train of thought, but I only meant it as "we're not dropping streaming encodes on the floor forever".

+1 to Eric's comment that this should be turned off for the "plain json" encoding, in any case.

rvagg · 2021-07-19T04:12:49Z

Might have overdone it here? Need feedback. Instead of adding more boolean parameters, I've cleaned up both Marshal and Unmarshal to take MarshalOptions and UnmarshalOptions which have all of the various ways the modes can vary. This also means that instead of piggybacking parseLinks and allowLinks for bytes handling on top of link handling, we can be more explicit about it. Sorting included too.

warpfork · 2021-07-19T07:49:36Z

Beaut. Probably the way it always should've been.

Stylistic option: you can take what you've done here and go one tiny step further: one can attach a method to a struct, and use that rather than an argument to pass the config. I kinda like doing this for config, because it frees up the namespace to have a function of the same name at package scope, which can have default behavior. So, roughly:

type DecodeConfig struct {
    /* options, as you've done */
}

// Decode on DecodeConfig is the configured variant of this method.
func (cfg *DecodeConfig) Decode(na ipld.NodeAssembler, r io.Reader) error {
    /* does the thing, taking config into account */
}

// Decode (at package scope) is the default variant of the decoding operation.
// Customize a DecodeConfig struct and call the method of the same name on that if needing to customize options.
func Decode(na ipld.NodeAssembler, r io.Reader) error {
    cfg := DecodeConfig{ /* set up defaults */ }
    return cfg.Decode(na, r)
}

(I started moving in this direction in the dagjson2 package, but simultaneously conflated it with a bunch of memory reuse work and other work at the same time, so, it... got too big for one go, and didn't really fully land. Wompwomp.)

I think I like this pattern. That said, it's stylistic. It also might not matter much to be decisive at the moment; we still seem to have most consumers of the library going only through the zero-config multicodec access pattern, which means we might still have relative freedom to iterate on this config stuff for a while.

Sidenote on naming, in case it's been confusing: stuff is called "marshal"/"unmarshal" when... it's older, really. I've started dropping that nomenclature and using encode/decode exclusively. ("Marshal"/"unmarshal" is a convention in golang. But I'd rather have terminology that's consistent with our language-agnostic IPLD docs which revolve around "codec"... so, "encode"/"decode" is better from that viewpoint.)

I haven't gone on a rampage to remove the old nomenclature. But if you happen to be renovating stuff and it seems low-value to keep it around... feel free to drop functions by those names, IMO. (...presuming that they already have an alternative in the "decode"/"encode" naming convention, of course. But I'm pretty sure they all do.)

mvdan · 2021-07-19T09:13:44Z

remove the old nomenclature

Please deprecate instead :) I tripped over Encoder vs Encode just yesterday, still.

Stylistic option: you can take what you've done here and go one tiny step further: one can attach a method to a struct, and use that rather than an argument to pass the config. I kinda like doing this for config, because it frees up the namespace to have a function of the same name at package scope, which can have default behavior.

Fully agreed there. It's what the new encoding/json experiment has gone for, as well:

With options: https://pkg.go.dev/github.com/go-json-experiment/json#MarshalOptions.Marshal
With defaults: https://pkg.go.dev/github.com/go-json-experiment/json#Marshal

(Ignore the EncodeOptions parameter, as json wants to separate the semantic "marshal options" from the syntactic "encode options")

warpfork · 2021-07-19T15:45:07Z

Please deprecate instead

Yes, I guess I should clarify to say: if you'd be in the position of making a breaking change anyway, then maybe consider dropping. Otherwise, it is often now worth maintaining some facades (since we're generally at the bottom of some deep dependency trees, frequently subject to diamond problems, etc).

as json wants to separate the semantic "marshal options" from the syntactic "encode options"

Oh, nice. That sounds like the same distinction I was thinking to try to make with those words these days: "marshal" to regard object<->serial/token mapping (or, in IPLD's case, Data Model rather than jumping straight to token); "encode" to regard serial/token<->encodedbytes.

rvagg · 2021-07-19T23:24:22Z

ok, gross and weird Golang config/options handling patterns aside, how's this PR looking for merge as it stands? The way it's looking to me right now is that Encode() and Decode() are the external facades and Marshal() and Unmarshal() are their more complicated internal siblings that you could reach for if you wanted to do more than the default patterns on the codec. So the dagjson marshaller and unmarshaller are more like utilities while the encoder and decoder are more about the codec doing what it should do. Maybe that doesn't make sense, but for now it seems nice to hide some of the complexity behind simple Encode() and Decode() calls.

rvagg requested a review from warpfork July 14, 2021 03:41

rvagg commented Jul 14, 2021

View reviewed changes

mvdan reviewed Jul 14, 2021

View reviewed changes

This was referenced Jul 15, 2021

switch dag put cmd to directly use prime ipfs/kubo#7995

Merged

fix dag test fixtures for go-ipld-prime codecs ipfs/kubo#8257

Merged

rvagg added 2 commits July 16, 2021 16:11

Sort map entries marshalling dag-json

26a6fb7

Make tests pass with sorted dag-json output

019d85c

rvagg force-pushed the rvagg/dagjsonsort branch from 534e775 to 019d85c Compare July 16, 2021 06:13

warpfork mentioned this pull request Jul 18, 2021

Sort map entries marshalling dag-cbor #204

Merged

Add {Unm,M}arshalOptions for explicit mode switching for json vs dagjson

e481ffe

rvagg merged commit 333be0f into master Jul 21, 2021

rvagg deleted the rvagg/dagjsonsort branch July 21, 2021 02:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sort map entries marshalling dag-json #203

Sort map entries marshalling dag-json #203

rvagg commented Jul 14, 2021

rvagg Jul 14, 2021

mvdan Jul 14, 2021

rvagg Jul 14, 2021

mvdan Jul 14, 2021

warpfork Jul 16, 2021

mvdan Jul 16, 2021

warpfork Jul 17, 2021

rvagg Jul 19, 2021

warpfork Jul 19, 2021

mvdan left a comment

mvdan Jul 14, 2021

mvdan Jul 14, 2021

rvagg commented Jul 15, 2021 •

edited

mvdan commented Jul 15, 2021

warpfork commented Jul 16, 2021 •

edited

mvdan commented Jul 16, 2021

rvagg commented Jul 19, 2021

warpfork commented Jul 19, 2021 •

edited

mvdan commented Jul 19, 2021

warpfork commented Jul 19, 2021

rvagg commented Jul 19, 2021

Sort map entries marshalling dag-json #203

Sort map entries marshalling dag-json #203

Conversation

rvagg commented Jul 14, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mvdan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rvagg commented Jul 15, 2021 • edited

mvdan commented Jul 15, 2021

warpfork commented Jul 16, 2021 • edited

mvdan commented Jul 16, 2021

rvagg commented Jul 19, 2021

warpfork commented Jul 19, 2021 • edited

mvdan commented Jul 19, 2021

warpfork commented Jul 19, 2021

rvagg commented Jul 19, 2021

rvagg commented Jul 15, 2021 •

edited

warpfork commented Jul 16, 2021 •

edited

warpfork commented Jul 19, 2021 •

edited