Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort map entries marshalling dag-json #203

Merged
merged 3 commits into from Jul 21, 2021
Merged

Sort map entries marshalling dag-json #203

merged 3 commits into from Jul 21, 2021

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Jul 14, 2021

Let me count the ways that you're going to hate this @warpfork ... but this is what everyone else has agreed the spec should do, it's what go-ipfs is doing with its existing encoding/json usage so existing test fixtures all make this assumption, and this opens the way to get some cross-impl test fixtures in place.

As for my impl, yeah .. that might take some finessing.

Builds on #202 so whitespace removal is in here already.

@rvagg rvagg requested a review from warpfork July 14, 2021 03:41
@@ -31,21 +32,32 @@ func Marshal(n ipld.Node, sink shared.TokenSink, allowLinks bool) error {
if _, err := sink.Step(&tk); err != nil {
return err
}
// Emit map contents (and recurse).
// Collect map entries, then sort by key
type entry struct {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is declaring a struct in the middle of this naughty, or OK? can I get away without a struct somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's naughty :) certainly better than declaring it globally, unless we need methods on it in the future.

if err != nil {
return err
}
entries = append(entries, entry{keyStr, v})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the implications of holding on to v while we collect and sort? Yeah, a perf hit, but what about memory and safety?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface doesn't say whether it's safe to use the value of a Next call after the Next call has happened. Many APIs like badgerdb or Go's own range will reuse the same memory for the following iterations, meaning that it's on you to make a copy if you want to keep it around for longer.

I think it should be Eric's call to say what the interface semantics should be here. Allowing implementations to reuse memory would certainly unlock more performance for the cases where one doesn't need to hold onto memory, and those will likely be more common.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Okay. Going to think-out-loud for a first pass here.)

Hrm. I don't think that's actually come up as a trade before, because I think we've always been able to answer with pointers to existing and already escaped-to-heap memory.

The immutability guarantee held by all most of the node implementations so far also made it easy to say it's no problem to hold onto things.

Allowing implementations to reuse memory would certainly unlock more performance for the cases where one doesn't need to hold onto memory, and those will likely be more common.

So this is certainly true, we just happen to not have hit a case where the distinction has been possible yet.

I can imagine in the future we might have ADL implementations which can optimize things by reusing internal buffers. That would hit this. I'm not immediately certain that's going to be a predominating concern in that situation, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think assuming it's safe to hold onto the values is the safer approach, anyway, as going the opposite route might well break some users.

In the future, we could always add opt-in iterator interfaces that take more aggressive shortcuts, like "may share memory", or "allows for more performant codec sorting", etc. But the basic iteration interface is... simple :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Agree. Let's go that route.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also worth noting here that we probably shouldn't optimise too heavily for the theoretical ADL case here because this is a block codec and if you're trying to serialize a massive HAMT into DAG-{JSON,CBOR} then you're in for a bad time regardless of what optimisations might apply here. I think it's safe-ish to assume we're dealing with maps of a maximum of ~1M, maybe a little more, in the 99.9% case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sorting in codecs -- yeah. It's not particularly important to optimize for someone feeding an ADL into a marshaller. (Even if that's technically allowed, it's... not a usual thing to want to actually do.)

The reason we ended up considering it is because it would have effect on the iterator design.

But all the same, the result here is: yeah, this is fine as written.

Copy link
Contributor

@mvdan mvdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a sane default behavior, even if it's a bit slow.

We could define an optional interface too, like:

type SortedMapNode interface {
    SortedMapIterator() MapIterator
}

Then, any ipld.Node implementation having that method would signal to us that its keys are already provided in a sorted manner, meaning we wouldn't need to buffer them all in memory and sort.

That should be a separate issue, though. Then we could look at implementing it appropriately for codegen and/or bindnode.

@@ -31,21 +32,32 @@ func Marshal(n ipld.Node, sink shared.TokenSink, allowLinks bool) error {
if _, err := sink.Step(&tk); err != nil {
return err
}
// Emit map contents (and recurse).
// Collect map entries, then sort by key
type entry struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's naughty :) certainly better than declaring it globally, unless we need methods on it in the future.

if err != nil {
return err
}
entries = append(entries, entry{keyStr, v})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface doesn't say whether it's safe to use the value of a Next call after the Next call has happened. Many APIs like badgerdb or Go's own range will reuse the same memory for the following iterations, meaning that it's on you to make a copy if you want to keep it around for longer.

I think it should be Eric's call to say what the interface semantics should be here. Allowing implementations to reuse memory would certainly unlock more performance for the cases where one doesn't need to hold onto memory, and those will likely be more common.

@rvagg
Copy link
Member Author

rvagg commented Jul 15, 2021

Click for some rambling about how different codecs need different sorting and the options available .. off-topic for this particular change.
type SortedMapNode interface {
    SortedMapIterator() MapIterator
}

Sadly this won't cut it as we have different sorting rules for different codecs. DAG-JSON does standard bytewise sorting while DAG-CBOR does length-first bytewise-second sorting. Eric's ideal codec would not do sorting, and that's certainly an option for the future.

Something like this might work so you'd get something like a standard Less(), but in ipld.Node form, or plain string form would be even better I think:

type SortedMapNode interface {
    SortedMapIterator(less func(i, j ipld.Node) bool)) MapIterator
}

But .. messy and complicated, so not just now.

@mvdan
Copy link
Contributor

mvdan commented Jul 15, 2021

Codecs could also have an opt-in to disable the sorting, but the default should be as per the spec, IMO.

Good point about different codecs. Maybe one interface per codec is an option, too. I feel like we should discuss those opt-in optimizations in a separate thread.

@warpfork
Copy link
Collaborator

warpfork commented Jul 16, 2021

Okay, so yeah, holistically, I'm gonna concede, fine, let's make the codec sort.


Thing one, seems to import to address now:

I think we need to put the whole thing in a condition branch, even on this first pass. DAG-JSON should gain this sort. JSON shouldn't. (I guess we don't have tests which have made this obvious, but iirc, the json package depends on the dagjson package, so will be affected unless we make it not so.)

We don't have to worry about trying to standardize a codec config interface for this today, I don't think. (That would be nice-to-have, but probably requires more thought.) Just enough of a bool passing so the JSON codec can not exhibit this trait.


Thing two, that we're keeping an eye on for later: about the fastpath options:

I think agree with the general idea of an outline for a feature detector for fastpath:

type SortedMapNode interface {
    SortedMapIterator(less func(i, j ipld.Node) bool)) MapIterator
}

But I would like to figure out how we can get this to support memoization, if something is created with a sort, and later asked to yield an iterator for the same sort -- that should be as close to free as possible.

I made some speculations on this, but these are actually slightly broken, because they assume golang will let one compare function pointers, which is... not entirely the case. However, I'll share this train of thought anyway, and we can maybe figure out practical routes to it later:

We should probably expect to also then form a list of well-known sort funcs, a la:

func Sort_Natural(i, j ipld.Node) bool { /* ... */ }
func Sort_RFC7049(i, j ipld.Node) bool { /* ... */ }
// etc...

and then another option for builders:

// this gets a little more complicated:
// I think we'll want functions as a facade that do the thing,
// and a feature detect interface on nodeprototype or nodeassembler which the facade uses if it can.
// thinking about how to make this look "normal"/ergonomic within the overall system might need work.

func BeginMapSorted(sizeHint int64, sort func(i, j ipld.Node) bool) (MapAssembler, error) {
    // feature detect for fastpath, or, just do it on the fly.
    // may want variants of this function which sort
    // vs variants which just *check* the sort and hollers if it receives out-of-order input.
    // may also want a variant that says "trust me", but would be unsafe;
    // the "trust me" variant could be useful for when there's compile-time-known ordering.
}

type NodeAssemblerSupportingSortedMap interface {
    BeginMapSorted(sizeHint int64, sort func(i, j ipld.Node) bool) (MapAssembler, error)
}

and then the reason why having all that can pay off:

Map node implementations that implement both of these feature detection interfaces can store the sort func(i, j ipld.Node) bool function pointer that they were assembled with. This adds a word of memory to the size of the map, but this is not usually significant. And if asked to yield an iterator for the same sort, now the node can just do a quick pointer-compare ((amend: whoops, no, I just wish this is possible; it is not, the idea breaks here!)) on the sort function requested vs the sort function that was the invariant when the map was assembled, and if the same, then an excellent shortcut is available: the sort has been memoized, and we just do the natural iterate with no sort at all!

@mvdan
Copy link
Contributor

mvdan commented Jul 16, 2021

All the ideas around future opt-in optimizations sound really interesting and I do think we should explore them further. But I also think this PR is probably not the place :) Perhaps my bad for starting that train of thought, but I only meant it as "we're not dropping streaming encodes on the floor forever".

+1 to Eric's comment that this should be turned off for the "plain json" encoding, in any case.

@rvagg
Copy link
Member Author

rvagg commented Jul 19, 2021

Might have overdone it here? Need feedback. Instead of adding more boolean parameters, I've cleaned up both Marshal and Unmarshal to take MarshalOptions and UnmarshalOptions which have all of the various ways the modes can vary. This also means that instead of piggybacking parseLinks and allowLinks for bytes handling on top of link handling, we can be more explicit about it. Sorting included too.

@warpfork
Copy link
Collaborator

warpfork commented Jul 19, 2021

Beaut. Probably the way it always should've been.


Stylistic option: you can take what you've done here and go one tiny step further: one can attach a method to a struct, and use that rather than an argument to pass the config. I kinda like doing this for config, because it frees up the namespace to have a function of the same name at package scope, which can have default behavior. So, roughly:

type DecodeConfig struct {
    /* options, as you've done */
}

// Decode on DecodeConfig is the configured variant of this method.
func (cfg *DecodeConfig) Decode(na ipld.NodeAssembler, r io.Reader) error {
    /* does the thing, taking config into account */
}

// Decode (at package scope) is the default variant of the decoding operation.
// Customize a DecodeConfig struct and call the method of the same name on that if needing to customize options.
func Decode(na ipld.NodeAssembler, r io.Reader) error {
    cfg := DecodeConfig{ /* set up defaults */ }
    return cfg.Decode(na, r)
}

(I started moving in this direction in the dagjson2 package, but simultaneously conflated it with a bunch of memory reuse work and other work at the same time, so, it... got too big for one go, and didn't really fully land. Wompwomp.)

I think I like this pattern. That said, it's stylistic. It also might not matter much to be decisive at the moment; we still seem to have most consumers of the library going only through the zero-config multicodec access pattern, which means we might still have relative freedom to iterate on this config stuff for a while.


Sidenote on naming, in case it's been confusing: stuff is called "marshal"/"unmarshal" when... it's older, really. I've started dropping that nomenclature and using encode/decode exclusively. ("Marshal"/"unmarshal" is a convention in golang. But I'd rather have terminology that's consistent with our language-agnostic IPLD docs which revolve around "codec"... so, "encode"/"decode" is better from that viewpoint.)

I haven't gone on a rampage to remove the old nomenclature. But if you happen to be renovating stuff and it seems low-value to keep it around... feel free to drop functions by those names, IMO. (...presuming that they already have an alternative in the "decode"/"encode" naming convention, of course. But I'm pretty sure they all do.)

@mvdan
Copy link
Contributor

mvdan commented Jul 19, 2021

remove the old nomenclature

Please deprecate instead :) I tripped over Encoder vs Encode just yesterday, still.

Stylistic option: you can take what you've done here and go one tiny step further: one can attach a method to a struct, and use that rather than an argument to pass the config. I kinda like doing this for config, because it frees up the namespace to have a function of the same name at package scope, which can have default behavior.

Fully agreed there. It's what the new encoding/json experiment has gone for, as well:

(Ignore the EncodeOptions parameter, as json wants to separate the semantic "marshal options" from the syntactic "encode options")

@warpfork
Copy link
Collaborator

Please deprecate instead

Yes, I guess I should clarify to say: if you'd be in the position of making a breaking change anyway, then maybe consider dropping. Otherwise, it is often now worth maintaining some facades (since we're generally at the bottom of some deep dependency trees, frequently subject to diamond problems, etc).

as json wants to separate the semantic "marshal options" from the syntactic "encode options"

Oh, nice. That sounds like the same distinction I was thinking to try to make with those words these days: "marshal" to regard object<->serial/token mapping (or, in IPLD's case, Data Model rather than jumping straight to token); "encode" to regard serial/token<->encodedbytes.

@rvagg
Copy link
Member Author

rvagg commented Jul 19, 2021

ok, gross and weird Golang config/options handling patterns aside, how's this PR looking for merge as it stands? The way it's looking to me right now is that Encode() and Decode() are the external facades and Marshal() and Unmarshal() are their more complicated internal siblings that you could reach for if you wanted to do more than the default patterns on the codec. So the dagjson marshaller and unmarshaller are more like utilities while the encoder and decoder are more about the codec doing what it should do. Maybe that doesn't make sense, but for now it seems nice to hide some of the complexity behind simple Encode() and Decode() calls.

@rvagg rvagg merged commit 333be0f into master Jul 21, 2021
@rvagg rvagg deleted the rvagg/dagjsonsort branch July 21, 2021 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants