Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ipld amend #445

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from
Draft

Conversation

smrz2001
Copy link
Contributor

@smrz2001 smrz2001 commented Jun 20, 2022

This PR implements a possible solution for incrementally modifying IPLD nodes.

It uses the same interface as @warpfork's patch implementation but accumulates updates and internal state to provide a more optimized Node "lens" for copy-on-write during Encode.

The draft PR included some additional discussions and benchmarking results comparing this PR's implementation with the current patch implementation.

cc @rvagg @RangerMauve @warpfork @BigLep

@smrz2001
Copy link
Contributor Author

smrz2001 commented Jun 20, 2022

Based on our discussion, @RangerMauve, @rvagg, and to your specific point, @RangerMauve, I can update patch/eval to use traversal/amend instead of traversal/focus. Once done, the PR should be in good shape to merge, albeit with some missing features like link traversal, budgets, etc. (as I also noted in the other PR).

I think a good subsequent PR might be one that integrates amend into walk and focus so that the latter can take advantage of amend's performance enhancements while continuing to provide link traversal, budgets, etc., and then switching patch/eval back to using the updated traversal/focus.

Alternatively, I could start with a clean slate and port these features over to amend, which is an option I don't dislike.

My aim is to maximize reuse and take advantage of all the work that's already gone into focus.

If merging without some or all of these features is a no-go, I don't mind working on adding them to the current PR instead of a subsequent one.

@RangerMauve
Copy link
Contributor

I think a good subsequent PR might be one that integrates amend into walk and focus so that the latter can take advantage of amend's performance enhancements while continuing to provide link traversal, budgets, etc., and then switching patch/eval back to using the updated traversal/focus.

Could you elaborate a bit on what integrating amend into walk and focus would entail?

and to your specific point, @RangerMauve, I can update patch/eval to use traversal/amend instead of traversal/focus

Question, would it make sense to put all the amend stuff under the patch namespace and merge the two concepts? Or are they so different that they need to have different names? One reason I'm thinking about this direction is that the names seem very similar and it could be confusing to API consumers to know what the difference is. It'd be nice if we could say "Patch has a high level API for patch sets (current /patch), and a lower level API for modifications (current /amend)".

@smrz2001
Copy link
Contributor Author

smrz2001 commented Jun 24, 2022

Could you elaborate a bit on what integrating amend into walk and focus would entail?

Yes, walk and focus involve similar operations as I've implemented in amend, (hence my thought of integrating them) - traversing a node graph, trying to reach a particular child node, modifying such a child, etc. In addition, they also traverse links as they encounter them and manage a budget for such link traversal, etc.

To integrate amend, I would replace the code that walk and focus currently use for traversal, iteration, etc. with that implemented in amend, which minimizes the operations and memory required. Won't be straightforward but I feel that it's doable.

I'm also debating whether to just add missing features to amend and let it grow organically based on user requirements instead of causing a ton of code churn while trying to fit it into walk and focus. If the IPLD team can provide a bit of guidance around which features are the most important for amend to have as it evolves, I can start adding them in order of importance. I feel like link traversal should definitely be added next.

Question, would it make sense to put all the amend stuff under the patch namespace and merge the two concepts? Or are they so different that they need to have different names? One reason I'm thinking about this direction is that the names seem very similar and it could be confusing to API consumers to know what the difference is. It'd be nice if we could say "Patch has a high level API for patch sets (current /patch), and a lower level API for modifications (current /amend)".

Yes, that makes sense. The main reason I used different terminology and separate packages was to be able to refer to the two implementations individually while comparing them, but, at this point, merging them seems to be appropriate.

If I were to merge them, my thinking was to replace the body of patch/eval with what is currently amend/eval, ie. an Amender is now used to apply patches instead of focus. That would have the effect of not (currently) allowing link traversal but, as I mentioned previously, I can add support for that soon.

Does that seem reasonable?

@smrz2001
Copy link
Contributor Author

Actually, I think I'm going to experiment a bit with the amend/walk/focus integration this week and let you know how it goes. Now that I'm looking at it again, I think it might be less work than I was initially thinking and could potentially fit in the same PR.

Integrating it there would allow amend's efficiencies to be available to two different APIs - patch and traversal.

@RangerMauve

@rvagg
Copy link
Member

rvagg commented Jun 28, 2022

@smrz2001 can you give me your thoughts on the performance gains from this impl over FocusedTransform? They're essentially having to perform the same operations by the time you get to final node realisation aren't they? What am I not seeing here? Are the transform operations having to double (and triple etc.) up on parent node assembling as they deal with batches of operations?

And yeah, now I look at this code, I can see the logic of going back and seeing if you can adapt the walk and focus transforms to use it. That would also help prove the concept.

I think that making sure this works across an ADL would be a good step in the process too. The HAMT ADL is actively getting work, Masih has been pushing new changes to it as they're using it for the Filecoin Indexer service I believe, so it's getting closer to feature-complete. Alternatively, the most complete ADL we have is go-unixfsnode. I'm not sure we want those pulled into the repo to test here but if you can figure it out then wiring up a transformation over a working ADL would be really interesting and confirm that the layering is working as expected. Although, it's possible that the pieces around finalisation are a bit too rough to make it all work, so don't spend too much time spinning wheels on that.

Also, is amending a non-list or map root node intended to work? Looking at the Any amender, I can't see how that would work. Maybe it doesn't make sense, but shouldn't I be able to amend using a / path and replace whatever node it finds with something new?

@smrz2001
Copy link
Contributor Author

smrz2001 commented Jun 29, 2022

@smrz2001 can you give me your thoughts on the performance gains from this impl over FocusedTransform? They're essentially having to perform the same operations by the time you get to final node realisation aren't they? What am I not seeing here? Are the transform operations having to double (and triple etc.) up on parent node assembling as they deal with batches of operations?

(Edited for clarity)

Basically, FocusedTransform assembles and builds a new Node using the base Node for each update operation, which involves some amount of copying and processing, though not much and probably possible to optimize further.

The big advantage in the case of amend, however, isn't that it assembles a new Node more efficiently, it's that it never assembles a new Node at all. It achieves this by minimally storing the update instructions and then presenting API consumers with a Node view that is cumulative result of one or more updates, without ever needing any assembly/build to prepare the final Node.

The two most important pieces of logic for this are the LookupBy* and iteration code. The modification metadata is the source of truth, with the base Node being used for lookup/iteration of unmodified content.

While this doesn't matter a ton for a small set of updates, the advantage of amend's amortization is more noticeable as the number of updates increases.

And yeah, now I look at this code, I can see the logic of going back and seeing if you can adapt the walk and focus transforms to use it. That would also help prove the concept.

Ok, excellent. I already started looking at this and will make updates in the current PR. I did run into some blockers and might need guidance - will let you know.

I think that making sure this works across an ADL would be a good step in the process too. The HAMT ADL is actively getting work, Masih has been pushing new changes to it as they're using it for the Filecoin Indexer service I believe, so it's getting closer to feature-complete. Alternatively, the most complete ADL we have is go-unixfsnode. I'm not sure we want those pulled into the repo to test here but if you can figure it out then wiring up a transformation over a working ADL would be really interesting and confirm that the layering is working as expected. Although, it's possible that the pieces around finalisation are a bit too rough to make it all work, so don't spend too much time spinning wheels on that.

Great idea!

Coincidentally, being able to efficiently update a HAMT (plus simplifying some of the ugly logic I had to add in dag-jose) was my original impetus for this project 😄 I would love to see updating HAMTs through once it's in a good place to test. I'm sure amend will need some enhancements at that point as well.

I'll start playing around with go-unixfsnode and see if I can set up a good test.

Also, is amending a non-list or map root node intended to work? Looking at the Any amender, I can't see how that would work. Maybe it doesn't make sense, but shouldn't I be able to amend using a / path and replace whatever node it finds with something new?

Good question. I had the same question during my implementation and was of the opinion that really only recursive Nodes need updates. Any primitive Nodes can just be recreated.

anyAmender acts as a wrapper for primitive leaf Nodes (and possibly for other (recursive) Amenders) and simplifies some of the iteration and processing logic without really providing any intelligence of its own.

I do see what you're saying though. I could update its Replace() method to check whether path is / and apply the update. I'll go ahead and do that.

As a side thought, I'm wondering if I should store a pointer to the base Node inside anyAmender. Recursive Nodes are already pointers but not primitive Nodes. For most primitive Nodes this probably doesn't matter but String or Bytes types could be large and creating an anyAmender wrapper around them currently results in a copy. Using pointers would reduce the memory consumption even further and probably justify anyAmender's existence a bit more.

@rvagg

@smrz2001
Copy link
Contributor Author

@rvagg, after rereading your question regarding perf, I reworded some of my explanation. Please let me know if that makes sense.

To summarize, the main reason behind the new implementation's better performance is that it completely bypasses Node (re)assembly.

An Amender masquerades as a fully updated Node by letting lookup and iteration work as they would on a Node with updates already applied, but without actually doing any reassembly, not even during Encode().

@smrz2001
Copy link
Contributor Author

smrz2001 commented Jul 4, 2022

@rvagg @RangerMauve, I've completed the integration with focus. All missing features, such as link traversal, progress tracking, link/node budgets, etc. are now included, and all focus and patch tests are passing.

I'm still mulling over how to integrate with walk and will attempt to do that soon, as well as do some testing using an ADL, like you recommended, @rvagg.

EDIT: I'm actually realizing that it might not be worth integrating with walk after all. The amend implementation lends itself very well to targeted retrievals/mutations but there doesn't seem to be a significant advantage to using it for node traversal.

I've also implemented lazy link recomputation so that, if desired, any link(s) whose referred node was modified will be recomputed only when accessed via Node.AsLink(). Till then, the "lazy" implementation will hold state in memory without either link recomputation or storage of mutated nodes.

I'm adding more test cases for focus, patch, and some of the less exposed branches in my code.

@BigLep
Copy link

BigLep commented Jul 19, 2022

@smrz2001 : is there anything missing before a final review/merge?

Can you please also provide updated benchmarks? We'd like to include those in the release notes.

2022-07-19 triage conversation:

  1. Wondering about the module for "amend". Maybe it should be in its own module or be under traversal/patch? This could be a good topic for the community call.

@smrz2001
Copy link
Contributor Author

@BigLep, here are the latest benchmarking results. The numbers are slightly higher than the initial results due to (I believe) the integration with focus, Progress, etc. but still significantly lower than when using node assemblers/builders.

Patch_Bench.test.exe -test.v -test.paniconexit0 -test.bench . -test.run ^$ -test.benchmem -test.benchtime=10x #gosetup
goos: windows
goarch: amd64
pkg: github.com/ipld/go-ipld-prime/traversal/patch
cpu: Intel(R) Core(TM) i7-10870H CPU @ 2.20GHz

BenchmarkAmend_Map_Add/inputs:_{100_1}-16             10         51710 ns/op       18503 B/op         164 allocs/op
BenchmarkAmend_Map_Add/inputs:_{100_10}-16            10         52970 ns/op       29069 B/op         301 allocs/op
BenchmarkAmend_Map_Add/inputs:_{100_100}-16           10        157300 ns/op       99791 B/op        1660 allocs/op
BenchmarkAmend_Map_Add/inputs:_{1000_10}-16           10        258040 ns/op      212967 B/op        1654 allocs/op
BenchmarkAmend_Map_Add/inputs:_{1000_100}-16          10        413550 ns/op      274216 B/op        3012 allocs/op
BenchmarkAmend_Map_Add/inputs:_{1000_1000}-16         10       1690200 ns/op     1022784 B/op       18334 allocs/op
BenchmarkAmend_Map_Add/inputs:_{10000_100}-16         10       3224290 ns/op     1915428 B/op       16517 allocs/op
BenchmarkAmend_Map_Add/inputs:_{10000_1000}-16        10       4328510 ns/op     2588397 B/op       31833 allocs/op
BenchmarkAmend_Map_Add/inputs:_{10000_10000}-16       10      20015850 ns/op     9551964 B/op      185023 allocs/op

BenchmarkAmend_List_Add/inputs:_{100_1}-16            10                           24757 B/op         252 allocs/op
BenchmarkAmend_List_Add/inputs:_{100_10}-16           10                           31165 B/op         414 allocs/op
BenchmarkAmend_List_Add/inputs:_{100_100}-16          10        100040 ns/op      107245 B/op        2036 allocs/op
BenchmarkAmend_List_Add/inputs:_{1000_10}-16          10        199880 ns/op      234832 B/op        2487 allocs/op
BenchmarkAmend_List_Add/inputs:_{1000_100}-16         10        200000 ns/op      301824 B/op        4107 allocs/op
BenchmarkAmend_List_Add/inputs:_{1000_1000}-16        10       1606750 ns/op     1049811 B/op       21209 allocs/op
BenchmarkAmend_List_Add/inputs:_{10000_100}-16        10       2419070 ns/op     2290280 B/op       24810 allocs/op
BenchmarkAmend_List_Add/inputs:_{10000_1000}-16       10       5962570 ns/op     3185764 B/op       41911 allocs/op
BenchmarkAmend_List_Add/inputs:_{10000_10000}-16      10      63479500 ns/op    10390667 B/op      212913 allocs/op

BenchmarkAmend_Map_Remove/inputs:_{100_1}-16          10                           18058 B/op         157 allocs/op
BenchmarkAmend_Map_Remove/inputs:_{100_10}-16         10                           20103 B/op         230 allocs/op
BenchmarkAmend_Map_Remove/inputs:_{100_100}-16        10         94370 ns/op       39853 B/op         955 allocs/op
BenchmarkAmend_Map_Remove/inputs:_{1000_10}-16        10        299900 ns/op      208808 B/op        1584 allocs/op
BenchmarkAmend_Map_Remove/inputs:_{1000_100}-16       10        306310 ns/op      232733 B/op        2312 allocs/op
BenchmarkAmend_Map_Remove/inputs:_{1000_1000}-16      10        700060 ns/op      462102 B/op       10425 allocs/op
BenchmarkAmend_Map_Remove/inputs:_{10000_100}-16      10       2913270 ns/op     1881092 B/op       15815 allocs/op
BenchmarkAmend_Map_Remove/inputs:_{10000_1000}-16     10       3518820 ns/op     2213765 B/op       23934 allocs/op
BenchmarkAmend_Map_Remove/inputs:_{10000_10000}-16    10       8937400 ns/op     4213445 B/op      105102 allocs/op

BenchmarkAmend_List_Remove/inputs:_{100_1}-16         10                          21791 B/op         236 allocs/op
BenchmarkAmend_List_Remove/inputs:_{100_10}-16        10                          22223 B/op         263 allocs/op
BenchmarkAmend_List_Remove/inputs:_{100_100}-16       10                          25519 B/op         532 allocs/op
BenchmarkAmend_List_Remove/inputs:_{1000_10}-16       10        199860 ns/op      228194 B/op        2337 allocs/op
BenchmarkAmend_List_Remove/inputs:_{1000_100}-16      10        199970 ns/op      232514 B/op        2607 allocs/op
BenchmarkAmend_List_Remove/inputs:_{1000_1000}-16     10        306580 ns/op      249362 B/op        5304 allocs/op
BenchmarkAmend_List_Remove/inputs:_{10000_100}-16     10       2517960 ns/op     2221005 B/op       23310 allocs/op
BenchmarkAmend_List_Remove/inputs:_{10000_1000}-16    10       4721820 ns/op     2239610 B/op       26010 allocs/op
BenchmarkAmend_List_Remove/inputs:_{10000_10000}-16   10      16425440 ns/op     2489872 B/op       53006 allocs/op

BenchmarkAmend_Map_Replace/inputs:_{100_1}-16         10                           18237 B/op         160 allocs/op
BenchmarkAmend_Map_Replace/inputs:_{100_10}-16        10         99780 ns/op       22429 B/op         269 allocs/op
BenchmarkAmend_Map_Replace/inputs:_{100_100}-16       10         99900 ns/op       69656 B/op        1318 allocs/op
BenchmarkAmend_Map_Replace/inputs:_{1000_10}-16       10        300020 ns/op      211261 B/op        1632 allocs/op
BenchmarkAmend_Map_Replace/inputs:_{1000_100}-16      10        399820 ns/op      257061 B/op        2799 allocs/op
BenchmarkAmend_Map_Replace/inputs:_{1000_1000}-16     10       1211120 ns/op      687030 B/op       14957 allocs/op
BenchmarkAmend_Map_Replace/inputs:_{10000_100}-16     10       3026720 ns/op     1897616 B/op       16312 allocs/op
BenchmarkAmend_Map_Replace/inputs:_{10000_1000}-16    10       4535900 ns/op     2428236 B/op       28873 allocs/op
BenchmarkAmend_Map_Replace/inputs:_{10000_10000}-16   10      14162460 ns/op     6570669 B/op      151308 allocs/op

BenchmarkAmend_List_Replace/inputs:_{100_1}-16        10         94810 ns/op       22332 B/op         244 allocs/op
BenchmarkAmend_List_Replace/inputs:_{100_10}-16       10         98780 ns/op       26525 B/op         343 allocs/op
BenchmarkAmend_List_Replace/inputs:_{100_100}-16      10        199880 ns/op       60149 B/op        1334 allocs/op
BenchmarkAmend_List_Replace/inputs:_{1000_10}-16      10         99940 ns/op      231360 B/op        2425 allocs/op
BenchmarkAmend_List_Replace/inputs:_{1000_100}-16     10        299850 ns/op      264554 B/op        3498 allocs/op
BenchmarkAmend_List_Replace/inputs:_{1000_1000}-16    10        809540 ns/op      599063 B/op       15112 allocs/op
BenchmarkAmend_List_Replace/inputs:_{10000_100}-16    10       2332470 ns/op     2237191 B/op       24209 allocs/op
BenchmarkAmend_List_Replace/inputs:_{10000_1000}-16   10       3222160 ns/op     2578458 B/op       35901 allocs/op
BenchmarkAmend_List_Replace/inputs:_{10000_10000}-16  10      10153530 ns/op     6103261 B/op      152815 allocs/op

@smrz2001
Copy link
Contributor Author

@smrz2001 : is there anything missing before a final review/merge?

I don't think so, @BigLep, unless @rvagg or @RangerMauve would like to see something else - I'd be happy to make more changes.

I haven't gotten a chance to add more tests for corner cases but don't think that's blocking. I'll add some with the next PR to this code. I'm planning to put up another one in a bit anyway for adding and exposing plain Map and List interfaces for better DX around recursive nodes (e.g. Map.Put(key string, value datamodel.Node) instead of MapBuilder.BeginMap()...[do stuff]...MapBuilder.Finish() or qp.BuildMap()...[do stuff] etc.), which has been one of my biggest pain points.

If any specific tests are needed for more confidence though, I can add them, no problem.

To @rvagg's question about integrating with go-unixfsnode, I'm definitely having some trouble figuring out how to wire things together correctly and will continue trying to figure that out because I do want to make sure amend works correctly when integrated with ADLs.

Once I have a Map interface in place, I'm actually also considering rejiggering go-ipld-adl-hamt to use that and the update logic so that it becomes more accessible for other, plain map-like uses, while also proving that the layering works correctly. I might just do that before looking at go-unixfsnode.

2022-07-19 triage conversation:

  1. Wondering about the module for "amend". Maybe it should be in its own module or be under traversal/patch? This could be a good topic for the community call.

I tried really hard to keep the packages clean and separate 😭 but had to give up because of circular dependencies while integrating amend with focus. More stuff will need to be shuffled around to get that to work (e.g. the things in traversal/common.go and traversal/fns.go are required in both traversal/focus and a separate amend package, which means the latter cannot be imported by the former). I'm not a Go expert by any means, so I'm not sure if there's some obvious, idiomatic solution I'm missing here.

I actually ended up going backwards and pulling all amend files out into the main traversal package in the current version in order to integrate them.

I feel that amend (i.e. updating IPLD nodes) is a more generic concept than patch (apply JSON patch sets), which is a specific type of update, but I'm open to suggestions. This makes me think that both traversal and patch should be consumers of the amend API. The integration with focus, Progress, etc. makes that difficult though.

cc @aschmahmann, since you expressed some interest in this PR.

@RangerMauve
Copy link
Contributor

Once I have a Map interface in place, I'm actually also considering rejiggering go-ipld-adl-hamt to use that and the update logic so that it becomes more accessible for other, plain map-like uses, while also proving that the layering works correctly. I might just do that before looking at go-unixfsnode.

That would be cool as heck!

I feel that amend (i.e. updating IPLD nodes) is a more generic concept than patch (apply JSON patch sets), which is a specific type of update, but I'm open to suggestions. This makes me think that both traversal and patch should be consumers of the amend API. The integration with focus, Progress, etc. makes that difficult though.

Honestly, I agree with this reasoning. Would it make sense to put amend in the top level? Or would this even be something that falls under go-ipld-prime/node/amend? cc @rvagg

Might be useful to talk about this with higher bandwidth at the next IPLD community call. 🤷

@RangerMauve
Copy link
Contributor

Didn't work under traversal due to circular dependencies.

@RangerMauve
Copy link
Contributor

Would it make sense to have a call with @smrz2001 and @rvagg (maybe @aschmahmann) and just devote some time to finalizing what we need to merge this so that it doesn't sit in limbo for too long?

@RangerMauve
Copy link
Contributor

The concept that this is a "writable ADL" is really curious though. 😅

@smrz2001
Copy link
Contributor Author

smrz2001 commented Aug 1, 2022

Would it make sense to have a call with @smrz2001 and @rvagg (maybe @aschmahmann) and just devote some time to finalizing what we need to merge this so that it doesn't sit in limbo for too long?

This would be great, @RangerMauve!

@smrz2001
Copy link
Contributor Author

smrz2001 commented Aug 2, 2022

Moving the "writable ADL" discussion here:

I also wanted to pass another thought by you, @RangerMauve.

I can't get the idea out of my head that what amend really is, is an updatable ADL. It takes a source Node, allows the accumulation of updates, then "presents" a new Node to consumers.

I could easily turn Amender.Build into Reify and add Substrate to expose the source Node.

The concept that this is a "writable ADL" is really curious though. 😅

Yeah, that thought's been coming to me for a while now, although it's still a bit nebulous.

If an ADL is a transformation applied to a "raw" Node in order to turn it into a new, "synthesized" Node, then amend does that as well. In fact, that's the reason this code is working with mostly surgical changes to the rest of the code. Each Amender implementation (mapAmender, listAmender, linkAmender, anyAmender) masquerades as a Node with all the accumulated updates applied.

Node contents are read via Node.LookupBy*, Node.As*, or iteration (for recursive nodes), so all I had to do was make sure that lookups and iteration incorporated the latest state of the node in a manner consistent with the underlying Node implementation. For example, mapAmender works seamlessly with plainMap because they're both essentially LinkedHashMap-based implementations (iterating over plainMap returns elements in their order of insertion).

What this also means is that amend will not work out-of-the-box with inconsistent "raw" data layouts. Using mapAmender with go-ipld-adl-hamt will "work", i.e. updates will be present in the "synthesized" Node, but encoded data will not always be in the right order.

In fact, I think go-ipld-adl-hamt is a halfway Amender already. It starts with an empty base Node and accumulates updates that it reports during lookups and iteration.

@RangerMauve
Copy link
Contributor

I was just off a call with the WNFS team and we were talking about how ADLs map from raw data into a complex Node, but how it wasn't obvious how the opposite could happen.

This made me think that mutability with ADLs is something we should talk about more. e.g. Should we have something in the raw datamodel for transforming a node? The stuff sketched up in #451 seems to allude to a need to be able to modify the keys in a Map or add/remove items in a List (and then get back a new node).

This also seems like an alternative to using a Builder node for interfacing with the data in a more high level way. Is this something we want within the repo or should we maybe look at having a higher level interface for IPLD?

@rvagg
Copy link
Member

rvagg commented Aug 8, 2022

This also seems like an alternative to using a Builder node for interfacing with the data in a more high level way. Is this something we want within the repo or should we maybe look at having a higher level interface for IPLD?

Yes, I think that's fine. We have bindnode in here already and there's methods in the codegen that can mess with internals too (if they're local anyway: e.g. https://github.com/ipfs/go-ipld-git/blob/master/ipldsch_types.go#L61).

I think we want to be rigid with making sure that all the various layers and implementations can present both:

  1. a proper Node interface to the data; and a
  2. a Assembler Builder interface that works across the data structure concerned

As long as those things hold true then the stack should line up. But it doesn't preclude offering alternative interfaces to the same data for usability and/or performance purposes. Performance tends to constraint us a bit and has influenced some of the design decisions here. But I think experimentation with alternative interface mechanisms for the different levels is fine as long as we offer a coherent stack that can work transparently when a user/API doesn't know or care what's underneath.

@rvagg
Copy link
Member

rvagg commented Aug 8, 2022

OK, I tried pulling this up into a node/amend package but have to concur with @smrz2001 that the interdependency is a bit too much. I started pulling some traversal pieces up into the root package, a TraversalProgress interface, the TraversalFn type and the ErrBudgetExceeded error but there's so much of Progress that's used inside amend that it essentially needs all of the fields exposed as functions on the interface, which is quite annoying.

So, if "where does this belong" from a purely API-sense perspective then I'd try and get it into node as its own package. But it's also true that this is essentially a form of transformation which, at the moment belongs under traversal.

@rvagg
Copy link
Member

rvagg commented Aug 8, 2022

So the next problem is: does this work with the layering that traversal needs to? I think the answer is no, unfortunately. If you look at the current implementation of FocusedTransform(), you'll see a nb := n.Prototype().NewBuilder(), with the traversal operating on that builder to make a new Node. With amend, the original prototype is ignored because it's operating on a raw datamodel form and it returns its own forms of Node objects instead. WalkTransforming() is similar but it uses the prototype when traversing the complex types, otherwise it just relies on the caller's Node to do the transforming.

So, I think if you were to run a traversal using this current implementation on anything but a datamodel layer (e.g. basicnode) Node then you might get into trouble. Some examples:

  • A TypedNode which is a struct Foo map { ... } representation tuple - so as a Node it presents as if it's a map, but at the base it's really a list. When you pass that off to a codec then it'll get Representation() called on it so you encode the list.
  • An ADL - take the adl/rot13adl demo ADL as a very simple example - it gets and sets strings encoded as ROT13, so the user just sees the unencoded strings but underneath they're encrypted. Pass it off to a codec and you should get the encrypted form.

In both cases, (and perhaps a third one where you layer one on the other) if you pass one of these to the current transform functions and do a simple transformation, it makes the changes through the NodeAssembler interface which takes care of making sure the underlying thing gets updated properly. So you can transform a ROT13 ADL Node as if it were unencrypted but it'll still do the encrypting for you. You can mess with fields of a tuple-representation Map but it'll still be a List underneath; plus you get all of the other things that come along with a TypedNode like a strict list of fields that can be set for a struct and what types they can be.

Amend is hijacking that and returning an Amender, which can act like a Node but loses the layers. If I pass in a HAMT ADL and mess with some of the entries with a transform then I expect the operations to occur on the HAMT. But instead my HAMT is converted to an Amender, which has the HAMT underneath it but the transformations won't be occuring as they should via the NodeAssembler pattern. e.g. https://github.com/ipld/go-ipld-adl-hamt/blob/9004dbd839e0e73ec52d9b1b47ddbd0627d74b92/builder.go#L274-L285 is what happens when you add a new key to a HAMT - Node#insertEntry has all the ADL magic, including all of the operations to create and store new datamodel layer blocks in the LinkSystem.

I'm not sure exactly what to do about this though. Some kind of "commit" operation would be nice to add to the traversals but that would have to be implicit, leaving the user with no option to receive an Amender if they want one for the efficiency gains. But perhaps that's the key - allow the user to opt in. If it receives an Amender because the user made one with NewAmender then just work on that and return it. If not, then we'd need to commit operations before returning, and those committals would need to be recursive and operate on the prototypes of the base Nodes at each point.

But then there's the question of whether accumulating all of this in memory is even what the user wants? With the current form, I could pass in a huge HAMT with strings in it and use the transformers to append some character to each entry and it'd update the nodes along the way, without needing to buffer it all in memory before doing it at once (one of the nice features of ADLs). So perhaps accumulating should be opt-in too, which means perhaps we should preserve some or all of the exsting traversal infrastructure and allow the user to opt-in to an amending traversal?

@smrz2001
Copy link
Contributor Author

smrz2001 commented Aug 8, 2022

OK, I tried pulling this up into a node/amend package but have to concur with @smrz2001 that the interdependency is a bit too much. I started pulling some traversal pieces up into the root package, a TraversalProgress interface, the TraversalFn type and the ErrBudgetExceeded error but there's so much of Progress that's used inside amend that it essentially needs all of the fields exposed as functions on the interface, which is quite annoying.

So, if "where does this belong" from a purely API-sense perspective then I'd try and get it into node as its own package. But it's also true that this is essentially a form of transformation which, at the moment belongs under traversal.

Thanks for experimenting with this, @rvagg. This is where I end up each time I try to move amend out, and I've tried and given up a few times.

So the next problem is: does this work with the layering that traversal needs to? I think the answer is no, unfortunately. If you look at the current implementation of FocusedTransform(), you'll see a nb := n.Prototype().NewBuilder(), with the traversal operating on that builder to make a new Node. With amend, the original prototype is ignored because it's operating on a raw datamodel form and it returns its own forms of Node objects instead. WalkTransforming() is similar but it uses the prototype when traversing the complex types, otherwise it just relies on the caller's Node to do the transforming.

Agreed, this is what I was trying to allude to in my earlier comment:

Node contents are read via Node.LookupBy*, Node.As*, or iteration (for recursive nodes), so all I had to do was make sure that lookups and iteration incorporated the latest state of the node in a manner consistent with the underlying Node implementation. For example, mapAmender works seamlessly with plainMap because they're both essentially LinkedHashMap-based implementations (iterating over plainMap returns elements in their order of insertion).

What this also means is that amend will not work out-of-the-box with inconsistent "raw" data layouts. Using mapAmender with go-ipld-adl-hamt will "work", i.e. updates will be present in the "synthesized" Node, but encoded data will not always be in the right order.

Any amendment logic must take into account the underlying Node implementation, e.g. the current mapAmender will work with plainMap but cannot be assumed to work with any other "map-type" Node (e.g. HAMT).

I have a few ideas on how to move forward, let me know what you (and @RangerMauve) think:

  • Revert the integration of amend and focus. The current code successfully validates that performant traversal and modification is possible using amend, but traversal can be amend-ified indirectly (see below).

  • amend becomes its own package inside node with its own (limited) version of Progress and Budget because those are definitely useful concepts. Though it'll be inside node, it'll be an alternative to traversal that can opted-in to that is also closer to the data model, allowing for more performant operations.

  • Amender implementations are explicitly tied to the underlying Node implementation, e.g. mapAmender should be tied to plainMap (maybe even renamed to plainMapAmender?), HAMT will have it's own Amender implementation, etc. This coupling could perhaps be implemented via an updated version of this existing interface:

// NodePrototypeSupportingAmend is a feature-detection interface that can be
// used on a NodePrototype to see if it's possible to build new nodes of this style
// while sharing some internal data in a copy-on-write way.
//
// For example, Nodes using an Advanced Data Layout will typically
// support this behavior, and since ADLs are often used for handling large
// volumes of data, detecting and using this feature can result in significant
// performance savings.
type NodePrototypeSupportingAmend interface {
	AmendingBuilder(base Node) NodeBuilder ======> AmendingBuilder(base Node) Amender
	// FUTURE: probably also needs a `AmendingWithout(base Node, filter func(k,v) bool) NodeBuilder`, or similar.
	//  ("deletion" based APIs are also possible but both more complicated in interfaces added, and prone to accidentally quadratic usage.)
	// FUTURE: there should be some stdlib `Copy` (?) methods that automatically look for this feature, and fallback if absent.
	//  Might include a wide range of point `Transform`, etc, methods.
	// FUTURE: consider putting this (and others like it) in a `feature` package, if there begin to be enough of them and docs get crowded.
}

traversal can then possibly check this feature flag and have an alternative path that takes advantage of any performance gains, but can also fallback to the more generic implementation if no Amender is available.

@smrz2001
Copy link
Contributor Author

smrz2001 commented Aug 8, 2022

But then there's the question of whether accumulating all of this in memory is even what the user wants? With the current form, I could pass in a huge HAMT with strings in it and use the transformers to append some character to each entry and it'd update the nodes along the way, without needing to buffer it all in memory before doing it at once (one of the nice features of ADLs). So perhaps accumulating should be opt-in too, which means perhaps we should preserve some or all of the exsting traversal infrastructure and allow the user to opt-in to an amending traversal?

With the approach I outlined above, updating nodes along the way could perhaps be implemented via hamtAmender specific configuration.

Amender would establish an API for modifying nodes, with reference implementations for existing complex types like plainMap, plainList, and plainLink. ADLs would be able to choose if/how their Amender (and its configuration) should be implemented (or if an existing implementation can be reused), and also be able to use it transparently and correctly with traversal.

@rvagg
Copy link
Member

rvagg commented Aug 9, 2022

Yes! Mostly that all sounds good. Some thoughts and questions:

  • Instead of plain* in the language, maybe frame this as "basicnode" since that's become the common parlance for how we refer to these base level things, the basicnode Node implementations happen to be implemented as plain* but that's really just internal naming. What you're suggesting is building basicnode-level tooling. I'm not really sure what that means in API terms, though. I think you could put some amend-related interfaces in the root of datamodel and then do some stuff in a sub-package, or even consider working inside the basicnode package? Or basicnode/amend? Is that too much? I'll let you think on this and come up with ideas, I don't have solid ones at the moment.
  • You're getting quite close to the NodeBuilder interface here, and as you've suggested already you're almost (or maybe are) building an ADL. So maybe its worth considering whether an Amender interface should just extend NodeBuilder? If that's possible it would provide some really interesting API possibilities when passing things around, and amender nodes can return proper prototypes that divert building code to use new amenders too. Since the intention is to have consistent build interfaces to a diversity of implementations, maybe we can just go the full way with that.
  • We could consider extending all of the basicnode Builders (e.g. plainMap__Builder) from an amender by default. So .. for example, if I decoded a graph directly from some bytes, without a typed interface or an ADL in the way, just their basicnode components, then I end up with a graph that's got amenders underneath and I automatically get the amend behaviour when using traversal on it. We could even work up the stack and give the same treatment to the TypedNode codegen components. Just a thought, and I'm not sure whether it'd actually be worth it or not, or what user expectations might be when Build() is called (are they OK with a change-tracked object? we're already dealing with memory issues up through the go-graphsync->go-data-transfer->go-fil-markets stack because of stacking Node problems)
  • Can you tell me about trackProgress and what it's intended to do? I can't quite figure it out from the code and maybe it's not fully doing what it's intended to do?
  • Likewise, created doesn't ever seem to be set, it's only ever false from what I can see. Can you talk about the intention there? Or am I looking at it wrong?

@smrz2001
Copy link
Contributor Author

smrz2001 commented Jun 11, 2023

traversal.Focus also has the notion of trackProgress. Adding something like that to traversal.Walk might be another way to limit the amount of state tracking.

Yes, I've been thinking that maybe a bool in Config to turn on and off progress. The complication is StartAtPath and Budget which need some amount of state. Maybe it could default to false but turn on when those features are needed, or when it's explicitly turned to true.

From what I can tell, trackProgress in Progress.get is mainly used to control updates to Progress.LastBlock and Progress.Path while allowing StartAtPath and Budget to always be used/updated. Maybe that's the way to do it elsewhere since Progress.Path is presumably the main source of allocations during traversals?

@smrz2001
Copy link
Contributor Author

smrz2001 commented Jun 12, 2023

It had been a while since I'd looked back at my own code, and now that I have, I've realized that the changes to focus are incomplete. Even in the basicnode implementations, NodeAmender operations assume that children are one of the ("amendable") basicnode types, which is definitely not always going to be the case.

These changes were primarily meant to discuss and validate the current high-level approach, compared with the previous one where amend was at the level of traversal and patch, and not deep inside basicnode. Since you've had a chance to review the code, @rvagg, and feel the approach has enough merit to keep exploring its possibilities further, I'll work on fleshing out the implementation (by feature-detecting NodePrototypeSupportingAmend), and to see if/how it can then be integrated into existing flows cleanly.

Before that though, if we take a step back, does adding to a Node implementation the ability to recursively operate on its children seem like a layering violation? I think it is sufficient for plainMap or plainList to allow the easy addition/removal/replacement of immediate children but allowing operations on children at or beyond 2 levels of nesting seems like it doesn't belong in map or list-like objects.

NodeAmender could be simplified to just having a Transform function (over what it already gets from NodeBuilder) that can update the value of any type of Node (including replacing the root of a recursive Node) or update the immediate children of recursive Nodes.

What do you think?

I was in the process of adding more NodeAmender documentation but then paused and thought I'd get a little more clarity here first.

@smrz2001 smrz2001 force-pushed the patch-feature-with-amend branch 3 times, most recently from 437510b to ba38717 Compare July 3, 2023 23:32
@smrz2001
Copy link
Contributor Author

smrz2001 commented Jul 3, 2023

Hey @rvagg, I just pushed some reworked code that simplifies things considerably. All tests are passing so there shouldn't be any regressions in the code, at least not any major ones.

There aren't any new test cases for NodeAmender.Transform or any benchmarking results, but I will look into both soon. I did remove two asserts from the map_test.go that don't quite apply in the new approach.

I'm going to restart the changes to focus and walk with this new foundation, which treats plainMap and plainList as amendable recursive types whose roots or immediate children can be transformed (added, modified, deleted).

Both focus and walk will remain complicated pieces of code (soon with additional amend feature checks) because not all node implementations support amendment, but we might be able to optimize the code considerably for the types that do.

I feel like recursive types have always been the source of complexity and inefficiency when attempting to modify nodes. The new code makes it easier to modify the basic map/list types while also improving the DX a little bit, i.e. calling Transform to modify nodes in-place instead of using BeginMap/BeginList -> Finish to create new nodes with transformed contents.

@smrz2001
Copy link
Contributor Author

smrz2001 commented Jul 5, 2023

Just pushed a version of walk with amenders. All existing tests are passing on MacOS, I'm surprised they're failing like they are in CI 😕

I'll work on adding some more tests and benchmarks soon, though would love some feedback regarding the approach if you get a chance to take a look, @rvagg.

@smrz2001
Copy link
Contributor Author

smrz2001 commented Jul 6, 2023

I went ahead and pulled in my map/list interface changes as well here, IMO further improving the DX for the amend API. I feel like now there's probably enough of an API around this to have a fruitful discussion on the path forward, @rvagg.

Since the API will very likely evolve some more before it settles, I feel like I should incorporate feedback before adding more unit/benchmarking tests.

*Amender remains an optional, no/low-cost abstraction over NodeBuilder, so I don't foresee a huge hit in performance, especially for critical paths. I will of course get some numbers and add correctness tests once we're further along.

@rvagg
Copy link
Member

rvagg commented Jul 7, 2023

Before that though, if we take a step back, does adding to a Node implementation the ability to recursively operate on its children seem like a layering violation? I think it is sufficient for plainMap or plainList to allow the easy addition/removal/replacement of immediate children but allowing operations on children at or beyond 2 levels of nesting seems like it doesn't belong in map or list-like objects.

This is an interesting question. I gather we're talking specifically about this:

	Transform(path Path, transform AmendFn) (Node, error)

It might be reasonable to see a violation when you look at plainMap and plainList, but those are the most basic kind of recursive node. The Node interface itself doesn't have anything that limits it to a single level of complexity, so I would say that it's not a violation. plain* are private types so you don't have any opportunity to confuse the fact that most of the time everything should just be a Node and you shouldn't make too many assumptions beyond that.

It might fit easier into the TypedNode interface where you have a much clearer mapping to multi-layered Nodes, but the justification for limiting this to just typed forms seems weak.

@rvagg
Copy link
Member

rvagg commented Jul 7, 2023

This is looking pretty clean, I like the way the traversal code feature detects and splits that out; much less scary than having it all tangled up.

Could you work on adding some basic tests to the map and/or list amenders to demonstrate usage? I feel like that would be a good path to describing what the purpose of this all is.

@smrz2001
Copy link
Contributor Author

smrz2001 commented Jul 7, 2023

This is an interesting question. I gather we're talking specifically about this:

	Transform(path Path, transform AmendFn) (Node, error)

It might be reasonable to see a violation when you look at plainMap and plainList, but those are the most basic kind of recursive node. The Node interface itself doesn't have anything that limits it to a single level of complexity, so I would say that it's not a violation. plain* are private types so you don't have any opportunity to confuse the fact that most of the time everything should just be a Node and you shouldn't make too many assumptions beyond that.

It might fit easier into the TypedNode interface where you have a much clearer mapping to multi-layered Nodes, but the justification for limiting this to just typed forms seems weak.

That makes sense, thank you for clarifying.

This is looking pretty clean, I like the way the traversal code feature detects and splits that out; much less scary than having it all tangled up.

Could you work on adding some basic tests to the map and/or list amenders to demonstrate usage? I feel like that would be a good path to describing what the purpose of this all is.

Yes, since the code looks reasonable, I'll work on adding tests 👍🏼

Thanks!

@smrz2001
Copy link
Contributor Author

Just pushed some tests demonstrating usage for MapAmender. The logic for ListAmender is trickier to keep optimized, so I'm still working on that and its tests.

Will also figure out why the other tests are failing.

@smrz2001
Copy link
Contributor Author

Ok, just pushed some fixes and tests for ListAmender demonstrating its usage. Now there are tests for both MapAmender and ListAmender. Please let me know what you think when you get a chance to look at the latest code, @rvagg.

It's a little bizarre that all the tests are passing for macOS but not for Windows or Ubuntu for Go 1.19.x/1.20.x, though they pass for Ubuntu on Go 1.18.x. I'll see if I can figure out what's going on - is this something you've run into before?

The Go 1.17.x tests on Ubuntu are failing because I've included an experimental package for some list operations that I did not want to code by hand, though I can implement them to maintain compatibility with older Go versions.

@smrz2001
Copy link
Contributor Author

smrz2001 commented Jul 16, 2023

Ok, so the tests were failing because transformations are now changing underlying nodes by default.

This is probably a good stage to discuss how to deal with im/mutability when transforming nodes. As I see it, we have a couple of options.

A.

By default, the code in its current state will now transform nodes being amended, even after they have been "built" and are presumably immutable. This is the most space/performance efficient because it needs no additional metadata regarding modifications. However, this could lead to broken invariants in other parts of the code assuming any constructed nodes to remain unchanged.

Maybe that's ok because this only happens if a user explicitly requests an amending builder for a node and knows what to expect. If they need a base node to remain immutable, they'll need to use datamodel.Copy to create a deep copy that can then be modified. This loses some of the space advantage of this approach but does retain the better DX of amending builders and allows unlimited modifications of the copied node.

B.

I can reuse some of the logic I previously wrote to maintain additional state for modifications while keeping the base node unmodified. This will definitely increase the complexity of the code and have an impact on space/performance because of the need to store metadata regarding modifications separately from the base node. This approach would have the advantage of ensuring that base nodes are never modified from under code expecting them to remain immutable.

This approach would have clearer behavior but would come with significant complexity/space/performance costs. It will, however, fulfill the original promise of copy-on-write.

EDIT 1:
We could also allow Node implementations to choose which approach to take. For example, an ADL could store additional modification-related metadata like I did in my earlier traversal.Amender implementations and present the right "lens" (with modifications applied but without changing the base data) to any consumers of its API.

EDIT 2:
I've updated the implementation to make Prototype.Map.AmendingBuilder and Prototype.List.AmendingBuilder internally make a copy of the base node so as not to break any contracts elsewhere. IMO this code is still more optimized than the previous way of modifying nodes - any additional memory required for the copy will be amortized over multiple modification calls without jeopardizing any contracts.

@smrz2001
Copy link
Contributor Author

Hey @rvagg, have you had a chance to look at some of the latest updates? There are now examples of how to use the new interfaces.

I think we're definitely closer to a clean API though there are still some usability details to iron out, some of which will have an impact on performance.

It would be good to understand the most common ways in which this code might be used in order to choose the right approach, or add some configuration options with sane defaults.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🏃‍♀️ In Progress
Development

Successfully merging this pull request may close these issues.

None yet

4 participants