importer: remove `UnixfsNode` from the balanced builder #5118

schomatis · 2018-06-14T13:11:20Z

The builder.go file has been completely rewritten, should be reviewed as an entirely new file.

The dagbuilder.go has mostly additions and can be reviewed as a diff.

Closes #5106.

importer: remove `UnixfsNode` from the balanced builder

The `UnixfsNode` structure has multiple pointers to many (non-complementary)
mutually exclusive node types, only some of them are active (not-`nil`) at a
given time in the code path which made the code too convoluted. Specifically,
the most important distinction between node types was being hidden: leaf nodes
vs internal (non-leaf) nodes.

Remove entirely the use of `UnixfsNode` from the `balanced` package replacing it
in turn with the newly created `FSNodeOverDag` structure that represents the
UnixFS node encoded inside the DAG node, primarily used for internal node
representations. Leaf nodes are handled exclusively in the `NewLeafDataNode`
encapsulating its multiple representations (that we're previously exposed in
`UnixfsNode` as conflicting pointers).

The `builder.go` file has been completely rewritten, although the basic DAG
creation algorithm has been preserved (extending a full DAG by creating a new
root and linking the old one as its child), the most significant modification
has been in the loop of `Layout` that now only handles internal nodes (i.e.,
nodes with `depth` bigger than zero) to be able to adapt `fillNodeRec` to only
that scenario (avoiding the replace logic of the zero `depth` case with the
defective `Set` function, now removed). The `fillNodeRec` now explicitly returns
the `ipld.Node` and the size of the file data it's storing to propagate it
upwards into the DAG.

The `DagBuilderHelper` was heavily extended to incorporate `ipld.Node` functions
that would replace the `UnixfsNode` ones used by the balanced builder:
`NewLeafNode()`, `NewLeafDataNode()` and `AddNodeAndClose()`. Also, the
`ProcessFileStore` function was incorporated to encapsulate all the logic
related to the Filestore support which was scattered throughout the builder
logic, the `offset` that was being passed through most functions is now a part
of the `DagBuilderHelper`.

This has turned out to be a rather big commit, it should have been split into
more smaller and logically cohesive commits, but the `UnixfsNode` was too
entangled inside the logic and that would have required a progressive
modification of the `UnixfsNode` structure as well, which wasn't possible as it
is still being used by the balanced builder (the same reason why most of the
`UnixfsNode`-related functions cannot yet be removed, leaving the `helpers.go`
file mostly untouched).

whyrusleeping · 2018-06-17T17:57:24Z

Do not review.

ACK

schomatis · 2018-06-17T18:04:08Z

@whyrusleeping I mean, take a look if you want, but right now I've basically expanded the UnixfsNode functions in the balanced builder and removed what was not necessary for an internal node, and similarly for the leaf node (internal vs leaf was the basic distinction that UnixfsNode was hiding and hence the code was losing an important dimensionality). The result is a lot of boilerplate code, but now, if this (hopefully) passes the tests it needs to be re-encapsulated in more IPLD-oriented functions.

schomatis · 2018-06-19T15:47:39Z

@Mr0grog Could you take a look at the comments please to check if they make sense? Especially the commit message, the package doc, the FSNodeOverDag doc, and the fillNodeRec doc.

Mr0grog · 2018-06-19T17:31:05Z

@schomatis I’ll try and take a look later today or tomorrow. I think it’s safe to say you know a lot more about all these parts than I do now, though!

schomatis · 2018-06-19T18:37:02Z

@Mr0grog Thanks, there's no rush, you don't need to check that what it says it's true but if actually some coherent meaning can be grasped from those lines.

Mr0grog

OK! I didn’t review everything in too much depth, but focused on the particular docs you pointed me to @schomatis.

With code covering a conceptually complicated area like this, I think diagrams can help a lot. They take up a lot of space (and a little time to create), but they are more than worth the effort for others trying to understand the code. (I used a similar technique extensively in a couple geometry libraries I’ve worked on in the past and had lots of folks say it was very helpful.)

Mr0grog · 2018-06-20T18:52:04Z

importer/balanced/builder.go

+// first child), which proceeds to be filled up (link) to more nodes. In
+// all cases, the data (chunk) is stored only at the leaves, with the
+// rest of (internal) nodes only storing links to their children and
+// sizes of the data stored under them.


I think diagrams could be enormously helpful here.

The way I read this, it’s saying something slightly different from the old text. I’m assuming functionality hasn’t changed here, though. Here’s what I’m reading:

Old:

until the maximum number of links is reached… Then, a new root is created, and points to the old root, and incorporates a new child, which proceeds to be filled up (link) to more leaves.

Which I think means:

With maximum links = 2 +-------------+ | Root 1 | +-------------+ | +------+------+ | | +=========+ +=========+ + - - - - + | Chunk 1 | | Chunk 2 | | Chunk 3 | +=========+ +=========+ + - - - - + ↯ +-------------+ | Root 3 | +-------------+ | +-------------+-------------+ | | +-------------+ +-------------+ | Root 1 | | Root 2 | +-------------+ +-------------+ | | +------+------+ +------+ - - -+ | | | | +=========+ +=========+ +=========+ + - - - - + | Chunk 1 | | Chunk 2 | | Chunk 3 | | Chunk 4 | +=========+ +=========+ +=========+ + - - - - +

While the new text:

until the maximum number of links is reached… Then, a new root is created to point to the old root (incorporating it as its first child), which proceeds to be filled up (link) to more nodes.

Sounds more like:

With maximum links = 2 +-------------+ | Root 1 | +-------------+ | +------+------+ | | +=========+ +=========+ + - - - - + | Chunk 1 | | Chunk 2 | | Chunk 3 | +=========+ +=========+ + - - - - + ↯ +-------------+ | Root 4 | +-------------+ | +-------------+ - - - -- - -+ | | +-------------+ +=============+ | Root 3 | | Chunk 4 | +-------------+ +=============+ | +-------------+-------------+ | | +-------------+ +=============+ | Root 1 | | Chunk 3 | +-------------+ +=============+ | +------+------+ | | +=========+ +=========+ | Chunk 1 | | Chunk 2 | +=========+ +=========+

I think the first interpretation is correct, so how about something like:

In a balanced DAG, nodes representing chunks of data are added to a single root until the maximum number of links is reached. Then, two new nodes are created: a second node like the first (to which we add new nodes representing chunks of data until it reaches the maximum number of links), and a new root that links to it and to the original root. In all cases, the data (chunk nodes) is stored only at the leaves, with the rest of the nodes (internal nodes) only storing links to their children and the size of the data stored under them.

[diagram]

That naturally leads to me on one more big question, though, which is: what happens when the maximum number of links in a root higher than the first level is reached? i.e., we add a 5th chunk to the diagram above?

Do we mirror the structure so the whole graph has the same depth: +-------------+ | Root 6 | +-------------+ | +--------------------------+----------------------------+ | | +-------------+ +-------------+ | Root 3 | | Root 5 | +-------------+ +-------------+ | | +-------------+-------------+ +-------------+ - - - - | | | +-------------+ +-------------+ +-------------+ | Root 1 | | Root 2 | | Root 4 | +-------------+ +-------------+ +-------------+ | | | +------+------+ +------+------+ +------+ - - - | | | | | +=========+ +=========+ +=========+ +=========+ +=========+ | Chunk 1 | | Chunk 2 | | Chunk 3 | | Chunk 4 | | Chunk 5 | +=========+ +=========+ +=========+ +=========+ +=========+ Or do we only make it as deep as it needs to be for the data we have: +-------------+ | Root 5 | +-------------+ | +---------------------+--------------------+ | | +-------------+ +-------------+ | Root 3 | | Root 4 | +-------------+ +-------------+ | | | +------+ - - - | | | +=========+ | | Chunk 5 | | +=========+ | +-------------+-------------+ | | +-------------+ +-------------+ | Root 1 | | Root 2 | +-------------+ +-------------+ | | +------+------+ +------+------+ | | | | +=========+ +=========+ +=========+ +=========+ | Chunk 1 | | Chunk 2 | | Chunk 3 | | Chunk 4 | +=========+ +=========+ +=========+ +=========+

Are leaves always at the same depth, or only as deep as they need to be? (And, in the second diagram above, when we add the 7th chunk, do we create a new root at the top of the whole graph, or do we insert a parent between root 4 and root 5?)

Do we mirror the structure so the whole graph has the same depth

Great question! You mirror what you already had, and that term definitely needs to go in the description (yes, one nice graph can clarify all of that), the balanced adjective already implies it (the 1st graph is balanced and the 2nd isn't) but may not be clear enough. The problem with the old explanation:

Then, a new root is created, and points to the old root, and incorporates a new child, which proceeds to be filled up (link) to more leaves.

Is that it is correct for the first iteration (depth of 1), but saying that a root's child (or any child for that matter) will be filled with leaves sound (to me) more like:

+-------------+ | Root 4 | +-------------+ | +-------------+ - - - -- - -+ | | +-------------+ +=============+ | Root 3 | | Chunk 4 | +-------------+ +=============+ | +-------------+-------------+ | | +-------------+ +=============+ | Root 1 | | Chunk 3 | +-------------+ +=============+ | +------+------+ | | +=========+ +=========+ | Chunk 1 | | Chunk 2 | +=========+ +=========+

because in deeper depths a root's grandchild won't be a chunk. This graph is actually the one you associated with the new explanation that talks about nodes (generic, not internal or leaves, because that depends on the depth), I'm interested in why you reached that conclusion. But the new explanation may also be too plain, let me iterate a bit on your version and ping you back. I think the mirror term is the key (and the graph, no question), that's why I talk about a B-tree in the package doc, but that may not be the most appropriate data structure.

This graph is actually the one you associated with the new explanation that talks about nodes… I'm interested in why you reached that conclusion.

Ah, sorry if I didn’t make that clear. Here’s the sticky bit:

Then, a new root is created to point to the old root (incorporating it as its first child), which proceeds to be filled up

So according to that sentence, the new root is the the thing that “gets filled up (link) to more nodes.” That’s technically true, but since it doesn’t say anything about creating intermediary parents/roots/internals/whatever, it sounds like we are filling the rest of the new root with leaf nodes.

(The old explanation was clearer for me on this point because it says “incorporates a new child, which proceeds to be filled up.” It’s explicit about it being the new child that is filled.)

I think the mirror term is the key

I do think “mirror” is useful, BUT I also wrote all of this comment before getting to the bottom where you wrote “Balanced DAGs are generalistic DAGs in which all leaves are at the same distance from the root,” which totally resolved the question for me. Just moving that from the bottom to be directly below this introductory paragraph (or below a diagram which is below the intro paragraph) would also be plenty clear, I think.

+-------------+ | Root 6 | +-------------+ | +--------------------------+----------------------------+ | | +-------------+ +-------------+ | Root 3 | | Root 5 | +-------------+ +-------------+ | | +-------------+-------------+ +-------------+ - - - - | | | +-------------+ +-------------+ +-------------+ | Root 1 | | Root 2 | | Root 4 | +-------------+ +-------------+ +-------------+ | | | +------+------+ +------+------+ +------+ - - - | | | | | +=========+ +=========+ +=========+ +=========+ +=========+ | Chunk 1 | | Chunk 2 | | Chunk 3 | | Chunk 4 | | Chunk 5 | +=========+ +=========+ +=========+ +=========+ +=========+

I'm just realizing that most of the graphs have too many Root nodes, if by that we're meaning a node that was at some point a root node created by Layout (and has since become an internal node when the DAG grew in depth). Root 1 is correct but Root 3 is actually Root 2 (if we're numbering in chronological order), the one drawn as 2 was never actually a root node, just an internal node generated by the fillNodeRec function at the time that (what you're drawing as) Root 3 was the root node and Root 1 was just an internal node (fillNodeRec was mirroring that internal node that used to be a root). Is that clear? Maybe we could add a more descriptive legend, e.g.,

+-------------+ | Root 2 | +-------------+ | +-------------+---------+ | | +---------------+ +--------------+ | Internal 1 | | Internal 3 | | (ex Root 1) | | | +---------------+ +--------------+

(BTW, have you been doing all these graphs manually? I just did a partial redrew of one of yours and became exhausted, we definitely need a tool for this.)

most of the graphs have too many Root nodes, if by that we're meaning a node that was at some point a root node created by Layout… the one drawn as 2 was never actually a root node

Ah, I didn’t mean anything that specific by root; only that it is the root of some sub-graph. I was trying not to change the name of a node between two diagrams that represent the graphs at different points in time and also trying not to introduce too much complexity. If you found that confusing, though, maybe it was a bad approach.

Maybe just “node?” (Still keep the leaves as “chunk,” though.) Not really sure.

+-------------+ | Node 1 | +-------------+ | +------+------+ | | +=========+ +=========+ + - - - - + | Chunk 1 | | Chunk 2 | | Chunk 3 | +=========+ +=========+ + - - - - + ↯ +------------+ | Node 2 | +------------+ | +-------------+-------------+ | | +-------------+ +-------------+ | Node 1 | | Node 3 | +-------------+ +-------------+ | | +------+------+ +------+ - - -+ | | | | +=========+ +=========+ +=========+ + - - - - + | Chunk 1 | | Chunk 2 | | Chunk 3 | | Chunk 4 | +=========+ +=========+ +=========+ + - - - - +

Root 1 is correct but Root 3 is actually Root 2 (if we're numbering in chronological order)

Yeah, after reading the code, I realized this was the case. I was sort of numbering them chronologically, but it wasn’t clear from the description which was the right order (I don’t think the description actually needs to make it clear — it’s not important to a consumer of the package and would add unnecessary complex detail to a package-level explanation). It would be a good idea to renumber the chronologically anyway, though :)

have you been doing all these graphs manually?

Yes, I have probably drawn too many weird ASCII diagrams at this point in my life :P

For diagrams like this, I’m not sure they’re so complex that finding and making a tool do what you want is really all that worthwhile (any more complex and it is, though). I will say that an editor with rectangular selections (TextMate, VS Code, etc.) is essential, though. You have to be able to pick up chunks of the graph and move them around, otherwise manipulating these graphs is insanely painful.

Ah, sorry if I didn’t make that clear. Here’s the sticky bit:

Then, a new root is created to point to the old root (incorporating it as its first child), which proceeds to be filled up

So according to that sentence, the new root is the the thing that “gets filled up (link) to more nodes.” That’s technically true, but since it doesn’t say anything about creating intermediary parents/roots/internals/whatever, it sounds like we are filling the rest of the new root with leaf nodes.

(The old explanation was clearer for me on this point because it says “incorporates a new child, which proceeds to be filled up.” It’s explicit about it being the new child that is filled.)

Yes, I see it now, thanks. I think I will also reserve the term fill for the parent of the leaf nodes to make that extra clear (although any node can be filled, but just for the sake of the explanation).

Just moving that from the bottom to be directly below this introductory paragraph (or below a diagram which is below the intro paragraph) would also be plenty clear, I think.

Ok. I think the mirror term can be especially useful in the Layout explanation when we're talking about how DAGs are built, and leave the balanced term for the presentation of the full DAG as you said.

It would be a good idea to renumber the chronologically anyway, though :)

Yes, let's use correct numbering without explaining them (at least in the intro, maybe explain them in the Layout function, which might be read more by an interested developer than by the consumer of the package).

For diagrams like this, I’m not sure they’re so complex that finding and making a tool do what you want is really all that worthwhile (any more complex and it is, though). I will say that an editor with rectangular selections (TextMate, VS Code, etc.) is essential, though. You have to be able to pick up chunks of the graph and move them around, otherwise manipulating these graphs is insanely painful.

Yes, I use the Sublime editor for that. My worry is not complexity but time consumption, as of now I haven't seen any diagrams like these in the code base, and if we want to incentivize their incorporation we need to make it damn easy for the developer to do it, otherwise it won't be done; and we need to find a tool for that, so the developer can focus exclusively (or as much as possible) in what is being done (the layout of the DAG) instead how it will be done (the editing part, not everyone is an ascii ninja like you :)).

Mr0grog · 2018-06-20T18:59:05Z

importer/balanced/builder.go

+// raw nodes). In the case the entire file fits into just one node it
+// will be formatted as a (single) leaf node (without parent) with the
+// possible representations already mentioned, this is the only scenario
+// where the root can be of a type different that the UnixFS node.


I’m not sure what “Filestore nodes (supported by raw nodes)” means.

In “possible representations already mentioned, this is the only scenario”, replace the comma with a period or put the “this is the only scenario…” part in parentheses. That helps keep each sentence to one clear idea and makes them much more readable.

It also might be helpful to say “(see the go-ipfs/unixfs package for details of UnixFS.)” at the end. Or something like that to point to more details on UnixFS.

I’m not sure what “Filestore nodes (supported by raw nodes)” means.

I left a vague description on purpose because I don't know exactly how Filestore operates, just that it's stored on raw nodes, support in communication protocols terminology is which type of lower layer packet carries your data, but I can just say that it's a format stored on a raw node.

In “possible representations already mentioned, this is the only scenario”, replace the comma with a period or put the “this is the only scenario…” part in parentheses. That helps keep each sentence to one clear idea and makes them much more readable.

It also might be helpful to say “(see the go-ipfs/unixfs package for details of UnixFS.)” at the end. Or something like that to point to more details on UnixFS.

👍

I left a vague description on purpose because I don't know exactly how Filestore operates

Ha! That’s reasonable, but can you point to a package where the reader can dig more into it (like I suggested with UnixFS)?

support in communication protocols terminology is which type of lower layer packet carries your data, but I can just say that it's a format stored on a raw node.

This part matches what I thought, but I think your more plain language version (“format stored on a raw node”) is better. 👍

Mr0grog · 2018-06-20T19:02:11Z

importer/balanced/builder.go

 //
-// Balanced DAGs are generalistic DAGs in which all leaves
-// are at the same distance from the root.
+// The nodes are filled by recursion (in a DFS way) so the DAG is built


I’m not sure what DFS is (guessing “depth-first search,” but also this isn’t really a seach). Maybe spell this one out? (“DAG” is ok because it’s not ambiguous in this context and we’re repeating it a lot.)

Actually, on second thought, maybe you don’t need to say “(in a DFS way)” at all here. It doesn’t seem like it’s saying anything more useful than what “so the DAG is build from the bottom up” already makes clear.

In a manner similar to a depth-first search, I think that concept can be much more familiar to the developer than the DAG term (even if we repeated a lot, it's not that common compared to a basic tree search). But yes, the "bottom up" term might be more clear here, I'll put the depth-first search at the end of the sentence to not distract the reader from the main point.

I think that concept can be much more familiar to the developer than the DAG term

Totally! Sorry, I meant that I don’t think the abbreviation is as clear. I’d heard of “depth-first [search/traversal/construction/whatever]” long before I’d ever learned about “directed acyclic graphs,” but I’ve never seen “DFS” used to refer to it (at least I don’t recall it).

Mr0grog · 2018-06-20T19:17:27Z

importer/balanced/builder.go

+// root). In this way the DAG acts like a B-tree (a more appropriate data
+// structure may be in order for this analogy), with each internal node
+// using the file size of its children as an index to perform the search
+// (seek) operation.


How about this rewording:

The nodes are filled recursively, so the DAG is built from the bottom up. Leaf nodes are created first using the chunked file data and its size. The size is then bubbled up to the parent (internal) node, which aggregates all the sizes of its children and bubbles that combined size up to its parent, and so on up to the root. This way, a balanced DAG acts like a B-tree when seeking to a byte offset in the file the graph represents: each internal node uses the file size of its children as an index when seeking.

Then yes, I better lose the DFS term, because that will clash with the explanation of the actual search, which is not of a DFS type, I'll just put an emphasis on the recursive aspect of the construction (that doesn't belong to the search).

Mr0grog · 2018-06-20T19:17:40Z

importer/balanced/builder.go

+// (seek) operation.
+//
+// Balanced DAGs are generalistic DAGs in which all leaves are at the
+// same distance from the root.


Ah! I guess this answers my question from the top :)

Maybe a good idea to move this paragraph up there.

Mr0grog · 2018-06-20T20:35:38Z

importer/helpers/dagbuilder.go

+	return node, nil
+}
+
+// FSNodeOverDag encapsulates an `ft.FSNode` that will be stored in a


This might just be my general newness to Go, but I find the ft here confusing (I know I can look up the fact that it’s referring to the unixfs package at the top, but I keep having to do that when I’m new to files AND none of that shows up in the Godoc). What if we said:

FSNodeOverDag encapsulates a UnixFS fs.FSNode

? Is that any better, or is it just confusing in a different way?

It is definetely not you (see #5056).

? Is that any better, or is it just confusing in a different way?

I'd rather rename the package to its full name and have unixfs.FSNode (and yes, the FS in FSNode is also a confusing redundancy, that's what namespaces are for).

Mr0grog · 2018-06-20T20:38:50Z

importer/helpers/dagbuilder.go

+// internal `FSNode` in the process of creating a UnixFS DAG, this
+// structure stores an `FSNode` cache to manipulate it (add child nodes)
+// directly , and only when the node has reached its final (immutable)
+// state it is committed to a single (indivisible) `ipld.Node`.


“state it is committed” → “state is it committed”

I might add that moving to its final state is done by calling Commit:

…and only when the node has reached its final state (signaled by calling Commit()) is it committed…

Otherwise it sounds like that happens automatically as a result of doing some other routine action.

Mr0grog · 2018-06-20T20:42:01Z

importer/helpers/dagbuilder.go

+// representations of data leaf nodes (that don't use raw nodes or
+// Filestore).
+//
+// It aims at replacing the `UnixfsNode` structure which encapsulated too


“at replacing” → “to replace”

Mr0grog · 2018-06-20T20:48:04Z

importer/helpers/dagbuilder.go

+}
+
+// SetFileData stores the `fileData` in the `ft.FSNode`. It
+// is used only when `FSNodeOverDag` represents a leaf node


“is used” → “should be used” ?

This code doesn’t control how it’s called, so this should be a recommendation to the reader about how to use it, not an assertion about what some other part of the codebase might or might not do.

Mr0grog · 2018-06-20T20:49:32Z

importer/helpers/dagbuilder.go

+// that represents them: the `ft.FSNode` is encoded inside the
+// `dag.ProtoNode`.
+//
+// TODO: Evaluate making it read-only after committing.


I like this idea (or having a dirty flag) :)

schomatis · 2018-06-20T21:14:39Z

@Mr0grog 😍

schomatis · 2018-06-21T23:19:46Z

With code covering a conceptually complicated area like this, I think diagrams can help a lot. They take up a lot of space (and a little time to create), but they are more than worth the effort for others trying to understand the code. (I used a similar technique extensively in a couple geometry libraries I’ve worked on in the past and had lots of folks say it was very helpful.)

I agree 100% on this, that's why I asked about tools to programmatically construct graphs like the ones you did (ipfs-inactive/docs#78), graphmd didn't seem to be suited for this, I'll take a closer look at Graphviz to see what kind of text graphs it can generate.

schomatis · 2018-06-25T14:45:25Z

@Mr0grog Could you take a look at the documentation fixes in the last commit?

I moved most of it to the Layout and fillNodeRec functions, leaving only a basic description of the balanced DAG in the package doc. Thinking about what you said, the consumer of the package will only be interested in what layout its data will be organized in, not how that layout is built, that should be left for someone reading the actual functions in charge of that.

schomatis · 2018-06-25T14:51:51Z

I've made several changes to the comments of fillNodeRec you gave me, they weren't totally accurate with what the functions does, but my changes didn't end up being very convincing either, so take a closer look there please.

schomatis · 2018-06-26T01:04:35Z

@Stebalien @whyrusleeping While we give the final touches to the documentation, could you take a look at the code please? Especially the added FSNodeOverDag and ProcessFileStore.

whyrusleeping

This looks pretty clean to me, great job!

whyrusleeping · 2018-06-26T19:10:53Z

importer/helpers/dagbuilder.go

+	}
+
+	// Convert this leaf to a `FilestoreNode` if needed.
+	node = db.ProcessFileStore(node, dataSize)


the downside of doing things this way is that we hold the node in memory for longer than we really need to (technically just need to hash the data, could be done entirely on stack if go let us)

I'm not following, this way refers to this PR or in general? From what I understand, we normally hold work-in-progress nodes to avoid constantly (un)packing while working on them, reducing computation time at the cost of more memory, but I'm not sure if you're referring to that.

Yeah, i'm just referring to the fact here that this 'node' is a full-on dag RawNode that holds a reference to the 256k buffer where it technically doesnt need to (since its just a raw data buffer under the hood).

Yes, something that I'm seeing throughout the entire code is this mixture of node structures that are used interchangeably for (i) reading/writing and for (ii) building/cache. Ideally, if you are retrieving content (or storing what you received from another peer) the first type is just fine, but if you are building stuff yourself (and that is the very common operation of just adding content with ipfs add) then you need an entirely different object, that yes, will look like the first one (and the first one will even be its end product) but actually will behave very differently: it will cache many things while in the process of modifying them, it will have information in a mutable state, it will allow (what the first type would see as) invalid states.

Ideally again, I would expect to have different structures (with a common base or shared methods) to reflect that. And going back to the original raw node comment, yes, someone reading (or writing) the node needs the buffer, but someone building it may only need partial information like the hash.

Sorry, I'm not sure if you're original remark was going in this direction, but it made me think of this situation that I was encountering repeatedly while working on this PR.

schomatis · 2018-06-28T02:02:43Z

@Stebalien PTAL

schomatis · 2018-07-04T15:30:43Z

@Stebalien PTAL

Mr0grog · 2018-07-06T19:21:25Z

importer/balanced/builder.go

-// are at the same distance from the root.
+// Package balanced provides methods to build balanced DAGs, which are generalistic
+// DAGs in which all leaves (nodes representing chunks of data) are at the same
+// distance from the root. Nodes can have only a maximum number of children, to be


Really minor, but the comma after children should be a period or a semicolon.

Mr0grog · 2018-07-06T19:30:58Z

importer/balanced/builder.go

+//      |             |             |             |             |
+// +=========+   +=========+   +=========+   +=========+   +=========+
+// | Chunk 1 |   | Chunk 2 |   | Chunk 3 |   | Chunk 4 |   | Chunk 5 |
+// +=========+   +=========+   +=========+   +=========+   +=========+


If you indent this whole block, it renders much nicer in godoc. Current:

Indented:

Mr0grog · 2018-07-06T19:56:33Z

importer/balanced/builder.go

+// `fillNodeRec` will return and the loop will be repeated: the `newRoot` created
+// will become the old `root` and a new root will be created again to increase the
+// depth of the DAG. The process is repeated until there is no more data to add
+// (`Done()`).


This bit about Done() doesn’t seem totally clear. Maybe:

…no more data to add (i.e. the DagBuilderHelper’s Done() function returns true).

Mr0grog · 2018-07-06T19:59:59Z

importer/balanced/builder.go

-// and depth only increases when the tree is full, that is, when
-// the root node has reached the maximum number of links.
+// Layout builds a balanced DAG layout. In a balanced DAG of depth 1, leaf nodes
+// with data are added to a single `root` until the maximum number of links is


Looking at this, especially in the context of the generated godoc, I wonder if most of this should move to a comment inside the the function so it doesn’t read as public documentation (it’s definitely still helpful clarification for someone working on the balanced builder here). Otherwise, it reads as a lot of detail about how this function works that isn’t necessary (and might be confusing) for a user of the function.

Totally true, but I can't find any other place in the function for this information, Layout actually has only a few code lines, none of which could be easily associated with the two introductory paragraphs, the "magic" is happening in the recursion call, which is hiding a lot behind the scenes (something I'm not really a fan of). Anyway, I will have this advice in mind when writing documentation (that was actually my motivation when moving much of the package description inside the individual functions).

Mr0grog · 2018-07-06T20:18:24Z

importer/balanced/builder.go

+//
+// It returns the `ipld.Node` representation of the passed `node` filled with
+// children and the `nodeFileSize` with the total size of the file chunk (leaf)
+// nodes stored under this node (to propagate it upwards in the DAG to its parent).


Instead of:

(to propagate it upwards in the DAG to its parent)

I might say:

(parent nodes store this to enable efficient seeking through the DAG when reading data later)

I think it’s self-evident that the data is being propagated upward, and this describes a little more clearly why we want that information.

schomatis · 2018-07-08T14:23:42Z

Thanks @Mr0grog for the very thorough review and invaluable diagrams!! This is already shaping up to be a nice reading reference for a new user who wants to get a deeper understanding of how files are handled in IPFS.

schomatis · 2018-07-08T14:24:29Z

~~/cc @whyrusleeping RFM~~

(edit: strike that, I've just seen that 0.4.16 still hasn't been released, let's leave this PR for the next one.)

whyrusleeping · 2018-07-16T14:48:46Z

In need of a rebase now that other PRs have landed

schomatis · 2018-07-16T15:06:26Z

Fixed, waiting on the tests..

schomatis · 2018-07-16T15:15:20Z

Broke the build.

whyrusleeping · 2018-07-16T15:17:03Z

@schomatis whoops. You got a fix or should I handle?

schomatis · 2018-07-16T15:35:08Z

@whyrusleeping Fixed in #5239, should be merged first.

The `UnixfsNode` structure has multiple pointers to many (non-complementary) mutually exclusive node types, only some of them are active (not-`nil`) at a given time in the code path which made the code too convoluted. Specifically, the most important distinction between node types was being hidden: leaf nodes vs internal (non-leaf) nodes. Remove entirely the use of `UnixfsNode` from the `balanced` package replacing it in turn with the newly created `FSNodeOverDag` structure that represents the UnixFS node encoded inside the DAG node, primarily used for internal node representations. Leaf nodes are handled exclusively in the `NewLeafDataNode` encapsulating its multiple representations (that we're previously exposed in `UnixfsNode` as conflicting pointers). The `builder.go` file has been completely rewritten, although the basic DAG creation algorithm has been preserved (extending a full DAG by creating a new root and linking the old one as its child), the most significant modification has been in the loop of `Layout` that now only handles internal nodes (i.e., nodes with `depth` bigger than zero) to be able to adapt `fillNodeRec` to only that scenario (avoiding the replace logic of the zero `depth` case with the defective `Set` function, now removed). The `fillNodeRec` now explicitly returns the `ipld.Node` and the size of the file data it's storing to propagate it upwards into the DAG. The `DagBuilderHelper` was heavily extended to incorporate `ipld.Node` functions that would replace the `UnixfsNode` ones used by the balanced builder: `NewLeafNode()`, `NewLeafDataNode()` and `AddNodeAndClose()`. Also, the `ProcessFileStore` function was incorporated to encapsulate all the logic related to the Filestore support which was scattered throughout the builder logic, the `offset` that was being passed through most functions is now a part of the `DagBuilderHelper`. This has turned out to be a rather big commit, it should have been split into more smaller and logically cohesive commits, but the `UnixfsNode` was too entangled inside the logic and that would have required a progressive modification of the `UnixfsNode` structure as well, which wasn't possible as it is still being used by the balanced builder (the same reason why most of the `UnixfsNode`-related functions cannot yet be removed, leaving the `helpers.go` file mostly untouched). License: MIT Signed-off-by: Lucas Molas <schomatis@gmail.com>

schomatis · 2018-07-16T16:20:20Z

@whyrusleeping RFM now

whyrusleeping · 2018-07-16T16:34:35Z

@schomatis Woo! Thanks for all of these PRs. Making things easier to read and understand is so great

schomatis added the topic/files Topic files label Jun 14, 2018

schomatis added this to the Files API Documentation milestone Jun 14, 2018

schomatis self-assigned this Jun 14, 2018

schomatis requested a review from Kubuxu as a code owner June 14, 2018 13:11

ghost added the status/in-progress In progress label Jun 14, 2018

This was referenced Jun 18, 2018

[PoC] importer: UnixFS type Raw is never used #5120

Closed

[community contribution] Remove UnixfsNode from the trickle builder #5135

Closed

schomatis changed the title ~~[WIP] remove unixfsnode from balanced importer~~ importer: remove UnixfsNode from the balanced builder Jun 19, 2018

schomatis requested review from Stebalien and whyrusleeping June 19, 2018 15:47

schomatis removed the status/in-progress In progress label Jun 19, 2018

Mr0grog reviewed Jun 20, 2018

View reviewed changes

ghost added the status/in-progress In progress label Jun 25, 2018

whyrusleeping approved these changes Jun 26, 2018

View reviewed changes

whyrusleeping reviewed Jun 26, 2018

View reviewed changes

schomatis mentioned this pull request Jul 5, 2018

Analyzing the DAG reader #5192

Closed

Mr0grog reviewed Jul 6, 2018

View reviewed changes

schomatis added topic/UnixFS Topic UnixFS RFM and removed status/in-progress In progress labels Jul 8, 2018

ghost added the status/in-progress In progress label Jul 16, 2018

schomatis removed the status/in-progress In progress label Jul 16, 2018

ghost added the status/in-progress In progress label Jul 16, 2018

whyrusleeping merged commit 0ab36f0 into ipfs:master Jul 16, 2018

ghost removed the status/in-progress In progress label Jul 16, 2018

schomatis deleted the feat/importer/remove-unixfsnode branch July 16, 2018 16:37

schomatis mentioned this pull request Sep 26, 2018

How are files formatted inside nodes? #5059

Closed

importer: remove UnixfsNode from the balanced builder #5118

importer: remove UnixfsNode from the balanced builder #5118

Conversation

schomatis commented Jun 14, 2018 • edited Loading

whyrusleeping commented Jun 17, 2018

schomatis commented Jun 17, 2018

schomatis commented Jun 19, 2018

Mr0grog commented Jun 19, 2018

schomatis commented Jun 19, 2018

Mr0grog left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schomatis Jun 21, 2018 • edited Loading

Choose a reason for hiding this comment

Mr0grog Jun 22, 2018 • edited Loading

Choose a reason for hiding this comment

schomatis Jun 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schomatis Jun 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schomatis commented Jun 20, 2018

schomatis commented Jun 21, 2018

schomatis commented Jun 25, 2018 • edited Loading

schomatis commented Jun 25, 2018

schomatis commented Jun 26, 2018

whyrusleeping left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schomatis commented Jun 28, 2018

schomatis commented Jul 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mr0grog Jul 6, 2018 • edited Loading

Choose a reason for hiding this comment

Mr0grog Jul 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schomatis commented Jul 8, 2018

schomatis commented Jul 8, 2018 • edited Loading

whyrusleeping commented Jul 16, 2018

schomatis commented Jul 16, 2018

schomatis commented Jul 16, 2018

whyrusleeping commented Jul 16, 2018

schomatis commented Jul 16, 2018

schomatis commented Jul 16, 2018

whyrusleeping commented Jul 16, 2018

importer: remove `UnixfsNode` from the balanced builder #5118

importer: remove `UnixfsNode` from the balanced builder #5118

schomatis commented Jun 14, 2018 •

edited

Loading

schomatis Jun 21, 2018 •

edited

Loading

Mr0grog Jun 22, 2018 •

edited

Loading

schomatis Jun 22, 2018 •

edited

Loading

schomatis Jun 23, 2018 •

edited

Loading

schomatis commented Jun 25, 2018 •

edited

Loading

Mr0grog Jul 6, 2018 •

edited

Loading

Mr0grog Jul 6, 2018 •

edited

Loading

schomatis commented Jul 8, 2018 •

edited

Loading