Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipld/unixfs/hamt: Review and comment the HAMT package #387

Open
schomatis opened this issue Sep 28, 2018 · 4 comments
Open

ipld/unixfs/hamt: Review and comment the HAMT package #387

schomatis opened this issue Sep 28, 2018 · 4 comments
Assignees

Comments

@schomatis
Copy link
Contributor

schomatis commented Sep 28, 2018

HATM description

Brief description of what we have at the moment.

The Hash Array Mapped Trie (HAMT, superficially explained in https://en.wikipedia.org/wiki/Hash_array_mapped_trie) is at its heart a tree of DAG nodes that together represent a directory. Instead of storing all of the entries of a directory as DAG links in a single node (BasicDirectory implementation) the HAMTDirectory distributes the entries across the nodes of the tree.

The entries are indexed based on the hash of its name (to allow for an even distribution), this hash shouldn't be confused with the hash inside the CID that addresses the contents of the entry (node) we're pointing to, the HAMTDirectory cares only about the names of the links. Two entries could point to the same content (file) with different names and they would be saved in different positions in this tree.

This tree is actually a prefix tree (trie) so each node has only a part of the hash of the entry name, all of its children share the same prefix (sub-hash) of the parent node, with each children having a different hash continuation to split different entries.

From the DAG layer perspective a big graph representing a directory and the files it contains can actually be divided in first a sub-graph (starting at the root) that represent the directory and from that other subgraphs representing the files it points to.

* `[`: node belonging to the `HAMTDirectory`.
* `{`: node belonging to an entry (any UnixFS type of file,
       can be another `HAMTDirectory`) pointed to by the directory.

[ Root (no content) ]
   |
   |
   |  0xAF (prefix)
   |------------------> [0xAF]
   |                      |  
   |                      |  
   |                      |  0x14
   |                      |----------> [0xAF14] ---> {file with hashed name
   |                      |                            starting with 0xAF14}
   |                      |  
   |                      |  
   |                      |  0x65
   |                      |----------> [0xAF65] ---> {hashed name: 0xAF656E8F...}
   |
   |  0x3C
   |------------------> [0x3C] -----> {0x3C246AF3E...}
   |
   |
   |  0x12
   |------------------> [0x12]
   |                      |  
   |                      |  0x65
   |                      |----------> [0x1265] ---> {0x12653A5F1D....}
   |                      |  
   |                      |  
   |                      |  0x33
   |                      |----------> [0x1233]
   |                                      |  
   |                                      |  
   |                                      |  0xAF
   |                                      |----------> [0x1233AF] ----> {0x1233AF...}
   |                                      |  
   |                                      |  
   |                                      |  0x1D
   |                                      |----------> [0x12331D] ----> {0x12331D...}

Implementation

Notes about the implementation (focusing on the more obscure points that need to be refactored).

In a normal trie we could have values at any node in the tree (besides the root). In this trie only leaf nodes have values since we're using a hash function (of fixed length) to index them so an internal node with a value would mean an entry with a shorter hash (the node's prefix) which is not possible. In this particular implementation the leaf nodes (shardValue) are encoded inside the parent to avoid making an extra request at the DAG level (improving performance). The result is that a normal node (Shard) can have many values (entries in the directory) since it's encoding many leaf nodes internally. This is all hidden behind the child interface which obscures the code considerably, as a result we have a Label method only implemented for the shardValue (the Shard returns an empty string) and the Link method being used for very different purposes in the Shard and shardValue (one is the link to another node in the HAMTDirectory and the other is actually the value stored in the leaf node which extends beyond the directory).

Another ramification of this design decision is that to distinguish at the DAG level which link points to another HAMTDirectory node or which points to a value (an entry in the directory) the names of the links are overloaded with a hexadecimal string prefix (corresponding to the new part of the hash prefix that is added in the new edge of the trie). The maxpadlen attribute encodes the length of that string prefix in the tree and names bigger than that are links pointing to entries (where the actual name of the entry is the sub-string after the hash prefix, e.g., A3file with maxpadlen of 2) and links with names equal to maxpadlen just point to another internal node in the directory (that will have a prefix that includes this added sub-hash). Example: https://github.com/ipfs/go-unixfs/blob/master/hamt/hamt.go#L278. The result is the Shard holding an attribute (prefixPadStr) which just encodes a string used for a printf call which is not easy to interpret correctly (https://github.com/ipfs/go-unixfs/blob/master/hamt/hamt.go#L87).

Since this is already encoded in existing sharded directories already deployed
I don't think that it will be possible to be modified (extracting this information from the DAG into the UnixFS layer) but should be clearly documented and encapsulated in a separate function.

  • TODO: Distinguish between the full hash of an entry, the hash prefix represented by the position of a node, and the hash part added to the prefix while traversing down a new edge of the trie.

There are two layers interacting here, DAG and UnixFS. When a directory is being manipulated (adding/removing entries) the information is encoded in the UnixFS layer (through the children attribute) and only flushed to the DAG layer (the node containing all this information) when Node() is called, but at times there is a decoupling which is to be expected but not clearly documented, and without a clear API of how to interact with the Shard in that state, e.g., how to enumerate its entries: should the links of the DAG node be checked or the children slice? At times there is only one of those that have the accurate information, e.g., when loading a Shard from a DAG node children is empty (the DAG links should be checked) and when modifying the Shard the information is encoded in the children but not passed to the DAG node. As an example of what would be needed see the FSNodeOverDag which defines a specific API to interact with a mutable object between the two layers.

The internal ProtoNode of the Shard (nd) isn't really used except as a link slice so if the interaction previously described is clearly defined (and correctly decoupled) there is no need to keep this cache node (which isn't even updated when Node is called), only the links need to be kept structuring them as just a performance improvement: instead of loading all the child nodes immediately we keep the links and request them on an as-needed basis (but they shouldn't be considered to hold the actual state of the node). This may not be possible since the new hash part is encoded in the link name (instead of inside the Shard object of the UnixFS layer where it belongs).

  • TODO: Revisit the "shard" term, it seems too generic, is it really associated only with directories at this layer? I'd rather use the normal node term and delimiting it to which sub-DAG it belongs (i.e., to which layer), e.g., "nodes of the UnixFS HAMTDirectory".

===

Sub issues:

@schomatis schomatis self-assigned this Sep 28, 2018
@schomatis
Copy link
Contributor Author

@overbool you up for some real challenge? 😁 I may throw some tasks at you to help me understand what's going on there.

@overbool
Copy link

@schomatis 👌, I am willing to accept the challenge, come on.

@zot
Copy link

zot commented Oct 18, 2020

Is this experimental? I see UseHAMTSharding = false and sourcegraph can't find any references to the variable outside the project...

@aschmahmann
Copy link
Contributor

@zot I don't know how good sourcegraph is at finding things, but here are a few references https://grep.app/search?q=UseHAMTSharding

HAMT Sharding is experimental within go-ipfs https://github.com/ipfs/go-ipfs/blob/master/docs/experimental-features.md#directory-sharding--hamt

@Jorropo Jorropo changed the title Review and comment the HAMT package ipld/unixfs: Review and comment the HAMT package Jun 26, 2023
@ipfs ipfs deleted a comment from welcome bot Jun 26, 2023
@Jorropo Jorropo transferred this issue from ipfs/go-unixfs Jun 26, 2023
@Jorropo Jorropo changed the title ipld/unixfs: Review and comment the HAMT package ipld/unixfs/hamt: Review and comment the HAMT package Jun 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

4 participants