Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explain how files are managed in IPFS. #251

Closed
schomatis opened this issue Nov 12, 2018 · 10 comments
Closed

Explain how files are managed in IPFS. #251

schomatis opened this issue Nov 12, 2018 · 10 comments
Assignees
Labels
dif/medium Prior experience is likely helpful effort/days Estimated to take multiple days, but less than a week kind/enhancement A net-new feature or an improvement to an existing feature need/analysis Needs further analysis before proceeding P3 Low: Not priority right now status/inactive No significant work in the previous month topic/docs Documentation

Comments

@schomatis
Copy link

ETA (Dec 2019)

In light of the new nav/IA structure on the docs beta site, let's use the content spec'd out in this issue to enhance the beta page currently located at https://github.com/ipfs/ipfs-docs-v2/edit/master/docs/concepts/file-systems.md .

Original issue text follows ...

This isn't the documentation itself (I'm not qualified to write that) but the main content that should be included in it along with some other pointers/questions as to how to approach the subject. Feel free to edit content directly on the issue (created as an issue for easy editing).

The objective of this document is to provide an easy read for both new users and developers who want to understand the high-level architecture of how are files managed in IPFS. This document should also work as the reference for the low-level code comments and mid-level usage examples (like ipfs/kubo#5052), to avoid repeating explanations and having a clear source of truth.

The master document should actually be the specification of the Files API layers, but since those documents are missing at the moment this is the best second-choice reference and we should actually strive to write this document before working on the full specs (that is, if we can't finish this up let's not even bother trying to write the specs). Compared to a specification the concept document should be much more accessible and user-friendly, e.g., the specification is useful for someone already familiar with the stack who wants to maybe code a different implementation while the "concepts" site is the entry point for someone who doesn't know anything about the subject (it should be useful as a first read).

This is still a work in progress, I'll be developing this content in parallel to a mid-level document (about how to use the ipfs files command and how it works under the hood ipfs/kubo#5052) to assess what information would be needed here to follow that guide.

We need diagrams and (maybe) some basic usage examples (or a link to them).

There are at the moment two light-weight concept documents about UnixFS and MFS which should be absorbed here and we should present a single "Files" concept document as an entry point to the entire subject (as the names UnixFS and MFS are meaningless to a new reader).

[TODO: List issues in the milestone (https://github.com/ipfs/go-ipfs/milestone/38) that this document should close.]

Files abstraction layers (stack)

This is the central concept: how we divide work in different layers, each one abstracting some part of the IPFS model. Maybe we could add here (or link to somewhere else) a short preface of what does it mean to be in a content-addressed paradigm where paths are just an abstraction but do not hold any hierarchy of its own.

We should have a simple diagram that would show how each layer operates on each other, e.g.,

     +---------+
     |   MFS   |   File system.
     +---------+
     |  UnixFS |   Files and directories.
     +---------+
     |   DAG   |   Nodes.
     +---------+
     |  Block  |   Stream of bits.
     +---------+

that could reflect the path from a block to a DAG ProtoNode, to a DAG of UnixFS nodes, and finally to a filesystem hierarchy in MFS.

The block and DAG layers could be extracted elsewhere (although I would first attempt to expand on them here to get firm sense of how they fit in the stack and then consider putting them in a document of their own).

The TCP/IP stack example (shown in the UnixFS section) could be presented in this intro instead (but may be too much information since the user hasn't seen any format yet).

Block

This is the most basic data unit, a stream of raw bits, it's how we store and transmit information.

Explain how do we address blocks, differentiate between the CID and the hash. Let's provide a Qm... example here to later use that notation (avoiding CIDv1 for now). Show an /ipfs/CID example path (this may already exist in another document, just point to that), explain that we prefix the CID with they type of interpretation or context for it (e.g., /ipfs/ or /ipns/). No need to talk about datastores, just know that IPFS stores blocks as key-value pairs where the key is the CID (hash) of the block.

DAG - IPLD Nodes

There's plenty of confusion (at least for me) in this subject (especially related to IPLD), but let's first explain the basic concept: we format those blocks adding the property of links allowing them to connect into a directed acyclic graph (DAG) and call them nodes.

What type of nodes should we mention here? Again, first give the most clear definition before going into details: a node is a block formatted with the link and data attributes, that is, it has its own data and points to other nodes with more data. From the Go implementation what I'm most interested in mentioning is the ProtoNode (since it's the most common one, no need to mention raw nodes), but maybe a more generic IPLD notion could be given here (with care to not over-complicate things early at this point).

We should emphasize at this point the the basic property of the Merkle DAG (maybe point to the Merkle tree which is more widely known and conceptually very similar for what we want to explain) that will have repercussions in the uppers layers: content-addressing (comparing it to path/location-addressing). Content is fixed, mutating/editing means recreating the node. Mutation needs to propagate upwards in the DAG, upper nodes will still "point" to the old content so we need to update/edit them as well, meaning we'll need to recreate them also (because chaining a link in the node changes the block contents and therefore its hash). This "propagation of change" goes all the way up to the root itself. That should have a name, e.g., "mutation path", all the nodes that need to be replaced for a certain change to take effect across the DAG. (Important concept for MFS and UnixFS).

[We definitely need a diagram here, e.g.,]

     +-------------+
     |    QmRoot   |
     +-------------+
            |                
            |           +---------------+        
            +---------> |    QmChild    | 
                        +---------------+        

Update `QmChild` contents, what's the result? `QmChild`!! We can't change that,
we can generate a new node based on `QmChild` but with our desired modifications,
the result is `QmNewChild`:

     +-------------+
     |    QmRoot   |
     +-------------+
            |                
            |           +---------------+        
            +---------> |    QmChild    | 
                        +---------------+        

                        +---------------+        
                        |  QmNewChild   | 
                        +---------------+     

What happened? Our root is still pointing to the old content, and we can't
change that, since the link (how it points) is embedded as part of its
contents, so actually we need a new root!

     +-------------+
     |  QmNewRoot  |
     +-------------+
            |                
            |           +---------------+        
            +---------> |  QmNewChild   | 
                        +---------------+     

     +-------------+
     |    QmRoot   |
     +-------------+
            |                
            |           +---------------+        
            +---------> |    QmChild    | 
                        +---------------+        

So we effectively end up with two different DAGs, we can't overwrite old content,
we just duplicate and modify it.

[and not an ugly diagram like this example but a nice one (with colors!!) like this. Also, not even sure if we want to be using the CID terminology here, we need to reflect the fact that we are talking about content (and not node positions in a diagram, since this isn't location-addressed) but the Qm notation is awful, and I would like to write names with spaces, not NewChild/OldChild.]

Another concept (useful for the MFS layer) worth expanding on is the root. There is no actual root, it's a relative term that depends on the entity we are abstracting. That root itself may be the child node of another node in a bigger DAG. So, similarly, there is not a single DAG, it's just the sub-DAG we're referring to at the moment (which may be part of a bigger DAG, and we never know how big that DAG is, since there may always be other unknown -to us- nodes pointing to it). [We need to rephrase this but the general idea is that even though we may not always mention it root and DAG are relative terms and the reader should always stop and think relative to what.]

(If we end up extracting this section to other documents we'll need to be careful how we divide things like the previous paragraphs with concepts closely related to both the DAG and UnixFS/MFS layers.)

The name DAG (mostly in the code, not sure if in the high-level documents as well) is usually associated with the layer being described here and not with the generic graph concept (since our "DAG" term comes from MerkleDAG where the "Merkle" -content-addressed- part is crucial), but many parts of the architecture can be thought of as a DAG (just not of IPLD nodes but for example of other sub-graphs as is the case of MFS).

[TODO: Pending review of the following two sections.]

UnixFS

Most important attribute is the file offset.

We split files in chunks and distribute it in a DAG (not sure if we should explain why or just say "to make them easier to modify and distribute"). Each chunk is a node with a part of the content of the file interconnected with other nodes. If you need to read offset X you'll need to traverse the graph to find the node that contains that offset. How do you know which one is the correct node? Because UnixFS nodes add more attributes like the offset of the data they contain (let's not mention the attributes names which in many cases do not clearly represent what they are).

At this point it would be useful to have a diagram to represent the different layers and the information encapsulated in each layer, e.g., this UDP/IP diagram which shows what is the "data" (a term largely misused in the code base) for the different layers. (It may also be useful to borrow the header+data model and terminology). Following that example we could say that the block's data (e.g., link layer in the diagram) is comprised of both the DAG Node header information (with the links attribute) and the node's data itself. Then at the DAG layer (e.g., Internet layer) the DAG node's data is unpacked to another (also called) node of the UnixFS type (whether to call this a node, giving the impression that we may be unpacking a node from itself has long been discussed and should be clearly documented). The header/meta-data of this UnixFS node has for example the offset the UnixFS node's data belong to in the file itself (e.g., transport layer).

it's more general than the Files API but this is an excellent example

An implicit concept in the code that should be clarified here and given a name is the mechanism by which we refer to an entire by only the root UnixFS node of that file DAG (another term worth introducing). Why? Because with the root node the UnixFS layer can decode the entire file (traversing the DAG) and since we are in a content-addressed system that node (through its hash) encodes all the nodes below it. So the result is that we refer to that root node as the entire file, we often just name it node in the code when it's actually representing much more than a single node: the entire file. To start changing that we should introduce the concept of a "root file node" (rename) to make this clear since it's heavily used in the MFS layers where we manipulate files and directories just by managing a single UnixFS node.

Haven't mentioned directories up to now, a simple example can be given, the most important aspect is that directory entries in the UnixFS layer rely in the links of the DAG layer which is a excellent examples of how these layers interact. (Let's not mention HAMT directories here, or provide a link to another document at most.)

How are these DAGs formed? Not sure if worth explaining here, we can just mention that we have different layouts for different use cases, e.g., balanced and trickle.

TODO: Add discussions about node encapsulation, node specialization. Something to consider when writing this section (see ipfs/kubo#5166).

Mutable File System (MFS)

Sub-graph concept and how they can be seen in different layers (already discussed in DAG). This is a new dimension not present in telecommunication models like TCP/IP. We could see an entire UnixFS graph representing a file like a sub-graph at the MFS layer, because the MFS will only interact with the root file nodes. Very useful for HAMT directories and the general idea that a directory may be comprised of several nodes and again, I will normally hold a reference (hash) to the root one.

Why do we need this layer? As it says in the concept doc: "Because files in IPFS are content-addressed and immutable, they can be complicated to edit." I would like to expand on that with a simple example (that should interact with the concepts discussed earlier) like: if I have a directory /home/user/ represented by a UnixFS node QmHomeDir (let's use readable CIDs for the examples whenever possible) that has a link to a child file bashrc (avoiding the dot for simplification) with a CID QmBashrc, if I modify the bashrc file contents it will therefore have a new hash and CID, e.g., QmBashrcModified, and the entry of QmHomeDir will point to the old contents. (Diagram with both old/ new content and the link that will be updated). (This can be based, or merged with, the previous example diagram of the DAG section, since we're talking basically about the same issue.)

Once we are done editing those changes will be flushed to the UnixFS layer: the final contents will be paired with the correct links to each other. (elaborate on this, it's important to have an idea that we need to save/flush changes because hashes are immutable).

The same way a "root file node" represented an entire file a "root MFS node" can represent an entire file system hierarchy.

MFS provides path-addressable content abstracting the UnixFS layer (to simplify the limits of the content addressable system of thee DAG layer, which should be clarified earlier.) And follow up with the bashrc example.

(Merge with previous paragraph.) To keep this all in sync while working with files we create another layer that works with paths (that are actually meaningless in content-addressed, "meaningless" may be too strong) and keeps a reference of "what I mean when I talk about the bashrc file" without actually being concerned with the (ever-changing) hash of a file that is being edited.

@meiqimichelle
Copy link

This is great, we really, really need this -- just reading your initial outline/writeup here is helping me understand the concepts better. @schomatis is there anything that would be helpful to keep moving on this? It wasn't self-assigned, so wanted to check to see if it is on your task list, or if we should find someone to shepherd it forward -- or even if it is on your list, more brains would be helpful (or editors).

Re: illustrations, once the concepts are where you'd like them to be, the illustrations can be issues in protocol/design -- but should wait till the concepts are finished so we don't illustrate the wrong idea, if you know what I mean.

Thank you! I'm excited. You can probably tell, haha.

@schomatis
Copy link
Author

schomatis commented Nov 13, 2018

Hey @meiqimichelle, thanks for the feedback, it's really energizing :)

@schomatis is there anything that would be helpful to keep moving on this? It wasn't self-assigned,

You're right, I should self-assign this for now. My plan is to keep working on the content for a week or two but then we'll need to find a technical writer who will actually prepare the end product (I can help with the review process but I'm not qualified to write it myself). Since Rob is no longer in charge of the docs repo Why suggested to ping @mikeal about it (not sure if to write it himself or to point to the right person for the job).

@schomatis schomatis self-assigned this Nov 13, 2018
@nitishm
Copy link

nitishm commented Dec 19, 2018

How is it that if I add a file through the WebGUI I can list it using ipfs files ls but I do the same with ipfs add --pin=true I cannot see it using ipfs files ls. Is there some concept that I might have overlooked ?

Follow up from ipfs/kubo#5858 (comment)

@nitishm
Copy link

nitishm commented Dec 20, 2018

Another concept that needs elaboration and a clear explanation,

What I dont understand is in ipfs add we create an MFS root object in the Adder.addNode() method (albeit with unixfs.EmptyDirNode()), why is it that ipfs files ls doesn't find it ?

[source: https://github.com/ipfs/kubo/issues/5862#issuecomment-449123483]

Explanation found here -
ipfs/kubo#5862 (comment)

Emphasis on the MFS root in ipfs add vs ipfs files where its could be ephemeral or local.

@schomatis
Copy link
Author

Sorry, I meant the documentation you're carrying forward. This is more like a high level document to explain what an MFS root is, now how we use it in our commands.

@nitishm
Copy link

nitishm commented Dec 21, 2018

if I have a directory /home/user/ represented by a UnixFS node QmHomeDir (let's use readable CIDs for the examples whenever possible) that has a link to a child file bashrc (avoiding the dot for simplification) with a CID QmBashrc, if I modify the bashrc file contents it will therefore have a new hash and CID, e.g., QmBashrcModified, and the entry of QmHomeDir will point to the old contents. (Diagram with both old/ new content and the link that will be updated).

Does this mean a QmNewHomeDir is created because of changing bashrc (QmNewBashrcModified) ? I am confused 🤔.

Point being its not exactly clear from the description.

@schomatis
Copy link
Author

Does this mean a QmNewHomeDir is created because of changing bashrc (QmNewBashrcModified) ?

Exactly.

I am confused .

Help me understand why, how could we rephrase this to make it more clear?

(Please keep this kind of questions coming, they are very helpful.)

@nitishm
Copy link

nitishm commented Dec 21, 2018

My advice is erring in the side of repeating yourself with the documentation.

(Diagram with both old/ new content and the link that will be updated). (This can be based, or merged with, the previous example diagram of the DAG section, since we're talking basically about the same issue.)

You have that fact captured in the document, just that when I read the bashrc example it felt incomplete without stating the fact that we now havethe old [QmHomeDir]->[QmBashrc] & the new [QmNewHomeDir]->[QmBashrcModified] DAG.
Diagrams would definitely be helpful, especially one showing a thousand foot view, with recursive tree diagrams, eg., the MFS tree starting at its own root and its own children. Each child having the root dagnode (aka root file node) and its linked children.

@jessicaschilling jessicaschilling changed the title Concept document about files in IPFS Concept Doc: Files in IPFS Jul 26, 2019
@jessicaschilling jessicaschilling changed the title Concept Doc: Files in IPFS Complete this Files in IPFS content brainstorm Sep 19, 2019
@jessicaschilling jessicaschilling changed the title Complete this Files in IPFS content brainstorm [NEW CONTENT] How files are managed in IPFS Dec 16, 2019
@jessicaschilling jessicaschilling changed the title [NEW CONTENT] How files are managed in IPFS [CONTENT ENHANCEMENT] How files are managed in IPFS Dec 16, 2019
@jessicaschilling jessicaschilling changed the title [CONTENT ENHANCEMENT] How files are managed in IPFS [CONTENT ENHANCEMENT] Augment "File systems and IPFS" page Dec 16, 2019
@jessicaschilling jessicaschilling changed the title [CONTENT ENHANCEMENT] Augment "File systems and IPFS" page [CONTENT IMPROVEMENT] Augment "File systems and IPFS" page Dec 17, 2019
@johnnymatthews johnnymatthews changed the title [CONTENT IMPROVEMENT] Augment "File systems and IPFS" page Augment "File systems and IPFS" page Apr 17, 2020
@hsanjuan hsanjuan transferred this issue from ipfs-inactive/docs May 22, 2020
@johnnymatthews johnnymatthews added dif/medium Prior experience is likely helpful effort/days Estimated to take multiple days, but less than a week kind/enhancement A net-new feature or an improvement to an existing feature need/analysis Needs further analysis before proceeding P3 Low: Not priority right now status/inactive No significant work in the previous month topic/docs Documentation labels Jun 18, 2020
@johnnymatthews johnnymatthews changed the title Augment "File systems and IPFS" page Explain how files are managed in IPFS. Jun 18, 2020
@BlocksOnAChain
Copy link

Probably explained here: https://proto.school/merkle-dags/01

@BlocksOnAChain
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dif/medium Prior experience is likely helpful effort/days Estimated to take multiple days, but less than a week kind/enhancement A net-new feature or an improvement to an existing feature need/analysis Needs further analysis before proceeding P3 Low: Not priority right now status/inactive No significant work in the previous month topic/docs Documentation
Projects
None yet
Development

No branches or pull requests

5 participants