Skip to content
This repository has been archived by the owner. It is now read-only.

WIP Add CAR spec #51

Merged
merged 1 commit into from May 12, 2018
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
@@ -0,0 +1,108 @@
# Certified ARchive

CARs (Certified ARchives) are archives for IPLD DAGs.

## Summary

CARs are archives of IPLD DAGs. They are:

1. Certified
2. Seekable
3. Compact
4. Reproducible
5. Simple and Stable

The actual format is just a (mostly) recursively-defined topological sort of the
desired DAG with some metadata for fast traversal.

```
CID
len(ROOT)
ROOT
[
CHILD-1-OFFSET
...
]
[
len(CHILD-1)
CHILD-1
[
CHILD-1/1-OFFSET
...
]
...
]
```

Offsets are relative, offsets for missing children use a sentinel value.
Copy link

@rklaehn rklaehn Nov 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So how would you deal with a leaf node that is referenced from multiple branches, e.g. root -> a -> c, root -> b -> c?

I assume that you would store c only once and just calculate the relative offset for both a and b? In that case a and b would just have offset arrays, but not value arrays, and c would be stored at the root level? Would c then be stored before a and b?

Copy link
Member Author

@Stebalien Stebalien Nov 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assume that you would store c only once and just calculate the relative offset for both a and b? In that case a and b would just have offset arrays, but not value arrays, and c would be stored at the root level?

Yes. I'll try to make this diagram a bit less confusing (and add the example below).

Would c then be stored before a and b?

That's where the topological sort comes in. We'd store this as:

      root.Cid()
root: root.Length()
      root.Bytes()
      offset(a)
      offset(b)
a:    a.Length()
      a.Bytes()
      offset(c)
b:    b.Length()
      b.Bytes()
      offset(c) // 0
c:    c.Length()
      c.Bytes()

Copy link

@kevina kevina Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I see the benefit of this. This could force certain children to be very far from there first parent and will likely not stream well. It would seam far simpler to just store children in a breadth first manor and allow negative offsets to keep the children close to there (first) parent. A negative value can then be meant as a clear indicator of a duplicate, and with some luck the node may already be in the cache and it will not be necessary to seek in order to gets the contents.

Copy link

@kevina kevina Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, breadth first may not be the best method to keep children close to there parents, in fast there probably is no best method to do so. Nevertheless, a still think it would be better to allow negative offsets to allow children to be close to there first parent when possible.

Copy link
Member Author

@Stebalien Stebalien Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could force certain children to be very far from there first parent and will likely not stream well.

We can't both stream (write) and support the offset index (unless we keep the index in memory and append it to the end but that would hurt reading significantly).

children close to there (first) parent

Regardless of what we do, we can't get this property (assuming a reasonable branching factor). For example, with a branching factor of 2, some children will already be ~1024 nodes away from their parents once we hit a depth of 10.

However, this would ensure that siblings are (usually) next to each other. On the other hand, we could probably get a similar property with a topological sort if we use the right algorithm (although we'd always end up grouping nodes with the deepest siblings).

A negative value can then be meant as a clear indicator of a duplicate, and with some luck the node may already be in the cache and it will not be necessary to seek in order to gets the contents.

We'd have to have read in the entire CAR for this to be the case. It would be nice to be able to skip over portions of the CAR that we don't care about.


Basically, if we can't optimize for streaming while writing, we might as well optimize for streaming while reading. With a topologically sorted DAG, we can traverse a DAG in one pass through the file by:

  1. Looking a at a node.
  2. Picking the child we want.
  3. Looking up the offset of that child.
  4. Seeking (forward) to the offset of that child.
  5. Recurse.

This is a useful property for media that works best with sequential access patterns.

Copy link
Member Author

@Stebalien Stebalien Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevertheless, a still think it would be better to allow negative offsets to allow children to be close to there first parent when possible.

Personally, I don't really see any benefit of putting them close to the first parent rather than the last. I don't see order as having anything to do with "importance". Is there any reason to do this?


We only bother including the root CID because all the other CIDs are embedded in
the objects themselves. This saves space and *forces* parsers to actually
traverse the DAG (hopefully validating it).
Copy link

@kevina kevina Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I see the benefit of "forcing a parser to traverse the DAG".

Copy link
Member Author

@Stebalien Stebalien Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. For one, it ensures that the CAR is actually one giant DAG rooted at that CID. However, that may not even be worth mentioning (space is, IMO, sufficient).


## Motivation

Use cases:

1. Reliably export/import a DAG to/from an external hard drive (backup, sneakernet).
2. Traverse a large DAG on an external hard drive without parsing the *entire* DAG.
3. Traverse a large DAG streamed over HTTP without downloading the entire thing.

The simple method is to copy the entire repo. However, for performance, we need
to be able to upgrade the repo format so this isn't really a stable format.
Additionally, repos need to support insertions, deletions, and random lookups.
Supporting these efficiently necessarily complicates the formats. We'd like
something simple and portable (backup).

The slightly more complex way is to download every object into a separate file
and then import each file. However, this isn't very convenient and *does not*
scale to large directories well (use case 2).

One could improve this multi-file approach by splitting up the DAG into multiple
directories and providing a set of tools to manage the files. However, we'd
rather not rely on the filesystem for anything, really. Filesystems:

* Don't always deal with names well (e.g., FAT16).
* Don't always handle many small files well.
* Aren't usually as space-efficient as possible (to support updates).
* Are complex (easy to corrupt metadata/structure).

Additionally, it's hard to download a directory structure over HTTP (motivated by
use-case 3). One can just TAR it up but that layers another (complex) file
format into the mix.

So, we'd like a new single-file format that, if necessary, we can just `dd` to a
drive in place of a filesystem.

TODO: Expand.

# Questions

However, there are a few open questions.

## Uint64/Varint

The advantage of using uint64s over varints is that we can leave the "jump
tables" blank and then fill them in on a second pass after we've written
everything. However, if we topologically sort the DAG, we may be able to compute
the jump tables up-front.

The advantages of varints over uint64 are space and flexibility (DAGs larger
than 16 Exbibytes).

Currently, I'm leaning toward varints as this will make storing lots of small
blocks significantly more efficient.

Copy link

@kevina kevina Nov 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree with this (varints).

Copy link
Member Author

@Stebalien Stebalien Nov 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: there are significant downsides:

  1. Can't leave them blank and fill them in (or change them after the fact).
  2. Slower/harder to parse.

Also, any suggestions on sentinal values?

Copy link

@kevina kevina Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, any suggestions on sentinal values?

Not yet, something like this is best determined once the details are worked out. If we are using COBL we could just use the NULL value.

Copy link
Member Author

@Stebalien Stebalien Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is COBL? I was planning on using base128 varints (the one protobuf uses) which has no "empty" values. We could use 0 and say that 1 means the next byte, I guess.

Copy link

@kevina kevina Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry meant to say CBOR.

Copy link
Member Author

@Stebalien Stebalien Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you mean structure this as a CBOR object? I hadn't considered that. My plan was to make a custom file format. Unfortunately, CBOR isn't very seekable. We could also go with some other existing format but I would like something very simple and compact.

Copy link
Member Author

@Stebalien Stebalien Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, one argument for uint64 is that we'd be able to skip directly to the correct offset in the jump table without iterating through it. However, given that we already have to parse the IPLD object, that's probably a non-issue. Also, there are some fancy bit-twiddling algorithms that can make this very fast by counting bytes with zero MSBs (I just need to remember to implement it...).

## Inline Blocks

So, we can technically have inline blocks using the identity multihash. How do
we deal with them?

1. We *don't* want to duplicate the data.
2. We need to support inline blocks with children.

## Topological sort
Copy link

@rklaehn rklaehn Nov 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the purpose of the topological sort and how exactly it is going to sort needs to be fleshed out a bit more.

Copy link
Member Author

@Stebalien Stebalien Nov 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree. Basically, it allows us to traverse a DAG in one pass. This + offsets makes traversing a DAG on, e.g., tape really fast.

Copy link

@rklaehn rklaehn Nov 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the lower nodes are stored after the higher nodes? Then the offset calculation will be tricky. I don't see how that can work with the varints. With int64 you could just backfill once you know the offsets.

If you store first the leaf nodes, and then the higher nodes and then the root, you always know the offset when you write a node. But then the access pattern is backwards.

Copy link
Member Author

@Stebalien Stebalien Nov 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the lower nodes are stored after the higher nodes?

Yes.

With int64 you could just backfill once you know the offsets.

We (@whyrusleeping and I) planned on using uint64s at first. However:

  1. The best topological sort algorithm I could find (basically, just a DFS) actually does work backwards.
  2. If we want to provide an index, we can't make the CAR in one pass anyways (although, if we don't do a topological sort, we could dump the data in one pass and then write in the index afterwards).

My current plan is to do one pass to determine the structure of the car and a second pass to write it. Unfortunately, this will require a significant amount of scratch space.

Copy link

@rklaehn rklaehn Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it would be something like

  1. do the sorting so that leaves are last, create a sequence of nodes
  2. go backwards through the seq and determine serialised size for each node (to do this you need the offsets)
  3. go forwards through the seq and write to disk

Sounds good. Backfilling the offsets in case of int64 offsets would also need to be done in a clever way if you want linear access patterns. The alternative would be to write leaves first and calculate the offsets in one pass. That would be a single pass for writing, but would be a backwards access pattern for reading.

Copy link

@sesam sesam Nov 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Camlistore, with care taken for data archeology/recoverability, stores data and metadata like folder structures and references the same way: content adressed blobs.

I understand CAR as IPFS's version of tar, to archive IPFS-held data in a (tape) streaming friendly format. Maybe it could become a .torrent competitor. But maybe that functionality can be achieved in a simpler way? Just pushing blobs to/from storage automatically, where automation can be made from a combination of Bloom filters and Inverted Bloom filters over the content hashes. This is currently being implemented for bitcoin block propagation, based on fresh insights, that surely can be recycled also for IPFS.

Imagine this upgrade: A HTTP/2 server can stream with interleaving so that several "files" and "directory structures" could be handled at once.

Which supports this use case: you'd click a web link into your "junior year reception party" and your HTTP/2 server suggest you might also want the gag/for fun subtexts or voice over that other people who streamed that video also used. Your client, say VLC, makes it a single keypress to say yes to any of the suggested extras.


So, a topological sort makes it really easy to traverse the CAR, even when
streaming. However producing a topologically sorted DAG is a bit trickier. Note:
whatever we choose, it won't have any affect on the asymptotic runtime (memory or time).