Landing

Merging Into go-ipfs

The filestore is a way to add files to IPFS without duplication. It works by storing a modified protocol buffer without the unixfs Data field (i.e. the file data), along with information on how to reconstruct the orignal protocol buffer from the file, in a seperate datastore. Currently leaf nodes the contain file data and interior merkle dag nodes that represent a file (or part of) are stored in the alternative datastore. Directory entries are currently stored in the normal datastore.

Right now the biggest area I need help on is working out some semanatic issues with how the filestore interacts with the rest of IPFS.

I could also really use some testing. Currently some filestore tests on the MacOS X are failing that I need some help to debug.

The rest of this document will focus on the semantic issues that need to be worked out.

Need Some Sort of Support For Multiple Datastores

Most Pressing Issue I need others feedback on

Related issue ipfs/go-ipfs #2747.

The filestore is a datastore, but it is only designed to handle a subset of the blocks that can be used in IPFS, therefore the main datastore is still needed and some sort of support for multiple datastores is needed so both the filestore and datastore can coexist.

(Note: IPFS has some infrastructure in place for using different datastores based on the key prefix. What I need for the filestore is to allow for multiple datastores under the "/blocks" prefix. Think of it something like UnionFS for the IPFS datastore.)

There are several ways to handle this. What I belive makes the most sense and will be the easist to implement will be have support for a "cache" and then any number of additional "aux" datastores with the following semantics:

When looking up a block the "cache" is first tried, if the block is not found then each "aux" datastores is tried in turn. The order of the "aux" datastores will explicitly set by the user.
Any operations that modify the datastaore only act on the "cache".
The "aux" datastores are allowed to read-only. When they are not additional specialized API calls will be required for adding or removing data from the "aux" datastores.
Each of these datastore is given a name and can be accessed via its name from the repo. The name "cache" is used for the main repo.
Duplicate data should be avoided if possible, but not completely disallowed.

These rules imply that the garbage collector should only attemt to remove data from the "cache" and leave the other datastores alone.

How multi-datatores are handled will influence how many of the other issues are resolved.

Garbage Collector and Invalid Blocks

Blocks stored in the filestore can become invalid at any time. If ones of those blocks is pinned this will cause the garbage collector to abort. As a concrete example, directories are not stored in the filestore, so if a user adds a directory and some of the files in the directory change the garbage collector will fail. There are several ways to handle this:

(1) Introduce the concept of best-effort pins that will simply not abort if it is unable to read a block. See go-ipfs issue #2698.

(2) In DAGService add a GetLinks() method. The filestore can always satisfy this method as links are stored in the metadata.

(3) If the multi-data store is implemented as proposed, simply acknowledge the block was found in the filestore and do not attemt to read it. Assume any children on the node are also in the filestore (a safe assumption based on how the filestore currently works).

(4) If the multi-data store is implemented as proposed, disallow pins in the "cache" to refer to blocks outside the cache. This will involve moving directory nodes inside the filestore.

Of all the solutions I like (3) as the best. A general implementation of (2) will likely speed up the garbage collector on large datastore significantly.

For completness here are some other possible solutions I ruled out:

(10) Have the garbage collector continue anyway even if it can't read all the blocks. Rulled out becuase it is dangerous. It could be that the unreadable blocks has links to other blocks that need to be keep.

(11) Disallow blocks in the filestore to become invalid if they are pinned. When a block becomes invalid automatically remove or repair the pin. Rulled out becuase it is a overkill, especially since blocks can become invalid due to transient or easily correctable errors.

Invalid Blocks in general

In general go-ipfs commands will abort on the first error even if it is non-fatal. With the filestore invalid blocks may become common place and IPFS command such as ipfs pin ls should not abort on an invalid block, rather it should report the block as invalid and continue to list all the valid blocks. This modified behavior is consistent with how most unix utilities work on system related errors. For example, if find can't read a directory it will output a wanrning, and continue on to the next directory.

Side Issue: Convert the Filestore to a Blockstore

Added July 11, 2016

After thinking about it, implementing the filestore as a datastore, as oppose to a blockstore, does not seam to serve much purpose other than and an extra layer of indirection.

If this is even an idea worth considering it could change some of the above issues. For example, instead of multiple datastores there will be multiple blockstores.

In addition implementing multiple blockstores could help to resolve some of the caching issues introduced by the new bloom filter by providing an a completely separate blockstore for the filestore that does not use caching. Since the filestore is implementing using leveldb, it may turn out that Has caching is not that important; or if it is, it may make sense to have seperate bloom filters for each blockstore.

Provide feedback

Saved searches