Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add --nocopy does not always add blocks to the filestore + proposal to make it do so #5988

Open
radfish opened this issue Feb 12, 2019 · 4 comments
Labels
help wanted Seeking public contribution on this issue topic/filestore Topic filestore

Comments

@radfish
Copy link

radfish commented Feb 12, 2019

Version information: 0.4.19-dev

Type: enhancement

Description:

add --nocopy should add files to the filestore instead of the blockstore, but it doesn't always do so. Blocks are not added to the filestore if the same blocks happen to be in the blockstore. In this case, the blocks won't end up in the filestore, and so won't be seen in filestore ls. This hinders the user from achieving non-duplicated storage of data: the user ends up with the data in the original files AND in the blockstore, despite having used --nocopy.

To avoid the duplication one might think to unpin the roots existing pior to the add and refering to the leaves that match the blocks that are about to be added, and then repo gc to clear these leaves from the blockstore, before running add --nocopy, but

  1. GC does not always cleanup the blocks in the filestore. I've seen block X in the filestore ls, pin ls X saying 'not pinned', then repo gc, and yet filestore ls X still finds it. I've also seen GC remove unpinned blocks from the filestore, so perhaps the former behavior is just a bug? (UPDATE: Blocks were held by MFS: ipfs files)
  2. This might not be a general solution, because the leaves might be shared with unrelated roots R. As long as R is pinned, this is fine (I'm not looking to re-point R's leaf pointers to the filestore.), but after R is unpinned, we have the problematic duplication: data in raw filesystem file and same data in the blockstore. In the extreme case, this problem can make nocopy a noop (save net zero space after counting the raw filesystem files; instead of adding the negligible cost of intermediate nodes for the benefit of publishing the data to IPFS).
  3. This approach cannot be applied retroactively after the add -- if you realize that the blocks existed in the blockstore after the add, then you have to clean them and then re-run add --nocopy again. This is less intuitive than having to clean up duplicates after finding out that there are duplicates.

Proposal: allow the add to duplicate between filestore and blockstore, but provide tools to manage this. If user requests something be added to the filestore with --nocopy, then add it unconditionally, but also give them a block rm option to remove from blockstore or filestore explicitly. The removal would only be allowed if the block is not referenced by a pinned parent. This check already happens, but with the proposed change the references need to be distinguished by the backend type of the block, not only by the block hash, because obviously the block hash is going to be referenced by the parent that was just added with --nocopy.

Then, to get into a state with only one copy of the data (in the original filesystem files): ipfs filestore dups | xargs -n 1 ipfs block rm --from-blockstore.

Both parts seem doable. The patch to the first part (I'm already using it):

--- a/filestore/filestore.go
+++ b/filestore/filestore.go
@@ -169,15 +169,6 @@ func (f *Filestore) GetSize(c cid.Cid) (int, error) {
 // Has returns true if the block with the given Cid is
 // stored in the Filestore.
 func (f *Filestore) Has(c cid.Cid) (bool, error) {
-       has, err := f.bs.Has(c)
-       if err != nil {
-               return false, err
-       }
-
-       if has {
-               return true, nil
-       }
-
        return f.fm.Has(c)
 }
 
@magik6k magik6k added help wanted Seeking public contribution on this issue topic/filestore Topic filestore labels Feb 12, 2019
@Stebalien
Copy link
Member

GC does not always cleanup the blocks in the filestore. I've seen block X in the filestore ls, pin ls X saying 'not pinned', then repo gc, and yet filestore ls X still finds it. I've also seen GC remove unpinned blocks from the filestore, so perhaps the former behavior is just a bug?

Is the file stored in MFS (ipfs files)?

@radfish
Copy link
Author

radfish commented Feb 18, 2019

Is the file stored in MFS (ipfs files)?

Almost certainly yes. I cleared MFS and GC collected a bunch of blocks; I didn't closely check, but safe to assume this explains Item 1. Thanks!

@momack2 momack2 added this to Inbox in ipfs/go-ipfs May 9, 2019
@Stebalien
Copy link
Member

Revisiting this issue for triage:

Both parts seem doable. The patch to the first part (I'm already using it):

That's probably not what you want to do. We wrap the blockstore with the filestore so that code will just cause us to think you don't have blocks that aren't in the filestore. I'm not sure if it'll break anything in practice (bitswap uses GetSize and most services use Get) but it's still a bad idea.

Blocks are not added to the filestore if the same blocks happen to be in the blockstore.

We can probably add a flag to the add command to handle this case but this isn't something the core team is likely to look into any time soon. It'll have to be an external contribution.

@radfish
Copy link
Author

radfish commented Jun 8, 2019

Both parts seem doable. The patch to the first part (I'm already using it):

That's probably not what you want to do.

Ok, then the consumer of the store API needs to be changed to simply call store.Add instead of if not store.Has then store.Add. The already-present check should be inside the store implementation, because as this case here shows, this logic is not generic and is really within the scope of a particular store implementation. A store may be implemented from composite stores, like in Filestore case here, and the semantics of which sub-store is modified on Add and which is looked up on Has etc should be entirely up to the implementation of the composite store.

Then, the Filestore will be able to implement the desired semantics: upon Add, it will check if the FS store contains the block and if not then add it to the FS store; upon Has, the logic will remain as it is now (check the non-filesystem-based blockstore, then check FS store).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Seeking public contribution on this issue topic/filestore Topic filestore
Projects
No open projects
Development

No branches or pull requests

3 participants