Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ipfs daemon hangs when MFS root is not available locally #7183

Closed
nlko opened this issue Apr 19, 2020 · 22 comments · Fixed by #8661
Closed

Ipfs daemon hangs when MFS root is not available locally #7183

nlko opened this issue Apr 19, 2020 · 22 comments · Fixed by #8661
Assignees
Labels
effort/hours Estimated to take one or several hours exp/intermediate Prior experience is likely helpful kind/bug A bug in existing code (including security flaws) P1 High: Likely tackled by core team if no one steps up topic/files Topic files

Comments

@nlko
Copy link

nlko commented Apr 19, 2020

Version information:

go-ipfs version: 0.4.23-
Repo version: 7
System version: amd64/linux
Golang version: go1.13.7

Description:

Hello,

When I start the daemon it now hangs after displaying this (I tried this several times and also restarted the computer):

$ ipfs daemon
Initializing daemon...
go-ipfs version: 0.4.23-
Repo version: 7
System version: amd64/linux
Golang version: go1.13.7

The execution hangs there forever. I tried some ipfs commands that failed returning Error: cannot acquire lock: Lock FcntlFlock of ~/.ipfs/repo.lock failed: resource temporarily unavailable

After an hour, if I press Ctrl+C it stops and displays :

22:46:53.828 ERROR   cmd/ipfs: error from node construction:  could not build arguments for function "reflect".makeFuncStub (/usr/lib/go/src/reflect/asm_amd64.s:12): failed to build *mfs.Root: function "github.com/ipfs/go-ipfs/core/node".Files (pkg/mod/github.com/ipfs/go-ipfs@v0.4.23/core/node/core.go:74) returned a non-nil error: error loading filesroot from DAG: failed to get block for QmeXwawbVyYrFh6ZjjjbBzzY8gZAcnt7UfLyY2ZhG6MkYQ: context canceled daemon.go:337

Error: could not build arguments for function "reflect".makeFuncStub (/usr/lib/go/src/reflect/asm_amd64.s:12): failed to build *mfs.Root: function "github.com/ipfs/go-ipfs/core/node".Files (pkg/mod/github.com/ipfs/go-ipfs@v0.4.23/core/node/core.go:74) returned a non-nil error: error loading filesroot from DAG: failed to get block for QmeXwawbVyYrFh6ZjjjbBzzY8gZAcnt7UfLyY2ZhG6MkYQ: context canceled

Here is how I managed to reach this point:

  • Everything was working fine, the daemon was running for more than 24h and I was doing some normal mfs operation like everyday using ipfs-desktop;
  • The disk where the repo was located ran out of space for a couple of minutes.
  • The reason why the disk ran out of space has nothing to do with ipfs : I was using dd (for a big copy) In the meantime I was watching a video from the mfs.
  • From that point I wasn't able to perform ipfs add comands. It failed stating that the disk was full according to a .LOG file. (which is wasn't anymore, more than 180G available).
  • I tried to restart the daemon and the daemon hung like I said previously.

Any idea how I can at least recover the content of the repo ?

@nlko nlko added the kind/bug A bug in existing code (including security flaws) label Apr 19, 2020
@hsanjuan
Copy link
Contributor

hsanjuan commented Apr 20, 2020

If you don't start your daemon, and run ipfs files ls /, what happens?

@nlko
Copy link
Author

nlko commented Apr 20, 2020

I prior checked with that the daemon isn't running with ps and grep.

$ ipfs files ls /
Error: could not build arguments for function "reflect".makeFuncStub (/usr/lib/go/src/reflect/asm_amd64.s:12): failed to build *mfs.Root: function "github.com/ipfs/go-ipfs/core/node".Files (pkg/mod/github.com/ipfs/go-ipfs@v0.4.23/core/node/core.go:74) returned a non-nil error: error loading filesroot from DAG: merkledag: not found

@hsanjuan
Copy link
Contributor

@Stebalien is there a way to replace the MFS root? I have no knowledge of it, but I could hack something together. It's not the first time we see this.

@Stebalien
Copy link
Member

We should refuse to start, but we shouldn't hang. This error indicates that the repo has been corrupted somehow (or maybe wasn't initialized properly?). We should:

  1. Track down why this might be happening.
  2. Fetch the MFS root with an "offline" DAG service on start, instead of trying to fetch it from the network.

@nlko
Copy link
Author

nlko commented Apr 20, 2020

It complains the QmeXwawbVyYrFh6ZjjjbBzzY8gZAcnt7UfLyY2ZhG6MkYQ block is missing. If I understand well it is looking for a file name L6/CIQPBH3WIZMW4EJVPZ4DDBCASE7JMK6F4RPT3BAYO2TEQPBIO4XBL6I.data which is missing in the filesystem.

Is it possible that somehow it has been able to remember what was the last hash of the root MFS even if it wasn't able to properly store the corresponding block ? (like I said I was running out of disk for some time just before I started to have trouble with ipfs.)

@Stebalien
Copy link
Member

Do you enable garbage collection (ipfs daemon --enable-gc or ipfs repo gc)? If so, I believe we have some races where the garbage collector may remove the root if run at the same time.

@nlko
Copy link
Author

nlko commented Apr 21, 2020

When it complained that the disk was full (during an ipfs add) I manually ran the gc.
Then I restarted the daemon because the ipfs add was still failing (event if at that time I was capable of accessing mfs files).
But restarting the daemon hung.

When I run the gc it removed some blocks. Most of my data (at least 60 GB out of 70+GB) was in mfs. The size of the repo folder is still more than 70GB.

So my data is still there but since I lost the MFS root block I'm unable to access it.

It's not the first time I end up with a problem with the mfs. During the last month I had several times mfs corruptions (https://discuss.ipfs.io/t/update-root-cid). I believe because I was using the mfs write.

The root cid is a quite sensible peace of information. Maybe keeping track of some older root cid would allow to do some roll-back latter on (ideally they could be stored outside in a human readable rotating file and the gc would respect them).

I consider myself as new to IPFS but the way I see it is that mfs is a layer on top of IPFS. Maybe the lack of mfs should not prevent the ipfs daemon to start (in a kind of reduced mode) and perform basic add/pin operations. It's also maybe a feature that is not useful in some context like on a ipfs-cluster server.

@nlko
Copy link
Author

nlko commented Apr 21, 2020

I wrote a piece of code that looks for the biggest tree in the repo.

With it I finally found a DAG tree with 75GB.

The content of the DAG contains the links (folders and files) I had at the root mfs.

Now that I have a CID that looks valid, how can I change the mfs root CID that the ipfs daemon should look for ?

@hsanjuan
Copy link
Contributor

@nlko , can you try https://github.com/hsanjuan/mfs-replace-root ? Hopefully this allows you to move on with your stuff. Figuring out if this was due to a GC-related race or simply a bad error when trying to write on a full disk may take a while, and I understand resetting your repo is not an option.

  • Track down why this might be happening.

  • Fetch the MFS root with an "offline" DAG service on start, instead of trying to fetch it from the network.

It is important to do this but from the user point of view, it would be good to include a way to forcefully set the mfs root whenever something like this arises (as part of ipfs files commands), imho.

@hsanjuan hsanjuan changed the title Ipfs daemon hangs Ipfs daemon hangs when MFS root is not available locally Apr 21, 2020
@hsanjuan hsanjuan added exp/intermediate Prior experience is likely helpful effort/hours Estimated to take one or several hours P1 High: Likely tackled by core team if no one steps up topic/files Topic files labels Apr 21, 2020
@hsn10
Copy link

hsn10 commented Apr 22, 2020

I had similar problem. There should be command line action to set mfs root or to reset it.

@nlko
Copy link
Author

nlko commented Apr 22, 2020

@hsanjuan thanks for the tool I compiled and it works. I was able to set the root mfs cid and retrieve the files but it required the repo to be v9 and so to use the rc2 of ipfs 0.5. It will be very useful when ipfs 0.5 is released.

I finally find another way to recover my repository (in V7 repo) this way :

  • Within the blocks I searched for the block containing the dag with the largest size (https://github.com/nlko/find-biggest-dag);
  • I obtained the CID from the block filename itself (https://github.com/nlko/ipfs-block-to-cid);
  • I created a new empty ipfs repository from scratch (identical to the broken one but with brand new datastore and blocks folder);
  • I replaced the blocks folder with the one from the broken one.
  • then I declared the old mfs root to be a subfolder of the current mfs root ipfs files cp /ipfs/<OLD_ROOT_CID> /old_root
  • then it was possible to access it : ipfs files ls /old_root

I still don't know why the reference of the mfs root was broken when the disk was lacking of free space.

@nlko
Copy link
Author

nlko commented Apr 22, 2020

@hsn10 I think if we need to reset the mfs to an empty one we will be able to do so using the @hsanjuan tool :

mfs-replace-root QmUNLLsPACCz1vLxQVkXqqLX5R1X345qqfHbsf67hvA3Nn

QmUNLLsPACCz1vLxQVkXqqLX5R1X345qqfHbsf67hvA3Nn is a CID for an empty folder.

@bqv
Copy link

bqv commented Aug 24, 2020

I hit this recently. Saved also by mfs-replace-root

@RubenKelevra
Copy link
Contributor

I packaged my IPFS folder in a state which hit this bug (probably) for debugging purposes:

#7844 (comment)

@bqv
Copy link

bqv commented Jan 28, 2021

It might be useful to have a way smaller sample broken repo, for those who want to explore this but can't repro. E.g. My node is only in single digit of gigs but I hit this so much that I've added an mfs-replace-root call to the systemd script to prevent it recurring

Edit: fwiw that means mine won't work, as I no longer see this issue :p

@aschmahmann
Copy link
Contributor

@bqv I agree. It's easy to create a repo with a broken state (just set the MFS root to a bogus CID) the question is if there are any forensics we can do that might help us with the problem.

Does your script only replace on error? If so maybe you can modify it to snapshot before replacing.

@bqv
Copy link

bqv commented Jan 28, 2021

It doesn't, I now just treat mfs as volatile storage and let it be cleared every reboot/restart

@jjzazuet
Copy link

Also having the same issue. Any updates? Thanks!

@schomatis schomatis self-assigned this Dec 23, 2021
@schomatis
Copy link
Contributor

From #6935:

Currently, if we can't find the MFS root block, we hang on start trying to find it (possibly before the network is ready?).

We shouldn't block startup on this.
We should set a short timeout and log loudly.

@RubenKelevra
Copy link
Contributor

Well, the MFS is kind of crucial for many applications ... so not sure what's the alternative to hanging... I mean we shouldn't delete it with a GC run. But if we lost the root and cannot find it in the network there's nothing we should continue with.

Sure, the GC issue should be fixed. But if something else seems to have deleted the root, I think the safest thing would be to stop accepting local commands but proceed to search for the root on the network (while the chances are pretty slim to find it there).

@schomatis
Copy link
Contributor

We're adding the "replace MFS root" as an IPFS command in #8648. Any feedback welcome.

@EtDu
Copy link

EtDu commented Feb 23, 2023

Adding to this, when running ipfs files ls /

Error: could not build arguments for function "reflect".makeFuncStub (reflect/asm_amd64.s:14): failed to build *mfs.Root: received non-nil error from function "github.com/ipfs/go-ipfs/core/node".Files (github.com/ipfs/go-ipfs@v0.10.0/core/node/core.go:112): failure writing to dagstore: Forbidden: Forbidden
	status code: 403, request id: TQCHP91BZ63NR4JF, host id: lLlfnL8JgDxZHMe0Lq4/iYm74DofzTa2QX4w8zNJnMRHCq0+vphA5Q9plo3uqs0S6Ii0AR55eIg=

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort/hours Estimated to take one or several hours exp/intermediate Prior experience is likely helpful kind/bug A bug in existing code (including security flaws) P1 High: Likely tackled by core team if no one steps up topic/files Topic files
Projects
No open projects
Status: 🔎 In Review
Development

Successfully merging a pull request may close this issue.