Restarting node with invalid pin index takes excessive time #8149

gammazero · 2021-05-20T00:54:50Z

Version information:

0.9.0-rc1-1c2fc6b
0.8.0-48f94e2
0.8.0-48f94e2

Appears to be all versions starting with 0.8.0

Description:

Restarting a node, that has an invalid pin index, is taking excessive time before the node is operational.

On a node with a large number of pins (1.4M in the case observed), it took almost 90 minutes for the node to come online following rebuilding its recursive pin indexes.

The logs contain this error message, after which everything begins working:

21-05-19T23:38:22.749875+00:00 localhost ipfs[1756272]: 2021-05-19T23:38:22.749Z#011INFO#011pin#011dspinner/pin.go:709#011invalid recursive indexes detected - rebuilt

To see this message, set the log level to info: GOLOG_LOG_LEVEL=info

Notes

Likely approaches to fixing this include:

Pinner improvement to reduce the possibility of an invalid index (or shrink the part that's invalid), or to make it much more efficient to do a repair.
Using a separate datastore for blocks than non-blocks so that we can have sync-writes turned off for non-blocks, but turned on for non-blocks (this will mimic the FlatFS + levelDB behavior).

The text was updated successfully, but these errors were encountered:

Stebalien · 2021-05-24T17:19:27Z

Can't we just sync writes for the index? Alternatively, we could do some write-ahead logging, replaying on error, instead of setting a dirty bit and rebuilding on error.

petar · 2021-06-23T17:34:35Z

Three possible solution:

(1) Shard keys into a 100 independent instances of the pinner datastore. This addresses the wait issue, because index rebuilding will be restricted only to one or few shards with corruption. Shards are smaller in size and can be reindexed in parallel, thereby alleviating the long wait. This is a relatively quicker hack. The downside is that re-indexing an entire shard is still suboptimal and having to pre-configure the number of shards is awkward and may require resharding facilities eventually.

(2) Use a relational database backend. This has the benefit of accommodating increasingly complex relational semantics, which are expected in upcoming projects/features (e.g. ipld backlink/parent indexing). However, the dependence on a relational database is probably a distribution nightmare, so I am not sure this is an option. (I am not aware of good embedded relational databases for Go. Any ideas?)

(3) Rewrite the pinner to use a write-ahead log together with a state snapshot, as its persistence strategy. This is the correct solution in the sense that it is lightweight (compared to a relational database) and still accommodates any relational semantics. It will require essentially rewriting the pinner datastore, but the upshot is that it will make it easy to add additional types of relational objects to the datastore, like e.g. backlinks.

petar · 2021-06-29T14:38:08Z

In IPFS, the pinner is used with badger and 'ipfs add' invokes sync after pinning an entire IPLD tree. @Stebalien's suggestion reduces the chances that the process dies between a write and a sync (by orders of magnitude for most files and directories). This PR ipfs/go-ipfs-pinner#13 confirms that moving to sync-on-every-pin strategy is not hurting happy-path performance much.

aschmahmann · 2021-06-30T17:03:25Z

@petar how different is the approach in ipfs/go-ipfs-pinner#13 from what's happening in go-ipfs (i.e. IIRC we should be calling flush after individual pin operations externally such as https://github.com/ipfs/go-ipfs/blob/ef866a1400b3b2861e5e8b6cc9edc8633b890a0a/core/coreapi/dag.go#L29)? Doing this internally will help, but maybe not enough.

An alternative idea that might work (thoughts @petar @Stebalien?) is that since my understanding is the issues are coming from the non-atomic nature of updating the indicies that we could insist that the pinner takes a datastore that understands transactions which both levelDB and Badger should support.

What's been stopping us from doing this so far is that generally indirection such as the mount datastore https://github.com/ipfs/go-datastore/blob/4ee0f58273906f63b05ea8957d9133a31586e881/mount/mount.go#L66 stops us from being able to make assertions like datastore.(TxnDatastore). However, given that all pins are now under /pins (instead of in the blockstore) we should be able to add a function to the mount datastore like GetInnerDatastore(path) and then assert that either the Datastore is a TxnDatastore, or it supports GetInnerDatastore(path) and that datastore supports transactions, before we pass it into the Pinner.

BigLep · 2021-07-15T15:18:18Z

2021-07-15 status:

feat: add 'ipfs multibase' commands #8180 has been deployed to PL infra. Next step is for @petar to do a release of https://github.com/ipfs/go-ipfs-pinner (we need to figure out the right version number v0.1.2 or v0.2.0)
We need to check in on the work @gammazero has been doing to see if we should still proceed given feat: add 'ipfs multibase' commands #8180 appears to be working.

BigLep · 2021-07-20T16:09:41Z

Related: ipfs/go-ipfs-pinner#15

BigLep · 2021-07-20T16:16:40Z

2021-07-20 status:

We've given instructions to George to deploy Fix/minimize rebuild go-ipfs-pinner#15
If pinner deployment goes well, deploy to all hosts in our cluster.

Exit criteria: as long as it's not breaking anything, we'll merge.

Assuming no problems the week of 2021-07-26, we'll merge.

gammazero added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization and removed kind/bug A bug in existing code (including security flaws) labels May 20, 2021

BigLep modified the milestones: go-ipfs 0.9, go-ipfs 0.10 May 20, 2021

This was referenced Jun 28, 2021

Add benchmark to compare LevelDB and Badger using sync-every-pin and sync-once-per-batch MichaelMure/go-ipfs-pinner#3

Closed

sync pinner on every pin operation ipfs/go-ipfs-pinner#13

Closed

aschmahmann mentioned this issue Jun 30, 2021

ipfs daemon hangs on startup after crash #8227

Closed

petar mentioned this issue Jun 30, 2021

point ipfs to pinner that syncs on every pin #8231

Merged

BigLep linked a pull request Jul 1, 2021 that will close this issue

point ipfs to pinner that syncs on every pin #8231

Merged

gammazero self-assigned this Jul 13, 2021

aschmahmann added P1 High: Likely tackled by core team if no one steps up and removed need/triage Needs initial labeling and prioritization labels Jul 19, 2021

BigLep assigned petar Jul 20, 2021

petar closed this as completed in #8231 Jul 29, 2021

Stebalien mentioned this issue Sep 5, 2021

ipfs config profile apply hangs after unclean shutdown. #8407

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restarting node with invalid pin index takes excessive time #8149

Restarting node with invalid pin index takes excessive time #8149

gammazero commented May 20, 2021

Stebalien commented May 24, 2021 via email

petar commented Jun 23, 2021 •

edited

Loading

petar commented Jun 29, 2021

aschmahmann commented Jun 30, 2021

BigLep commented Jul 15, 2021 •

edited

Loading

BigLep commented Jul 20, 2021

BigLep commented Jul 20, 2021

Restarting node with invalid pin index takes excessive time #8149

Restarting node with invalid pin index takes excessive time #8149

Comments

gammazero commented May 20, 2021

Version information:

Description:

Notes

Stebalien commented May 24, 2021 via email

petar commented Jun 23, 2021 • edited Loading

petar commented Jun 29, 2021

aschmahmann commented Jun 30, 2021

BigLep commented Jul 15, 2021 • edited Loading

BigLep commented Jul 20, 2021

BigLep commented Jul 20, 2021

petar commented Jun 23, 2021 •

edited

Loading

BigLep commented Jul 15, 2021 •

edited

Loading