Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarting node with invalid pin index takes excessive time #8149

Closed
gammazero opened this issue May 20, 2021 · 7 comments · Fixed by #8231
Closed

Restarting node with invalid pin index takes excessive time #8149

gammazero opened this issue May 20, 2021 · 7 comments · Fixed by #8231
Assignees
Labels
P1 High: Likely tackled by core team if no one steps up
Milestone

Comments

@gammazero
Copy link
Contributor

Version information:

  • 0.9.0-rc1-1c2fc6b
  • 0.8.0-48f94e2
  • 0.8.0-48f94e2

Appears to be all versions starting with 0.8.0

Description:

Restarting a node, that has an invalid pin index, is taking excessive time before the node is operational.

On a node with a large number of pins (1.4M in the case observed), it took almost 90 minutes for the node to come online following rebuilding its recursive pin indexes.

The logs contain this error message, after which everything begins working:

21-05-19T23:38:22.749875+00:00 localhost ipfs[1756272]: 2021-05-19T23:38:22.749Z#011INFO#011pin#011dspinner/pin.go:709#011invalid recursive indexes detected - rebuilt

To see this message, set the log level to info: GOLOG_LOG_LEVEL=info

Notes

Likely approaches to fixing this include:

  • Pinner improvement to reduce the possibility of an invalid index (or shrink the part that's invalid), or to make it much more efficient to do a repair.
  • Using a separate datastore for blocks than non-blocks so that we can have sync-writes turned off for non-blocks, but turned on for non-blocks (this will mimic the FlatFS + levelDB behavior).
@gammazero gammazero added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization and removed kind/bug A bug in existing code (including security flaws) labels May 20, 2021
@BigLep BigLep modified the milestones: go-ipfs 0.9, go-ipfs 0.10 May 20, 2021
@Stebalien
Copy link
Member

Stebalien commented May 24, 2021 via email

@petar
Copy link
Contributor

petar commented Jun 23, 2021

Three possible solution:

(1) Shard keys into a 100 independent instances of the pinner datastore. This addresses the wait issue, because index rebuilding will be restricted only to one or few shards with corruption. Shards are smaller in size and can be reindexed in parallel, thereby alleviating the long wait. This is a relatively quicker hack. The downside is that re-indexing an entire shard is still suboptimal and having to pre-configure the number of shards is awkward and may require resharding facilities eventually.

(2) Use a relational database backend. This has the benefit of accommodating increasingly complex relational semantics, which are expected in upcoming projects/features (e.g. ipld backlink/parent indexing). However, the dependence on a relational database is probably a distribution nightmare, so I am not sure this is an option. (I am not aware of good embedded relational databases for Go. Any ideas?)

(3) Rewrite the pinner to use a write-ahead log together with a state snapshot, as its persistence strategy. This is the correct solution in the sense that it is lightweight (compared to a relational database) and still accommodates any relational semantics. It will require essentially rewriting the pinner datastore, but the upshot is that it will make it easy to add additional types of relational objects to the datastore, like e.g. backlinks.

@petar
Copy link
Contributor

petar commented Jun 29, 2021

In IPFS, the pinner is used with badger and 'ipfs add' invokes sync after pinning an entire IPLD tree. @Stebalien's suggestion reduces the chances that the process dies between a write and a sync (by orders of magnitude for most files and directories). This PR ipfs/go-ipfs-pinner#13 confirms that moving to sync-on-every-pin strategy is not hurting happy-path performance much.

@aschmahmann
Copy link
Contributor

@petar how different is the approach in ipfs/go-ipfs-pinner#13 from what's happening in go-ipfs (i.e. IIRC we should be calling flush after individual pin operations externally such as https://github.com/ipfs/go-ipfs/blob/ef866a1400b3b2861e5e8b6cc9edc8633b890a0a/core/coreapi/dag.go#L29)? Doing this internally will help, but maybe not enough.

An alternative idea that might work (thoughts @petar @Stebalien?) is that since my understanding is the issues are coming from the non-atomic nature of updating the indicies that we could insist that the pinner takes a datastore that understands transactions which both levelDB and Badger should support.

What's been stopping us from doing this so far is that generally indirection such as the mount datastore https://github.com/ipfs/go-datastore/blob/4ee0f58273906f63b05ea8957d9133a31586e881/mount/mount.go#L66 stops us from being able to make assertions like datastore.(TxnDatastore). However, given that all pins are now under /pins (instead of in the blockstore) we should be able to add a function to the mount datastore like GetInnerDatastore(path) and then assert that either the Datastore is a TxnDatastore, or it supports GetInnerDatastore(path) and that datastore supports transactions, before we pass it into the Pinner.

@BigLep BigLep linked a pull request Jul 1, 2021 that will close this issue
@gammazero gammazero self-assigned this Jul 13, 2021
@BigLep
Copy link
Contributor

BigLep commented Jul 15, 2021

2021-07-15 status:

  1. feat: add 'ipfs multibase' commands #8180 has been deployed to PL infra. Next step is for @petar to do a release of https://github.com/ipfs/go-ipfs-pinner (we need to figure out the right version number v0.1.2 or v0.2.0)
  2. We need to check in on the work @gammazero has been doing to see if we should still proceed given feat: add 'ipfs multibase' commands #8180 appears to be working.

@aschmahmann aschmahmann added P1 High: Likely tackled by core team if no one steps up and removed need/triage Needs initial labeling and prioritization labels Jul 19, 2021
@BigLep
Copy link
Contributor

BigLep commented Jul 20, 2021

Related: ipfs/go-ipfs-pinner#15

@BigLep
Copy link
Contributor

BigLep commented Jul 20, 2021

2021-07-20 status:

  1. We've given instructions to George to deploy Fix/minimize rebuild go-ipfs-pinner#15
  2. If pinner deployment goes well, deploy to all hosts in our cluster.

Exit criteria: as long as it's not breaking anything, we'll merge.

Assuming no problems the week of 2021-07-26, we'll merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 High: Likely tackled by core team if no one steps up
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants