Skip to content

Latest commit

 

History

History
99 lines (88 loc) · 5.64 KB

pruning.md

File metadata and controls

99 lines (88 loc) · 5.64 KB

Pruning deployments

Subgraphs, by default, store a full version history for entities, allowing consumers to query the subgraph as of any historical block. Pruning is an operation that deletes entity versions from a deployment older than a certain block, so it is no longer possible to query the deployment as of prior blocks. In GraphQL, those are only queries with a constraint block { number: <n> } } or a similar constraint by block hash where n is before the block to which the deployment is pruned. Queries that are run at a block height greater than that are not affected by pruning, and there is no difference between running these queries against an unpruned and a pruned deployment.

Because pruning reduces the amount of data in a deployment, it reduces the amount of storage needed for that deployment, and is beneficial for both query performance and indexing speed. Especially compared to the default of keeping all history for a deployment, it can often reduce the amount of data for a deployment by a very large amount and speed up queries considerably. See caveats below for the downsides.

The block b to which a deployment is pruned is controlled by how many blocks history_blocks of history to retain; b is calculated internally using history_blocks and the latest block of the deployment when the prune operation is performed. When pruning finishes, it updates the earliest_block for the deployment. The earliest_block can be retrieved through the index-node status API, and graph-node will return an error for any query that tries to time-travel to a point before earliest_block. The value of history_blocks must be greater than ETHEREUM_REORG_THRESHOLD to make sure that reverts can never conflict with pruning.

Pruning is started by running graphman prune. That command will perform an initial prune of the deployment and set the subgraph's history_blocks setting which is used to periodically check whether the deployment has accumulated more history than that. Whenever the deployment does contain more history than that, the deployment is automatically repruned. If ongoing pruning is not desired, pass the --once flag to graphman prune. Ongoing pruning can be turned off by setting history_blocks to a very large value with the --history flag.

Repruning is performed whenever the deployment has more than history_blocks * GRAPH_STORE_HISTORY_SLACK_FACTOR blocks of history. The environment variable GRAPH_STORE_HISTORY_SLACK_FACTOR therefore controls how often repruning is performed: with GRAPH_STORE_HISTORY_SLACK_FACTOR=1.5 and history_blocks set to 10,000, a reprune will happen every 5,000 blocks. After the initial pruning, a reprune therefore happens every history_blocks * (1 - GRAPH_STORE_HISTORY_SLACK_FACTOR) blocks. This value should be set high enough so that repruning occurs relatively infrequently to not cause too much database work.

Pruning uses two different strategies for how to remove unneeded data: rebuilding tables and deleting old entity versions. Deleting old entity versions is straightforward: this strategy deletes rows from the underlying tables. Rebuilding tables will copy the data that should be kept from the existing tables into new tables and then replaces the existing tables with these much smaller tables. Which strategy to use is determined for each table individually, and governed by the settings for GRAPH_STORE_HISTORY_REBUILD_THRESHOLD and GRAPH_STORE_HISTORY_DELETE_THRESHOLD, both numbers between 0 and 1: if we estimate that we will remove more than REBUILD_THRESHOLD of the table, the table will be rebuilt. If we estimate that we will remove a fraction between REBUILD_THRESHOLD and DELETE_THRESHOLD of the table, unneeded entity versions will be deleted. If we estimate to remove less than DELETE_THRESHOLD, the table is not changed at all. With both strategies, operations are broken into batches that should each take GRAPH_STORE_BATCH_TARGET_DURATION seconds to avoid causing very long-running transactions.

Pruning, in most cases, runs in parallel with indexing and does not block it. When the rebuild strategy is used, pruning does block indexing while it copies non-final entities from the existing table to the new table.

The initial prune started by graphman prune prints a progress report on the console. For the ongoing prune runs that are periodically performed, the following information is logged: a message Start pruning historical entities which includes the earliest and latest block, a message Analyzed N tables, and a message Finished pruning entities with details about how much was deleted or copied and how long that took. Pruning analyzes tables, if that seems necessary, because its estimates of how much of a table is likely not needed are based on Postgres statistics.

Caveats

Pruning is a user-visible operation and does affect some of the things that can be done with a deployment:

  • because it removes history, it restricts how far back time-travel queries can be performed. This will only be an issue for entities that keep lifetime statistics about some object (e.g., a token) and are used to produce time series: after pruning, it is only possible to produce a time series that goes back no more than history_blocks. It is very beneficial though for entities that keep daily or similar statistics about some object as it removes data that is not needed once the time period is over, and does not affect how far back time series based on these objects can be retrieved.
  • it restricts how far back a graft can be performed. Because it removes history, it becomes impossible to graft more than history_blocks before the current deployment head.