Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add usage-aware garbage collection #8870

Open
3 tasks done
iand opened this issue Apr 11, 2022 · 19 comments
Open
3 tasks done

Add usage-aware garbage collection #8870

iand opened this issue Apr 11, 2022 · 19 comments
Assignees
Labels
kind/enhancement A net-new feature or improvement to an existing feature

Comments

@iand
Copy link
Contributor

iand commented Apr 11, 2022

Checklist

  • My issue is specific & actionable.
  • I am not suggesting a protocol enhancement.
  • I have searched on the issue tracker for my issue.

Description

When operating as an HTTP gateway, go-ipfs should adopt a garbage collection strategy that retains blocks that improve gateway request times.

The hypothesis is that blocks are being deleted from the blockstore by garbage collection only to be re-requested at a later date which is much slower. Garbage collection is required to keep within disk space limits but the current implementation is unware of the usage patterns of unpinned blocks.

A crude experiment demonstrates the effect. The same URL (/ipfs/QmVoe22tWS8JqfrMHj5WTHJ6zWrWqTL3YeG5uFQmAH6zhs/182.png) was requested every 30 seconds with a manual GC run every few minutes (using ipfs repo gc)

Timings (elapsed time, connection time, time to start transfer, total time):
0,conn:0.092357,ttfb:9.629604,total:10.343198   
31,conn:0.092979,ttfb:0.190162,total:0.466788
62,conn:0.103032,ttfb:0.210427,total:0.517554
92,conn:0.103929,ttfb:0.212191,total:0.525963
123,conn:0.102814,ttfb:0.210029,total:0.519017
154,conn:0.093053,ttfb:2.338450,total:2.620145  <-- after GC
186,conn:0.093665,ttfb:0.191273,total:0.473441
217,conn:0.093083,ttfb:0.191259,total:0.468096
247,conn:0.103448,ttfb:0.211765,total:0.520322
278,conn:0.093548,ttfb:0.191549,total:0.469947
308,conn:0.103633,ttfb:0.212855,total:0.522967
339,conn:0.092336,ttfb:0.190009,total:0.466366
369,conn:0.093359,ttfb:0.191401,total:0.469183
400,conn:0.093163,ttfb:0.191280,total:0.473141
430,conn:0.093641,ttfb:0.200777,total:0.478957
461,conn:0.103439,ttfb:0.212652,total:0.520436
491,conn:0.102797,ttfb:0.210385,total:0.518253
522,conn:0.103191,ttfb:0.212262,total:0.522594
552,conn:0.092788,ttfb:0.190656,total:0.468532
583,conn:0.092756,ttfb:1.081157,total:1.358370   <-- after GC 
614,conn:0.104042,ttfb:0.217775,total:0.527639
645,conn:0.092954,ttfb:0.190958,total:0.472610
675,conn:0.094487,ttfb:0.193336,total:0.474542
706,conn:0.093254,ttfb:63.240915,total:63.519715  <-- after GC
799,conn:0.103447,ttfb:0.211192,total:0.519018
830,conn:0.102939,ttfb:2.021556,total:2.327899   <-- after GC 
862,conn:0.103380,ttfb:0.211327,total:0.520165
893,conn:0.103518,ttfb:0.211405,total:0.520424
923,conn:0.092240,ttfb:0.189054,total:0.467218
954,conn:0.093287,ttfb:0.190607,total:0.467807
984,conn:0.103094,ttfb:2.554306,total:2.862080    <-- after GC
1017,conn:0.103841,ttfb:0.212136,total:0.522242
1048,conn:0.092973,ttfb:0.190636,total:0.470584

The request times at 154, 583, 830 and 984 after a garbage collection show a tenfold increase in time to first byte.
The timings at 706 are much higher, indicating that the peer probably dropped from node's peer list.

The default garbage collection strategy is mark and sweep:

  • mark all:
    • pinned blocks, plus all of their descendants (recursively)
    • bestEffortRoots, plus all of its descendants (recursively)
    • directly pinned blocks
    • blocks utilized internally by the pinner
  • then iterate over every block in the blockstore and delete any block not found in the marked set

bestEffortRoots, by default, only includes the MFS root

Proposal

A cache-aware garbage collection algorithm would assign a cost to blocks derived from the frequency of access and time taken to fetch from peers. The algorithm then deletes blocks with least cost until a sufficient space has been freed.

For space efficiency we may want to only record metrics for roots of dags, i.e. directories and files. In this case the garbage collector would treat all descendant blocks of the root as a unit: either all are deleted together or none are. The final scoring formula will probably want to factor in the size of the dag.

Outline of proposed approach:

  • Phase 1
    • Roughly instrument block system to record the mean request rate of a block and the mean time to request the block from the network
    • Evaluate metrics against request patterns derived from ipfs.io to determine whether to proceed to phase 2
  • Phase 2
    • Define an efficient representation for block request metrics
    • Extend go-ipfs to enable configurable garbage collection strategies
    • Implement cost-based garbage collection strategy
    • Enable cost-based garbage collection for go-ipfs being operated as a gateway

Related Work

@iand iand added the kind/enhancement A net-new feature or improvement to an existing feature label Apr 11, 2022
@iand iand self-assigned this Apr 11, 2022
@aschmahmann
Copy link
Contributor

Roughly instrument block system to record the mean request rate of a block and the mean time to request the block from the network

Hopefully this should be easy enough to do using some combination of a custom blockstore wrapper and the contexts passed through requests.

Extend go-ipfs to enable configurable garbage collection strategies

This might be tricky. In particular, most proposals for different GC strategies involve tracking more state. While this is fine it likely means we have to do one of:

  1. Track and modify state for each GC type that requires it
  2. Have a single type of state generic enough to cover the various strategies
  3. Only allow users to change GC strategy with some data migration (i.e. similar in nature to changing backing datastores)

I don't know off-hand if any of these are particularly palatable and it may depend on the types of GC strategies that get built.

I'm highly supportive of making new GC strategies easier to work with at the library layer though so each application can choose its own strategy. My thinking is once we've got that we can figure out if/how to move that level of choice into the binary.

@RubenKelevra
Copy link
Contributor

RubenKelevra commented Apr 14, 2022

Yeah you cannot make cache decisions without holding additional data to do them. The current "flat out" write of the key-value store isn't enough for that (without very costly traversing of all datastructures from pin level down).

So the current mark-and-sweep approach is not only unsafe - as its not thread-safe but also the worst in terms of performance.

My mentioned proposal is a bit older but I still think it's a viable choice to go forward. It not only tracks the usage by pins per chunk with a ref counter, but allows to a mixed time/usage drop like ARC does (which is sadly patented). It would also eliminate the performance issues with some other approaches, as it is lock-free - it only requires to get messages over the actions of the rest of the program.

@guseggert
Copy link
Contributor

guseggert commented Apr 14, 2022

Could we address the use case mentioned by @iand by considering only access time? E.g. mount the flatfs datastore w/ strictatime flag, then GC with find ~/.ipfs/blocks -type f -name '*.data' -amin +60 | xargs rm. It wouldn't consider pinned/MFS blocks, but would work fine for read-only gateways. Some variations include using utime/utimensat/futimens syscalls or equiv such as os.Chtimes() when accessing blocks to avoid having a special mount, add a way to use a separate datastore for pinned blocks, use xattrs, etc. (these are a lot easier than tracking our own GC state).

@RubenKelevra
Copy link
Contributor

@guseggert well, this would be a Least recently used (LRU) and is one of the worst choices.

This would also neither limit the maximum amount of storage use (as we stupidly drop everything not touched in an hour) nor does it work for "normal" ipfs usage, where files are stored.

But the main issue is that most stuff is probably not reused within an hour, but more like several hours to several days. We need to be able to track accessed over longer periods of time to make a somewhat optimal decision.

@guseggert
Copy link
Contributor

Yes it's not optimal for all use cases. But for some use cases it would work well enough, such as for gateways, which is the use case described in this issue.

@RubenKelevra
Copy link
Contributor

I doubt that it would work for the usecase. As I said: It might be able to catch some really high access blocks that way, but most stuff will be discarded before reuse.

That's also not what's described in this issue as a possible solution:

A cache-aware garbage collection algorithm would assign a cost to blocks derived from the frequency of access and time taken to fetch from peers. The algorithm then deletes blocks with least cost until a sufficient space has been freed.

@guseggert
Copy link
Contributor

guseggert commented Apr 19, 2022

I have used LRU eviction for this kind of caching in the past, it works. It is easy to extend the strategy to e.g. only evict the oldest entries until the cache doesn't exceed some size, and then tweak the cache size and eviction threshold to adjust the hit rate for your traffic patterns. Most importantly, it is easy to implement outside of go-ipfs with a script and a cron job, so is a path forward now for a common use case, without having to build stateful GC into go-ipfs (which is unquestionably more optimal but has a much higher engineering cost).

@hsn10
Copy link

hsn10 commented Apr 24, 2022

Tracking will add storage overhead - mostly write operations and they are slow. Probably viable to store LRU tracking information in memory and do periodic batch dumps and then during GC load all dumps, delete data, delete tracking dumps and write new tracking information.

I dont think this feature is needed. Important data are pinned and adhoc data are not long term important to justify additional overhead.

@iand
Copy link
Contributor Author

iand commented Apr 28, 2022

@guseggert the problem with LRU is that it doesn't account for the cost for refilling the cache, which is usually a non-issue in filesystems and databases. In a distributed system like IPFS the fill cost is highly variable so I want to factor that into the algorithm to meet my goal of reducing time to first byte for the gateway.

A good reference on some of the issues here is LNC-R-WS-U (citeseer or dweb.link ) . In that paper the authors define a delay savings metric (DSR) which represents the fraction of communication delays saved by satisfying requests from cache. Ignoring cache revalidation (not meaningful in an immutable system) it's defined as:

sum over i (di*hi) / sum over i (di*fi)

where:

  • di is the delay to fetch document i into the cache
  • hi is the number of references to document i which were satisfied from the cache
  • fi is the total number of references to document i

To maximise the DSR while also maintaining a fixed cache size they assign a profit to each document and then retain the top N documents that fit the defined cache size. The profit function is:

pi = ri*di / si

where:

  • ri is the mean reference rate for document i
  • di is the mean delay to fetch document i into the cache
  • si is the size of document i

The reference rate and mean delays can be calculated from a sliding window of three values for time last requested and time taken to request. (The paper has a section analyzing the size of the sliding window and 3 is sufficient)

At the moment I think it's best to limit this caching algorithm to roots rather than individual blocks. The accounting overhead is then six values per root (3 timestamps and 3 durations) and we need to persist this so it isn't lost between restarts.

@yiannisbot
Copy link
Member

Two thoughts on this line of work:

  • With a long enough trace from the gateways, we should be able to find the request distribution. Having that and with the amount of cache storage we have available, plus the request rate, we should be able to estimate the amount of time that would make sense for items to be stored before they're GC'ed.
  • It would great, I believe, to be able to cross-check requests received between gateways. This would increase our sample of what is popular across the gateways and not locally only. I don't think it would be too much extra overhead to share the list of CIDs received with other gateways. Is that implicitly done through Bitswap requests/wantlists, by any chance? We could build several types of "collaborative cluster caching" mechanisms with that info.

@RubenKelevra
Copy link
Contributor

RubenKelevra commented Jun 16, 2022

@yiannisbot wrote:

Two thoughts on this line of work:

* With a long enough trace from the gateways, we should be able to find the request distribution. Having that and with the amount of cache storage we have available, plus the request rate, we should be able to estimate the amount of time that would make sense for items to be stored before they're GC'ed.

* It would great, I believe, to be able to cross-check requests received between gateways. This would increase our sample of what is popular across the gateways and not locally only. I don't think it would be too much extra overhead to share the list of CIDs received with other gateways. Is that implicitly done through Bitswap requests/wantlists, by any chance? We could build several types of "collaborative cluster caching" mechanisms with that info.

Got an idea on that.

We don't need to store the blocks for each request on each node (gateway in this case). We can fully locally decide which blocks to keep and which not:

  • We store a metric on how often blocks are used in the past in an LRU+Hitcount approach I layed out in [Draft] A cache sweeper for kubo (go-ipfs) notes#428. TL;DR: A number is ticking down over time, while a request on it will push the number up. Starting point is 50% of the range. If it reaches 0 we plan to drop the block.
  • If we need more storage (storage pressure from new operations) we can drop from the lower end of this list.

Distributed way

I planned a kind of seldom storage section for blocks which we wanted to drop, but asked the DHT and found no other copies. This way we don't drop the last copy of data in the network. While I still think that could be a good idea for low throughput nodes, that's not a good idea for high throughput nodes.

High throughput nodes would use a different mode: A DHT-alike probabilistic caching.

  • Calculate the distance between the Node-ID and the CID (like we do for the DHT)
  • If the distance is below a high threshold (like 10%) we keep it.
  • If we reach the storage limit, wo drop the CIDs with the most distance first

This avoids any communication between nodes, but on the other hand makes it predictable where a block may be stored. This would allow sorting the neighboring requests for bitswap by the distance between the CID and the Node-IDs for each request, to optimize the amount of queries we have to make on average to fetch the block before we resolve to using the DHT.

@iand
Copy link
Contributor Author

iand commented Jun 16, 2022

Updating this to note the in-progress implementation of phase 1 at master...iand:block-accounting

@RubenKelevra
Copy link
Contributor

@iand ah cool.

But it looks like a massive overhead to do this amount of accounting for each block.

How much memory does this need per block?

@iand
Copy link
Contributor Author

iand commented Jun 16, 2022

@iand ah cool.

But it looks like a massive overhead to do this amount of accounting for each block.

How much memory does this need per block?

I think I will end up doing it only for roots of dags, since the use case is the gateway which predominantly serves complete dags over individual blocks.

The accounting is currently 3 values for the last 3 times the block was served, 3 values for the last 3 elapsed times when retrieving and a value to represent the size of the block/dag that would be freed if it were deleted. My current unoptimized version looks like this:

type BlockAccounting struct {
	// Last 3 times this block was referenced
	ReferenceTimes [3]time.Time

	// Last 3 durations of time taken to request block from network
	FetchDelays [3]time.Duration

	// Size of the block
	Size int64
}

But I plan to define a smaller representation since our fetch durations are capped and we don't need to represent reference times in the medieval era 😄

@RubenKelevra
Copy link
Contributor

But I plan to define a smaller representation since our fetch durations are capped and we don't need to represent reference times in the medieval era smile

Got a bit left in the bitfield (still have no idea what to do with "precaching" - so this bit will be dropped.

So my proposal could be expanded to have this bit signal very hard to find/fetch blocks, to maybe give them another round in the cache for example.

@RubenKelevra
Copy link
Contributor

Btw, how are your locking going to work?

Example A: A pinning operation is running and your GC is running as well. Will it delete the blocks already fetched?

Example B: While your GC is running a file gets added to the MFS and the Pin for it gets deleted. Will it delete the MFS blocks when it has already marked the Pins?

Or are you planning a "world hold" lock which waits for all current writing operations to be completed but will block everything from beeing pinned/modified in the MFS?

@iand
Copy link
Contributor Author

iand commented Jun 17, 2022

Btw, how are your locking going to work?

Example A: A pinning operation is running and your GC is running as well. Will it delete the blocks already fetched?

Example B: While your GC is running a file gets added to the MFS and the Pin for it gets deleted. Will it delete the MFS blocks when it has already marked the Pins?

Or are you planning a "world hold" lock which waits for all current writing operations to be completed but will block everything from beeing pinned/modified in the MFS?

go-ipfs already uses the GCLocker type to hold a global lock while GC is in progress. This is checked fairly inconsistently in various places, such as when adding a unixfs file

But we should treat the locking as orthogonal to the algorithm used which should be interruptible. My sense is that locking needs to be improved throughout as a separate exercise.

@iand
Copy link
Contributor Author

iand commented Jun 21, 2022

Initial evaluation of the profit function (#8870 (comment)) against a version of go-ipfs instrumented to record block retrieval times and access patterns: protocol/prodeng#7 (comment)

@RubenKelevra
Copy link
Contributor

@guseggert the problem with LRU is that it doesn't account for the cost for refilling the cache, which is usually a non-issue in filesystems and databases. In a distributed system like IPFS the fill cost is highly variable so I want to factor that into the algorithm to meet my goal of reducing time to first byte for the gateway.

Overall, LRU is one of the worst approaches to caching. That's why ARC in ZFS does MRU and MFU, and the recently-part is pretty bad in performance, as you can see from the ARC statistics from two of my IPFS nodes:

$ uptime
 08:19:14 up 3 days, 19:16
# arc_summary -s archits
[...]
ARC total accesses (hits + misses):                                45.7M
        Cache hit ratio:                               84.5 %      38.6M
        Cache miss ratio:                              15.5 %       7.1M
        Actual hit ratio (MFU + MRU hits):             84.2 %      38.5M
        Data demand efficiency:                         2.5 %       6.9M

Cache hits by cache type:
        Most frequently used (MFU):                    91.0 %      35.2M
        Most recently used (MRU):                       8.6 %       3.3M
[...]
# arc_summary -s arc
ARC size (current):                                   100.2 %   15.7 GiB
[...]
        Most Frequently Used (MFU) cache size:         81.6 %  480.4 MiB
        Most Recently Used (MRU) cache size:           18.4 %  108.2 MiB
[...]
$ uptime
 08:16:43 up 30 days, 13:46 [...]
# arc_summary -s archits
[...]
ARC total accesses (hits + misses):                                 7.4G
        Cache hit ratio:                               99.8 %       7.3G
        Cache miss ratio:                               0.2 %      14.7M
        Actual hit ratio (MFU + MRU hits):             99.8 %       7.3G
        Data demand efficiency:                        99.7 %       1.5G
        Data prefetch efficiency:                      14.4 %       6.4M

Cache hits by cache type:
        Most frequently used (MFU):                    98.5 %       7.2G
        Most recently used (MRU):                       1.5 %     108.0M
[...]
# arc_summary -s arc
ARC size (current):                                   100.2 %   15.7 GiB
[...]
        Most Frequently Used (MFU) cache size:         90.1 %   13.5 GiB
        Most Recently Used (MRU) cache size:            9.9 %    1.5 GiB
[...]

go-ipfs already uses the GCLocker type to hold a global lock while GC is in progress. This is checked fairly inconsistently in various places, such as when adding a unixfs file

But we should treat the locking as orthogonal to the algorithm used which should be interruptible. My sense is that locking needs to be improved throughout as a separate exercise.

Yeah, there are various ways the GC currently can lose data which should actually be retained while the node is running.

A world lock in a new design is a pretty bad approach tbh, as it's pretty expensive to stop all new writes while the cleanup is taking place.

That's why I designed ipfs/notes#428 to completely avoid locking writes from happening via time based locks on the newly written and touched blocks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement A net-new feature or improvement to an existing feature
Projects
None yet
Development

No branches or pull requests

6 participants