-
Notifications
You must be signed in to change notification settings - Fork 31
[Draft] A cache sweeper for kubo (go-ipfs) #428
Description
Status
Blockers (to do first):
- Article: options and tradeoffs around data import parameters ipfs-docs#1176
- buzhash: reduce target size and cutoff size go-ipfs-chunker#31
- finding a workaround or a fix for Regression: Adding a lot of files to MFS will slow ipfs down significantly kubo#8694 to get pacman.store back up
Todo list
- Create a goal ticket in go-ipfs to track the progress(?) - I asked @Stebalien if that's a good idea
Update draft to include
- DHT-alike caching section for high throughput nodes, outlined here, and originally here.
- more details on database structure, planned hashing algorithms, hash collisions
- Make clear that the server profile in ipfs should turn off the seldom cache section – as it emits a lot of DHT requests
- the details mentioned in the update post
- Cleanup language, restructure a bit
- evaluation how long each operation will take to evaluate where it's benefitial/necessary to create lists/hashtables instead of relying on the flags alone - and if the added complexity makes sense.
Introduction
Hey guys,
I'm pretty new to this project, but I see the opportunity to contribute a little and learn some go on the way. I learned C a long time ago, and I'm currently just firm in Python and shell script - so make sure to thoroughly check the contribution.
Automatic cache cleanup
Current approach
There's a garbage collector that is designed to sweep the whole block storage and remove everything which is not part of the MFS and not part of the pins. The garbage collector can be run manually offline, manually online or enabled to run if a specified repo size reached a certain percentage of fill level.
There's no logging of the number of requests of a block, its age nor how often a block is referenced. The garbage collector run is pretty slow since it traverses the whole local database to look for pins and their recursive references as well as the MFS to make sure to drop just non-referenced objects.
New approach
A background task will be added which will use a database to keep track of references, the number of requests, age, and size of each block.
On first initialization (or with a flag) it will traverse the whole pin/MFS database to create its database (or make sure it's consistent).
Goals
(in priority order)
- The garbage collector remains untouched for full sweeps of the repo
- Timestamp-based concurrency control (won't block any operations of IPFS)
- Crash resistance
- Atomic writes (DB)
- Extremely low memory usage
- No scalability issues
- low chance of cache pollution
- Fast startup
- Permanent storage of block metrics (no warmup needed)
- Avoids small writes to disk
- As many unsynced writes as possible
- Long term cache efficiency
- Avoid redundancies of information
- Low system load (background task)
- Sane and adaptive default config (no tweaking necessary)
Limitations
(in no particular order)
- Does not understand anything about the data structure of higher objects
- Can't predict access patterns
- Can't analyze access patterns
- Low level of concurrency on own operations
- Can't answer queries for age, hit/miss rate requests on blocks
- Can't handle a fixed age
- Doesn't care about misses
- Size based operations (like drop rate, ingress rate, repo size) are just rough estimates
- New blocks cannot be squeezed
- Size estimates are limited to 2MB, implementing larger blocks in the protocol would require a change in data structures
- maximum references are limited to 25 bit (33,554,432)
- could be theoretically extended to 27 bit (134,217,728) with somewhat more complexity if needed
- Doesn't have any metric on blocks which gets unpinned. (All blocks being unpinned/rm from MFS get a standard value)
- Long lag between sniffed operations and actions on the cache and DB
- Running the GC deletes the cache sweeper DB (and stops the task - if it's running). The cache sweeper database needs to be rebuilt offline to start the task again.
- maximum cache time of an item is limited to
2^25/(10*(0.1+0.01)/2)/60/60/24
= 706 days - if zero hits occur and given that the storage is large enough to hold it for this timeframe. Objects which exceeds those value, will be placed in the stale section of the cache - which is by default capped at 10% relative size. Blocks in the stale section will be moved back to the main section on requests hitting them - or dropped at any time on cache pressure. (time limit can be raised by switching to 27 bit or by changing the clock interval - default is 10 seconds) - Extremely long running pinning operations, like days, will block that the new blocks part will be processed at all. Only when the operation is completed, the background task will catch up and move the blocks to either the cache side or the permanent storage side. This might be an issue on very limited space left for the cache, while there are many read operations without pinning. In this case a pinning operation might need to be forcefully stopped if no cache space is left, to clear up this state. Obviously all cached data will be "trashed" this way.
- Aborted pinning operations will have all their data left in the cached section of the datastore. Rerunning the pinning operation will reuse the blocks and move them to the new and than to the permanent storage section. If there's a high memory pressure and many "better" valued blocks, aborted pinning operations data, might be removed from the cache before the repinning operation have a chance to claim them. In this case the data has to be fetched again from the network.
Proposed Features
(in no particular order)
- Permanent storage of the database
- Sniffs ipfs operations: Pin, Unpin, write/delete of MFS objects
- optional distributed prefetching
- Sniffs bitswap requests (optional: filtered through a NodeID whitelist) to fetch blocks other nodes are searching for
- seldom segment of the cache
- If enabled: When a block would enter a stale state, the DHT is queried for other providers. If there are none the block will be added to the seldom cache for an extended stale state. If there's a request it will be moved back to the main cache and the metrics will be reset.
- Keeps track of free space left on device and drops blocks if necessary
- Asks periodically the OS for the free space, if there's a shortage the ingress rate on the disk is guessed, while roughly removing its own operations. This fill-rate will be added to the ingress rate of the datastore and adjust the drop rate.
- Each cache segment can be squeezed on the fly with a block count or a size (roughly) (except new blocks)
- Keeps track of the 'new blocks' write speed, and adjusts the drop rate accordingly.
- Keeps track of blocks which can be immediately dropped on a pressure signal
- Keeps track of the block sizes
- Can be adjusted to care more about access requests or more about age
Potential configuration knobs
(all sizes are estimates)
(relative: excludes pinned/new blocks)
- seldom cache (relative) size (=10%)
- stale cache (relative) size (=5%)
- warn threshold minimal estimated cache size (=25%)
- log: too many pinned objects/too low disk space
- free cache size (=2%)
- warn threshold (total) new blocks (=10%)
- minimal free disk space (=1 GB)
- action: will drop blocks
- max estimated block storage size (=unlimited)
- max number of prefetched objects in the cache (=100,000)
- max total size of prefetched objects in the cache (=2%)
- max number of stale objects (=unlimited)
- max total size of stale objects (= 10% cache size)
Potential Datastructure
32-bit integer value:
- 3-bits for a size estimate (8 values)
- 1-bit Dirty flag
- 1-bit Seldom flag
- 1-bit Precaching flag
- 1 bit for reference (=true) or cache
- 25-bits for ref-counter / cache
= 0
New block (no reference information yet)
The dirty flag marks incomplete writes to recover from them without analyzing the whole data store.
The seldom flag marks if a cache entry is part of the seldom cache. If the reference bit is set, this bit has no function.
The Precaching flag marks if a cache entry is part of the preaching flag. If the reference bit is set, this bit has no function.
Size estimate values
The size bits will be rounded to the nearest value in this list:
2,500 Bytes - 000
38,656 Bytes - 001
136,029 Bytes - 010
268,924 Bytes - 011
422,168 Bytes - 100
639,322 Bytes - 101
1,319,771 Bytes - 110
2,005,897 Bytes - 111
Caching algorithm
- A clock interval is 10 seconds.
- Each access will raise the counter by 1.
- If 2^25 is reached, the value will not be changed.
- Each clock interval will decrease the counter by 1 with probability based on the probability-curve.
- The default value is 2^25 / 2 (16777216).
A newly added block will be added to the new block list and get a timestamp. If the timestamp is older than the latest completed pinning/writing which is completed and the block wasn't referenced, it will be moved to the cache with the default value. The timestamp is discarded.
|+++++++++++++++++++++++++++++++ Cache ++++++++++++++++++++++++++++++++++++|++++++ New Blocks ++++++|+++++++ Storage +++++++|
||+ Stale +|+Seldom +|+++++++++++++++++ Main +++++++++++++++++++++|+ Pre +|| | |
|| | | | || | |
|| | | | || | |
||+++++++++|+++++++++|++++++++++++++++++++++++++++++++++++++++++++|+++++++|| | |
|++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++|++++++++++++++++++++++++|+++++++++++++++++++++++|
Probability curve
- The default value 2^25 / 2 (16777216) has a probability of
0.1
. - The curve on the left is linear and on the right somewhat exponential.
- Left side will drop off to
0.01
at0
. - Right side will be at
0.2
at 2^25 / 2 + 2^25 / 4 (25165824). Exact curve TBD.