Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API to allow explicit compaction/garbage collection #523

Open
gtaska opened this issue Jan 31, 2021 · 19 comments
Open

API to allow explicit compaction/garbage collection #523

gtaska opened this issue Jan 31, 2021 · 19 comments

Comments

@gtaska
Copy link

gtaska commented Jan 31, 2021

From what I can tell, compaction/garbage collection only occurs as the file system is being used (written to).

On devices with longer erase times (I'm currently looking at a part that takes 1.6 seconds to erase 32kB), this will lead to poor performance during write function calls if compaction/garbage collection is needed.

It would be very useful to have a separate API that the system integrators can call to explicitly initiates a rounds of compaction/garbage collection. An API like this would allow them to schedule expensive erase operations at times that have less likelihood of impacting their system's write performance.

An example of where this may be useful, is for when data is only intermittently/periodically being written to flash, but when it does, it needs to be written as quickly as possible. Garbage collection can then be performed over the longer time periods between the write bursts, ensuring that the flash is in a readily written to state by the time the next series of write calls occurs.

While this may not eliminate the need for all compaction/garbage collection during write calls, it has the potential to improve the consistency of write call durations.

@geky
Copy link
Member

geky commented Feb 3, 2021

This is a good idea for the garbage collection/block allocator, and wouldn't be too difficult to add. I am hoping to address the issues in #75 by moving away from garbage collection completely, but adding a function to eagerly populate the block allocator could be a good temporary solution.

It also wouldn't be too difficult to extend the block allocator with information about which blocks have been erased, currently it doesn't track this since it only erases on demand.

One issue is that unlike SPIFFS or Dhara, LittleFS doesn't persist the garbage collection state on disk. This limits the use of eager garbage collection if your device loses power often, and wouldn't change the cold-start time.

The metadata compaction is a different system and might be a bit more complex. You could iterate over all metadata-pairs and force a compaction if the metadata-pair is full past some threshold, but in doing so you are effectively losing potential writes from the last erase. But maybe this is acceptable if the threshold is sufficiently high around ~75%? It could be a user provided option.

The metadata compaction operates on the sub-block level, so it currently throws away the existing metadata when it compacts.

schedule expensive erase operations

One of the future challenges, even with a different block allocator, is tracking which blocks have been erased. LittleFS doesn't require erased blocks to be all 0xffs, and this allows some interesting use cases such as cheap SD/eMMC and encrypted block devices.

The downside is, in order to know which blocks have been erased, we would need to store that somewhere. One of the more promising allocator schemes is basically just a bitmap in a file, but in order update such a file you likely end up needing to erase+write new blocks. So updating the allocator to indicate a block is erased would require an erase, which sort of defeats the purpose. Still need to explore this more, but it promises to be challenging.

@gtaska
Copy link
Author

gtaska commented Feb 28, 2021

This appears to be a dupe of #493. Closing.

@gtaska gtaska closed this as completed Feb 28, 2021
@geky
Copy link
Member

geky commented Jun 23, 2021

I'm going to reopen this since it is subtly different than #493 and worth noting, #493 is a signalling mechanism from littlefs to disk, where as this (explicit compaction/garbage collection) is a more general feature for littlefs that may be useful. Development-wise they involve different layers. They may both be solved by an allocator redesign (#75), but we'll have to see.

@victorallume
Copy link

I've been experiencing some performance issues as well, and was looking for a garbage collection trigger.

In my testing I noticed the following behaviour - write(append) operations would gradually take longer as a file gets bigger. Then a single write operation would take about double the previous write, then the next one would be considerably shorter, and the process would repeat. I figured the 'long' write is performing a garbage collection (is this assumption correct?). Then I introduced another file, which gets over-written every time I append to the original file (this is to simulate the real-world operation of our system). Instead of a 'long write' to the file I'm appending to (as happened before), the append op takes a similar time to before, but the op to the extra file takes a lot longer than the other writes to it would have taken. So I'm figuring that write op has taken over the garbage collection.

So I got to thinking: why not just have a dummy file I write to that effectively triggers the garbage collection? Could this be made in a deterministic way (I'm guessing I'm just lucky with my test setup that the gc is happening only on the extra file, while in the wild it might happen on either). Any thoughts?

@geky
Copy link
Member

geky commented Apr 6, 2022

Hi @victorallume, this does sounds like metadata compaction. One of the two garbage-collecting-like processes. And one of the two main open problems for littlefs right now (#203).

Out of curiosity how large is your block size?

The key thing is that metadata compaction occurs per metadata-pair, with each directory being made out of one or more metadata-pairs. So if the files are in the same directory, they probably share a metadata-pair, which would show the behavior you're describing: one file write might compact the metadata-pair so the other file doesn't need to.

If you moved the other file into a different directory, then you wouldn't see the largest cost, but eventually the second directory would need to compact as well.

while in the wild it might happen on either

Yep, this it could happen to either. If one file fits in an "inline" file or has a bigger file name, it's metadata may take up more space which could make it more likely to trigger a compaction. But this isn't guaranteed, so shouldn't be relied on.

why not just have a dummy file I write to that effectively triggers the garbage collection?

This is interesting and I suppose could work if you're able to instrument the block device to detect the compaction. But it probably wouldn't work in the general case. If you have multiple files in the directory it could take up multiple metadata-pairs, which means you might not be compacting the right metadata-pair.


You might be interested in enabling the metadata_max configuration option. This is a workaround that lets you artificially limit the size of metadata in each metadata-pair to keep the cost of compaction usable. It wastes disk space, but is currently necessary for reasonable performance on devices with large block sizes:

littlefs/lfs.h

Lines 260 to 264 in 9c7e232

// Optional upper limit on total space given to metadata pairs in bytes. On
// devices with large blocks (e.g. 128kB) setting this to a low size (2-8kB)
// can help bound the metadata compaction time. Must be <= block_size.
// Defaults to block_size when zero.
lfs_size_t metadata_max;

@victorallume
Copy link

victorallume commented Apr 6, 2022

Hi @geky, thanks for the reply. The block size we're using is 4kB, and all of the files in question are in the same directory.

@kyrreaa
Copy link

kyrreaa commented Oct 30, 2023

I'm using LittleFS on a 8Gbit NAND flash and would love to trigger compaction to speed up my directory listings later.
Since my particular NAND allows 4 partial page writes (1024 bytes at a time per 4096 byte page if one wants instead of full 4096 byte) without erasing and since NAND can change any 1's in memory to 0's only tracking or identifying used/free space is at least possible.

@geky
Copy link
Member

geky commented Oct 30, 2023

Unfortunately, relying on the masking nature of flash is not a good general solution. It works for many (most?) flash devices, but a number of flash devices disallow it due to either write perturbation concerns or because of hardware-backed ECC.

The current design I'm working towards batches this up with the allocator redesign (#75), and adds an optional "block map" that tracks the status of every block in the filesystem.

Such a block map could also be used to track bad blocks and allow blocks to be reserved for non-filesystem use.

@kyrreaa
Copy link

kyrreaa commented Oct 30, 2023

Agree completely on the abuse of NAND masking. I was more thinking of the ability to write partial blocks and combining this knowledge with the erase-block information. An erase wears flash too, and for example for my flash, if the first 1/4 has been written to but next 1/4 still shows FF's it can update the first block's information with a new revised version without requiring erasing and without having to go to next block. The read operation can still read the entire 4096 sector (in my case) and this is supported by the automatic ECC of the chip. It specifically mentions up to 4 partial writes are allowed. I am just thinking how this flexibility could be exploited after noticing some massive delays during testing creating small(ish) (120 sectors of 4096 bytes) and measuring the directory listing performance for each addition. I noticed that once I reached enough files to compact the entires the delay dropped significantly. I created a graph to test it. I understand ofcource that this mostly applies to my current implementation and interface speed also affects it, but it should scale regardless for anyones access, depending on size of buffers etc. (4096 byte access, 409664 blocks, 4096 write buffer and a total of 4096 blocks (of 409664 bytes) storage with a lookahead of 4096/8 hoping it would allow full chip overview...)
image
(Edit: Updated graph)
Compacting (when closing the file that triggered it) took a very very long time though... Is this due to the alphabetical sorting perhaps?

@kyrreaa
Copy link

kyrreaa commented Feb 6, 2024

I ran my test again now that 2.9 is out!
image
Looks like there are small improvements in listing speed! (Edit: Nope, it was caused by other compiler changes. But performance is not worse so that is good :) )

I've also tested the manual garbage collection / compacting and it seems to work.
I am curious though if there are any metrics we can extract on how the metadata are organized before/after compaction. It would be good to be able to verify what the compaction did and maybe even analyse the metadata to determine how much wasted space/programming blocks a block of metadata contains.

@geky
Copy link
Member

geky commented Feb 16, 2024

Oh yes, I forgot to update this thread. v2.9 now includes lfs_fs_gc (#913). Which is the start of explicit garbage collection in littlefs.

@kyrreaa sorry about the late response, I'm trying to catch up to your comments

@geky
Copy link
Member

geky commented Feb 16, 2024

if the first 1/4 has been written to but next 1/4 still shows FF's it can update the first block's information with a new revised version without requiring erasing and without having to go to next block.

It specifically mentions up to 4 partial writes are allowed. I am just thinking how this flexibility could be exploited

If I'm understanding this correctly, why not just just set prog_size=1024? I think what you described is more-or-less what littlefs is doing when it appends to metadata logs.

LittleFS Performance (graph)

This graph is curious. Are you measuring lfs_stat, lfs_dir_read, or lfs_file_open? These have slightly different behaviors.

I would normally expect the performance to drop to ~1/2 after compaction, not ~0. I wonder if it's because you're operating with a very high prog_size/metadata ratio. It may be that most of your commits are filled with padding to align to the prog_size, and after compaction your metadata fits in only a couple prog units.

It depends on which of the above operations you're calling, but I think the bottleneck/cause for the ramp is the mdir fetch operation necessary to check the metadata block for consistency. mdir fetch needs to scan the metadata block to validate checksums, which ends up $O(n)$ over the number of commits in the block. It should skip padding, but if read_size == prog_size, that doesn't really help.

The mdir fetch cost has been a bit annoying as it's pretty fundamental. The best solution I've come up with is maintaining a cache of metadata trunks in RAM. Once fetched an mdir doesn't really require all that much RAM to keep track of, only 8 words/32 bytes at the moment.

Compacting (when closing the file that triggered it) took a very very long time though... Is this due to the alphabetical sorting perhaps?

This is basically because the compaction algorithm is too naive.

I think this thread is the best discussion on the issue: #783. The open issues really need to get better categorized at some point...

There was a fundamental assumption in littlefs's design that blocks are "small", and we can ignore algorithmic complexity related to the block size. This assumption came from focusing on NOR flash/eMMC and turned out to be wrong. At ~256KiB blocks, you can't ignore the complexity of in-block operations.

The current metadata compaction algorithm is $O(n^2)$ over the number of tags in a metadata log. This is why the performance spike is so large. I have a redesign in the works that brings this down to $O(n \log n)$, but it is quite involved.

The current workaround for this is the metadata_max config, which artificialy limits metadata logs. This ends up wasting storage, but can bring the compaction cost down to usable levels in some situations.

I am curious though if there are any metrics we can extract on how the metadata are organized before/after compaction. It would be good to be able to verify what the compaction did and maybe even analyse the metadata to determine how much wasted space/programming blocks a block of metadata contains.

This is an interesting idea, and may be worth opening another issue.

One could imagine a sort of lfs_fs_debugstat that collects more invasive info, average metadata util, etc (or a config flag to add this to the current lfs_fsinfo struct).

I considered adding LFS_DEBUG statements to mdir compaction, but was worried it would result in too much debug spam, since mdir compaction toes the line between a common/uncommon operation...

An erase wears flash too

Technically erase is the only operation that wears flash :)

@kyrreaa
Copy link

kyrreaa commented Feb 17, 2024

@geky I am using timing routines (almost ns accurate) built into my NAND (over SPIM on nRF53) to time the read portions that happens during file creation and writes. I also time the complete cycle of opening directory (as a seperate handle), list all content and close it again to metric the list time. In my usecase I create 24-28 files that range from 20-30 Mbyte in size, one at a time, and at a later date read them all back once before deleting. The total ammount of data written is between 500-800 Meg on a 1Gig (8Gbit) NAND flash.

You are absolutely right on the flash write_size and it was that which gave me the best benefits combined with the new garbage collection. Since I know after deleting files, that I will be creating a bunch of files on next iteration I can call the function ahead of time.
By not limiting metadata size but combining read_size=4096, write_size=1024 and block_size=64*4096 (smallest available) and the lfs_fs_gc() call after deletion (when we have plenty of time), I ended up with a use-case cycle that does not compact during normal use (live data recording) and thus has great create/write times. Since I can not know when a block_size has been filled however I can not sync() at ideal times so avoid it until file closed. Since no data is reallocated or otherwise copied this is also fast.
image
Create time here contains file creation (which makes the 0 length inline data tag), writing unique file header with important metrics etc (and synching to make sure this ends up as inline data tag) before writing 4000 bytes of data to simplify test (not 20 Meg as I don't want to wait ages just for a higher number...) and closing file. I do not time the metadata compaction or file deletion that happens to clear out the existing files as that is not performance critical.
As you can see, it works wonders!

You were also right about the metadata being mostly padding due to write_size and the 1/4 sector solution helps a lot.
Since the change is as easy as modifying one parameter of config I think this is a great solution even though one may later change to a flash that does not support partial writes (mostly MLC).

To be able to analyse the structures on disk I wrote my own parser which can also find the header data thanks to the metadata written and synced when file is created. Sudden powerloss (each file is 1 hour of data) will leave the filesystem with knowledge of only the initial sync but once it scans and finds the header (488 bits of uniquely identifiable data) it now has a block reference it can use to identify the sequentially allocated blocks pointing to this block and so on. It can thus recreate the CTZ chain of blocks to find the last block. Since the data in the file is also formatted it can detect how long the file is and I added code to replace the files metadata upon opening by a normal open command so it will then use the CTZ ref instead with recovered file length. I have successfully recovered all lost files due to power-drop over a trial of 20 units with last file lost on each. It might be worth committing metadata about potential file data location in future?

The only issue with lfs_fs_gc() right now would be that if you have not filled 1/2 the first block_size it does nothing. Even though it may be bytes away from this limit and the entire metadata set could fit in a single sector.

@geky
Copy link
Member

geky commented Feb 17, 2024

To be able to analyse the structures on disk I wrote my own parser which can also find the header data thanks to the metadata written and synced when file is created. Sudden powerloss (each file is 1 hour of data) will leave the filesystem with knowledge of only the initial sync but once it scans and finds the header (488 bits of uniquely identifiable data) it now has a block reference it can use to identify the sequentially allocated blocks pointing to this block and so on. It can thus recreate the CTZ chain of blocks to find the last block. Since the data in the file is also formatted it can detect how long the file is and I added code to replace the files metadata upon opening by a normal open command so it will then use the CTZ ref instead with recovered file length. I have successfully recovered all lost files due to power-drop over a trial of 20 units with last file lost on each. It might be worth committing metadata about potential file data location in future?

This is an interesting direction to go. Though I'm not sure if it can be made to work well in the general case:

  • You need to make sure you don't pick up old files still on disk.

    Storage where erase is a noop (SD/FTL/RAM/etc) make this harder.

  • You need to make sure you can detect partially written files.

    Sufficient checksumming probably solves this.

  • If you don't know the size of the file beforehand, scanning for a checksum may be unbounded.

    This is actually the same problem as mounting littlefs without known the block_size, which would be really nice to solve, but to my current knowledge is unsolvable (#349).

  • If you open multiple files for writing, the blocks can become interleaved, which may make finding the exact file you're looking for more difficult.


But there's nothing wrong with modifying littlefs locally for a specific use-case.

@kyrreaa
Copy link

kyrreaa commented Feb 18, 2024

Well, there is nothing written to the first block allocated to a file right now, except filedata (and checksum?) but next block starts with CTZ data pointing to the first block. There is nothing preventing the first datablock from containing the filename etc except for renaming files would be a nightmare. A 32bit unique identifier linked to a metadata tag committed with the filename tag would work. It could also be attached to every block assigned to the file. Not sure how to avoid a discarded block from being mistaken for belonging to a file though. Maybe erasing discarded blocks works for that (except again, powerloss may prevent this from being consistent).
It would be really nice though if powerloss resilience was so strong that written data could be recovered even if file was closed incorrectly (aka never closed). Thanks to my limited usecase however I was lucky and I could rebuild everything I needed.

@geky
Copy link
Member

geky commented Feb 18, 2024

A 32bit unique identifier linked to a metadata tag committed with the filename tag would work. It could also be attached to every block assigned to the file.

It sounds like this could work. Though discarded blocks (as you mentioned) and 32-bit overflow could present problems. It also starts to look a lot more like a logging filesystem.

It's interesting to note that this is more-or-less how littlefs works inside a metadata log. Each file gets a 10-bit id, and we scan for the most recent "block" (inline data). This is also where most of the mdir-related bottlenecks come from. It takes $O(n)$ to find the most recent block, resulting in $O(n^2)$ compaction cost.

It would be really nice though if powerloss resilience was so strong that written data could be recovered even if file was closed incorrectly (aka never closed).

Isn't this solved by changing all lfs_file_writes to:

int err = lfs_file_write(&lfs, &file, buffer, size);
if (err) {
    return err;
}
err = lfs_file_sync(&lfs, &file);
if (err) {
    return err;
}

?

I think syncing on partial file write would be quite problematic for users.

@kyrreaa
Copy link

kyrreaa commented Feb 19, 2024

The issue with syncing on partial file write or automatically all the time is the reallocation of data if it is to be appended afterwards. The issue is you do not know when you should sync as you do not know if you reach a block boundry where there would be 0 reallocation.
In fact that may actually be a way to do it. A config option, api call or similar to let littlefs sync file metadata upon next boundry where there is no data to reallocate upon next append. For continous logging this type of api could be called at an interval the user sees fit and would not trigger sync until it hits the boundry. Not sure how ideal this is but I wonder if it would be useful for those who write long logs or live data recordings. Every sync does add metadata overhead but it would just be a single CTZ reference to the new block and a file length if I understand correctly?

@geky
Copy link
Member

geky commented Feb 28, 2024

The issue with syncing on partial file write or automatically all the time is the reallocation of data if it is to be appended afterwards.

I think we're approaching the same problem from two different directions.

From my perspective, waiting to update metadata until sync is the correct API in terms of making it easy for users to write power-safe applications. And performance issues (reallocation of data) are problems that need fixing in the filesystem's design.

To this end I've been working on improving sync performance by storing incomplete blocks in the metadata (like inline files). This should avoid most tail block reallocations.

In fact that may actually be a way to do it. A config option, api call or similar to let littlefs sync file metadata upon next boundry where there is no data to reallocate upon next append.

This could be added, maybe LFS_O_EAGERSYNC? (open to better names).

But I would want to see the above improvements to sync performance land first. My hope is that we can push sync to be fast enough that additional APIs such as LFS_O_EAGERSYNC (name tbd) are not necessary.

Every sync does add metadata overhead but it would just be a single CTZ reference to the new block and a file length if I understand correctly?

Yes, this is just two words (~8 bytes) for the CTZ skip-list.

But, as I think you've found out in other comments, padding to the prog_size can be a real kicker.

That's where I think the storing of partial blocks can work quite well with NAND's limitations. If we're writing a whole prog_size, might as well fill it with the tail of the current file.

@kyrreaa
Copy link

kyrreaa commented Feb 28, 2024

To this end I've been working on improving sync performance by storing incomplete blocks in the metadata (like inline files). This should avoid most tail block reallocations.

Ooooh! I like that idea! Partial blocks could be treated differently but for large blocks it may not work well inline.
If one could find a way to structure the data so that a block inside a file does not have to be full, and maybe even do not have to be "dense" it would allow append with no copy. Append could continue in the next prog_size.
Syncing a file after appending a single byte would then have to write the data to the next prog_size in the block and somehow associate a new crc with the data. Then update metadata ctz with size. Size would not directly indicate how the blocks were to be read, or how many blocks are part of the file as the blocks can be very sparse, but that could be solved either by more structure in block data or with metadata?
Copy on modify would be able to compact a sparse block even more during copy and if not truncated but modified" could even keep next block intact if there was a way to update the ctz sequence? Since it really needs to be a forward-back type list for more efficient use it may be better stored in metadata as well? "Block 0x1337 allocated to file 0xb" metadata could work? A copy on modify (not needed for append) would be able to de-allocate 0x1337 and in same commit allocate 0x1338? Alternatively, could the block allocations reside in the first block of a file?

I am just rambling here but conceptually what I mean is a way for a file to be restructured during append or partial overwrite that allows minimum copying of data and at same time keeping the metadata commit after the change atomic to be power-loss safe. This may require a way to make a complete blocklist of a file and commit a new version of it without changing the old one, before updating metadata to point to the new blocklist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants