-
Notifications
You must be signed in to change notification settings - Fork 775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API to allow explicit compaction/garbage collection #523
Comments
This is a good idea for the garbage collection/block allocator, and wouldn't be too difficult to add. I am hoping to address the issues in #75 by moving away from garbage collection completely, but adding a function to eagerly populate the block allocator could be a good temporary solution. It also wouldn't be too difficult to extend the block allocator with information about which blocks have been erased, currently it doesn't track this since it only erases on demand. One issue is that unlike SPIFFS or Dhara, LittleFS doesn't persist the garbage collection state on disk. This limits the use of eager garbage collection if your device loses power often, and wouldn't change the cold-start time. The metadata compaction is a different system and might be a bit more complex. You could iterate over all metadata-pairs and force a compaction if the metadata-pair is full past some threshold, but in doing so you are effectively losing potential writes from the last erase. But maybe this is acceptable if the threshold is sufficiently high around ~75%? It could be a user provided option. The metadata compaction operates on the sub-block level, so it currently throws away the existing metadata when it compacts.
One of the future challenges, even with a different block allocator, is tracking which blocks have been erased. LittleFS doesn't require erased blocks to be all 0xffs, and this allows some interesting use cases such as cheap SD/eMMC and encrypted block devices. The downside is, in order to know which blocks have been erased, we would need to store that somewhere. One of the more promising allocator schemes is basically just a bitmap in a file, but in order update such a file you likely end up needing to erase+write new blocks. So updating the allocator to indicate a block is erased would require an erase, which sort of defeats the purpose. Still need to explore this more, but it promises to be challenging. |
This appears to be a dupe of #493. Closing. |
I'm going to reopen this since it is subtly different than #493 and worth noting, #493 is a signalling mechanism from littlefs to disk, where as this (explicit compaction/garbage collection) is a more general feature for littlefs that may be useful. Development-wise they involve different layers. They may both be solved by an allocator redesign (#75), but we'll have to see. |
I've been experiencing some performance issues as well, and was looking for a garbage collection trigger. In my testing I noticed the following behaviour - write(append) operations would gradually take longer as a file gets bigger. Then a single write operation would take about double the previous write, then the next one would be considerably shorter, and the process would repeat. I figured the 'long' write is performing a garbage collection (is this assumption correct?). Then I introduced another file, which gets over-written every time I append to the original file (this is to simulate the real-world operation of our system). Instead of a 'long write' to the file I'm appending to (as happened before), the append op takes a similar time to before, but the op to the extra file takes a lot longer than the other writes to it would have taken. So I'm figuring that write op has taken over the garbage collection. So I got to thinking: why not just have a dummy file I write to that effectively triggers the garbage collection? Could this be made in a deterministic way (I'm guessing I'm just lucky with my test setup that the gc is happening only on the extra file, while in the wild it might happen on either). Any thoughts? |
Hi @victorallume, this does sounds like metadata compaction. One of the two garbage-collecting-like processes. And one of the two main open problems for littlefs right now (#203). Out of curiosity how large is your block size? The key thing is that metadata compaction occurs per metadata-pair, with each directory being made out of one or more metadata-pairs. So if the files are in the same directory, they probably share a metadata-pair, which would show the behavior you're describing: one file write might compact the metadata-pair so the other file doesn't need to. If you moved the other file into a different directory, then you wouldn't see the largest cost, but eventually the second directory would need to compact as well.
Yep, this it could happen to either. If one file fits in an "inline" file or has a bigger file name, it's metadata may take up more space which could make it more likely to trigger a compaction. But this isn't guaranteed, so shouldn't be relied on.
This is interesting and I suppose could work if you're able to instrument the block device to detect the compaction. But it probably wouldn't work in the general case. If you have multiple files in the directory it could take up multiple metadata-pairs, which means you might not be compacting the right metadata-pair. You might be interested in enabling the Lines 260 to 264 in 9c7e232
|
Hi @geky, thanks for the reply. The block size we're using is 4kB, and all of the files in question are in the same directory. |
I'm using LittleFS on a 8Gbit NAND flash and would love to trigger compaction to speed up my directory listings later. |
Unfortunately, relying on the masking nature of flash is not a good general solution. It works for many (most?) flash devices, but a number of flash devices disallow it due to either write perturbation concerns or because of hardware-backed ECC. The current design I'm working towards batches this up with the allocator redesign (#75), and adds an optional "block map" that tracks the status of every block in the filesystem. Such a block map could also be used to track bad blocks and allow blocks to be reserved for non-filesystem use. |
If I'm understanding this correctly, why not just just set
This graph is curious. Are you measuring I would normally expect the performance to drop to ~1/2 after compaction, not ~0. I wonder if it's because you're operating with a very high It depends on which of the above operations you're calling, but I think the bottleneck/cause for the ramp is the mdir fetch operation necessary to check the metadata block for consistency. mdir fetch needs to scan the metadata block to validate checksums, which ends up The mdir fetch cost has been a bit annoying as it's pretty fundamental. The best solution I've come up with is maintaining a cache of metadata trunks in RAM. Once fetched an mdir doesn't really require all that much RAM to keep track of, only 8 words/32 bytes at the moment.
This is basically because the compaction algorithm is too naive. I think this thread is the best discussion on the issue: #783. The open issues really need to get better categorized at some point... There was a fundamental assumption in littlefs's design that blocks are "small", and we can ignore algorithmic complexity related to the block size. This assumption came from focusing on NOR flash/eMMC and turned out to be wrong. At ~256KiB blocks, you can't ignore the complexity of in-block operations. The current metadata compaction algorithm is The current workaround for this is the
This is an interesting idea, and may be worth opening another issue. One could imagine a sort of I considered adding LFS_DEBUG statements to mdir compaction, but was worried it would result in too much debug spam, since mdir compaction toes the line between a common/uncommon operation...
Technically erase is the only operation that wears flash :) |
@geky I am using timing routines (almost ns accurate) built into my NAND (over SPIM on nRF53) to time the read portions that happens during file creation and writes. I also time the complete cycle of opening directory (as a seperate handle), list all content and close it again to metric the list time. In my usecase I create 24-28 files that range from 20-30 Mbyte in size, one at a time, and at a later date read them all back once before deleting. The total ammount of data written is between 500-800 Meg on a 1Gig (8Gbit) NAND flash. You are absolutely right on the flash write_size and it was that which gave me the best benefits combined with the new garbage collection. Since I know after deleting files, that I will be creating a bunch of files on next iteration I can call the function ahead of time. You were also right about the metadata being mostly padding due to write_size and the 1/4 sector solution helps a lot. To be able to analyse the structures on disk I wrote my own parser which can also find the header data thanks to the metadata written and synced when file is created. Sudden powerloss (each file is 1 hour of data) will leave the filesystem with knowledge of only the initial sync but once it scans and finds the header (488 bits of uniquely identifiable data) it now has a block reference it can use to identify the sequentially allocated blocks pointing to this block and so on. It can thus recreate the CTZ chain of blocks to find the last block. Since the data in the file is also formatted it can detect how long the file is and I added code to replace the files metadata upon opening by a normal open command so it will then use the CTZ ref instead with recovered file length. I have successfully recovered all lost files due to power-drop over a trial of 20 units with last file lost on each. It might be worth committing metadata about potential file data location in future? The only issue with lfs_fs_gc() right now would be that if you have not filled 1/2 the first block_size it does nothing. Even though it may be bytes away from this limit and the entire metadata set could fit in a single sector. |
This is an interesting direction to go. Though I'm not sure if it can be made to work well in the general case:
But there's nothing wrong with modifying littlefs locally for a specific use-case. |
Well, there is nothing written to the first block allocated to a file right now, except filedata (and checksum?) but next block starts with CTZ data pointing to the first block. There is nothing preventing the first datablock from containing the filename etc except for renaming files would be a nightmare. A 32bit unique identifier linked to a metadata tag committed with the filename tag would work. It could also be attached to every block assigned to the file. Not sure how to avoid a discarded block from being mistaken for belonging to a file though. Maybe erasing discarded blocks works for that (except again, powerloss may prevent this from being consistent). |
It sounds like this could work. Though discarded blocks (as you mentioned) and 32-bit overflow could present problems. It also starts to look a lot more like a logging filesystem. It's interesting to note that this is more-or-less how littlefs works inside a metadata log. Each file gets a 10-bit id, and we scan for the most recent "block" (inline data). This is also where most of the mdir-related bottlenecks come from. It takes
Isn't this solved by changing all int err = lfs_file_write(&lfs, &file, buffer, size);
if (err) {
return err;
}
err = lfs_file_sync(&lfs, &file);
if (err) {
return err;
} ? I think syncing on partial file write would be quite problematic for users. |
The issue with syncing on partial file write or automatically all the time is the reallocation of data if it is to be appended afterwards. The issue is you do not know when you should sync as you do not know if you reach a block boundry where there would be 0 reallocation. |
I think we're approaching the same problem from two different directions. From my perspective, waiting to update metadata until sync is the correct API in terms of making it easy for users to write power-safe applications. And performance issues (reallocation of data) are problems that need fixing in the filesystem's design. To this end I've been working on improving sync performance by storing incomplete blocks in the metadata (like inline files). This should avoid most tail block reallocations.
This could be added, maybe LFS_O_EAGERSYNC? (open to better names). But I would want to see the above improvements to sync performance land first. My hope is that we can push sync to be fast enough that additional APIs such as LFS_O_EAGERSYNC (name tbd) are not necessary.
Yes, this is just two words (~8 bytes) for the CTZ skip-list. But, as I think you've found out in other comments, padding to the That's where I think the storing of partial blocks can work quite well with NAND's limitations. If we're writing a whole |
Ooooh! I like that idea! Partial blocks could be treated differently but for large blocks it may not work well inline. I am just rambling here but conceptually what I mean is a way for a file to be restructured during append or partial overwrite that allows minimum copying of data and at same time keeping the metadata commit after the change atomic to be power-loss safe. This may require a way to make a complete blocklist of a file and commit a new version of it without changing the old one, before updating metadata to point to the new blocklist. |
From what I can tell, compaction/garbage collection only occurs as the file system is being used (written to).
On devices with longer erase times (I'm currently looking at a part that takes 1.6 seconds to erase 32kB), this will lead to poor performance during write function calls if compaction/garbage collection is needed.
It would be very useful to have a separate API that the system integrators can call to explicitly initiates a rounds of compaction/garbage collection. An API like this would allow them to schedule expensive erase operations at times that have less likelihood of impacting their system's write performance.
An example of where this may be useful, is for when data is only intermittently/periodically being written to flash, but when it does, it needs to be written as quickly as possible. Garbage collection can then be performed over the longer time periods between the write bursts, ensuring that the flash is in a readily written to state by the time the next series of write calls occurs.
While this may not eliminate the need for all compaction/garbage collection during write calls, it has the potential to improve the consistency of write call durations.
The text was updated successfully, but these errors were encountered: