Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LFSv2: inline files performance drop #203

Open
pasi-ojala-tridonic opened this issue Jun 7, 2019 · 11 comments
Open

LFSv2: inline files performance drop #203

pasi-ojala-tridonic opened this issue Jun 7, 2019 · 11 comments

Comments

@pasi-ojala-tridonic
Copy link

Running on a device with very little flash available, decided to give the inline files of v2 a go. Not sure if I got everything configured ok, tried to follow the instructions.

Creating a couple of files holding an integer, increment the value in a loop, things basically seem to work. I see lots of reads towards my flash, I assume it's the inline files that cause the need to read a lot. I could leave with that, the data seems valid, the file contents keep changing as expected.

However, after some loops, starting from line 1246 and lasting till 1446 in the log, the reads just explode when updating a value in file 8, the update taking about 25 seconds. There's an ERASE happening in the middle, but that's not taking much time compared to all the reads happening.
Two minutes later (search for ERASE), this happens with file 6, then later for file 4, then 2 and then back to 8.

I was looking around a bit in debugger: when these huge read blocks appear, the recursion in lfs_dir_traverse (https://github.com/ARMmbed/littlefs/blob/master/lfs.c#L700) seems to kick in, although I never saw the recursion go deeper than level 1.

In the test snippet, mountLittleFS() does the mount and printing out of directory contents.

lfstest.txt
lfsv2-8files.txt

@geky
Copy link
Member

geky commented Jun 10, 2019

Hi @pasi-ojala-tridonic, thanks for creating an issue.

This is a part of how littlefs works. At its core, littlefs is built out of a bunch of small 2-block logs (metadata pairs, similar to journals). These logs are appended with new information as the filesystem is written to. Eventually they get full and need to be garbage collected to remove outdated entries. Unfortunately this garbage collection is expensive (O(erase size^2)).

The main approach littlefs takes to prevent this from becoming unusable is to keep the logs small and limited to 2-erase blocks, building more scalable data structures out of the 2-block logs. But there's still a risk that this is no scalable as the erase size gets too large.

Inline files are files that are small enough we can fit them directly in their parent directory's 2-block log, instead of just storing a pointer to the file's data structure. This can lead to more data being stored in the 2-block log. It's possible limiting the size of inline files could increase the performance of garbage collection, but there's still the risk of long-running spikes as the 2-block log will still have to garbage collect at some point.

More info here:
https://github.com/ARMmbed/littlefs/blob/master/DESIGN.md#metadata-pairs

There's an ERASE happening in the middle, but that's not taking much time compared to all the reads happening.

That's interesting, generally the opposite is true due to how time consuming erase operations are. What chip is this?

Also what is your device's geometry (read size/prog size/cache size/erase size)? The spike from garbage collection gets worse as the erase size increases.

I never saw the recursion go deeper than level 1

littlefs does use recursion, but all recursion should be bounded. I have not measured the maximum stack usage, though that would be useful.


Also there as a recent change that may have negatively impacted read performance:
#158

If you move back to v2.0.2, do you see an improvement?

@pasi-ojala-tridonic
Copy link
Author

Hey @geky, thanks for the reply. Some answers below.

The chip is the SST25PF040C, with 4k erase blocks.

I didn't mean the erase is especially fast, but when that happens, it's surrounded by hundreds of reads (caused by the lfs_dir_traverse, I assume), so in comparison, the one erase really does not really matter when the system stalls for half a minute. I'm seeing a lot of the same 1k blocks read all over again. (I had a read size of 64 previously, that just amounted in proportionally more reads). The read size on line 1266 of the log obviously seems a bit suspicious too, but didn't debug into that.

Also, looks to me the garbage collection time grows (linear, based on a whopping 2 observations) with the number of files: I tested with about 100 (inlined) files, and the collection was in the order of several minutes. Is that expected?

This is a part of how littlefs works.

So just to be clear: when the inline files fill the block, the garbage collection will slow down the system, ok. Is the tens to hundreds of seconds of downtime expected?

I'll need to check the change you mentioned at some point. (Un)fortunately I'll be starting my vacation tomorrow, but I have colleagues who'll be interested in this and monitoring the thread, so any info at any time will be appreciated.

@ksstms
Copy link

ksstms commented Aug 5, 2019

Hi! I ran into the same problem.
Unfortunately my flash has 128K erase size. This is not a big problem for my ~16MB files, but now I'm trying to figure out if it's worth it to store some small files as well.

I have tried to create 100 files, each with 256 bytes of data.
Here's the full log of my emulated device: createfiles.txt (format: address, block, offset, size)

It works great for the most part, but sometimes there's a massive spike in flash access. For example when closing file61.dat, there are ~6000 reads, then a block erase, and then another 6000 reads and some writes. It looks like this:
close_file61

... and goes something like this:

  • read block 0 multiple times (6000 2K reads on a block that has 64 2K pages)
  • erase block 1
  • write 9 pages to block 1, but perform 800-1000 reads on block 0 between them

I get it that sometimes the FS has to do some garbage collection and other things, but it seems strange that it needs this many operations. In this case creating a 256B file has caused 24 MB of flash operations, which would take some significant time on a real device.

@geky
Copy link
Member

geky commented Aug 7, 2019

Thanks both of you for posting benchmarks. After running some adhoc tests of my own, I'm getting a better understanding of the issue.

I think something we will need before we can say this and other performance issues are truely fixed is a benchmarking framework. It will take more time but time but I think it will be worth it. One of the issues is that I was primarily testing on an MX25R over QSPI with a relatively high SPI frequency, so reads were very fast, making erase time dominate.

While I had theoretical numbers for different storage types, I haven't put them into in sort of practical context.


@ksstms, interesting graph, is the x access time?

Also do you know how long it takes to iterate over the directory when full? This should be equivalent to one of the vertical spikes in your graph.

To share more information, the culprit function responsible for compacting the metadata pairs (one of the two garbage collectors) is lfs_dir_compact:
https://github.com/ARMmbed/littlefs/blob/master/lfs.c#L1424

The two sweeps are to:

  1. Calculate the size of the compacted metadata:
    https://github.com/ARMmbed/littlefs/blob/master/lfs.c#L1436
  2. Copy the most recent tags over to the new block:
    https://github.com/ARMmbed/littlefs/blob/master/lfs.c#L1558

I was also considering a previous solution that did only one pass, however I changed it in a548ce6 (explanation in commit message). This could possibly cut the cost in half, but I don't think it fully solves the problem.

The compaction is O(n^2). This is the root of the problem. The assumed solution was to keep n small, so the runtime is really O(block-count * block-size^2).

While putting together v2, this was chosen as a tradeoff to enable better erase usage and a lower minimum storage size without increasing RAM consumption. But it sounds like this tradeoff should be reconsidered.


If you have RAM to throw at the problem, increasing prog_size will decrease the runtime cost by a constant amount. But I realize that's really a non-answer for a filesystem that's supposed to use bounded RAM.

It may be possible to introduce an artificial commit_size option that reduces the runtime cost by limiting the number of commits in a metadata block. This would be the same as increasing prog_size but without the RAM cost. If commit_size == block_size it would behave the same as v1.

I'm also looking into a better long-term solution based on some sort of append-tree, but this may take quite a bit of time to implement.

@ksstms
Copy link

ksstms commented Aug 7, 2019

is the x access time?

It's just the number of the transactions. Sorry for the uninformative graph, I just wanted to illustrate that there's an enormous number of read passes before anything happens.
I haven't tested this on a real device, but I know that reading a full 16 MB file on our system takes about 6 s. Considering that that mostly consists of big reads (cache_size is 2K), it's safe to say that the 24 MB total read of this experiment would take about 9 s.

Currently we are using QSPI with DMA. I'm working on a solution to use DMA for big transactions only, so I can lower my read_size to a more reasonable number. That way if the FS needs 4 B from a page, the cpu doesn't have to bother setting up dma. This also means that there will be more flash accesses because of the unused cache, so we must find the golden mean.

Also do you know how long it takes to iterate over the directory when full? This should be equivalent to one of the vertical spikes in your graph.

For now all I know is that it's 64 2K reads (my flash has 64 2K pages per erasable block). I'm going to measure things on the real system and get back with the results.

@Johnxjj
Copy link

Johnxjj commented Sep 18, 2019

I have encountered this problem recently. In order to shorten the time, I can only load the data of the super block into the RAM beforehand, and directly read the data of the RAM. With this solution, the time I spent was reduced from 2s to 200ms.

@JojoS62
Copy link

JojoS62 commented Jul 11, 2020

I have created a little test with MBed-os 6.1.0. It creates a number of short files and prints some statistics with the ProfilingBlockDevice. There I see also the very large performance drop, in this case after 71 files.

LFS-v2-statistics.txt

Source code can be found here: https://github.com/JojoS62/testLFSv2/blob/master/source/main.cpp

The columns in the report are:
file numer, time for creating file, diff to previous operation for read / write / erase blocks, absolute access no of read / write / erase blocks.
The increase of 720000 reads is not looking plausible. In general, v2 is faster except after a number of files.

LfsV1-Writefile-12Byte
LfsV2-Writefile-12Bypte

in the v2 diagram, the write times are 2x higher, there the RTOS was enabled, for the later report I used the bare-metal setting.

@mon
Copy link
Contributor

mon commented Dec 10, 2020

Would a good interim solution be the ability to disable or limit the number of inline files?

I'm using one of those 128kB-block NAND flashes (MT29F) and seeing terrible read performance.
With the default setup, I see a huge number of 4 byte reads with decreasing offset - I guess there's some sort of list traversal happening to find inline files?

Forcing LFSv2 to always think there is no room for an inlined file drastically increases performance:

  • Reading ~3 files in a traversed folder moves from 4.7 seconds to 0.5 seconds
  • Deleting those files moves from 7.8 seconds to 0.26 seconds

LFSv2 has provided a huge boost in write performance over LFSv1, so I don't want to ditch it, but hacking around the inline file logic feels bad.
Our files are few enough that we have effectively no chance of running out of blocks (2048)

@geky
Copy link
Member

geky commented Dec 13, 2020

@mon I'm not sure, since littlefs still tries to pack as much data in a metadata-pair as possible. If you limited inline files, instead of filling up with file data, the metadata-pairs would fill up with metadata (these are logs, so outdated changes are not cleaned up until the expensive compact step). They would hit the same performance issue, just slower.


I haven't had time to experiment with this much, but for another short-term idea has anyone tried artificially increasing the prog_size? This should reduce the number of commits in each metadata-pair, since each commit is padded to fill a prog_size.

The tradeoff would be that you would need RAM (cache_size) = prog_size, and that littlefs would never write less than a full prog_size.

@mon
Copy link
Contributor

mon commented Dec 14, 2020

@geky excellent point on the metadata pairs becoming the limiter, it explains my immediate performance gains.

What about some sort of "metadata limit"? Where there is a maximum of, say, 4k allowed for inline files and metadata pairs, and the remaining bytes in the block are only used in the case of files big enough to fill it. I think this would bound the metadata/inlines enough to make the compaction pass more like O(1) regardless of the actual block size.

Obviously that comes with drawbacks to total storeable files on these large NANDs, but I see it as a similar tradeoff to the existing cache sizes. At first glance it seems like it should be quite simple to implement? I'm thinking modify the lfs->cfg->block_size/2 in lfs_dir_compact and lfs->cfg->block_size/8 in lfs_file_write. It hasn't appeared to have initially broken anything, but I haven't done proper benchmarking like JojoS62.

I can't increase prog_size easily since we're almost near our RAM limit already.

@geky
Copy link
Member

geky commented Dec 14, 2020

What about some sort of "metadata limit"?

Oh! That is a better idea. Forcing the metadata-pair to compact early should keep the amount of metadata manageable without a RAM cost. That might be the best short-term solution for this bottleneck.

You will also need to change the check in lfs_dir_commit, by default littlefs fills up the entire metadata block before compacting, and only during compacting does that lfs->cfg->block_size/2 kick in. But at that point you will already need to make it through an expensive metadata compact.

I believe there are the only three places that would need changing:

  • Overflow check during commits:

    littlefs/lfs.c

    Line 1887 in 1a59954

    .end = lfs->cfg->block_size - 8,
  • Overflow check during compact:

    littlefs/lfs.c

    Lines 1587 to 1593 in 1a59954

    // space is complicated, we need room for tail, crc, gstate,
    // cleanup delete, and we cap at half a block to give room
    // for metadata updates.
    if (end - begin < 0xff &&
    size <= lfs_min(lfs->cfg->block_size - 36,
    lfs_alignup(lfs->cfg->block_size/2,
    lfs->cfg->prog_size))) {
  • Inline file check during write:

    littlefs/lfs.c

    Lines 2966 to 2969 in 1a59954

    if ((file->flags & LFS_F_INLINE) &&
    lfs_max(file->pos+nsize, file->ctz.size) >
    lfs_min(0x3fe, lfs_min(
    lfs->cfg->cache_size, lfs->cfg->block_size/8))) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants