Skip to content

Commit a29f891

Browse files
boryasgregkh
authored andcommitted
btrfs: use readahead_expand() on compressed extents
[ Upstream commit 9e9ff87 ] We recently received a report of poor performance doing sequential buffered reads of a file with compressed extents. With bs=128k, a naive sequential dd ran as fast on a compressed file as on an uncompressed (1.2GB/s on my reproducing system) while with bs<32k, this performance tanked down to ~300MB/s. i.e., slow: dd if=some-compressed-file of=/dev/null bs=4k count=X vs fast: dd if=some-compressed-file of=/dev/null bs=128k count=Y The cause of this slowness is overhead to do with looking up extent_maps to enable readahead pre-caching on compressed extents (add_ra_bio_pages()), as well as some overhead in the generic VFS readahead code we hit more in the slow case. Notably, the main difference between the two read sizes is that in the large sized request case, we call btrfs_readahead() relatively rarely while in the smaller request we call it for every compressed extent. So the fast case stays in the btrfs readahead loop: while ((folio = readahead_folio(rac)) != NULL) btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start); where the slower one breaks out of that loop every time. This results in calling add_ra_bio_pages a lot, doing lots of extent_map lookups, extent_map locking, etc. This happens because although add_ra_bio_pages() does add the appropriate un-compressed file pages to the cache, it does not communicate back to the ractl in any way. To solve this, we should be using readahead_expand() to signal to readahead to expand the readahead window. This change passes the readahead_control into the btrfs_bio_ctrl and in the case of compressed reads sets the expansion to the size of the extent_map we already looked up anyway. It skips the subpage case as that one already doesn't do add_ra_bio_pages(). With this change, whether we use bs=4k or bs=128k, btrfs expands the readahead window up to the largest compressed extent we have seen so far (in the trivial example: 128k) and the call stacks of the two modes look identical. Notably, we barely call add_ra_bio_pages at all. And the performance becomes identical as well. So this change certainly "fixes" this performance problem. Of course, it does seem to beg a few questions: 1. Will this waste too much page cache with a too large ra window? 2. Will this somehow cause bugs prevented by the more thoughtful checking in add_ra_bio_pages? 3. Should we delete add_ra_bio_pages? My stabs at some answers: 1. Hard to say. See attempts at generic performance testing below. Is there a "readahead_shrink" we should be using? Should we expand more slowly, by half the remaining em size each time? 2. I don't think so. Since the new behavior is indistinguishable from reading the file with a larger read size passed in, I don't see why one would be safe but not the other. 3. Probably! I tested that and it was fine in fstests, and it seems like the pages would get re-used just as well in the readahead case. However, it is possible some reads that use page cache but not btrfs_readahead() could suffer. I will investigate this further as a follow up. I tested the performance implications of this change in 3 ways (using compress-force=zstd:3 for compression): Directly test the affected workload of small sequential reads on a compressed file (improved from ~250MB/s to ~1.2GB/s) ==========for-next========== dd /mnt/lol/non-cmpr 4k 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.02983 s, 712 MB/s dd /mnt/lol/non-cmpr 128k 32768+0 records in 32768+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 5.92403 s, 725 MB/s dd /mnt/lol/cmpr 4k 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 17.8832 s, 240 MB/s dd /mnt/lol/cmpr 128k 32768+0 records in 32768+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.71001 s, 1.2 GB/s ==========ra-expand========== dd /mnt/lol/non-cmpr 4k 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.09001 s, 705 MB/s dd /mnt/lol/non-cmpr 128k 32768+0 records in 32768+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.07664 s, 707 MB/s dd /mnt/lol/cmpr 4k 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.79531 s, 1.1 GB/s dd /mnt/lol/cmpr 128k 32768+0 records in 32768+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.69533 s, 1.2 GB/s Built the linux kernel from clean (no change) Ran fsperf. Mostly neutral results with some improvements and regressions here and there. Reported-by: Dimitrios Apostolou <jimis@gmx.net> Link: https://lore.kernel.org/linux-btrfs/34601559-6c16-6ccc-1793-20a97ca0dbba@gmx.net/ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
1 parent 7bb675c commit a29f891

File tree

1 file changed

+33
-1
lines changed

1 file changed

+33
-1
lines changed

fs/btrfs/extent_io.c

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,7 @@ struct btrfs_bio_ctrl {
103103
blk_opf_t opf;
104104
btrfs_bio_end_io_t end_io_func;
105105
struct writeback_control *wbc;
106+
struct readahead_control *ractl;
106107
};
107108

108109
static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
@@ -952,6 +953,23 @@ __get_extent_map(struct inode *inode, struct page *page, size_t pg_offset,
952953
}
953954
return em;
954955
}
956+
957+
static void btrfs_readahead_expand(struct readahead_control *ractl,
958+
const struct extent_map *em)
959+
{
960+
const u64 ra_pos = readahead_pos(ractl);
961+
const u64 ra_end = ra_pos + readahead_length(ractl);
962+
const u64 em_end = em->start + em->ram_bytes;
963+
964+
/* No expansion for holes and inline extents. */
965+
if (em->block_start > EXTENT_MAP_LAST_BYTE)
966+
return;
967+
968+
ASSERT(em_end >= ra_pos);
969+
if (em_end > ra_end)
970+
readahead_expand(ractl, ra_pos, em_end - ra_pos);
971+
}
972+
955973
/*
956974
* basic readpage implementation. Locked extent state structs are inserted
957975
* into the tree that are removed when the IO is done (by the end_io
@@ -1023,6 +1041,17 @@ static int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
10231041

10241042
iosize = min(extent_map_end(em) - cur, end - cur + 1);
10251043
iosize = ALIGN(iosize, blocksize);
1044+
1045+
/*
1046+
* Only expand readahead for extents which are already creating
1047+
* the pages anyway in add_ra_bio_pages, which is compressed
1048+
* extents in the non subpage case.
1049+
*/
1050+
if (bio_ctrl->ractl &&
1051+
!btrfs_is_subpage(fs_info, page) &&
1052+
compress_type != BTRFS_COMPRESS_NONE)
1053+
btrfs_readahead_expand(bio_ctrl->ractl, em);
1054+
10261055
if (compress_type != BTRFS_COMPRESS_NONE)
10271056
disk_bytenr = em->block_start;
10281057
else
@@ -2224,7 +2253,10 @@ int extent_writepages(struct address_space *mapping,
22242253

22252254
void extent_readahead(struct readahead_control *rac)
22262255
{
2227-
struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ | REQ_RAHEAD };
2256+
struct btrfs_bio_ctrl bio_ctrl = {
2257+
.opf = REQ_OP_READ | REQ_RAHEAD,
2258+
.ractl = rac
2259+
};
22282260
struct page *pagepool[16];
22292261
struct extent_map *em_cached = NULL;
22302262
u64 prev_em_start = (u64)-1;

0 commit comments

Comments
 (0)