Skip to content

Commit 394547b

Browse files
adam900710gregkh
authored andcommitted
btrfs: fix corruption reading compressed range when block size is smaller than page size
[ Upstream commit 9786531 ] [BUG] With 64K page size (aarch64 with 64K page size config) and 4K btrfs block size, the following workload can easily lead to a corrupted read: mkfs.btrfs -f -s 4k $dev > /dev/null mount -o compress $dev $mnt xfs_io -f -c "pwrite -S 0xff 0 64k" $mnt/base > /dev/null echo "correct result:" od -Ad -t x1 $mnt/base xfs_io -f -c "reflink $mnt/base 32k 0 32k" \ -c "reflink $mnt/base 0 32k 32k" \ -c "pwrite -S 0xff 60k 4k" $mnt/new > /dev/null echo "incorrect result:" od -Ad -t x1 $mnt/new umount $mnt This shows the following result: correct result: 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0065536 incorrect result: 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0032768 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0061440 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0065536 Notice the zero in the range [32K, 60K), which is incorrect. [CAUSE] With extra trace printk, it shows the following events during od: (some unrelated info removed like CPU and context) od-3457 btrfs_do_readpage: enter r/i=5/258 folio=0(65536) prev_em_start=0000000000000000 The "r/i" is indicating the root and inode number. In our case the file "new" is using ino 258 from fs tree (root 5). Here notice the @prev_em_start pointer is NULL. This means the btrfs_do_readpage() is called from btrfs_read_folio(), not from btrfs_readahead(). od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=0 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=4096 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=8192 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=12288 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=16384 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=20480 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=24576 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=28672 got em start=0 len=32768 These above 32K blocks will be read from the first half of the compressed data extent. od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=32768 got em start=32768 len=32768 Note here there is no btrfs_submit_compressed_read() call. Which is incorrect now. Although both extent maps at 0 and 32K are pointing to the same compressed data, their offsets are different thus can not be merged into the same read. So this means the compressed data read merge check is doing something wrong. od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=36864 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=40960 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=45056 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=49152 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=53248 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=57344 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=61440 skip uptodate od-3457 btrfs_submit_compressed_read: cb orig_bio: file off=0 len=61440 The function btrfs_submit_compressed_read() is only called at the end of folio read. The compressed bio will only have an extent map of range [0, 32K), but the original bio passed in is for the whole 64K folio. This will cause the decompression part to only fill the first 32K, leaving the rest untouched (aka, filled with zero). This incorrect compressed read merge leads to the above data corruption. There were similar problems that happened in the past, commit 808f80b ("Btrfs: update fix for read corruption of compressed and shared extents") is doing pretty much the same fix for readahead. But that's back to 2015, where btrfs still only supports bs (block size) == ps (page size) cases. This means btrfs_do_readpage() only needs to handle a folio which contains exactly one block. Only btrfs_readahead() can lead to a read covering multiple blocks. Thus only btrfs_readahead() passes a non-NULL @prev_em_start pointer. With v5.15 kernel btrfs introduced bs < ps support. This breaks the above assumption that a folio can only contain one block. Now btrfs_read_folio() can also read multiple blocks in one go. But btrfs_read_folio() doesn't pass a @prev_em_start pointer, thus the existing bio force submission check will never be triggered. In theory, this can also happen for btrfs with large folios, but since large folio is still experimental, we don't need to bother it, thus only bs < ps support is affected for now. [FIX] Instead of passing @prev_em_start to do the proper compressed extent check, introduce one new member, btrfs_bio_ctrl::last_em_start, so that the existing bio force submission logic will always be triggered. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
1 parent a29f891 commit 394547b

File tree

1 file changed

+32
-14
lines changed

1 file changed

+32
-14
lines changed

fs/btrfs/extent_io.c

Lines changed: 32 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,24 @@ struct btrfs_bio_ctrl {
104104
btrfs_bio_end_io_t end_io_func;
105105
struct writeback_control *wbc;
106106
struct readahead_control *ractl;
107+
108+
/*
109+
* The start offset of the last used extent map by a read operation.
110+
*
111+
* This is for proper compressed read merge.
112+
* U64_MAX means we are starting the read and have made no progress yet.
113+
*
114+
* The current btrfs_bio_is_contig() only uses disk_bytenr as
115+
* the condition to check if the read can be merged with previous
116+
* bio, which is not correct. E.g. two file extents pointing to the
117+
* same extent but with different offset.
118+
*
119+
* So here we need to do extra checks to only merge reads that are
120+
* covered by the same extent map.
121+
* Just extent_map::start will be enough, as they are unique
122+
* inside the same inode.
123+
*/
124+
u64 last_em_start;
107125
};
108126

109127
static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
@@ -978,7 +996,7 @@ static void btrfs_readahead_expand(struct readahead_control *ractl,
978996
* return 0 on success, otherwise return error
979997
*/
980998
static int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
981-
struct btrfs_bio_ctrl *bio_ctrl, u64 *prev_em_start)
999+
struct btrfs_bio_ctrl *bio_ctrl)
9821000
{
9831001
struct inode *inode = page->mapping->host;
9841002
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
@@ -1095,12 +1113,11 @@ static int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
10951113
* non-optimal behavior (submitting 2 bios for the same extent).
10961114
*/
10971115
if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags) &&
1098-
prev_em_start && *prev_em_start != (u64)-1 &&
1099-
*prev_em_start != em->start)
1116+
bio_ctrl->last_em_start != (u64)-1 &&
1117+
bio_ctrl->last_em_start != em->start)
11001118
force_bio_submit = true;
11011119

1102-
if (prev_em_start)
1103-
*prev_em_start = em->start;
1120+
bio_ctrl->last_em_start = em->start;
11041121

11051122
free_extent_map(em);
11061123
em = NULL;
@@ -1146,12 +1163,15 @@ int btrfs_read_folio(struct file *file, struct folio *folio)
11461163
struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
11471164
u64 start = page_offset(page);
11481165
u64 end = start + PAGE_SIZE - 1;
1149-
struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
1166+
struct btrfs_bio_ctrl bio_ctrl = {
1167+
.opf = REQ_OP_READ,
1168+
.last_em_start = (u64)-1,
1169+
};
11501170
int ret;
11511171

11521172
btrfs_lock_and_flush_ordered_range(inode, start, end, NULL);
11531173

1154-
ret = btrfs_do_readpage(page, NULL, &bio_ctrl, NULL);
1174+
ret = btrfs_do_readpage(page, NULL, &bio_ctrl);
11551175
/*
11561176
* If btrfs_do_readpage() failed we will want to submit the assembled
11571177
* bio to do the cleanup.
@@ -1163,17 +1183,15 @@ int btrfs_read_folio(struct file *file, struct folio *folio)
11631183
static inline void contiguous_readpages(struct page *pages[], int nr_pages,
11641184
u64 start, u64 end,
11651185
struct extent_map **em_cached,
1166-
struct btrfs_bio_ctrl *bio_ctrl,
1167-
u64 *prev_em_start)
1186+
struct btrfs_bio_ctrl *bio_ctrl)
11681187
{
11691188
struct btrfs_inode *inode = BTRFS_I(pages[0]->mapping->host);
11701189
int index;
11711190

11721191
btrfs_lock_and_flush_ordered_range(inode, start, end, NULL);
11731192

11741193
for (index = 0; index < nr_pages; index++) {
1175-
btrfs_do_readpage(pages[index], em_cached, bio_ctrl,
1176-
prev_em_start);
1194+
btrfs_do_readpage(pages[index], em_cached, bio_ctrl);
11771195
put_page(pages[index]);
11781196
}
11791197
}
@@ -2255,19 +2273,19 @@ void extent_readahead(struct readahead_control *rac)
22552273
{
22562274
struct btrfs_bio_ctrl bio_ctrl = {
22572275
.opf = REQ_OP_READ | REQ_RAHEAD,
2258-
.ractl = rac
2276+
.ractl = rac,
2277+
.last_em_start = (u64)-1,
22592278
};
22602279
struct page *pagepool[16];
22612280
struct extent_map *em_cached = NULL;
2262-
u64 prev_em_start = (u64)-1;
22632281
int nr;
22642282

22652283
while ((nr = readahead_page_batch(rac, pagepool))) {
22662284
u64 contig_start = readahead_pos(rac);
22672285
u64 contig_end = contig_start + readahead_batch_length(rac) - 1;
22682286

22692287
contiguous_readpages(pagepool, nr, contig_start, contig_end,
2270-
&em_cached, &bio_ctrl, &prev_em_start);
2288+
&em_cached, &bio_ctrl);
22712289
}
22722290

22732291
if (em_cached)

0 commit comments

Comments
 (0)