Skip to content

Conversation

@boryas
Copy link
Contributor

@boryas boryas commented Apr 14, 2022

Don't want to erase it entirely from the doc, since it is the ideal way to use it ultimately. However, the bug is significant, so I think it's our responsibility to encourage users that can use umount/mount (e.g. in an initrd, not for root fs) to use it.

adam900710 and others added 11 commits April 8, 2022 23:33
…tree block read

We used to use read_whole_eb() to read super block, but it's no longer
the case (so long that I can not even find out which commit did the
conversion).

Thus there is no need for read_whole_eb() to handle super block read
anymore.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For metadata restore, our logical address is mapped to a single device
with logical address 1:1 mapped to device physical address.

Move this part of code into a helper, this will make later extent buffer
read path refactoer much easier.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are two call sites using write_extent_to_disk() directly:

- debug_corrupt_block() in btrfs-corrupt-block.c
- corrupt_keys() in btrfs-corrupt-block.c

The problem of write_extent_to_disk() is, it can only handle plain
profiles (All profiles except P/Q stripes of RAID56).

Calling it directly can corrupted RAID56 P/Q, and in the future we're
going to remove eb::fd/eb::dev_bytes, so remove such call sites with
write_and_map_eb().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function write_extent_to_disk() is just writing the content of a tree
block to disk.

It can not handle RAID56, and its work is the same as
write_data_to_disk().

Thus we can replace write_extent_to_disk() with write_data_to_disk()
easily.

There is only one special call site in write_raid56_with_parity(), which
can easily be replace with btrfs_pwrite() directly.

This reduce the write entrance, and make later eb::fd removal easier.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…sk() and replace read_extent_data()

The function read_extent_from_disk() is only a wrapper to read tree
block.

And read_extent_data() is just a while loop to eliminate short read
caused by stripe boundary.

In fact, a lot of call sites of read_extent_data() are either reading
metadata (thus no possible short read) or doing extra loop by
themselves.

This patch will replace those two functions with read_data_from_disk(),
making it the only entrance for data/metadata read.
And update read_data_from_disk() to return the read bytes, so caller can
do a simple while loop.

For the few callers of read_extent_data(), open-code a small while loop
for them.

This will allow later RAID56 read repair using P/Q much easier.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Those two members are a shortcut for non-RAID56 profiles.

But we should not use such shortcut, and move all our logical address
read/write to the unified read_data_from_disk()/write_data_to_disk().

With previous refactors, now we're safe to remove them.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This new ability is added by:

- Allow btrfs_map_block() to return the chunk type
  This makes later work much easier

- Only reset stripe offset inside btrfs_map_block() when needed
  Currently if @raid_map is not NULL, btrfs_map_block() will consider
  this call is for WRITE and will reset stripe offset.

  This is no longer the case, as for RAID56 read with mirror_num 1/0,
  we will still call btrfs_map_block() with non-NULL raid_map.

  Add a small check to make sure we won't reset stripe offset for
  mirror 1/0 read.

- Add new helper read_raid56() to handle rebuild
  We will read the full stripe (including all data and P/Q stripes)
  do the rebuild, then only copy the refered part to the caller.

  There is a catch for RAID6, we have no way to exhaust all combination,
  so the current repair will assume the mirror = 0 data is corrupted,
  then try to find a missing device.

  But if no missing device can be found, it will assume P is corrupted.
  This is just a guess, and can to totally wrong, but we have no better
  idea.

Now btrfs-progs have full read ability for RAID56.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously 'btrfs check --check-data-csum' will report tons of false
alerts for RAID56.

Add a test case to make sure with the new RAID56 rebuild ability, there
should be no false alerts.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running btrfs/011 with subpage case, even with RAID56 support, it
still fails with the following error:

 QA output created by 011
 *** test btrfs replace
 mkfs failed
 (see /home/adam/xfstests-dev/results//btrfs/011.full for details)

The full log shows:

  ---------workout "-m single -d single -M" 1 no 64-----------
  ERROR: illegal nodesize 65536 (not equal to 4096 for mixed block group)
  mkfs failed

This is a critical error, making test case to be aborted, without
checking the rest profiles.

[CAUSE]
Mkfs.btrfs always uses the maximum value between sectorsize and page
size for its mixed profile nodesize.

For subpage case, it means we always go PAGE_SIZE, no matter whatever
the sectorsize is passed in.

[FIX]
Just get rid of the direct PAGE_SIZE usage when determining nodesize for
mixed profiles.
And use sectorsize directly (either passed in by the user, or
determined from page size).

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, if user specifies value 'no' or 'none' on the command line,
it gets translated to an empty value that is passed to kernel. There was
a change in kernel 5.14 done by commit 5548c8c6f55b ("btrfs: props:
change how empty value is interpreted") that changes the behaviour
in that case.

The empty value is supposed to mean 'the default value' for any
property. For compression there is a need to distinguish resetting the
value and also setting the NOCOMPRESS property. The translation to empty
value makes that impossible.

The explanation and behaviour copied from the kernel patch:

    Old behaviour:

      $ lsattr file
      ---------------------- file
      # the NOCOMPRESS bit is set
      $ btrfs prop set file compression ''
      $ lsattr file
      ---------------------m file

    This is equivalent to 'btrfs prop set file compression no' in current
    btrfs-progs as the 'no' or 'none' values are translated to an empty
    string.

    This is where the new behaviour is different: empty string drops the
    compression flag (-c) and nocompress (-m):

      $ lsattr file
      ---------------------- file
      # No change
      $ btrfs prop set file compression ''
      $ lsattr file
      ---------------------- file
      $ btrfs prop set file compression lzo
      $ lsattr file
      --------c------------- file
      $ btrfs prop get file compression
      compression=lzo
      $ btrfs prop set file compression ''
      # Reset to the initial state
      $ lsattr file
      ---------------------- file
      # Set NOCOMPRESS bit
      $ btrfs prop set file compression no
      $ lsattr file
      ---------------------m file

    This obviously brings problems with backward compatibility, so this
    patch should not be backported without making sure the updated
    btrfs-progs are also used and that scripts have been updated to use the
    new semantics.

    Summary:

    - old kernel:
      no, none, "" - set NOCOMPRESS bit
    - new kernel:
      no, none - set NOCOMPRESS bit
      "" - drop all compression flags, ie. COMPRESS and NOCOMPRESS

Signed-off-by: Li Zhang <zhanglikernel@gmail.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Don't want to erase it entirely from the doc, since it is the ideal way to use it ultimately. However, the bug is significant, so I think it's our responsibility to encourage users that can use umount/mount (e.g. in an initrd, not for root fs) to use it.
@kdave kdave force-pushed the devel branch 2 times, most recently from e3c0fe2 to 588429c Compare August 16, 2022 13:23
@kdave kdave force-pushed the devel branch 8 times, most recently from 54326e2 to ad2a1bf Compare September 21, 2022 13:42
@kdave kdave force-pushed the devel branch 3 times, most recently from 1e791f0 to 8f51194 Compare September 23, 2022 11:18
@kdave kdave force-pushed the devel branch 2 times, most recently from f230ae6 to d2fa26b Compare October 4, 2022 21:50
@kdave kdave force-pushed the devel branch 5 times, most recently from 7bfedf8 to cda1f14 Compare February 22, 2023 02:06
@kdave kdave force-pushed the devel branch 4 times, most recently from 63a7417 to 4fc291a Compare March 1, 2023 14:20
@kdave kdave force-pushed the devel branch 4 times, most recently from e53f194 to 405f4c7 Compare April 26, 2023 22:51
@kdave kdave force-pushed the devel branch 3 times, most recently from af681f4 to 1628b2f Compare May 10, 2023 19:47
@kdave kdave force-pushed the devel branch 2 times, most recently from a8c6f08 to 34c5de3 Compare May 22, 2023 12:32
@kdave kdave force-pushed the devel branch 5 times, most recently from 5e02de0 to adfc8a9 Compare May 26, 2023 17:53
kdave pushed a commit that referenced this pull request Jun 9, 2023
The remount workflow could cause some problems so make a note about it.
Recommend the umount/mount step by default. A seeding device used for
e.g. a root filesystem that gets updated and has snapshots is a real
world example where the space consumed by unreclaimed deleted snapshots
would hurt.

Pull-request: #462
Author: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
@kdave kdave added this to the v6.3.2 milestone Jun 9, 2023
@kdave
Copy link
Owner

kdave commented Jun 9, 2023

Merged to devel with some minor edits. The description of the workflows can be still improved, separating clearly the use case without snapshots that the remount is fine.

@kdave kdave closed this Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants