Skip to content
Permalink
misc-next

Commits on Jul 22, 2020

  1. btrfs: open-code remount flag setting in btrfs_remount

    When we're (re)mounting a btrfs filesystem we set the
    BTRFS_FS_STATE_REMOUNTING state in fs_info to serialize against async
    reclaim or defrags.
    
    This flag is set in btrfs_remount_prepare() called by btrfs_remount().
    As btrfs_remount_prepare() does nothing but setting this flag and
    doesn't have a second caller, we can just open-code the flag setting in
    btrfs_remount().
    
    Similarly do for so clearing of the flag by moving it out of
    btrfs_remount_cleanup() into btrfs_remount() to be symmetrical.
    
    Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Johannes Thumshirn authored and kdave committed Jul 22, 2020
  2. btrfs: if we're restriping, use the target restripe profile

    Previously we depended on some weird behavior in our chunk allocator to
    force the allocation of new stripes, so by the time we got to doing the
    reduce we would usually already have a chunk with the proper target.
    
    However that behavior causes other problems and needs to be removed.
    First however we need to remove this check to only restripe if we
    already have those available profiles, because if we're allocating our
    first chunk it obviously will not be available.  Simply use the target
    as specified, and if that fails it'll be because we're out of space.
    
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 22, 2020
  3. btrfs: don't adjust bg flags and use default allocation profiles

    btrfs/061 has been failing consistently for me recently with a
    transaction abort.  We run out of space in the system chunk array, which
    means we've allocated way too many system chunks than we need.
    
    Chris added this a long time ago for balance as a poor mans restriping.
    If you had a single disk and then added another disk and then did a
    balance, update_block_group_flags would then figure out which RAID level
    you needed.
    
    Fast forward to today and we have restriping behavior, so we can
    explicitly tell the fs that we're trying to change the raid level.  This
    is accomplished through the normal get_alloc_profile path.
    
    Furthermore this code actually causes btrfs/061 to fail, because we do
    things like mkfs -m dup -d single with multiple devices.  This trips
    this check
    
    alloc_flags = update_block_group_flags(fs_info, cache->flags);
    if (alloc_flags != cache->flags) {
    	ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);
    
    in btrfs_inc_block_group_ro.  Because we're balancing and scrubbing, but
    not actually restriping, we keep forcing chunk allocation of RAID1
    chunks.  This eventually causes us to run out of system space and the
    file system aborts and flips read only.
    
    We don't need this poor mans restriping any more, simply use the normal
    get_alloc_profile helper, which will get the correct alloc_flags and
    thus make the right decision for chunk allocation.  This keeps us from
    allocating a billion system chunks and falling over.
    
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Reviewed-by: Qu Wenruo <wqu@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 22, 2020
  4. btrfs: fix lockdep splat from btrfs_dump_space_info

    When running with -o enospc_debug you can get the following splat if one
    of the dump_space_info's trip
    
      ======================================================
      WARNING: possible circular locking dependency detected
      5.8.0-rc5+ torvalds#20 Tainted: G           OE
      ------------------------------------------------------
      dd/563090 is trying to acquire lock:
      ffff9e7dbf4f1e18 (&ctl->tree_lock){+.+.}-{2:2}, at: btrfs_dump_free_space+0x2b/0xa0 [btrfs]
    
      but task is already holding lock:
      ffff9e7e2284d428 (&cache->lock){+.+.}-{2:2}, at: btrfs_dump_space_info+0xaa/0x120 [btrfs]
    
      which lock already depends on the new lock.
    
      the existing dependency chain (in reverse order) is:
    
      -> #3 (&cache->lock){+.+.}-{2:2}:
    	 _raw_spin_lock+0x25/0x30
    	 btrfs_add_reserved_bytes+0x3c/0x3c0 [btrfs]
    	 find_free_extent+0x7ef/0x13b0 [btrfs]
    	 btrfs_reserve_extent+0x9b/0x180 [btrfs]
    	 btrfs_alloc_tree_block+0xc1/0x340 [btrfs]
    	 alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
    	 __btrfs_cow_block+0x122/0x530 [btrfs]
    	 btrfs_cow_block+0x106/0x210 [btrfs]
    	 commit_cowonly_roots+0x55/0x300 [btrfs]
    	 btrfs_commit_transaction+0x4ed/0xac0 [btrfs]
    	 sync_filesystem+0x74/0x90
    	 generic_shutdown_super+0x22/0x100
    	 kill_anon_super+0x14/0x30
    	 btrfs_kill_super+0x12/0x20 [btrfs]
    	 deactivate_locked_super+0x36/0x70
    	 cleanup_mnt+0x104/0x160
    	 task_work_run+0x5f/0x90
    	 __prepare_exit_to_usermode+0x1bd/0x1c0
    	 do_syscall_64+0x5e/0xb0
    	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
      -> #2 (&space_info->lock){+.+.}-{2:2}:
    	 _raw_spin_lock+0x25/0x30
    	 btrfs_block_rsv_release+0x1a6/0x3f0 [btrfs]
    	 btrfs_inode_rsv_release+0x4f/0x170 [btrfs]
    	 btrfs_clear_delalloc_extent+0x155/0x480 [btrfs]
    	 clear_state_bit+0x81/0x1a0 [btrfs]
    	 __clear_extent_bit+0x25c/0x5d0 [btrfs]
    	 clear_extent_bit+0x15/0x20 [btrfs]
    	 btrfs_invalidatepage+0x2b7/0x3c0 [btrfs]
    	 truncate_cleanup_page+0x47/0xe0
    	 truncate_inode_pages_range+0x238/0x840
    	 truncate_pagecache+0x44/0x60
    	 btrfs_setattr+0x202/0x5e0 [btrfs]
    	 notify_change+0x33b/0x490
    	 do_truncate+0x76/0xd0
    	 path_openat+0x687/0xa10
    	 do_filp_open+0x91/0x100
    	 do_sys_openat2+0x215/0x2d0
    	 do_sys_open+0x44/0x80
    	 do_syscall_64+0x52/0xb0
    	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
      -> #1 (&tree->lock#2){+.+.}-{2:2}:
    	 _raw_spin_lock+0x25/0x30
    	 find_first_extent_bit+0x32/0x150 [btrfs]
    	 write_pinned_extent_entries.isra.0+0xc5/0x100 [btrfs]
    	 __btrfs_write_out_cache+0x172/0x480 [btrfs]
    	 btrfs_write_out_cache+0x7a/0xf0 [btrfs]
    	 btrfs_write_dirty_block_groups+0x286/0x3b0 [btrfs]
    	 commit_cowonly_roots+0x245/0x300 [btrfs]
    	 btrfs_commit_transaction+0x4ed/0xac0 [btrfs]
    	 close_ctree+0xf9/0x2f5 [btrfs]
    	 generic_shutdown_super+0x6c/0x100
    	 kill_anon_super+0x14/0x30
    	 btrfs_kill_super+0x12/0x20 [btrfs]
    	 deactivate_locked_super+0x36/0x70
    	 cleanup_mnt+0x104/0x160
    	 task_work_run+0x5f/0x90
    	 __prepare_exit_to_usermode+0x1bd/0x1c0
    	 do_syscall_64+0x5e/0xb0
    	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
      -> #0 (&ctl->tree_lock){+.+.}-{2:2}:
    	 __lock_acquire+0x1240/0x2460
    	 lock_acquire+0xab/0x360
    	 _raw_spin_lock+0x25/0x30
    	 btrfs_dump_free_space+0x2b/0xa0 [btrfs]
    	 btrfs_dump_space_info+0xf4/0x120 [btrfs]
    	 btrfs_reserve_extent+0x176/0x180 [btrfs]
    	 __btrfs_prealloc_file_range+0x145/0x550 [btrfs]
    	 cache_save_setup+0x28d/0x3b0 [btrfs]
    	 btrfs_start_dirty_block_groups+0x1fc/0x4f0 [btrfs]
    	 btrfs_commit_transaction+0xcc/0xac0 [btrfs]
    	 btrfs_alloc_data_chunk_ondemand+0x162/0x4c0 [btrfs]
    	 btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
    	 btrfs_buffered_write.isra.0+0x19b/0x740 [btrfs]
    	 btrfs_file_write_iter+0x3cf/0x610 [btrfs]
    	 new_sync_write+0x11e/0x1b0
    	 vfs_write+0x1c9/0x200
    	 ksys_write+0x68/0xe0
    	 do_syscall_64+0x52/0xb0
    	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
      other info that might help us debug this:
    
      Chain exists of:
        &ctl->tree_lock --> &space_info->lock --> &cache->lock
    
       Possible unsafe locking scenario:
    
    	 CPU0                    CPU1
    	 ----                    ----
        lock(&cache->lock);
    				 lock(&space_info->lock);
    				 lock(&cache->lock);
        lock(&ctl->tree_lock);
    
       *** DEADLOCK ***
    
      6 locks held by dd/563090:
       #0: ffff9e7e21d18448 (sb_writers#14){.+.+}-{0:0}, at: vfs_write+0x195/0x200
       #1: ffff9e7dd0410ed8 (&sb->s_type->i_mutex_key#19){++++}-{3:3}, at: btrfs_file_write_iter+0x86/0x610 [btrfs]
       #2: ffff9e7e21d18638 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40b/0x5b0 [btrfs]
       #3: ffff9e7e1f05d688 (&cur_trans->cache_write_mutex){+.+.}-{3:3}, at: btrfs_start_dirty_block_groups+0x158/0x4f0 [btrfs]
       #4: ffff9e7e2284ddb8 (&space_info->groups_sem){++++}-{3:3}, at: btrfs_dump_space_info+0x69/0x120 [btrfs]
       #5: ffff9e7e2284d428 (&cache->lock){+.+.}-{2:2}, at: btrfs_dump_space_info+0xaa/0x120 [btrfs]
    
      stack backtrace:
      CPU: 3 PID: 563090 Comm: dd Tainted: G           OE     5.8.0-rc5+ torvalds#20
      Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
      Call Trace:
       dump_stack+0x96/0xd0
       check_noncircular+0x162/0x180
       __lock_acquire+0x1240/0x2460
       ? wake_up_klogd.part.0+0x30/0x40
       lock_acquire+0xab/0x360
       ? btrfs_dump_free_space+0x2b/0xa0 [btrfs]
       _raw_spin_lock+0x25/0x30
       ? btrfs_dump_free_space+0x2b/0xa0 [btrfs]
       btrfs_dump_free_space+0x2b/0xa0 [btrfs]
       btrfs_dump_space_info+0xf4/0x120 [btrfs]
       btrfs_reserve_extent+0x176/0x180 [btrfs]
       __btrfs_prealloc_file_range+0x145/0x550 [btrfs]
       ? btrfs_qgroup_reserve_data+0x1d/0x60 [btrfs]
       cache_save_setup+0x28d/0x3b0 [btrfs]
       btrfs_start_dirty_block_groups+0x1fc/0x4f0 [btrfs]
       btrfs_commit_transaction+0xcc/0xac0 [btrfs]
       ? start_transaction+0xe0/0x5b0 [btrfs]
       btrfs_alloc_data_chunk_ondemand+0x162/0x4c0 [btrfs]
       btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
       btrfs_buffered_write.isra.0+0x19b/0x740 [btrfs]
       ? ktime_get_coarse_real_ts64+0xa8/0xd0
       ? trace_hardirqs_on+0x1c/0xe0
       btrfs_file_write_iter+0x3cf/0x610 [btrfs]
       new_sync_write+0x11e/0x1b0
       vfs_write+0x1c9/0x200
       ksys_write+0x68/0xe0
       do_syscall_64+0x52/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
    This is because we're holding the block_group->lock while trying to dump
    the free space cache.  However we don't need this lock, we just need it
    to read the values for the printk, so move the free space cache dumping
    outside of the block group lock.
    
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 22, 2020
  5. btrfs: move the chunk_mutex in btrfs_read_chunk_tree

    We are currently getting this lockdep splat in btrfs/161:
    
      ======================================================
      WARNING: possible circular locking dependency detected
      5.8.0-rc5+ torvalds#20 Tainted: G            E
      ------------------------------------------------------
      mount/678048 is trying to acquire lock:
      ffff9b769f15b6e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: clone_fs_devices+0x4d/0x170 [btrfs]
    
      but task is already holding lock:
      ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]
    
      which lock already depends on the new lock.
    
      the existing dependency chain (in reverse order) is:
    
      -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
    	 __mutex_lock+0x8b/0x8f0
    	 btrfs_init_new_device+0x2d2/0x1240 [btrfs]
    	 btrfs_ioctl+0x1de/0x2d20 [btrfs]
    	 ksys_ioctl+0x87/0xc0
    	 __x64_sys_ioctl+0x16/0x20
    	 do_syscall_64+0x52/0xb0
    	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
      -> #0 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
    	 __lock_acquire+0x1240/0x2460
    	 lock_acquire+0xab/0x360
    	 __mutex_lock+0x8b/0x8f0
    	 clone_fs_devices+0x4d/0x170 [btrfs]
    	 btrfs_read_chunk_tree+0x330/0x800 [btrfs]
    	 open_ctree+0xb7c/0x18ce [btrfs]
    	 btrfs_mount_root.cold+0x13/0xfa [btrfs]
    	 legacy_get_tree+0x30/0x50
    	 vfs_get_tree+0x28/0xc0
    	 fc_mount+0xe/0x40
    	 vfs_kern_mount.part.0+0x71/0x90
    	 btrfs_mount+0x13b/0x3e0 [btrfs]
    	 legacy_get_tree+0x30/0x50
    	 vfs_get_tree+0x28/0xc0
    	 do_mount+0x7de/0xb30
    	 __x64_sys_mount+0x8e/0xd0
    	 do_syscall_64+0x52/0xb0
    	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
      other info that might help us debug this:
    
       Possible unsafe locking scenario:
    
    	 CPU0                    CPU1
    	 ----                    ----
        lock(&fs_info->chunk_mutex);
    				 lock(&fs_devs->device_list_mutex);
    				 lock(&fs_info->chunk_mutex);
        lock(&fs_devs->device_list_mutex);
    
       *** DEADLOCK ***
    
      3 locks held by mount/678048:
       #0: ffff9b75ff5fb0e0 (&type->s_umount_key#63/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380
       #1: ffffffffc0c2fbc8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x54/0x800 [btrfs]
       #2: ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]
    
      stack backtrace:
      CPU: 2 PID: 678048 Comm: mount Tainted: G            E     5.8.0-rc5+ torvalds#20
      Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
      Call Trace:
       dump_stack+0x96/0xd0
       check_noncircular+0x162/0x180
       __lock_acquire+0x1240/0x2460
       ? asm_sysvec_apic_timer_interrupt+0x12/0x20
       lock_acquire+0xab/0x360
       ? clone_fs_devices+0x4d/0x170 [btrfs]
       __mutex_lock+0x8b/0x8f0
       ? clone_fs_devices+0x4d/0x170 [btrfs]
       ? rcu_read_lock_sched_held+0x52/0x60
       ? cpumask_next+0x16/0x20
       ? module_assert_mutex_or_preempt+0x14/0x40
       ? __module_address+0x28/0xf0
       ? clone_fs_devices+0x4d/0x170 [btrfs]
       ? static_obj+0x4f/0x60
       ? lockdep_init_map_waits+0x43/0x200
       ? clone_fs_devices+0x4d/0x170 [btrfs]
       clone_fs_devices+0x4d/0x170 [btrfs]
       btrfs_read_chunk_tree+0x330/0x800 [btrfs]
       open_ctree+0xb7c/0x18ce [btrfs]
       ? super_setup_bdi_name+0x79/0xd0
       btrfs_mount_root.cold+0x13/0xfa [btrfs]
       ? vfs_parse_fs_string+0x84/0xb0
       ? rcu_read_lock_sched_held+0x52/0x60
       ? kfree+0x2b5/0x310
       legacy_get_tree+0x30/0x50
       vfs_get_tree+0x28/0xc0
       fc_mount+0xe/0x40
       vfs_kern_mount.part.0+0x71/0x90
       btrfs_mount+0x13b/0x3e0 [btrfs]
       ? cred_has_capability+0x7c/0x120
       ? rcu_read_lock_sched_held+0x52/0x60
       ? legacy_get_tree+0x30/0x50
       legacy_get_tree+0x30/0x50
       vfs_get_tree+0x28/0xc0
       do_mount+0x7de/0xb30
       ? memdup_user+0x4e/0x90
       __x64_sys_mount+0x8e/0xd0
       do_syscall_64+0x52/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
    This is because btrfs_read_chunk_tree() can come upon DEV_EXTENT's and
    then read the device, which takes the device_list_mutex.  The
    device_list_mutex needs to be taken before the chunk_mutex, so this is a
    problem.  We only really need the chunk mutex around adding the chunk,
    so move the mutex around read_one_chunk.
    
    An argument could be made that we don't even need the chunk_mutex here
    as it's during mount, and we are protected by various other locks.
    However we already have special rules for ->device_list_mutex, and I'd
    rather not have another special case for ->chunk_mutex.
    
    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Anand Jain <anand.jain@oracle.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 22, 2020
  6. btrfs: open device without device_list_mutex

    There's long existed a lockdep splat because we open our bdev's under
    the ->device_list_mutex at mount time, which acquires the bd_mutex.
    Usually this goes unnoticed, but if you do loopback devices at all
    suddenly the bd_mutex comes with a whole host of other dependencies,
    which results in the splat when you mount a btrfs file system.
    
    ======================================================
    WARNING: possible circular locking dependency detected
    5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
    ------------------------------------------------------
    systemd-journal/509 is trying to acquire lock:
    ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]
    
    but task is already holding lock:
    ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
    
    which lock already depends on the new lock.
    
    the existing dependency chain (in reverse order) is:
    
     -> #6 (sb_pagefaults){.+.+}-{0:0}:
           __sb_start_write+0x13e/0x220
           btrfs_page_mkwrite+0x59/0x560 [btrfs]
           do_page_mkwrite+0x4f/0x130
           do_wp_page+0x3b0/0x4f0
           handle_mm_fault+0xf47/0x1850
           do_user_addr_fault+0x1fc/0x4b0
           exc_page_fault+0x88/0x300
           asm_exc_page_fault+0x1e/0x30
    
     -> #5 (&mm->mmap_lock#2){++++}-{3:3}:
           __might_fault+0x60/0x80
           _copy_from_user+0x20/0xb0
           get_sg_io_hdr+0x9a/0xb0
           scsi_cmd_ioctl+0x1ea/0x2f0
           cdrom_ioctl+0x3c/0x12b4
           sr_block_ioctl+0xa4/0xd0
           block_ioctl+0x3f/0x50
           ksys_ioctl+0x82/0xc0
           __x64_sys_ioctl+0x16/0x20
           do_syscall_64+0x52/0xb0
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
     -> #4 (&cd->lock){+.+.}-{3:3}:
           __mutex_lock+0x7b/0x820
           sr_block_open+0xa2/0x180
           __blkdev_get+0xdd/0x550
           blkdev_get+0x38/0x150
           do_dentry_open+0x16b/0x3e0
           path_openat+0x3c9/0xa00
           do_filp_open+0x75/0x100
           do_sys_openat2+0x8a/0x140
           __x64_sys_openat+0x46/0x70
           do_syscall_64+0x52/0xb0
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
     -> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
           __mutex_lock+0x7b/0x820
           __blkdev_get+0x6a/0x550
           blkdev_get+0x85/0x150
           blkdev_get_by_path+0x2c/0x70
           btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
           open_fs_devices+0x88/0x240 [btrfs]
           btrfs_open_devices+0x92/0xa0 [btrfs]
           btrfs_mount_root+0x250/0x490 [btrfs]
           legacy_get_tree+0x30/0x50
           vfs_get_tree+0x28/0xc0
           vfs_kern_mount.part.0+0x71/0xb0
           btrfs_mount+0x119/0x380 [btrfs]
           legacy_get_tree+0x30/0x50
           vfs_get_tree+0x28/0xc0
           do_mount+0x8c6/0xca0
           __x64_sys_mount+0x8e/0xd0
           do_syscall_64+0x52/0xb0
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
     -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
           __mutex_lock+0x7b/0x820
           btrfs_run_dev_stats+0x36/0x420 [btrfs]
           commit_cowonly_roots+0x91/0x2d0 [btrfs]
           btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
           btrfs_sync_file+0x38a/0x480 [btrfs]
           __x64_sys_fdatasync+0x47/0x80
           do_syscall_64+0x52/0xb0
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
     -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
           __mutex_lock+0x7b/0x820
           btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
           btrfs_sync_file+0x38a/0x480 [btrfs]
           __x64_sys_fdatasync+0x47/0x80
           do_syscall_64+0x52/0xb0
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
     -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
           __lock_acquire+0x1241/0x20c0
           lock_acquire+0xb0/0x400
           __mutex_lock+0x7b/0x820
           btrfs_record_root_in_trans+0x44/0x70 [btrfs]
           start_transaction+0xd2/0x500 [btrfs]
           btrfs_dirty_inode+0x44/0xd0 [btrfs]
           file_update_time+0xc6/0x120
           btrfs_page_mkwrite+0xda/0x560 [btrfs]
           do_page_mkwrite+0x4f/0x130
           do_wp_page+0x3b0/0x4f0
           handle_mm_fault+0xf47/0x1850
           do_user_addr_fault+0x1fc/0x4b0
           exc_page_fault+0x88/0x300
           asm_exc_page_fault+0x1e/0x30
    
    other info that might help us debug this:
    
    Chain exists of:
      &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
    
    Possible unsafe locking scenario:
    
         CPU0                    CPU1
         ----                    ----
     lock(sb_pagefaults);
                                 lock(&mm->mmap_lock#2);
                                 lock(sb_pagefaults);
     lock(&fs_info->reloc_mutex);
    
     *** DEADLOCK ***
    
    3 locks held by systemd-journal/509:
     #0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
     #1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
     #2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]
    
    stack backtrace:
    CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
    Call Trace:
     dump_stack+0x92/0xc8
     check_noncircular+0x134/0x150
     __lock_acquire+0x1241/0x20c0
     lock_acquire+0xb0/0x400
     ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
     ? lock_acquire+0xb0/0x400
     ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
     __mutex_lock+0x7b/0x820
     ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
     ? kvm_sched_clock_read+0x14/0x30
     ? sched_clock+0x5/0x10
     ? sched_clock_cpu+0xc/0xb0
     btrfs_record_root_in_trans+0x44/0x70 [btrfs]
     start_transaction+0xd2/0x500 [btrfs]
     btrfs_dirty_inode+0x44/0xd0 [btrfs]
     file_update_time+0xc6/0x120
     btrfs_page_mkwrite+0xda/0x560 [btrfs]
     ? sched_clock+0x5/0x10
     do_page_mkwrite+0x4f/0x130
     do_wp_page+0x3b0/0x4f0
     handle_mm_fault+0xf47/0x1850
     do_user_addr_fault+0x1fc/0x4b0
     exc_page_fault+0x88/0x300
     ? asm_exc_page_fault+0x8/0x30
     asm_exc_page_fault+0x1e/0x30
    RIP: 0033:0x7fa3972fdbfe
    Code: Bad RIP value.
    
    Fix this by not holding the ->device_list_mutex at this point.  The
    device_list_mutex exists to protect us from modifying the device list
    while the file system is running.
    
    However it can also be modified by doing a scan on a device.  But this
    action is specifically protected by the uuid_mutex, which we are holding
    here.  We cannot race with opening at this point because we have the
    ->s_mount lock held during the mount.  Not having the
    ->device_list_mutex here is perfectly safe as we're not going to change
    the devices at this point.
    
    CC: stable@vger.kernel.org # 4.19+
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    [ add some comments ]
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 22, 2020

Commits on Jul 21, 2020

  1. btrfs: sysfs: use NOFS for device creation

    Dave hit this splat during testing btrfs/078:
    
      ======================================================
      WARNING: possible circular locking dependency detected
      5.8.0-rc6-default+ #1191 Not tainted
      ------------------------------------------------------
      kswapd0/75 is trying to acquire lock:
      ffffa040e9d04ff8 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
    
      but task is already holding lock:
      ffffffff8b0c8040 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
    
      which lock already depends on the new lock.
    
      the existing dependency chain (in reverse order) is:
    
      -> #2 (fs_reclaim){+.+.}-{0:0}:
    	 __lock_acquire+0x56f/0xaa0
    	 lock_acquire+0xa3/0x440
    	 fs_reclaim_acquire.part.0+0x25/0x30
    	 __kmalloc_track_caller+0x49/0x330
    	 kstrdup+0x2e/0x60
    	 __kernfs_new_node.constprop.0+0x44/0x250
    	 kernfs_new_node+0x25/0x50
    	 kernfs_create_link+0x34/0xa0
    	 sysfs_do_create_link_sd+0x5e/0xd0
    	 btrfs_sysfs_add_devices_dir+0x65/0x100 [btrfs]
    	 btrfs_init_new_device+0x44c/0x12b0 [btrfs]
    	 btrfs_ioctl+0xc3c/0x25c0 [btrfs]
    	 ksys_ioctl+0x68/0xa0
    	 __x64_sys_ioctl+0x16/0x20
    	 do_syscall_64+0x50/0xe0
    	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
      -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
    	 __lock_acquire+0x56f/0xaa0
    	 lock_acquire+0xa3/0x440
    	 __mutex_lock+0xa0/0xaf0
    	 btrfs_chunk_alloc+0x137/0x3e0 [btrfs]
    	 find_free_extent+0xb44/0xfb0 [btrfs]
    	 btrfs_reserve_extent+0x9b/0x180 [btrfs]
    	 btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
    	 alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
    	 __btrfs_cow_block+0x143/0x7a0 [btrfs]
    	 btrfs_cow_block+0x15f/0x310 [btrfs]
    	 push_leaf_right+0x150/0x240 [btrfs]
    	 split_leaf+0x3cd/0x6d0 [btrfs]
    	 btrfs_search_slot+0xd14/0xf70 [btrfs]
    	 btrfs_insert_empty_items+0x64/0xc0 [btrfs]
    	 __btrfs_commit_inode_delayed_items+0xb2/0x840 [btrfs]
    	 btrfs_async_run_delayed_root+0x10e/0x1d0 [btrfs]
    	 btrfs_work_helper+0x2f9/0x650 [btrfs]
    	 process_one_work+0x22c/0x600
    	 worker_thread+0x50/0x3b0
    	 kthread+0x137/0x150
    	 ret_from_fork+0x1f/0x30
    
      -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
    	 check_prev_add+0x98/0xa20
    	 validate_chain+0xa8c/0x2a00
    	 __lock_acquire+0x56f/0xaa0
    	 lock_acquire+0xa3/0x440
    	 __mutex_lock+0xa0/0xaf0
    	 __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
    	 btrfs_evict_inode+0x3bf/0x560 [btrfs]
    	 evict+0xd6/0x1c0
    	 dispose_list+0x48/0x70
    	 prune_icache_sb+0x54/0x80
    	 super_cache_scan+0x121/0x1a0
    	 do_shrink_slab+0x175/0x420
    	 shrink_slab+0xb1/0x2e0
    	 shrink_node+0x192/0x600
    	 balance_pgdat+0x31f/0x750
    	 kswapd+0x206/0x510
    	 kthread+0x137/0x150
    	 ret_from_fork+0x1f/0x30
    
      other info that might help us debug this:
    
      Chain exists of:
        &delayed_node->mutex --> &fs_info->chunk_mutex --> fs_reclaim
    
       Possible unsafe locking scenario:
    
    	 CPU0                    CPU1
    	 ----                    ----
        lock(fs_reclaim);
    				 lock(&fs_info->chunk_mutex);
    				 lock(fs_reclaim);
        lock(&delayed_node->mutex);
    
       *** DEADLOCK ***
    
      3 locks held by kswapd0/75:
       #0: ffffffff8b0c8040 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
       #1: ffffffff8b0b50b8 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x54/0x2e0
       #2: ffffa040e057c0e8 (&type->s_umount_key#26){++++}-{3:3}, at: trylock_super+0x16/0x50
    
      stack backtrace:
      CPU: 2 PID: 75 Comm: kswapd0 Not tainted 5.8.0-rc6-default+ #1191
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
      Call Trace:
       dump_stack+0x78/0xa0
       check_noncircular+0x16f/0x190
       check_prev_add+0x98/0xa20
       validate_chain+0xa8c/0x2a00
       __lock_acquire+0x56f/0xaa0
       lock_acquire+0xa3/0x440
       ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
       __mutex_lock+0xa0/0xaf0
       ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
       ? __lock_acquire+0x56f/0xaa0
       ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
       ? lock_acquire+0xa3/0x440
       ? btrfs_evict_inode+0x138/0x560 [btrfs]
       ? btrfs_evict_inode+0x2fe/0x560 [btrfs]
       ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
       __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
       btrfs_evict_inode+0x3bf/0x560 [btrfs]
       evict+0xd6/0x1c0
       dispose_list+0x48/0x70
       prune_icache_sb+0x54/0x80
       super_cache_scan+0x121/0x1a0
       do_shrink_slab+0x175/0x420
       shrink_slab+0xb1/0x2e0
       shrink_node+0x192/0x600
       balance_pgdat+0x31f/0x750
       kswapd+0x206/0x510
       ? _raw_spin_unlock_irqrestore+0x3e/0x50
       ? finish_wait+0x90/0x90
       ? balance_pgdat+0x750/0x750
       kthread+0x137/0x150
       ? kthread_stop+0x2a0/0x2a0
       ret_from_fork+0x1f/0x30
    
    This is because we're holding the chunk_mutex while adding this device
    and adding its sysfs entries.  We actually hold different locks in
    different places when calling this function, the dev_replace semaphore
    for instance in dev replace, so instead of moving this call around
    simply wrap it's operations in NOFS.
    
    CC: stable@vger.kernel.org # 4.14+
    Reported-by: David Sterba <dsterba@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 21, 2020
  2. btrfs: return EROFS for BTRFS_FS_STATE_ERROR cases

    Eric reported seeing this message while running generic/475
    
      BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
    
    Full stack trace:
    
      BTRFS: error (device dm-0) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction)
      BTRFS info (device dm-0): forced readonly
      BTRFS warning (device dm-0): Skipping commit of aborted transaction.
      ------------[ cut here ]------------
      BTRFS: error (device dm-0) in cleanup_transaction:1894: errno=-5 IO failure
      BTRFS: Transaction aborted (error -117)
      BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6480 len 4096 err no 10
      BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6488 len 4096 err no 10
      BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6490 len 4096 err no 10
      BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6498 len 4096 err no 10
      BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a0 len 4096 err no 10
      BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a8 len 4096 err no 10
      BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b0 len 4096 err no 10
      BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b8 len 4096 err no 10
      BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64c0 len 4096 err no 10
      BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85e8 len 4096 err no 10
      BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85f0 len 4096 err no 10
      WARNING: CPU: 3 PID: 23985 at fs/btrfs/tree-log.c:3084 btrfs_sync_log+0xbc8/0xd60 [btrfs]
      BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4288 len 4096 err no 10
      BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4290 len 4096 err no 10
      BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4298 len 4096 err no 10
      BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a0 len 4096 err no 10
      BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a8 len 4096 err no 10
      BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b0 len 4096 err no 10
      BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b8 len 4096 err no 10
      BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c0 len 4096 err no 10
      BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c8 len 4096 err no 10
      BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42d0 len 4096 err no 10
      CPU: 3 PID: 23985 Comm: fsstress Tainted: G        W    L    5.8.0-rc4-default+ #1181
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
      RIP: 0010:btrfs_sync_log+0xbc8/0xd60 [btrfs]
      RSP: 0018:ffff909a44d17bd0 EFLAGS: 00010286
      RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
      RDX: ffff8f3be41cb940 RSI: ffffffffb0108d2b RDI: ffffffffb0108ff7
      RBP: ffff909a44d17e70 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000037988 R12: ffff8f3bd20e4000
      R13: ffff8f3bd20e4428 R14: 00000000ffffff8b R15: ffff909a44d17c70
      FS:  00007f6a6ed3fb80(0000) GS:ffff8f3c3dc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f6a6ed3e000 CR3: 00000000525c0003 CR4: 0000000000160ee0
      Call Trace:
       ? finish_wait+0x90/0x90
       ? __mutex_unlock_slowpath+0x45/0x2a0
       ? lock_acquire+0xa3/0x440
       ? lockref_put_or_lock+0x9/0x30
       ? dput+0x20/0x4a0
       ? dput+0x20/0x4a0
       ? do_raw_spin_unlock+0x4b/0xc0
       ? _raw_spin_unlock+0x1f/0x30
       btrfs_sync_file+0x335/0x490 [btrfs]
       do_fsync+0x38/0x70
       __x64_sys_fsync+0x10/0x20
       do_syscall_64+0x50/0xe0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x7f6a6ef1b6e3
      Code: Bad RIP value.
      RSP: 002b:00007ffd01e20038 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
      RAX: ffffffffffffffda RBX: 000000000007a120 RCX: 00007f6a6ef1b6e3
      RDX: 00007ffd01e1ffa0 RSI: 00007ffd01e1ffa0 RDI: 0000000000000003
      RBP: 0000000000000003 R08: 0000000000000001 R09: 00007ffd01e2004c
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000009f
      R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      irq event stamp: 0
      hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      hardirqs last disabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
      softirqs last  enabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
      softirqs last disabled at (0): [<0000000000000000>] 0x0
      ---[ end trace af146e0e38433456 ]---
      BTRFS: error (device dm-0) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
    
    This ret came from btrfs_write_marked_extents().  If we get an aborted
    transaction via EIO before, we'll see it in btree_write_cache_pages()
    and return EUCLEAN, which gets printed as "Filesystem corrupted".
    
    Except we shouldn't be returning EUCLEAN here, we need to be returning
    EROFS because EUCLEAN is reserved for actual corruption, not IO errors.
    
    We are inconsistent about our handling of BTRFS_FS_STATE_ERROR
    elsewhere, but we want to use EROFS for this particular case.  The
    original transaction abort has the real error code for why we ended up
    with an aborted transaction, all subsequent actions just need to return
    EROFS because they may not have a trans handle and have no idea about
    the original cause of the abort.
    
    After patch "btrfs: don't WARN if we abort a transaction with EROFS" the
    stacktrace will not be dumped either.
    
    Reported-by: Eric Sandeen <esandeen@redhat.com>
    CC: stable@vger.kernel.org # 5.4+
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    [ add full test stacktrace ]
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 21, 2020
  3. btrfs: document special case error codes for fs errors

    We've had some discussions about what to do in certain scenarios for
    error codes, specifically EUCLEAN and EROFS.  Document these near the
    error handling code so its clear what their intentions are.
    
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 21, 2020
  4. btrfs: don't WARN if we abort a transaction with EROFS

    If we got some sort of corruption via a read and call
    btrfs_handle_fs_error() we'll set BTRFS_FS_STATE_ERROR on the fs and
    complain.  If a subsequent trans handle trips over this it'll get EROFS
    and then abort.  However at that point we're not aborting for the
    original reason, we're aborting because we've been flipped read only.
    We do not need to WARN_ON() here.
    
    CC: stable@vger.kernel.org # 5.4+
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 21, 2020
  5. btrfs: fix mount failure caused by race with umount

    It is possible to cause a btrfs mount to fail by racing it with a slow
    umount. The crux of the sequence is generic_shutdown_super not yet
    calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
    If that occurs, btrfs_open_devices will decide the opened counter is
    non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
    0. From here, mount will call sget which will result in grab_super
    trying to take the super block umount semaphore. That semaphore will be
    held by the slow umount, so mount will block. Before up-ing the
    semaphore, umount will delete the super block, resulting in mount's sget
    reliably allocating a new one, which causes the mount path to dutifully
    fill it out, and increment total_rw_bytes a second time, which causes
    the mount to fail, as we see double the expected bytes.
    
    Here is the sequence laid out in greater detail:
    
    CPU0                                                    CPU1
    down_write sb->s_umount
    btrfs_kill_super
      kill_anon_super(sb)
        generic_shutdown_super(sb);
          shrink_dcache_for_umount(sb);
          sync_filesystem(sb);
          evict_inodes(sb); // SLOW
    
                                                  btrfs_mount_root
                                                    btrfs_scan_one_device
                                                    fs_devices = device->fs_devices
                                                    fs_info->fs_devices = fs_devices
                                                    // fs_devices-opened makes this a no-op
                                                    btrfs_open_devices(fs_devices, mode, fs_type)
                                                    s = sget(fs_type, test, set, flags, fs_info);
                                                      find sb in s_instances
                                                      grab_super(sb);
                                                        down_write(&s->s_umount); // blocks
    
          sop->put_super(sb)
            // sb->fs_devices->opened == 2; no-op
          spin_lock(&sb_lock);
          hlist_del_init(&sb->s_instances);
          spin_unlock(&sb_lock);
          up_write(&sb->s_umount);
                                                        return 0;
                                                      retry lookup
                                                      don't find sb in s_instances (deleted by CPU0)
                                                      s = alloc_super
                                                      return s;
                                                    btrfs_fill_super(s, fs_devices, data)
                                                      open_ctree // fs_devices total_rw_bytes improperly set!
                                                        btrfs_read_chunk_tree
                                                          read_one_dev // increment total_rw_bytes again!!
                                                          super_total_bytes < fs_devices->total_rw_bytes // ERROR!!!
    
    To fix this, we clear total_rw_bytes from within btrfs_read_chunk_tree
    before the calls to read_one_dev, while holding the sb umount semaphore
    and the uuid mutex.
    
    To reproduce, it is sufficient to dirty a decent number of inodes, then
    quickly umount and mount.
    
      for i in $(seq 0 500)
      do
        dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
      done
      umount /mnt/foo&
      mount /mnt/foo
    
    does the trick for me.
    
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Boris Burkov <boris@bur.io>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Boris Burkov authored and kdave committed Jul 21, 2020
  6. btrfs: fix page leaks after failure to lock page for delalloc

    When locking pages for delalloc, we check if it's dirty and mapping still
    matches. If it does not match, we need to return -EAGAIN and release all
    pages. Only the current page was put though, iterate over all the
    remaining pages too.
    
    CC: stable@vger.kernel.org # 4.14+
    Reviewed-by: Filipe Manana <fdmanana@suse.com>
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Robbie Ko <robbieko@synology.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Robbie Ko authored and kdave committed Jul 21, 2020
  7. btrfs: qgroup: fix data leak caused by race between writeback and tru…

    …ncate
    
    [BUG]
    When running tests like generic/013 on test device with btrfs quota
    enabled, it can normally lead to data leak, detected at unmount time:
    
      BTRFS warning (device dm-3): qgroup 0/5 has unreleased space, type 0 rsv 4096
      ------------[ cut here ]------------
      WARNING: CPU: 11 PID: 16386 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs]
      RIP: 0010:close_ctree+0x1dc/0x323 [btrfs]
      Call Trace:
       btrfs_put_super+0x15/0x17 [btrfs]
       generic_shutdown_super+0x72/0x110
       kill_anon_super+0x18/0x30
       btrfs_kill_super+0x17/0x30 [btrfs]
       deactivate_locked_super+0x3b/0xa0
       deactivate_super+0x40/0x50
       cleanup_mnt+0x135/0x190
       __cleanup_mnt+0x12/0x20
       task_work_run+0x64/0xb0
       __prepare_exit_to_usermode+0x1bc/0x1c0
       __syscall_return_slowpath+0x47/0x230
       do_syscall_64+0x64/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      ---[ end trace caf08beafeca2392 ]---
      BTRFS error (device dm-3): qgroup reserved space leaked
    
    [CAUSE]
    In the offending case, the offending operations are:
    2/6: writev f2X[269 1 0 0 0 0] [1006997,67,288] 0
    2/7: truncate f2X[269 1 0 0 48 1026293] 18388 0
    
    The following sequence of events could happen after the writev():
    	CPU1 (writeback)		|		CPU2 (truncate)
    -----------------------------------------------------------------
    btrfs_writepages()			|
    |- extent_write_cache_pages()		|
       |- Got page for 1003520		|
       |  1003520 is Dirty, no writeback	|
       |  So (!clear_page_dirty_for_io())   |
       |  gets called for it		|
       |- Now page 1003520 is Clean.	|
       |					| btrfs_setattr()
       |					| |- btrfs_setsize()
       |					|    |- truncate_setsize()
       |					|       New i_size is 18388
       |- __extent_writepage()		|
       |  |- page_offset() > i_size		|
          |- btrfs_invalidatepage()		|
    	 |- Page is clean, so no qgroup |
    	    callback executed
    
    This means, the qgroup reserved data space is not properly released in
    btrfs_invalidatepage() as the page is Clean.
    
    [FIX]
    Instead of checking the dirty bit of a page, call
    btrfs_qgroup_free_data() unconditionally in btrfs_invalidatepage().
    
    As qgroup rsv are completely bound to the QGROUP_RESERVED bit of
    io_tree, not bound to page status, thus we won't cause double freeing
    anyway.
    
    Fixes: 0b34c26 ("btrfs: qgroup: Prevent qgroup->reserved from going subzero")
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed Jul 21, 2020
  8. btrfs: reduce contention on log trees when logging checksums

    The possibility of extents being shared (through clone and deduplication
    operations) requires special care when logging data checksums, to avoid
    having a log tree with different checksum items that cover ranges which
    overlap (which resulted in missing checksums after replaying a log tree).
    Such problems were fixed in the past by the following commits:
    
    commit 40e046a ("Btrfs: fix missing data checksums after replaying a
                          log tree")
    
    commit e289f03 ("btrfs: fix corrupt log due to concurrent fsync of
                          inodes with shared extents")
    
    Test case generic/588 exercises the scenario solved by the first commit
    (purely sequential and deterministic) while test case generic/457 often
    triggered the case fixed by the second commit (not deterministic, requires
    specific timings under concurrency).
    
    The problems were addressed by deleting, from the log tree, any existing
    checksums before logging the new ones. And also by doing the deletion and
    logging of the cheksums while locking the checksum range in an extent io
    tree (root->log_csum_range), to deal with the case where we have concurrent
    fsyncs against files with shared extents.
    
    That however causes more contention on the leaves of a log tree where we
    store checksums (and all the nodes in the paths leading to them), even
    when we do not have shared extents, or all the shared extents were created
    by past transactions. It also adds a bit of contention on the spin lock of
    the log_csums_range extent io tree of the log root.
    
    This change adds a 'last_reflink_trans' field to the inode to keep track
    of the last transaction where a new extent was shared between inodes
    (through clone and deduplication operations). It is updated for both the
    source and destination inodes of reflink operations whenever a new extent
    (created in the current transaction) becomes shared by the inodes. This
    field is kept in memory only, not persisted in the inode item, similar
    to other existing fields (last_unlink_trans, logged_trans).
    
    When logging checksums for an extent, if the value of 'last_reflink_trans'
    is smaller then the current transaction's generation/id, we skip locking
    the extent range and deletion of checksums from the log tree, since we
    know we do not have new shared extents. This reduces contention on the
    log tree's leaves where checksums are stored.
    
    The following script, which uses fio, was used to measure the impact of
    this change:
    
      $ cat test-fsync.sh
      #!/bin/bash
    
      DEV=/dev/sdk
      MNT=/mnt/sdk
      MOUNT_OPTIONS="-o ssd"
      MKFS_OPTIONS="-d single -m single"
    
      if [ $# -ne 3 ]; then
          echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
          exit 1
      fi
    
      NUM_JOBS=$1
      FILE_SIZE=$2
      FSYNC_FREQ=$3
    
      cat <<EOF > /tmp/fio-job.ini
      [writers]
      rw=write
      fsync=$FSYNC_FREQ
      fallocate=none
      group_reporting=1
      direct=0
      bs=64k
      ioengine=sync
      size=$FILE_SIZE
      directory=$MNT
      numjobs=$NUM_JOBS
      EOF
    
      echo "Using config:"
      echo
      cat /tmp/fio-job.ini
      echo
    
      mkfs.btrfs -f $MKFS_OPTIONS $DEV
      mount $MOUNT_OPTIONS $DEV $MNT
      fio /tmp/fio-job.ini
      umount $MNT
    
    The tests were performed for different numbers of jobs, file sizes and
    fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
    12 cores, with cpu governance set to performance mode on all cores), 16GiB
    of ram (the host has 64GiB) and using a NVMe device directly (without an
    intermediary filesystem in the host). While running the tests, the host
    was not used for anything else, to avoid disturbing the tests.
    
    The obtained results were the following (the last line of fio's output was
    pasted). Starting with 16 jobs is where a significant difference is
    observable in this particular setup and hardware (differences highlighted
    below). The very small differences for tests with less than 16 jobs are
    possibly just noise and random.
    
        **** 1 job, file size 1G, fsync frequency 1 ****
    
    before this change:
    
    WRITE: bw=23.8MiB/s (24.9MB/s), 23.8MiB/s-23.8MiB/s (24.9MB/s-24.9MB/s), io=1024MiB (1074MB), run=43075-43075msec
    
    after this change:
    
    WRITE: bw=24.4MiB/s (25.6MB/s), 24.4MiB/s-24.4MiB/s (25.6MB/s-25.6MB/s), io=1024MiB (1074MB), run=41938-41938msec
    
        **** 2 jobs, file size 1G, fsync frequency 1 ****
    
    before this change:
    
    WRITE: bw=37.7MiB/s (39.5MB/s), 37.7MiB/s-37.7MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54351-54351msec
    
    after this change:
    
    WRITE: bw=37.7MiB/s (39.5MB/s), 37.6MiB/s-37.6MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54428-54428msec
    
        **** 4 jobs, file size 1G, fsync frequency 1 ****
    
    before this change:
    
    WRITE: bw=67.5MiB/s (70.8MB/s), 67.5MiB/s-67.5MiB/s (70.8MB/s-70.8MB/s), io=4096MiB (4295MB), run=60669-60669msec
    
    after this change:
    
    WRITE: bw=68.6MiB/s (71.0MB/s), 68.6MiB/s-68.6MiB/s (71.0MB/s-71.0MB/s), io=4096MiB (4295MB), run=59678-59678msec
    
        **** 8 jobs, file size 1G, fsync frequency 1 ****
    
    before this change:
    
    WRITE: bw=128MiB/s (134MB/s), 128MiB/s-128MiB/s (134MB/s-134MB/s), io=8192MiB (8590MB), run=64048-64048msec
    
    after this change:
    
    WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=8192MiB (8590MB), run=63405-63405msec
    
        **** 16 jobs, file size 1G, fsync frequency 1 ****
    
    before this change:
    
    WRITE: bw=78.5MiB/s (82.3MB/s), 78.5MiB/s-78.5MiB/s (82.3MB/s-82.3MB/s), io=16.0GiB (17.2GB), run=208676-208676msec
    
    after this change:
    
    WRITE: bw=110MiB/s (115MB/s), 110MiB/s-110MiB/s (115MB/s-115MB/s), io=16.0GiB (17.2GB), run=149295-149295msec
    (+40.1% throughput, -28.5% runtime)
    
        **** 32 jobs, file size 1G, fsync frequency 1 ****
    
    before this change:
    
    WRITE: bw=58.8MiB/s (61.7MB/s), 58.8MiB/s-58.8MiB/s (61.7MB/s-61.7MB/s), io=32.0GiB (34.4GB), run=557134-557134msec
    
    after this change:
    
    WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430550-430550msec
    (+29.4% throughput, -22.7% runtime)
    
        **** 64 jobs, file size 512M, fsync frequency 1 ****
    
    before this change:
    
    WRITE: bw=65.8MiB/s (68.0MB/s), 65.8MiB/s-65.8MiB/s (68.0MB/s-68.0MB/s), io=32.0GiB (34.4GB), run=498055-498055msec
    
    after this change:
    
    WRITE: bw=85.1MiB/s (89.2MB/s), 85.1MiB/s-85.1MiB/s (89.2MB/s-89.2MB/s), io=32.0GiB (34.4GB), run=385116-385116msec
    (+29.3% throughput, -22.7% runtime)
    
        **** 128 jobs, file size 256M, fsync frequency 1 ****
    
    before this change:
    
    WRITE: bw=54.7MiB/s (57.3MB/s), 54.7MiB/s-54.7MiB/s (57.3MB/s-57.3MB/s), io=32.0GiB (34.4GB), run=599373-599373msec
    
    after this change:
    
    WRITE: bw=121MiB/s (126MB/s), 121MiB/s-121MiB/s (126MB/s-126MB/s), io=32.0GiB (34.4GB), run=271907-271907msec
    (+121.2% throughput, -54.6% runtime)
    
        **** 256 jobs, file size 256M, fsync frequency 1 ****
    
    before this change:
    
    WRITE: bw=69.2MiB/s (72.5MB/s), 69.2MiB/s-69.2MiB/s (72.5MB/s-72.5MB/s), io=64.0GiB (68.7GB), run=947536-947536msec
    
    after this change:
    
    WRITE: bw=121MiB/s (127MB/s), 121MiB/s-121MiB/s (127MB/s-127MB/s), io=64.0GiB (68.7GB), run=541916-541916msec
    (+74.9% throughput, -42.8% runtime)
    
        **** 512 jobs, file size 128M, fsync frequency 1 ****
    
    before this change:
    
    WRITE: bw=85.4MiB/s (89.5MB/s), 85.4MiB/s-85.4MiB/s (89.5MB/s-89.5MB/s), io=64.0GiB (68.7GB), run=767734-767734msec
    
    after this change:
    
    WRITE: bw=141MiB/s (147MB/s), 141MiB/s-141MiB/s (147MB/s-147MB/s), io=64.0GiB (68.7GB), run=466022-466022msec
    (+65.1% throughput, -39.3% runtime)
    
        **** 1024 jobs, file size 128M, fsync frequency 1 ****
    
    before this change:
    
    WRITE: bw=115MiB/s (120MB/s), 115MiB/s-115MiB/s (120MB/s-120MB/s), io=128GiB (137GB), run=1143775-1143775msec
    
    after this change:
    
    WRITE: bw=171MiB/s (180MB/s), 171MiB/s-171MiB/s (180MB/s-180MB/s), io=128GiB (137GB), run=764843-764843msec
    (+48.7% throughput, -33.1% runtime)
    
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    fdmanana authored and kdave committed Jul 21, 2020
  9. btrfs: remove done label in writepage_delalloc

    Since there is not common cleanup run after the label it makes it
    somewhat redundant.
    
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Nikolay Borisov authored and kdave committed Jul 21, 2020
  10. btrfs: add comments for btrfs_reserve_flush_enum

    This enum is the interface exposed to developers.
    
    Although we have a detailed comment explaining the whole idea of space
    flushing at the beginning of space-info.c, the exposed enum interface
    doesn't have any comment.
    
    Some corner cases, like BTRFS_RESERVE_FLUSH_ALL and
    BTRFS_RESERVE_FLUSH_ALL_STEAL can be interrupted by fatal signals, are
    not explained at all.
    
    So add some simple comments for these enums as a quick reference.
    
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed Jul 21, 2020
  11. btrfs: relocation: review the call sites which can be interrupted by …

    …signal
    
    Since most metadata reservation calls can return -EINTR when get
    interrupted by fatal signal, we need to review the all the metadata
    reservation call sites.
    
    In relocation code, the metadata reservation happens in the following
    sites:
    
    - btrfs_block_rsv_refill() in merge_reloc_root()
      merge_reloc_root() is a pretty critical section, we don't want to be
      interrupted by signal, so change the flush status to
      BTRFS_RESERVE_FLUSH_LIMIT, so it won't get interrupted by signal.
      Since such change can be ENPSPC-prone, also shrink the amount of
      metadata to reserve least amount avoid deadly ENOSPC there.
    
    - btrfs_block_rsv_refill() in reserve_metadata_space()
      It calls with BTRFS_RESERVE_FLUSH_LIMIT, which won't get interrupted
      by signal.
    
    - btrfs_block_rsv_refill() in prepare_to_relocate()
    
    - btrfs_block_rsv_add() in prepare_to_relocate()
    
    - btrfs_block_rsv_refill() in relocate_block_group()
    
    - btrfs_delalloc_reserve_metadata() in relocate_file_extent_cluster()
    
    - btrfs_start_transaction() in relocate_block_group()
    
    - btrfs_start_transaction() in create_reloc_inode()
      Can be interrupted by fatal signal and we can handle it easily.
      For these call sites, just catch the -EINTR value in btrfs_balance()
      and count them as canceled.
    
    CC: stable@vger.kernel.org # 5.4+
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed Jul 21, 2020
  12. btrfs: avoid possible signal interruption of btrfs_drop_snapshot() on…

    … relocation tree
    
    [BUG]
    There is a bug report about bad signal timing could lead to read-only
    fs during balance:
    
      BTRFS info (device xvdb): balance: start -d -m -s
      BTRFS info (device xvdb): relocating block group 73001861120 flags metadata
      BTRFS info (device xvdb): found 12236 extents, stage: move data extents
      BTRFS info (device xvdb): relocating block group 71928119296 flags data
      BTRFS info (device xvdb): found 3 extents, stage: move data extents
      BTRFS info (device xvdb): found 3 extents, stage: update data pointers
      BTRFS info (device xvdb): relocating block group 60922265600 flags metadata
      BTRFS: error (device xvdb) in btrfs_drop_snapshot:5505: errno=-4 unknown
      BTRFS info (device xvdb): forced readonly
      BTRFS info (device xvdb): balance: ended with status: -4
    
    [CAUSE]
    The direct cause is the -EINTR from the following call chain when a
    fatal signal is pending:
    
     relocate_block_group()
     |- clean_dirty_subvols()
        |- btrfs_drop_snapshot()
           |- btrfs_start_transaction()
              |- btrfs_delayed_refs_rsv_refill()
                 |- btrfs_reserve_metadata_bytes()
                    |- __reserve_metadata_bytes()
                       |- wait_reserve_ticket()
                          |- prepare_to_wait_event();
                          |- ticket->error = -EINTR;
    
    Normally this behavior is fine for most btrfs_start_transaction()
    callers, as they need to catch any other error, same for the signal, and
    exit ASAP.
    
    However for balance, especially for the clean_dirty_subvols() case, we're
    already doing cleanup works, getting -EINTR from btrfs_drop_snapshot()
    could cause a lot of unexpected problems.
    
    From the mentioned forced read-only report, to later balance error due
    to half dropped reloc trees.
    
    [FIX]
    Fix this problem by using btrfs_join_transaction() if
    btrfs_drop_snapshot() is called from relocation context.
    
    Since btrfs_join_transaction() won't get interrupted by signal, we can
    continue the cleanup.
    
    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>3
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed Jul 21, 2020
  13. btrfs: relocation: allow signal to cancel balance

    Although btrfs balance can be canceled with "btrfs balance cancel"
    command, it's still almost muscle memory to press Ctrl-C to cancel a
    long running btrfs balance.
    
    So allow btrfs balance to check signal to determine if it should exit.
    The cancellation points are in known location and we're only adding one
    more reason, so this should be safe.
    
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed Jul 21, 2020
  14. btrfs: raid56: remove out label in __raid56_parity_recover

    There's no cleanup that occurs so we can simply return 0 directly.
    
    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Nikolay Borisov authored and kdave committed Jul 21, 2020
  15. btrfs: add missing check for nocow and compression inode flags

    User Forza reported on IRC that some invalid combinations of file
    attributes are accepted by chattr.
    
    The NODATACOW and compression file flags/attributes are mutually
    exclusive, but they could be set by 'chattr +c +C' on an empty file. The
    nodatacow will be in effect because it's checked first in
    btrfs_run_delalloc_range.
    
    Extend the flag validation to catch the following cases:
    
      - input flags are conflicting
      - old and new flags are conflicting
      - initialize the local variable with inode flags after inode ls locked
    
    Inode attributes take precedence over mount options and are an
    independent setting.
    
    Nocompress would be a no-op with nodatacow, but we don't want to mix
    any compression-related options with nodatacow.
    
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: David Sterba <dsterba@suse.com>
    kdave committed Jul 21, 2020
  16. btrfs: don't traverse into the seed devices in show_devname

    ->show_devname currently shows the lowest devid in the list. As the seed
    devices have the lowest devid in the sprouted filesystem, the userland
    tool such as findmnt end up seeing seed device instead of the device from
    the read-writable sprouted filesystem. As shown below.
    
     mount /dev/sda /btrfs
     mount: /btrfs: WARNING: device write-protected, mounted read-only.
    
     findmnt --output SOURCE,TARGET,UUID /btrfs
     SOURCE   TARGET UUID
     /dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111
    
     btrfs dev add -f /dev/sdb /btrfs
    
     umount /btrfs
     mount /dev/sdb /btrfs
    
     findmnt --output SOURCE,TARGET,UUID /btrfs
     SOURCE   TARGET UUID
     /dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111
    
    All sprouts from a single seed will show the same seed device and the
    same fsid. That's confusing.
    This is causing problems in our prototype as there isn't any reference
    to the sprout file-system(s) which is being used for actual read and
    write.
    
    This was added in the patch which implemented the show_devname in btrfs
    commit 9c5085c ("Btrfs: implement ->show_devname").
    I tried to look for any particular reason that we need to show the seed
    device, there isn't any.
    
    So instead, do not traverse through the seed devices, just show the
    lowest devid in the sprouted fsid.
    
    After the patch:
    
     mount /dev/sda /btrfs
     mount: /btrfs: WARNING: device write-protected, mounted read-only.
    
     findmnt --output SOURCE,TARGET,UUID /btrfs
     SOURCE   TARGET UUID
     /dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111
    
     btrfs dev add -f /dev/sdb /btrfs
     mount -o rw,remount /dev/sdb /btrfs
    
     findmnt --output SOURCE,TARGET,UUID /btrfs
     SOURCE   TARGET UUID
     /dev/sdb /btrfs 595ca0e6-b82e-46b5-b9e2-c72a6928be48
    
     mount /dev/sda /btrfs1
     mount: /btrfs1: WARNING: device write-protected, mounted read-only.
    
     btrfs dev add -f /dev/sdc /btrfs1
    
     findmnt --output SOURCE,TARGET,UUID /btrfs1
     SOURCE   TARGET  UUID
     /dev/sdc /btrfs1 ca1dbb7a-8446-4f95-853c-a20f3f82bdbb
    
     cat /proc/self/mounts | grep btrfs
     /dev/sdb /btrfs btrfs rw,relatime,noacl,space_cache,subvolid=5,subvol=/ 0 0
     /dev/sdc /btrfs1 btrfs ro,relatime,noacl,space_cache,subvolid=5,subvol=/ 0 0
    
    Reported-by: Martin K. Petersen <martin.petersen@oracle.com>
    CC: stable@vger.kernel.org # 4.19+
    Tested-by: Martin K. Petersen <martin.petersen@oracle.com>
    Signed-off-by: Anand Jain <anand.jain@oracle.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    asj authored and kdave committed Jul 21, 2020
  17. btrfs: qgroup: free per-trans reserved space when a subvolume gets dr…

    …opped
    
    [BUG]
    Sometime fsstress could lead to qgroup warning for case like
    generic/013:
    
      BTRFS warning (device dm-3): qgroup 0/259 has unreleased space, type 1 rsv 81920
      ------------[ cut here ]------------
      WARNING: CPU: 9 PID: 24535 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs]
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      RIP: 0010:close_ctree+0x1dc/0x323 [btrfs]
      Call Trace:
       btrfs_put_super+0x15/0x17 [btrfs]
       generic_shutdown_super+0x72/0x110
       kill_anon_super+0x18/0x30
       btrfs_kill_super+0x17/0x30 [btrfs]
       deactivate_locked_super+0x3b/0xa0
       deactivate_super+0x40/0x50
       cleanup_mnt+0x135/0x190
       __cleanup_mnt+0x12/0x20
       task_work_run+0x64/0xb0
       __prepare_exit_to_usermode+0x1bc/0x1c0
       __syscall_return_slowpath+0x47/0x230
       do_syscall_64+0x64/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      ---[ end trace 6c341cdf9b6cc3c1 ]---
      BTRFS error (device dm-3): qgroup reserved space leaked
    
    While that subvolume 259 is no longer in that filesystem.
    
    [CAUSE]
    Normally per-trans qgroup reserved space is freed when a transaction is
    committed, in commit_fs_roots().
    
    However for completely dropped subvolume, that subvolume is completely
    gone, thus is no longer in the fs_roots_radix, and its per-trans
    reserved qgroup will never be freed.
    
    Since the subvolume is already gone, leaked per-trans space won't cause
    any trouble for end users.
    
    [FIX]
    Just call btrfs_qgroup_free_meta_all_pertrans() before a subvolume is
    completely dropped.
    
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed Jul 21, 2020
  18. btrfs: fix double free on ulist after backref resolution failure

    At btrfs_find_all_roots_safe() we allocate a ulist and set the **roots
    argument to point to it. However if later we fail due to an error returned
    by find_parent_nodes(), we free that ulist but leave a dangling pointer in
    the **roots argument. Upon receiving the error, a caller of this function
    can attempt to free the same ulist again, resulting in an invalid memory
    access.
    
    One such scenario is during qgroup accounting:
    
    btrfs_qgroup_account_extents()
    
     --> calls btrfs_find_all_roots() passes &new_roots (a stack allocated
         pointer) to btrfs_find_all_roots()
    
       --> btrfs_find_all_roots() just calls btrfs_find_all_roots_safe()
           passing &new_roots to it
    
         --> allocates ulist and assigns its address to **roots (which
             points to new_roots from btrfs_qgroup_account_extents())
    
         --> find_parent_nodes() returns an error, so we free the ulist
             and leave **roots pointing to it after returning
    
     --> btrfs_qgroup_account_extents() sees btrfs_find_all_roots() returned
         an error and jumps to the label 'cleanup', which just tries to
         free again the same ulist
    
    Stack trace example:
    
     ------------[ cut here ]------------
     BTRFS: tree first key check failed
     WARNING: CPU: 1 PID: 1763215 at fs/btrfs/disk-io.c:422 btrfs_verify_level_key+0xe0/0x180 [btrfs]
     Modules linked in: dm_snapshot dm_thin_pool (...)
     CPU: 1 PID: 1763215 Comm: fsstress Tainted: G        W         5.8.0-rc3-btrfs-next-64 #1
     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
     RIP: 0010:btrfs_verify_level_key+0xe0/0x180 [btrfs]
     Code: 28 5b 5d (...)
     RSP: 0018:ffffb89b473779a0 EFLAGS: 00010286
     RAX: 0000000000000000 RBX: ffff90397759bf08 RCX: 0000000000000000
     RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff
     RBP: ffff9039a419c000 R08: 0000000000000000 R09: 0000000000000000
     R10: 0000000000000000 R11: ffffb89b43301000 R12: 000000000000005e
     R13: ffffb89b47377a2e R14: ffffb89b473779af R15: 0000000000000000
     FS:  00007fc47e1e1000(0000) GS:ffff9039ac200000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: 00007fc47e1df000 CR3: 00000003d9e4e001 CR4: 00000000003606e0
     DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
     DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
     Call Trace:
      read_block_for_search+0xf6/0x350 [btrfs]
      btrfs_next_old_leaf+0x242/0x650 [btrfs]
      resolve_indirect_refs+0x7cf/0x9e0 [btrfs]
      find_parent_nodes+0x4ea/0x12c0 [btrfs]
      btrfs_find_all_roots_safe+0xbf/0x130 [btrfs]
      btrfs_qgroup_account_extents+0x9d/0x390 [btrfs]
      btrfs_commit_transaction+0x4f7/0xb20 [btrfs]
      btrfs_sync_file+0x3d4/0x4d0 [btrfs]
      do_fsync+0x38/0x70
      __x64_sys_fdatasync+0x13/0x20
      do_syscall_64+0x5c/0xe0
      entry_SYSCALL_64_after_hwframe+0x44/0xa9
     RIP: 0033:0x7fc47e2d72e3
     Code: Bad RIP value.
     RSP: 002b:00007fffa32098c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
     RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fc47e2d72e3
     RDX: 00007fffa3209830 RSI: 00007fffa3209830 RDI: 0000000000000003
     RBP: 000000000000072e R08: 0000000000000001 R09: 0000000000000003
     R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000003e8
     R13: 0000000051eb851f R14: 00007fffa3209970 R15: 00005607c4ac8b50
     irq event stamp: 0
     hardirqs last  enabled at (0): [<0000000000000000>] 0x0
     hardirqs last disabled at (0): [<ffffffffb8eb5e85>] copy_process+0x755/0x1eb0
     softirqs last  enabled at (0): [<ffffffffb8eb5e85>] copy_process+0x755/0x1eb0
     softirqs last disabled at (0): [<0000000000000000>] 0x0
     ---[ end trace 8639237550317b48 ]---
     BTRFS error (device sdc): tree first key mismatch detected, bytenr=62324736 parent_transid=94 key expected=(262,108,1351680) has=(259,108,1921024)
     general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
     CPU: 2 PID: 1763215 Comm: fsstress Tainted: G        W         5.8.0-rc3-btrfs-next-64 #1
     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
     RIP: 0010:ulist_release+0x14/0x60 [btrfs]
     Code: c7 07 00 (...)
     RSP: 0018:ffffb89b47377d60 EFLAGS: 00010282
     RAX: 6b6b6b6b6b6b6b6b RBX: ffff903959b56b90 RCX: 0000000000000000
     RDX: 0000000000000001 RSI: 0000000000270024 RDI: ffff9036e2adc840
     RBP: ffff9036e2adc848 R08: 0000000000000000 R09: 0000000000000000
     R10: 0000000000000000 R11: 0000000000000000 R12: ffff9036e2adc840
     R13: 0000000000000015 R14: ffff9039a419ccf8 R15: ffff90395d605840
     FS:  00007fc47e1e1000(0000) GS:ffff9039ac600000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: 00007f8c1c0a51c8 CR3: 00000003d9e4e004 CR4: 00000000003606e0
     DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
     DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
     Call Trace:
      ulist_free+0x13/0x20 [btrfs]
      btrfs_qgroup_account_extents+0xf3/0x390 [btrfs]
      btrfs_commit_transaction+0x4f7/0xb20 [btrfs]
      btrfs_sync_file+0x3d4/0x4d0 [btrfs]
      do_fsync+0x38/0x70
      __x64_sys_fdatasync+0x13/0x20
      do_syscall_64+0x5c/0xe0
      entry_SYSCALL_64_after_hwframe+0x44/0xa9
     RIP: 0033:0x7fc47e2d72e3
     Code: Bad RIP value.
     RSP: 002b:00007fffa32098c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
     RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fc47e2d72e3
     RDX: 00007fffa3209830 RSI: 00007fffa3209830 RDI: 0000000000000003
     RBP: 000000000000072e R08: 0000000000000001 R09: 0000000000000003
     R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000003e8
     R13: 0000000051eb851f R14: 00007fffa3209970 R15: 00005607c4ac8b50
     Modules linked in: dm_snapshot dm_thin_pool (...)
     ---[ end trace 8639237550317b49 ]---
     RIP: 0010:ulist_release+0x14/0x60 [btrfs]
     Code: c7 07 00 (...)
     RSP: 0018:ffffb89b47377d60 EFLAGS: 00010282
     RAX: 6b6b6b6b6b6b6b6b RBX: ffff903959b56b90 RCX: 0000000000000000
     RDX: 0000000000000001 RSI: 0000000000270024 RDI: ffff9036e2adc840
     RBP: ffff9036e2adc848 R08: 0000000000000000 R09: 0000000000000000
     R10: 0000000000000000 R11: 0000000000000000 R12: ffff9036e2adc840
     R13: 0000000000000015 R14: ffff9039a419ccf8 R15: ffff90395d605840
     FS:  00007fc47e1e1000(0000) GS:ffff9039ad200000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: 00007f6a776f7d40 CR3: 00000003d9e4e002 CR4: 00000000003606e0
     DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
     DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    
    Fix this by making btrfs_find_all_roots_safe() set *roots to NULL after
    it frees the ulist.
    
    Fixes: 8da6d58 ("Btrfs: added btrfs_find_all_roots()")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: Filipe Manana <fdmanana@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    fdmanana authored and kdave committed Jul 21, 2020
  19. btrfs: ref-verify: fix memory leak in add_block_entry

    clang static analysis flags this error
    
    fs/btrfs/ref-verify.c:290:3: warning: Potential leak of memory pointed to by 're' [unix.Malloc]
                    kfree(be);
                    ^~~~~
    
    The problem is in this block of code:
    
    	if (root_objectid) {
    		struct root_entry *exist_re;
    
    		exist_re = insert_root_entry(&exist->roots, re);
    		if (exist_re)
    			kfree(re);
    	}
    
    There is no 'else' block freeing when root_objectid is 0. Add the
    missing kfree to the else branch.
    
    Fixes: fd708b8 ("Btrfs: add a extent ref verify tool")
    CC: stable@vger.kernel.org # 4.19+
    Signed-off-by: Tom Rix <trix@redhat.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    trixirt authored and kdave committed Jul 21, 2020
  20. btrfs: prefetch chunk tree leaves at mount

    The whole chunk tree is read at mount time so we can utilize readahead
    to get the tree blocks to memory before we read the items. The idea is
    from Robbie, but instead of updating search slot readahead, this patch
    implements the chunk tree readahead manually from nodes on level 1.
    
    We've decided to do specific readahead optimizations and then unify them
    under a common API so we don't break everything by changing the search
    slot readahead logic.
    
    Higher chunk trees grow on large filesystems (many terabytes), and
    prefetching just level 1 seems to be sufficient. Provided example was
    from a 200TiB filesystem with chunk tree level 2.
    
    CC: Robbie Ko <robbieko@synology.com>
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    kdave committed Jul 21, 2020
  21. btrfs: add metadata_uuid to FS_INFO ioctl

    Add retrieval of the filesystem's metadata UUID to the fsinfo ioctl.
    This is driven by setting the BTRFS_FS_INFO_FLAG_METADATA_UUID flag in
    btrfs_ioctl_fs_info_args::flags.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Johannes Thumshirn authored and kdave committed Jul 21, 2020
  22. btrfs: add filesystem generation to FS_INFO ioctl

    Add retrieval of the filesystem's generation to the fsinfo ioctl. This is
    driven by setting the BTRFS_FS_INFO_FLAG_GENERATION flag in
    btrfs_ioctl_fs_info_args::flags.
    
    Reviewed-by: Nikolay Borisov <nborisov@suse.com>
    Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Johannes Thumshirn authored and kdave committed Jul 21, 2020
  23. btrfs: pass checksum type via BTRFS_IOC_FS_INFO ioctl

    With the recent addition of filesystem checksum types other than CRC32c,
    it is not anymore hard-coded which checksum type a btrfs filesystem uses.
    
    Up to now there is no good way to read the filesystem checksum, apart from
    reading the filesystem UUID and then query sysfs for the checksum type.
    
    Add a new csum_type and csum_size fields to the BTRFS_IOC_FS_INFO ioctl
    command which usually is used to query filesystem features. Also add a
    flags member indicating that the kernel responded with a set csum_type and
    csum_size field.
    
    For compatibility reasons, only return the csum_type and csum_size if
    the BTRFS_FS_INFO_FLAG_CSUM_INFO flag was passed to the kernel. Also
    clear any unknown flags so we don't pass false positives to user-space
    newer than the kernel.
    
    To simplify further additions to the ioctl, also switch the padding to a
    u8 array. Pahole was used to verify the result of this switch:
    
    The csum members are added before flags, which might look odd, but this
    is to keep the alignment requirements and not to introduce holes in the
    structure.
    
      $ pahole -C btrfs_ioctl_fs_info_args fs/btrfs/btrfs.ko
      struct btrfs_ioctl_fs_info_args {
    	  __u64                      max_id;               /*     0     8 */
    	  __u64                      num_devices;          /*     8     8 */
    	  __u8                       fsid[16];             /*    16    16 */
    	  __u32                      nodesize;             /*    32     4 */
    	  __u32                      sectorsize;           /*    36     4 */
    	  __u32                      clone_alignment;      /*    40     4 */
    	  __u16                      csum_type;            /*    44     2 */
    	  __u16                      csum_size;            /*    46     2 */
    	  __u64                      flags;                /*    48     8 */
    	  __u8                       reserved[968];        /*    56   968 */
    
    	  /* size: 1024, cachelines: 16, members: 10 */
      };
    
    Fixes: 3951e7f ("btrfs: add xxhash64 to checksumming algorithms")
    Fixes: 3831bf0 ("btrfs: add sha256 to checksumming algorithm")
    CC: stable@vger.kernel.org # 5.5+
    Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Johannes Thumshirn authored and kdave committed Jul 21, 2020

Commits on Jul 20, 2020

  1. btrfs: qgroup: remove ASYNC_COMMIT mechanism in favor of reserve retr…

    …y-after-EDQUOT
    
    commit a514d63 ("btrfs: qgroup: Commit transaction in advance to
    reduce early EDQUOT") tries to reduce the early EDQUOT problems by
    checking the qgroup free against threshold and tries to wake up commit
    kthread to free some space.
    
    The problem of that mechanism is, it can only free qgroup per-trans
    metadata space, can't do anything to data, nor prealloc qgroup space.
    
    Now since we have the ability to flush qgroup space, and implemented
    retry-after-EDQUOT behavior, such mechanism can be completely replaced.
    
    So this patch will cleanup such mechanism in favor of
    retry-after-EDQUOT.
    
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed Jul 20, 2020
  2. btrfs: qgroup: try to flush qgroup space when we get -EDQUOT

    [PROBLEM]
    There are known problem related to how btrfs handles qgroup reserved
    space.  One of the most obvious case is the the test case btrfs/153,
    which do fallocate, then write into the preallocated range.
    
      btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
          --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
          +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01 20:24:40.730000089 +0800
          @@ -1,2 +1,5 @@
           QA output created by 153
          +pwrite: Disk quota exceeded
          +/mnt/scratch/testfile2: Disk quota exceeded
          +/mnt/scratch/testfile2: Disk quota exceeded
           Silence is golden
          ...
          (Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
    
    [CAUSE]
    Since commit c6887cd ("Btrfs: don't do nocow check unless we have to"),
    we always reserve space no matter if it's COW or not.
    
    Such behavior change is mostly for performance, and reverting it is not
    a good idea anyway.
    
    For preallcoated extent, we reserve qgroup data space for it already,
    and since we also reserve data space for qgroup at buffered write time,
    it needs twice the space for us to write into preallocated space.
    
    This leads to the -EDQUOT in buffered write routine.
    
    And we can't follow the same solution, unlike data/meta space check,
    qgroup reserved space is shared between data/metadata.
    The EDQUOT can happen at the metadata reservation, so doing NODATACOW
    check after qgroup reservation failure is not a solution.
    
    [FIX]
    To solve the problem, we don't return -EDQUOT directly, but every time
    we got a -EDQUOT, we try to flush qgroup space:
    
    - Flush all inodes of the root
      NODATACOW writes will free the qgroup reserved at run_dealloc_range().
      However we don't have the infrastructure to only flush NODATACOW
      inodes, here we flush all inodes anyway.
    
    - Wait for ordered extents
      This would convert the preallocated metadata space into per-trans
      metadata, which can be freed in later transaction commit.
    
    - Commit transaction
      This will free all per-trans metadata space.
    
    Also we don't want to trigger flush multiple times, so here we introduce
    a per-root wait list and a new root status, to ensure only one thread
    starts the flushing.
    
    Fixes: c6887cd ("Btrfs: don't do nocow check unless we have to")
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed Jul 20, 2020
  3. btrfs: qgroup: allow to unreserve range without releasing other ranges

    [PROBLEM]
    Before this patch, when btrfs_qgroup_reserve_data() fails, we free all
    reserved space of the changeset.
    
    For example:
    	ret = btrfs_qgroup_reserve_data(inode, changeset, 0, SZ_1M);
    	ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_1M, SZ_1M);
    	ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_2M, SZ_1M);
    
    If the last btrfs_qgroup_reserve_data() failed, it will release the
    entire [0, 3M) range.
    
    This behavior is kind of OK for now, as when we hit -EDQUOT, we normally
    go error handling and need to release all reserved ranges anyway.
    
    But this also means the following call is not possible:
    
    	ret = btrfs_qgroup_reserve_data();
    	if (ret == -EDQUOT) {
    		/* Do something to free some qgroup space */
    		ret = btrfs_qgroup_reserve_data();
    	}
    
    As if the first btrfs_qgroup_reserve_data() fails, it will free all
    reserved qgroup space.
    
    [CAUSE]
    This is because we release all reserved ranges when
    btrfs_qgroup_reserve_data() fails.
    
    [FIX]
    This patch will implement a new function, qgroup_unreserve_range(), to
    iterate through the ulist nodes, to find any nodes in the failure range,
    and remove the EXTENT_QGROUP_RESERVED bits from the io_tree, and
    decrease the extent_changeset::bytes_changed, so that we can revert to
    previous state.
    
    This allows later patches to retry btrfs_qgroup_reserve_data() if EDQUOT
    happens.
    
    Suggested-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: Josef Bacik <josef@toxicpanda.com>
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    adam900710 authored and kdave committed Jul 20, 2020
  4. btrfs: convert block group refcount to refcount_t

    We have refcount_t now with the associated library to handle refcounts,
    which gives us extra debugging around reference count mistakes that may
    be made.  For example it'll warn on any transition from 0->1 or 0->-1,
    which is handy for noticing cases where we've messed up reference
    counting.  Convert the block group ref counting from an atomic_t to
    refcount_t and use the appropriate helpers.
    
    Reviewed-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: Josef Bacik <josef@toxicpanda.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    josefbacik authored and kdave committed Jul 20, 2020
  5. btrfs: add multi-statement protection to btrfs_set/clear_and_info macros

    Multi-statement macros should be enclosed in do/while(0) block to make
    their use safe in single statement if conditions. All current uses of
    the macros are safe, so this change is for future protection.
    
    Reviewed-by: Anand Jain <anand.jain@oracle.com>
    Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    marcosps authored and kdave committed Jul 20, 2020
  6. btrfs: remove fail label in check_compressed_csum

    Signed-off-by: Nikolay Borisov <nborisov@suse.com>
    Reviewed-by: David Sterba <dsterba@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Nikolay Borisov authored and kdave committed Jul 20, 2020
Older
You can’t perform that action at this time.