Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to 3.14-rc7 #1

Closed
wants to merge 10,000 commits into from
Closed

Update to 3.14-rc7 #1

wants to merge 10,000 commits into from

Conversation

dizzy57
Copy link

@dizzy57 dizzy57 commented Jun 17, 2014

Update kasan to 3.14-rc7, fix slab.c after change of return type of kmem_getpages()

ralfbaechle and others added 30 commits March 6, 2014 20:52
  CC      arch/mips/kernel/ptrace.o
In file included from arch/mips/kernel/ptrace.c:42:0:
arch/mips/kernel/ptrace.c: In function ‘mips_get_syscall_arg’:
/home/ralf/src/linux/linux-mips/arch/mips/include/asm/syscall.h:60:1: error: control reaches end of non-void function [-Werror=return-type]
cc1: all warnings being treated as errors
make[2]: *** [arch/mips/kernel/ptrace.o] Error 1
make[1]: *** [arch/mips/kernel] Error 2
make: *** [arch/mips] Error 2

Fixed by marking the end of mips_get_syscall_arg() as unreachable.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
enabled __must_check logic triggers a build error for mtx1 and gpr
in the prom init code.  Fix by checking the kstrtoul() return value.

Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>
Cc: Linux-MIPS <linux-mips@linux-mips.org>
Patchwork: https://patchwork.linux-mips.org/patch/6574/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Fixes the following build problen on allmodconfig:

arch/mips/bcm47xx/board.c: In function 'bcm47xx_board_detect':
arch/mips/bcm47xx/board.c:291:14: error: 'ENXIO' undeclared
(first use in this function)

Signed-off-by: Markos Chandras <markos.chandras@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/6571/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
There are 2 errors with the existing aprp_hook linking:
- The prefix CONFIG_ is missing;
- The hook should be linked exclusively in the cases of MT and CMP.
Signed-off-by: Deng-Cheng Zhu <dengcheng.zhu@imgtec.com>
Reviewed-by: Steven J. Hill <Steven.Hill@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: john@phrozen.org
Patchwork: https://patchwork.linux-mips.org/patch/6588/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
If the aprp_hook is not assigned back to NULL, it will still be called
after module exits. This is not wanted.

Reviewed-by: Steven J. Hill <Steven.Hill@imgtec.com>
Signed-off-by: Deng-Cheng Zhu <dengcheng.zhu@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: john@phrozen.org
Patchwork: https://patchwork.linux-mips.org/patch/6590/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Now we have CONFIG_MIPS_VPE_LOADER and CONFIG_MIPS_VPE_LOADER_[CMP|MT]. The
latter two are used by the 2 exclusive flavors. The vpe_run in malta-amon.c
is for CMP APRP. Without the fix, this vpe_run will be used in MT APRP.

Reviewed-by: Steven J. Hill <Steven.Hill@imgtec.com>
Signed-off-by: Deng-Cheng Zhu <dengcheng.zhu@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: john@phrozen.org
Patchwork: https://patchwork.linux-mips.org/patch/6589/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Quoting Alexander Aring:
  While fragmentation and unloading of 6lowpan module I got this kernel Oops
  after few seconds:

  BUG: unable to handle kernel paging request at f88bbc30
  [..]
  Modules linked in: ipv6 [last unloaded: 6lowpan]
  Call Trace:
   [<c012af4c>] ? call_timer_fn+0x54/0xb3
   [<c012aef8>] ? process_timeout+0xa/0xa
   [<c012b66b>] run_timer_softirq+0x140/0x15f

Problem is that incomplete frags are still around after unload; when
their frag expire timer fires, we get crash.

When a netns is removed (also done when unloading module), inet_frag
calls the evictor with 'force' argument to purge remaining frags.

The evictor loop terminates when accounted memory ('work') drops to 0
or the lru-list becomes empty.  However, the mem accounting is done
via percpu counters and may not be accurate, i.e. loop may terminate
prematurely.

Alter evictor to only stop once the lru list is empty when force is
requested.

Reported-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Reported-by: Alexander Aring <alex.aring@gmail.com>
Tested-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit bd97268
"firewire: ohci: Fix 'failed to read phy reg' on FW643 rev8",
there is a high chance that firewire-ohci fails to initialize LSI née
Agere controllers.
https://bugzilla.kernel.org/show_bug.cgi?id=65151

Peter Hurley points out the reason:  IEEE 1394a:2000 clause 5A.1 (or
IEEE 1394:2008 clause 17.2.1) say:  "The PHY shall insure that no more
than 10 ms elapse from the reassertion of LPS until the interface is
reset.  The link shall not assert LReq until the reset is complete."
In other words, the link needs to give the PHY at least 10 ms to get
the interface operational.

With just the msleep(1) in bd97268, the first read_phy_reg()
during ohci_enable() may happen before the phy-link interface reset was
finished, and fail.  Due to the high variability of msleep(n) with small
n, this failure was not fully reproducible, and not apparent at all with
low CONFIG_HZ setting.

On the other hand, Peter can no longer reproduce the issue with FW643
rev8.  The read phy reg failures that happened back then may have had an
unrelated cause.  So, just revert bd97268, except for the valid
comment on TSB82AA2 cards.

Reported-by: Mikhail Gavrilov
Reported-by: Jay Fenlason <fenlason@redhat.com>
Reported-by: Clemens Ladisch <clemens@ladisch.de>
Reported-by: Peter Hurley <peter@hurleysoftware.com>
Cc: stable@vger.kernel.org # v3.10+
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
With -Werror=array-bounds, gcc v4.7.x warns that in phy_find_valid(), the
settings[] "array subscript is above array bounds", I think because idx is
a signed integer and if the caller supplied idx < 0, we pass the guard but
still reference out of bounds.

Fix this by making idx unsigned here and elsewhere.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The original documentation was very unclear.

The code fix is presumably related to the formerly unclear
documentation: SOCK_TIMESTAMPING_RX_SOFTWARE has no effect on
__sock_recv_timestamp's behavior, so calling __sock_recv_ts_and_drops
from sock_recv_ts_and_drops if only SOCK_TIMESTAMPING_RX_SOFTWARE is
set is pointless.  This should have no user-observable effect.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The unix socket code is using the result of csum_partial to
hash into a lookup table:

	unix_hash_fold(csum_partial(sunaddr, len, 0));

csum_partial is only guaranteed to produce something that can be
folded into a checksum, as its prototype explains:

 * returns a 32-bit number suitable for feeding into itself
 * or csum_tcpudp_magic

The 32bit value should not be used directly.

Depending on the alignment, the ppc64 csum_partial will return
different 32bit partial checksums that will fold into the same
16bit checksum.

This difference causes the following testcase (courtesy of
Gustavo) to sometimes fail:

#include <sys/socket.h>
#include <stdio.h>

int main()
{
	int fd = socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0);

	int i = 1;
	setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &i, 4);

	struct sockaddr addr;
	addr.sa_family = AF_LOCAL;
	bind(fd, &addr, 2);

	listen(fd, 128);

	struct sockaddr_storage ss;
	socklen_t sslen = (socklen_t)sizeof(ss);
	getsockname(fd, (struct sockaddr*)&ss, &sslen);

	fd = socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0);

	if (connect(fd, (struct sockaddr*)&ss, sslen) == -1){
		perror(NULL);
		return 1;
	}
	printf("OK\n");
	return 0;
}

As suggested by davem, fix this by using csum_fold to fold the
partial 32bit checksum into a 16bit checksum before using it.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
The code to load a MAC address into a u64 for passing to the
hypervisor via a register is broken on little endian.

Create a helper function called ibmveth_encode_mac_addr
which does the right thing in both big and little endian.

We were storing the MAC address in a long in struct ibmveth_adapter.
It's never used so remove it - we don't need another place in the
driver where we create endian issues with MAC addresses.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Without this fix, ipv6_exthdrs_offload_init doesn't register IPPROTO_DSTOPTS
offload, but returns 0 (as the IPPROTO_ROUTING registration actually succeeds).

This then causes the ipv6_gso_segment to drop IPv6 packets with IPPROTO_DSTOPTS
header.

The issue detected and the fix verified by running MS HCK Offload LSO test on
top of QEMU Windows guests, as this test sends IPv6 packets with
IPPROTO_DSTOPTS.

Signed-off-by: Anton Nayshtut <anton@swortex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without this, a bo may get created in the cpu-inaccessible vram.
Before the CP engines get setup, all copies are done via cpu memcpy.

This means that the cpu tries to read from inaccessible memory, fails,
and the radeon module proceeds to disable acceleration.

Doing this has no downsides, as the real VRAM size gets set as soon as the
CP engines get init.

This is a candidate for 3.14 fixes.

v2: Add comment on why the function is used

Signed-off-by: Lauri Kasanen <cand@gmx.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Cc: stable@vger.kernel.org
Moving the pm resume up in the init order to fix
dpm seems to have regressed somes cases with the old
pm code.  Move it back to late resume.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Building radeon_ttm.o on 32 bit x86 triggers a warning:
    In file included from include/asm-generic/bug.h:13:0,
                     from [...]/arch/x86/include/asm/bug.h:38,
                     from include/linux/bug.h:4,
                     from include/drm/drm_mm.h:39,
                     from include/drm/drm_vma_manager.h:26,
                     from include/drm/ttm/ttm_bo_api.h:35,
                     from drivers/gpu/drm/radeon/radeon_ttm.c:32:
    drivers/gpu/drm/radeon/radeon_ttm.c: In function 'radeon_ttm_gtt_read':
    include/linux/kernel.h:712:17: warning: comparison of distinct pointer types lacks a cast [enabled by default]
      (void) (&_min1 == &_min2);  \
                     ^
    drivers/gpu/drm/radeon/radeon_ttm.c:938:22: note: in expansion of macro 'min'
       ssize_t cur_size = min(size, PAGE_SIZE - off);
                          ^

Silence this warning by using min_t(). Since cur_size will never be
negative and its upper bound is PAGE_SIZE, we can change its type to
size_t and use min_t(size_t, [...]) here.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Copy-paste typo.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Should be at 0x8 rather than 0.

fixes:
https://bugzilla.kernel.org/show_bug.cgi?id=60523

Noticed by ArtForz on #radeon

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
We didn't correctly check cases where the value for lp_interval is not
within the legal range due to a missing table terminator.

This would let userspace trigger a kernel panic by specifying a value out
of range:

	echo -1 > /sys/devices/virtual/net/bond0/bonding/lp_interval

Introduced by commit 4325b37 ("bonding: convert lp_interval to use
the new option API").

Acked-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a regression introduced by [1]. outbox was accessed instead of
outbox->buf. Typo was copy-pasted to [2] and [3].

[1] - cc1ade9 mlx4_core: Disable memory windows for virtual functions
[2] - 4de6580 mlx4_core: Add support for steerable IB UD QPs
[3] - 7ffdf72 net/mlx4_core: Add basic support for TCP/IP offloads under
      tunneling

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
… PF is ready

Currently, the PF call to pci_enable_sriov from the PF probe function
stalls for 10 seconds times the number of VFs probed on the host. This
happens because the way for such VFs to determine of the PF
initialization finished, is by attempting to issue reset on the
comm-channel and get timeout (after 10s).

The PF probe function is called from a kenernel workqueue, and therefore
during that time, rcu lock is being held and kernel's workqueue is
stalled. This blocks other processes that try to use the workqueue
or rcu lock.  For example, interface renaming which is calling
rcu_synchronize is blocked, and timedout by systemd.

Changed mlx4_init_slave() to allow VF probed on the host to immediatly
detect that the PF is not ready, and return EPROBE_DEFER instantly.

Only when the PF finishes the initialization, allow such VFs to
access the comm channel.

This issue and fix are relevant only for probed VFs on the hypervisor,
there is no way to pass this information to a VM until comm channel is
ready, so in a VM, if PF is not ready, the first command will be timedout
after 10 seconds and return EPROBE_DEFER.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
DST_NOCOUNT should only be used if an authorized user adds routes
locally. In case of routes which are added on behalf of router
advertisments this flag must not get used as it allows an unlimited
number of routes getting added remotely.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
…linux into drm-fixes

a few more radeon fixes.

* 'drm-fixes-3.14' of git://people.freedesktop.org/~agd5f/linux:
  drm/radeon/dpm: fix typo in EVERGREEN_SMC_FIRMWARE_HEADER_softRegisters
  drm/radeon/cik: fix typo in documentation
  drm/radeon: silence GCC warning on 32 bit
  drm/radeon: resume old pm late
  drm/radeon: TTM must be init with cpu-visible VRAM, v2
…nux-cubox into drm-fixes

fix for kfifo api change.
* 'drm-armada-fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-cubox:
  DRM: armada: fix use of kfifo_put()
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Add a maintainers entry for the TDA998x driver.  Rob Clark has handed
this driver over to me to look after.

Acked-by: Rob Clark <robdclark@gmail.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Dave Airlie <airlied@redhat.com>
We need to check for DVI vs. HDMI when setting up duallink since
HDMI is single link only.  Fixes 4k modes on newer asics.

bug:
https://bugs.freedesktop.org/show_bug.cgi?id=75223

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
…linux into drm-fixes

one more radeon fix.

* 'drm-fixes-3.14' of git://people.freedesktop.org/~agd5f/linux:
  drm/radeon/atom: select the proper number of lanes in transmitter setup
When we fork/clone we currently don't copy any of the TM state to the new
thread.  This results in a TM bad thing (program check) when the new process is
switched in as the kernel does a tmrechkpt with TEXASR FS not set.  Also, since
R1 is from userspace, we trigger the bad kernel stack pointer detection.  So we
end up with something like this:

   Bad kernel stack pointer 0 at c0000000000404fc
   cpu 0x2: Vector: 700 (Program Check) at [c00000003ffefd40]
       pc: c0000000000404fc: restore_gprs+0xc0/0x148
       lr: 0000000000000000
       sp: 0
      msr: 9000000100201030
     current = 0xc000001dd1417c30
     paca    = 0xc00000000fe00800   softe: 0        irq_happened: 0x01
       pid   = 0, comm = swapper/2
   WARNING: exception is not recoverable, can't continue

The below fixes this by flushing the TM state before we copy the task_struct to
the clone.  To do this we go through the tmreclaim patch, which removes the
checkpointed registers from the CPU and transitions the CPU out of TM suspend
mode.  Hence we need to call tmrechkpt after to restore the checkpointed state
and the TM mode for the current task.

To make this fail from userspace is simply:
	tbegin
	li	r0, 2
	sc
	<boom>

Kudos to Adhemerval Zanella Neto for finding this.

Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: Adhemerval Zanella Neto <azanella@br.ibm.com>
cc: stable@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The 64bit relocation code places a few symbols in the text segment.
These symbols are only 4 byte aligned where they need to be 8 byte
aligned. Add an explicit alignment.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@vger.kernel.org
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
melver pushed a commit that referenced this pull request Jul 28, 2020
The following backtrace is seen when running aspeed G5 kernels.

WARNING: CPU: 0 PID: 1 at drivers/gpu/drm/drm_fb_helper.c:2233 drm_fbdev_generic_setup+0x138/0x198
aspeed_gfx 1e6e6000.display: Device has not been registered.
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.8.0-rc3 #1
Hardware name: Generic DT based system
Backtrace:
[<8010d6d0>] (dump_backtrace) from [<8010d9b8>] (show_stack+0x20/0x24)
r7:00000009 r6:60000153 r5:00000000 r4:8119fa94
[<8010d998>] (show_stack) from [<80b8cb98>] (dump_stack+0xcc/0xec)
[<80b8cacc>] (dump_stack) from [<80123ef0>] (__warn+0xd8/0xfc)
r7:00000009 r6:80e62ed0 r5:00000000 r4:974c3ccc
[<80123e18>] (__warn) from [<80123f98>] (warn_slowpath_fmt+0x84/0xc4)
r9:00000009 r8:806a0140 r7:000008b9 r6:80e62ed0 r5:80e631f8 r4:974c2000
[<80123f18>] (warn_slowpath_fmt) from [<806a0140>] (drm_fbdev_generic_setup+0x138/0x198)
r9:00000001 r8:9758fc10 r7:9758fc00 r6:00000000 r5:00000020 r4:9768a000
[<806a0008>] (drm_fbdev_generic_setup) from [<806d4558>] (aspeed_gfx_probe+0x204/0x32c)
r7:9758fc00 r6:00000000 r5:00000000 r4:9768a000
[<806d4354>] (aspeed_gfx_probe) from [<806dfca0>] (platform_drv_probe+0x58/0xa8)

Since commit 1aed950 ("drm/fb-helper: Remove return value from
drm_fbdev_generic_setup()"), drm_fbdev_generic_setup() must be called
after drm_dev_register() to avoid the warning. Do that.

Fixes: 1aed950 ("drm/fb-helper: Remove return value from drm_fbdev_generic_setup()")
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20200701001002.74997-1-linux@roeck-us.net
melver pushed a commit that referenced this pull request Jul 28, 2020
In helper function is_flow_rule_duplicate_allowed() verify that rpviv
pointer is not NULL before dereferencing it. This can happen when device is
in NIC mode and leads to following crash:

[90444.046419] BUG: kernel NULL pointer dereference, address: 0000000000000000
[90444.048149] #PF: supervisor read access in kernel mode
[90444.049781] #PF: error_code(0x0000) - not-present page
[90444.051386] PGD 80000003d35a4067 P4D 80000003d35a4067 PUD 3d35a3067 PMD 0
[90444.053051] Oops: 0000 [#1] SMP PTI
[90444.054683] CPU: 16 PID: 31736 Comm: tc Not tainted 5.8.0-rc1+ #1157
[90444.056340] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
[90444.058079] RIP: 0010:mlx5e_configure_flower+0x3aa/0x9b0 [mlx5_core]
[90444.059753] Code: 24 50 49 8b 95 08 02 00 00 48 b8 00 08 00 00 04 00 00 00 48 21 c2 48 39 c2 74 0a 41 f6 85 0d 02 00 00 20 74 16 48 8b 44 24 20 <48> 8b 00 66 83 78 20 ff 74 07 4d 89 aa e0 00 00 00 48 83 7d 28 00
[90444.063232] RSP: 0018:ffffabe9c61ff768 EFLAGS: 00010246
[90444.065014] RAX: 0000000000000000 RBX: ffff9b13c4c91e80 RCX: 00000000000093fa
[90444.066784] RDX: 0000000400000800 RSI: 0000000000000000 RDI: 000000000002d5e0
[90444.068533] RBP: ffff9b174d308468 R08: 0000000000000000 R09: ffff9b17d63003f0
[90444.070285] R10: ffff9b17ea288600 R11: 0000000000000000 R12: ffffabe9c61ff878
[90444.072032] R13: ffff9b174d300000 R14: ffffabe9c61ffbb8 R15: ffff9b174d300880
[90444.073760] FS:  00007f3c23775480(0000) GS:ffff9b13efc80000(0000) knlGS:0000000000000000
[90444.075492] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[90444.077266] CR2: 0000000000000000 CR3: 00000003e2a60002 CR4: 00000000001606e0
[90444.079024] Call Trace:
[90444.080753]  tc_setup_cb_add+0xca/0x1e0
[90444.082415]  fl_hw_replace_filter+0x15f/0x1f0 [cls_flower]
[90444.084119]  fl_change+0xa59/0x13dc [cls_flower]
[90444.085772]  ? wait_for_completion+0xa8/0xf0
[90444.087364]  tc_new_tfilter+0x3f5/0xa60
[90444.088960]  rtnetlink_rcv_msg+0xeb/0x360
[90444.090514]  ? __d_lookup_done+0x76/0xe0
[90444.092034]  ? proc_alloc_inode+0x16/0x70
[90444.093560]  ? prep_new_page+0x8c/0xf0
[90444.095048]  ? _cond_resched+0x15/0x30
[90444.096483]  ? rtnl_calcit.isra.0+0x110/0x110
[90444.097907]  netlink_rcv_skb+0x49/0x110
[90444.099289]  netlink_unicast+0x191/0x230
[90444.100629]  netlink_sendmsg+0x243/0x480
[90444.101984]  sock_sendmsg+0x5e/0x60
[90444.103305]  ____sys_sendmsg+0x1f3/0x260
[90444.104597]  ? copy_msghdr_from_user+0x5c/0x90
[90444.105916]  ? __mod_lruvec_state+0x3c/0xe0
[90444.107210]  ___sys_sendmsg+0x81/0xc0
[90444.108484]  ? do_filp_open+0xa5/0x100
[90444.109732]  ? handle_mm_fault+0x117b/0x1e00
[90444.110970]  ? __check_object_size+0x46/0x147
[90444.112205]  ? __check_object_size+0x136/0x147
[90444.113402]  __sys_sendmsg+0x59/0xa0
[90444.114587]  do_syscall_64+0x4d/0x90
[90444.115782]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[90444.116953] RIP: 0033:0x7f3c2393b7b8
[90444.118101] Code: Bad RIP value.
[90444.119240] RSP: 002b:00007ffc6ad8e6c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[90444.120408] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3c2393b7b8
[90444.121583] RDX: 0000000000000000 RSI: 00007ffc6ad8e740 RDI: 0000000000000003
[90444.122750] RBP: 000000005eea0c3a R08: 0000000000000001 R09: 00007ffc6ad8e68c
[90444.123928] R10: 0000000000404fa8 R11: 0000000000000246 R12: 0000000000000001
[90444.125073] R13: 0000000000000000 R14: 00007ffc6ad92a00 R15: 00000000004866a0
[90444.126221] Modules linked in: act_skbedit act_tunnel_key act_mirred bonding vxlan ip6_udp_tunnel udp_tunnel nfnetlink act_gact cls_flower sch_ingress openvswitch nsh nf_conncount nfsv3 nfs_acl nfs lockd grace fscache tun bridge stp llc sunrpc rdma_ucm rdma_cm iw_cm ib_cm mlx5_ib ib_uverbs ib_core mlx5_core intel_r
apl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel mlxfw kvm act_ct nf_flow_table nf_nat nf_conntrack irqbypass crct10dif_pclmul nf_defrag_ipv6 igb ipmi_ssif libcrc32c crc32_pclmul crc32c_intel ipmi_si nf_defrag_ipv4 ptp ghash_clmulni_intel mei_me ses iTCO_wdt i2c_i801 pps_core
ioatdma iTCO_vendor_support joydev mei enclosure intel_cstate i2c_smbus wmi dca ipmi_devintf intel_uncore lpc_ich ipmi_msghandler pcspkr acpi_pad acpi_power_meter ast i2c_algo_bit drm_vram_helper drm_kms_helper drm_ttm_helper ttm drm mpt3sas raid_class scsi_transport_sas
[90444.136253] CR2: 0000000000000000
[90444.137621] ---[ end trace 924af62aa2b151bd ]---

Fixes: 553f932 ("net/mlx5e: Support tc block sharing for representors")
Reported-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
melver pushed a commit that referenced this pull request Jul 28, 2020
After function reload, CPU mapping used by aRFS RX is broken, leading to
a kernel panic. Fix by moving initialization of rx_cpu_rmap from
netdev_init to netdev_attach. IRQ table is re-allocated on mlx5_load,
but netdev is not re-initialize.

Trace of the panic:
[ 22.055672] general protection fault, probably for non-canonical address 0x785634120000ff1c: 0000 [#1] SMP PTI
[ 22.065010] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 5.7.0-rc2-for-upstream-perf-2020-04-21_16-34-03-31 #1
[ 22.067967] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[ 22.071174] RIP: 0010:get_rps_cpu+0x267/0x300
[ 22.075692] RSP: 0018:ffffc90000244d60 EFLAGS: 00010202
[ 22.076888] RAX: ffff888459b0e400 RBX: 0000000000000000 RCX:0000000000000007
[ 22.078364] RDX: 0000000000008884 RSI: ffff888467cb5b00 RDI:0000000000000000
[ 22.079815] RBP: 00000000ff342b27 R08: 0000000000000007 R09:0000000000000003
[ 22.081289] R10: ffffffffffffffff R11: 00000000000070cc R12:ffff888454900000
[ 22.082767] R13: ffffc90000e5a950 R14: ffffc90000244dc0 R15:0000000000000007
[ 22.084190] FS: 0000000000000000(0000) GS:ffff88846fc80000(0000)knlGS:0000000000000000
[ 22.086161] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 22.087427] CR2: ffffffffffffffff CR3: 0000000464426003 CR4:0000000000760ee0
[ 22.088888] DR0: 0000000000000000 DR1: 0000000000000000 DR2:0000000000000000
[ 22.090336] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:0000000000000400
[ 22.091764] PKRU: 55555554
[ 22.092618] Call Trace:
[ 22.093442] <IRQ>
[ 22.094211] ? kvm_clock_get_cycles+0xd/0x10
[ 22.095272] netif_receive_skb_list_internal+0x258/0x2a0
[ 22.096460] gro_normal_list.part.137+0x19/0x40
[ 22.097547] napi_complete_done+0xc6/0x110
[ 22.098685] mlx5e_napi_poll+0x190/0x670 [mlx5_core]
[ 22.099859] net_rx_action+0x2a0/0x400
[ 22.100848] __do_softirq+0xd8/0x2a8
[ 22.101829] irq_exit+0xa5/0xb0
[ 22.102750] do_IRQ+0x52/0xd0
[ 22.103654] common_interrupt+0xf/0xf
[ 22.104641] </IRQ>

Fixes: 4383cfc ("net/mlx5: Add devlink reload")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
melver pushed a commit that referenced this pull request Jul 28, 2020
we need to set 'active_vfs' back to 0, if something goes wrong during the
allocation of SR-IOV resources: otherwise, further VF configurations will
wrongly assume that bp->pf.vf[x] are valid memory locations, and commands
like the ones in the following sequence:

 # echo 2 >/sys/bus/pci/devices/${ADDR}/sriov_numvfs
 # ip link set dev ens1f0np0 up
 # ip link set dev ens1f0np0 vf 0 trust on

will cause a kernel crash similar to this:

 bnxt_en 0000:3b:00.0: not enough MMIO resources for SR-IOV
 BUG: kernel NULL pointer dereference, address: 0000000000000014
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] SMP PTI
 CPU: 43 PID: 2059 Comm: ip Tainted: G          I       5.8.0-rc2.upstream+ #871
 Hardware name: Dell Inc. PowerEdge R740/08D89F, BIOS 2.2.11 06/13/2019
 RIP: 0010:bnxt_set_vf_trust+0x5b/0x110 [bnxt_en]
 Code: 44 24 58 31 c0 e8 f5 fb ff ff 85 c0 0f 85 b6 00 00 00 48 8d 1c 5b 41 89 c6 b9 0b 00 00 00 48 c1 e3 04 49 03 9c 24 f0 0e 00 00 <8b> 43 14 89 c2 83 c8 10 83 e2 ef 45 84 ed 49 89 e5 0f 44 c2 4c 89
 RSP: 0018:ffffac6246a1f570 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000000b
 RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff98b28f538900
 RBP: ffff98b28f538900 R08: 0000000000000000 R09: 0000000000000008
 R10: ffffffffb9515be0 R11: ffffac6246a1f678 R12: ffff98b28f538000
 R13: 0000000000000001 R14: 0000000000000000 R15: ffffffffc05451e0
 FS:  00007fde0f688800(0000) GS:ffff98baffd40000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000014 CR3: 000000104bb0a003 CR4: 00000000007606e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  do_setlink+0x994/0xfe0
  __rtnl_newlink+0x544/0x8d0
  rtnl_newlink+0x47/0x70
  rtnetlink_rcv_msg+0x29f/0x350
  netlink_rcv_skb+0x4a/0x110
  netlink_unicast+0x21d/0x300
  netlink_sendmsg+0x329/0x450
  sock_sendmsg+0x5b/0x60
  ____sys_sendmsg+0x204/0x280
  ___sys_sendmsg+0x88/0xd0
  __sys_sendmsg+0x5e/0xa0
  do_syscall_64+0x47/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: c0c050c ("bnxt_en: New Broadcom ethernet driver.")
Reported-by: Fei Liu <feliu@redhat.com>
CC: Jonathan Toppins <jtoppins@redhat.com>
CC: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Acked-by: Jonathan Toppins <jtoppins@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
melver pushed a commit that referenced this pull request Jul 28, 2020
Ido Schimmel says:

====================
mlxsw: Various fixes

Fix two issues found by syzkaller.

Patch #1 removes inappropriate usage of WARN_ON() following memory
allocation failure. Constantly triggered when syzkaller injects faults.

Patch #2 fixes a use-after-free that can be triggered by 'devlink dev
info' following a failed devlink reload.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
melver pushed a commit that referenced this pull request Jul 28, 2020
…evices

Since commit a85a6c8 ("driver core: platform: Clarify that IRQ 0 is
invalid"), the kernel is a bit touchy when it encounters interrupt 0.
As a result, there are lots of warnings such as the following when booting
systems such as 'kzm'.

WARNING: CPU: 0 PID: 1 at drivers/base/platform.c:224 platform_get_irq_optional+0x118/0x128
0 is an invalid IRQ number
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.8.0-rc3 #1
Hardware name: Kyoto Microcomputer Co., Ltd. KZM-ARM11-01
[<c01127d4>] (unwind_backtrace) from [<c010c620>] (show_stack+0x10/0x14)
[<c010c620>] (show_stack) from [<c06f5f54>] (dump_stack+0xe8/0x120)
[<c06f5f54>] (dump_stack) from [<c0128878>] (__warn+0xe4/0x108)
[<c0128878>] (__warn) from [<c0128910>] (warn_slowpath_fmt+0x74/0xbc)
[<c0128910>] (warn_slowpath_fmt) from [<c08b8e84>] (platform_get_irq_optional+0x118/0x128)
[<c08b8e84>] (platform_get_irq_optional) from [<c08b8eb4>] (platform_irq_count+0x20/0x3c)
[<c08b8eb4>] (platform_irq_count) from [<c0728660>] (mxc_gpio_probe+0x8c/0x494)
[<c0728660>] (mxc_gpio_probe) from [<c08b93cc>] (platform_drv_probe+0x48/0x98)
[<c08b93cc>] (platform_drv_probe) from [<c08b703c>] (really_probe+0x214/0x344)
[<c08b703c>] (really_probe) from [<c08b7274>] (driver_probe_device+0x58/0xb4)
[<c08b7274>] (driver_probe_device) from [<c08b7478>] (device_driver_attach+0x58/0x60)
[<c08b7478>] (device_driver_attach) from [<c08b7504>] (__driver_attach+0x84/0xc0)
[<c08b7504>] (__driver_attach) from [<c08b50f8>] (bus_for_each_dev+0x78/0xb8)
[<c08b50f8>] (bus_for_each_dev) from [<c08b62cc>] (bus_add_driver+0x154/0x1e0)
[<c08b62cc>] (bus_add_driver) from [<c08b82b8>] (driver_register+0x74/0x108)
[<c08b82b8>] (driver_register) from [<c0102320>] (do_one_initcall+0x80/0x3b4)
[<c0102320>] (do_one_initcall) from [<c1501008>] (kernel_init_freeable+0x170/0x208)
[<c1501008>] (kernel_init_freeable) from [<c0e178d4>] (kernel_init+0x8/0x11c)
[<c0e178d4>] (kernel_init) from [<c0100134>] (ret_from_fork+0x14/0x20)

As it turns out, mxc_register_gpio() is a bit lax when setting the
number of resources: it registers a resource with interrupt 0 when in
reality there is no such interrupt. Fix the problem by not declaring
the second interrupt resource if there is no second interrupt.

Fixes: a85a6c8 ("driver core: platform: Clarify that IRQ 0 is invalid")
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
melver pushed a commit that referenced this pull request Jul 28, 2020
Early secure guest boot hits the below crash while booting with
vcpus numbers aligned with page boundary for PAGE size of 64k
and LPPACA size of 1k i.e 64, 128 etc.

  Partition configured for 64 cpus.
  CPU maps initialized for 1 thread per core
  ------------[ cut here ]------------
  kernel BUG at arch/powerpc/kernel/paca.c:89!
  Oops: Exception in kernel mode, sig: 5 [#1]
  LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries

This is due to the BUG_ON() for shared_lppaca_total_size equal to
shared_lppaca_size. Instead the code should only BUG_ON() if we have
exceeded the total_size, which indicates we've overflowed the array.

Fixes: bd104e6 ("powerpc/pseries/svm: Use shared memory for LPPACA structures")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
[mpe: Reword change log to clarify we're fixing not removing the check]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200619070113.16696-1-sathnaga@linux.vnet.ibm.com
melver pushed a commit that referenced this pull request Jul 28, 2020
This reverts commit eb9aecd

The above patch is supposed to fix a register index error on mt2701. It
is not clear if the problem solved is a hang or just an invalid value
returned, my guess is the second. The patch introduces, though, a new
hang on MT8173 device making them unusable. So, seems reasonable, revert
the patch because introduces a worst issue.

The reason I send a revert instead of trying to fix the issue for MT8173
is because the information needed to fix the issue is in the datasheet
and is not public. So I am not really able to fix it.

Fixes the following bug when CONFIG_MTK_THERMAL is set on MT8173
devices.

[    2.222488] Unable to handle kernel paging request at virtual address ffff8000125f5001
[    2.230421] Mem abort info:
[    2.233207]   ESR = 0x96000021
[    2.236261]   EC = 0x25: DABT (current EL), IL = 32 bits
[    2.241571]   SET = 0, FnV = 0
[    2.244623]   EA = 0, S1PTW = 0
[    2.247762] Data abort info:
[    2.250640]   ISV = 0, ISS = 0x00000021
[    2.254473]   CM = 0, WnR = 0
[    2.257544] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000041850000
[    2.264251] [ffff8000125f5001] pgd=000000013ffff003, pud=000000013fffe003, pmd=000000013fff9003, pte=006800001100b707
[    2.274867] Internal error: Oops: 96000021 [#1] PREEMPT SMP
[    2.280432] Modules linked in:
[    2.283483] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.7.0-rc6+ #162
[    2.289914] Hardware name: Google Elm (DT)
[    2.294003] pstate: 20000005 (nzCv daif -PAN -UAO)
[    2.298792] pc : mtk_read_temp+0xb8/0x1c8
[    2.302793] lr : mtk_read_temp+0x7c/0x1c8
[    2.306794] sp : ffff80001003b930
[    2.310100] x29: ffff80001003b930 x28: 0000000000000000
[    2.315404] x27: 0000000000000002 x26: ffff0000f9550b10
[    2.320709] x25: ffff0000f9550a80 x24: 0000000000000090
[    2.326014] x23: ffff80001003ba24 x22: 00000000610344c0
[    2.331318] x21: 0000000000002710 x20: 00000000000001f4
[    2.336622] x19: 0000000000030d40 x18: ffff800011742ec0
[    2.341926] x17: 0000000000000001 x16: 0000000000000001
[    2.347230] x15: ffffffffffffffff x14: ffffff0000000000
[    2.352535] x13: ffffffffffffffff x12: 0000000000000028
[    2.357839] x11: 0000000000000003 x10: ffff800011295ec8
[    2.363143] x9 : 000000000000291b x8 : 0000000000000002
[    2.368447] x7 : 00000000000000a8 x6 : 0000000000000004
[    2.373751] x5 : 0000000000000000 x4 : ffff800011295cb0
[    2.379055] x3 : 0000000000000002 x2 : ffff8000125f5001
[    2.384359] x1 : 0000000000000001 x0 : ffff0000f9550a80
[    2.389665] Call trace:
[    2.392105]  mtk_read_temp+0xb8/0x1c8
[    2.395760]  of_thermal_get_temp+0x2c/0x40
[    2.399849]  thermal_zone_get_temp+0x78/0x160
[    2.404198]  thermal_zone_device_update.part.0+0x3c/0x1f8
[    2.409589]  thermal_zone_device_update+0x34/0x48
[    2.414286]  of_thermal_set_mode+0x58/0x88
[    2.418375]  thermal_zone_of_sensor_register+0x1a8/0x1d8
[    2.423679]  devm_thermal_zone_of_sensor_register+0x64/0xb0
[    2.429242]  mtk_thermal_probe+0x690/0x7d0
[    2.433333]  platform_drv_probe+0x5c/0xb0
[    2.437335]  really_probe+0xe4/0x448
[    2.440901]  driver_probe_device+0xe8/0x140
[    2.445077]  device_driver_attach+0x7c/0x88
[    2.449252]  __driver_attach+0xac/0x178
[    2.453082]  bus_for_each_dev+0x78/0xc8
[    2.456909]  driver_attach+0x2c/0x38
[    2.460476]  bus_add_driver+0x14c/0x230
[    2.464304]  driver_register+0x6c/0x128
[    2.468131]  __platform_driver_register+0x50/0x60
[    2.472831]  mtk_thermal_driver_init+0x24/0x30
[    2.477268]  do_one_initcall+0x50/0x298
[    2.481098]  kernel_init_freeable+0x1ec/0x264
[    2.485450]  kernel_init+0x1c/0x110
[    2.488931]  ret_from_fork+0x10/0x1c
[    2.492502] Code: f9401081 f9400402 b8a67821 8b010042 (b9400042)
[    2.498599] ---[ end trace e43e3105ed27dc99 ]---
[    2.503367] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    2.511020] SMP: stopping secondary CPUs
[    2.514941] Kernel Offset: disabled
[    2.518421] CPU features: 0x090002,25006005
[    2.522595] Memory Limit: none
[    2.525644] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]--

Cc: Michael Kao <michael.kao@mediatek.com>
Fixes: eb9aecd ("thermal: mediatek: fix register index error")
Signed-off-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
Reviewed-by: Matthias Brugger <matthias.bgg@gmail.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lore.kernel.org/r/20200707103412.1010823-1-enric.balletbo@collabora.com
melver pushed a commit that referenced this pull request Jul 28, 2020
Although mmiowb() is concerned only with serialising MMIO writes occuring
in contexts where a spinlock is held, the call to mmiowb_set_pending()
from the MMIO write accessors can occur in preemptible contexts, such
as during driver probe() functions where ordering between CPUs is not
usually a concern, assuming that the task migration path provides the
necessary ordering guarantees.

Unfortunately, the default implementation of mmiowb_set_pending() is not
preempt-safe, as it makes use of a a per-cpu variable to track its
internal state. This has been reported to generate the following splat
on riscv:

 | BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1
 | caller is regmap_mmio_write32le+0x1c/0x46
 | CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.8.0-rc3-hfu+ #1
 | Call Trace:
 |  walk_stackframe+0x0/0x7a
 |  dump_stack+0x6e/0x88
 |  regmap_mmio_write32le+0x18/0x46
 |  check_preemption_disabled+0xa4/0xaa
 |  regmap_mmio_write32le+0x18/0x46
 |  regmap_mmio_write+0x26/0x44
 |  regmap_write+0x28/0x48
 |  sifive_gpio_probe+0xc0/0x1da

Although it's possible to fix the driver in this case, other splats have
been seen from other drivers, including the infamous 8250 UART, and so
it's better to address this problem in the mmiowb core itself.

Fix mmiowb_set_pending() by using the raw_cpu_ptr() to get at the mmiowb
state and then only updating the 'mmiowb_pending' field if we are not
preemptible (i.e. we have a non-zero nesting count).

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Reported-by: Palmer Dabbelt <palmer@dabbelt.com>
Reported-by: Emil Renner Berthing <kernel@esmil.dk>
Tested-by: Emil Renner Berthing <kernel@esmil.dk>
Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
Link: https://lore.kernel.org/r/20200716112816.7356-1-will@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
melver pushed a commit that referenced this pull request Nov 2, 2020
When doing a fallocate() we have a short time window, after reserving an
extent and before starting a transaction, where if relocation for the block
group containing the reserved extent happens, we can end up missing the
extent in the data relocation inode causing relocation to fail later.

This only happens when we don't pass a transaction to the internal
fallocate function __btrfs_prealloc_file_range(), which is for all the
cases where fallocate() is called from user space (the internal use cases
include space cache extent allocation and relocation).

When the race triggers the relocation failure, it produces a trace like
the following:

  [200611.995995] ------------[ cut here ]------------
  [200611.997084] BTRFS: Transaction aborted (error -2)
  [200611.998208] WARNING: CPU: 3 PID: 235845 at fs/btrfs/ctree.c:1074 __btrfs_cow_block+0x3a0/0x5b0 [btrfs]
  [200611.999042] Modules linked in: dm_thin_pool dm_persistent_data (...)
  [200612.003287] CPU: 3 PID: 235845 Comm: btrfs Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [200612.004442] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  [200612.006186] RIP: 0010:__btrfs_cow_block+0x3a0/0x5b0 [btrfs]
  [200612.007110] Code: 1b 00 00 02 72 2a 83 f8 fb 0f 84 b8 01 (...)
  [200612.007341] BTRFS warning (device sdb): Skipping commit of aborted transaction.
  [200612.008959] RSP: 0018:ffffaee38550f918 EFLAGS: 00010286
  [200612.009672] BTRFS: error (device sdb) in cleanup_transaction:1901: errno=-30 Readonly filesystem
  [200612.010428] RAX: 0000000000000000 RBX: ffff9174d96f4000 RCX: 0000000000000000
  [200612.011078] BTRFS info (device sdb): forced readonly
  [200612.011862] RDX: 0000000000000001 RSI: ffffffffa8161978 RDI: 00000000ffffffff
  [200612.013215] RBP: ffff9172569a0f80 R08: 0000000000000000 R09: 0000000000000000
  [200612.014263] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9174b8403b88
  [200612.015203] R13: ffff9174b8400a88 R14: ffff9174c90f1000 R15: ffff9174a5a60e08
  [200612.016182] FS:  00007fa55cf878c0(0000) GS:ffff9174ece00000(0000) knlGS:0000000000000000
  [200612.017174] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [200612.018418] CR2: 00007f8fb8048148 CR3: 0000000428a46003 CR4: 00000000003706e0
  [200612.019510] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [200612.020648] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [200612.021520] Call Trace:
  [200612.022434]  btrfs_cow_block+0x10b/0x250 [btrfs]
  [200612.023407]  do_relocation+0x54e/0x7b0 [btrfs]
  [200612.024343]  ? do_raw_spin_unlock+0x4b/0xc0
  [200612.025280]  ? _raw_spin_unlock+0x29/0x40
  [200612.026200]  relocate_tree_blocks+0x3bc/0x6d0 [btrfs]
  [200612.027088]  relocate_block_group+0x2f3/0x600 [btrfs]
  [200612.027961]  btrfs_relocate_block_group+0x15e/0x340 [btrfs]
  [200612.028896]  btrfs_relocate_chunk+0x38/0x110 [btrfs]
  [200612.029772]  btrfs_balance+0xb22/0x1790 [btrfs]
  [200612.030601]  ? btrfs_ioctl_balance+0x253/0x380 [btrfs]
  [200612.031414]  btrfs_ioctl_balance+0x2cf/0x380 [btrfs]
  [200612.032279]  btrfs_ioctl+0x620/0x36f0 [btrfs]
  [200612.033077]  ? _raw_spin_unlock+0x29/0x40
  [200612.033948]  ? handle_mm_fault+0x116d/0x1ca0
  [200612.034749]  ? up_read+0x18/0x240
  [200612.035542]  ? __x64_sys_ioctl+0x83/0xb0
  [200612.036244]  __x64_sys_ioctl+0x83/0xb0
  [200612.037269]  do_syscall_64+0x33/0x80
  [200612.038190]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [200612.038976] RIP: 0033:0x7fa55d07ed87
  [200612.040127] Code: 00 00 00 48 8b 05 09 91 0c 00 64 c7 00 26 (...)
  [200612.041669] RSP: 002b:00007ffd5ebf03e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
  [200612.042437] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fa55d07ed87
  [200612.043511] RDX: 00007ffd5ebf0470 RSI: 00000000c4009420 RDI: 0000000000000003
  [200612.044250] RBP: 0000000000000003 R08: 000055d8362642a0 R09: 00007fa55d148be0
  [200612.044963] R10: fffffffffffff52e R11: 0000000000000206 R12: 00007ffd5ebf1614
  [200612.045683] R13: 00007ffd5ebf0470 R14: 0000000000000002 R15: 00007ffd5ebf0470
  [200612.046361] irq event stamp: 0
  [200612.047040] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  [200612.047725] hardirqs last disabled at (0): [<ffffffffa6eb5ab3>] copy_process+0x823/0x1bc0
  [200612.048387] softirqs last  enabled at (0): [<ffffffffa6eb5ab3>] copy_process+0x823/0x1bc0
  [200612.049024] softirqs last disabled at (0): [<0000000000000000>] 0x0
  [200612.049722] ---[ end trace 49006c6876e65227 ]---

The race happens like this:

1) Task A starts an fallocate() (plain or zero range) and it calls
   __btrfs_prealloc_file_range() with the 'trans' parameter set to NULL;

2) Task A calls btrfs_reserve_extent() and gets an extent that belongs to
   block group X;

3) Before task A gets into btrfs_replace_file_extents(), through the call
   to insert_prealloc_file_extent(), task B starts relocation of block
   group X;

4) Task B enters btrfs_relocate_block_group() and it sets block group X to
   RO mode;

5) Task B enters relocate_block_group(), it calls prepare_to_relocate()
   whichs joins/starts a transaction and then commits the transaction;

6) Task B then starts scanning the extent tree looking for extents that
   belong to block group X - it does not find yet the extent reserved by
   task A, since that extent was not yet added to the extent tree, as its
   delayed reference was not even yet created at this point;

7) The data relocation inode ends up not having the extent reserved by
   task A associated to it;

8) Task A then starts a transaction through btrfs_replace_file_extents(),
   inserts a file extent item in the subvolume tree pointing to the
   reserved extent and creates a delayed reference for it;

9) Task A finishes and returns success to user space;

10) Later on, while relocation is still in progress, the leaf where task A
    inserted the new file extent item is COWed, so we end up at
    __btrfs_cow_block(), which calls btrfs_reloc_cow_block(), and that in
    turn calls relocation.c:replace_file_extents();

11) At relocation.c:replace_file_extents() we iterate over all the items in
    the leaf and find the file extent item pointing to the extent that was
    allocated by task A, and then call relocation.c:get_new_location(), to
    find the new location for the extent;

12) However relocation.c:get_new_location() fails, returning -ENOENT,
    because it couldn't find a corresponding file extent item associated
    with the data relocation inode. This is because the extent was not seen
    in the extent tree at step 6). The -ENOENT error is propagated to
    __btrfs_cow_block(), which aborts the transaction.

So fix this simply by decrementing the block group's number of reservations
after calling insert_prealloc_file_extent(), as relocation waits for that
counter to go down to zero before calling prepare_to_relocate() and start
looking for extents in the extent tree.

This issue only started to happen recently as of commit 8fccebf
("btrfs: fix metadata reservation for fallocate that leads to transaction
aborts"), because now we can reserve an extent before starting/joining a
transaction, and previously we always did it after that, so relocation
ended up waiting for a concurrent fallocate() to finish because before
searching for the extents of the block group, it starts/joins a transaction
and then commits it (at prepare_to_relocate()), which made it wait for the
fallocate task to complete first.

Fixes: 8fccebf ("btrfs: fix metadata reservation for fallocate that leads to transaction aborts")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
melver pushed a commit that referenced this pull request Nov 2, 2020
With SO_RCVLOWAT, under memory pressure,
it is possible to enter a state where:

1. We have not received enough bytes to satisfy SO_RCVLOWAT.
2. We have not entered buffer pressure (see tcp_rmem_pressure()).
3. But, we do not have enough buffer space to accept more packets.

In this case, we advertise 0 rwnd (due to #3) but the application does
not drain the receive queue (no wakeup because of #1 and #2) so the
flow stalls.

Modify the heuristic for SO_RCVLOWAT so that, if we are advertising
rwnd<=rcv_mss, force a wakeup to prevent a stall.

Without this patch, setting tcp_rmem to 6143 and disabling TCP
autotune causes a stalled flow. With this patch, no stall occurs. This
is with RPC-style traffic with large messages.

Fixes: 03f45c8 ("tcp: avoid extra wakeups for SO_RCVLOWAT users")
Signed-off-by: Arjun Roy <arjunroy@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201023184709.217614-1-arjunroy.kdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
melver pushed a commit that referenced this pull request Nov 2, 2020
Dave reported a problem with my rwsem conversion patch where we got the
following lockdep splat:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.9.0-default+ #1297 Not tainted
  ------------------------------------------------------
  kswapd0/76 is trying to acquire lock:
  ffff9d5d25df2530 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]

  but task is already holding lock:
  ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #4 (fs_reclaim){+.+.}-{0:0}:
	 __lock_acquire+0x582/0xac0
	 lock_acquire+0xca/0x430
	 fs_reclaim_acquire.part.0+0x25/0x30
	 kmem_cache_alloc+0x30/0x9c0
	 alloc_inode+0x81/0x90
	 iget_locked+0xcd/0x1a0
	 kernfs_get_inode+0x1b/0x130
	 kernfs_get_tree+0x136/0x210
	 sysfs_get_tree+0x1a/0x50
	 vfs_get_tree+0x1d/0xb0
	 path_mount+0x70f/0xa80
	 do_mount+0x75/0x90
	 __x64_sys_mount+0x8e/0xd0
	 do_syscall_64+0x2d/0x70
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #3 (kernfs_mutex){+.+.}-{3:3}:
	 __lock_acquire+0x582/0xac0
	 lock_acquire+0xca/0x430
	 __mutex_lock+0xa0/0xaf0
	 kernfs_add_one+0x23/0x150
	 kernfs_create_dir_ns+0x58/0x80
	 sysfs_create_dir_ns+0x70/0xd0
	 kobject_add_internal+0xbb/0x2d0
	 kobject_add+0x7a/0xd0
	 btrfs_sysfs_add_block_group_type+0x141/0x1d0 [btrfs]
	 btrfs_read_block_groups+0x1f1/0x8c0 [btrfs]
	 open_ctree+0x981/0x1108 [btrfs]
	 btrfs_mount_root.cold+0xe/0xb0 [btrfs]
	 legacy_get_tree+0x2d/0x60
	 vfs_get_tree+0x1d/0xb0
	 fc_mount+0xe/0x40
	 vfs_kern_mount.part.0+0x71/0x90
	 btrfs_mount+0x13b/0x3e0 [btrfs]
	 legacy_get_tree+0x2d/0x60
	 vfs_get_tree+0x1d/0xb0
	 path_mount+0x70f/0xa80
	 do_mount+0x75/0x90
	 __x64_sys_mount+0x8e/0xd0
	 do_syscall_64+0x2d/0x70
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #2 (btrfs-extent-00){++++}-{3:3}:
	 __lock_acquire+0x582/0xac0
	 lock_acquire+0xca/0x430
	 down_read_nested+0x45/0x220
	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
	 btrfs_search_slot+0x6d4/0xfd0 [btrfs]
	 check_committed_ref+0x69/0x200 [btrfs]
	 btrfs_cross_ref_exist+0x65/0xb0 [btrfs]
	 run_delalloc_nocow+0x446/0x9b0 [btrfs]
	 btrfs_run_delalloc_range+0x61/0x6a0 [btrfs]
	 writepage_delalloc+0xae/0x160 [btrfs]
	 __extent_writepage+0x262/0x420 [btrfs]
	 extent_write_cache_pages+0x2b6/0x510 [btrfs]
	 extent_writepages+0x43/0x90 [btrfs]
	 do_writepages+0x40/0xe0
	 __writeback_single_inode+0x62/0x610
	 writeback_sb_inodes+0x20f/0x500
	 wb_writeback+0xef/0x4a0
	 wb_do_writeback+0x49/0x2e0
	 wb_workfn+0x81/0x340
	 process_one_work+0x233/0x5d0
	 worker_thread+0x50/0x3b0
	 kthread+0x137/0x150
	 ret_from_fork+0x1f/0x30

  -> #1 (btrfs-fs-00){++++}-{3:3}:
	 __lock_acquire+0x582/0xac0
	 lock_acquire+0xca/0x430
	 down_read_nested+0x45/0x220
	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
	 btrfs_search_slot+0x6d4/0xfd0 [btrfs]
	 btrfs_lookup_inode+0x3a/0xc0 [btrfs]
	 __btrfs_update_delayed_inode+0x93/0x2c0 [btrfs]
	 __btrfs_commit_inode_delayed_items+0x7de/0x850 [btrfs]
	 __btrfs_run_delayed_items+0x8e/0x140 [btrfs]
	 btrfs_commit_transaction+0x367/0xbc0 [btrfs]
	 btrfs_mksubvol+0x2db/0x470 [btrfs]
	 btrfs_mksnapshot+0x7b/0xb0 [btrfs]
	 __btrfs_ioctl_snap_create+0x16f/0x1a0 [btrfs]
	 btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
	 btrfs_ioctl+0xd0b/0x2690 [btrfs]
	 __x64_sys_ioctl+0x6f/0xa0
	 do_syscall_64+0x2d/0x70
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
	 check_prev_add+0x91/0xc60
	 validate_chain+0xa6e/0x2a20
	 __lock_acquire+0x582/0xac0
	 lock_acquire+0xca/0x430
	 __mutex_lock+0xa0/0xaf0
	 __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
	 btrfs_evict_inode+0x3cc/0x560 [btrfs]
	 evict+0xd6/0x1c0
	 dispose_list+0x48/0x70
	 prune_icache_sb+0x54/0x80
	 super_cache_scan+0x121/0x1a0
	 do_shrink_slab+0x16d/0x3b0
	 shrink_slab+0xb1/0x2e0
	 shrink_node+0x230/0x6a0
	 balance_pgdat+0x325/0x750
	 kswapd+0x206/0x4d0
	 kthread+0x137/0x150
	 ret_from_fork+0x1f/0x30

  other info that might help us debug this:

  Chain exists of:
    &delayed_node->mutex --> kernfs_mutex --> fs_reclaim

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(fs_reclaim);
				 lock(kernfs_mutex);
				 lock(fs_reclaim);
    lock(&delayed_node->mutex);

   *** DEADLOCK ***

  3 locks held by kswapd0/76:
   #0: ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
   #1: ffffffffa40b8b58 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x54/0x2e0
   #2: ffff9d5d322390e8 (&type->s_umount_key#26){++++}-{3:3}, at: trylock_super+0x16/0x50

  stack backtrace:
  CPU: 2 PID: 76 Comm: kswapd0 Not tainted 5.9.0-default+ #1297
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
  Call Trace:
   dump_stack+0x77/0x97
   check_noncircular+0xff/0x110
   ? save_trace+0x50/0x470
   check_prev_add+0x91/0xc60
   validate_chain+0xa6e/0x2a20
   ? save_trace+0x50/0x470
   __lock_acquire+0x582/0xac0
   lock_acquire+0xca/0x430
   ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
   __mutex_lock+0xa0/0xaf0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
   ? __lock_acquire+0x582/0xac0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
   ? btrfs_evict_inode+0x30b/0x560 [btrfs]
   ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
   __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
   btrfs_evict_inode+0x3cc/0x560 [btrfs]
   evict+0xd6/0x1c0
   dispose_list+0x48/0x70
   prune_icache_sb+0x54/0x80
   super_cache_scan+0x121/0x1a0
   do_shrink_slab+0x16d/0x3b0
   shrink_slab+0xb1/0x2e0
   shrink_node+0x230/0x6a0
   balance_pgdat+0x325/0x750
   kswapd+0x206/0x4d0
   ? finish_wait+0x90/0x90
   ? balance_pgdat+0x750/0x750
   kthread+0x137/0x150
   ? kthread_mod_delayed_work+0xc0/0xc0
   ret_from_fork+0x1f/0x30

This happens because we are still holding the path open when we start
adding the sysfs files for the block groups, which creates a dependency
on fs_reclaim via the tree lock.  Fix this by dropping the path before
we start doing anything with sysfs.

Reported-by: David Sterba <dsterba@suse.com>
CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
melver pushed a commit that referenced this pull request Nov 2, 2020
If we fail to find suitable zones for a new readahead extent, we end up
leaving a stale pointer in the global readahead extents radix tree
(fs_info->reada_tree), which can trigger the following trace later on:

  [13367.696354] BUG: kernel NULL pointer dereference, address: 00000000000000b0
  [13367.696802] #PF: supervisor read access in kernel mode
  [13367.697249] #PF: error_code(0x0000) - not-present page
  [13367.697721] PGD 0 P4D 0
  [13367.698171] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
  [13367.698632] CPU: 6 PID: 851214 Comm: btrfs Tainted: G        W         5.9.0-rc6-btrfs-next-69 #1
  [13367.699100] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  [13367.700069] RIP: 0010:__lock_acquire+0x20a/0x3970
  [13367.700562] Code: ff 1f 0f b7 c0 48 0f (...)
  [13367.701609] RSP: 0018:ffffb14448f57790 EFLAGS: 00010046
  [13367.702140] RAX: 0000000000000000 RBX: 29b935140c15e8cf RCX: 0000000000000000
  [13367.702698] RDX: 0000000000000002 RSI: ffffffffb3d66bd0 RDI: 0000000000000046
  [13367.703240] RBP: ffff8a52ba8ac040 R08: 00000c2866ad9288 R09: 0000000000000001
  [13367.703783] R10: 0000000000000001 R11: 00000000b66d9b53 R12: ffff8a52ba8ac9b0
  [13367.704330] R13: 0000000000000000 R14: ffff8a532b6333e8 R15: 0000000000000000
  [13367.704880] FS:  00007fe1df6b5700(0000) GS:ffff8a5376600000(0000) knlGS:0000000000000000
  [13367.705438] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [13367.705995] CR2: 00000000000000b0 CR3: 000000022cca8004 CR4: 00000000003706e0
  [13367.706565] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [13367.707127] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [13367.707686] Call Trace:
  [13367.708246]  ? ___slab_alloc+0x395/0x740
  [13367.708820]  ? reada_add_block+0xae/0xee0 [btrfs]
  [13367.709383]  lock_acquire+0xb1/0x480
  [13367.709955]  ? reada_add_block+0xe0/0xee0 [btrfs]
  [13367.710537]  ? reada_add_block+0xae/0xee0 [btrfs]
  [13367.711097]  ? rcu_read_lock_sched_held+0x5d/0x90
  [13367.711659]  ? kmem_cache_alloc_trace+0x8d2/0x990
  [13367.712221]  ? lock_acquired+0x33b/0x470
  [13367.712784]  _raw_spin_lock+0x34/0x80
  [13367.713356]  ? reada_add_block+0xe0/0xee0 [btrfs]
  [13367.713966]  reada_add_block+0xe0/0xee0 [btrfs]
  [13367.714529]  ? btrfs_root_node+0x15/0x1f0 [btrfs]
  [13367.715077]  btrfs_reada_add+0x117/0x170 [btrfs]
  [13367.715620]  scrub_stripe+0x21e/0x10d0 [btrfs]
  [13367.716141]  ? kvm_sched_clock_read+0x5/0x10
  [13367.716657]  ? __lock_acquire+0x41e/0x3970
  [13367.717184]  ? scrub_chunk+0x60/0x140 [btrfs]
  [13367.717697]  ? find_held_lock+0x32/0x90
  [13367.718254]  ? scrub_chunk+0x60/0x140 [btrfs]
  [13367.718773]  ? lock_acquired+0x33b/0x470
  [13367.719278]  ? scrub_chunk+0xcd/0x140 [btrfs]
  [13367.719786]  scrub_chunk+0xcd/0x140 [btrfs]
  [13367.720291]  scrub_enumerate_chunks+0x270/0x5c0 [btrfs]
  [13367.720787]  ? finish_wait+0x90/0x90
  [13367.721281]  btrfs_scrub_dev+0x1ee/0x620 [btrfs]
  [13367.721762]  ? rcu_read_lock_any_held+0x8e/0xb0
  [13367.722235]  ? preempt_count_add+0x49/0xa0
  [13367.722710]  ? __sb_start_write+0x19b/0x290
  [13367.723192]  btrfs_ioctl+0x7f5/0x36f0 [btrfs]
  [13367.723660]  ? __fget_files+0x101/0x1d0
  [13367.724118]  ? find_held_lock+0x32/0x90
  [13367.724559]  ? __fget_files+0x101/0x1d0
  [13367.724982]  ? __x64_sys_ioctl+0x83/0xb0
  [13367.725399]  __x64_sys_ioctl+0x83/0xb0
  [13367.725802]  do_syscall_64+0x33/0x80
  [13367.726188]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [13367.726574] RIP: 0033:0x7fe1df7add87
  [13367.726948] Code: 00 00 00 48 8b 05 09 91 (...)
  [13367.727763] RSP: 002b:00007fe1df6b4d48 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [13367.728179] RAX: ffffffffffffffda RBX: 000055ce1fb596a0 RCX: 00007fe1df7add87
  [13367.728604] RDX: 000055ce1fb596a0 RSI: 00000000c400941b RDI: 0000000000000003
  [13367.729021] RBP: 0000000000000000 R08: 00007fe1df6b5700 R09: 0000000000000000
  [13367.729431] R10: 00007fe1df6b5700 R11: 0000000000000246 R12: 00007ffd922b07de
  [13367.729842] R13: 00007ffd922b07df R14: 00007fe1df6b4e40 R15: 0000000000802000
  [13367.730275] Modules linked in: btrfs blake2b_generic xor (...)
  [13367.732638] CR2: 00000000000000b0
  [13367.733166] ---[ end trace d298b6805556acd9 ]---

What happens is the following:

1) At reada_find_extent() we don't find any existing readahead extent for
   the metadata extent starting at logical address X;

2) So we proceed to create a new one. We then call btrfs_map_block() to get
   information about which stripes contain extent X;

3) After that we iterate over the stripes and create only one zone for the
   readahead extent - only one because reada_find_zone() returned NULL for
   all iterations except for one, either because a memory allocation failed
   or it couldn't find the block group of the extent (it may have just been
   deleted);

4) We then add the new readahead extent to the readahead extents radix
   tree at fs_info->reada_tree;

5) Then we iterate over each zone of the new readahead extent, and find
   that the device used for that zone no longer exists, because it was
   removed or it was the source device of a device replace operation.
   Since this left 'have_zone' set to 0, after finishing the loop we jump
   to the 'error' label, call kfree() on the new readahead extent and
   return without removing it from the radix tree at fs_info->reada_tree;

6) Any future call to reada_find_extent() for the logical address X will
   find the stale pointer in the readahead extents radix tree, increment
   its reference counter, which can trigger the use-after-free right
   away or return it to the caller reada_add_block() that results in the
   use-after-free of the example trace above.

So fix this by making sure we delete the readahead extent from the radix
tree if we fail to setup zones for it (when 'have_zone = 0').

Fixes: 3194502 ("btrfs: reada: bypass adding extent when all zone failed")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
melver pushed a commit that referenced this pull request Nov 2, 2020
Very sporadically I had test case btrfs/069 from fstests hanging (for
years, it is not a recent regression), with the following traces in
dmesg/syslog:

  [162301.160628] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg started
  [162301.181196] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
  [162301.287162] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg finished
  [162513.513792] INFO: task btrfs-transacti:1356167 blocked for more than 120 seconds.
  [162513.514318]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.514522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.514747] task:btrfs-transacti state:D stack:    0 pid:1356167 ppid:     2 flags:0x00004000
  [162513.514751] Call Trace:
  [162513.514761]  __schedule+0x5ce/0xd00
  [162513.514765]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.514771]  schedule+0x46/0xf0
  [162513.514844]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.514850]  ? finish_wait+0x90/0x90
  [162513.514864]  start_transaction+0x37c/0x5f0 [btrfs]
  [162513.514879]  transaction_kthread+0xa4/0x170 [btrfs]
  [162513.514891]  ? btrfs_cleanup_transaction+0x660/0x660 [btrfs]
  [162513.514894]  kthread+0x153/0x170
  [162513.514897]  ? kthread_stop+0x2c0/0x2c0
  [162513.514902]  ret_from_fork+0x22/0x30
  [162513.514916] INFO: task fsstress:1356184 blocked for more than 120 seconds.
  [162513.515192]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.515431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.515680] task:fsstress        state:D stack:    0 pid:1356184 ppid:1356177 flags:0x00004000
  [162513.515682] Call Trace:
  [162513.515688]  __schedule+0x5ce/0xd00
  [162513.515691]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.515697]  schedule+0x46/0xf0
  [162513.515712]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.515716]  ? finish_wait+0x90/0x90
  [162513.515729]  start_transaction+0x37c/0x5f0 [btrfs]
  [162513.515743]  btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
  [162513.515753]  btrfs_sync_fs+0x61/0x1c0 [btrfs]
  [162513.515758]  ? __ia32_sys_fdatasync+0x20/0x20
  [162513.515761]  iterate_supers+0x87/0xf0
  [162513.515765]  ksys_sync+0x60/0xb0
  [162513.515768]  __do_sys_sync+0xa/0x10
  [162513.515771]  do_syscall_64+0x33/0x80
  [162513.515774]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.515781] RIP: 0033:0x7f5238f50bd7
  [162513.515782] Code: Bad RIP value.
  [162513.515784] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
  [162513.515786] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
  [162513.515788] RDX: 00000000ffffffff RSI: 000000000daf0e74 RDI: 000000000000003a
  [162513.515789] RBP: 0000000000000032 R08: 000000000000000a R09: 00007f5239019be0
  [162513.515791] R10: fffffffffffff24f R11: 0000000000000206 R12: 000000000000003a
  [162513.515792] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
  [162513.515804] INFO: task fsstress:1356185 blocked for more than 120 seconds.
  [162513.516064]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.516329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.516617] task:fsstress        state:D stack:    0 pid:1356185 ppid:1356177 flags:0x00000000
  [162513.516620] Call Trace:
  [162513.516625]  __schedule+0x5ce/0xd00
  [162513.516628]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.516634]  schedule+0x46/0xf0
  [162513.516647]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.516650]  ? finish_wait+0x90/0x90
  [162513.516662]  start_transaction+0x4d7/0x5f0 [btrfs]
  [162513.516679]  btrfs_setxattr_trans+0x3c/0x100 [btrfs]
  [162513.516686]  __vfs_setxattr+0x66/0x80
  [162513.516691]  __vfs_setxattr_noperm+0x70/0x200
  [162513.516697]  vfs_setxattr+0x6b/0x120
  [162513.516703]  setxattr+0x125/0x240
  [162513.516709]  ? lock_acquire+0xb1/0x480
  [162513.516712]  ? mnt_want_write+0x20/0x50
  [162513.516721]  ? rcu_read_lock_any_held+0x8e/0xb0
  [162513.516723]  ? preempt_count_add+0x49/0xa0
  [162513.516725]  ? __sb_start_write+0x19b/0x290
  [162513.516727]  ? preempt_count_add+0x49/0xa0
  [162513.516732]  path_setxattr+0xba/0xd0
  [162513.516739]  __x64_sys_setxattr+0x27/0x30
  [162513.516741]  do_syscall_64+0x33/0x80
  [162513.516743]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.516745] RIP: 0033:0x7f5238f56d5a
  [162513.516746] Code: Bad RIP value.
  [162513.516748] RSP: 002b:00007fff67b97868 EFLAGS: 00000202 ORIG_RAX: 00000000000000bc
  [162513.516750] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f5238f56d5a
  [162513.516751] RDX: 000055b1fbb0d5a0 RSI: 00007fff67b978a0 RDI: 000055b1fbb0d470
  [162513.516753] RBP: 000055b1fbb0d5a0 R08: 0000000000000001 R09: 00007fff67b97700
  [162513.516754] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000004
  [162513.516756] R13: 0000000000000024 R14: 0000000000000001 R15: 00007fff67b978a0
  [162513.516767] INFO: task fsstress:1356196 blocked for more than 120 seconds.
  [162513.517064]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.517365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.517763] task:fsstress        state:D stack:    0 pid:1356196 ppid:1356177 flags:0x00004000
  [162513.517780] Call Trace:
  [162513.517786]  __schedule+0x5ce/0xd00
  [162513.517789]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.517796]  schedule+0x46/0xf0
  [162513.517810]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.517814]  ? finish_wait+0x90/0x90
  [162513.517829]  start_transaction+0x37c/0x5f0 [btrfs]
  [162513.517845]  btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
  [162513.517857]  btrfs_sync_fs+0x61/0x1c0 [btrfs]
  [162513.517862]  ? __ia32_sys_fdatasync+0x20/0x20
  [162513.517865]  iterate_supers+0x87/0xf0
  [162513.517869]  ksys_sync+0x60/0xb0
  [162513.517872]  __do_sys_sync+0xa/0x10
  [162513.517875]  do_syscall_64+0x33/0x80
  [162513.517878]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.517881] RIP: 0033:0x7f5238f50bd7
  [162513.517883] Code: Bad RIP value.
  [162513.517885] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
  [162513.517887] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
  [162513.517889] RDX: 0000000000000000 RSI: 000000007660add2 RDI: 0000000000000053
  [162513.517891] RBP: 0000000000000032 R08: 0000000000000067 R09: 00007f5239019be0
  [162513.517893] R10: fffffffffffff24f R11: 0000000000000206 R12: 0000000000000053
  [162513.517895] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
  [162513.517908] INFO: task fsstress:1356197 blocked for more than 120 seconds.
  [162513.518298]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.518672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.519157] task:fsstress        state:D stack:    0 pid:1356197 ppid:1356177 flags:0x00000000
  [162513.519160] Call Trace:
  [162513.519165]  __schedule+0x5ce/0xd00
  [162513.519168]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.519174]  schedule+0x46/0xf0
  [162513.519190]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.519193]  ? finish_wait+0x90/0x90
  [162513.519206]  start_transaction+0x4d7/0x5f0 [btrfs]
  [162513.519222]  btrfs_create+0x57/0x200 [btrfs]
  [162513.519230]  lookup_open+0x522/0x650
  [162513.519246]  path_openat+0x2b8/0xa50
  [162513.519270]  do_filp_open+0x91/0x100
  [162513.519275]  ? find_held_lock+0x32/0x90
  [162513.519280]  ? lock_acquired+0x33b/0x470
  [162513.519285]  ? do_raw_spin_unlock+0x4b/0xc0
  [162513.519287]  ? _raw_spin_unlock+0x29/0x40
  [162513.519295]  do_sys_openat2+0x20d/0x2d0
  [162513.519300]  do_sys_open+0x44/0x80
  [162513.519304]  do_syscall_64+0x33/0x80
  [162513.519307]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.519309] RIP: 0033:0x7f5238f4a903
  [162513.519310] Code: Bad RIP value.
  [162513.519312] RSP: 002b:00007fff67b97758 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
  [162513.519314] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f5238f4a903
  [162513.519316] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 000055b1fbb0d470
  [162513.519317] RBP: 00007fff67b978c0 R08: 0000000000000001 R09: 0000000000000002
  [162513.519319] R10: 00007fff67b974f7 R11: 0000000000000246 R12: 0000000000000013
  [162513.519320] R13: 00000000000001b6 R14: 00007fff67b97906 R15: 000055b1fad1c620
  [162513.519332] INFO: task btrfs:1356211 blocked for more than 120 seconds.
  [162513.519727]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.520115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.520508] task:btrfs           state:D stack:    0 pid:1356211 ppid:1356178 flags:0x00004002
  [162513.520511] Call Trace:
  [162513.520516]  __schedule+0x5ce/0xd00
  [162513.520519]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.520525]  schedule+0x46/0xf0
  [162513.520544]  btrfs_scrub_pause+0x11f/0x180 [btrfs]
  [162513.520548]  ? finish_wait+0x90/0x90
  [162513.520562]  btrfs_commit_transaction+0x45a/0xc30 [btrfs]
  [162513.520574]  ? start_transaction+0xe0/0x5f0 [btrfs]
  [162513.520596]  btrfs_dev_replace_finishing+0x6d8/0x711 [btrfs]
  [162513.520619]  btrfs_dev_replace_by_ioctl.cold+0x1cc/0x1fd [btrfs]
  [162513.520639]  btrfs_ioctl+0x2a25/0x36f0 [btrfs]
  [162513.520643]  ? do_sigaction+0xf3/0x240
  [162513.520645]  ? find_held_lock+0x32/0x90
  [162513.520648]  ? do_sigaction+0xf3/0x240
  [162513.520651]  ? lock_acquired+0x33b/0x470
  [162513.520655]  ? _raw_spin_unlock_irq+0x24/0x50
  [162513.520657]  ? lockdep_hardirqs_on+0x7d/0x100
  [162513.520660]  ? _raw_spin_unlock_irq+0x35/0x50
  [162513.520662]  ? do_sigaction+0xf3/0x240
  [162513.520671]  ? __x64_sys_ioctl+0x83/0xb0
  [162513.520672]  __x64_sys_ioctl+0x83/0xb0
  [162513.520677]  do_syscall_64+0x33/0x80
  [162513.520679]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.520681] RIP: 0033:0x7fc3cd307d87
  [162513.520682] Code: Bad RIP value.
  [162513.520684] RSP: 002b:00007ffe30a56bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
  [162513.520686] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc3cd307d87
  [162513.520687] RDX: 00007ffe30a57a30 RSI: 00000000ca289435 RDI: 0000000000000003
  [162513.520689] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
  [162513.520690] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
  [162513.520692] R13: 0000557323a212e0 R14: 00007ffe30a5a520 R15: 0000000000000001
  [162513.520703]
		  Showing all locks held in the system:
  [162513.520712] 1 lock held by khungtaskd/54:
  [162513.520713]  #0: ffffffffb40a91a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x197
  [162513.520728] 1 lock held by in:imklog/596:
  [162513.520729]  #0: ffff8f3f0d781400 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
  [162513.520782] 1 lock held by btrfs-transacti/1356167:
  [162513.520784]  #0: ffff8f3d810cc848 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x4a/0x170 [btrfs]
  [162513.520798] 1 lock held by btrfs/1356190:
  [162513.520800]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0x60
  [162513.520805] 1 lock held by fsstress/1356184:
  [162513.520806]  #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
  [162513.520811] 3 locks held by fsstress/1356185:
  [162513.520812]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
  [162513.520815]  #1: ffff8f3d80a650b8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: vfs_setxattr+0x50/0x120
  [162513.520820]  #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
  [162513.520833] 1 lock held by fsstress/1356196:
  [162513.520834]  #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
  [162513.520838] 3 locks held by fsstress/1356197:
  [162513.520839]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
  [162513.520843]  #1: ffff8f3d506465e8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: path_openat+0x2a7/0xa50
  [162513.520846]  #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
  [162513.520858] 2 locks held by btrfs/1356211:
  [162513.520859]  #0: ffff8f3d810cde30 (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.}-{3:3}, at: btrfs_dev_replace_finishing+0x52/0x711 [btrfs]
  [162513.520877]  #1: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]

This was weird because the stack traces show that a transaction commit,
triggered by a device replace operation, is blocking trying to pause any
running scrubs but there are no stack traces of blocked tasks doing a
scrub.

After poking around with drgn, I noticed there was a scrub task that was
constantly running and blocking for shorts periods of time:

  >>> t = find_task(prog, 1356190)
  >>> prog.stack_trace(t)
  #0  __schedule+0x5ce/0xcfc
  #1  schedule+0x46/0xe4
  #2  schedule_timeout+0x1df/0x475
  #3  btrfs_reada_wait+0xda/0x132
  #4  scrub_stripe+0x2a8/0x112f
  #5  scrub_chunk+0xcd/0x134
  #6  scrub_enumerate_chunks+0x29e/0x5ee
  #7  btrfs_scrub_dev+0x2d5/0x91b
  #8  btrfs_ioctl+0x7f5/0x36e7
  #9  __x64_sys_ioctl+0x83/0xb0
  #10 do_syscall_64+0x33/0x77
  #11 entry_SYSCALL_64+0x7c/0x156

Which corresponds to:

int btrfs_reada_wait(void *handle)
{
    struct reada_control *rc = handle;
    struct btrfs_fs_info *fs_info = rc->fs_info;

    while (atomic_read(&rc->elems)) {
        if (!atomic_read(&fs_info->reada_works_cnt))
            reada_start_machine(fs_info);
        wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0,
                          (HZ + 9) / 10);
    }
(...)

So the counter "rc->elems" was set to 1 and never decreased to 0, causing
the scrub task to loop forever in that function. Then I used the following
script for drgn to check the readahead requests:

  $ cat dump_reada.py
  import sys
  import drgn
  from drgn import NULL, Object, cast, container_of, execscript, \
      reinterpret, sizeof
  from drgn.helpers.linux import *

  mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"

  mnt = None
  for mnt in for_each_mount(prog, dst = mnt_path):
      pass

  if mnt is None:
      sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
      sys.exit(1)

  fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)

  def dump_re(re):
      nzones = re.nzones.value_()
      print(f're at {hex(re.value_())}')
      print(f'\t logical {re.logical.value_()}')
      print(f'\t refcnt {re.refcnt.value_()}')
      print(f'\t nzones {nzones}')
      for i in range(nzones):
          dev = re.zones[i].device
          name = dev.name.str.string_()
          print(f'\t\t dev id {dev.devid.value_()} name {name}')
      print()

  for _, e in radix_tree_for_each(fs_info.reada_tree):
      re = cast('struct reada_extent *', e)
      dump_re(re)

  $ drgn dump_reada.py
  re at 0xffff8f3da9d25ad8
          logical 38928384
          refcnt 1
          nzones 1
                 dev id 0 name b'/dev/sdd'
  $

So there was one readahead extent with a single zone corresponding to the
source device of that last device replace operation logged in dmesg/syslog.
Also the ID of that zone's device was 0 which is a special value set in
the source device of a device replace operation when the operation finishes
(constant BTRFS_DEV_REPLACE_DEVID set at btrfs_dev_replace_finishing()),
confirming again that device /dev/sdd was the source of a device replace
operation.

Normally there should be as many zones in the readahead extent as there are
devices, and I wasn't expecting the extent to be in a block group with a
'single' profile, so I went and confirmed with the following drgn script
that there weren't any single profile block groups:

  $ cat dump_block_groups.py
  import sys
  import drgn
  from drgn import NULL, Object, cast, container_of, execscript, \
      reinterpret, sizeof
  from drgn.helpers.linux import *

  mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"

  mnt = None
  for mnt in for_each_mount(prog, dst = mnt_path):
      pass

  if mnt is None:
      sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
      sys.exit(1)

  fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)

  BTRFS_BLOCK_GROUP_DATA = (1 << 0)
  BTRFS_BLOCK_GROUP_SYSTEM = (1 << 1)
  BTRFS_BLOCK_GROUP_METADATA = (1 << 2)
  BTRFS_BLOCK_GROUP_RAID0 = (1 << 3)
  BTRFS_BLOCK_GROUP_RAID1 = (1 << 4)
  BTRFS_BLOCK_GROUP_DUP = (1 << 5)
  BTRFS_BLOCK_GROUP_RAID10 = (1 << 6)
  BTRFS_BLOCK_GROUP_RAID5 = (1 << 7)
  BTRFS_BLOCK_GROUP_RAID6 = (1 << 8)
  BTRFS_BLOCK_GROUP_RAID1C3 = (1 << 9)
  BTRFS_BLOCK_GROUP_RAID1C4 = (1 << 10)

  def bg_flags_string(bg):
      flags = bg.flags.value_()
      ret = ''
      if flags & BTRFS_BLOCK_GROUP_DATA:
          ret = 'data'
      if flags & BTRFS_BLOCK_GROUP_METADATA:
          if len(ret) > 0:
              ret += '|'
          ret += 'meta'
      if flags & BTRFS_BLOCK_GROUP_SYSTEM:
          if len(ret) > 0:
              ret += '|'
          ret += 'system'
      if flags & BTRFS_BLOCK_GROUP_RAID0:
          ret += ' raid0'
      elif flags & BTRFS_BLOCK_GROUP_RAID1:
          ret += ' raid1'
      elif flags & BTRFS_BLOCK_GROUP_DUP:
          ret += ' dup'
      elif flags & BTRFS_BLOCK_GROUP_RAID10:
          ret += ' raid10'
      elif flags & BTRFS_BLOCK_GROUP_RAID5:
          ret += ' raid5'
      elif flags & BTRFS_BLOCK_GROUP_RAID6:
          ret += ' raid6'
      elif flags & BTRFS_BLOCK_GROUP_RAID1C3:
          ret += ' raid1c3'
      elif flags & BTRFS_BLOCK_GROUP_RAID1C4:
          ret += ' raid1c4'
      else:
          ret += ' single'

      return ret

  def dump_bg(bg):
      print()
      print(f'block group at {hex(bg.value_())}')
      print(f'\t start {bg.start.value_()} length {bg.length.value_()}')
      print(f'\t flags {bg.flags.value_()} - {bg_flags_string(bg)}')

  bg_root = fs_info.block_group_cache_tree.address_of_()
  for bg in rbtree_inorder_for_each_entry('struct btrfs_block_group', bg_root, 'cache_node'):
      dump_bg(bg)

  $ drgn dump_block_groups.py

  block group at 0xffff8f3d673b0400
         start 22020096 length 16777216
         flags 258 - system raid6

  block group at 0xffff8f3d53ddb400
         start 38797312 length 536870912
         flags 260 - meta raid6

  block group at 0xffff8f3d5f4d9c00
         start 575668224 length 2147483648
         flags 257 - data raid6

  block group at 0xffff8f3d08189000
         start 2723151872 length 67108864
         flags 258 - system raid6

  block group at 0xffff8f3db70ff000
         start 2790260736 length 1073741824
         flags 260 - meta raid6

  block group at 0xffff8f3d5f4dd800
         start 3864002560 length 67108864
         flags 258 - system raid6

  block group at 0xffff8f3d67037000
         start 3931111424 length 2147483648
         flags 257 - data raid6
  $

So there were only 2 reasons left for having a readahead extent with a
single zone: reada_find_zone(), called when creating a readahead extent,
returned NULL either because we failed to find the corresponding block
group or because a memory allocation failed. With some additional and
custom tracing I figured out that on every further ocurrence of the
problem the block group had just been deleted when we were looping to
create the zones for the readahead extent (at reada_find_extent()), so we
ended up with only one zone in the readahead extent, corresponding to a
device that ends up getting replaced.

So after figuring that out it became obvious why the hang happens:

1) Task A starts a scrub on any device of the filesystem, except for
   device /dev/sdd;

2) Task B starts a device replace with /dev/sdd as the source device;

3) Task A calls btrfs_reada_add() from scrub_stripe() and it is currently
   starting to scrub a stripe from block group X. This call to
   btrfs_reada_add() is the one for the extent tree. When btrfs_reada_add()
   calls reada_add_block(), it passes the logical address of the extent
   tree's root node as its 'logical' argument - a value of 38928384;

4) Task A then enters reada_find_extent(), called from reada_add_block().
   It finds there isn't any existing readahead extent for the logical
   address 38928384, so it proceeds to the path of creating a new one.

   It calls btrfs_map_block() to find out which stripes exist for the block
   group X. On the first iteration of the for loop that iterates over the
   stripes, it finds the stripe for device /dev/sdd, so it creates one
   zone for that device and adds it to the readahead extent. Before getting
   into the second iteration of the loop, the cleanup kthread deletes block
   group X because it was empty. So in the iterations for the remaining
   stripes it does not add more zones to the readahead extent, because the
   calls to reada_find_zone() returned NULL because they couldn't find
   block group X anymore.

   As a result the new readahead extent has a single zone, corresponding to
   the device /dev/sdd;

4) Before task A returns to btrfs_reada_add() and queues the readahead job
   for the readahead work queue, task B finishes the device replace and at
   btrfs_dev_replace_finishing() swaps the device /dev/sdd with the new
   device /dev/sdg;

5) Task A returns to reada_add_block(), which increments the counter
   "->elems" of the reada_control structure allocated at btrfs_reada_add().

   Then it returns back to btrfs_reada_add() and calls
   reada_start_machine(). This queues a job in the readahead work queue to
   run the function reada_start_machine_worker(), which calls
   __reada_start_machine().

   At __reada_start_machine() we take the device list mutex and for each
   device found in the current device list, we call
   reada_start_machine_dev() to start the readahead work. However at this
   point the device /dev/sdd was already freed and is not in the device
   list anymore.

   This means the corresponding readahead for the extent at 38928384 is
   never started, and therefore the "->elems" counter of the reada_control
   structure allocated at btrfs_reada_add() never goes down to 0, causing
   the call to btrfs_reada_wait(), done by the scrub task, to wait forever.

Note that the readahead request can be made either after the device replace
started or before it started, however in pratice it is very unlikely that a
device replace is able to start after a readahead request is made and is
able to complete before the readahead request completes - maybe only on a
very small and nearly empty filesystem.

This hang however is not the only problem we can have with readahead and
device removals. When the readahead extent has other zones other than the
one corresponding to the device that is being removed (either by a device
replace or a device remove operation), we risk having a use-after-free on
the device when dropping the last reference of the readahead extent.

For example if we create a readahead extent with two zones, one for the
device /dev/sdd and one for the device /dev/sde:

1) Before the readahead worker starts, the device /dev/sdd is removed,
   and the corresponding btrfs_device structure is freed. However the
   readahead extent still has the zone pointing to the device structure;

2) When the readahead worker starts, it only finds device /dev/sde in the
   current device list of the filesystem;

3) It starts the readahead work, at reada_start_machine_dev(), using the
   device /dev/sde;

4) Then when it finishes reading the extent from device /dev/sde, it calls
   __readahead_hook() which ends up dropping the last reference on the
   readahead extent through the last call to reada_extent_put();

5) At reada_extent_put() it iterates over each zone of the readahead extent
   and attempts to delete an element from the device's 'reada_extents'
   radix tree, resulting in a use-after-free, as the device pointer of the
   zone for /dev/sdd is now stale. We can also access the device after
   dropping the last reference of a zone, through reada_zone_release(),
   also called by reada_extent_put().

And a device remove suffers the same problem, however since it shrinks the
device size down to zero before removing the device, it is very unlikely to
still have readahead requests not completed by the time we free the device,
the only possibility is if the device has a very little space allocated.

While the hang problem is exclusive to scrub, since it is currently the
only user of btrfs_reada_add() and btrfs_reada_wait(), the use-after-free
problem affects any path that triggers readhead, which includes
btree_readahead_hook() and __readahead_hook() (a readahead worker can
trigger readahed for the children of a node) for example - any path that
ends up calling reada_add_block() can trigger the use-after-free after a
device is removed.

So fix this by waiting for any readahead requests for a device to complete
before removing a device, ensuring that while waiting for existing ones no
new ones can be made.

This problem has been around for a very long time - the readahead code was
added in 2011, device remove exists since 2008 and device replace was
introduced in 2013, hard to pick a specific commit for a git Fixes tag.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
melver pushed a commit that referenced this pull request Nov 2, 2020
When enabling qgroups we walk the tree_root and then add a qgroup item
for every root that we have.  This creates a lock dependency on the
tree_root and qgroup_root, which results in the following lockdep splat
(with tree locks using rwsem), eg. in tests btrfs/017 or btrfs/022:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.9.0-default+ #1299 Not tainted
  ------------------------------------------------------
  btrfs/24552 is trying to acquire lock:
  ffff9142dfc5f630 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]

  but task is already holding lock:
  ffff9142dfc5d0b0 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #1 (btrfs-root-00){++++}-{3:3}:
	 __lock_acquire+0x3fb/0x730
	 lock_acquire.part.0+0x6a/0x130
	 down_read_nested+0x46/0x130
	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
	 btrfs_search_slot_get_root+0x11d/0x290 [btrfs]
	 btrfs_search_slot+0xc3/0x9f0 [btrfs]
	 btrfs_insert_item+0x6e/0x140 [btrfs]
	 btrfs_create_tree+0x1cb/0x240 [btrfs]
	 btrfs_quota_enable+0xcd/0x790 [btrfs]
	 btrfs_ioctl_quota_ctl+0xc9/0xe0 [btrfs]
	 __x64_sys_ioctl+0x83/0xa0
	 do_syscall_64+0x2d/0x70
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (btrfs-quota-00){++++}-{3:3}:
	 check_prev_add+0x91/0xc30
	 validate_chain+0x491/0x750
	 __lock_acquire+0x3fb/0x730
	 lock_acquire.part.0+0x6a/0x130
	 down_read_nested+0x46/0x130
	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
	 btrfs_search_slot_get_root+0x11d/0x290 [btrfs]
	 btrfs_search_slot+0xc3/0x9f0 [btrfs]
	 btrfs_insert_empty_items+0x58/0xa0 [btrfs]
	 add_qgroup_item.part.0+0x72/0x210 [btrfs]
	 btrfs_quota_enable+0x3bb/0x790 [btrfs]
	 btrfs_ioctl_quota_ctl+0xc9/0xe0 [btrfs]
	 __x64_sys_ioctl+0x83/0xa0
	 do_syscall_64+0x2d/0x70
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  other info that might help us debug this:

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(btrfs-root-00);
				 lock(btrfs-quota-00);
				 lock(btrfs-root-00);
    lock(btrfs-quota-00);

   *** DEADLOCK ***

  5 locks held by btrfs/24552:
   #0: ffff9142df431478 (sb_writers#10){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0xa0
   #1: ffff9142f9b10cc0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl_quota_ctl+0x7b/0xe0 [btrfs]
   #2: ffff9142f9b11a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0x790 [btrfs]
   #3: ffff9142df431698 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x406/0x510 [btrfs]
   #4: ffff9142dfc5d0b0 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]

  stack backtrace:
  CPU: 1 PID: 24552 Comm: btrfs Not tainted 5.9.0-default+ #1299
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
  Call Trace:
   dump_stack+0x77/0x97
   check_noncircular+0xf3/0x110
   check_prev_add+0x91/0xc30
   validate_chain+0x491/0x750
   __lock_acquire+0x3fb/0x730
   lock_acquire.part.0+0x6a/0x130
   ? __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
   ? lock_acquire+0xc4/0x140
   ? __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
   down_read_nested+0x46/0x130
   ? __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
   __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
   ? btrfs_root_node+0xd9/0x200 [btrfs]
   __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
   btrfs_search_slot_get_root+0x11d/0x290 [btrfs]
   btrfs_search_slot+0xc3/0x9f0 [btrfs]
   btrfs_insert_empty_items+0x58/0xa0 [btrfs]
   add_qgroup_item.part.0+0x72/0x210 [btrfs]
   btrfs_quota_enable+0x3bb/0x790 [btrfs]
   btrfs_ioctl_quota_ctl+0xc9/0xe0 [btrfs]
   __x64_sys_ioctl+0x83/0xa0
   do_syscall_64+0x2d/0x70
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix this by dropping the path whenever we find a root item, add the
qgroup item, and then re-lookup the root item we found and continue
processing roots.

Reported-by: David Sterba <dsterba@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
melver pushed a commit that referenced this pull request Nov 2, 2020
I got the following lockdep splat with tree locks converted to rwsem
patches on btrfs/104:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.9.0+ #102 Not tainted
  ------------------------------------------------------
  btrfs-cleaner/903 is trying to acquire lock:
  ffff8e7fab6ffe30 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x170

  but task is already holding lock:
  ffff8e7fab628a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #3 (&fs_info->commit_root_sem){++++}-{3:3}:
	 down_read+0x40/0x130
	 caching_thread+0x53/0x5a0
	 btrfs_work_helper+0xfa/0x520
	 process_one_work+0x238/0x540
	 worker_thread+0x55/0x3c0
	 kthread+0x13a/0x150
	 ret_from_fork+0x1f/0x30

  -> #2 (&caching_ctl->mutex){+.+.}-{3:3}:
	 __mutex_lock+0x7e/0x7b0
	 btrfs_cache_block_group+0x1e0/0x510
	 find_free_extent+0xb6e/0x12f0
	 btrfs_reserve_extent+0xb3/0x1b0
	 btrfs_alloc_tree_block+0xb1/0x330
	 alloc_tree_block_no_bg_flush+0x4f/0x60
	 __btrfs_cow_block+0x11d/0x580
	 btrfs_cow_block+0x10c/0x220
	 commit_cowonly_roots+0x47/0x2e0
	 btrfs_commit_transaction+0x595/0xbd0
	 sync_filesystem+0x74/0x90
	 generic_shutdown_super+0x22/0x100
	 kill_anon_super+0x14/0x30
	 btrfs_kill_super+0x12/0x20
	 deactivate_locked_super+0x36/0xa0
	 cleanup_mnt+0x12d/0x190
	 task_work_run+0x5c/0xa0
	 exit_to_user_mode_prepare+0x1df/0x200
	 syscall_exit_to_user_mode+0x54/0x280
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #1 (&space_info->groups_sem){++++}-{3:3}:
	 down_read+0x40/0x130
	 find_free_extent+0x2ed/0x12f0
	 btrfs_reserve_extent+0xb3/0x1b0
	 btrfs_alloc_tree_block+0xb1/0x330
	 alloc_tree_block_no_bg_flush+0x4f/0x60
	 __btrfs_cow_block+0x11d/0x580
	 btrfs_cow_block+0x10c/0x220
	 commit_cowonly_roots+0x47/0x2e0
	 btrfs_commit_transaction+0x595/0xbd0
	 sync_filesystem+0x74/0x90
	 generic_shutdown_super+0x22/0x100
	 kill_anon_super+0x14/0x30
	 btrfs_kill_super+0x12/0x20
	 deactivate_locked_super+0x36/0xa0
	 cleanup_mnt+0x12d/0x190
	 task_work_run+0x5c/0xa0
	 exit_to_user_mode_prepare+0x1df/0x200
	 syscall_exit_to_user_mode+0x54/0x280
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (btrfs-root-00){++++}-{3:3}:
	 __lock_acquire+0x1167/0x2150
	 lock_acquire+0xb9/0x3d0
	 down_read_nested+0x43/0x130
	 __btrfs_tree_read_lock+0x32/0x170
	 __btrfs_read_lock_root_node+0x3a/0x50
	 btrfs_search_slot+0x614/0x9d0
	 btrfs_find_root+0x35/0x1b0
	 btrfs_read_tree_root+0x61/0x120
	 btrfs_get_root_ref+0x14b/0x600
	 find_parent_nodes+0x3e6/0x1b30
	 btrfs_find_all_roots_safe+0xb4/0x130
	 btrfs_find_all_roots+0x60/0x80
	 btrfs_qgroup_trace_extent_post+0x27/0x40
	 btrfs_add_delayed_data_ref+0x3fd/0x460
	 btrfs_free_extent+0x42/0x100
	 __btrfs_mod_ref+0x1d7/0x2f0
	 walk_up_proc+0x11c/0x400
	 walk_up_tree+0xf0/0x180
	 btrfs_drop_snapshot+0x1c7/0x780
	 btrfs_clean_one_deleted_snapshot+0xfb/0x110
	 cleaner_kthread+0xd4/0x140
	 kthread+0x13a/0x150
	 ret_from_fork+0x1f/0x30

  other info that might help us debug this:

  Chain exists of:
    btrfs-root-00 --> &caching_ctl->mutex --> &fs_info->commit_root_sem

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(&fs_info->commit_root_sem);
				 lock(&caching_ctl->mutex);
				 lock(&fs_info->commit_root_sem);
    lock(btrfs-root-00);

   *** DEADLOCK ***

  3 locks held by btrfs-cleaner/903:
   #0: ffff8e7fab628838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x6e/0x140
   #1: ffff8e7faadac640 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x40b/0x5c0
   #2: ffff8e7fab628a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80

  stack backtrace:
  CPU: 0 PID: 903 Comm: btrfs-cleaner Not tainted 5.9.0+ #102
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
  Call Trace:
   dump_stack+0x8b/0xb0
   check_noncircular+0xcf/0xf0
   __lock_acquire+0x1167/0x2150
   ? __bfs+0x42/0x210
   lock_acquire+0xb9/0x3d0
   ? __btrfs_tree_read_lock+0x32/0x170
   down_read_nested+0x43/0x130
   ? __btrfs_tree_read_lock+0x32/0x170
   __btrfs_tree_read_lock+0x32/0x170
   __btrfs_read_lock_root_node+0x3a/0x50
   btrfs_search_slot+0x614/0x9d0
   ? find_held_lock+0x2b/0x80
   btrfs_find_root+0x35/0x1b0
   ? do_raw_spin_unlock+0x4b/0xa0
   btrfs_read_tree_root+0x61/0x120
   btrfs_get_root_ref+0x14b/0x600
   find_parent_nodes+0x3e6/0x1b30
   btrfs_find_all_roots_safe+0xb4/0x130
   btrfs_find_all_roots+0x60/0x80
   btrfs_qgroup_trace_extent_post+0x27/0x40
   btrfs_add_delayed_data_ref+0x3fd/0x460
   btrfs_free_extent+0x42/0x100
   __btrfs_mod_ref+0x1d7/0x2f0
   walk_up_proc+0x11c/0x400
   walk_up_tree+0xf0/0x180
   btrfs_drop_snapshot+0x1c7/0x780
   ? btrfs_clean_one_deleted_snapshot+0x73/0x110
   btrfs_clean_one_deleted_snapshot+0xfb/0x110
   cleaner_kthread+0xd4/0x140
   ? btrfs_alloc_root+0x50/0x50
   kthread+0x13a/0x150
   ? kthread_create_worker_on_cpu+0x40/0x40
   ret_from_fork+0x1f/0x30
  BTRFS info (device sdb): disk space caching is enabled
  BTRFS info (device sdb): has skinny extents

This happens because qgroups does a backref lookup when we create a
delayed ref.  From here it may have to look up a root from an indirect
ref, which does a normal lookup on the tree_root, which takes the read
lock on the tree_root nodes.

To fix this we need to add a variant for looking up roots that searches
the commit root of the tree_root.  Then when we do the backref search
using the commit root we are sure to not take any locks on the tree_root
nodes.  This gets rid of the lockdep splat when running btrfs/104.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
melver pushed a commit that referenced this pull request Nov 2, 2020
Ido Schimmel says:

====================
mlxsw: Various fixes

This patch set contains various fixes for mlxsw.

Patch #1 ensures that only link modes that are supported by both the
device and the driver are advertised. When a link mode that is not
supported by the driver is negotiated by the device, it will be
presented as an unknown speed by ethtool, causing the bond driver to
wrongly assume that the link is down.

Patch #2 fixes a trivial memory leak upon module removal.

Patch #3 fixes a use-after-free that syzkaller was able to trigger once
on a slow emulator after a few months of fuzzing.
====================

Link: https://lore.kernel.org/r/20201024133733.2107509-1-idosch@idosch.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
melver pushed a commit that referenced this pull request Nov 2, 2020
bnxt_open_nic() is called during configuration changes that require
the NIC to be closed and then opened.  This call is protected by
rtnl_lock.  Firmware reset can be happening at the same time.  Only
critical portions of the entire firmware reset sequence are protected
by the rtnl_lock.  It is possible that bnxt_open_nic() can be called
when the firmware reset sequence is aborting.  In that case,
bnxt_open_nic() needs to check if the ABORT_ERR flag is set and
abort if it is.  The configuration change that resulted in the
bnxt_open_nic() call will fail but the NIC will be brought to a
consistent IF_DOWN state.

Without this patch, if bnxt_open_nic() were to continue in this error
state, it may crash like this:

[ 1648.659736] BUG: unable to handle kernel NULL pointer dereference at           (null)
[ 1648.659768] IP: [<ffffffffc01e9b3a>] bnxt_alloc_mem+0x50a/0x1140 [bnxt_en]
[ 1648.659796] PGD 101e1b3067 PUD 101e1b2067 PMD 0
[ 1648.659813] Oops: 0000 [#1] SMP
[ 1648.659825] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc dell_smbios dell_wmi_descriptor dcdbas amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper vfat cryptd fat pcspkr ipmi_ssif sg k10temp i2c_piix4 wmi ipmi_si ipmi_devintf ipmi_msghandler tpm_crb acpi_power_meter sch_fq_codel ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci drm libahci megaraid_sas crct10dif_pclmul crct10dif_common
[ 1648.660063]  tg3 libata crc32c_intel bnxt_en(OE) drm_panel_orientation_quirks devlink ptp pps_core dm_mirror dm_region_hash dm_log dm_mod fuse
[ 1648.660105] CPU: 13 PID: 3867 Comm: ethtool Kdump: loaded Tainted: G           OE  ------------   3.10.0-1152.el7.x86_64 #1
[ 1648.660911] Hardware name: Dell Inc. PowerEdge R7515/0R4CNN, BIOS 1.2.14 01/28/2020
[ 1648.661662] task: ffff94e64cbc9080 ti: ffff94f55df1c000 task.ti: ffff94f55df1c000
[ 1648.662409] RIP: 0010:[<ffffffffc01e9b3a>]  [<ffffffffc01e9b3a>] bnxt_alloc_mem+0x50a/0x1140 [bnxt_en]
[ 1648.663171] RSP: 0018:ffff94f55df1fba8  EFLAGS: 00010202
[ 1648.663927] RAX: 0000000000000000 RBX: ffff94e6827e0000 RCX: 0000000000000000
[ 1648.664684] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff94e6827e08c0
[ 1648.665433] RBP: ffff94f55df1fc20 R08: 00000000000001ff R09: 0000000000000008
[ 1648.666184] R10: 0000000000000d53 R11: ffff94f55df1f7ce R12: ffff94e6827e08c0
[ 1648.666940] R13: ffff94e6827e08c0 R14: ffff94e6827e08c0 R15: ffffffffb9115e40
[ 1648.667695] FS:  00007f8aadba5740(0000) GS:ffff94f57eb40000(0000) knlGS:0000000000000000
[ 1648.668447] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1648.669202] CR2: 0000000000000000 CR3: 0000001022772000 CR4: 0000000000340fe0
[ 1648.669966] Call Trace:
[ 1648.670730]  [<ffffffffc01f1d5d>] ? bnxt_need_reserve_rings+0x9d/0x170 [bnxt_en]
[ 1648.671496]  [<ffffffffc01fa7ea>] __bnxt_open_nic+0x8a/0x9a0 [bnxt_en]
[ 1648.672263]  [<ffffffffc01f7479>] ? bnxt_close_nic+0x59/0x1b0 [bnxt_en]
[ 1648.673031]  [<ffffffffc01fb11b>] bnxt_open_nic+0x1b/0x50 [bnxt_en]
[ 1648.673793]  [<ffffffffc020037c>] bnxt_set_ringparam+0x6c/0xa0 [bnxt_en]
[ 1648.674550]  [<ffffffffb8a5f564>] dev_ethtool+0x1334/0x21a0
[ 1648.675306]  [<ffffffffb8a719ff>] dev_ioctl+0x1ef/0x5f0
[ 1648.676061]  [<ffffffffb8a324bd>] sock_do_ioctl+0x4d/0x60
[ 1648.676810]  [<ffffffffb8a326bb>] sock_ioctl+0x1eb/0x2d0
[ 1648.677548]  [<ffffffffb8663230>] do_vfs_ioctl+0x3a0/0x5b0
[ 1648.678282]  [<ffffffffb8b8e678>] ? __do_page_fault+0x238/0x500
[ 1648.679016]  [<ffffffffb86634e1>] SyS_ioctl+0xa1/0xc0
[ 1648.679745]  [<ffffffffb8b93f92>] system_call_fastpath+0x25/0x2a
[ 1648.680461] Code: 9e 60 01 00 00 0f 1f 40 00 45 8b 8e 48 01 00 00 31 c9 45 85 c9 0f 8e 73 01 00 00 66 0f 1f 44 00 00 49 8b 86 a8 00 00 00 48 63 d1 <48> 8b 14 d0 48 85 d2 0f 84 46 01 00 00 41 8b 86 44 01 00 00 c7
[ 1648.681986] RIP  [<ffffffffc01e9b3a>] bnxt_alloc_mem+0x50a/0x1140 [bnxt_en]
[ 1648.682724]  RSP <ffff94f55df1fba8>
[ 1648.683451] CR2: 0000000000000000

Fixes: ec5d31e ("bnxt_en: Handle firmware reset status during IF_UP.")
Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
melver pushed a commit that referenced this pull request Nov 2, 2020
When unbinding the hns3 driver with the HNS3 VF, I got the following
kernel panic:

[  265.709989] Unable to handle kernel paging request at virtual address ffff800054627000
[  265.717928] Mem abort info:
[  265.720740]   ESR = 0x96000047
[  265.723810]   EC = 0x25: DABT (current EL), IL = 32 bits
[  265.729126]   SET = 0, FnV = 0
[  265.732195]   EA = 0, S1PTW = 0
[  265.735351] Data abort info:
[  265.738227]   ISV = 0, ISS = 0x00000047
[  265.742071]   CM = 0, WnR = 1
[  265.745055] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000009b54000
[  265.751753] [ffff800054627000] pgd=0000202ffffff003, p4d=0000202ffffff003, pud=00002020020eb003, pmd=00000020a0dfc003, pte=0000000000000000
[  265.764314] Internal error: Oops: 96000047 [#1] SMP
[  265.830357] CPU: 61 PID: 20319 Comm: bash Not tainted 5.9.0+ #206
[  265.836423] Hardware name: Huawei TaiShan 2280 V2/BC82AMDDA, BIOS 1.05 09/18/2019
[  265.843873] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
[  265.843890] pc : hclgevf_cmd_uninit+0xbc/0x300
[  265.861988] lr : hclgevf_cmd_uninit+0xb0/0x300
[  265.861992] sp : ffff80004c983b50
[  265.881411] pmr_save: 000000e0
[  265.884453] x29: ffff80004c983b50 x28: ffff20280bbce500
[  265.889744] x27: 0000000000000000 x26: 0000000000000000
[  265.895034] x25: ffff800011a1f000 x24: ffff800011a1fe90
[  265.900325] x23: ffff0020ce9b00d8 x22: ffff0020ce9b0150
[  265.905616] x21: ffff800010d70e90 x20: ffff800010d70e90
[  265.910906] x19: ffff0020ce9b0080 x18: 0000000000000004
[  265.916198] x17: 0000000000000000 x16: ffff800011ae32e8
[  265.916201] x15: 0000000000000028 x14: 0000000000000002
[  265.916204] x13: ffff800011ae32e8 x12: 0000000000012ad8
[  265.946619] x11: ffff80004c983b50 x10: 0000000000000000
[  265.951911] x9 : ffff8000115d0888 x8 : 0000000000000000
[  265.951914] x7 : ffff800011890b20 x6 : c0000000ffff7fff
[  265.951917] x5 : ffff80004c983930 x4 : 0000000000000001
[  265.951919] x3 : ffffa027eec1b000 x2 : 2b78ccbbff369100
[  265.964487] x1 : 0000000000000000 x0 : ffff800054627000
[  265.964491] Call trace:
[  265.964494]  hclgevf_cmd_uninit+0xbc/0x300
[  265.964496]  hclgevf_uninit_ae_dev+0x9c/0xe8
[  265.964501]  hnae3_unregister_ae_dev+0xb0/0x130
[  265.964516]  hns3_remove+0x34/0x88 [hns3]
[  266.009683]  pci_device_remove+0x48/0xf0
[  266.009692]  device_release_driver_internal+0x114/0x1e8
[  266.030058]  device_driver_detach+0x28/0x38
[  266.034224]  unbind_store+0xd4/0x108
[  266.037784]  drv_attr_store+0x40/0x58
[  266.041435]  sysfs_kf_write+0x54/0x80
[  266.045081]  kernfs_fop_write+0x12c/0x250
[  266.049076]  vfs_write+0xc4/0x248
[  266.052378]  ksys_write+0x74/0xf8
[  266.055677]  __arm64_sys_write+0x24/0x30
[  266.059584]  el0_svc_common.constprop.3+0x84/0x270
[  266.064354]  do_el0_svc+0x34/0xa0
[  266.067658]  el0_svc+0x38/0x40
[  266.070700]  el0_sync_handler+0x8c/0xb0
[  266.074519]  el0_sync+0x140/0x180

It looks like the BAR memory region had already been unmapped before we
start clearing CMDQ registers in it, which is pretty bad and the kernel
happily kills itself because of a Current EL Data Abort (on arm64).

Moving the CMDQ uninitialization a bit early fixes the issue for me.

Fixes: 862d969 ("net: hns3: do VF's pci re-initialization while PF doing FLR")
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Link: https://lore.kernel.org/r/20201023051550.793-1-yuzenghui@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
melver pushed a commit that referenced this pull request Nov 2, 2020
When target side trace in turned on and flush command is issued from the
host it results in the following Oops.

[  856.789724] BUG: kernel NULL pointer dereference, address: 0000000000000068
[  856.790686] #PF: supervisor read access in kernel mode
[  856.791262] #PF: error_code(0x0000) - not-present page
[  856.791863] PGD 6d7110067 P4D 6d7110067 PUD 66f0ad067 PMD 0
[  856.792527] Oops: 0000 [#1] SMP NOPTI
[  856.792950] CPU: 15 PID: 7034 Comm: nvme Tainted: G           OE     5.9.0nvme-5.9+ #71
[  856.793790] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e3214
[  856.794956] RIP: 0010:trace_event_raw_event_nvmet_req_init+0x13e/0x170 [nvmet]
[  856.795734] Code: 41 5c 41 5d c3 31 d2 31 f6 e8 4e 9b b8 e0 e9 0e ff ff ff 49 8b 55 00 48 8b 38 8b 0
[  856.797740] RSP: 0018:ffffc90001be3a60 EFLAGS: 00010246
[  856.798375] RAX: 0000000000000000 RBX: ffff8887e7d2c01c RCX: 0000000000000000
[  856.799234] RDX: 0000000000000020 RSI: 0000000057e70ea2 RDI: ffff8887e7d2c034
[  856.800088] RBP: ffff88869f710578 R08: ffff888807500d40 R09: 00000000fffffffe
[  856.800951] R10: 0000000064c66670 R11: 00000000ef955201 R12: ffff8887e7d2c034
[  856.801807] R13: ffff88869f7105c8 R14: 0000000000000040 R15: ffff88869f710440
[  856.802667] FS:  00007f6a22bd8780(0000) GS:ffff888813a00000(0000) knlGS:0000000000000000
[  856.803635] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  856.804367] CR2: 0000000000000068 CR3: 00000006d73e0000 CR4: 00000000003506e0
[  856.805283] Call Trace:
[  856.805613]  nvmet_req_init+0x27c/0x480 [nvmet]
[  856.806200]  nvme_loop_queue_rq+0xcb/0x1d0 [nvme_loop]
[  856.806862]  blk_mq_dispatch_rq_list+0x123/0x7b0
[  856.807459]  ? kvm_sched_clock_read+0x14/0x30
[  856.808025]  __blk_mq_sched_dispatch_requests+0xc7/0x170
[  856.808708]  blk_mq_sched_dispatch_requests+0x30/0x60
[  856.809372]  __blk_mq_run_hw_queue+0x70/0x100
[  856.809935]  __blk_mq_delay_run_hw_queue+0x156/0x170
[  856.810574]  blk_mq_run_hw_queue+0x86/0xe0
[  856.811104]  blk_mq_sched_insert_request+0xef/0x160
[  856.811733]  blk_execute_rq+0x69/0xc0
[  856.812212]  ? blk_mq_rq_ctx_init+0xd0/0x230
[  856.812784]  nvme_execute_passthru_rq+0x57/0x130 [nvme_core]
[  856.813461]  nvme_submit_user_cmd+0xeb/0x300 [nvme_core]
[  856.814099]  nvme_user_cmd.isra.82+0x11e/0x1a0 [nvme_core]
[  856.814752]  blkdev_ioctl+0x1dc/0x2c0
[  856.815197]  block_ioctl+0x3f/0x50
[  856.815606]  __x64_sys_ioctl+0x84/0xc0
[  856.816074]  do_syscall_64+0x33/0x40
[  856.816533]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  856.817168] RIP: 0033:0x7f6a222ed107
[  856.817617] Code: 44 00 00 48 8b 05 81 cd 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 8
[  856.819901] RSP: 002b:00007ffca848f058 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[  856.820846] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f6a222ed107
[  856.821726] RDX: 00007ffca848f060 RSI: 00000000c0484e43 RDI: 0000000000000003
[  856.822603] RBP: 0000000000000003 R08: 000000000000003f R09: 0000000000000005
[  856.823478] R10: 00007ffca848ece0 R11: 0000000000000202 R12: 00007ffca84912d3
[  856.824359] R13: 00007ffca848f4d0 R14: 0000000000000002 R15: 000000000067e900
[  856.825236] Modules linked in: nvme_loop(OE) nvmet(OE) nvme_fabrics(OE) null_blk nvme(OE) nvme_corel

Move the nvmet_req_init() tracepoint after we parse the command in
nvmet_req_init() so that we can get rid of the duplicate
nvmet_find_namespace() call.
Rename __assign_disk_name() ->  __assign_req_name(). Now that we call
tracepoint after parsing the command simplify the newly added
__assign_req_name() which fixes this bug.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
melver pushed a commit that referenced this pull request Nov 2, 2020
The tcf_block_unbind() expects that the caller will take block->cb_lock
before calling it, however the code took RTNL lock and dropped cb_lock
instead. This causes to the following kernel panic.

 WARNING: CPU: 1 PID: 13524 at net/sched/cls_api.c:1488 tcf_block_unbind+0x2db/0x420
 Modules linked in: mlx5_ib mlx5_core mlxfw ptp pps_core act_mirred act_tunnel_key cls_flower vxlan ip6_udp_tunnel udp_tunnel dummy sch_ingress openvswitch nsh xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad ib_ipoib rdma_cm iw_cm ib_cm ib_uverbs ib_core overlay [last unloaded: mlxfw]
 CPU: 1 PID: 13524 Comm: test-ecmp-add-v Tainted: G        W         5.9.0+ #1
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 RIP: 0010:tcf_block_unbind+0x2db/0x420
 Code: ff 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8d bc 24 30 01 00 00 be ff ff ff ff e8 7d 7f 70 00 85 c0 0f 85 7b fd ff ff <0f> 0b e9 74 fd ff ff 48 c7 c7 dc 6a 24 84 e8 02 ec fe fe e9 55 fd
 RSP: 0018:ffff888117d17968 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffff88812f713c00 RCX: 1ffffffff0848d5b
 RDX: 0000000000000001 RSI: ffff88814fbc8130 RDI: ffff888107f2b878
 RBP: 1ffff11022fa2f3f R08: 0000000000000000 R09: ffffffff84115a87
 R10: fffffbfff0822b50 R11: ffff888107f2b898 R12: ffff88814fbc8000
 R13: ffff88812f713c10 R14: ffff888117d17a38 R15: ffff88814fbc80c0
 FS:  00007f6593d36740(0000) GS:ffff8882a4f00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00005607a00758f8 CR3: 0000000131aea006 CR4: 0000000000170ea0
 Call Trace:
  tc_block_indr_cleanup+0x3e0/0x5a0
  ? tcf_block_unbind+0x420/0x420
  ? __mutex_unlock_slowpath+0xe7/0x610
  flow_indr_dev_unregister+0x5e2/0x930
  ? mlx5e_restore_tunnel+0xdf0/0xdf0 [mlx5_core]
  ? mlx5e_restore_tunnel+0xdf0/0xdf0 [mlx5_core]
  ? flow_indr_block_cb_alloc+0x3c0/0x3c0
  ? mlx5_db_free+0x37c/0x4b0 [mlx5_core]
  mlx5e_cleanup_rep_tx+0x8b/0xc0 [mlx5_core]
  mlx5e_detach_netdev+0xe5/0x120 [mlx5_core]
  mlx5e_vport_rep_unload+0x155/0x260 [mlx5_core]
  esw_offloads_disable+0x227/0x2b0 [mlx5_core]
  mlx5_eswitch_disable_locked.cold+0x38e/0x699 [mlx5_core]
  mlx5_eswitch_disable+0x94/0xf0 [mlx5_core]
  mlx5_device_disable_sriov+0x183/0x1f0 [mlx5_core]
  mlx5_core_sriov_configure+0xfd/0x230 [mlx5_core]
  sriov_numvfs_store+0x261/0x2f0
  ? sriov_drivers_autoprobe_store+0x110/0x110
  ? sysfs_file_ops+0x170/0x170
  ? sysfs_file_ops+0x117/0x170
  ? sysfs_file_ops+0x170/0x170
  kernfs_fop_write+0x1ff/0x3f0
  ? rcu_read_lock_any_held+0x6e/0x90
  vfs_write+0x1f3/0x620
  ksys_write+0xf9/0x1d0
  ? __x64_sys_read+0xb0/0xb0
  ? lockdep_hardirqs_on_prepare+0x273/0x3f0
  ? syscall_enter_from_user_mode+0x1d/0x50
  do_syscall_64+0x2d/0x40
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

<...>

 ---[ end trace bfdd028ada702879 ]---

Fixes: 0fdcf78 ("net: use flow_indr_dev_setup_offload()")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20201026123327.1141066-1-leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
melver pushed a commit that referenced this pull request Nov 2, 2020
…/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 5.10, take #1

- Force PTE mapping on device pages provided via VFIO
- Fix detection of cacheable mapping at S2
- Fallback to PMD/PTE mappings for composite huge pages
- Fix accounting of Stage-2 PGD allocation
- Fix AArch32 handling of some of the debug registers
- Simplify host HYP entry
- Fix stray pointer conversion on nVHE TLB invalidation
- Fix initialization of the nVHE code
- Simplify handling of capabilities exposed to HYP
- Nuke VCPUs caught using a forbidden AArch32 EL0
melver pushed a commit that referenced this pull request Nov 10, 2020
This is a copy of commit 5c5f1ba ("ASoC: Intel:
kbl_rt5663_rt5514_max98927: Fix kabylake_ssp_fixup function") applied to
the kbl_rt5663_max98927 board file.

Original explanation of the change:

kabylake_ssp_fixup function uses snd_soc_dpcm to identify the
codecs DAIs. The HW parameters are changed based on the codec DAI of the
stream. The earlier approach to get snd_soc_dpcm was using container_of()
macro on snd_pcm_hw_params.

The structures have been modified over time and snd_soc_dpcm does not have
snd_pcm_hw_params as a reference but as a copy. This causes the current
driver to crash when used.

This patch changes the way snd_soc_dpcm is extracted. snd_soc_pcm_runtime
holds 2 dpcm instances (one for playback and one for capture). 2 codecs
on the SSP are dmic (capture) and speakers (playback). Based on the
stream direction, snd_soc_dpcm is extracted from snd_soc_pcm_runtime.

Fixes a boot crash on a HP Chromebook x2:

[   16.582225] BUG: kernel NULL pointer dereference, address: 0000000000000050
[   16.582231] #PF: supervisor read access in kernel mode
[   16.582233] #PF: error_code(0x0000) - not-present page
[   16.582234] PGD 0 P4D 0
[   16.582238] Oops: 0000 [#1] PREEMPT SMP PTI
[   16.582241] CPU: 0 PID: 1980 Comm: cras Tainted: G         C        5.4.58 #1
[   16.582243] Hardware name: HP Soraka/Soraka, BIOS Google_Soraka.10431.75.0 08/30/2018
[   16.582247] RIP: 0010:kabylake_ssp_fixup+0x19/0xbb [snd_soc_kbl_rt5663_max98927]
[   16.582250] Code: c6 6f c5 80 c0 44 89 f2 31 c0 e8 3e c9 4c d6 eb de 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 53 48 89 f3 48 8b 46 c8 48 8b 4e d0 <48> 8b 49 10 4c 8b 78 10 4c 8b 31 4c 89 f7 48 c7 c6 4b c2 80 c0 e8
[   16.582252] RSP: 0000:ffffaf7e81e0b958 EFLAGS: 00010282
[   16.582254] RAX: ffffffff96f13e0d RBX: ffffaf7e81e0ba00 RCX: 0000000000000040
[   16.582256] RDX: ffffaf7e81e0ba00 RSI: ffffaf7e81e0ba00 RDI: ffffa3b208558028
[   16.582258] RBP: ffffaf7e81e0b970 R08: ffffa3b203b54160 R09: ffffaf7e81e0ba00
[   16.582259] R10: 0000000000000000 R11: ffffffffc080b345 R12: ffffa3b209fb6e00
[   16.582261] R13: ffffa3b1b1a47838 R14: ffffa3b1e6197f28 R15: ffffaf7e81e0ba00
[   16.582263] FS:  00007eb3f25aaf80(0000) GS:ffffa3b236a00000(0000) knlGS:0000000000000000
[   16.582265] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   16.582267] CR2: 0000000000000050 CR3: 0000000246bc8006 CR4: 00000000003606f0
[   16.582269] Call Trace:
[   16.582275]  snd_soc_link_be_hw_params_fixup+0x21/0x68
[   16.582278]  snd_soc_dai_hw_params+0x25/0x94
[   16.582282]  soc_pcm_hw_params+0x2d8/0x583
[   16.582288]  dpcm_be_dai_hw_params+0x172/0x29e
[   16.582291]  dpcm_fe_dai_hw_params+0x9f/0x12f
[   16.582295]  snd_pcm_hw_params+0x137/0x41c
[   16.582298]  snd_pcm_hw_params_user+0x3c/0x71
[   16.582301]  snd_pcm_common_ioctl+0x2c6/0x565
[   16.582304]  snd_pcm_ioctl+0x32/0x36
[   16.582307]  do_vfs_ioctl+0x506/0x783
[   16.582311]  ksys_ioctl+0x58/0x83
[   16.582313]  __x64_sys_ioctl+0x1a/0x1e
[   16.582316]  do_syscall_64+0x54/0x7e
[   16.582319]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   16.582322] RIP: 0033:0x7eb3f1886157
[   16.582324] Code: 8a 66 90 48 8b 05 11 dd 2b 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e1 dc 2b 00 f7 d8 64 89 01 48
[   16.582326] RSP: 002b:00007ffff7559818 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   16.582329] RAX: ffffffffffffffda RBX: 00005acc9188b140 RCX: 00007eb3f1886157
[   16.582330] RDX: 00007ffff7559940 RSI: 00000000c2604111 RDI: 000000000000001e
[   16.582332] RBP: 00007ffff7559840 R08: 0000000000000004 R09: 0000000000000000
[   16.582333] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000bb80
[   16.582335] R13: 00005acc91702e80 R14: 00007ffff7559940 R15: 00005acc91702e80
[   16.582337] Modules linked in: rfcomm cmac algif_hash algif_skcipher af_alg uinput hid_google_hammer snd_soc_kbl_rt5663_max98927 snd_soc_hdac_hdmi snd_soc_dmic snd_soc_skl_ssp_clk snd_soc_skl snd_soc_sst_ipc snd_soc_sst_dsp snd_soc_hdac_hda snd_soc_acpi_intel_match snd_soc_acpi snd_hda_ext_core snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core ipu3_cio2 ipu3_imgu(C) videobuf2_v4l2 videobuf2_common videobuf2_dma_sg videobuf2_memops snd_soc_rt5663 snd_soc_max98927 snd_soc_rl6231 ov5670 ov13858 acpi_als v4l2_fwnode dw9714 fuse xt_MASQUERADE iio_trig_sysfs cros_ec_light_prox cros_ec_sensors cros_ec_sensors_core cros_ec_sensors_ring industrialio_triggered_buffer kfifo_buf industrialio cros_ec_sensorhub cdc_ether usbnet btusb btrtl btintel btbcm bluetooth ecdh_generic ecc lzo_rle lzo_compress iwlmvm zram iwl7000_mac80211 r8152 mii iwlwifi cfg80211 joydev
[   16.584243] gsmi: Log Shutdown Reason 0x03
[   16.584246] CR2: 0000000000000050
[   16.584248] ---[ end trace c8511d090c11edff ]---

Suggested-by: Łukasz Majczak <lmajczak@google.com>
Fixes: 2e5894d ("ASoC: pcm: Add support for DAI multicodec")
Signed-off-by: Tomasz Figa <tfiga@chromium.org>
Acked-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Link: https://lore.kernel.org/r/20201014141624.4143453-1-tfiga@chromium.org
Signed-off-by: Mark Brown <broonie@kernel.org>
melver pushed a commit that referenced this pull request Nov 10, 2020
spi_nor_parse_sfdp() modifies the passed structure so that it points to
itself (params.erase_map.regions to params.erase_map.uniform_region). This
makes it impossible to copy the local struct anywhere else.

Therefore only use memcpy() in backup-restore scenario. The bug may show up
like below:

BUG: unable to handle page fault for address: ffffc90000b377f8
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 4 PID: 3500 Comm: flashcp Tainted: G           O      5.4.53-... #1
...
RIP: 0010:spi_nor_erase+0x8e/0x5c0
Code: 64 24 18 89 db 4d 8b b5 d0 04 00 00 4c 89 64 24 18 4c 89 64 24 20 eb 12 a8 10 0f 85 59 02 00 00 49 83 c6 10 0f 84 4f 02 00 00 <49> 8b 06 48 89 c2 48 83 e2 c0 48 89 d1 49 03 4e 08 48 39 cb 73 d8
RSP: 0018:ffffc9000217fc48 EFLAGS: 00010206
RAX: 0000000000740000 RBX: 0000000000000000 RCX: 0000000000740000
RDX: ffff8884550c9980 RSI: ffff88844f9c0bc0 RDI: ffff88844ede7bb8
RBP: 0000000000740000 R08: ffffffff815bfbe0 R09: ffff88844f9c0bc0
R10: 0000000000000000 R11: 0000000000000000 R12: ffffc9000217fc60
R13: ffff88844ede7818 R14: ffffc90000b377f8 R15: 0000000000000000
FS:  00007f4699780500(0000) GS:ffff88846ff00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffc90000b377f8 CR3: 00000004538ee000 CR4: 0000000000340fe0
Call Trace:
 part_erase+0x27/0x50
 mtdchar_ioctl+0x831/0xba0
 ? filemap_map_pages+0x186/0x3d0
 ? do_filp_open+0xad/0x110
 ? _copy_to_user+0x22/0x30
 ? cp_new_stat+0x150/0x180
 mtdchar_unlocked_ioctl+0x2a/0x40
 do_vfs_ioctl+0xa0/0x630
 ? __do_sys_newfstat+0x3c/0x60
 ksys_ioctl+0x70/0x80
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x6a/0x200
 ? prepare_exit_to_usermode+0x50/0xd0
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f46996b6817

Cc: stable@vger.kernel.org
Fixes: c468721 ("mtd: spi-nor: Move erase_map to 'struct spi_nor_flash_parameter'")
Co-developed-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nokia.com>
Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nokia.com>
Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
Tested-by: Baurzhan Ismagulov <ibr@radix50.net>
Reviewed-by: Tudor Ambarus <tudor.ambarus@microchip.com>
Link: https://lore.kernel.org/r/20201005084803.23460-1-alexander.sverdlin@nokia.com
melver pushed a commit that referenced this pull request Nov 10, 2020
In my test setup, I had a SAMA5D27 device configured with ip forwarding, and
second device with usb ethernet (r8152) sending ICMP packets.  If the packet
was larger than about 220 bytes, the SAMA5 device would "oops" with the
following trace:

kernel BUG at net/core/skbuff.c:1863!
Internal error: Oops - BUG: 0 [#1] ARM
Modules linked in: xt_MASQUERADE ppp_async ppp_generic slhc iptable_nat xt_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 can_raw can bridge stp llc ipt_REJECT nf_reject_ipv4 sd_mod cdc_ether usbnet usb_storage r8152 scsi_mod mii o
ption usb_wwan usbserial micrel macb at91_sama5d2_adc phylink gpio_sama5d2_piobu m_can_platform m_can industrialio_triggered_buffer kfifo_buf of_mdio can_dev fixed_phy sdhci_of_at91 sdhci_pltfm libphy sdhci mmc_core ohci_at91 ehci_atmel o
hci_hcd iio_rescale industrialio sch_fq_codel spidev prox2_hal(O)
CPU: 0 PID: 0 Comm: swapper Tainted: G           O      5.9.1-prox2+ #1
Hardware name: Atmel SAMA5
PC is at skb_put+0x3c/0x50
LR is at macb_start_xmit+0x134/0xad0 [macb]
pc : [<c05258cc>]    lr : [<bf0ea5b8>]    psr: 20070113
sp : c0d01a60  ip : c07232c0  fp : c4250000
r10: c0d03cc8  r9 : 00000000  r8 : c0d038c0
r7 : 00000000  r6 : 00000008  r5 : c59b66c0  r4 : 0000002a
r3 : 8f659eff  r2 : c59e9eea  r1 : 00000001  r0 : c59b66c0
Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c53c7d  Table: 2640c059  DAC: 00000051
Process swapper (pid: 0, stack limit = 0x75002d81)

<snipped stack>

[<c05258cc>] (skb_put) from [<bf0ea5b8>] (macb_start_xmit+0x134/0xad0 [macb])
[<bf0ea5b8>] (macb_start_xmit [macb]) from [<c053e504>] (dev_hard_start_xmit+0x90/0x11c)
[<c053e504>] (dev_hard_start_xmit) from [<c0571180>] (sch_direct_xmit+0x124/0x260)
[<c0571180>] (sch_direct_xmit) from [<c053eae4>] (__dev_queue_xmit+0x4b0/0x6d0)
[<c053eae4>] (__dev_queue_xmit) from [<c05a5650>] (ip_finish_output2+0x350/0x580)
[<c05a5650>] (ip_finish_output2) from [<c05a7e24>] (ip_output+0xb4/0x13c)
[<c05a7e24>] (ip_output) from [<c05a39d0>] (ip_forward+0x474/0x500)
[<c05a39d0>] (ip_forward) from [<c05a13d8>] (ip_sublist_rcv_finish+0x3c/0x50)
[<c05a13d8>] (ip_sublist_rcv_finish) from [<c05a19b8>] (ip_sublist_rcv+0x11c/0x188)
[<c05a19b8>] (ip_sublist_rcv) from [<c05a2494>] (ip_list_rcv+0xf8/0x124)
[<c05a2494>] (ip_list_rcv) from [<c05403c4>] (__netif_receive_skb_list_core+0x1a0/0x20c)
[<c05403c4>] (__netif_receive_skb_list_core) from [<c05405c4>] (netif_receive_skb_list_internal+0x194/0x230)
[<c05405c4>] (netif_receive_skb_list_internal) from [<c0540684>] (gro_normal_list.part.0+0x14/0x28)
[<c0540684>] (gro_normal_list.part.0) from [<c0541280>] (napi_complete_done+0x16c/0x210)
[<c0541280>] (napi_complete_done) from [<bf14c1c0>] (r8152_poll+0x684/0x708 [r8152])
[<bf14c1c0>] (r8152_poll [r8152]) from [<c0541424>] (net_rx_action+0x100/0x328)
[<c0541424>] (net_rx_action) from [<c01012ec>] (__do_softirq+0xec/0x274)
[<c01012ec>] (__do_softirq) from [<c012d6d4>] (irq_exit+0xcc/0xd0)
[<c012d6d4>] (irq_exit) from [<c0160960>] (__handle_domain_irq+0x58/0xa4)
[<c0160960>] (__handle_domain_irq) from [<c0100b0c>] (__irq_svc+0x6c/0x90)
Exception stack(0xc0d01ef0 to 0xc0d01f38)
1ee0:                                     00000000 0000003d 0c31f383 c0d0fa00
1f00: c0d2eb80 00000000 c0d2e630 4dad8c49 4da967b0 0000003d 0000003d 00000000
1f20: fffffff5 c0d01f40 c04e0f88 c04e0f8c 30070013 ffffffff
[<c0100b0c>] (__irq_svc) from [<c04e0f8c>] (cpuidle_enter_state+0x7c/0x378)
[<c04e0f8c>] (cpuidle_enter_state) from [<c04e12c4>] (cpuidle_enter+0x28/0x38)
[<c04e12c4>] (cpuidle_enter) from [<c014f710>] (do_idle+0x194/0x214)
[<c014f710>] (do_idle) from [<c014fa50>] (cpu_startup_entry+0xc/0x14)
[<c014fa50>] (cpu_startup_entry) from [<c0a00dc8>] (start_kernel+0x46c/0x4a0)
Code: e580c054 8a000002 e1a00002 e8bd8070 (e7f001f2)
---[ end trace 146c8a334115490c ]---

The solution was to force nonlinear buffers to be cloned.  This was previously
reported by Klaus Doth (https://www.spinics.net/lists/netdev/msg556937.html)
but never formally submitted as a patch.

This is the third revision, hopefully the formatting is correct this time!

Suggested-by: Klaus Doth <krnl@doth.eu>
Fixes: 653e92a ("net: macb: add support for padding and fcs computation")
Signed-off-by: Mark Deneen <mdeneen@saucontech.com>
Link: https://lore.kernel.org/r/20201030155814.622831-1-mdeneen@saucontech.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
melver pushed a commit that referenced this pull request Nov 10, 2020
If calling find_domain() for a device which hasn't been probed by the
iommu core, below kernel NULL pointer dereference issue happens.

[  362.736947] BUG: kernel NULL pointer dereference, address: 0000000000000038
[  362.743953] #PF: supervisor read access in kernel mode
[  362.749115] #PF: error_code(0x0000) - not-present page
[  362.754278] PGD 0 P4D 0
[  362.756843] Oops: 0000 [#1] SMP NOPTI
[  362.760528] CPU: 0 PID: 844 Comm: cat Not tainted 5.9.0-rc4-intel-next+ #1
[  362.767428] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake
               U DDR4 SODIMM PD RVP TLC, BIOS ICLSFWR1.R00.3384.A02.1909200816
               09/20/2019
[  362.781109] RIP: 0010:find_domain+0xd/0x40
[  362.785234] Code: 48 81 fb 60 28 d9 b2 75 de 5b 41 5c 41 5d 5d c3 0f 1f 00 66
                     2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 87 e0 02 00
                     00 55 <48> 8b 40 38 48 89 e5 48 83 f8 fe 0f 94 c1 48 85 ff
                     0f 94 c2 08 d1
[  362.804041] RSP: 0018:ffffb09cc1f0bd38 EFLAGS: 00010046
[  362.809292] RAX: 0000000000000000 RBX: ffff905b98e4fac8 RCX: 0000000000000000
[  362.816452] RDX: 0000000000000001 RSI: ffff905b98e4fac8 RDI: ffff905b9ccd40d0
[  362.823617] RBP: ffffb09cc1f0bda0 R08: ffffb09cc1f0bd48 R09: 000000000000000f
[  362.830778] R10: ffffffffb266c080 R11: ffff905b9042602d R12: ffff905b98e4fac8
[  362.837944] R13: ffffb09cc1f0bd48 R14: ffff905b9ccd40d0 R15: ffff905b98e4fac8
[  362.845108] FS:  00007f8485460740(0000) GS:ffff905b9fc00000(0000)
               knlGS:0000000000000000
[  362.853227] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  362.858996] CR2: 0000000000000038 CR3: 00000004627a6003 CR4: 0000000000770ef0
[  362.866161] PKRU: fffffffc
[  362.868890] Call Trace:
[  362.871363]  ? show_device_domain_translation+0x32/0x100
[  362.876700]  ? bind_store+0x110/0x110
[  362.880387]  ? klist_next+0x91/0x120
[  362.883987]  ? domain_translation_struct_show+0x50/0x50
[  362.889237]  bus_for_each_dev+0x79/0xc0
[  362.893121]  domain_translation_struct_show+0x36/0x50
[  362.898204]  seq_read+0x135/0x410
[  362.901545]  ? handle_mm_fault+0xeb8/0x1750
[  362.905755]  full_proxy_read+0x5c/0x90
[  362.909526]  vfs_read+0xa6/0x190
[  362.912782]  ksys_read+0x61/0xe0
[  362.916037]  __x64_sys_read+0x1a/0x20
[  362.919725]  do_syscall_64+0x37/0x80
[  362.923329]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  362.928405] RIP: 0033:0x7f84855c5e95

Filter out those devices to avoid such error.

Fixes: e2726da ("iommu/vt-d: debugfs: Add support to show page table internals")
Reported-and-tested-by: Xu Pengfei <pengfei.xu@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Cc: stable@vger.kernel.org#v5.6+
Link: https://lore.kernel.org/r/20201028070725.24979-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
melver pushed a commit that referenced this pull request Nov 10, 2020
The cleanup for the yfs_store_opaque_acl2_operation calls the wrong
function to destroy the ACL content buffer.  It's an afs_acl struct, not
a yfs_acl struct - and the free function for latter may pass invalid
pointers to kfree().

Fix this by using the afs_acl_put() function.  The yfs_acl_put()
function is then no longer used and can be removed.

	general protection fault, probably for non-canonical address 0x7ebde00000000: 0000 [#1] SMP PTI
	...
	RIP: 0010:compound_head+0x0/0x11
	...
	Call Trace:
	 virt_to_cache+0x8/0x51
	 kfree+0x5d/0x79
	 yfs_free_opaque_acl+0x16/0x29
	 afs_put_operation+0x60/0x114
	 __vfs_setxattr+0x67/0x72
	 __vfs_setxattr_noperm+0x66/0xe9
	 vfs_setxattr+0x67/0xce
	 setxattr+0x14e/0x184
	 __do_sys_fsetxattr+0x66/0x8f
	 do_syscall_64+0x2d/0x3a
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: e49c7b2 ("afs: Build an abstraction around an "operation" concept")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
melver pushed a commit that referenced this pull request Nov 10, 2020
…lone()

All user space generated SKBs are owned by a socket (unless injected into the
key via AF_PACKET). If a socket is closed, all associated skbs will be cleaned
up.

This leads to a problem when a CAN driver calls can_put_echo_skb() on a
unshared SKB. If the socket is closed prior to the TX complete handler,
can_get_echo_skb() and the subsequent delivering of the echo SKB to all
registered callbacks, a SKB with a refcount of 0 is delivered.

To avoid the problem, in can_get_echo_skb() the original SKB is now always
cloned, regardless of shared SKB or not. If the process exists it can now
safely discard its SKBs, without disturbing the delivery of the echo SKB.

The problem shows up in the j1939 stack, when it clones the incoming skb, which
detects the already 0 refcount.

We can easily reproduce this with following example:

testj1939 -B -r can0: &
cansend can0 1823ff40#0123

WARNING: CPU: 0 PID: 293 at lib/refcount.c:25 refcount_warn_saturate+0x108/0x174
refcount_t: addition on 0; use-after-free.
Modules linked in: coda_vpu imx_vdoa videobuf2_vmalloc dw_hdmi_ahb_audio vcan
CPU: 0 PID: 293 Comm: cansend Not tainted 5.5.0-rc6-00376-g9e20dcb7040d #1
Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
Backtrace:
[<c010f570>] (dump_backtrace) from [<c010f90c>] (show_stack+0x20/0x24)
[<c010f8ec>] (show_stack) from [<c0c3e1a4>] (dump_stack+0x8c/0xa0)
[<c0c3e118>] (dump_stack) from [<c0127fec>] (__warn+0xe0/0x108)
[<c0127f0c>] (__warn) from [<c01283c8>] (warn_slowpath_fmt+0xa8/0xcc)
[<c0128324>] (warn_slowpath_fmt) from [<c0539c0c>] (refcount_warn_saturate+0x108/0x174)
[<c0539b04>] (refcount_warn_saturate) from [<c0ad2cac>] (j1939_can_recv+0x20c/0x210)
[<c0ad2aa0>] (j1939_can_recv) from [<c0ac9dc8>] (can_rcv_filter+0xb4/0x268)
[<c0ac9d14>] (can_rcv_filter) from [<c0aca2cc>] (can_receive+0xb0/0xe4)
[<c0aca21c>] (can_receive) from [<c0aca348>] (can_rcv+0x48/0x98)
[<c0aca300>] (can_rcv) from [<c09b1fdc>] (__netif_receive_skb_one_core+0x64/0x88)
[<c09b1f78>] (__netif_receive_skb_one_core) from [<c09b2070>] (__netif_receive_skb+0x38/0x94)
[<c09b2038>] (__netif_receive_skb) from [<c09b2130>] (netif_receive_skb_internal+0x64/0xf8)
[<c09b20cc>] (netif_receive_skb_internal) from [<c09b21f8>] (netif_receive_skb+0x34/0x19c)
[<c09b21c4>] (netif_receive_skb) from [<c0791278>] (can_rx_offload_napi_poll+0x58/0xb4)

Fixes: 0ae89be ("can: add destructor for self generated skbs")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: http://lore.kernel.org/r/20200124132656.22156-1-o.rempel@pengutronix.de
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
melver pushed a commit that referenced this pull request Nov 10, 2020
The atomic check hooks must look up the encoder to be used with a
connector from the connector's atomic state, and not assume that it's
the connector's current attached encoder. The latter one can change
under the atomic check func, or can be unset yet as in the case of MST
connectors.

This fixes
[    7.940719] Oops: 0000 [#1] SMP NOPTI
[    7.944407] CPU: 2 PID: 143 Comm: kworker/2:2 Not tainted 5.6.0-1023-oem #23-Ubuntu
[    7.952102] Hardware name: Dell Inc. Latitude 7320/, BIOS 88.87.11 09/07/2020
[    7.959278] Workqueue: events output_poll_execute [drm_kms_helper]
[    7.965511] RIP: 0010:intel_psr_atomic_check+0x37/0xa0 [i915]
[    7.971327] Code: 80 2d 06 00 00 20 74 42 80 b8 34 71 00 00 00 74 39 48 8b 72 08 48 85 f6 74 30 80 b8 f8 71 00 00 00 74 27 4c 8b 87 80 04 00 00 <41> 8b 78 78 83 ff 08 77 19 31 c9 83 ff 05 77 19 48 81 c1 20 01 00
[    7.977541] input: PS/2 Generic Mouse as /devices/platform/i8042/serio1/input/input5
[    7.990154] RSP: 0018:ffffb864c073fac8 EFLAGS: 00010202
[    7.990155] RAX: ffff8c5d55ce0000 RBX: ffff8c5d54519000 RCX: 0000000000000000
[    7.990155] RDX: ffff8c5d55cb30c0 RSI: ffff8c5d89a0c800 RDI: ffff8c5d55fcf800
[    7.990156] RBP: ffffb864c073fac8 R08: 0000000000000000 R09: ffff8c5d55d9f3a0
[    7.990156] R10: ffff8c5d55cb30c0 R11: 0000000000000009 R12: ffff8c5d55fcf800
[    7.990156] R13: ffff8c5d55cb30c0 R14: ffff8c5d56989cc0 R15: ffff8c5d56989cc0
[    7.990158] FS:  0000000000000000(0000) GS:ffff8c5d8e480000(0000) knlGS:0000000000000000
[    8.047193] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.052970] CR2: 0000000000000078 CR3: 0000000856500005 CR4: 0000000000760ee0
[    8.060137] PKRU: 55555554
[    8.062867] Call Trace:
[    8.065361]  intel_digital_connector_atomic_check+0x53/0x130 [i915]
[    8.071703]  intel_dp_mst_atomic_check+0x5b/0x200 [i915]
[    8.077074]  drm_atomic_helper_check_modeset+0x1db/0x790 [drm_kms_helper]
[    8.083942]  intel_atomic_check+0x92/0xc50 [i915]
[    8.088705]  ? drm_plane_check_pixel_format+0x4f/0xb0 [drm]
[    8.094345]  ? drm_atomic_plane_check+0x7a/0x3a0 [drm]
[    8.099548]  drm_atomic_check_only+0x2b1/0x450 [drm]
[    8.104573]  drm_atomic_commit+0x18/0x50 [drm]
[    8.109070]  drm_client_modeset_commit_atomic+0x1c9/0x200 [drm]
[    8.115056]  drm_client_modeset_commit_force+0x55/0x160 [drm]
[    8.120866]  drm_fb_helper_restore_fbdev_mode_unlocked+0x54/0xb0 [drm_kms_helper]
[    8.128415]  drm_fb_helper_set_par+0x34/0x50 [drm_kms_helper]
[    8.134225]  drm_fb_helper_hotplug_event.part.0+0xb4/0xe0 [drm_kms_helper]
[    8.141150]  drm_fb_helper_hotplug_event+0x1c/0x30 [drm_kms_helper]
[    8.147481]  intel_fbdev_output_poll_changed+0x6f/0xa0 [i915]
[    8.153287]  drm_kms_helper_hotplug_event+0x2c/0x40 [drm_kms_helper]
[    8.159709]  output_poll_execute+0x1aa/0x1c0 [drm_kms_helper]
[    8.165506]  process_one_work+0x1e8/0x3b0
[    8.169561]  worker_thread+0x4d/0x400
[    8.173249]  kthread+0x104/0x140
[    8.176515]  ? process_one_work+0x3b0/0x3b0
[    8.180726]  ? kthread_park+0x90/0x90
[    8.184416]  ret_from_fork+0x1f/0x40

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2361
References: https://gitlab.freedesktop.org/drm/intel/-/issues/2486
Reported-by: William Tseng <william.tseng@intel.com>
Reported-by: Cooper Chiou <cooper.chiou@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Imre Deak <imre.deak@intel.com>
Reviewed-by: Anshuman Gupta <anshuman.gupta@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20201027160928.3665377-1-imre.deak@intel.com
(cherry picked from commit 00e5deb)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
melver pushed a commit that referenced this pull request Nov 10, 2020
Mimic the pre-existing ACPI and Device Tree event log behavior by not
creating the binary_bios_measurements file when the EFI TPM event log is
empty.

This fixes the following NULL pointer dereference that can occur when
reading /sys/kernel/security/tpm0/binary_bios_measurements after the
kernel received an empty event log from the firmware:

 BUG: kernel NULL pointer dereference, address: 000000000000002c
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] SMP PTI
 CPU: 2 PID: 3932 Comm: fwupdtpmevlog Not tainted 5.9.0-00003-g629990edad62 #17
 Hardware name: LENOVO 20LCS03L00/20LCS03L00, BIOS N27ET38W (1.24 ) 11/28/2019
 RIP: 0010:tpm2_bios_measurements_start+0x3a/0x550
 Code: 54 53 48 83 ec 68 48 8b 57 70 48 8b 1e 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48 8b 82 c0 06 00 00 48 8b 8a c8 06 00 00 <44> 8b 60 1c 48 89 4d a0 4c 89 e2 49 83 c4 20 48 83 fb 00 75 2a 49
 RSP: 0018:ffffa9c901203db0 EFLAGS: 00010246
 RAX: 0000000000000010 RBX: 0000000000000000 RCX: 0000000000000010
 RDX: ffff8ba1eb99c000 RSI: ffff8ba1e4ce8280 RDI: ffff8ba1e4ce8258
 RBP: ffffa9c901203e40 R08: ffffa9c901203dd8 R09: ffff8ba1ec443300
 R10: ffffa9c901203e50 R11: 0000000000000000 R12: ffff8ba1e4ce8280
 R13: ffffa9c901203ef0 R14: ffffa9c901203ef0 R15: ffff8ba1e4ce8258
 FS:  00007f6595460880(0000) GS:ffff8ba1ef880000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000000000000002c CR3: 00000007d8d18003 CR4: 00000000003706e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  ? __kmalloc_node+0x113/0x320
  ? kvmalloc_node+0x31/0x80
  seq_read+0x94/0x420
  vfs_read+0xa7/0x190
  ksys_read+0xa7/0xe0
  __x64_sys_read+0x1a/0x20
  do_syscall_64+0x37/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

In this situation, the bios_event_log pointer in the tpm_bios_log struct
was not NULL but was equal to the ZERO_SIZE_PTR (0x10) value. This was
due to the following kmemdup() in tpm_read_log_efi():

int tpm_read_log_efi(struct tpm_chip *chip)
{
...
	/* malloc EventLog space */
	log->bios_event_log = kmemdup(log_tbl->log, log_size, GFP_KERNEL);
	if (!log->bios_event_log) {
		ret = -ENOMEM;
		goto out;
	}
...
}

When log_size is zero, due to an empty event log from firmware,
ZERO_SIZE_PTR is returned from kmemdup(). Upon a read of the
binary_bios_measurements file, the tpm2_bios_measurements_start()
function does not perform a ZERO_OR_NULL_PTR() check on the
bios_event_log pointer before dereferencing it.

Rather than add a ZERO_OR_NULL_PTR() check in functions that make use of
the bios_event_log pointer, simply avoid creating the
binary_bios_measurements_file as is done in other event log retrieval
backends.

Explicitly ignore all of the events in the final event log when the main
event log is empty. The list of events in the final event log cannot be
accurately parsed without referring to the first event in the main event
log (the event log header) so the final event log is useless in such a
situation.

Fixes: 58cc1e4 ("tpm: parse TPM event logs based on EFI table")
Link: https://lore.kernel.org/linux-integrity/E1FDCCCB-CA51-4AEE-AC83-9CDE995EAE52@canonical.com/
Reported-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Reported-by: Kenneth R. Crudup <kenny@panix.com>
Reported-by: Mimi Zohar <zohar@linux.ibm.com>
Cc: Thiébaud Weksteen <tweek@google.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.microsoft.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet