Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mwifiex: add fw reset quirks for Surface devices #58

Merged
merged 4 commits into from
Sep 30, 2020

Conversation

kitakar5525
Copy link
Member

@kitakar5525 kitakar5525 commented Sep 24, 2020

This PR adds

  1. DMI tables of Surface devices (first and second patch)
  2. fw reset quirk for Surface gen4+ devices (third patch)
  3. fw reset quirk for Surface 3 (fourth patch)

While here, created new files for quirks (mwifiex/pcie_quirks.c and mwifiex/pcie_quirks.h) because the changes are a little bit too big to add into pcie.c.

For Surface gen4+ devices, we can reset fw by putting wifi into D3cold then bring back into D0 (power-cycle).

For Surface 3, we can reset fw by calling _DSM method available in \_SB.WSID device (https://github.com/linux-surface/acpidumps/blob/05cba925f3a515f222acb5b3551a032ddde958fe/surface_3/dsdt.dsl#L11947-L12011).

Note that for Surface Pro 3, there is the same WSID device exists (https://github.com/linux-surface/acpidumps/blob/05cba925f3a515f222acb5b3551a032ddde958fe/surface_pro_3/dsdt.dsl#L12164-L12216). However, I'm not sure if it will work in the same way as Surface 3. So, I didn't add support for SP3 in this PR. Might look into it later.

@kitakar5525
Copy link
Member Author

You can trigger the card reset manually via sysfs entry:

$ echo 1 | sudo tee /sys/kernel/debug/mwifiex/mlan0/reset

Copy link
Member Author

@kitakar5525 kitakar5525 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mwifiex: pcie: add reset_wsid quirk for Surface 3

Forgot to mention that why this quirk is required.

To reset mwifiex on Surface 3, it seems that calling the _DSM method
exists in \_SB.WSID [1] device is required.

[1] https://github.com/linux-surface/acpidumps/blob/05cba925f3a515f222acb5b3551a032ddde958fe/surface_3/dsdt.dsl#L11947-L12011

EDIT: Added.

Copy link
Member Author

@kitakar5525 kitakar5525 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mwifiex: pcie: add reset_wsid quirk for Surface 3

Hmm, some indentations look weird. I'll fix them later.

EDIT: Fixed. (Changed tab size setting to 8 in my editor)

@kitakar5525 kitakar5525 force-pushed the mwifiex/fw-reset-quirk branch 3 times, most recently from f0edf12 to 43662e8 Compare September 24, 2020 14:07
@kitakar5525
Copy link
Member Author

Addressed the following checkpatch warnings:
mwifiex: pcie: add Surface devices DMI table for quirks

WARNING: Missing or malformed SPDX-License-Identifier tag in line 1
#33: FILE: drivers/net/wireless/marvell/mwifiex/pcie_quirks.h:1:
+/*

mwifiex: pcie: add reset_d3cold quirk for Surface gen4+ devices

WARNING: Block comments use a trailing */ on a separate line
#53: FILE: drivers/net/wireless/marvell/mwifiex/pcie.c:398:
+	 * before performing FLR */

WARNING: Missing or malformed SPDX-License-Identifier tag in line 1
#66: FILE: drivers/net/wireless/marvell/mwifiex/pcie_quirks.c:1:
+/*

WARNING: Block comments use a trailing */ on a separate line
#109: FILE: drivers/net/wireless/marvell/mwifiex/pcie_quirks.c:44:
+	 * "Cannot transition to power state D0 for parent in D3hot". */

mwifiex: pcie: add reset_wsid quirk for Surface 3

WARNING: Block comments use a trailing */ on a separate line
#39: FILE: drivers/net/wireless/marvell/mwifiex/pcie.c:2830:
+	 * reset-related works. */

Ignoring WARNING: networking block comments don't use an empty /* line, use /* Comment... at top of files in accordance with the other upstream files:

WARNING: networking block comments don't use an empty /* line, use /* Comment...
#35: FILE: drivers/net/wireless/marvell/mwifiex/pcie_quirks.h:3:
+/*
+ * Header file for PCIe quirks.

drivers/net/wireless/marvell/mwifiex/pcie_quirks.c Outdated Show resolved Hide resolved
drivers/net/wireless/marvell/mwifiex/pcie.c Outdated Show resolved Hide resolved
drivers/net/wireless/marvell/mwifiex/pcie_quirks.h Outdated Show resolved Hide resolved
@kitakar5525 kitakar5525 changed the title mwifiex: add fw reset quirks for Surface devices WIP: mwifiex: add fw reset quirks for Surface devices Sep 25, 2020
@kitakar5525
Copy link
Member Author

Thanks for the review! I've pushed four new patches.

The first two patches addressed the review.
The rest is for simplification/upstreaming.

If they look OK, I'll squash the patches.

Copy link
Member

@qzed qzed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good, a couple of nitpicks.

drivers/net/wireless/marvell/mwifiex/pcie.c Outdated Show resolved Hide resolved
drivers/net/wireless/marvell/mwifiex/pcie_quirks.h Outdated Show resolved Hide resolved
drivers/net/wireless/marvell/mwifiex/pcie_quirks.c Outdated Show resolved Hide resolved
drivers/net/wireless/marvell/mwifiex/pcie_quirks.c Outdated Show resolved Hide resolved
drivers/net/wireless/marvell/mwifiex/pcie_quirks.c Outdated Show resolved Hide resolved
drivers/net/wireless/marvell/mwifiex/pcie_quirks.c Outdated Show resolved Hide resolved
@kitakar5525
Copy link
Member Author

Squashed patches with addressing the review feedbacks:

  • moved mwifiex_initialize_quirks() before mwifiex_add_card()
  • switched to using GUID_INIT()
  • changed acpi_evaluate_dsm() failure return value to -EFAULT
  • added ACPI_FREE(obj); after acpi_evaluate_dsm() success

Other changes:

  • do not use mwifiex_dbg() to avoid troubles in pcie_quirks.c
    (e.g. to avoid using mwifiex_adapter struct before init or wifi is powered down, or causes NULL ptr deref).
    Added comment regarding this to the top of pcie_quirks.c:
    /* The low-level PCI operations will be performed in this file. Therefore,
    * let's use dev_*() instead of mwifiex_dbg() here to avoid troubles (e.g.
    * to avoid using mwifiex_adapter struct before init or wifi is powered
    * down, or causes NULL ptr deref).
    */

@kitakar5525
Copy link
Member Author

I've rebased again.

Changes:

  • Fixed changes squashed into wrong patch (the print dev_info(&pdev->dev, "no quirks applied\n"); went into second patch by mistake, now in first patch)
  • Moved comment for (do not use mwifiex_dbg() to avoid troubles in pcie_quirks.c) higher (now above #include):
    // SPDX-License-Identifier: GPL-2.0
    /*
    * File for PCIe quirks.
    */
    /* The low-level PCI operations will be performed in this file. Therefore,
    * let's use dev_*() instead of mwifiex_dbg() here to avoid troubles (e.g.
    * to avoid using mwifiex_adapter struct before init or wifi is powered
    * down, or causes NULL ptr deref).
    */
    #include <linux/acpi.h>
    #include <linux/dmi.h>
    #include "pcie_quirks.h"
  • Use result of mwifiex_pcie_set_power_d0()
    ret = mwifiex_pcie_set_power_d0(parent_pdev);
    if (ret)
    return ret;
    ret = mwifiex_pcie_set_power_d0(pdev);
    if (ret)
    return ret;

@kitakar5525
Copy link
Member Author

kitakar5525 commented Sep 30, 2020

Rebased with addressing review feedbacks:

  • dropped QUIRK_NONE and used .driver_data = 0 for the skeleton
  • dropped path variable in mwifiex_pcie_reset_wsid_quirk()
  • changed return value of acpi_evaluate_dsm() failure to -EIO

Other changes:

  • added _DSM check for also wifi power on function
    There was only check for wifi power off function before, so also add for on accordingly
    if (!acpi_check_dsm(handle, &wsid_dsm_guid,
    WSID_REV, WSID_FUNC_WIFI_PWR_OFF)) {
    dev_err(&pdev->dev,
    "_DSM method doesn't support wifi power off func\n");
    return -ENODEV;
    }
    if (!acpi_check_dsm(handle, &wsid_dsm_guid,
    WSID_REV, WSID_FUNC_WIFI_PWR_ON)) {
    dev_err(&pdev->dev,
    "_DSM method doesn't support wifi power on func\n");
    return -ENODEV;
    }
  • change wording no quirks applied -> no quirks enabled in mwifiex_initialize_quirks() in accordance with the other print there
  • just use "NOT operator" instead of == in mwifiex_initialize_quirks()
    card->quirks == 0 -> !card->quirks
  • moved patch mwifiex: pcie: (OEMB) add quirk for Surface 3 with broken DMI table to last so that the first three patches can be sent upstream easily

@qzed
Copy link
Member

qzed commented Sep 30, 2020

Okay, looks good to me. Feel free to merge when you're ready.

@kitakar5525
Copy link
Member Author

Thanks!

There are still issues remaining:

  1. For the reset_wsid quirk (for S3 and SP3), I think that creating a new device driver for the WSID device is better, then export the reset function, and call it from mwifiex. As the comments in code states, on S3, calling _DSM methods immediately remove/re-probe mwifiex. So, resetting the wifi externally is a better idea. (Also in this way, hopefully supporting SP3 may become easier.)

  2. After reset, ASPM settings don't get restored completely.
    sudo lspci -nnvvv diff before/after fw reset on SB1:

#
# 03:00.0 Ethernet controller [0200]: Marvell Technology Group Ltd. 88W8897 [AVASTAR] 802.11ac Wireless [11ab:2b38]
#
@@ -574,9 +574,9 @@
        Capabilities: [168 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=70us PortTPowerOnTime=10us
-               L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
-                          T_CommonMode=0us LTR1.2_Threshold=163840ns
-               L1SubCtl2: T_PwrOn=44us
+               L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
+                          T_CommonMode=0us LTR1.2_Threshold=0ns
+               L1SubCtl2: T_PwrOn=10us
        Kernel driver in use: mwifiex_pcie
        Kernel modules: mwifiex_pcie

#
# no changes on root port of wifi regarding ASPM
#

All of the L1 substates are disabled after fw reset. LTR value is also changed.

sudo lspci -nnvvv diff before/after fw reset on S3:

#
# root port of wifi:
# 00:1c.0 PCI bridge [0604]: Intel Corporation Atom/Celeron/Pentium Processor x5-E8000/J3xxx/N3xxx Series PCI Express Port #1 [8086:22c8] (rev 20) (prog-if 00 [Normal decode])
#
@@ -103,7 +103,7 @@
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
                LnkCap: Port #1, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <512ns, L1 <16us
                        ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp-
-               LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk+
+               LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s (downgraded), Width x1 (ok)
                        TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-

#
# no changes on wifi regarding ASPM
#

It's worse in this case. The ASPM of wifi root port is disabled after fw reset (thus breaks S0ix).

Anyway, we can improve these later. This PR just works regarding fw reset. So, I'll go this way for now.
I'll merge this later when the other open PR becomes ready.

@qzed
Copy link
Member

qzed commented Sep 30, 2020

Keep in mind that with an external driver you might need to ensure correct suspend/resume ordering though, which might complicate things. I know too little about the whole situation to say what's better here though-

@kitakar5525
Copy link
Member Author

I think what we need to do is just exposing reset function via EXPORT_SYMBOL_GPL(). So, I hope there are no issues with suspend/resume ordering, but I may be wrong though.

@qzed
Copy link
Member

qzed commented Sep 30, 2020

You'd have to make sure that reset can be called both before and after mwifiex has resumed and also executed concurrently. At least if you plan on calling it during resume. Otherwise you'd have to enforce an ordering. You could probably also call it during the complete phase, assuming that mwifiex doesn't have a handler for that phase (then it's guaranteed to run after the basic resume steps have been executed). You probably still want to make sure that it can executed concurrently though.

@kitakar5525
Copy link
Member Author

Hmm, anyway I'll try something first.

This commit adds quirk implementation based on DMI matching with DMI
table for Surface devices.

This implementation can be used for quirks later.

Signed-off-by: Tsuchiya Yuto <kitakar@gmail.com>
To reset mwifiex on Surface gen4+ (Pro 4 or later gen) devices, it
seems that putting the wifi device into D3cold is required according
to errata.inf file on Windows installation (Windows/INF/errata.inf).

This patch adds a function that performs power-cycle (put into D3cold
then D0) and call the function at the end of reset_prepare().

Note: Need to also reset the parent device (bridge) of wifi on SB1;
it might be because the bridge of wifi always reports it's in D3hot.
When I tried to reset only the wifi device (not touching parent), it gave
the following error and the reset failed:

    acpi device:4b: Cannot transition to power state D0 for parent in D3hot
    mwifiex_pcie 0000:03:00.0: can't change power state from D3cold to D0 (config space inaccessible)

Signed-off-by: Tsuchiya Yuto <kitakar@gmail.com>
This commit adds reset_wsid quirk and uses this quirk for Surface 3 on
card reset.

To reset mwifiex on Surface 3, it seems that calling the _DSM method
exists in \_SB.WSID [1] device is required.

On Surface 3, calling the _DSM method removes/re-probes the card by
itself. So, need to place the reset function before performing FLR and
skip performing any other reset-related works.

Note that Surface Pro 3 also has the WSID device [2], but it seems to need
more work. This commit only supports Surface 3 yet.

[1] https://github.com/linux-surface/acpidumps/blob/05cba925f3a515f222acb5b3551a032ddde958fe/surface_3/dsdt.dsl#L11947-L12011
[2] https://github.com/linux-surface/acpidumps/blob/05cba925f3a515f222acb5b3551a032ddde958fe/surface_pro_3/dsdt.dsl#L12164-L12216

Signed-off-by: Tsuchiya Yuto <kitakar@gmail.com>
(made referring to http://git.osdn.net/view?p=android-x86/kernel.git;a=commitdiff;h=18e2e857c57633b25b3b4120f212224a108cd883)

On some Surface 3, the DMI table gets corrupted for unknown reasons
and breaks existing DMI matching used for device-specific quirks.

This commit adds the (broken) DMI info for the affected Surface 3.

On affected systems, DMI info will look like this:
    $ grep . /sys/devices/virtual/dmi/id/{bios_vendor,board_name,board_vendor,\
    chassis_vendor,product_name,sys_vendor}
    /sys/devices/virtual/dmi/id/bios_vendor:American Megatrends Inc.
    /sys/devices/virtual/dmi/id/board_name:OEMB
    /sys/devices/virtual/dmi/id/board_vendor:OEMB
    /sys/devices/virtual/dmi/id/chassis_vendor:OEMB
    /sys/devices/virtual/dmi/id/product_name:OEMB
    /sys/devices/virtual/dmi/id/sys_vendor:OEMB

Expected:
    $ grep . /sys/devices/virtual/dmi/id/{bios_vendor,board_name,board_vendor,\
    chassis_vendor,product_name,sys_vendor}
    /sys/devices/virtual/dmi/id/bios_vendor:American Megatrends Inc.
    /sys/devices/virtual/dmi/id/board_name:Surface 3
    /sys/devices/virtual/dmi/id/board_vendor:Microsoft Corporation
    /sys/devices/virtual/dmi/id/chassis_vendor:Microsoft Corporation
    /sys/devices/virtual/dmi/id/product_name:Surface 3
    /sys/devices/virtual/dmi/id/sys_vendor:Microsoft Corporation

Signed-off-by: Tsuchiya Yuto <kitakar@gmail.com>
@kitakar5525
Copy link
Member Author

kitakar5525 commented Sep 30, 2020

Rebased on top of current v5.8-surface-devel due to conflict with #59 (with correcting grammar in commit messages)

@kitakar5525 kitakar5525 changed the title WIP: mwifiex: add fw reset quirks for Surface devices mwifiex: add fw reset quirks for Surface devices Sep 30, 2020
@kitakar5525 kitakar5525 merged commit 86944d6 into v5.8-surface-devel Sep 30, 2020
@kitakar5525 kitakar5525 deleted the mwifiex/fw-reset-quirk branch September 30, 2020 15:08
kitakar5525 pushed a commit to kitakar5525/linux-kernel that referenced this pull request Nov 23, 2020
fast_second_level_miss handler for the TLBTEMP area has an assumption
that page table directory entry for the TLBTEMP address range is 0. For
it to be true the TLBTEMP area must be aligned to 4MB boundary and not
share its 4MB region with anything that may use a page table. This is
not true currently: TLBTEMP shares space with vmalloc space which
results in the following kinds of runtime errors when
fast_second_level_miss loads page table directory entry for the vmalloc
space instead of fixing up the TLBTEMP area:

 Unable to handle kernel paging request at virtual address c7ff0e00
  pc = d0009275, ra = 90009478
 Oops: sig: 9 [#1] PREEMPT
 CPU: 1 PID: 61 Comm: kworker/u9:2 Not tainted 5.10.0-rc3-next-20201110-00007-g1fe4962fa983-dirty linux-surface#58
 Workqueue: xprtiod xs_stream_data_receive_workfn
 a00: 90009478 d11e1dc0 c7ff0e00 00000020 c7ff0000 00000001 7f8b8107 00000000
 a08: 900c5992 d11e1d90 d0cc88b8 5506e97c 00000000 5506e97c d06c8074 d11e1d90
 pc: d0009275, ps: 00060310, depc: 00000014, excvaddr: c7ff0e00
 lbeg: d0009275, lend: d0009287 lcount: 00000003, sar: 00000010
 Call Trace:
   xs_stream_data_receive_workfn+0x43c/0x770
   process_one_work+0x1a1/0x324
   worker_thread+0x1cc/0x3c0
   kthread+0x10d/0x124
   ret_from_kernel_thread+0xc/0x18

Cc: stable@vger.kernel.org
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
qzed pushed a commit that referenced this pull request Nov 24, 2020
commit 481535c upstream.

fast_second_level_miss handler for the TLBTEMP area has an assumption
that page table directory entry for the TLBTEMP address range is 0. For
it to be true the TLBTEMP area must be aligned to 4MB boundary and not
share its 4MB region with anything that may use a page table. This is
not true currently: TLBTEMP shares space with vmalloc space which
results in the following kinds of runtime errors when
fast_second_level_miss loads page table directory entry for the vmalloc
space instead of fixing up the TLBTEMP area:

 Unable to handle kernel paging request at virtual address c7ff0e00
  pc = d0009275, ra = 90009478
 Oops: sig: 9 [#1] PREEMPT
 CPU: 1 PID: 61 Comm: kworker/u9:2 Not tainted 5.10.0-rc3-next-20201110-00007-g1fe4962fa983-dirty #58
 Workqueue: xprtiod xs_stream_data_receive_workfn
 a00: 90009478 d11e1dc0 c7ff0e00 00000020 c7ff0000 00000001 7f8b8107 00000000
 a08: 900c5992 d11e1d90 d0cc88b8 5506e97c 00000000 5506e97c d06c8074 d11e1d90
 pc: d0009275, ps: 00060310, depc: 00000014, excvaddr: c7ff0e00
 lbeg: d0009275, lend: d0009287 lcount: 00000003, sar: 00000010
 Call Trace:
   xs_stream_data_receive_workfn+0x43c/0x770
   process_one_work+0x1a1/0x324
   worker_thread+0x1cc/0x3c0
   kthread+0x10d/0x124
   ret_from_kernel_thread+0xc/0x18

Cc: stable@vger.kernel.org
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
qzed pushed a commit that referenced this pull request Nov 24, 2020
commit 481535c upstream.

fast_second_level_miss handler for the TLBTEMP area has an assumption
that page table directory entry for the TLBTEMP address range is 0. For
it to be true the TLBTEMP area must be aligned to 4MB boundary and not
share its 4MB region with anything that may use a page table. This is
not true currently: TLBTEMP shares space with vmalloc space which
results in the following kinds of runtime errors when
fast_second_level_miss loads page table directory entry for the vmalloc
space instead of fixing up the TLBTEMP area:

 Unable to handle kernel paging request at virtual address c7ff0e00
  pc = d0009275, ra = 90009478
 Oops: sig: 9 [#1] PREEMPT
 CPU: 1 PID: 61 Comm: kworker/u9:2 Not tainted 5.10.0-rc3-next-20201110-00007-g1fe4962fa983-dirty #58
 Workqueue: xprtiod xs_stream_data_receive_workfn
 a00: 90009478 d11e1dc0 c7ff0e00 00000020 c7ff0000 00000001 7f8b8107 00000000
 a08: 900c5992 d11e1d90 d0cc88b8 5506e97c 00000000 5506e97c d06c8074 d11e1d90
 pc: d0009275, ps: 00060310, depc: 00000014, excvaddr: c7ff0e00
 lbeg: d0009275, lend: d0009287 lcount: 00000003, sar: 00000010
 Call Trace:
   xs_stream_data_receive_workfn+0x43c/0x770
   process_one_work+0x1a1/0x324
   worker_thread+0x1cc/0x3c0
   kthread+0x10d/0x124
   ret_from_kernel_thread+0xc/0x18

Cc: stable@vger.kernel.org
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
qzed pushed a commit that referenced this pull request Apr 25, 2022
[ Upstream commit 8d3ea3d ]

GCC12 appears to be much smarter about its dependency tracking and is
aware that the relaxed variants are just normal loads and stores and
this is causing problems like:

[  210.074549] ------------[ cut here ]------------
[  210.079223] NETDEV WATCHDOG: enabcm6e4ei0 (bcmgenet): transmit queue 1 timed out
[  210.086717] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:529 dev_watchdog+0x234/0x240
[  210.095044] Modules linked in: genet(E) nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat]
[  210.146561] ACPI CPPC: PCC check channel failed for ss: 0. ret=-110
[  210.146927] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G            E     5.17.0-rc7G12+ #58
[  210.153226] CPPC Cpufreq:cppc_scale_freq_workfn: failed to read perf counters
[  210.161349] Hardware name: Raspberry Pi Foundation Raspberry Pi 4 Model B/Raspberry Pi 4 Model B, BIOS EDK2-DEV 02/08/2022
[  210.161353] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  210.161358] pc : dev_watchdog+0x234/0x240
[  210.161364] lr : dev_watchdog+0x234/0x240
[  210.161368] sp : ffff8000080a3a40
[  210.161370] x29: ffff8000080a3a40 x28: ffffcd425af87000 x27: ffff8000080a3b20
[  210.205150] x26: ffffcd425aa00000 x25: 0000000000000001 x24: ffffcd425af8ec08
[  210.212321] x23: 0000000000000100 x22: ffffcd425af87000 x21: ffff55b142688000
[  210.219491] x20: 0000000000000001 x19: ffff55b1426884c8 x18: ffffffffffffffff
[  210.226661] x17: 64656d6974203120 x16: 0000000000000001 x15: 6d736e617274203a
[  210.233831] x14: 2974656e65676d63 x13: ffffcd4259c300d8 x12: ffffcd425b07d5f0
[  210.241001] x11: 00000000ffffffff x10: ffffcd425b07d5f0 x9 : ffffcd4258bdad9c
[  210.248171] x8 : 00000000ffffdfff x7 : 000000000000003f x6 : 0000000000000000
[  210.255341] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000001000
[  210.262511] x2 : 0000000000001000 x1 : 0000000000000005 x0 : 0000000000000044
[  210.269682] Call trace:
[  210.272133]  dev_watchdog+0x234/0x240
[  210.275811]  call_timer_fn+0x3c/0x15c
[  210.279489]  __run_timers.part.0+0x288/0x310
[  210.283777]  run_timer_softirq+0x48/0x80
[  210.287716]  __do_softirq+0x128/0x360
[  210.291392]  __irq_exit_rcu+0x138/0x140
[  210.295243]  irq_exit_rcu+0x1c/0x30
[  210.298745]  el1_interrupt+0x38/0x54
[  210.302334]  el1h_64_irq_handler+0x18/0x24
[  210.306445]  el1h_64_irq+0x7c/0x80
[  210.309857]  arch_cpu_idle+0x18/0x2c
[  210.313445]  default_idle_call+0x4c/0x140
[  210.317470]  cpuidle_idle_call+0x14c/0x1a0
[  210.321584]  do_idle+0xb0/0x100
[  210.324737]  cpu_startup_entry+0x30/0x8c
[  210.328675]  secondary_start_kernel+0xe4/0x110
[  210.333138]  __secondary_switched+0x94/0x98

The assumption when these were relaxed seems to be that device memory
would be mapped non reordering, and that other constructs
(spinlocks/etc) would provide the barriers to assure that packet data
and in memory rings/queues were ordered with respect to device
register reads/writes. This itself seems a bit sketchy, but the real
problem with GCC12 is that it is moving the actual reads/writes around
at will as though they were independent operations when in truth they
are not, but the compiler can't know that. When looking at the
assembly dumps for many of these routines its possible to see very
clean, but not strictly in program order operations occurring as the
compiler would be free to do if these weren't actually register
reads/write operations.

Its possible to suppress the timeout with a liberal bit of dma_mb()'s
sprinkled around but the device still seems unable to reliably
send/receive data. A better plan is to use the safer readl/writel
everywhere.

Since this partially reverts an older commit, which notes the use of
the relaxed variants for performance reasons. I would suggest that
any performance problems with this commit are targeted at relaxing only
the performance critical code paths after assuring proper barriers.

Fixes: 69d2ea9 ("net: bcmgenet: Use correct I/O accessors")
Reported-by: Peter Robinson <pbrobinson@gmail.com>
Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Acked-by: Peter Robinson <pbrobinson@gmail.com>
Tested-by: Peter Robinson <pbrobinson@gmail.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20220310045358.224350-1-jeremy.linton@arm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
iwanders pushed a commit to iwanders/linux-surface-kernel that referenced this pull request Mar 2, 2024
When trying to use copy_from_kernel_nofault() to read vsyscall page
through a bpf program, the following oops was reported:

  BUG: unable to handle page fault for address: ffffffffff600000
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 3231067 P4D 3231067 PUD 3233067 PMD 3235067 PTE 0
  Oops: 0000 [linux-surface#1] PREEMPT SMP PTI
  CPU: 1 PID: 20390 Comm: test_progs ...... 6.7.0+ linux-surface#58
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
  RIP: 0010:copy_from_kernel_nofault+0x6f/0x110
  ......
  Call Trace:
   <TASK>
   ? copy_from_kernel_nofault+0x6f/0x110
   bpf_probe_read_kernel+0x1d/0x50
   bpf_prog_2061065e56845f08_do_probe_read+0x51/0x8d
   trace_call_bpf+0xc5/0x1c0
   perf_call_bpf_enter.isra.0+0x69/0xb0
   perf_syscall_enter+0x13e/0x200
   syscall_trace_enter+0x188/0x1c0
   do_syscall_64+0xb5/0xe0
   entry_SYSCALL_64_after_hwframe+0x6e/0x76
   </TASK>
  ......
  ---[ end trace 0000000000000000 ]---

The oops is triggered when:

1) A bpf program uses bpf_probe_read_kernel() to read from the vsyscall
page and invokes copy_from_kernel_nofault() which in turn calls
__get_user_asm().

2) Because the vsyscall page address is not readable from kernel space,
a page fault exception is triggered accordingly.

3) handle_page_fault() considers the vsyscall page address as a user
space address instead of a kernel space address. This results in the
fix-up setup by bpf not being applied and a page_fault_oops() is invoked
due to SMAP.

Considering handle_page_fault() has already considered the vsyscall page
address as a userspace address, fix the problem by disallowing vsyscall
page read for copy_from_kernel_nofault().

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: syzbot+72aa0161922eba61b50e@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/CAG48ez06TZft=ATH1qh2c5mpS5BT8UakwNkzi6nvK5_djC-4Nw@mail.gmail.com
Reported-by: xingwei lee <xrivendell7@gmail.com>
Closes: https://lore.kernel.org/bpf/CABOYnLynjBoFZOf3Z4BhaZkc5hx_kHfsjiW+UWLoB=w33LvScw@mail.gmail.com
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240202103935.3154011-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
@katamayko
Copy link

friends
i installed debian 12 for the first time this week and i'm having wifi problems on my surface pro 4
I don't understand how to overcome this, can you give me a step-by-step guide if there is a solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants