Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Has anyone fixed the wifi problem yet? #70

Closed
TimmyOl opened this issue Feb 3, 2020 · 88 comments
Closed

Has anyone fixed the wifi problem yet? #70

TimmyOl opened this issue Feb 3, 2020 · 88 comments

Comments

@TimmyOl
Copy link

TimmyOl commented Feb 3, 2020

The problem that exist from jakeday and on all your kernel still persists and no one seems to have found a fix.

The wifi stops after x amount of Time, in ubuntu a restart of the network-service each 5mins or so works but its not a solution... On pop os a full reboot is requires each time wifi drops.

From what ive found in all the posts is that its probably the driver for the card, the mwifiex_pcie that crashes.

Ive tried rundning ndiswrapper with drivers from Microsofts surface drivers pack but no success.

Environment

Im currently rundning kernel 5.3 on pop os, but ive also tried kernel 5.15 on ubuntu.

  • Hardware model:
    Hardware: Surface Book 2
  • Kernel version:
    Kernel is 5.3 the last one with touch support.
  • Distribution:
    Running pop os and also tried in ubuntu 18 and 19.
`dmesg` output
please provide a copy of `dmesg` here if possible

@leonm1
Copy link

leonm1 commented Feb 3, 2020

You should put the following in /etc/NetworkManager/conf.d/99-surface.conf if you use the default networking tool in Ubuntu and PopOS, called NetworkManager.

[connection]
wifi.powersave = 2

[device]
wifi.scan-rand-mac-address=false

If you don't use NetworkManager, you can follow this wiki entry:
https://github.com/linux-surface/linux-surface/wiki/Known-Issues#mwifiex-you-may-need-to-disable-wifi-power_save-manually-if-you-dont-use-networkmanager

@TimmyOl
Copy link
Author

TimmyOl commented Feb 4, 2020

Already done that,

So some of the things that doesnt work:

  • Setting wifi power save to 2
  • Rand Mac adress off
  • Setting BSSID in network manager
  • Disabling network manager and using wicd
  • Using ndiswrapper to use windows driver instead of mwifiex_pcie
  • disabled bluetooth in bios

Ive tried litterarly all of the solutions that google can offer and none work and it seems im far from the only one.

I love rundning linux on the machine which ive done since jakeday kernel 3 or something with the wifi problem always present, im geting tired of it and it leans towards reverting to windows on this one if its an unfixable problem 😔😔

@TimmyOl
Copy link
Author

TimmyOl commented Feb 4, 2020

My dmesg when its working:
[ 72.805710] usb 1-6: New USB device found, idVendor=1286, idProduct=204c, bcdDevice=32.01
[ 72.805715] usb 1-6: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[ 72.805718] usb 1-6: Product: Bluetooth and Wireless LAN Composite Device
[ 72.805720] usb 1-6: Manufacturer: Marvell
[ 72.805723] usb 1-6: SerialNumber: 0000000000000000
[ 72.832316] alg: No test for fips(ansi_cprng) (fips_ansi_cprng)
[ 72.856040] Bluetooth: Core ver 2.22
[ 72.856061] NET: Registered protocol family 31
[ 72.856061] Bluetooth: HCI device and connection manager initialized
[ 72.856065] Bluetooth: HCI socket layer initialized
[ 72.856066] Bluetooth: L2CAP socket layer initialized
[ 72.856069] Bluetooth: SCO socket layer initialized
[ 72.860342] usbcore: registered new interface driver btusb
[ 72.868075] Bluetooth: hci0: unexpected event for opcode 0x0000
[ 72.868320] Bluetooth: hci0: unexpected event for opcode 0x0000
[ 72.884832] mwifiex_pcie 0000:01:00.0: WLAN FW is active
[ 72.896896] Bluetooth: BNEP (Ethernet Emulation) ver 1.3
[ 72.896897] Bluetooth: BNEP filters: protocol multicast
[ 72.896901] Bluetooth: BNEP socket layer initialized
[ 72.942023] mwifiex_pcie 0000:01:00.0: Unknown api_id: 4
[ 72.974971] mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.19.p21)
[ 72.974978] mwifiex_pcie 0000:01:00.0: driver_version = mwifiex 1.0 (15.68.19.p21)
[ 72.987254] mwifiex_pcie 0000:01:00.0 wlp1s0: renamed from mlan0
[ 73.026013] Bluetooth: RFCOMM TTY layer initialized
[ 73.026018] Bluetooth: RFCOMM socket layer initialized
[ 73.026022] Bluetooth: RFCOMM ver 1.11
[ 77.309481] mwifiex_pcie 0000:01:00.0: info: trying to associate to 'Nelly for pre5ident' bssid dc:53:7c:8e:01:de
[ 77.473228] mwifiex_pcie 0000:01:00.0: info: associated to bssid dc:53:7c:8e:01:de successfully
[ 77.523942] IPv6: ADDRCONF(NETDEV_CHANGE): wlp1s0: link becomes ready

@RangerMauve
Copy link

Having this problem too. :) I'm using the Surface Pro 3, and I'm not sure if I got the kernel installed properly.

@TimmyOl
Copy link
Author

TimmyOl commented Feb 5, 2020

Having this problem too. :) I'm using the Surface Pro 3, and I'm not sure if I got the kernel installed properly.

Write uname - a to se what kernel you are using. It should say something like "Linux <your dist (in my case pop-os) > 5.3.18-surface" and after the date and show if 64 or 32 bit.
The important thing is that it says your version and surface and Not a version and generic

@RangerMauve
Copy link

@TimmyOl Yeah, doesn't seem like I did it right after all. 😅

Linux crapheap 5.3.0-7625-generic #27~1576774560~19.10~f432cd8-Ubuntu SMP Thu Dec 19 20:35:37 UTC  x86_64 x86_64 x86_64 GNU/Linux

@sebanc
Copy link

sebanc commented Feb 5, 2020

@TimmyOl Could you try building the kernel with the below patch and confirm if it solves your issue ?

--- a/net/wireless/nl80211.c	2019-07-08 00:41:56.000000000 +0200
+++ b/net/wireless/nl80211.c	2020-02-05 19:30:26.352718504 +0100
@@ -10517,10 +10520,7 @@
 	if (!rdev->ops->set_power_mgmt)
 		return -EOPNOTSUPP;
 
-	state = (ps_state == NL80211_PS_ENABLED) ? true : false;
-
-	if (state == wdev->ps)
-		return 0;
+	state = false;
 
 	err = rdev_set_power_mgmt(rdev, dev, state, wdev->ps_timeout);
 	if (!err)

@TimmyOl
Copy link
Author

TimmyOl commented Feb 5, 2020

@TimmyOl Could you try building the kernel with the below patch and confirm if it solves your issue ?

--- a/net/wireless/nl80211.c	2019-07-08 00:41:56.000000000 +0200
+++ b/net/wireless/nl80211.c	2020-02-05 19:30:26.352718504 +0100
@@ -10517,10 +10520,7 @@
 	if (!rdev->ops->set_power_mgmt)
 		return -EOPNOTSUPP;
 
-	state = (ps_state == NL80211_PS_ENABLED) ? true : false;
-
-	if (state == wdev->ps)
-		return 0;
+	state = false;
 
 	err = rdev_set_power_mgmt(rdev, dev, state, wdev->ps_timeout);
 	if (!err)

I will try as soon as I have the time, ill let you know when its done 😁👍

@mmalmeida
Copy link

I am suffering from the same in my SB2 with Ubuntu. To make matters worse, sometimes the wifi just disappears after 5-10 minutes usage, which makes work rather unproductive as it requires a full restart.

@TimmyOl
Copy link
Author

TimmyOl commented Feb 7, 2020

I am suffering from the same in my SB2 with Ubuntu. To make matters worse, sometimes the wifi just disappears after 5-10 minutes usage, which makes work rather unproductive as it requires a full restart.

Its the same as me,
Have you tried all the things i have said i have tried above, they seem to help for some.

Do the guide leonm1 linked https://github.com/linux-surface/linux-surface/wiki/Known-Issues#mwifiex-you-may-need-to-disable-wifi-power_save-manually-if-you-dont-use-networkmanager

I use network-manager but this seems to have made it alittle more stable.

I will try building the kernel with the patch from sebanc as soon as I have the time, it can be some days until i have the time to try, im not 100% sure how to add the patch so i have to research that first too 😁

@mmalmeida
Copy link

I am testing it while working today (plugged to a/c). What I did today was turn "dim screen when inactive" and "bluetooth" to off, and the wifi kept stable for about 3-4 hours, until I detached the screen. When I reattached the screen wifi had disappeared and had to restart.

@TimmyOl have you noticed anything similar in your tests?

@TimmyOl
Copy link
Author

TimmyOl commented Feb 7, 2020

I am testing it while working today (plugged to a/c). What I did today was turn "dim screen when inactive" and "bluetooth" to off, and the wifi kept stable for about 3-4 hours, until I detached the screen. When I reattached the screen wifi had disappeared and had to restart.

@TimmyOl have you noticed anything similar in your tests?

If I have bluetooth on the speed of atleast 5ghz wifi is 0.6mb and with bluetooth off its about 114mb but it doesnt disconnect now for some reason, i set power save 2 in all configs there is in linux and in settings for the network i go to the identity tab and set all the boxes from the drop down and on cloned adress i set permanent.
Also disable ipv6 and follow the guide in the link to set permanent mac

@mmalmeida
Copy link

mmalmeida commented Feb 7, 2020

syslog.txt
For reference (might be useful for debug/testing) I attach today's syslog. Most recent wifi vanishing occurrence was seconds after Feb 7 15:20:30

@mmalmeida
Copy link

I am testing it while working today (plugged to a/c). What I did today was turn "dim screen when inactive" and "bluetooth" to off, and the wifi kept stable for about 3-4 hours, until I detached the screen. When I reattached the screen wifi had disappeared and had to restart.
@TimmyOl have you noticed anything similar in your tests?

If I have bluetooth on the speed of atleast 5ghz wifi is 0.6mb and with bluetooth off its about 114mb but it doesnt disconnect now for some reason, i set power save 2 in all configs there is in linux and in settings for the network i go to the identity tab and set all the boxes from the drop down and on cloned adress i set permanent.
Also disable ipv6 and follow the guide in the link to set permanent mac

Do you mean having the bluetooth on makes the wifi speed crap? Is there a ticket for this (if not, maybe you can create it as a separate issue?)

@TimmyOl
Copy link
Author

TimmyOl commented Feb 7, 2020

I am testing it while working today (plugged to a/c). What I did today was turn "dim screen when inactive" and "bluetooth" to off, and the wifi kept stable for about 3-4 hours, until I detached the screen. When I reattached the screen wifi had disappeared and had to restart.
@TimmyOl have you noticed anything similar in your tests?

If I have bluetooth on the speed of atleast 5ghz wifi is 0.6mb and with bluetooth off its about 114mb but it doesnt disconnect now for some reason, i set power save 2 in all configs there is in linux and in settings for the network i go to the identity tab and set all the boxes from the drop down and on cloned adress i set permanent.
Also disable ipv6 and follow the guide in the link to set permanent mac

Do you mean having the bluetooth on makes the wifi speed crap? Is there a ticket for this (if not, maybe you can create it as a separate issue?)

Yes that is correct, i will take a look if there is a issue otherwhise ill create one

@kitakar5525
Copy link
Member

It might be useful to check the power_save state whether it is really turned off:

iw dev mlan0 get power_save

(The devname mlan0 may vary depending on environment. On my environment, the name will somehow not renamed to wlp*s0. But usually it may be wlp1s0, wlp2s0 or wlp3s0. Check your "Interface" name of the output iw dev)

@kitakar5525
Copy link
Member

Regarding Bluetooth, if you paired Surface Pen (or any BLE devices), try unpairing it.
At least unpairing Surface Pen helped the wifi speed issue on SB1 even when Bluetooth is on.

@TimmyOl
Copy link
Author

TimmyOl commented Feb 7, 2020

Regarding Bluetooth, if you paired Surface Pen (or any BLE devices), try unpairing it.
At least unpairing Surface Pen helped the wifi speed issue on SB1 even when Bluetooth is on.

Its off, ive heard this and seen it before on the jake kernel, if i pair a mouse the speed is slow but if i move the mouse it speeds up.

I have a Pen but its not connected yet.

@sebanc
Copy link

sebanc commented Feb 7, 2020

I cannot reproduce the issue as I do not have Bluetooth LE devices but does turning off autosuspend on the usb bluetooth module helps ?

  1. Identify your bluetooth adapter by looking at the "product" file in /sys/bus/usb/devices/X-X/ (replace X-X by your different devices)
  2. Turn off autosuspend:
echo on | sudo tee /sys/bus/usb/devices/X-X/power/control

@mmalmeida
Copy link

@TimmyOl Could you try building the kernel with the below patch and confirm if it solves your issue ?

--- a/net/wireless/nl80211.c	2019-07-08 00:41:56.000000000 +0200
+++ b/net/wireless/nl80211.c	2020-02-05 19:30:26.352718504 +0100
@@ -10517,10 +10520,7 @@
 	if (!rdev->ops->set_power_mgmt)
 		return -EOPNOTSUPP;
 
-	state = (ps_state == NL80211_PS_ENABLED) ? true : false;
-
-	if (state == wdev->ps)
-		return 0;
+	state = false;
 
 	err = rdev_set_power_mgmt(rdev, dev, state, wdev->ps_timeout);
 	if (!err)

Hey, I can try this - have a couple of hours this afternoon, but have never done this.
Can you pinpoint some quick instructions on how to do this?
I am guessing:

  1. Clone the repo: git clone --depth 1 https://github.com/linux-surface/linux-surface.git
  2. switch to a branch
  3. Apply patch
  4. Make .deb file for the patched kernel (with a name different from 5.3.18-surface ?)
  5. install kernel (in a way it is added to the list of available kernels; perhaps not selected by default

Can someone help validating this procedure and providing instructions for 3,4,5?

@kitakar5525
Copy link
Member

@TimmyOl
Opening another issue regarding Bluetooth is a good idea. I'm not aware of this issue opened in this repo. (Some issues are opened in jakeday repo but not in this repo)

@sebanc Unfortunately, turning off autosuspend on the usb bluetooth did not help when Surface Pen is paired and not connected.

@TimmyOl
Copy link
Author

TimmyOl commented Feb 9, 2020

@TimmyOl
Opening another issue regarding Bluetooth is a good idea. I'm not aware of this issue opened in this repo. (Some issues are opened in jakeday repo but not in this repo)

@sebanc Unfortunately, turning off autosuspend on the usb bluetooth did not help when Surface Pen is paired and not connected.

Ooh ok, then i can open more issues, many of the issues from jakeday still persist, so the suspend and hibernate issue is still there aswell, and bluetooth speakers connect as hsp headset every time so i have to go in manually and set it to a2dp

@TimmyOl
Copy link
Author

TimmyOl commented Feb 9, 2020

@TimmyOl
Opening another issue regarding Bluetooth is a good idea. I'm not aware of this issue opened in this repo. (Some issues are opened in jakeday repo but not in this repo)

@sebanc Unfortunately, turning off autosuspend on the usb bluetooth did not help when Surface Pen is paired and not connected.

I made a new issue for the bluetooth issue here: #78

@TimmyOl
Copy link
Author

TimmyOl commented Feb 9, 2020

@TimmyOl Could you try building the kernel with the below patch and confirm if it solves your issue ?

--- a/net/wireless/nl80211.c	2019-07-08 00:41:56.000000000 +0200
+++ b/net/wireless/nl80211.c	2020-02-05 19:30:26.352718504 +0100
@@ -10517,10 +10520,7 @@
 	if (!rdev->ops->set_power_mgmt)
 		return -EOPNOTSUPP;
 
-	state = (ps_state == NL80211_PS_ENABLED) ? true : false;
-
-	if (state == wdev->ps)
-		return 0;
+	state = false;
 
 	err = rdev_set_power_mgmt(rdev, dev, state, wdev->ps_timeout);
 	if (!err)

So i have a small update, i havnt tried rebuilding the kernel yet since i dont really know how to apply the patch and i havnt had time to learn and try it.

I have however seen a significant improvement since i followed this: https://github.com/linux-surface/linux-surface/wiki/Known-Issues#mwifiex-you-may-need-to-disable-wifi-power_save-manually-if-you-dont-use-networkmanager
and the thing i did different is that even though i use network-manager now i spammed all the config files with power_save off and it seems like some of the files did the trick, i have only had one wifi crash with bluetooth off in some days now and ive used the surface as my main in that time.

The power_save seems to work but had to be added to some files that none of the guides, issues or google answers told and i dont know wich one since i tried all the ones from all the guides i could find and none worked, so i went on a scavanger hunt in the system for configs for the wifi and it seems like something worked since its now much more stable, ill try and find all the stuff i did and post them here to see if they help

@mmalmeida
Copy link

@TimmyOl do you have any status update on this?

@TimmyOl
Copy link
Author

TimmyOl commented Feb 15, 2020

@TimmyOl do you have any status update on this?

No sorry, im totally overrun by work and school so i dont know when ill have the time too look at it, I have bluetooth off and mostly uses my other computer instead for now 😔

@tyalie
Copy link

tyalie commented Feb 20, 2020

It would be nice to be able to reproduce this issue reliably. Otherwise one can't really test if your patch above works.

@RangerMauve
Copy link

It seems to be super random, but I think it happens more when I move my laptop around. 🤷

Switching to the latest kernel seemed to help, but I'm not sure since it still happens. Not sure how to reproduce it consistently either.

@mmalmeida
Copy link

I used the laptop intensively for 1 day and this happened about 8 times (so around once every 30 minutes).

Since then I have used the laptop at home in short periods and hasn't happened since.
Only differences I can think of:

  • it was connected to power when it happened frequently. have been using on battery at home
  • I am connected to a different wifi network

Maybe you can test if you spot differences between battery vs a/c power usage?

@lviggiani
Copy link

Looks like this workaround fixes the issue for me - Surface Book 2 15" no-gpu, Ubuntu 20.04

Bluetooth disabled in BIOS, not sure if this is required.

Prerequisites

Install acpi-call from https://github.com/nix-community/acpi_call and build against current kernel

Procedure

All console commands to be executed as root (sudo su)

  1. Power off. After cold-booting into OS, immediately disable wi-fi from menu.
  2. Insert the acpi_call module from acpi_call build directory
    insmod ./acpi_call.ko
  3. Remove the original driver releasing the device
    rmmod mwifiex_pcie
  4. Unregister from PCI
    echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/remove
  5. Issue a full ACPI reset to the device
    echo '\_SB.PCI0.RP01.PXSX.PRWF._RST' > /proc/acpi/call
  6. Rescan PCI to get device back online (this also re-uploads the firmware as shown in dmesg log) -
    echo 1 > /sys/bus/pci/rescan
  7. Re-enable wi-fi with OS menu

Obviously you will need to do this every time you reboot.

Please tell me if this fixes your problems and I'll work on packaging everything

Hi, lspci to me gives:
02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88W8897 [AVASTAR] 802.11ac Wireless
so do I have to change at step 4 the coomand to this? (02:00.0 instead of 01:00.0)
echo 1 > /sys/bus/pci/devices/0000\:02\:00.0/remove

@lviggiani
Copy link

@grandrew I've used your script (with my modification from my previous comment) and it worked. I can now use pacman at "full speed" without and crashes so far thanks!

@grandrew
Copy link

grandrew commented Oct 2, 2020

@grandrew I've used your script (with my modification from my previous comment) and it worked. I can now use pacman at "full speed" without and crashes so far thanks!

@lviggiani Looks like you only fixed the bus unregister line and not the ACPI line. Are you saying that ACPI reset may not be required?

@lviggiani
Copy link

lviggiani commented Oct 2, 2020

Well I'm not sure... this is the script I made and that I run just after booting (and having powered off my wifi from gnome menu):

# cat fix-wifi 
#!/bin/bash

insmod ./acpi_call.ko
rmmod mwifiex_pcie
echo 1 > /sys/bus/pci/devices/0000\:02\:00.0/remove
echo '\_SB.PCI0.RP01.PXSX.PRWF._RST' > /proc/acpi/call
echo 1 > /sys/bus/pci/rescan

and after that I re enable network as per your instruction.
So fat I did not experience any other wifi crash even while downloading at full speed.
The script covers your step 2 to 6 withe the exception of 02:00.0 instead of 01:00.0
But honestly I changed that just basing on my guess that this is the address shown by lspci to be used. Is it correct?

@lviggiani
Copy link

@grandrew do you mean that I should also have to change this
echo '_SB.PCI0.RP01.PXSX.PRWF._RST' > /proc/acpi/call
into this ?
echo '_SB.PCI0.RP02.PXSX.PRWF._RST' > /proc/acpi/call

@kitakar5525
Copy link
Member

kitakar5525 commented Oct 10, 2020

Looks like this workaround fixes the issue for me - Surface Book 2 15" no-gpu, Ubuntu 20.04

Bluetooth disabled in BIOS, not sure if this is required.

Prerequisites

Install acpi-call from https://github.com/nix-community/acpi_call and build against current kernel

Procedure

All console commands to be executed as root (sudo su)

  1. Power off. After cold-booting into OS, immediately disable wi-fi from menu.
  2. Insert the acpi_call module from acpi_call build directory
    insmod ./acpi_call.ko
  3. Remove the original driver releasing the device
    rmmod mwifiex_pcie
  4. Unregister from PCI
    echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/remove
  5. Issue a full ACPI reset to the device
    echo '\_SB.PCI0.RP01.PXSX.PRWF._RST' > /proc/acpi/call
  6. Rescan PCI to get device back online (this also re-uploads the firmware as shown in dmesg log) -
    echo 1 > /sys/bus/pci/rescan
  7. Re-enable wi-fi with OS menu

Obviously you will need to do this every time you reboot.

Please tell me if this fixes your problems and I'll work on packaging everything

These commands (firmware reset) happen to disable all of ASPM L1 substates. And there are report on SP5 that disabling L1.2 substate fixed wifi crash. So, if these commands work, disabling some ASPM L1 substates may also work.
For example (works with kernel version v5.5 or later, not work with v4.19):

Print current wifi ASPM setting:

grep . /sys/bus/pci/drivers/mwifiex_pcie/*/link/*

and try one of the following. If possible, please try each command (reboot when you try another one) and tell me what command worked.

Disable L1.2 substate:

echo 0 | sudo tee /sys/bus/pci/drivers/mwifiex_pcie/*/link/l1_2*

Disable L1.1 substate:

echo 0 | sudo tee /sys/bus/pci/drivers/mwifiex_pcie/*/link/l1_1*

Disable all of the L1 substates:

echo 0 | sudo tee /sys/bus/pci/drivers/mwifiex_pcie/*/link/{l1_1*,l1_2*}

EDIT: and again if possible, please check if S0ix is still working during suspend (https://github.com/linux-surface/linux-surface/wiki/Known-Issues-and-FAQ#general-info-about-s0ix).
So, the procedure:

  1. run one of the above command
  2. check if wifi crash won't happen now
  3. check if S0ix during suspend is still working
    sudo cat /sys/kernel/debug/pmc_core/slp_s0_residency_usec should increase after suspend
  4. reboot and try another command above (return to 1.)

@kitakar5525
Copy link
Member

Note that the broken wifi reset feature has been fixed with the recent kernel release. The wifi should reset by itself now when wifi crashed.
So, if necessary, check the dmesg log to see if a wifi crash has occurred or not.

@redd1ng
Copy link

redd1ng commented Oct 31, 2020

Looks like this workaround fixes the issue for me - Surface Book 2 15" no-gpu, Ubuntu 20.04
Bluetooth disabled in BIOS, not sure if this is required.

Prerequisites

Install acpi-call from https://github.com/nix-community/acpi_call and build against current kernel

Procedure

All console commands to be executed as root (sudo su)

  1. Power off. After cold-booting into OS, immediately disable wi-fi from menu.
  2. Insert the acpi_call module from acpi_call build directory
    insmod ./acpi_call.ko
  3. Remove the original driver releasing the device
    rmmod mwifiex_pcie
  4. Unregister from PCI
    echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/remove
  5. Issue a full ACPI reset to the device
    echo '\_SB.PCI0.RP01.PXSX.PRWF._RST' > /proc/acpi/call
  6. Rescan PCI to get device back online (this also re-uploads the firmware as shown in dmesg log) -
    echo 1 > /sys/bus/pci/rescan
  7. Re-enable wi-fi with OS menu

Obviously you will need to do this every time you reboot.
Please tell me if this fixes your problems and I'll work on packaging everything

These commands (firmware reset) happen to disable all of ASPM L1 substates. And there are report on SP5 that disabling L1.2 substate fixed wifi crash. So, if these commands work, disabling some ASPM L1 substates may also work.
For example (works with kernel version v5.5 or later, not work with v4.19):

Print current wifi ASPM setting:

grep . /sys/bus/pci/drivers/mwifiex_pcie/*/link/*

and try one of the following. If possible, please try each command (reboot when you try another one) and tell me what command worked.

Disable L1.2 substate:

echo 0 | sudo tee /sys/bus/pci/drivers/mwifiex_pcie/*/link/l1_2*

Disable L1.1 substate:

echo 0 | sudo tee /sys/bus/pci/drivers/mwifiex_pcie/*/link/l1_1*

Disable all of the L1 substates:

echo 0 | sudo tee /sys/bus/pci/drivers/mwifiex_pcie/*/link/{l1_1*,l1_2*}

EDIT: and again if possible, please check if S0ix is still working during suspend (https://github.com/linux-surface/linux-surface/wiki/Known-Issues-and-FAQ#general-info-about-s0ix).
So, the procedure:

1. run one of the above command

2. check if wifi crash won't happen now

3. check if S0ix during suspend is still working
   `sudo cat /sys/kernel/debug/pmc_core/slp_s0_residency_usec` should increase after suspend

4. reboot and try another command above (return to 1.)

I have a Surface Pro 3 and only had issues with the Wi-Fi interface falling in to an error state after turning the display back on after suspending/sleep. It would not happen each time but after turning the display off and on 3-4 times definitely would reproduce the error.

My SP3 running Ubuntu 20.04.1 LTS:
Linux Surface-Pro-3 5.9.1-surface #1 SMP Thu Oct 22 17:00:07 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

After disabling all of the ASPM L1 substates the problem seems to be solved and I have not experience any Wi-Fi issues since:
echo 0 | sudo tee /sys/bus/pci/drivers/mwifiex_pcie/*/link/{l1_1*,l1_2*}

@kitakar5525
Disabling only the ASPM L1.1 or L1.2 substates did not work for me.

I hope this info helps other Surface users.

@kitakar5525
Copy link
Member

@ReddingZH

Thanks for testing! Hmm, SP3 needs all of the ASPM L1 substates disabled... I've never seen this type. But fortunately, doing so is still acceptable because it should still not break S0ix.

@jonas2515
Copy link

Edited the wiki page to include the results for SP3: https://github.com/linux-surface/linux-surface/wiki/Marvell-88W8897-quirks

@ReddingZH can you please confirm that S0ix during suspend is not broken with this?

Also would be great if you could paste the output of sudo lspci -vvv here so we can double-check that those substates are indeed disabled on both the bridge device and the wifi chip.

@redd1ng
Copy link

redd1ng commented Nov 2, 2020

How can I confirm that S0ix is not broken during suspend?

Here is the verbose lspci output of my Wi-Fi adapter:

$ sudo lspci -vvv -s 01:00.0
01:00.0 Ethernet controller: Marvell Technology Group Ltd. 88W8897 [AVASTAR] 802.11ac Wireless
	Subsystem: SafeNet (wrong ID) 88W8897 [AVASTAR] 802.11ac Wireless
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 58
	Region 0: Memory at c0500000 (64-bit, prefetchable) [size=1M]
	Region 2: Memory at c0400000 (64-bit, prefetchable) [size=1M]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable+ Count=1/32 Maskable+ 64bit+
		Address: 00000000fee00358  Data: 0000
		Masking: fffffffe  Pending: 00000000
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 10.000W
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <1us, L1 unlimited
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s (ok), Width x1 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR+
			 10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-, TPHComp-, ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [140 v1] Device Serial Number 00-00-00-00-00-00-00-00
	Capabilities: [150 v1] Power Budgeting <?>
	Capabilities: [160 v1] Latency Tolerance Reporting
		Max snoop latency: 3145728ns
		Max no snoop latency: 3145728ns
	Capabilities: [168 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=70us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=163840ns
		L1SubCtl2: T_PwrOn=10us
	Kernel driver in use: mwifiex_pcie
	Kernel modules: mwifiex_pcie

@jonas2515
Copy link

How can I confirm that S0ix is not broken during suspend?

You can do that by checking the output of sudo cat /sys/kernel/debug/pmc_core/slp_s0_residency_usec before suspending and after suspending. If it the value increased, s0ix is working.

@redd1ng
Copy link

redd1ng commented Nov 2, 2020

I'm missing the folder /sys/kernel/debug/pmc_core, does that mean S0ix is not working?

@jonas2515
Copy link

Oh, what's your CPU? Can you check if /sys/kernel/debug/telemetry/s0ix_residency_usec or /sys/kernel/debug/pmc_atom/sleep_state exists?

@redd1ng
Copy link

redd1ng commented Nov 2, 2020

I have a Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz installed. Neither of the files exists on the filesystem.

@jonas2515
Copy link

jonas2515 commented Nov 2, 2020

Hmm, what's the output of sudo turbostat --show Pkg%pc2,Pkg%pc3,Pkg%pc6,Pkg%pc7,Pkg%pc8,Pkg%pc9,Pk%pc10,SYS%LPI sleep 10? Maybe your cpu doesn't have support for reading those values.

@redd1ng
Copy link

redd1ng commented Nov 2, 2020

Sorry I can't get turbostat to work as there is no version found for kernel 5.9.1. Could also not find it for 5.9.1 via apt.

$ sudo turbostat 
WARNING: turbostat not found for kernel 5.9.1

  You may need to install the following packages for this specific kernel:
    linux-tools-5.9.1-surface
    linux-cloud-tools-5.9.1-surface

  You may also want to install one of the following packages to keep up to date:
    linux-tools-surface
    linux-cloud-tools-surface

@jonas2515
Copy link

Hmm, then maybe just try booting an older kernel, maybe 5.8?

@redd1ng
Copy link

redd1ng commented Nov 2, 2020

Cool that worked. Here the output:

$ sudo turbostat --show Pkg%pc2,Pkg%pc3,Pkg%pc6,Pkg%pc7,Pkg%pc8,Pkg%pc9,Pk%pc10,SYS%LPI sleep 10
turbostat version 19.08.31 - Len Brown <lenb@kernel.org>
CPUID(0): GenuineIntel 0xd CPUID levels; 0x80000008 xlevels; family:model:stepping 0x6:45:1 (6:69:1)
CPUID(1): SSE3 MONITOR SMX EIST TM2 TSC MSR ACPI-TM HT TM
CPUID(6): APERF, TURBO, DTS, PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, EPB
cpu1: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MWAIT PREFETCH TURBO)
CPUID(7): No-SGX
cpu1: MSR_MISC_PWR_MGMT: 0x00400000 (ENable-EIST_Coordination DISable-EPB DISable-OOB)
RAPL: 17476 sec. Joule Counter Range, at 15 Watts
cpu1: MSR_PLATFORM_INFO: 0x8083df3011900
8 * 100.0 = 800.0 MHz max efficiency frequency
25 * 100.0 = 2500.0 MHz base frequency
cpu1: MSR_IA32_POWER_CTL: 0x0004005d (C1E auto-promotion: DISabled)
cpu1: MSR_TURBO_RATIO_LIMIT: 0x1a1a1a1d
26 * 100.0 = 2600.0 MHz max turbo 4 active cores
26 * 100.0 = 2600.0 MHz max turbo 3 active cores
26 * 100.0 = 2600.0 MHz max turbo 2 active cores
29 * 100.0 = 2900.0 MHz max turbo 1 active cores
cpu1: MSR_CONFIG_TDP_NOMINAL: 0x00000013 (base_ratio=19)
cpu1: MSR_CONFIG_TDP_LEVEL_1: 0x0008005c (PKG_MIN_PWR_LVL1=0 PKG_MAX_PWR_LVL1=0 LVL1_RATIO=8 PKG_TDP_LVL1=92)
cpu1: MSR_CONFIG_TDP_LEVEL_2: 0x001900c8 (PKG_MIN_PWR_LVL2=0 PKG_MAX_PWR_LVL2=0 LVL2_RATIO=25 PKG_TDP_LVL2=200)
cpu1: MSR_CONFIG_TDP_CONTROL: 0x00000000 ( lock=0)
cpu1: MSR_TURBO_ACTIVATION_RATIO: 0x00000012 (MAX_NON_TURBO_RATIO=18 lock=0)
cpu1: MSR_PKG_CST_CONFIG_CONTROL: 0x1e000408 (UNdemote-C3, UNdemote-C1, demote-C3, demote-C1, UNlocked, pkg-cstate-limit=8 (unlimited))
cpu1: cpufreq driver: intel_pstate
cpu1: cpufreq governor: powersave
cpufreq intel_pstate no_turbo: 0
cpu1: MSR_MISC_FEATURE_CONTROL: 0x00000000 (L2-Prefetch L2-Prefetch-pair L1-Prefetch L1-IP-Prefetch)
cpu0: MSR_IA32_ENERGY_PERF_BIAS: 0x00000006 (balanced)
cpu0: MSR_CORE_PERF_LIMIT_REASONS, 0x15201000 (Active: MultiCoreTurbo, ) (Logged: MultiCoreTurbo, PkgPwrL1, Amps, Auto-HWP, )
cpu0: MSR_GFX_PERF_LIMIT_REASONS, 0x14000000 (Active: ) (Logged: PkgPwrL1, )
cpu0: MSR_RING_PERF_LIMIT_REASONS, 0x01000000 (Active: ) (Logged: Amps, )
cpu0: MSR_RAPL_POWER_UNIT: 0x000a0e03 (0.125000 Watts, 0.000061 Joules, 0.000977 sec.)
cpu0: MSR_PKG_POWER_INFO: 0x00000078 (15 W TDP, RAPL 0 - 0 W, 0.000000 sec.)
cpu0: MSR_PKG_POWER_LIMIT: 0x4200c800968098 (UNlocked)
cpu0: PKG Limit #1: ENabled (19.000000 Watts, 3.000000 sec, clamp DISabled)
cpu0: PKG Limit #2: DISabled (25.000000 Watts, 0.002441* sec, clamp DISabled)
cpu0: MSR_PP0_POLICY: 0
cpu0: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_PP1_POLICY: 0
cpu0: MSR_PP1_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: GFX Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x0a640000 (100 C)
cpu0: MSR_IA32_PACKAGE_THERM_STATUS: 0x882d0800 (55 C)
cpu0: MSR_IA32_PACKAGE_THERM_INTERRUPT: 0x00000003 (100 C, 100 C)
cpu1: MSR_PKGC3_IRTL: 0x00008842 (valid, 67584 ns)
cpu1: MSR_PKGC6_IRTL: 0x00008873 (valid, 117760 ns)
cpu1: MSR_PKGC7_IRTL: 0x00008891 (valid, 148480 ns)
cpu1: MSR_PKGC8_IRTL: 0x000088e4 (valid, 233472 ns)
cpu1: MSR_PKGC9_IRTL: 0x00008945 (valid, 332800 ns)
cpu1: MSR_PKGC10_IRTL: 0x000089ef (valid, 506880 ns)
10.055896 sec
Pkg%pc2	Pkg%pc3	Pkg%pc6	Pkg%pc7	Pkg%pc8	Pkg%pc9	Pk%pc10
71.06	0.00	0.00	0.00	0.00	0.00	0.00
71.06	0.00	0.00	0.00	0.00	0.00	0.00

@jonas2515
Copy link

Thanks! Seems like there's indeed no s0ix reporting on your device, but it's also not reaching any package C-states deeper than 2, which is quite bad. Can you do sudo powertop --auto-tune and then try running turbostat again?

@redd1ng
Copy link

redd1ng commented Nov 2, 2020

$ sudo powertop --auto-tune
modprobe cpufreq_stats failedCannot load from file /var/cache/powertop/saved_results.powertop
Cannot load from file /var/cache/powertop/saved_parameters.powertop
File will be loaded after taking minimum number of measurement(s) with battery only 
RAPL device for cpu 0
RAPL Using PowerCap Sysfs : Domain Mask f
RAPL device for cpu 0
RAPL Using PowerCap Sysfs : Domain Mask f
Devfreq not enabled
glob returned GLOB_ABORTED
Cannot load from file /var/cache/powertop/saved_parameters.powertop
File will be loaded after taking minimum number of measurement(s) with battery only 
 the port is sda
 the port is sdb
Leaving PowerTOP
$ sudo turbostat --show Pkg%pc2,Pkg%pc3,Pkg%pc6,Pkg%pc7,Pkg%pc8,Pkg%pc9,Pk%pc10,SYS%LPI sleep 10
turbostat version 19.08.31 - Len Brown <lenb@kernel.org>
CPUID(0): GenuineIntel 0xd CPUID levels; 0x80000008 xlevels; family:model:stepping 0x6:45:1 (6:69:1)
CPUID(1): SSE3 MONITOR SMX EIST TM2 TSC MSR ACPI-TM HT TM
CPUID(6): APERF, TURBO, DTS, PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, EPB
cpu0: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MWAIT PREFETCH TURBO)
CPUID(7): No-SGX
cpu0: MSR_MISC_PWR_MGMT: 0x00400000 (ENable-EIST_Coordination DISable-EPB DISable-OOB)
RAPL: 17476 sec. Joule Counter Range, at 15 Watts
cpu0: MSR_PLATFORM_INFO: 0x8083df3011900
8 * 100.0 = 800.0 MHz max efficiency frequency
25 * 100.0 = 2500.0 MHz base frequency
cpu0: MSR_IA32_POWER_CTL: 0x0004005d (C1E auto-promotion: DISabled)
cpu0: MSR_TURBO_RATIO_LIMIT: 0x1a1a1a1d
26 * 100.0 = 2600.0 MHz max turbo 4 active cores
26 * 100.0 = 2600.0 MHz max turbo 3 active cores
26 * 100.0 = 2600.0 MHz max turbo 2 active cores
29 * 100.0 = 2900.0 MHz max turbo 1 active cores
cpu0: MSR_CONFIG_TDP_NOMINAL: 0x00000013 (base_ratio=19)
cpu0: MSR_CONFIG_TDP_LEVEL_1: 0x0008005c (PKG_MIN_PWR_LVL1=0 PKG_MAX_PWR_LVL1=0 LVL1_RATIO=8 PKG_TDP_LVL1=92)
cpu0: MSR_CONFIG_TDP_LEVEL_2: 0x001900c8 (PKG_MIN_PWR_LVL2=0 PKG_MAX_PWR_LVL2=0 LVL2_RATIO=25 PKG_TDP_LVL2=200)
cpu0: MSR_CONFIG_TDP_CONTROL: 0x00000000 ( lock=0)
cpu0: MSR_TURBO_ACTIVATION_RATIO: 0x00000012 (MAX_NON_TURBO_RATIO=18 lock=0)
cpu0: MSR_PKG_CST_CONFIG_CONTROL: 0x1e000408 (UNdemote-C3, UNdemote-C1, demote-C3, demote-C1, UNlocked, pkg-cstate-limit=8 (unlimited))
cpu0: cpufreq driver: intel_pstate
cpu0: cpufreq governor: powersave
cpufreq intel_pstate no_turbo: 0
cpu0: MSR_MISC_FEATURE_CONTROL: 0x00000000 (L2-Prefetch L2-Prefetch-pair L1-Prefetch L1-IP-Prefetch)
cpu0: MSR_IA32_ENERGY_PERF_BIAS: 0x00000006 (balanced)
cpu0: MSR_CORE_PERF_LIMIT_REASONS, 0x15200020 (Active: Auto-HWP, ) (Logged: MultiCoreTurbo, PkgPwrL1, Amps, Auto-HWP, )
cpu0: MSR_GFX_PERF_LIMIT_REASONS, 0x14000000 (Active: ) (Logged: PkgPwrL1, )
cpu0: MSR_RING_PERF_LIMIT_REASONS, 0x01000000 (Active: ) (Logged: Amps, )
cpu0: MSR_RAPL_POWER_UNIT: 0x000a0e03 (0.125000 Watts, 0.000061 Joules, 0.000977 sec.)
cpu0: MSR_PKG_POWER_INFO: 0x00000078 (15 W TDP, RAPL 0 - 0 W, 0.000000 sec.)
cpu0: MSR_PKG_POWER_LIMIT: 0x4200c800968098 (UNlocked)
cpu0: PKG Limit #1: ENabled (19.000000 Watts, 3.000000 sec, clamp DISabled)
cpu0: PKG Limit #2: DISabled (25.000000 Watts, 0.002441* sec, clamp DISabled)
cpu0: MSR_PP0_POLICY: 0
cpu0: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_PP1_POLICY: 0
cpu0: MSR_PP1_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: GFX Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x0a640000 (100 C)
cpu0: MSR_IA32_PACKAGE_THERM_STATUS: 0x882e0800 (54 C)
cpu0: MSR_IA32_PACKAGE_THERM_INTERRUPT: 0x00000003 (100 C, 100 C)
cpu0: MSR_PKGC3_IRTL: 0x00008842 (valid, 67584 ns)
cpu0: MSR_PKGC6_IRTL: 0x00008873 (valid, 117760 ns)
cpu0: MSR_PKGC7_IRTL: 0x00008891 (valid, 148480 ns)
cpu0: MSR_PKGC8_IRTL: 0x000088e4 (valid, 233472 ns)
cpu0: MSR_PKGC9_IRTL: 0x00008945 (valid, 332800 ns)
cpu0: MSR_PKGC10_IRTL: 0x000089ef (valid, 506880 ns)
10.002633 sec
Pkg%pc2	Pkg%pc3	Pkg%pc6	Pkg%pc7	Pkg%pc8	Pkg%pc9	Pk%pc10
50.09	0.00	0.00	0.00	0.00	0.00	0.00
50.09	0.00	0.00	0.00	0.00	0.00	0.00

@jonas2515
Copy link

Hmm, still nothing better than PC2.. Anyway, that's a different issue from wifi stability. I guess if there's no counter for s0ix residency, we also can't confirm that it's working during suspend...

@lx07
Copy link

lx07 commented Dec 8, 2020

So this seems gone when turning off Bluetooth for me, any clues?

@jonas2515
Copy link

Random firmware crashes should be fixed now thanks to linux-surface/kernel#70 and linux-surface/kernel#91, I think we can close this issue.

@Lakeland97
Copy link

Still have this issue on my Surface Book 1 with the latest 5.14 kernel

@jonas2515
Copy link

Can you be a bit more specific what exact issues you are experiencing? Also if you're talking about network speed drops can you answer these questions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests