driver causes instability with kernel 4.7? #76

Open
speculatrix opened this Issue Jun 17, 2016 · 17 comments

Projects

None yet

10 participants

@speculatrix
speculatrix commented Jun 17, 2016 edited

After testing the driver in different scenarios, I have concluded that it makes the kernel very prone to lock up when using kernel 4.7 (rc1, 2 or 3), getting more than 5 minutes up time is very unlikely. Ping times vary wildly between 2.4ms and 1130ms !

I am using the John Brodie patch set (he of the Ubuntu on T100TA group) which i believe includes the relevant ones from this git repo; I built a kernel from sources at kernel.org, applied patches and used this kernel config: http://www.zaurus.org.uk/download/toshiba_click_mini_l9w/config-4.7.0-rc3-jbpm0

I have eliminated as many other things as possible, by running linux from a USB card reader and unmounting any eMMC partitions, so the only device active is SDIO, which is in the Baytrail's Storage Hub:

Am not sure whether it's relevant, but I'm using dual stack IPv4 & IPv6; there are IPv6 route announcements, it's not dhcp6.
The usb ether and the wifi are on the same LAN, they get different IP addresses, no fancy tricks, so the amount of network chatter on the interfaces much the same.

With kernel 4.4.6 and later, I found the wifi driver to be quite useable.

This is on a Toshiba Click Mini, with 5.20 UEFI firmware; Z3735F SoC, 2GB RAM, Debian 8.4.

thanks for your attention

http://cdn.arstechnica.net/wp-content/uploads/2013/09/Screen-Shot-2013-09-13-at-6.32.07-PM-640x423.jpg

@robert-john-small

Here's the dmesg from a network lockup I saw.

[37275.278075] RTL8723BS: nolinked power save leave
[37276.940013] RTL8723BS: nolinked power save enter
[37277.206103] RTL8723BS: nolinked power save leave
[37277.421546] RTL8723BS: rtw_set_802_11_connect(wlan2)  fw_state = 0x00000008
[37279.219935] RTL8723BS: start auth
[37279.223193] RTL8723BS: auth success, start assoc
[37279.226711] RTL8723BS: rtw_cfg80211_indicate_connect(wlan2) BSS not found !!
[37279.226743] RTL8723BS: assoc success
[37279.235546] RTL8723BS: send eapol packet
[37279.253229] RTL8723BS: send eapol packet
[37279.254480] RTL8723BS: set pairwise key camid:4, addr:14:dd:a9:ca:15:22, kid:0, type:AES
[37279.256273] RTL8723BS: set group key camid:5, addr:14:dd:a9:ca:15:22, kid:1, type:AES


[37529.721779] RTL8723BS: send eapol packet
[37529.722926] RTL8723BS: send eapol packet
[37529.723959] RTL8723BS: send eapol packet
[37529.724433] RTL8723BS: set group key camid:6, addr:14:dd:a9:ca:15:22, kid:2, type:AES
[37529.724999] RTL8723BS: send eapol packet
[37529.725236] RTL8723BS: set group key camid:6, addr:14:dd:a9:ca:15:22, kid:2, type:AES
[37529.726144] RTL8723BS: set group key camid:6, addr:14:dd:a9:ca:15:22, kid:2, type:AES
[37529.727644] RTL8723BS: set group key camid:6, addr:14:dd:a9:ca:15:22, kid:2, type:AES

[38351.204283] irq 187: nobody cared (try booting with the "irqpoll" option)
[38351.204303] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O    4.7.0-rc3-jbpm0 #1
[38351.204311] Hardware name: TOSHIBA SATELLITE Click Mini L9W-B/0700, BIOS 5.20 11/02/2015
[38351.204319]  0000000000000000 ffffffff8127c895 ffff880078d9b400 0000000000000000
[38351.204335]  ffffffff810a2ce9 ffff880078d9b400 0000000000000000 0000000000000000
[38351.204347]  ffffffff810a3054 0000000000000000 00000000000000bb 0000000000000000
[38351.204360] Call Trace:
[38351.204367]  <IRQ>  [<ffffffff8127c895>] ? dump_stack+0x5c/0x77
[38351.204395]  [<ffffffff810a2ce9>] ? __report_bad_irq+0x29/0xc0
[38351.204407]  [<ffffffff810a3054>] ? note_interrupt+0x224/0x270
[38351.204418]  [<ffffffff810a08c9>] ? handle_irq_event_percpu+0x149/0x1b0
[38351.204428]  [<ffffffff810a0966>] ? handle_irq_event+0x36/0x60
[38351.204439]  [<ffffffff810a34fa>] ? handle_fasteoi_irq+0x8a/0x140
[38351.204451]  [<ffffffff8101a719>] ? handle_irq+0x19/0x30
[38351.204462]  [<ffffffff815a2c76>] ? do_IRQ+0x46/0xd0
[38351.204475]  [<ffffffff815a1242>] ? common_interrupt+0x82/0x82
[38351.204481]  <EOI>  [<ffffffff8144a193>] ? cpuidle_enter_state+0x113/0x250
[38351.204504]  [<ffffffff8108cbde>] ? cpu_startup_entry+0x2ae/0x350
[38351.204517]  [<ffffffff81b2aedb>] ? start_kernel+0x43d/0x445
[38351.204528]  [<ffffffff81b2a120>] ? early_idt_handler_array+0x120/0x120
[38351.204538]  [<ffffffff81b2a574>] ? x86_64_start_kernel+0x145/0x154
[38351.204545] handlers:
[38351.204554] [<ffffffff81466f70>] sdhci_irq threaded [<ffffffff81464630>] sdhci_thread_irq
[38351.204567] Disabling IRQ #187
@robert-john-small

If you use irqpoll as a kernel parameter instead of locking the network stack it locks the whole machine.

@speculatrix
speculatrix commented Jun 18, 2016 edited

I am fairly confident the driver per se is not the problem.

I left my Toshiba Click Mini (TCM) running overnight with kernel 4.4.13, 32 bit version, built from vanilla source with the patches from this repo and no others. I'm running with the root partition on a microSD card in the tablet itself (mmcblk0), a USB hub combo (2 USB ports, 1 gigEther, 1 card slot).

I set up Zabbix monitoring so my Z server would probe the usb ether and the wifi every minute. Eleven hours later it's still happy. I note that wifi latency is pretty good, a ping from my laptop (on 802.11a) to the TCM is typically 2.3ms, occasionally spiking to 6ms.

I compared /proc/interrupts between the two kernels, cutting out the stats columns

$ diff interrupts-4.7.0-rc3-jbpm0.nostats interrupts-4.4.13-pm2.nostats
17,18c17,18
<  202:  PCI-MSI 327680-edge      xhci_hcd
<  203:  IO-APIC   68-fasteoi   SIS0817:00
---
>  202:  IO-APIC   68-fasteoi   SIS0817:00
>  204:  PCI-MSI 327680-edge      xhci_hcd

I don't know much about linux kernel device drivers, so I can only speculate about whether it's the interaction of the driver with the kernel (interrupts not being passed through in some way) or the SDIO layer being flakey in 4.7, or the kernel allocating clashing resources which causes a lock-up.

@speculatrix
speculatrix commented Jun 18, 2016 edited

A very odd thing. If I remove the r8723bs.ko from /lib/modules/4.7.../kernel/drivers/net/wireless so that it doesn't load the module on boot, then boot normally then load the kernel module with insmod, the machine is more stable (stays up for over 30 minutes); wifi ping latency just as variable and bad.

I think this might be a symptom. By not including the wifi driver until the system has fully booted and settled down, so all caches are loaded, and all the buses (internal and external) are quiescent, then there's a lower chance of a crash. Or, if the crash is a result of certain things happening at the same time causing something bad to accumulate, it will take longer to manifest?

@speculatrix

I tried with the latest 4.7-rc4, got the same error about IRC#187 (which /proc/interrupts says is mmc1), and pretty much the same dmesg as previously reported.

@lwfinger
Collaborator

I know that kernel bisection is a pain with these devices, but it would be helpful if someone could determine which kernel commit causes the interrupt errors.

I have no idea why loading the driver after boot is complete would lead to more stability than allowing it to load as soon as user space is running. That is the point at which firmware can be read from disc.

@BzukTuk
BzukTuk commented Jun 21, 2016 edited

My r8723bs Wifi is stable without any problems using kernel 4.6 (also in 4.6.[1,2]) with these patches:
0001-PM-QoS-Add-pm_qos_cancel_request_lazy-that-doesn-t-s.patch
0002-mmc-sdhci-get-runtime-pm-when-sdio-irq-is-enabled.patch
0003-mmc-sdhci-Support-maximum-DMA-latency-request-via-PM.patch
0005-mmc-sdhci-pci-Fix-device-hang-on-Intel-BayTrail.patch
from patches_4.5 directory

In kernel 4.7rc[1,2,3] I cant use 0002-mmc-sdhci-get-runtime-pm-when-sdio-irq-is-enabled.patch because it is not applicable because of commit 15e82076... mmc: sdhci: Remove redundant runtime PM calls. So with patches 1,3,5 only kernel panics after some time. Here are dmesg messages similar to what robert-john-small posted:
[127.749348] r8723bs: module verification failed: signature and/or required key missing - tainting kernel
[ 127.753808] RTL8723BS: module init start
[ 127.753816] RTL8723BS: rtl8723bs v4.3.5.5_12290.20140916_BTCOEX20140507-4E40
[ 127.753820] RTL8723BS: rtl8723bs BT-Coex version = BTCOEX20140507-4E40
[ 127.910434] RTL8723BS: rtw_ndev_init(wlan0)
[ 127.911423] RTL8723BS: module init ret =0
[ 127.945925] rtl8723bs: accquire FW from file:rtlwifi/rtl8723bs_nic.bin
[ 131.327981] RTL8723BS: nolinked power save enter
[ 131.649807] RTL8723BS: nolinked power save leave
[ 131.871768] RTL8723BS: rtw_set_802_11_connect(wlan0) fw_state = 0x00000008
[ 133.667895] RTL8723BS: start auth
[ 133.672430] RTL8723BS: auth success, start assoc
[ 133.676842] RTL8723BS: rtw_cfg80211_indicate_connect(wlan0) BSS not found !!
[ 133.676877] RTL8723BS: assoc success
[ 133.698842] RTL8723BS: send eapol packet
[ 133.726052] RTL8723BS: send eapol packet
[ 133.726358] RTL8723BS: set pairwise key camid:4, addr:ab:cd:ef:20:df:7d, kid:0, type:AES
[ 133.731700] RTL8723BS: set group key camid:5, addr:ab:cd:ef:20:df:7d, kid:2, type:TKIP
[ 840.829692] irq 187: nobody cared (try booting with the "irqpoll" option)
[ 840.829710] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G OE 4.7.0-rc2+makewifi+snd+turbo+mika123+pmmmc135+lpss+ #4
[ 840.829716] Hardware name: Acer Aspire SW5-012/Fendi2, BIOS V1.18 08/06/2015
[ 840.829723] 0000000000000086 8a9f2215d48ca820 ffff880078483e68 ffffffff813f6c03
[ 840.829736] ffff88007b8b0a00 ffff88007b8b0a9c ffff880078483e90 ffffffff810e1b75
[ 840.829745] ffff88007b8b0a00 0000000000000000 0000000000000000 ffff880078483ec8
[ 840.829755] Call Trace:
[ 840.829760] <IRQ> [<ffffffff813f6c03>] dump_stack+0x63/0x90
[ 840.829786] [<ffffffff810e1b75>] __report_bad_irq+0x35/0xd0
[ 840.829795] [<ffffffff810e1efc>] note_interrupt+0x22c/0x270
[ 840.829806] [<ffffffff810df0f6>] handle_irq_event_percpu+0x156/0x1c0
[ 840.829815] [<ffffffff810df19b>] handle_irq_event+0x3b/0x60
[ 840.829823] [<ffffffff810e24cf>] handle_fasteoi_irq+0x8f/0x140
[ 840.829833] [<ffffffff8103032d>] handle_irq+0x1d/0x30
[ 840.829844] [<ffffffff8184e6eb>] do_IRQ+0x4b/0xd0
[ 840.829853] [<ffffffff8184c802>] common_interrupt+0x82/0x82
[ 840.829858] <EOI> [<ffffffff816d7725>] ? cpuidle_enter_state+0x115/0x250
[ 840.829875] [<ffffffff816d7701>] ? cpuidle_enter_state+0xf1/0x250
[ 840.829883] [<ffffffff816d7897>] cpuidle_enter+0x17/0x20
[ 840.829892] [<ffffffff810c7013>] call_cpuidle+0x33/0x50
[ 840.829900] [<ffffffff810c7416>] cpu_startup_entry+0x2c6/0x370
[ 840.829910] [<ffffffff81050d38>] start_secondary+0x158/0x190
[ 840.829917] handlers:
[ 840.829931] [<ffffffffc0004a70>] sdhci_irq [sdhci] threaded [<ffffffffc0001f70>] sdhci_thread_irq [sdhci]
[ 840.829945] Disabling IRQ #187

Edit, I just tried 4.7rc2 without any pm&mmc patches and got total lockup&freeze after 30+/- minutes (youtube loading every 2-4 minute)

@speculatrix

I built and installed kernel 4.5.7 with john brodie patchset and achieved 10.5 hours uptime with wifi and sound driver loaded. Ping time as good as 4.4.
I'll try 4.6.x with jb patches.

@Laszlo-Fiat
Laszlo-Fiat commented Jul 23, 2016 edited

0001-My-changes-for-4.7.0-rc7-for-Baytrail-T.patch.txt
For 4.7-rcs and linux-next, I partly reverted commit 15e82076... mmc: sdhci: Remove redundant runtime PM calls. That makes the "irq 187 nobody cared" problem go away. I am still struggling with the Wifi, as I have terrible packet loss, and lost connection sometimes.

The old patches 1,3,5 From Adrian Hunter are not needed any more (in theory) as Adrian mainlined a patch which is supposed to fix the same problems: https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/drivers/mmc?id=6e1c7d6103fe7031035cec321307c6356809adf4

Here is my latest patch, with the "Remove redundant..." patch partly reverted, with bluetooth added to rfkill-gpio, and sound irq fix. I run this now.
I still add intel_idle.max_cstate=1 to the cmdline, as it makes a graphics related hang to go away.

Does anyone run a community somewhere for Baytrail-T linux users where we can talk, share patches, etc?

@speculatrix

The most active place I know of working with Linux on Baytrail-T devices is the Google+ group for Ubuntu on the Asus T100 series
https://plus.google.com/communities/117853703024346186936

@BzukTuk
BzukTuk commented Jul 25, 2016 edited

Laszlo's patch is working for me - thank you. Wifi works without any problem for hours (4.7 kernel on Acer Aspire Switch 10). I don't need to use any other PM-QoS or mmc patch for wifi to work.

@jharrison022

Laszlo's patch is also working for me on kernel 4.8-rc1.

@marc-wx
marc-wx commented Oct 15, 2016

Laszlo's patch also working for me on 4.8 - thank you very much. Without it I was struggling with different combinations of patches from the 4.5 directory and getting system freezes in < 5 min. Wifi now stable with Laslo's patch excluding all others.

@AndyLavr
AndyLavr commented Nov 4, 2016

Ubuntu kernel 4.8.6 for Acer Aspire Switch 10 SW5-012
https://github.com/AndyLavr/Aspire-SW5-012_Kernel_4.8/wiki

@jprvita jprvita added a commit to endlessm/linux that referenced this issue Dec 9, 2016
@jprvita jprvita mmc: sdhci: Revive sdhci_runtime_pm_{get,put}
These functions are needed to get runtime pm when sdio irq is enabled.

They were originally removed by the following commit:
15e8207 mmc: sdhci: Remove redundant runtime PM calls

This patch is based on Laszlo Fiat's patch shared on
hadess/rtl8723bs#76 (comment)

hadess/rtl8723bs#76
hadess/rtl8723bs#33

Signed-off-by: João Paulo Rechi Vita <jprvita@endlessm.com>

https://phabricator.endlessm.com/T14511
403e192
@jprvita jprvita added a commit to endlessm/linux that referenced this issue Dec 20, 2016
@jprvita jprvita sdhci-acpi: Disable runtime PM on ThinGlobal TGMPC01
This machine freezes if the SDIO wifi adapter RTL8732BS is put to sleep
while its IRQs are enabled. We don't fully understand why this happens
on this particular host controller and device combination, but this
problem has been seen by reported by others on different platforms as
well, as shown in the following bug reports:

hadess/rtl8723bs#33
hadess/rtl8723bs#76

Signed-off-by: João Paulo Rechi Vita <jprvita@endlessm.com>

https://phabricator.endlessm.com/T14511
794e39b
@dsd dsd added a commit to endlessm/linux that referenced this issue Dec 23, 2016
@jprvita @dsd jprvita + dsd sdhci-acpi: Disable runtime PM on ThinGlobal TGMPC01
This machine freezes if the SDIO wifi adapter RTL8732BS is put to sleep
while its IRQs are enabled. We don't fully understand why this happens
on this particular host controller and device combination, but this
problem has been seen by reported by others on different platforms as
well, as shown in the following bug reports:

hadess/rtl8723bs#33
hadess/rtl8723bs#76

Signed-off-by: João Paulo Rechi Vita <jprvita@endlessm.com>

https://phabricator.endlessm.com/T14511
cc6ac8f
@braiam
braiam commented Jan 6, 2017

So, which patches are required? Only Adrian Hunter patches referenced by @Laszlo-Fiat to the kernel and build the module for kernel 4.8?

@marc-wx
marc-wx commented Jan 16, 2017

This is the patch - also working on 4.9.0 for my baytrail tablet lenovo miix 3-830.

0_rt8723bs.txt

@dfiloni
dfiloni commented Jan 24, 2017

Patch is working also on 4.10 (Tested on a Lenovo Miix 310 - Cherry trail). Thank you!

@Tuxman2 Tuxman2 referenced this issue in burzumishi/linux-baytrail-flexx10 Jan 25, 2017
Open

Battery is not supported #11

@dsd dsd added a commit to endlessm/linux that referenced this issue Jan 27, 2017
@jprvita @dsd jprvita + dsd sdhci-acpi: Disable runtime PM on ThinGlobal TGMPC01
This machine freezes if the SDIO wifi adapter RTL8732BS is put to sleep
while its IRQs are enabled. We don't fully understand why this happens
on this particular host controller and device combination, but this
problem has been seen by reported by others on different platforms as
well, as shown in the following bug reports:

hadess/rtl8723bs#33
hadess/rtl8723bs#76

Signed-off-by: João Paulo Rechi Vita <jprvita@endlessm.com>

https://phabricator.endlessm.com/T14511
ba37ca5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment