-
-
Notifications
You must be signed in to change notification settings - Fork 10.3k
-
-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FS#804 - mt7621: kernel errors - rcu_sched detected stalls on CPUs/tasks #5932
Comments
RedDwarf: It seems it has been happening for a while -> http://lists.infradead.org/pipermail/lede-dev/2017-February/006325.html I don't fully understand it, but I think it's related to these messages "cron.err crond[756]: time disparity of 1096 minutes detected". So it has real consequences, at the very least it can make crond sleep() for 18 hours without running any job. |
RedDwarf: I have found this on a SK-WB8 (MT7621 too) with a slightly modified (not the kernel) 17.01.2. |
john: i pushed a patch to my stging tree that might fix this issue |
camel: let me know when the patch is in lede trunk on "ZBT-WG3526" image, and i will test. |
bjonglez: You need to test the commit from john's staging tree before it gets merged into trunk. By the way, ramips has been switched to linux 4.9, so it can also be worth testing the latest trunk image to see if it changes anything. |
camel: well, if it would be in the trunk, we could test it ...eg: WG-3526 on which i can support for power testing :) |
lister-wrt: I'm having the same issue on Ubiquiti ERX. Can test patches but I don't know how to build it. |
bjonglez: I don't really like this, but here are all mt7621 images with lede-17.01 + john's patch (r3464+1-82b20d74cb): https://pub.polyno.me/lede-ramips-FS804/ Please only use these images for testing! |
lister-wrt: Thanks Baptiste, I'll try it out. I have a USB-TTL in case it goes horribly wrong. The only way I know of reliably reproducing this issue is with SQM (errors start after ~5m after install) and it's not in your build. Kernel 4.9.37 merged just after you built this so I won't be able to use the LEDE packages. Is there another way to test? |
dchard: Kernel 4.9.37 is also affected. I am testing John's patched build, and so far I was not able to reproduce this bug with hours of torturing the CPU. Previously it took only 5-10 minutes, so this is good news so far. |
Mushoz: This seems to be a duplicate or related to the following issue: https://bugs.lede-project.org/index.php?do=details&task_id=764 Unfortunately, during traffic shaping the Dir-860l still crashes with that patch applied. So it does not seem to be a complete fix. It does look like it takes longer for it to manifest, so I believe we're getting closer to the solution for our issues :) |
camel: @baptiste Jonglez |
camel: hm,, i tested current trunk ....(without traffic shapping packages installed and same result ... happening per day ~30 times ... |
john: could you try if this still happens if remove |
camel: Sorry, can't build own image. |
pparent76: I have the same problem. I will try to compile without those patches tomorrow, also for me the bug is not as easily replicable as camel say, at least without traffic shaping and without mt7603e driver it's rare. Also if you want a ZBT-WG3526, to be able to test yourself , I can send you one for free. |
camel: For me it is clear related to wlan 2,4GHZ |
pparent76: Not it's probably not, because the problem happens without 2,4Ghz driver running. Though 2.4Ghz driver can make the problem happen more often, and the driver mt76 itself has some specific issues independent from that problem. Traffic shaping makes it happen more often even with 2.4Ghz driver disabled. |
pparent76: I compiled the version without patch You can download it here: https://www.own-mailbox.com/lede/lede-ramips-mt7621-zbt-wg3526-16M-squashfs-sysupgrade.bin Please can you test, I will test on my side when I have time Edit: I have updated firmware on my server with few more packages including kmod-mt76 and kmod-sched at 8:40 GMT md5sum: 9ac127a3d0bf49a8d452e51b2ff9b741 |
camel: I can't test, as I would med more other packages . |
pparent76: I have updated firmware on my server with few more packages including kmod-mt76 and kmod-sched at 8:40 GMT md5sum: 9ac127a3d0bf49a8d452e51b2ff9b741 What packages would you need that you cannot install with opkg? (Only kernel related packages cannot be installed with opkg) |
camel: is too much, and i don'T know exactly which packackes are then needed to reflect kernel builds ... but if interested .. this is my list of packages what I'm installing ...
#disk & SD related stuff
opkg install block-mount # --force-reinstall
opkg install kmod-scsi-core kmod-usb-storage #--force-reinstall
opkg install kmod-fs-ext4 kmod-fs-vfat
opkg install kmod-nls-utf8 kmod-nls-cp437 kmod-nls-iso8859-1 #--force-reinstall
opkg install kmod-fs-nfs nfs-utils #--force-reinstall
opkg install kmod-fs-ext4 kmod-fs-vfat kmod-nls-utf8 kmod-nls-base kmod-nls-cp437 kmod-nls-iso8859-1 cfdisk e2fsprogs #--force-reinstall
opkg install kmod-fs-f2fs libf2fs f2fs-tools f2fsck mkf2fs
opkg install fdisk #--force-reinstall
opkg install rsync
|
camel: well, if wanted i can try to install .... and let you know whcih packages i would missing on kernel related stuff ... which link for your build can i use ? i tried it, but it is too much what would be missing to test it longer ... no luci packages, no modem driver, etc ... i need to wait till it is in trunk. i guess, it can not be more worst as it is for now in current trunk ... |
azuwis: It's possible to force building the same kernel version as [[https://downloads.lede-project.org/snapshots/targets/ramips/mt7621/packages/kernel_4.9.37-1-7f0de30d5b73958cb146494d8e5b2ef4_mipsel_24kc.ipk
As long as you use the same code base and same kernel config, kmod from upstream snapshot should work fine. |
pparent76: I've started testing with qos enabled but mt76 (2.4ghz driver) disabled, with my image, for now it seems that I don't see any RCU_sched warning anymore but I would need confirmation since for me it was always very random, and not easily reproducible. I guess if there is none to test, I don't know if anything will get to trunk soon. Especially Since this hack is about removing patches impacting all MIPS images. But john should know better than I do. Here is the latest image: The image builder: The Sdk: |
pparent76: @zhong Jianxin: will it work even with a modified kernel (since we change patches used in upstream)? |
camel: @pierre: It's possible to force building the same kernel version as upstream snapshot, e.g: $ make clean |
pparent76: I've updated the image+sdk+image-builder on my server compiled with the above command. The md5sum of the image is 3298cb86e8ff7737fcad8bc4065914ec. Please test, if you can install your packages with it. |
camel: ok, first try .. rsync package seems to be completely missing now on the TRUNK snapshot .. maybe compile failed on some new packages ... i let you know as soon as i get it installedmaybe some small issues .. not sure, if i can mount all OK ... |
camel: hmm :( .. |
azuwis:
It depends on the modification, in this case, it should work. The reason why there are so many
Upstream snapshot build will select many packages, but it's probably not the case of custom build. Here is another way to build as close as upstream snapshot:
But it will take much longer time. Just tested this, it built the some kernel version as current upstream snapshot, without overriding LINUX_VERMAGIC. |
camel: @piere: possible to prepare a new build with current new kernel magic id ? |
pparent76: I will not use the magic number technique again because: 1- It's useless, since modules that were not included in my build, and therefore not built will not work @zhong Jianxin said. Those who are built I can include them directly in the image. 2- It can lead to wrong diagnostic and wrong conclusions for our testing. Since as we saw we can get kernel errors due to incompatibility between the kernel I built and modules in packages upstream. 3- I will compile an image with traffic shaping in one hour, so that you can test with traffic shaping. |
camel: Hmm, not sure, if you can add the pptpd + pppd packages included, too. |
pparent76: @Camel can you please test with QOS/SQM enabled on the version I just sent to my server?
Anyways you should not use my images for anything else than testing.... |
camel: sure, i will try ... |
camel: pls give me the DL link, |
pparent76: Hum no it should not be, it corresponds to the image I compiled today, with luci-app-sqm sqm-scripts Md5sum: 71121b4a6a30abd6627595d01bd0374c |
camel: sorry, my mistake ... as i can'T install pptp client + server packages ... root@LEDE: opkg install ppp-mod-pptp kmod-nf-nathelper-extra #notfalls via: opkg install ppp-mod-pptp kmod-nf-nathelper-extra #--force-depends --force-reinstall#If LuCI support is desired, additionally install the protocol package: #VPN PPTP server:
|
pparent76:
I did not forget I purposly did not do it for reasons mentioned above. |
camel: hmm, but i really need the pptp stuff .. otherwise i can't really test it more in detail
or if kernel magic set to lede trunk snapshot, then i can remove and install the real one which i need... as #OPENVPN: (ca. 1MB space needed) |
camel: meanwhile i tested ... hmm ...seems to be, that I'M getting a mem issue ... maybe that was related to a "speedtest.py" which i run every 5min to have the statistics about the router internet speed ... Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.520000] luci invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), nodemask=0, order=1, oom_score_adj=0 for traffic shapping i would need the pptp in any case, as the VPN's are shapped .. |
camel: meanwhile .. i tested with traffic shapping ... and produced a lot of traff ic on: and got few errors ....
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.520000] luci invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), nodemask=0, order=1, oom_score_adj=0
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.530000] COMPACTION is disabled!!!
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.540000] CPU: 2 PID: 32685 Comm: luci Not tainted 4.9.37 #0
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.540000] Stack : 00000000 00000000 80537b2a 00000032 803f4084 00000000 00000000 80530000
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.550000] 81fa462c 804d7da7 8046dff0 00000002 00007fad 80533824 00000001 00200000
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.560000] 00001321 80069890 00000000 800696b0 00000000 00000004 80472c00 82745c3c
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.570000] 00000000 800a5d98 00000000 00000000 80537b2a 00000000 82745d28 00745c3c
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.580000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.580000] ...
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.590000] Call Trace:
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.590000] [<8000f644>] show_stack+0x54/0x88
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.590000] [<801e5924>] dump_stack+0x84/0xc0
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.600000] [<800ea424>] dump_header.isra.4+0x84/0x1b4
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.600000] [<800ac264>] oom_kill_process+0xd0/0x484
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.610000] [<800acb40>] out_of_memory+0x3bc/0x3fc
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.610000] [<800b01c4>] __alloc_pages_nodemask+0x5e4/0xa58
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.620000] [<800281b0>] copy_process.isra.8.part.9+0x10c/0x1300
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.620000] [<80029520>] _do_fork+0xcc/0x2d8
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.630000] [<800297dc>] SyS_clone+0x20/0x2c
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.630000] [<80016558>] syscall_common+0x34/0x58
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.640000] Mem-Info:
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.640000] active_anon:13977 inactive_anon:1643 isolated_anon:0
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.640000] active_file:805 inactive_file:2995 isolated_file:0
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.640000] unevictable:1 dirty:2 writeback:0 unstable:0
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.640000] slab_reclaimable:3926 slab_unreclaimable:35491
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.640000] mapped:3433 shmem:8024 pagetables:227 bounce:0
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.640000] free:49832 free_pcp:28 free_cma:0
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.670000] Node 0 active_anon:55908kB inactive_anon:6572kB active_file:3220kB inactive_file:11980kB unevictable:4kB isolated(anon):0kB isolated(file):0kB mapped:13732kB dirty:8kB writeback:0kB shmem:32096kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
Tue Jul 25 12:50:04 2017 kern.warn kernel: [ 1065.700000] Normal free:20008kB min:16384kB low:20480kB high:24576kB active_anon:0kB inactive_anon:0kB active_file:4kB inactive_file:84kB unevictable:0kB writepending:8kB present:262144kB managed:251688kB mlocked:0kB slab_reclaimable:15704kB slab_unreclaimable:141964kB kernel_stack:56192kB pagetables:908kB bounce:0kB free_pcp:232kB local_pcp:0kB free_cma:0kB
Tue Jul 25 12:50:05 2017 kern.emerg kernel: lowmem_reserve[]: 0 2048 2048
|
pparent76: We need a better patch... |
john: i've done nothing all day but play around with this. i am unfortunately not able to reproduce this issue. i've just sent my latest version of the patch to someone for testing. lets hope for the best. |
pparent76: @john thanks a lot. Maybe if you send us your latest version of your patch we can test it too. |
dchard: @john: I agree with Pierre: if you can send us a build with your (latest) patches inside, we are happy to test. Like we did with previous versions :-) |
pparent76: If you don't have a build I can do the build as I did for last hint. |
john: drop this file into target/linux/ramips/patches-4.9/ on current trunk, ignoring all previous patches. i have had an iperf test run for 2 hours now with near gbit speed using SQM/cake/piece.of.cake setup to rate limit at 600Mbit and have not seen any oopses |
camel: Pls commit it asao |
pparent76: Thanks, I will try to compile it tomorrow for all boards and I will try to add pptpd |
camel: Thx |
pparent76: Here are the images: https://www.own-mailbox.com/lede/ @Camel: in order to not run out of memory don't download files to /tmp/ during your tests but to /dev/null |
bjonglez: The fix has been pushed to master. |
pparent76: @baptiste Jonglez ok great! (Not sure if it is included in the last build yet though, since it was comited 1 hour ago.) |
camel: as i can see: https://git.lede-project.org/?p=source.git;a=summary and build is from: see: so, we have to wait 1 day (if build commit will be done tonight) longer to use the snapshot trunk :) I'M very nasty to test - won't wait longer :) @Biptiste: Thx |
dchard: I am testing your fixes (in trunk) for 5 days, and so far there was no crash, warning, or any other indication of the problem. The kernel and system logs are also clean. How I tested is the following:
Previously it took only a few minutes to recreate the errors above, now it seems to be gone completely. I will look into to logs every few days to see if anything happens. Thanks for your hard work! |
pparent76: Not sure if it is related but today I got with latest version:
After that the router did not respond, even in UART until I rebooted it. |
bjonglez: This looks like an entirely different issue, please open a new bug report. |
camel:
current trunk
hardware: zbt3526 mt7621
it give more and more often this kernel bugs ...
(i did not have that much for 2 months ago)
can it be related to newer kernel on TRUNK ?
Thu May 25 18:20:04 2017 user.notice root: Subject: [router.xxx.com] KERNEL error/warnings issue - 2017-05-25:18:20:01
Thu May 25 18:20:39 2017 kern.err kernel: [ 4797.640000] INFO: rcu_sched detected stalls on CPUs/tasks:
Thu May 25 18:20:39 2017 kern.err kernel: [ 4797.640000] 2-...: (0 ticks this GP) idle=dc4/0/0 softirq=370963/370963 fqs=0
Thu May 25 18:20:39 2017 kern.err kernel: [ 4797.650000] (detected by 1, t=6003 jiffies, g=119392, c=119391, q=19565)
Thu May 25 18:20:39 2017 kern.info kernel: [ 4797.650000] Task dump for CPU 2:
Thu May 25 18:20:39 2017 kern.info kernel: [ 4797.660000] swapper/2 R running 0 0 1 0x00100000
Thu May 25 18:20:39 2017 kern.warn kernel: [ 4797.660000] Stack : 00000000 00003a99 00000000 77de22c0 00000000 00000000 804df2a4 80490000
Thu May 25 18:20:39 2017 kern.warn kernel: [ 4797.660000] 8048c75c 00000001 00000000 8048c5e0 8048c724 80490000 00000000 800135e4
Thu May 25 18:20:39 2017 kern.warn kernel: [ 4797.660000] 00000000 814a37e0 8fc72000 8fc73ec0 80490000 8005ec74 1100fc03 00000002
Thu May 25 18:20:39 2017 kern.warn kernel: [ 4797.660000] 00000000 80490000 804df2a4 8005ec6c 80490000 8001b1a8 1100fc03 00000000
Thu May 25 18:20:39 2017 kern.warn kernel: [ 4797.660000] 00000004 8048c4a0 000000a0 8001b1b0 e8c7e2d3 3a8bf07f 2cfde824 eeff5ebf
Thu May 25 18:20:39 2017 kern.warn kernel: [ 4797.660000] ...
Thu May 25 18:20:39 2017 kern.warn kernel: [ 4797.700000] Call Trace:
Thu May 25 18:20:39 2017 kern.warn kernel: [ 4797.700000] [<8000be98>] __schedule+0x574/0x758
Thu May 25 18:20:39 2017 kern.warn kernel: [ 4797.710000] [<800135e4>] r4k_wait_irqoff+0x0/0x20
Thu May 25 18:20:39 2017 kern.warn kernel: [ 4797.710000]
Thu May 25 18:20:39 2017 kern.err kernel: [ 4797.710000] rcu_sched kthread starved for 6009 jiffies! g119392 c119391 f0x0 s3 ->state=0x1
The text was updated successfully, but these errors were encountered: