New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel: do not inline MEMREAD/MEMWRITE ioctls #12472
Conversation
A Linksys E8450 (mt7622) device running current master has recently started crashing [1]: [ 0.562900] mtk-ecc 1100e000.ecc: probed [ 0.570254] spi-nand spi2.0: Fidelix SPI NAND was found. [ 0.575576] spi-nand spi2.0: 128 MiB, block size: 128 KiB, page size: 2048, OOB size: 64 [ 0.583780] mtk-snand 1100d000.spi: ECC strength: 4 bits per 512 bytes [ 0.682930] Insufficient stack space to handle exception! [ 0.682939] ESR: 0x0000000096000047 -- DABT (current EL) [ 0.682946] FAR: 0xffffffc008c47fe0 [ 0.682948] Task stack: [0xffffffc008c48000..0xffffffc008c4c000] [ 0.682951] IRQ stack: [0xffffffc008008000..0xffffffc00800c000] [ 0.682954] Overflow stack: [0xffffff801feb00a0..0xffffff801feb10a0] [ 0.682959] CPU: 1 PID: 1 Comm: swapper/0 Tainted: G S 5.15.107 #0 [ 0.682966] Hardware name: Linksys E8450 (DT) [ 0.682969] pstate: 800000c5 (Nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 0.682975] pc : dequeue_entity+0x0/0x250 [ 0.682988] lr : dequeue_task_fair+0x98/0x290 [ 0.682992] sp : ffffffc008c48030 [ 0.682994] x29: ffffffc008c48030 x28: 0000000000000001 x27: ffffff801feb6380 [ 0.683004] x26: 0000000000000001 x25: ffffff801feb6300 x24: ffffff8000068000 [ 0.683011] x23: 0000000000000001 x22: 0000000000000009 x21: 0000000000000000 [ 0.683017] x20: ffffff801feb6380 x19: ffffff8000068080 x18: 0000000017a740a6 [ 0.683024] x17: ffffffc008bae748 x16: ffffffc008bae6d8 x15: ffffffffffffffff [ 0.683031] x14: ffffffffffffffff x13: 0000000000000000 x12: 0000000f00000101 [ 0.683038] x11: 0000000000000449 x10: 0000000000000127 x9 : 0000000000000000 [ 0.683044] x8 : 0000000000000125 x7 : 0000000000116da1 x6 : 0000000000116da1 [ 0.683051] x5 : 00000000001165a1 x4 : ffffff801feb6e00 x3 : 0000000000000000 [ 0.683058] x2 : 0000000000000009 x1 : ffffff8000068080 x0 : ffffff801feb6380 [ 0.683066] Kernel panic - not syncing: kernel stack overflow [ 0.683069] SMP: stopping secondary CPUs [ 1.648361] SMP: failed to stop secondary CPUs 0-1 [ 1.648366] Kernel Offset: disabled [ 1.648368] CPU features: 0x00003000,00000802 [ 1.648372] Memory Limit: none The last working revision was reportedly commit e11d00d ("ath79: create Aruba AP-105 APBoot compatible image") while the first tested revision that failed with the above message was commit 1416b9b ("tools/dwarves: update to 1.25"). The exact reason for why these kernel panics started happening has not been thoroughly investigated. However, since the crash happens right after snand driver initialization, commit fa4dc86 ("kernel: backport MEMREAD ioctl") is the most likely culprit. The panic message quoted above includes mentions of a stack overflow. A pending upstream kernel patch [2] exists that prevents inlining mtdchar_read_ioctl() and mtdchar_write_ioctl() into mtdchar_ioctl() as the addition of the former triggered compiler warnings related to stack size on some platforms. Add a backport of that patch in hope of fixing the crashes described above. [1] https://lists.openwrt.org/pipermail/openwrt-devel/2023-April/040872.html [2] https://lists.infradead.org/pipermail/linux-mtd/2023-April/097912.html Signed-off-by: Michał Kępień <openwrt@kempniu.pl>
I'm now 99.9% sure, as reverting the commit fixes the problem.
Thanks a lot for the fix attempt, but this doesn't help. |
It's been pointed out that the crash happens shortly after Looking at target/linux/mediatek/patches-5.15/120-12-v5.19-spi-add-driver-for-MTK-SPI-NAND-Flash-Interface.patch where this function is defined this is a backport from 5.19 that is specific to the mediatek target: could it simply be a matter of missing backported bits for this particular piece of code to play ball with MEMREAD/MEMWRITE? |
@f00b4r0 I don't think so. MEMREAD/MEMWRITE are high-level abstractions that use internal interfaces provided by the Linux kernel for talking to MTDs.
@ynezz Thanks for checking. Is there any chance you could retest |
That makes sense, however AIUI the driver above is not a flash driver: I've only given a quick glance at it but it appears to be a spi-mem glue that does clever things to interface a NAND device. If you look at the comment at the very beginning of this file, it seems in particular to handle the peculiarities of this platform wrt OOB. This might have some bearing here? In fact your argument cuts both ways: I think it's a clue that this bug seemingly only happens on this platform where this driver is used. Otherwise we would have seen tons of other breakages on OpenWrt devices by now :)
Indeed, the stack trace would confirm/infirm whether the above driver is really involved in this bug. |
|
Thanks, @ynezz, that looks very useful. Note the recursion depth in that stack trace. That comes from
Not to mention that It is hard for me to confirm this with 100% certainty without getting my hands on the device in question, but it seems to me that the MEMREAD backport may have pushed this call stack over the limit. Unfortunately I also do not see any way of preventing |
@kempniu this is sound analysis. What does bother me is that |
Nothing obvious, at least not to me. Since this is the only lead I have for now and I do not have direct access to the affected device, I looked at the code I called out in my previous message and it seems to me that there is no real need for recursion there as the logic just scans flash blocks backwards. IMHO the current code layout only muddles what this code actually does. @ynezz, could you perhaps give kempniu/openwrt@2f248ab (the (Please make sure that |
Agreed, the recursion is unnecessary.
In case this helps, the new code looks correct. Thanks for proposing this. |
Nice work, seems to be fixed, thanks!
|
Nice. However that seems to suggest that tail-call optimization is disabled here (although it's also possible the compiler failed to optimize). This is something to bear in mind if stack exhaustion happens again elsewhere. |
The rework also looks good to me, thanks.
|
Indeed, then this explains that! |
Cool, thanks everyone for taking a look. I will soon polish the commit message for the revised fix and prepare a proper (separate) merge request out of it. |
A Linksys E8450 (mt7622) device running current master has recently started crashing [1]:
The last working revision was reportedly commit e11d00d ("ath79: create Aruba AP-105 APBoot compatible image") while the first tested revision that failed with the above message was commit 1416b9b ("tools/dwarves: update to 1.25").
The exact reason for why these kernel panics started happening has not been thoroughly investigated. However, since the crash happens right after snand driver initialization, commit fa4dc86 ("kernel: backport MEMREAD ioctl") is the most likely culprit.
The panic message quoted above includes mentions of a stack overflow. A pending upstream kernel patch [2] exists that prevents inlining mtdchar_read_ioctl() and mtdchar_write_ioctl() into mtdchar_ioctl() as the addition of the former triggered compiler warnings related to stack size on some platforms. Add a backport of that patch in hope of fixing the crashes described above.
[1] https://lists.openwrt.org/pipermail/openwrt-devel/2023-April/040872.html
[2] https://lists.infradead.org/pipermail/linux-mtd/2023-April/097912.html
See also #12225