lantiq: xway_nand: don't yield while holding spinlock (#9829 fix) #12265

tomjnixon · 2023-03-26T11:19:30Z

The nand driver normally yields while waiting for the device to become ready; this is normally fine, but xway_nand holds the ebu_lock spinlock, and this can cause lockups if other threads which use ebu_lock are interleaved. Fix this by waiting instead of polling.

This mainly showed up as crashes in ath9k_pci_owl_loader (see #9829 ), but turning on spinlock debugging shows this happening in other places too.

This doesn't seem to measurably impact boot time.

Tested on bt_homehub-v5a with 5.10 and 5.15.

Random thoughts:

more testing would be helpful, though i think this should be safe
perhaps there's a better approach to solving this?
this defeats timeouts in nand_wait/nand_wait_ready -- are these actually necessary? there's already a busy loop in xway_cmd_ctrl.
this should be backported to 22.03. I haven't checked if this issue is present in 21.02, or if it just doesn't show up.

Thanks!

monnier · 2023-03-26T12:37:47Z

[ Kernel newbie here, commenting from the peanut gallery: ]
AFAIK the traditional way to fix problems where we yield while still holding a lock is to release the lock before yielding.
Can you comment about the tradeoff here between busy-waiting as you do, and releasing the lock instead?

tomjnixon · 2023-03-26T12:46:55Z

[ Kernel newbie here, commenting from the peanut gallery: ]

me too tbh :)

AFAIK the traditional way to fix problems where we yield while still holding a lock is to release the lock before yielding.

I don't think that would be sensible, because the lock isn't protecting a data structure, it's protecting the EBU hardware. It's unclear to me if it would be OK to have a PCI transaction happen when the flash CE pin is held high, for example.

Can you comment about the tradeoff here between busy-waiting as you do, and releasing the lock instead?

The longer it waits for the worse it would be for performance. Given that this doesn't occur often I'd guess not long, but it may be worth testing. That's why i mentioned that the boot time doesn't change much -- it clearly doesn't make that much difference.

monnier · 2023-03-26T13:25:36Z

I don't think that would be sensible, because the lock isn't protecting a data structure, it's protecting the EBU hardware. It's unclear to me if it would be OK to have a PCI transaction happen when the flash CE pin is held high, for example.

Ah, makes sense, thanks.

abajk · 2023-03-26T20:56:10Z

@tomjnixon It would be great if you could send the patch upstream and discuss it with mtd maintainers.

tomjnixon · 2023-03-27T09:33:59Z

Good point, done.

Just4pLeisure · 2023-03-30T19:23:40Z

@tomjnixon Great work in narrowiwng down the cause of what looks like a race condition 👍

I'm no expert either but this looks like a workaround 'hack' rather than a fix proper. Functions like xway_dev_ready should be very simple and only return what is in this case the LSB of a register. It is up to whoever calls xway_dev_ready to deal with what it finds, that is the correct place to apply an eventual fix.

As aside, the while loop in this patch is blocking and will peg the CPU/core at 100% until it exits. Adding a short sleep will alleviate this.

I really am quite the novice so only know enough to be dangerous 😆 but can build OpenWRT and am happy to assist with testing.

Sophie x

tomjnixon · 2023-03-31T08:32:02Z

I'm no expert either but this looks like a workaround 'hack' rather than a fix proper. Functions like xway_dev_ready should be very simple and only return what is in this case the LSB of a register. It is up to whoever calls xway_dev_ready to deal with what it finds, that is the correct place to apply an eventual fix.

Quite, sorry if I didn't make that completely clear.

Reading around more, I think there are two ways to solve this legitimately:

if locking is really required, then the generic nand driver should have an option to tell it not to sleep while waiting, or the ready function should be turned into a wait function
if locking is not actually required, then figure out how to remove it without causing more problems.

As aside, the while loop in this patch is blocking and will peg the CPU/core at 100% until it exits. Adding a short sleep will alleviate this.

Hmm, the sleep is kind of the problem. You could use udelay and friends, but those are tight loops too. That might reduce some bus activity, but I doubt that would ever be measurable.

I really am quite the novice so only know enough to be dangerous 😆 but can build OpenWRT and am happy to assist with testing.

Well, more testing of this can at least confirm that the original issue was really just caused by this locking trouble, even though it's not the final fix. Thanks for having a look.

hauke · 2023-04-01T18:52:09Z

target/linux/lantiq/patches-5.10/0400-mtd-rawnand-xway-don-t-yield-while-holding-spinlock.patch

+	 * nand_wait_ready, which is a bad idea when we're holding ebu_lock
+	 */
+	while ((ltq_ebu_r32(EBU_NAND_WAIT) & NAND_WAIT_RD) == 0)
+		;


Please add cpu_relax() here:

Suggested change

+ ;

+ cpu_relax();

Done. Could do the same in the spin waits in xway_cmd_ctrl for consistency, but it doesn't actually do anything on mips.

hauke · 2023-04-01T18:58:11Z

I agree that this is a hack but the lock is already a hack.

We have to make sure the NAND flash is not accessed at the same time as the PCI device. I am not even sure if the spin_lock_irqsave is sufficient to prevent PCI accesses, a device could do DMA in the background.

For me this pull request looks good.

tomjnixon · 2023-04-01T22:39:22Z

We have to make sure the NAND flash is not accessed at the same time as the PCI device. I am not even sure if the spin_lock_irqsave is sufficient to prevent PCI accesses, a device could do DMA in the background.

Yeah I was thinking about that -- you'd think that flash/pci errors would be far more common if this actually caused issues.

Just4pLeisure · 2023-04-02T14:25:54Z

Thank you @tomjnixon and @hauke for your explanations and clarifications 😄

My router has never successfully booted with 'official' 22.03 builds before, it remains stuck in a bootloop seemingly forever. I've been testing with the updated patch with cpu_relax() as @hauke suggested. After power cycling and 'soft' rebooting my router many times I'm happy to report that I haven't experienced the bootloop once 👍

The nand driver normally while waiting for the device to become ready; this is normally fine, but xway_nand holds the ebu_lock spinlock, and this can cause lockups if other threads which use ebu_lock are interleaved. Fix this by waiting instead of polling. This mainly showed up as crashes in ath9k_pci_owl_loader (see openwrt#9829 ), but turning on spinlock debugging shows this happening in other places too. This doesn't seem to measurably impact boot time. Tested on bt_homehub-v5a with 5.10 and 5.15. Signed-off-by: Thomas Nixon <tom@tomn.co.uk> [Add commit description into patch] Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>

github-actions bot added the target/lantiq pull request/issue for lantiq target label Mar 26, 2023

tomjnixon mentioned this pull request Mar 26, 2023

BT Home Hub 5A (HH5a) - 22.03.0-rc1 - crashes/reboots during boot up sequence #9829

Closed

tomjnixon force-pushed the lantiq_nand_lock_fix branch from afd3e40 to f4990c7 Compare March 26, 2023 11:34

hauke reviewed Apr 1, 2023

View reviewed changes

tomjnixon force-pushed the lantiq_nand_lock_fix branch from f4990c7 to 0660d10 Compare April 1, 2023 22:28

hauke force-pushed the lantiq_nand_lock_fix branch from 0660d10 to d3b4790 Compare April 2, 2023 16:23

openwrt-bot merged commit d3b4790 into openwrt:master Apr 2, 2023
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lantiq: xway_nand: don't yield while holding spinlock (#9829 fix) #12265

lantiq: xway_nand: don't yield while holding spinlock (#9829 fix) #12265

tomjnixon commented Mar 26, 2023

monnier commented Mar 26, 2023

tomjnixon commented Mar 26, 2023

monnier commented Mar 26, 2023

abajk commented Mar 26, 2023

tomjnixon commented Mar 27, 2023

Just4pLeisure commented Mar 30, 2023

tomjnixon commented Mar 31, 2023 •

edited

hauke Apr 1, 2023

tomjnixon Apr 1, 2023

hauke commented Apr 1, 2023

tomjnixon commented Apr 1, 2023

Just4pLeisure commented Apr 2, 2023

lantiq: xway_nand: don't yield while holding spinlock (#9829 fix) #12265

lantiq: xway_nand: don't yield while holding spinlock (#9829 fix) #12265

Conversation

tomjnixon commented Mar 26, 2023

monnier commented Mar 26, 2023

tomjnixon commented Mar 26, 2023

monnier commented Mar 26, 2023

abajk commented Mar 26, 2023

tomjnixon commented Mar 27, 2023

Just4pLeisure commented Mar 30, 2023

tomjnixon commented Mar 31, 2023 • edited

hauke Apr 1, 2023

Choose a reason for hiding this comment

tomjnixon Apr 1, 2023

Choose a reason for hiding this comment

hauke commented Apr 1, 2023

tomjnixon commented Apr 1, 2023

Just4pLeisure commented Apr 2, 2023

tomjnixon commented Mar 31, 2023 •

edited