New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lantiq: xway_nand: don't yield while holding spinlock (#9829 fix) #12265
Conversation
afd3e40
to
f4990c7
Compare
[ Kernel newbie here, commenting from the peanut gallery: ] |
me too tbh :)
I don't think that would be sensible, because the lock isn't protecting a data structure, it's protecting the EBU hardware. It's unclear to me if it would be OK to have a PCI transaction happen when the flash CE pin is held high, for example.
The longer it waits for the worse it would be for performance. Given that this doesn't occur often I'd guess not long, but it may be worth testing. That's why i mentioned that the boot time doesn't change much -- it clearly doesn't make that much difference. |
Ah, makes sense, thanks. |
@tomjnixon It would be great if you could send the patch upstream and discuss it with mtd maintainers. |
Good point, done. |
@tomjnixon Great work in narrowiwng down the cause of what looks like a race condition 👍 I'm no expert either but this looks like a workaround 'hack' rather than a fix proper. Functions like xway_dev_ready should be very simple and only return what is in this case the LSB of a register. It is up to whoever calls xway_dev_ready to deal with what it finds, that is the correct place to apply an eventual fix. As aside, the while loop in this patch is blocking and will peg the CPU/core at 100% until it exits. Adding a short sleep will alleviate this. I really am quite the novice so only know enough to be dangerous 😆 but can build OpenWRT and am happy to assist with testing. Sophie x |
Quite, sorry if I didn't make that completely clear. Reading around more, I think there are two ways to solve this legitimately:
Hmm, the sleep is kind of the problem. You could use udelay and friends, but those are tight loops too. That might reduce some bus activity, but I doubt that would ever be measurable.
Well, more testing of this can at least confirm that the original issue was really just caused by this locking trouble, even though it's not the final fix. Thanks for having a look. |
+ * nand_wait_ready, which is a bad idea when we're holding ebu_lock | ||
+ */ | ||
+ while ((ltq_ebu_r32(EBU_NAND_WAIT) & NAND_WAIT_RD) == 0) | ||
+ ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add cpu_relax() here:
+ ; | |
+ cpu_relax(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Could do the same in the spin waits in xway_cmd_ctrl
for consistency, but it doesn't actually do anything on mips.
I agree that this is a hack but the lock is already a hack. We have to make sure the NAND flash is not accessed at the same time as the PCI device. I am not even sure if the For me this pull request looks good. |
f4990c7
to
0660d10
Compare
Yeah I was thinking about that -- you'd think that flash/pci errors would be far more common if this actually caused issues. |
Thank you @tomjnixon and @hauke for your explanations and clarifications 😄 My router has never successfully booted with 'official' 22.03 builds before, it remains stuck in a bootloop seemingly forever. I've been testing with the updated patch with cpu_relax() as @hauke suggested. After power cycling and 'soft' rebooting my router many times I'm happy to report that I haven't experienced the bootloop once 👍 |
The nand driver normally while waiting for the device to become ready; this is normally fine, but xway_nand holds the ebu_lock spinlock, and this can cause lockups if other threads which use ebu_lock are interleaved. Fix this by waiting instead of polling. This mainly showed up as crashes in ath9k_pci_owl_loader (see openwrt#9829 ), but turning on spinlock debugging shows this happening in other places too. This doesn't seem to measurably impact boot time. Tested on bt_homehub-v5a with 5.10 and 5.15. Signed-off-by: Thomas Nixon <tom@tomn.co.uk> [Add commit description into patch] Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
0660d10
to
d3b4790
Compare
The nand driver normally yields while waiting for the device to become ready; this is normally fine, but xway_nand holds the ebu_lock spinlock, and this can cause lockups if other threads which use ebu_lock are interleaved. Fix this by waiting instead of polling.
This mainly showed up as crashes in ath9k_pci_owl_loader (see #9829 ), but turning on spinlock debugging shows this happening in other places too.
This doesn't seem to measurably impact boot time.
Tested on bt_homehub-v5a with 5.10 and 5.15.
Random thoughts:
nand_wait
/nand_wait_ready
-- are these actually necessary? there's already a busy loop inxway_cmd_ctrl
.Thanks!