Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lantiq: xway_nand: don't yield while holding spinlock (#9829 fix) #12265

Merged
merged 1 commit into from Apr 2, 2023

Conversation

tomjnixon
Copy link
Contributor

The nand driver normally yields while waiting for the device to become ready; this is normally fine, but xway_nand holds the ebu_lock spinlock, and this can cause lockups if other threads which use ebu_lock are interleaved. Fix this by waiting instead of polling.

This mainly showed up as crashes in ath9k_pci_owl_loader (see #9829 ), but turning on spinlock debugging shows this happening in other places too.

This doesn't seem to measurably impact boot time.

Tested on bt_homehub-v5a with 5.10 and 5.15.

Random thoughts:

  • more testing would be helpful, though i think this should be safe
  • perhaps there's a better approach to solving this?
  • this defeats timeouts in nand_wait/nand_wait_ready -- are these actually necessary? there's already a busy loop in xway_cmd_ctrl.
  • this should be backported to 22.03. I haven't checked if this issue is present in 21.02, or if it just doesn't show up.

Thanks!

@monnier
Copy link

monnier commented Mar 26, 2023

[ Kernel newbie here, commenting from the peanut gallery: ]
AFAIK the traditional way to fix problems where we yield while still holding a lock is to release the lock before yielding.
Can you comment about the tradeoff here between busy-waiting as you do, and releasing the lock instead?

@tomjnixon
Copy link
Contributor Author

[ Kernel newbie here, commenting from the peanut gallery: ]

me too tbh :)

AFAIK the traditional way to fix problems where we yield while still holding a lock is to release the lock before yielding.

I don't think that would be sensible, because the lock isn't protecting a data structure, it's protecting the EBU hardware. It's unclear to me if it would be OK to have a PCI transaction happen when the flash CE pin is held high, for example.

Can you comment about the tradeoff here between busy-waiting as you do, and releasing the lock instead?

The longer it waits for the worse it would be for performance. Given that this doesn't occur often I'd guess not long, but it may be worth testing. That's why i mentioned that the boot time doesn't change much -- it clearly doesn't make that much difference.

@monnier
Copy link

monnier commented Mar 26, 2023

I don't think that would be sensible, because the lock isn't protecting a data structure, it's protecting the EBU hardware. It's unclear to me if it would be OK to have a PCI transaction happen when the flash CE pin is held high, for example.

Ah, makes sense, thanks.

@abajk
Copy link
Contributor

abajk commented Mar 26, 2023

@tomjnixon It would be great if you could send the patch upstream and discuss it with mtd maintainers.

@tomjnixon
Copy link
Contributor Author

Good point, done.

@Just4pLeisure
Copy link

@tomjnixon Great work in narrowiwng down the cause of what looks like a race condition 👍

I'm no expert either but this looks like a workaround 'hack' rather than a fix proper. Functions like xway_dev_ready should be very simple and only return what is in this case the LSB of a register. It is up to whoever calls xway_dev_ready to deal with what it finds, that is the correct place to apply an eventual fix.

As aside, the while loop in this patch is blocking and will peg the CPU/core at 100% until it exits. Adding a short sleep will alleviate this.

I really am quite the novice so only know enough to be dangerous 😆 but can build OpenWRT and am happy to assist with testing.

Sophie x

@tomjnixon
Copy link
Contributor Author

tomjnixon commented Mar 31, 2023

I'm no expert either but this looks like a workaround 'hack' rather than a fix proper. Functions like xway_dev_ready should be very simple and only return what is in this case the LSB of a register. It is up to whoever calls xway_dev_ready to deal with what it finds, that is the correct place to apply an eventual fix.

Quite, sorry if I didn't make that completely clear.

Reading around more, I think there are two ways to solve this legitimately:

  • if locking is really required, then the generic nand driver should have an option to tell it not to sleep while waiting, or the ready function should be turned into a wait function
  • if locking is not actually required, then figure out how to remove it without causing more problems.

As aside, the while loop in this patch is blocking and will peg the CPU/core at 100% until it exits. Adding a short sleep will alleviate this.

Hmm, the sleep is kind of the problem. You could use udelay and friends, but those are tight loops too. That might reduce some bus activity, but I doubt that would ever be measurable.

I really am quite the novice so only know enough to be dangerous 😆 but can build OpenWRT and am happy to assist with testing.

Well, more testing of this can at least confirm that the original issue was really just caused by this locking trouble, even though it's not the final fix. Thanks for having a look.

+ * nand_wait_ready, which is a bad idea when we're holding ebu_lock
+ */
+ while ((ltq_ebu_r32(EBU_NAND_WAIT) & NAND_WAIT_RD) == 0)
+ ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add cpu_relax() here:

Suggested change
+ ;
+ cpu_relax();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Could do the same in the spin waits in xway_cmd_ctrl for consistency, but it doesn't actually do anything on mips.

@hauke
Copy link
Member

hauke commented Apr 1, 2023

I agree that this is a hack but the lock is already a hack.

We have to make sure the NAND flash is not accessed at the same time as the PCI device. I am not even sure if the spin_lock_irqsave is sufficient to prevent PCI accesses, a device could do DMA in the background.

For me this pull request looks good.

@tomjnixon
Copy link
Contributor Author

We have to make sure the NAND flash is not accessed at the same time as the PCI device. I am not even sure if the spin_lock_irqsave is sufficient to prevent PCI accesses, a device could do DMA in the background.

Yeah I was thinking about that -- you'd think that flash/pci errors would be far more common if this actually caused issues.

@Just4pLeisure
Copy link

Thank you @tomjnixon and @hauke for your explanations and clarifications 😄

My router has never successfully booted with 'official' 22.03 builds before, it remains stuck in a bootloop seemingly forever. I've been testing with the updated patch with cpu_relax() as @hauke suggested. After power cycling and 'soft' rebooting my router many times I'm happy to report that I haven't experienced the bootloop once 👍

The nand driver normally while waiting for the device to become ready;
this is normally fine, but xway_nand holds the ebu_lock spinlock, and
this can cause lockups if other threads which use ebu_lock are
interleaved. Fix this by waiting instead of polling.

This mainly showed up as crashes in ath9k_pci_owl_loader (see
openwrt#9829 ), but turning on
spinlock debugging shows this happening in other places too.

This doesn't seem to measurably impact boot time.

Tested on bt_homehub-v5a with 5.10 and 5.15.

Signed-off-by: Thomas Nixon <tom@tomn.co.uk>
[Add commit description into patch]
Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
@openwrt-bot openwrt-bot merged commit d3b4790 into openwrt:master Apr 2, 2023
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
target/lantiq pull request/issue for lantiq target
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants