Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL-WDR3600 v1.5 intermittently hangs at reboot #13043

Closed
1 task done
Shine- opened this issue Jul 3, 2023 · 15 comments
Closed
1 task done

TL-WDR3600 v1.5 intermittently hangs at reboot #13043

Shine- opened this issue Jul 3, 2023 · 15 comments
Labels
bug issue report with a confirmed bug

Comments

@Shine-
Copy link

Shine- commented Jul 3, 2023

Describe the bug

This is an old-old issue that apparently reappeared starting with 22.03 (including any post-22.03 version). This should also affect TL-WDR4300 v1.7. If it really is this reoccuring old issue, then earlier HW revisions of these two devices should not be affected.

After sysupgrading, the device resets, all LEDs turn off, then the Ethernet Link/Status LEDs start flashing again like normal, but the star-shaped status LED stays off. The device hangs indefinitely and needs to be powercycled. After powercycle, everything is back to normal.

The issue occurs intermittently, in my experience around 5-50% of all sysupgrades, and even less frequent when rebooting normally.

I remember this issue from around (or more than?) 10 years ago. Iirc, after a lot of guesswork and tries from multiple people, it finally disappeared. I can't, however, find the commit that actually fixed it.

As versions 19.07/21.02 (ath79) aren't affected, I assume the issue doesn't have anything to do with the ar71xx to ath79 transition. Otherwise it would've likely reappeared in 19.07 already.

Testing this does require some patience, since it can happen that 10-20 sysupgrades succeed without the issue occuring. Since I'm currently using a TL-WDR3600 v1.5 for some testing and am therefore resetting/sysupgrading it quite often, I can see this issue happen more or less frequently with my device.

The difference to the old issue from 10y ago is, nowadays it seems to occur only after sysupgrading, not after a normal reboot.

Wild guess: Might this be another sporadic issue of missing cache invalidation on reboot, so the CPU ends up in an endless loop? Like was fixed for a number of ath79 devices before?

OpenWrt version

22.03 or later

OpenWrt target/subtarget

ath79 / generic

Device

TL-WDR3600 v1.5 (and likely also TL-WDR4300 v1.7)

Image kind

Official downloaded image

Steps to reproduce

  1. Install 22.03 or later
  2. sysupgrade or sysupgrade -n to any version 22.03 or later
  3. wait for flashing to end and the device to reset
  4. in case the device reboots normally, repeat from step (2)

In rare cases, a normal reboot without sysupgrade is sufficient to show the issue.

Actual behaviour

After 1 to X tries, the device will hang indefinitely after the reboot has been initiated by sysupgrade. The "star" shaped status LED will be off, Ethernet Link/Status LEDs will flash like normal. Device requires powercycling to work normally again.

Expected behaviour

Device comes up again reliably after any number of sysupgrade trials. The "star" shaped LED turns off after reset, then turns on (during U-Boot stage), then normal OpenWrt startup begins (flashing rapidly, then flashing slowly, then on steadily).

Additional info

To make sure this doesn't affect 21.02, I ran >10 syspgrades in a row with 21.02.7 right before creating this issue. Reboot was always successful. There's no 100%, though...
Fact is, I never saw this issue while this device was in production use with 19.07 or 21.02 based Gluon firmware. The issue started appearing with 22.03 based Gluon, therefore I took it out of production and started using it as a testing device (using vanilla OpenWrt, not Gluon).

Diffconfig

No response

Terms

  • I am reporting an issue for OpenWrt, not an unsupported fork.
@Shine- Shine- added the bug issue report with a confirmed bug label Jul 3, 2023
@brada4
Copy link

brada4 commented Jul 4, 2023

  • do you have any other device that did not reboot
  • serial port connection to one failed device at hand is only chance to get to understand the problem before fast blink after kernel boots and script asks for failsafe userland.

@Shine-
Copy link
Author

Shine- commented Jul 5, 2023

If you read my description correctly, you will (I hope??) understand that the device doesn't even enter Uboot, so is still far away from loading the kernel.
Please do everyone a favor, refrain from cluttering this issue with any more unqualified comments and leave it to others to actually contribute.

@brada4
Copy link

brada4 commented Jul 5, 2023

More like something left in memory by kernel that prevents uboot from completing, like it is expecting zeroed heap at some memory location but it is not zeroed. Serial output is likely to contain addresses, like to exclude 16kB at this physical location from linux to assure reboot.

@Shine-
Copy link
Author

Shine- commented Jul 6, 2023

I remember this issue from around (or more than?) 10 years ago. Iirc, after a lot of guesswork and tries from multiple people, it finally disappeared. I can't, however, find the commit that actually fixed it.

For the record, I was able to dig out the old issue from 2014, and the proposed fix, as well as the accepted fix for it.

Here's the issue from the archived issue tracker, here the proposed fix from the mailing list. The accepted fix by nbd is fa3cb9f.

Note that none of both fixes changes the behavior as described here, for me.

Also for the record, the last fix I know of for a cache invalidation resp. cache flush race condition is 26bc8f6, which didn't change the behavior I'm experiencing either.

I'll try to extract a serial log one of these days (may take a while, though), just to make sure that it's not a completely different problem I'm experiencing here.

In the meantime, I'd appreciate if others, who own a TL-WDR3600 or TL-WDR4300 could report here whether they're also seeing the reboot-after-sysupgrade issue or not, along with their exact HW version.

@grische
Copy link
Contributor

grische commented Jan 9, 2024

I can reproduce the issue on a TP-Link WDR4300 using stock OpenWRT, with a simple reboot. No sysupgrade needed.

Summary

Device

TP-Link TL-WDR4300 v1

Description

Regression introduced with OpenWRT 22.03.

I have a TP-Link WDR4300 with a serial port installed and a serial console cable attached. Simply installing stock OpenWRT (a simple sysupgrade -n) and repeated reboots will reproduce the issue:

Additional info

Reverting commit ebf0d8d did seem to have no impact on the problem.

There is also an extensive discussion here: freifunk-gluon/gluon#2904

Steps to reproduce

  1. Install OpenWRT (no configuration whatsoever needed)
  2. Use the following bash snippet to reboot repeatedly:
while sleep 30
do
    ssh root@192.168.1.1 reboot && echo "Rebooted at $(date --iso=s)" >> ~/reboot-tests/reboot_$(date --iso).log
done

@Shine-
Copy link
Author

Shine- commented Jan 9, 2024

Thanks for confirming!

You're right, I'm also experiencing hangs during normal (non-sysupgrade) reboots of my WDR3600 @ 22.03+, but these are rare, so I wasn't aware of it yet when creating this issue.

I have a WDR4300 v1.7 as well by now, which didn't show the issue yet - though here as well, I can't be sure, since I only rebooted/sysupgraded it a handful of times so far.

Thanks for linking to the issue# in the Gluon repo, I wasn't aware of that (I'm not following the Gluon repo). I already saw the "test/tp-link-wdr4300-hangs" branch in your site-ffm repo, though, which kindof comfirmed to me that I'm not the only one experiencing this problem :-)

@DragonBluep
Copy link
Contributor

@Shine- @grische Are you using u-boot_mod bootloader? A few months ago, I encountered this problem by chance.

@Shine-
Copy link
Author

Shine- commented Jan 9, 2024

Stock TP-Link boot loader from latest official firmware, for me.

@DragonBluep
Copy link
Contributor

My report. #12764 (comment)

when downgrade from the 6.1 kernel via LuCI, sometimes device won't restart
Re-plug power supply can recover the system.
Can not reproduce it now. 2023/08/03

I guess some reset functions were not called correctly. Unfortunately, I haven't encountered it again recently, so I can't debug it.

This is a generic problem, unrelated to the specific device. So, I would suggest change title to something like ath79: intermittently hangs at reboot after sysupgrade since 22.03

@grische
Copy link
Contributor

grische commented Jan 9, 2024

My report. #12764 (comment)

when downgrade from the 6.1 kernel via LuCI, sometimes device won't restart
Re-plug power supply can recover the system.
Can not reproduce it now. 2023/08/03

I guess some reset functions were not called correctly. Unfortunately, I haven't encountered it again recently, so I can't debug it.

This is a generic problem, unrelated to the specific device. So, I would suggest change title to something like ath79: intermittently hangs at reboot after sysupgrade since 22.03

I am also for renaming the ticket, but only WDR4300 and WDR3600 seem affected. And it's independent of the sysupgrade, a simple reboot will do the trick.

EDIT: to clarify, we have a large set of different models and we only noticed the problems with these two models.

@DragonBluep
Copy link
Contributor

If you can build the openwrt source code, I believe this patch has a chance to fix this issue.
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/patch/?id=d3115128bdafb62628ab41861a4f06f6d02ac320

@Shine-
Copy link
Author

Shine- commented Jan 9, 2024

ath79_restart is already gone from both kernel 5.10 (22.03) and 5.15 (23.05).

Also, I have to agree with @grische that only my WDR3600 (and possibly my WDR4300 if I finally get around to using it) is affected for me, none of my many other ath79 based devices.

@Shine- Shine- changed the title TL-WDR3600 v1.5 intermittently hangs at reboot after sysupgrade TL-WDR3600 v1.5 intermittently hangs at reboot Jan 9, 2024
grische pushed a commit to grische/openwrt that referenced this issue Jan 10, 2024
Add a cache-barrier after the reset-register write. This fixes spurious
reboot issues on TP-Link WDR3600 and WDR4300 devices with Zental DDR2
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: openwrt#13043
Link: https://dev.archive.openwrt.org/ticket/17839

Signed-off-by: David Bauer <mail@david-bauer.net>
blocktrron added a commit to blocktrron/openwrt that referenced this issue Jan 10, 2024
Read back the reset register in order to flush the cache. This fixes
spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: openwrt#13043
Link: https://dev.archive.openwrt.org/ticket/17839
Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart")

Signed-off-by: David Bauer <mail@david-bauer.net>
grische pushed a commit to grische/openwrt that referenced this issue Jan 10, 2024
Read back the reset register in order to flush the cache. This fixes
spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: openwrt#13043
Link: https://dev.archive.openwrt.org/ticket/17839
Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart")

Signed-off-by: David Bauer <mail@david-bauer.net>
@grische
Copy link
Contributor

grische commented Jan 10, 2024

Some addition:
My affected WDR4300 test device also has the Zentel A3R12E40CBF that was referenced in the original ticket.

With @blocktrron's patch #14378, I was able to boot both OpenWRT master and OpenWRT 23.05 successfully for several reboots. Thanks a lot! 🙏

openwrt-bot pushed a commit that referenced this issue Jan 11, 2024
Read back the reset register in order to flush the cache. This fixes
spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: #13043
Link: https://dev.archive.openwrt.org/ticket/17839
Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart")

Signed-off-by: David Bauer <mail@david-bauer.net>
openwrt-bot pushed a commit that referenced this issue Jan 11, 2024
Read back the reset register in order to flush the cache. This fixes
spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: #13043
Link: https://dev.archive.openwrt.org/ticket/17839
Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart")

Signed-off-by: David Bauer <mail@david-bauer.net>
(cherry picked from commit 2fe8ecd)
@blocktrron
Copy link
Member

#14378 merged, closing

grische pushed a commit to grische/openwrt that referenced this issue Jan 11, 2024
Read back the reset register in order to flush the cache. This fixes
spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: openwrt#13043
Link: https://dev.archive.openwrt.org/ticket/17839
Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart")

Signed-off-by: David Bauer <mail@david-bauer.net>
openwrt-bot pushed a commit that referenced this issue Jan 11, 2024
Read back the reset register in order to flush the cache. This fixes
spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: #13043
Link: https://dev.archive.openwrt.org/ticket/17839
Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart")

Signed-off-by: David Bauer <mail@david-bauer.net>
(cherry picked from commit 2fe8ecd)
@grische
Copy link
Contributor

grische commented Jan 12, 2024

For those interested, this fix was backported to 22.03 and 23.05 as well:

Vladdrako pushed a commit to Vladdrako/openwrt that referenced this issue Jan 14, 2024
Read back the reset register in order to flush the cache. This fixes
spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: openwrt#13043
Link: https://dev.archive.openwrt.org/ticket/17839
Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart")

Signed-off-by: David Bauer <mail@david-bauer.net>
db260179 pushed a commit to db260179/openwrt that referenced this issue Jan 31, 2024
Read back the reset register in order to flush the cache. This fixes
spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: openwrt#13043
Link: https://dev.archive.openwrt.org/ticket/17839
Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart")

Signed-off-by: David Bauer <mail@david-bauer.net>
sbeach92 pushed a commit to sbeach92/openwrt that referenced this issue Feb 16, 2024
Read back the reset register in order to flush the cache. This fixes
spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: openwrt#13043
Link: https://dev.archive.openwrt.org/ticket/17839
Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart")

Signed-off-by: David Bauer <mail@david-bauer.net>
rondoval pushed a commit to rondoval/openwrt that referenced this issue Feb 25, 2024
Read back the reset register in order to flush the cache. This fixes
spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: openwrt#13043
Link: https://dev.archive.openwrt.org/ticket/17839
Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart")

Signed-off-by: David Bauer <mail@david-bauer.net>
(cherry picked from commit 2fe8ecd)
davintagas pushed a commit to davintagas/ROOterSource2305 that referenced this issue Jun 26, 2024
Read back the reset register in order to flush the cache. This fixes
spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: openwrt/openwrt#13043
Link: https://dev.archive.openwrt.org/ticket/17839
Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart")

Signed-off-by: David Bauer <mail@david-bauer.net>
(cherry picked from commit 2fe8ecd880396b5ae25fe9583aaa1d71be0b8468)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug issue report with a confirmed bug
Projects
None yet
Development

No branches or pull requests

5 participants