New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ramips/mt7621: SQUASHFS filesystem corruption #9085
Comments
crowston: I tried installing on a different router and after a few powercycles saw the same SQUASHFS errors, suggesting it's not just bad memory: Fri Oct 22 11:30:14 2021 kern.err kernel: [ 97.569402] SQUASHFS error: xz decompression failed, data probably corrupt But most of the time it seems to work fine. |
M95D: I have this exact problem with WRT1900ACv1, OpenWRT built from git master. It won't boot at all with the new firmware. |
M95D: More debugging: Apparently, the image is not correctly written to flash. Reading back the squashfs and trying to mount it on a x86 Gentoo linux gives the same decompression errors. See attachment for details. |
M95D: Even more debugging: I extracted the squashfs from the original firmware image that was uploaded to the router. They are identical, except for some extra 0xFF at the end (ubifs read back from the router's mtd is larger, probably because it extends until the end of the erase block). So, it's not a flash write issue, and it's not a hardware defect. |
M95D: It seems that ARM BCJ filter decoder is needed in kernel, even on the desktop. Having only x86 BCJ filter decoder won't help. Maybe there should be a warning put somwhere to alert users that alter the default kernel config. |
brianmercer: My WD Mynet N750 is also unstable and also displays these same errors in the log. |
danak6jq: I am also seeing this on a WD MyNet N750, starting with 21.02.1. I made an attempt to build a kernel/image with ARM BCJ pinned to the kernel and it did not make a difference. |
I'm seeing this issue with a fresh download of 21.02.2 from https://firmware-selector.openwrt.org/?version=21.02.2&target=ath79%2Fgeneric&id=wd_mynet-n750 I also have a WD MyNet N750 |
Someone found the true problem: |
I also ran in the issue after updating my WD MyNet N750 to 21.02.2 r16495-bf0c965af0 from an 19.x version. After now around five days I get a high CPU load and the same reading errors: kern.err kernel: [ 1177.557521] SQUASHFS error: Unable to read fragment cache entry [270732] I re-flashed the version and for the moment it works fine again. |
@EccoB have you power cycled it yet? I find it weird that it can run initially but that, at least in my experience, a power cycle causes issue. Never had that issue with OpenWRT 19.X |
@ShapeShifter499 Till now, I did not and there were no errors so far.
|
The router was screwed (see last post), Luci told that the password was not set (which shouldn't be the case), and lots of CRC errors.
Over the next days, I will monitor the behaviour and document if there are any issues. If there is something I can do for further investigation you may tell me. |
Hello, I recently went through the same issue with my edgerouter-x: root@edgerouterx:~# cat /etc/openwrt_release
DISTRIB_ID='OpenWrt'
DISTRIB_RELEASE='21.02.0'
DISTRIB_REVISION='r16279-5cc0535800'
DISTRIB_TARGET='ramips/mt7621'
DISTRIB_ARCH='mipsel_24kc'
DISTRIB_DESCRIPTION='OpenWrt 21.02.0 r16279-5cc0535800'
DISTRIB_TAINTS='' Sorry I reinstall everything and did not take time to log, I will come back if it happened again. Did nothing special except disable uhttpd service and reboot, then I noticed that clients don't get their ips (dns issue) and when I looked in the logs (dmesg) I had a lot of SQUASHFS errors. |
Examples of various SQUASHFS, jffs2 errors from my N750, running the March 7 snapshot. I do not encounter any errors running 19.07.X
|
I'm seeing a similar thing on a Ubiquity ER-X which has been stable and running 21.02.1 for many months.
I noticed today that LUCI and uhttpd is not running
|
Same here, changed some config + reboot. Now the ER-X is stuck in Bootloop after running stable for ~2years
|
I am seeing the same sort of thing here on the GL.iNet GL-B1300 with 21.02.1 According to dmesg, my storage configuration looks like this:
My errors occur against the rootfs (mtdblock10). When this occurs to me, I start seeing errors similar to:
which then progress to:
Squashfs caches the read failures until the hardware is rebooted - whereupon everything is once again "fine"; I am able to perform read checks against the entire rootfs without encountering any obvious storage errors, after the reboot. The appearance of the read errors appears to be "random" - but, once squashfs caches them, only a reboot is able to resolve the situation. This is clearly some issue with the storage controller, that the caching in squashfs makes worse. |
@ynezz This issue is occurring for me as well, and not on hardware that is ramips-based (GL.iNet GL-B1300). Should I open a separate bug for the issue, for my hardware? |
The same happened to my Edgerouter X SFP on Monday. However, just came home today to investigate. A reboot didn't resolve it, still the same SQUASHFS errors. I then reflashed the same build and the router works again as it should, without SQUASHFS errors.
|
See if this helps: |
I agree, this issue should not be exclusively title or tagged as a mt7621 platform issue. Frankly, I think the original title should be restored, as the issue also exists on Ath79/Atheros devices. |
I don't have your specific device, but I can share that I experienced multiple failures with my WD N750 and was able to restore a working image of stock and OpenWRT via bootloader/tftp on many occasions. I agree though...the errors are worrisome. |
I'm testing the latest OpenWRT Snapshot (r20029-3c06a344e9), running the 5.15.50 kernel via testing mode on a WD N750, but still seeing the same SquashFS errors. Update: Possible improvement! On initial boot after flashing I observed the SquashFS errors in the Kernel log. After a reboot, no errors and there have been no errors for past 7 days. No issues on second reboot after 7 day uptime either. In the past on OpenWRT 21 and 22 I would see a few SquashFS errors immediately upon reboot. I built the current firmware image from source using the July 7th snapshot, selected testing kernel and running 5.15.50. I'll keep monitoring and will report back. |
Happens on my TP-Link ArcherC6U. I don't know if this is already observed, but in my case, the errors don't come immediately after a cold boot. After a cold boot, everything works fine for quite some time, unless I perform some particular tasks, in which case, the errors come flooding in. Using snapshot r19971-416d4483e8. This started happening maybe a month ago. All snapshots up to that was perfectly functional. It is a mt7621 device. |
Does anyone know a workaround (besides patching the kernel)? The corruption can be detected with
I'm thinking of adding some code to /etc/rc.local (assuming it can boot up to this) to automatically reflash and restore from a backup in case it breaks after a power loss. I'm not sure if it's possible to fully automate this, though. |
After a month of testing I’m sad to report the SquashFS corruption errors continue with my WD N750, even when running the 5.15.50 kernel on a July 2022 OpenWrt Snapshot. I’ve had no success with 21.02, 22.03, or Snapshot - all present errors after a matter of a few hours to a few weeks. |
@M95D, @NoTengoBattery, this is very annoying and widespread bug. Can the patch be applied to mt7621 or all arch? Are you going to open the pull request? |
openwrt-bot commentedOct 20, 2021
crowston:
Supply the following if possible:
Western Digital My Net N750
openwrt-21.02.0
strongswan, dnscrypt-proxy2, avahi-utils, luci-app-ddns
I installed openwrt-21.02.0-ath79-generic-wd_mynet-n750-squashfs-sysupgrade.bin on a Western Digital My Net N750 that had been running openwrt-19.
The router seemed okay initially but after power cycling, it started reporting errors:
Oct 17 12:20:37 router2 kernel: [ 38.613970] SQUASHFS error: xz decompression failed, data probably corrupt
Oct 17 12:20:37 router2 kernel: [ 38.621029] SQUASHFS error: squashfs_read_data failed to read block 0x23686e
Oct 17 12:20:37 router2 kernel: [ 38.628199] SQUASHFS error: Unable to read fragment cache entry [23686e]
Oct 17 12:20:37 router2 kernel: [ 38.635010] SQUASHFS error: Unable to read page, block 23686e, size 16b28
The filesystem problem would leave some random file damaged, so different services would fail. Over time, the router became less and less functional as various files became inaccessible and after a few cycles, wouldn't boot at all.
I wondered if there was a problem with my old configuration on the new release (though I'm not sure how that could damage the squashfs), so I reinstalled a few more times in different ways, e.g., doing a factory install (openwrt-21.02.0-ath79-generic-wd_mynet-n750-squashfs-factory.bin and then the upgrade) instead of just the upgrade, and configuring from scratch rather than from the backup. But each time I had the same problem with the router.
It wasn't the same block on different installs, I noticed, but it seemed to be consistent for a particular installation attempt.
Oct 17 16:11:14 router2 kernel: [ 53.182571] SQUASHFS error: xz decompression failed, data probably corrupt
Oct 17 16:11:14 router2 kernel: [ 53.189582] SQUASHFS error: squashfs_read_data failed to read block 0x21e9e6
Oct 17 16:11:14 router2 kernel: [ 53.196749] SQUASHFS error: Unable to read fragment cache entry [21e9e6]
Oct 17 16:11:14 router2 kernel: [ 53.203559] SQUASHFS error: Unable to read page, block 21e9e6, size fd9c
Once there were two blocks (I think this is a reboot of the install above):
Oct 17 16:29:04 router2 kernel: [ 78.505075] SQUASHFS error: xz decompression failed, data probably corrupt
Oct 17 16:29:04 router2 kernel: [ 78.512103] SQUASHFS error: squashfs_read_data failed to read block 0x1e6e76
Oct 17 16:29:05 router2 kernel: [ 79.111366] SQUASHFS error: xz decompression failed, data probably corrupt
Oct 17 16:29:05 router2 kernel: [ 79.118386] SQUASHFS error: squashfs_read_data failed to read block 0x21e9e6
Oct 17 16:29:05 router2 kernel: [ 79.125565] SQUASHFS error: Unable to read fragment cache entry [21e9e6]
Oct 17 16:29:05 router2 kernel: [ 79.132445] SQUASHFS error: Unable to read page, block 21e9e6, size fd9c
One time there was first a jffs error, followed by lots of squashfs errors. Sorry, I don't have the log for that one.
I now realize that I should have tried power cycling a clean install a few times to see if there were errors right away or if they only happened after files were installed/changed.
To check whether the router was just having a hardware problem, I reinstalled openwrt-19.07.8 and configured it the same. I have not seen any errors after a few power cycles, which points to a problem with the new release. I did not see any bug reports on this tracker that mention squashfs problems and googling, I did not find any useful discussions, hence this bug report.
I guess it could be that the new release uses a bad bit of memory that the earlier release managed to miss. I looked for but didn't find a memory test utility, so I don't know how to examine that possibility. Though the fact that it was different blocks each time makes it not sound like a hardware problem.
The text was updated successfully, but these errors were encountered: