Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mediatek: various efforts to improve SNFI reliability #15112

Merged
merged 4 commits into from
Jun 5, 2024

Conversation

dangowrt
Copy link
Member

@dangowrt dangowrt commented Apr 8, 2024

The Linksys E8450 aka. Belkin RT3200 suffers from a hard-to-catch issue with unreliable writes to the SPI-NAND flash which is attached via the SNFI interface (ie. using SoC BCH engine instead of on-die ECC of the flash chip).
Apply various changes which hopefully help to get rid of what has been dubbed the "OpenWrt Kiss of Death" (OKD) once we create an updated UBI installer for the device based on them.

@github-actions github-actions bot added core packages pull request/issue for core (in-tree) packages target/mediatek pull request/issue for mediatek target labels Apr 8, 2024
@aiamadeus
Copy link
Contributor

For the kernel part changes, I can confirm this works on
MT7981 + DS35Q1GA (which uses on-die ECC).
Before this change:

[    0.713881] spi spi0.0: setup: ignoring unsupported mode bits a00
[    0.720304] spi-nand spi0.0: Fidelix SPI NAND was found.
[    0.725615] spi-nand spi0.0: 128 MiB, block size: 128 KiB, page size: 2048, OOB size: 64
[    2.057819] ubi0 warning: ubi_eba_init: cannot reserve enough PEBs for bad PEB handling, reserved 17, need 19
[    9.654826] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 11 bytes from PEB 244:7585, read only 11 bytes, retry
[    9.666820] UBIFS error (ubi0:2 pid 463): ubifs_lpt_init: invalid type (15) in LPT node type 2
[    9.743737] UBIFS (ubi0:2): background thread "ubifs_bgt0_2" stops
[    9.750287] mount_root: failed to mount -t ubifs /dev/ubi0_2 /tmp/overlay: Invalid argument

After this change:

[    0.716792] spi spi0.0: setup: ignoring unsupported mode bits a00
[    0.723257] spi-nand spi0.0: Fidelix SPI NAND was found.
[    0.728569] spi-nand spi0.0: 128 MiB, block size: 128 KiB, page size: 2048, OOB size: 64
[    2.049643] ubi0 warning: ubi_eba_init: cannot reserve enough PEBs for bad PEB handling, reserved 17, need 19
[    9.770535] mount_root: overlay filesystem has not been fully initialized yet
[    9.777911] mount_root: switching to ubifs overlay

@mrkiko
Copy link
Contributor

mrkiko commented Apr 9, 2024

Is switching to the performance governor or leaving cpufreq alone alltogeter considered still something important?
May this be an occasion to do that?

@dangowrt
Copy link
Member Author

dangowrt commented Apr 9, 2024

@mrkiko While cpufreq causes some issues when running at too low voltage (or frequency?) while rebooting the SoC and this leads to hang inside DRAM calibration, I don't think the issue is related to the SPI-NAND corruption issues we are seeing. Maybe indirectly, as hanging in DRAM calibration for hours and hours could result in a lot of heat, I've never tried that, but it's unlikely as it happens while CPU is running at slow clock only, and hence should get less hot compared to normal operation even.

Resetting the CPU to run at full speed and voltage before entering DRAM calibration would be the best possible fix for that. Maybe open an issue on https://github.com/mtk-openwrt/arm-trusted-firmware/issues for that?

@rsalvaterra
Copy link
Member

too low voltage (or frequency?)

It was the voltage. I reinstated the 300 MHz OPP in my tree at 1 V and it never failed on me (minus the hang I started having at reboot with Linux 6.6, but that's another issue, surely).

@mrkiko
Copy link
Contributor

mrkiko commented May 23, 2024

What's the merging perspective for this PR?

@dangowrt
Copy link
Member Author

Currently @onlyfly34 is working with some people on the forum with bricked devices to figure out how to address the root cause. I'm waiting for the result, and then would go ahead and merge this PR if it is deemed suitable.

Update ARM TrustedFirmware-A to the most recent release of
MediaTek downstream patched version released 2024-01-17.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Import pending patches to set pinconf settings for SPI-NAND pins on
MT7622 identical to what the old proprietary preloader did.

Should further increase the reliability of some SNFI-attached SPI-NAND
flash chips.

Link: mtk-openwrt/arm-trusted-firmware#7
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Dont allow x2 read and cache read operations on FM35Q1GA as they seem
to be unstable. Also the Linux drivers does not allow x2 ops.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Prior to performing a PROGRAM LOAD RANDOM DATA operation, a WRITE
ENABLE (06h) command must be issued to change the contents of the
memory array. Following a WRITE ENABLE (06) command, **first a PROGRAM
LOAD (02h or 32h) command must be issued to reset the cache**, then
issue a PROGRAM LOAD RANDOM DATA (84h or 34h) command

This is dirty fix provided to use by MediaTek engineer Sky Huang which
may resolve the "OpenWrt Kiss of Death" issue we've been seeing on the
Linksys E8450 aka. Belkin RT3200. However, it means that everything has
to be re-written with that patch already applied, ie. we need to rebuild
the installer once it is part of snapshot builds to have any effect.

Users already on FIP-in-UBI layout are advised to re-write 'fip' UBI
volume and 'bl2' MTD partition manually once from within Linux after
this fix has been applied.

A similar fix will also be required for U-Boot.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
@openwrt-bot openwrt-bot merged commit 84a5274 into openwrt:main Jun 5, 2024
2 checks passed
@dangowrt dangowrt deleted the snfi-reliability-fixes branch June 5, 2024 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core packages pull request/issue for core (in-tree) packages target/mediatek pull request/issue for mediatek target
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants