-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mediatek: snfi: FM35Q1GA is x4-only and needs 12mA pin drive strength #7
base: mtksoc
Are you sure you want to change the base?
Conversation
Dont allow x2 read and cache read operations on FM35Q1GA. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Unfortunately it doesn't seem to fix the issue. After only a few reboots I now finally managed to see this happening on one of my own devices:
I've dumped the flash from within Linux and it looks all fine. Writing back the content identically fixes the issue which hints to some internal problem with this specific Fidelix SPI-NAND chip. Can it be that bl2 is trying to access the flash to quickly after reset? |
@dangowrt Can you rebuild bl2 with LOG_LEVEL=50 to see if there's any message from SNIF driver? |
While I wasn't able to deliberately brick my device again, I posted instructions to the forum which told people to load bl2 with LOG_LEVEL=50 via UART onto a bricked device. User _FailSafe has done so and posted the result: As you can see, the device now boots just fine. So maybe the additional delays caused by all the debug output fix the issue 😬 Edit: Or the fact that the SNFI wasn't previously accessed by the BootROM (as bl2 was loaded via UART...) |
Oh, another log, and that looks more interesting: My interpretation: |
Another quite similar log uploaded by another user:
|
@dangowrt So it seems to be the UBI data CRC failure, but no ECC error printed. May be the data error happened in NAND write phase? |
It's unlikely to have happened at write phase (of fip) because the device was booting dozens (and some other devices hundreds) of times without problems. Also now the problem does not occur on every boot, but once it happens it kind of persists. This bug has been present for at least 1 year, but it was very hard to catch and quite impossible to debug before https://github.com/981213/mtk_uartboot was available. Now with many users having bricked devices sitting on their shelf we could finally see what is going on. Note that many devices were running fine for years (!) without bl2 or fip being written -- I always thought we are simply dealing with new bitflips which at some point end up as no longer correctable ECC errors. However, from my observation I can tell that the probability for this to happen does not increase over time, so that speaks against that thesis. Also note that in the log, just a few lines above, the very same PEB is read successfully, which is very weird. |
So once it happens, it will persist in every further bootup? |
No, that would be too easy ;) If it happens once, chances that it happens again on the next boot are above 90%. And some (but not all) devices apparently never boot again once it happens. However, many users have reported that devices have magically recovered and it never happened again then -- believed dead and put away, connected again months later they would work again, every single boot. Just to then, a few weeks or months later, usually after a warm reboot, happen again -- and then also again not recover easily or not at all, but sometimes yet. One or two users also claim they managed to recover their devices by putting them into the freezer for 10 minutes (which I don't believe is a good idea, but nvm). Edit: important note: On multiple boots with the debug-output-enabled bl2 we could observe that the error always happens for the exact same EB, but then there is a chance that it doesn't happen. Reading and writing back MTD data in Linux significantly reduces the probability of the problem occurring again on the next boot, and there aren't any errors when reading in Linux or U-Boot, nor are there bit flips. Edit 2: I wonder if we could fix this in a dirty way by just having bl2 retry a failed read a couple of times... |
😨😨😨
We could have a try. But I'm curious did this happen before users flashing their devices from stock firmware to openwrt? |
I didnt see any report of this happening with old preloader and U-Boot used by the stock firmware, so I strongly assume it never happens there.
|
That's strange. |
I suppose you are aware of that: As the stock loader is quite fragile (drops into U-Boot shell in case of a broken image, for example) and stock firmware uses JFFS2 on NAND, I decided to build an "installer" which replaces it with TF-A and U-Boot built from source.
Edit: As the behavior is non-deterministic it would be weird if an incomplete write causes this kind of "instability" when reading a different place in the flash later on. Thinking about it more and also about the nature of NAND flash in general makes it seem likely to me that the fact that the very same EB is read multiple times within a very short time may cause a problems on not very well designed NAND chips. The goal should be to avoid that and after the initial (always successful!) read of the whole fip image to memory just use that instead of subsequently reading it from NAND again. Ie. after this
we should not later on again read all those very same blocks. They are in the memory already, supposedly. We should also avoid to read PEB 309 twice. Also that would obviously avoid ending up to do
as (in this case) PEB 311 is already in RAM. |
This seems to be caused by the ATF's FIP io layer...... I don't know if it's easy to modify it. |
I've read a bit of the code and traced back all those flash reads. The most obvious and also most controversial part is the continuous read of all LEBs of the However, the problem itself occurs also on when reading It hence makes sense to focus on the follow-up repeated reads of LEB 0 of the Reading more of the code I believe it won't be too hard to implement (very simple) generic IO block caching in RAM (ie. add another memory region just like Even a very naive implementation struct io_cache {
bool entry_in_use[4096];
uint64_t entry_address[4096];
char content[512][4096];
}; would take 2134016 bytes (2 MiB cache content, 36 kiB cache metadata) in RAM and prevent repeated reads for sure. Let me know if you think this is a good idea and I'll implement that tomorrow. |
worth a try |
LOL, turns out |
This was added by me🤣 |
Should improve performance/reliability with lots of mcast packets Signed-off-by: Felix Fietkau <nbd@nbd.name>
@hackpascal I'm still getting reports from users almost daily encountering this problem, before and after the switch to use FIP-in-UBI. Many users report this problem after sudden power loss -- however, it would never happen with the stock firmware even in this case. https://forum.openwrt.org/t/belkin-rt3200-linksys-e8450-wifi-ax-discussion/94302/4464?u=daniel https://forum.openwrt.org/t/belkin-rt3200-linksys-e8450-wifi-ax-discussion/94302/4360?u=daniel ... @981213 any ideas maybe? |
@schuettecarsten I assume you are using the device with OpenWrt's default settings which currently set the |
@dangowrt I downloaded a copy of E8450 GPL and found that u-boot sets something called DRIVING_E8 instead of DRIVING_E4 for this flash in GPIO register. (Driving current? I can't find its definition in MT7622 reference manual for banana-pi.) typedef struct
{
u8 id[SNAND_MAX_ID];
u8 id_length;
u16 totalsize;
u16 blocksize;
u16 pagesize;
u16 sparesize;
u32 SNF_DLY_CTL1;
u32 SNF_DLY_CTL2;
u32 SNF_DLY_CTL3;
u32 SNF_DLY_CTL4;
u32 SNF_MISC_CTL;
u32 SNF_DRIVING_E4;
u32 SNF_DRIVING_E8;
u8 devicename[30];
u32 advancedmode;
}snand_flashdev_info,*psnandflashdev_info;
static const snand_flashdev_info gen_snand_FlashTable[]={
/* ... */
{{0xE5, 0x71}, 2, 128, 128, 2048, 64, 0x00000000, 0x00000000, 0x00000014,
0x00000000, 0x0552000A, 0x0000, 0x3F00, "FM35X1GA", 0x00000000},
};
We don't set any drive strength in OpenWrt device tree for SNFI pins. BL2 also doesn't seem to set this. Maybe this could cause some data bits flipping on the SPI bus. |
@981213 very interesting. And at least in my mind it adds up to be depending on temperature, as resistance changes with temperature.... |
I've added debugging output of the relevant GPIO registers before and after the adjustment. Turns out that 4mA seems to be the default and hence changing it to 8mA means a change indeed. On my device (which has always been working fine, for years and probably thousands of reboots by now) it doesn't make any difference.
I've told some users with devices in semi-bricked condition to try if this helps, I hope we soon know if this was the jackpot. Edit: Tested on BPi-R64 which comes with W25N01GV and now also got the drive strength adjusted to 8mA supposedly (because old legacy U-Boot would have done that). We never received any reports of BPi-R64 having the same problem as the Linksys/Belkin device with Fidelix FM35Q1GA though, so I wonder if it is needed. However, it doesn't seem to hurt. |
User reported that at least reading is not fixed in that way;
However, it could be that the problem is writing to the flash with the (potentially) wrong pinconf setting... |
@981213 The Linux pinctrl driver does define the e4 and e8 ranges here And explains the meaning of e4 and e8 here: So diff --git a/target/linux/mediatek/dts/mt7622-linksys-e8450.dtsi b/target/linux/mediatek/dts/mt7622-linksys-e8450.dtsi
index be216d972c..c4324410ff 100644
--- a/target/linux/mediatek/dts/mt7622-linksys-e8450.dtsi
+++ b/target/linux/mediatek/dts/mt7622-linksys-e8450.dtsi
@@ -271,6 +271,11 @@
function = "flash";
groups = "snfi";
};
+
+ conf {
+ groups = "snfi";
+ drive-strength = <MTK_DRIVE_12mA>;
+ };
};
spic0_pins: spic0-pins { Edit: applied this in OpenWrt for now. Thinking about updating TF-A to v2.10 in OpenWrt with this PR applied as downstream patch. That will give it more testing and we'll see if it helps. Even more worrying is that a user in the forum claims that this could be a bug introduced between TF-A v2.4 and v2.9. |
I've dumped pinconf using the same Linux image, once running stock bootchain and once running current OpenWrt UBI bootchain, and there are more differences! With stock bootchain, nearly all pins got With our bootchain, all pins end up as So 1 Ohm pull-down sounds like close to shorting this pin to ground, I cannot believe this is the true value, the SoC would get very hot if this was true... 1 kiloOhm maybe, even that would be a low resistance for a pull-down bias. |
It does not fix the "lights out" issue in OpenWrt. And now, for my device also the "turn it off, unplug it and wait 30 minutes" recovery also does not work any longer. |
The change alone not making a difference for devices already in fragile or bricked condition was the expected result. If at all, then it could fix it if we also apply this patch for TF-A (this very PR) and then re-generate the installer including both. |
It seems like we might need to adjust the pin driver strength to 12mA for Fidelix SPI-NAND chip on MT7622 to avoid SPI data corruption on some devices. Signed-off-by: Daniel Golle <daniel@makrotopia.org>
In addition to FM35X1GA, also change the driver strength to 12mA for all chips where this is done by the old/legacy U-Boot: * Winbond 512Mb * Winbond 1Gb * Winbond 2Gb * GD5F4GQ4UBYIG * GD5F4GQ4UAYIG * GD5F1GQ4UX * GD5F1GQ4UE * GD5F2GQ4UX * GD5F2GQ4UE Signed-off-by: Daniel Golle <daniel@makrotopia.org>
@hackpascal Does the change suggested here generally look acceptable to you? Handling this "properly" in pinctrl/gpio driver looks a bit more complicated... |
Actually I want to add a simple pinctrl layer to make it "generic" :p |
@dangowrt I managed to buy a spi-nand flash DS35Q1GA. It seems to use exactly the same die that FM35Q1GA is using. |
@dangowrt @981213 @hackpascal if you are looking at driver strength on the SOC's I/O side, I would also look how it's configured on the SPI-NAND side (usually, by default it should be set to 100%). |
...and another brick with full debug log here: |
Hi @dangowrt
Since we use 0x84/0x34 for x1/x4 program load random data in Linux(https://elixir.bootlin.com/linux/latest/source/include/linux/mtd/spinand.h#L138), issues occur with such combination: FM35Q1GA(DS35Q1GA)+Uboot/Linux spi-mem framework. We observed the same thing on Foresee's F35SQA001G. I'm not 100% sure if the above issue has something to do with this ticket. But it indeed causes serious problems when we program NAND flash, and impacts further reading behavior. Here's dirty fix you may try (apply both to Uboot & Linux):
to this:
This is pretty dirty because it makes SPI controller unable to send 0x84/0x34. This is a big problem for SPI controllers which can't send data of page size(2K or 4K bytes) in one transfer. For example, if the SPI controller is capable to transfer only 512bytes with one command, it needs to issue 0x84/0x34 command four times to transfer 2K bytes data onto NAND flash's cache. |
Import pending patches to set pinconf settings for SPI-NAND pins on MT7622 identical to what the old proprietary preloader did. Should further increase the reliability of some SNFI-attached SPI-NAND flash chips. Link: mtk-openwrt/arm-trusted-firmware#7 Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Import pending patches to set pinconf settings for SPI-NAND pins on MT7622 identical to what the old proprietary preloader did. Should further increase the reliability of some SNFI-attached SPI-NAND flash chips. Link: mtk-openwrt/arm-trusted-firmware#7 Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Summarizing all discoveries I've opened a PR for openwrt.git to add all of that, in the hope that we can rescue users from OKD asap: |
Will this potentially prevent OKD even for devices that haven’t been OKD’ed? |
Import pending patches to set pinconf settings for SPI-NAND pins on MT7622 identical to what the old proprietary preloader did. Should further increase the reliability of some SNFI-attached SPI-NAND flash chips. Link: mtk-openwrt/arm-trusted-firmware#7 Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Import pending patches to set pinconf settings for SPI-NAND pins on MT7622 identical to what the old proprietary preloader did. Should further increase the reliability of some SNFI-attached SPI-NAND flash chips. Link: mtk-openwrt/arm-trusted-firmware#7 Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Import pending patches to set pinconf settings for SPI-NAND pins on MT7622 identical to what the old proprietary preloader did. Should further increase the reliability of some SNFI-attached SPI-NAND flash chips. Link: mtk-openwrt/arm-trusted-firmware#7 Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Import pending patches to set pinconf settings for SPI-NAND pins on MT7622 identical to what the old proprietary preloader did. Should further increase the reliability of some SNFI-attached SPI-NAND flash chips. Link: mtk-openwrt/arm-trusted-firmware#7 Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Import pending patches to set pinconf settings for SPI-NAND pins on MT7622 identical to what the old proprietary preloader did. Should further increase the reliability of some SNFI-attached SPI-NAND flash chips. Link: mtk-openwrt/arm-trusted-firmware#7 Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Dont allow x2 read and cache read operations on FM35Q1GA as they seem to be unstable. Also the Linux drivers does not allow x2 ops:
https://github.com/openwrt/openwrt/blob/main/target/linux/mediatek/patches-6.1/340-mtd-spinand-Add-support-for-the-Fidelix-FM35X1GA.patch#L49
We may need a similar fix for the U-Boot driver, though problems have not been observed there. However, we sporadically see bl2 failing to load fip for no apparent reason (ie. after the device was powered-off and working fine before, or after a reboot). The error then persists.
Also using UBI apparently didn't help
@hackpascal @nbd168