Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad block makes ubi_auto_attach read from wrong offset, fail to mount ubi #12895

Open
1 task done
chrdev opened this issue Jun 13, 2023 · 5 comments
Open
1 task done
Labels
bug issue report with a confirmed bug

Comments

@chrdev
Copy link

chrdev commented Jun 13, 2023

Describe the bug

There is an unmarked bad block at 0x520000, in partition "kernel". When creating MTD partitions, this bad block is detected, but instead of marking it, the kernel tries to erase it. nand_erase_nand() rejects the erasing.

Somehow this incident makes mtd_read() in ubi_auto_attach() read from a wrong offset, one-eraseblock-back specifically.

I used the following code to show magic[16], compared it to mtd dumps, then found out that magic[16] is actually read from 0x560000 instead of 0x580000. 0x560000 is the last eraseblock of partion "kernel".

drivers/mtd/ubi/build.c

static void __init ubi_auto_attach(void)
{
  ...
  char magic[16];
  ...
  err = mtd_read(mtd, offset, 16, &len, (void *) magic);
  print_16_bytes_hex(magic);
  ...
}

Boot log, timestamp removed

...
nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xf1
nand: Macronix MX30LF1G18AC
nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
mt7621-nand 1e003000.nand: ECC strength adjusted to 4 bits
read_bbt: found bbt at block 1023
11 fixed-partitions partitions found on MTD device mt7621-nand
Creating 11 MTD partitions on "mt7621-nand":
0x000000000000-0x000000080000 : "boot"
0x000000080000-0x0000000c0000 : "u_env"
0x0000000c0000-0x000000100000 : "factory"
0x000000100000-0x000000140000 : "s_env"
0x000000140000-0x000000180000 : "devinfo"
0x000000180000-0x000000580000 : "kernel"
nand: nand_erase_nand: attempt to erase a bad block at page 0x00000a40
nand: nand_erase_nand: attempt to erase a bad block at page 0x00000a40
nand: nand_erase_nand: attempt to erase a bad block at page 0x00000a40
nand: nand_erase_nand: attempt to erase a bad block at page 0x00000a40
nand: nand_erase_nand: attempt to erase a bad block at page 0x00000a40
nand: nand_erase_nand: attempt to erase a bad block at page 0x00000a40
nand: nand_erase_nand: attempt to erase a bad block at page 0x00000a40
nand: nand_erase_nand: attempt to erase a bad block at page 0x00000a40
nand: nand_erase_nand: attempt to erase a bad block at page 0x00000a40
nand: nand_erase_nand: attempt to erase a bad block at page 0x00000a40
nand: nand_erase_nand: attempt to erase a bad block at page 0x00000a40
0x000000580000-0x000002980000 : "ubi"
0x000002980000-0x000002d80000 : "alt_kernel"
0x000002d80000-0x000005180000 : "alt_rootfs"
0x000005180000-0x000005280000 : "sysdiag"
...
UBI error: no valid UBI magic found inside mtd6
...

OpenWrt version

22.03.5

OpenWrt target/subtarget

ramips/mt7621

Device

Linksys EA7500 v2

Image kind

Official downloaded image

Steps to reproduce

There is an unmarked bad block near the end of kernel partition. Flash the firmware, boot, fail.

Actual behaviour

ubi_auto_attach() reads from the last block of kernel partition and tries to find ubi magic.

Expected behaviour

ubi_auto_attach() reads from ubi partition and find ubi magic

Additional info

No response

Diffconfig

No response

Terms

  • I am reporting an issue for OpenWrt, not an unsupported fork.
@chrdev chrdev added the bug issue report with a confirmed bug label Jun 13, 2023
@chrdev
Copy link
Author

chrdev commented Jun 13, 2023

0x520000 read and write both alright, nandtest also shows no error, so why it's detected bad by the kernel is unknown to me.

@chrdev
Copy link
Author

chrdev commented Jun 14, 2023

Since none of nand_test, flash_erase, dd write and read, complains about offset 0x520000. I made a little test using this diff

--- a/drivers/mtd/ubi/build.c
+++ b/drivers/mtd/ubi/build.c
@@ -1200,6 +1200,12 @@ static struct mtd_info * __init open_mtd
        return mtd;
 }

+static void show_hex16(char const* m) {
+       for (int i = 0; i < 16; ++i) {
+               pr_cont(" %02hhX", *(m++));
+       }
+}
+
 /*
  * This function tries attaching mtd partitions named either "ubi" or "data"
  * during boot.
@@ -1210,7 +1216,7 @@ static void __init ubi_auto_attach(void)
        struct mtd_info *mtd;
        loff_t offset = 0;
        size_t len;
-       char magic[4];
+       char magic[16];

        /* try attaching mtd device named "ubi" or "data" */
        mtd = open_mtd_device("ubi");
@@ -1233,8 +1239,11 @@ static void __init ubi_auto_attach(void)
                }

        /* check if the read from flash was successful */
-       err = mtd_read(mtd, offset, 4, &len, (void *) magic);
-       if ((err && !mtd_is_bitflip(err)) || len != 4) {
+       err = mtd_read(mtd, offset, 16, &len, (void *) magic);
+       pr_cont("%s: read offset: 0x%llX, err: %d, data: ", __func__, offset, err);
+       show_hex16(magic);
+
+       if ((err && !mtd_is_bitflip(err)) || len != 16) {
                pr_err("UBI error: unable to read from mtd%d\n", mtd->index);
                goto cleanup;
        }
--- a/drivers/mtd/mtdcore.c
+++ b/drivers/mtd/mtdcore.c
@@ -1363,6 +1363,7 @@ int mtd_erase(struct mtd_info *mtd, stru
        instr->fail_addr = MTD_FAIL_ADDR_UNKNOWN;
        adjinstr = *instr;

+       pr_warn("mtd_erase: addr: 0x%llX, len: 0x%llX\n", instr->addr, instr->len);
        if (!mtd->erasesize || !master->_erase)
                return -ENOTSUPP;

@@ -2293,6 +2294,7 @@ EXPORT_SYMBOL_GPL(mtd_block_isreserved);
 int mtd_block_isbad(struct mtd_info *mtd, loff_t ofs)
 {
        struct mtd_info *master = mtd_get_master(mtd);
+       loff_t master_ofs;

        if (ofs < 0 || ofs >= mtd->size)
                return -EINVAL;
@@ -2302,7 +2304,12 @@ int mtd_block_isbad(struct mtd_info *mtd
        if (mtd->flags & MTD_SLC_ON_MLC_EMULATION)
                ofs = (loff_t)mtd_div_by_eb(ofs, mtd) * master->erasesize;

-       return master->_block_isbad(master, mtd_get_master_ofs(mtd, ofs));
+       master_ofs = mtd_get_master_ofs(mtd, ofs);
+       if (master_ofs == 0x520000) {
+               pr_warn("mtd_block_isbad: 0x520000 always good.\n");
+               return 0;
+       }
+       return master->_block_isbad(master, master_ofs);
 }
 EXPORT_SYMBOL_GPL(mtd_block_isbad);

--- a/drivers/mtd/nand/raw/nand_base.c
+++ b/drivers/mtd/nand/raw/nand_base.c
@@ -259,6 +259,12 @@ static int nand_block_bad(struct nand_ch
        first_page = (int)(ofs >> chip->page_shift) & chip->pagemask;
        page_offset = nand_bbm_get_next_page(chip, 0);

+       if (ofs == 0x520000) {
+               pr_info("nand_block_bad: 0x520000 always good. first_page: 0x%X, page_offset: 0x%X\n",
+                               first_page, page_offset);
+               return 0;
+       }
+
        while (page_offset >= 0) {
                res = chip->ecc.read_oob(chip, first_page + page_offset);
                if (res < 0)
@@ -4473,9 +4479,13 @@ int nand_erase_nand(struct nand_chip *ch
        int page, pages_per_block, ret, chipnr;
        loff_t len;

-       pr_debug("%s: start = 0x%012llx, len = %llu\n",
+       pr_info("%s: start = 0x%012llx, len = %llu\n",
                        __func__, (unsigned long long)instr->addr,
                        (unsigned long long)instr->len);
+       if (instr->addr == 0x520000) {
+               pr_warn("nand_erase_nand: attemp to erase 0x520000");
+               return -EIO;
+       }

        if (check_offs_len(chip, instr->addr, instr->len))
                return -EINVAL;

boot log excerpt

... omitted
nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xf1
nand: Macronix MX30LF1G18AC
nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
mt7621-nand 1e003000.nand: ECC strength adjusted to 4 bits
read_bbt: found bbt at block 1023
11 fixed-partitions partitions found on MTD device mt7621-nand
Creating 11 MTD partitions on "mt7621-nand":
0x000000000000-0x000000080000 : "boot"
0x000000080000-0x0000000c0000 : "u_env"
0x0000000c0000-0x000000100000 : "factory"
0x000000100000-0x000000140000 : "s_env"
0x000000140000-0x000000180000 : "devinfo"
0x000000180000-0x000000580000 : "kernel"
mtd_block_isbad: 0x520000 always good.
nand: nand_block_bad: 0x520000 always good. first_page: 0xA40, page_offset: 0x0
0x000000580000-0x000002980000 : "ubi"
0x000002980000-0x000002d80000 : "alt_kernel"
0x000002d80000-0x000005180000 : "alt_rootfs"
0x000005180000-0x000005280000 : "sysdiag"
0x000005280000-0x000007f80000 : "syscfg"
mt7530-mdio mdio-bus:1f: MT7530 adapts as multi-chip module
... omitted
DSA: tree 0 setup
ubi_auto_attach: read offset: 0x0, err: 0, data:  30 78 33 45 30 30 30 30 40 4B 45 52 4E 45 4C 20
UBI error: no valid UBI magic found inside mtd6
/dev/root: Can't open blockdev
... omitted

The following string is manually written to mtd offset 0x560000. This sring is unique in mtd scope so that we can know ubi_auto_attach() actually reads from 0x560000 instead of ubi partition.
00560000 30 78 33 45 30 30 30 30 40 4b 45 52 4e 45 4c 20 |0x3E0000@KERNEL |

What I don't understand is:

  1. mtd_block_isbad() returns 0 for 0x520000, so why nand_block_bad() still called? by who?
  2. mtd_read() in ubi_auto_attach() still reads from 0x560000, even 0x520000 is treated as good block, why?

@chrdev
Copy link
Author

chrdev commented Jun 15, 2023

The bug is still there, but OpenWrt boots. Steps

  1. Try attaching ubi partition even if mtd_read() reads from a wrong offset.
drivers/mtd/ubi/build.c

static void __init ubi_auto_attach(void)
... omitted
/* check for a valid ubi magic */
if (strncmp(magic, "UBI#", 4)) {
        pr_err("UBI error: no valid UBI magic found inside mtd%d, but still try attaching...\n", mtd->index);
		/* goto cleanup; comment out to fall-through */
}
... omitted
  1. Once OpenWrt boots, it always boots. Reflash official builds, 22.3.05 for example, the unit still boots.

The culprit seems to be the discrepancy between BBT and BBM.

Mod kernel code to print some info

drivers/mtd/nand/mtk_bmt_bbt.c

/* new func */
static void
print_bmtd(void) {
        pr_info("tbl_size %u, pg_size %u, blk_size %u, pg_shift %hu, blk_shift %hu\n",
                        bmtd.table_size, bmtd.pg_size, bmtd.blk_size, bmtd.pg_shift, bmtd.blk_shift);
        pr_info("pool_lba %hu, pool_pba %hu, bb_max: %hu, total_blks %hu\n",
                        bmtd.pool_lba, bmtd.pool_pba, bmtd.bb_max, bmtd.total_blks);
        pr_info("bmt_blk_idx %hu, bmt_pgs %u, remap_range_len %d, oob_offset %hhu\n",
                        bmtd.bmt_blk_idx, bmtd.bmt_pgs, bmtd.remap_range_len, bmtd.oob_offset);
        print_hex_dump(KERN_INFO, "remap_range: ", DUMP_PREFIX_OFFSET, 16, 1, bmtd.remap_range,
                        4 * bmtd.remap_range_len, 0);
}

/* new func */
static void
print_bbt(void) {
	for (u32 i = 0; i < bmtd.total_blks - 4; ++i) {
		if (bbt_block_is_bad((u16)i))
			pr_info("bad block: #%u at 0x%X, mapping_index %d\n", i, i << bmtd.blk_shift, get_mapping_block_index_bbt((int)i));
	}

/* mod */
static int
mtk_bmt_init_bbt(struct device_node *np)
{
	... omitted
	print_bmtd();
	print_bbt();
	return 0;
}

Also mod nandtest from mtd-utils, delete read and write codes, only test if a block is bad. nandtest calls ioctl(fd, MEMGETBADBLOCK, &test_ofs) for that purpose.

ioctl(MEMGETBADBLOCK) output from stock FW

Linux LinksysN 2.6.36 #81 SMP Mon Apr 29 17:41:44 PDT 2019 mips GNU/Linux

ECC corrections: 0
ECC failures   : 0
Bad blocks     : 5
BBT blocks     : 0
Bad block at 0x02900000, block 328
Bad block at 0x02920000, block 329
Bad block at 0x02940000, block 330
Bad block at 0x02960000, block 331
Bad block at 0x06580000, block 812

ioctl(MEMGETBADBLOCK) output from OpenWrt 22.3.05

No "ALL", so test for earch partition, non-list parts have zero badblock
address and block translated into global scope.

mtd5 kernel
ECC corrections: 0
ECC failures   : 0
Bad blocks     : 1
BBT blocks     : 0
Bad block at 0x00560000, block 43

mtd6 ubi
ECC corrections: 0
ECC failures   : 0
Bad blocks     : 3
BBT blocks     : 0
Bad block at 0x02180000, block 268
Bad block at 0x02340000, block 282
Bad block at 0x02380000, block 284

mtd10 syscfg
ECC corrections: 0
ECC failures   : 0
Bad blocks     : 1
BBT blocks     : 0
Bad block at 0x06580000, block 812

And bootlog, using codes above

... omitted
nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xf1
nand: Macronix MX30LF1G18AC
nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
mt7621-nand 1e003000.nand: ECC strength adjusted to 4 bits
read_bbt: found bbt at block 1023
tbl_size 0, pg_size 2048, blk_size 131072, pg_shift 11, blk_shift 17
pool_lba 0, pool_pba 0, bb_max: 0, total_blks 1024
bmt_blk_idx 1023, bmt_pgs 64, remap_range_len 2, oob_offset 0
remap_range: 00000000: 00 18 00 00 00 58 00 00
bad block: #41 at 0x520000, mapping_index 42
bad block: #268 at 0x2180000, mapping_index 268
bad block: #282 at 0x2340000, mapping_index 282
bad block: #284 at 0x2380000, mapping_index 284
bad block: #812 at 0x6580000, mapping_index 812
11 fixed-partitions partitions found on MTD device mt7621-nand
Creating 11 MTD partitions on "mt7621-nand":
0x000000000000-0x000000080000 : "boot"
0x000000080000-0x0000000c0000 : "u_env"
0x0000000c0000-0x000000100000 : "factory"
0x000000100000-0x000000140000 : "s_env"
0x000000140000-0x000000180000 : "devinfo"
0x000000180000-0x000000580000 : "kernel"
nand: nand_erase_nand: attempt to erase a bad block at page 0x00000a40 (offset 0x520000)
nand: nand_erase_nand: attempt to erase a bad block at page 0x00000a40 (offset 0x520000)
nand: nand_erase_nand: attempt to erase a bad block at page 0x00000a40 (offset 0x520000)
nand: nand_erase_nand: attempt to erase a bad block at page 0x00000a40 (offset 0x520000)
nand: nand_erase_nand: attempt to erase a bad block at page 0x00000a40 (offset 0x520000)
nand: nand_erase_nand: attempt to erase a bad block at page 0x00000a40 (offset 0x520000)
nand: nand_erase_nand: attempt to erase a bad block at page 0x00000a40 (offset 0x520000)
nand: nand_erase_nand: attempt to erase a bad block at page 0x00000a40 (offset 0x520000)
nand: nand_erase_nand: attempt to erase a bad block at page 0x00000a40 (offset 0x520000)
nand: nand_erase_nand: attempt to erase a bad block at page 0x00000a40 (offset 0x520000)
nand: nand_erase_nand: attempt to erase a bad block at page 0x00000a40 (offset 0x520000)
0x000000580000-0x000002980000 : "ubi"
0x000002980000-0x000002d80000 : "alt_kernel"
0x000002d80000-0x000005180000 : "alt_rootfs"
0x000005180000-0x000005280000 : "sysdiag"
0x000005280000-0x000007f80000 : "syscfg"
... omitted

As we can see, bad blocks via different means don't match. Furthermore, I don't think block 43 is really bad, I guess it's ubi-related code that marked block 43 bad, that's why once OpenWrt boots, it always boots even flashed to official OpenWrt.
And why nand_erase_nand attemt to erase 0x520000 11 times when attaching kernel partition?

The bug is still out there.

@DragonBluep
Copy link
Contributor

Did you try master branch? nand driver issue should be fixed recently.

@chrdev
Copy link
Author

chrdev commented Jun 16, 2023

The lastest master branch still showed this message 11 times. so maybe not fixed
nand: nand_erase_nand: attempt to erase a bad block at page 0x00000a40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug issue report with a confirmed bug
Projects
None yet
Development

No branches or pull requests

2 participants