Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS#1242 - SATA broken on kernel 4.9 on mt7621 #6488

Closed
openwrt-bot opened this issue Dec 25, 2017 · 22 comments
Closed

FS#1242 - SATA broken on kernel 4.9 on mt7621 #6488

openwrt-bot opened this issue Dec 25, 2017 · 22 comments
Labels

Comments

@openwrt-bot
Copy link

openwrt-bot commented Dec 25, 2017

neheb:

Supply the following if possible:

  • Device problem occurs on - GnuBee Personal Cloud One
  • Software versions of LEDE release, packages, etc. - latest trunk
  • Steps to reproduce - Run trunk for a few hours and verify that the data on the hard drive is legitimate.

Basically, with kernel 4.9 there's some weird issue where after several hours (around 18), the SATA controller starts returning bad data. On 4.4, this is not a problem.

I've avoided reporting this problem to kernel.org since ramips is quite LEDE specific. Could be a pcie issue for all I know.

The data on the actual hard drive is fine. It's just bad data that's being returned. Maybe bit errors or something.

The way I test this is by using transmission with its Verify feature. Last I tested with adm + ext4, a torrent that verified at 100% verified at 91% 3 days later.

btrfs is more vocal since it reports silent data corruption and throws checksum mismatch errors in dmesg quite frequently after a few hours.

I currently work around the issue by running kernel 4.4, but this is not a long term solution.

@openwrt-bot
Copy link
Author

openwrt-bot commented Dec 27, 2017

valdi74:

I can confirm this bug on Xiaomi Mi Router 3G and USB SATA HDD. Big files (10 GB) downloaded are sometimes (20-30% files) broken - md5sum don't match. There was no log entry when the error occurred. Tested with:

  • three different Xiaomi Mi 3G routers
  • two USB HDDs
  • four different LEDE snapshots from the last 7 weeks (kernel 4.9)
  • pyLoad and wget download managers

@openwrt-bot
Copy link
Author

openwrt-bot commented Dec 27, 2017

neheb:

Let's see. Not a SATA issue. Not a pcie issue (USB is not connected through pcie). Sounds like a bug introduced in the port to 4.9. Maybe a CPU issue?

@openwrt-bot
Copy link
Author

openwrt-bot commented Jan 11, 2018

HeadLessHUN:

Hi there!

I also faced this bug on xiaomi mi Router 3g on different HDDs with ext4 filesystem.

I've OpenWrt SNAPSHOT r5629-23bba9c release, which equipped with 4.9.72 kernel.

I haven't seen any kernel log which might be relevant to this problem only when mysql tries to acces some block and it can't read...

[28372.317828] EXT4-fs warning: 10 callbacks suppressed
[28372.317846] EXT4-fs warning (device sdb3): htree_dirblock_to_tree:962: inode #872: lblock 0: comm mysqld: error -5 reading directory block
[28372.341743] EXT4-fs warning (device sdb3): htree_dirblock_to_tree:962: inode #872: lblock 0: comm mysqld: error -5 reading directory block
[28372.365581] EXT4-fs warning (device sdb3): dx_probe:742: inode #4312: lblock 0: comm mysqld: error -5 reading directory block

It is very annoying bug, i hope it will be fixed ASAP.

@openwrt-bot
Copy link
Author

openwrt-bot commented Jan 11, 2018

neheb:

Kernel 4.14 should be coming soon. Hopefully it fixes this issue. For all I know, the kernel config could be the issue. Testing is needed...

@openwrt-bot
Copy link
Author

openwrt-bot commented Jan 15, 2018

@openwrt-bot
Copy link
Author

openwrt-bot commented Jan 17, 2018

HeadLessHUN:

i'll try it out but it shouldn't have any impact because it was added to the generic config in june.[[https://git.openwrt.org/?p=openwrt/openwrt.git;a=commitdiff;h=b47fd7656336162360ebf66147326763ddae3f8d;hp=415c47de79ada7496c39f435df0b0523472aee58|External Link]], did you change anything else to the master branch?

@openwrt-bot
Copy link
Author

openwrt-bot commented Jan 17, 2018

neheb:

Yeah I did a diff between config-4.4 and config-4.9 and removed newly introduced CONFIGs. It worked. I have firmware on 4.9 that does not show this issue. Unfortunately, I lost the exact config.

I'm currently testing a new one but unfortunately, this testing of bad kernels destroyed my btrfs array. Now I need to rebuild it...

diff --git a/target/linux/ramips/mt7621/config-4.9 b/target/linux/ramips/mt7621/config-4.9
index 0ea6798..b3c8afc 100644
--- a/target/linux/ramips/mt7621/config-4.9
+++ b/target/linux/ramips/mt7621/config-4.9
@@ -67,7 +67,6 @@ CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_GENERIC_IO=y
CONFIG_GENERIC_IRQ_CHIP=y
-CONFIG_GENERIC_IRQ_IPI=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_GENERIC_SCHED_CLOCK=y
@@ -105,7 +104,6 @@ CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_GENERIC_DMA_COHERENT=y
CONFIG_HAVE_IDE=y
-CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_KVM=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
@@ -127,7 +125,6 @@ CONFIG_I2C_MT7621=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_IRQCHIP=y
CONFIG_IRQ_DOMAIN=y
-CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_IRQ_MIPS_CPU=y
CONFIG_IRQ_WORK=y

and yes, I attributed the error to the wrong CONFIG.

@openwrt-bot
Copy link
Author

openwrt-bot commented Jan 21, 2018

HeadLessHUN:

I commented out these lines

CONFIG_GENERIC_IRQ_IPI=y
CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y

and inserted this line to the target/linux/ramips/mt7621/config-4.9.

CONFIG_SCHED_HRTICK=y

Build it and the problem didn't get solved....There are lots of corruption in few hour uptimm.

@openwrt-bot
Copy link
Author

openwrt-bot commented Jan 21, 2018

neheb:

I got rid of a bunch of CONFIG settings out of confg-4.9 but after observing the actual generated .config file in the build directory, there's no difference. So it seems this is a dead-end...

In other news, I seem not to have these issues anymore. I don't know why. The only answer I have is that it was fixed upstream. I can't see what would have done that though... I have working firmware from 4.9.75. I need to do more testing, but this seems to be gone.

Even if placebo, try this patch. It may work, may not...

diff --git a/target/linux/ramips/mt7621/config-4.9 b/target/linux/ramips/mt7621/config-4.9
index f9765ed..37c2e19 100644
--- a/target/linux/ramips/mt7621/config-4.9
+++ b/target/linux/ramips/mt7621/config-4.9
@@ -12,7 +12,6 @@ CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_USE_BUILTIN_BSWAP=y
CONFIG_ARCH_WANT_IPC_PARSE_VERSION=y
-CONFIG_BLK_MQ_PCI=y
CONFIG_BOARD_SCACHE=y
CONFIG_BOUNCE=y
CONFIG_CEVT_R4K=y
@@ -28,7 +27,6 @@ CONFIG_CMDLINE_BOOL=y
CONFIG_COMMON_CLK=y
CONFIG_CPU_GENERIC_DUMP_TLB=y
CONFIG_CPU_HAS_PREFETCH=y
-CONFIG_CPU_HAS_RIXI=y
CONFIG_CPU_HAS_SYNC=y
CONFIG_CPU_LITTLE_ENDIAN=y
CONFIG_CPU_MIPS32=y
@@ -45,14 +43,8 @@ CONFIG_CPU_SUPPORTS_32BIT_KERNEL=y
CONFIG_CPU_SUPPORTS_HIGHMEM=y
CONFIG_CPU_SUPPORTS_MSA=y
CONFIG_CRC16=y
-CONFIG_CRYPTO_AEAD=y
-CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_DEFLATE=y
-CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_LZO=y
-CONFIG_CRYPTO_MANAGER=y
-CONFIG_CRYPTO_MANAGER2=y
-CONFIG_CRYPTO_NULL2=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_WORKQUEUE=y
CONFIG_CSRC_R4K=y
@@ -61,13 +53,11 @@ CONFIG_DMA_NONCOHERENT=y
CONFIG_DTB_RT_NONE=y
CONFIG_DTC=y
CONFIG_EARLY_PRINTK=y
-CONFIG_FIXED_PHY=y
CONFIG_GENERIC_ATOMIC64=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_GENERIC_IO=y
CONFIG_GENERIC_IRQ_CHIP=y
-CONFIG_GENERIC_IRQ_IPI=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_GENERIC_SCHED_CLOCK=y
@@ -77,7 +67,6 @@ CONFIG_GPIOLIB=y
CONFIG_GPIO_MT7621=y

CONFIG_GPIO_RALINK is not set

CONFIG_GPIO_SYSFS=y
-CONFIG_HANDLE_DOMAIN_IRQ=y
CONFIG_HARDWARE_WATCHPOINTS=y
CONFIG_HAS_DMA=y
CONFIG_HAS_IOMEM=y
@@ -89,7 +78,6 @@ CONFIG_HAVE_ARCH_KGDB=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_HAVE_ARCH_TRACEHOOK=y

CONFIG_HAVE_BOOTMEM_INFO_NODE is not set

-CONFIG_HAVE_CBPF_JIT=y
CONFIG_HAVE_CC_STACKPROTECTOR=y
CONFIG_HAVE_CLK=y
CONFIG_HAVE_CLK_PREPARE=y
@@ -105,7 +93,6 @@ CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_GENERIC_DMA_COHERENT=y
CONFIG_HAVE_IDE=y
-CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_KVM=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
@@ -115,19 +102,16 @@ CONFIG_HAVE_MOD_ARCH_SPECIFIC=y
CONFIG_HAVE_NET_DSA=y
CONFIG_HAVE_OPROFILE=y
CONFIG_HAVE_PERF_EVENTS=y
-CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_HIGHMEM=y
CONFIG_HW_HAS_PCI=y
CONFIG_HZ_PERIODIC=y
CONFIG_I2C=y
-CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_MT7621=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_IRQCHIP=y
CONFIG_IRQ_DOMAIN=y
-CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_IRQ_MIPS_CPU=y
CONFIG_IRQ_WORK=y
@@ -136,8 +120,6 @@ CONFIG_LZO_COMPRESS=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_MDIO_BOARDINFO=y
CONFIG_MIPS=y
-CONFIG_MIPS_ASID_BITS=8
-CONFIG_MIPS_ASID_SHIFT=0
CONFIG_MIPS_CLOCK_VSYSCALL=y
CONFIG_MIPS_CM=y

CONFIG_MIPS_CMDLINE_BUILTIN_EXTEND is not set

@@ -204,11 +186,9 @@ CONFIG_OF_MDIO=y
CONFIG_OF_NET=y
CONFIG_OF_PCI=y
CONFIG_OF_PCI_IRQ=y
-CONFIG_PADATA=y
CONFIG_PCI=y
CONFIG_PCI_DISABLE_COMMON_QUIRKS=y
CONFIG_PCI_DOMAINS=y
-CONFIG_PCI_DRIVERS_LEGACY=y
CONFIG_PERF_USE_VMALLOC=y
CONFIG_PGTABLE_LEVELS=2
CONFIG_PHYLIB=y
@@ -223,16 +203,11 @@ CONFIG_RALINK=y

CONFIG_RALINK_WDT is not set

CONFIG_RATIONAL=y
CONFIG_RCU_STALL_COMMON=y
-CONFIG_REGMAP=y
-CONFIG_REGMAP_I2C=y
-CONFIG_REGMAP_SPI=y
CONFIG_RESET_CONTROLLER=y
CONFIG_RFS_ACCEL=y
CONFIG_RPS=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_DRV_PCF8563=y
-CONFIG_RTC_I2C_AND_SPI=y
-CONFIG_RTC_MC146818_LIB=y

CONFIG_SCHED_INFO is not set

CONFIG_SCHED_SMT=y

CONFIG_SCSI_DMA is not set

@@ -254,7 +229,6 @@ CONFIG_SPI_MT7621=y
CONFIG_SRCU=y
CONFIG_SWCONFIG_LEDS=y
CONFIG_SWCONFIG=y
-CONFIG_SWPHY=y
CONFIG_SYNC_R4K=y
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_SYS_HAS_CPU_MIPS32_R1=y

@openwrt-bot
Copy link
Author

openwrt-bot commented Jan 21, 2018

HeadLessHUN:

I'm on 4.9.77 r5917-36f1978 and there is still issue with that...

These config removes doesn't needed by anything? openvpn for example crypto support

@openwrt-bot
Copy link
Author

openwrt-bot commented Jan 21, 2018

neheb:

Like I said, this is no-op as all of those options end up in the resulting kernel .config anyway. But I tried it on one of my builds and it seems to have worked? If something breaks you'll instantly know.

@openwrt-bot
Copy link
Author

openwrt-bot commented Jan 23, 2018

neheb:

I gave up. What I did was probably placebo. Just gonna keep ramips at 4.4 in my tree.

Hoping 4.14 (which should come soon) fixes it but I wouldn't hold my breath. If you can, run a ramips unit for several days and compare "md5sum /dev/mtdblock[0123456]
" to see if they change. I bet they do. Unfortunately, I don't think anyone cares even though this is a potentially huge issue.

@openwrt-bot
Copy link
Author

openwrt-bot commented Jan 23, 2018

HeadLessHUN:

i will try to run it through several days and save the md5 from all mtdblock, and will share with you, but it should increase the priority...

but it should change for example because of the overlayfs it should be tested on drives which is not changing...

@openwrt-bot
Copy link
Author

openwrt-bot commented Jan 23, 2018

HeadLessHUN:

nah it's getting corrupted (i mean my hdd-s), is it possible to build a snapshot image with 4.4 kernel? Or my only chance is to backport the device to lede 17.01-stable?

@openwrt-bot
Copy link
Author

openwrt-bot commented Jan 23, 2018

neheb:

I'm using 4.4 with trunk. Just copy patches-4.4 and config-4.4 from 17.01 and change the Makefile to use 4.4.

@openwrt-bot
Copy link
Author

openwrt-bot commented Jan 31, 2018

neheb:

@HeadLessHUN a little birdie told me that disabling CONFIG_HIGHMEM fixes this. Could be good to try out.

diff --git a/target/linux/ramips/mt7621/config-4.9 b/target/linux/ramips/mt7621/config-4.9
index f9765ed..7732443 100644
--- a/target/linux/ramips/mt7621/config-4.9
+++ b/target/linux/ramips/mt7621/config-4.9
@@ -118,7 +118,7 @@ CONFIG_HAVE_PERF_EVENTS=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN=y
-CONFIG_HIGHMEM=y
+# CONFIG_HIGHMEM is not set
CONFIG_HW_HAS_PCI=y
CONFIG_HZ_PERIODIC=y
CONFIG_I2C=y

@openwrt-bot
Copy link
Author

openwrt-bot commented Feb 1, 2018

easyteacher:

@neheb Does disabling CONFIG_HIGHMEM really work? Have you tested it?

I found a new config introduced in kernel 4.5

[[https://cateee.net/lkddb/web-lkddb/IO_STRICT_DEVMEM.html|CONFIG_IO_STRICT_DEVMEM: Filter I/O access to /dev/mem]]

And will enabling CONFIG_DM_VERITY help?

@openwrt-bot
Copy link
Author

openwrt-bot commented Feb 1, 2018

neheb:

No idea. I've tried it on the 4.4 kernel and it seems to work well. I'm using it for the sd card though (the mmc driver breaks when using the HighMem zone). Could also help here since the issue for me happens after 15+ hours. Maybe when something else tries using the HighMem zone.

I don't think those two options have any impact.

@openwrt-bot
Copy link
Author

openwrt-bot commented Feb 3, 2018

easyteacher:

[[https://events.static.linuxfound.org/sites/events/files/slides/Shuah_Khan_dma_map_error.pdf|Detecting silent data corruptionsand memory leaks using DMA Debug API]]

I found a document possibly related to the bug. To debug, set CONFIG_DMA_API_DEBUG=y. Currently I have no idea how to use it.

@openwrt-bot
Copy link
Author

openwrt-bot commented Feb 4, 2018

neheb:

It seems drivers must be manually modified to use it.

@openwrt-bot
Copy link
Author

openwrt-bot commented Apr 27, 2018

valdi74:

Maybe [[https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=79126770868995faa8656f6687a88d385802e34b|this]] is the solution to our problem?

@openwrt-bot
Copy link
Author

openwrt-bot commented Apr 27, 2018

neheb:

Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant