Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task l2arc_feed:1815 blocked for more than 120 seconds. #13435

Open
liyimeng opened this issue May 7, 2022 · 20 comments
Open

task l2arc_feed:1815 blocked for more than 120 seconds. #13435

liyimeng opened this issue May 7, 2022 · 20 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@liyimeng
Copy link

liyimeng commented May 7, 2022

System information

Type Version/Name
Distribution Name ubuntu
Distribution Version 20.04
Kernel Version 5.4.0.88
Architecture amd64
OpenZFS Version 2.1.0

Describe the problem you're observing

Eveytime I reboot my system, when importing zpool, booting get stuck for a long time and I observe a call trace like below, is this something to worry about?

[  243.948884] INFO: task l2arc_feed:1815 blocked for more than 120 seconds.
[  243.948992]       Tainted: P           OE     5.4.0-72-generic #80
[  243.949088] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  243.949220] l2arc_feed      D    0  1815      2 0x80004000
[  243.949321] Call Trace:
[  243.949422]  __schedule+0x2e3/0x740
[  243.949515]  schedule+0x42/0xb0
[  243.949606]  schedule_preempt_disabled+0xe/0x10
[  243.949704]  __mutex_lock.isra.0+0x182/0x4f0
[  243.949808]  ? sigprocmask+0x72/0xa0
[  243.949897]  __mutex_lock_slowpath+0x13/0x20
[  243.949988]  mutex_lock+0x2e/0x40
[  243.950166]  l2arc_feed_thread+0x116/0x1a70 [zfs]
[  243.950263]  ? __switch_to_asm+0x34/0x70
[  243.950353]  ? __switch_to_asm+0x40/0x70
[  243.950444]  ? __switch_to_asm+0x34/0x70
[  243.950534]  ? __switch_to_asm+0x40/0x70
[  243.950626]  ? __switch_to_asm+0x34/0x70
[  243.950719]  ? __switch_to_asm+0x40/0x70
[  243.950809]  ? __switch_to_asm+0x34/0x70
[  243.950899]  ? __switch_to_asm+0x40/0x70
[  243.950989]  ? __switch_to_asm+0x34/0x70
[  243.951080]  ? __switch_to_asm+0x40/0x70
[  243.951170]  ? __switch_to_asm+0x34/0x70
[  243.951262]  ? __switch_to_asm+0x40/0x70
[  243.951353]  ? __switch_to_asm+0x40/0x70
[  243.951452]  ? __switch_to+0x7f/0x470
[  243.951542]  ? __switch_to_asm+0x40/0x70
[  243.951632]  ? __switch_to_asm+0x34/0x70
[  243.951735]  ? spl_kmem_free+0x33/0x40 [spl]
[  243.951859]  ? l2arc_remove_vdev+0x220/0x220 [zfs]
[  243.951957]  thread_generic_wrapper+0x79/0x90 [spl]
[  243.952056]  kthread+0x104/0x140
[  243.952146]  ? __thread_exit+0x20/0x20 [spl]
[  243.952237]  ? kthread_park+0x90/0x90
[  243.952327]  ret_from_fork+0x35/0x40
[  364.780885] INFO: task l2arc_feed:1815 blocked for more than 241 seconds.
[  364.780992]       Tainted: P           OE     5.4.0-72-generic #80
[  364.781088] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  364.781218] l2arc_feed      D    0  1815      2 0x80004000
[  364.781315] Call Trace:
[  364.781413]  __schedule+0x2e3/0x740
[  364.781504]  schedule+0x42/0xb0
[  364.781593]  schedule_preempt_disabled+0xe/0x10
[  364.781686]  __mutex_lock.isra.0+0x182/0x4f0
[  364.781783]  ? sigprocmask+0x72/0xa0
[  364.781872]  __mutex_lock_slowpath+0x13/0x20
[  364.781965]  mutex_lock+0x2e/0x40
[  364.782134]  l2arc_feed_thread+0x116/0x1a70 [zfs]
[  364.782229]  ? __switch_to_asm+0x34/0x70
[  364.782320]  ? __switch_to_asm+0x40/0x70
[  364.782410]  ? __switch_to_asm+0x34/0x70
[  364.782500]  ? __switch_to_asm+0x40/0x70
[  364.782591]  ? __switch_to_asm+0x34/0x70
[  364.782681]  ? __switch_to_asm+0x40/0x70
[  364.782772]  ? __switch_to_asm+0x34/0x70
[  364.782862]  ? __switch_to_asm+0x40/0x70
[  364.782952]  ? __switch_to_asm+0x34/0x70
[  364.783042]  ? __switch_to_asm+0x40/0x70
[  364.783135]  ? __switch_to_asm+0x34/0x70
[  364.783227]  ? __switch_to_asm+0x40/0x70
[  364.783317]  ? __switch_to_asm+0x40/0x70
[  364.783417]  ? __switch_to+0x7f/0x470
[  364.783507]  ? __switch_to_asm+0x40/0x70
[  364.783597]  ? __switch_to_asm+0x34/0x70
[  364.783699]  ? spl_kmem_free+0x33/0x40 [spl]
[  364.783826]  ? l2arc_remove_vdev+0x220/0x220 [zfs]
[  364.783923]  thread_generic_wrapper+0x79/0x90 [spl]
[  364.784020]  kthread+0x104/0x140
[  364.784111]  ? __thread_exit+0x20/0x20 [spl]
[  364.784202]  ? kthread_park+0x90/0x90
[  364.784292]  ret_from_fork+0x35/0x40
[  743.635861]  zd0: p1

@liyimeng liyimeng added the Type: Defect Incorrect behavior (e.g. crash, hang) label May 7, 2022
@rincebrain
Copy link
Contributor

I'd suggest using 2.1.4, for one thing - there've been a lot of random fixes that might or might not make a difference. (Maybe #12365, for example, or another L2ARC removal race that I don't have the link for offhand.)

@liyimeng
Copy link
Author

liyimeng commented May 8, 2022

@rincebrain Thanks a lot. I find this is an real issue. sometime, it will keep printing this with no end. I have to do hard reset on the system, repeating a couple of times and then it pass. I will upgrade to 2.1.4 to see if it helps.

@liyimeng
Copy link
Author

liyimeng commented May 8, 2022

If I am lucky enough, get the system boot, the boot is almost unusable. It keep freezing now and then. a quick fio test show miserable result

bash-5.0# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --size=960M --numjobs=1 --runtime=60 --time_based --end_fsync=1

Jobs: 1 (f=1): [F(1)][100.0%][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=124080: Sun May  8 20:24:32 2022
  write: IOPS=8, BW=35.7KiB/s (36.5kB/s)(22.4MiB/643285msec); 0 zone resets
    slat (usec): min=45, max=4592, avg=76.80, stdev=98.31
    clat (msec): min=2, max=218, avg=10.38, stdev= 8.46
     lat (msec): min=2, max=218, avg=10.46, stdev= 8.46
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    4],
     | 30.00th=[    6], 40.00th=[    8], 50.00th=[    9], 60.00th=[    9],
     | 70.00th=[   11], 80.00th=[   17], 90.00th=[   21], 95.00th=[   27],
     | 99.00th=[   40], 99.50th=[   42], 99.90th=[   52], 99.95th=[   54],
     | 99.99th=[  220]
   bw (  KiB/s): min=  120, max= 1464, per=100.00%, avg=384.47, stdev=273.18, samples=119
   iops        : min=   30, max=  366, avg=96.12, stdev=68.30, samples=119
  lat (msec)   : 4=20.55%, 10=46.18%, 20=22.02%, 50=11.11%, 100=0.10%
  lat (msec)   : 250=0.03%
  cpu          : usr=0.02%, sys=0.06%, ctx=11322, majf=0, minf=5748
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,5736,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=35.7KiB/s (36.5kB/s), 35.7KiB/s-35.7KiB/s (36.5kB/s-36.5kB/s), io=22.4MiB (23.5MB), run=643285-643285msec

@rincebrain
Copy link
Contributor

If it's flooding your system with errors about the l2arc, if 2.1.4 doesn't help, I'd try removing the l2arc from the pool? (If you can't get it to respond well enough to do that, you could try just offlining it then removing it if that works, or you can physically detach the l2arc and then import the pool.)

What's the pool layout? Was this working until recently or after some update, or is this a new setup?

@liyimeng
Copy link
Author

liyimeng commented May 9, 2022

I have removed both logs and l2arc device from the pool, that make no difference at all. I am struggling in upgrading to 2.1.4. because I have issues to get everything compiled and installed.

The setup is pretty simple, a raid0 with ssd partition as cache and logs device


$ sudo zpool status 
  pool: data
 state: ONLINE
  scan: scrub in progress since Sat May  7 21:40:47 2022
	684G scanned at 1.09G/s, 4.01G issued at 6.53M/s, 753G total
	0B repaired, 0.53% done, no estimated completion time
remove: Removal of vdev 0 copied 17.8G in 0h3m, completed on Mon Jan 11 18:02:51 2021
	114K memory used for removed device mappings
config:

	NAME          STATE     READ WRITE CKSUM
	data          ONLINE       0     0     0
	  sdd         ONLINE       0     0     0
	  sde         ONLINE       0     0     0
	logs	
	  nvme0n1p1   ONLINE       0     0     0
	cache
	  nvme0n1p3   ONLINE       0     0     0

errors: No known data errors

It was working great at beginning, then I start to see the filesystem frozen now and then. Up to now it is becoming unusable.

I thought it was disk fault, run a couple of scrub, but find no errors

@h1z1
Copy link

h1z1 commented May 9, 2022

The thread is blocked yes but what are the other cpus doing when that is printed though. Guessing something else borked.

@jittygitty
Copy link

@liyimeng I'm suspecting some issue with one of the drives in your raid0 type stripe, I could be wrong, did smartctl show anything interesting?

@liyimeng
Copy link
Author

My driver is behind a hard raid controller, which dose not support passthrough. I therefore configure raid0 for each disk, and the controller hide all smartctl metrics aways :(


/ # smartctl -a /dev/sdc
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.0-72-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               Intel
Product:              RMS25CB080
Revision:             3.19
Compliance:           SPC-3
User Capacity:        1,998,998,994,944 bytes [1.99 TB]
Logical block size:   512 bytes
Logical Unit id:      0x6001e67881d5e0002a0ee2abcb7e9e21
Serial number:        00219e7ecbabe20e2a00e0d58178e601
Device type:          disk
Local Time is:        Thu May 12 21:24:10 2022 UTC
SMART support is:     Unavailable - device lacks SMART capability.

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     0 C
Drive Trip Temperature:        0 C

Error Counter logging not supported

Device does not support Self Test logging
/ # smartctl -a /dev/sdd
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.0-72-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               Intel
Product:              RMS25CB080
Revision:             3.19
Compliance:           SPC-3
User Capacity:        1,998,998,994,944 bytes [1.99 TB]
Logical block size:   512 bytes
Logical Unit id:      0x6001e67881d5e0002a0ee2f7d00690fb
Serial number:        00fb9006d0f7e20e2a00e0d58178e601
Device type:          disk
Local Time is:        Thu May 12 21:24:54 2022 UTC
SMART support is:     Unavailable - device lacks SMART capability.

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     0 C
Drive Trip Temperature:        0 C

Error Counter logging not supported

Device does not support Self Test logging

However, when I was hacking around, I accidently screw the raid confirmation in the controller, make one of disk become available, I therefore have to destroy the pool and rebuild it with the same setup. Now it is back into normal again. .e.g. I can get 170MB (vs 35KB) of 4k random write now.

Lets see if the number will drop after a while

@jittygitty
Copy link

jittygitty commented May 13, 2022

@liyimeng Depending on your RAID controller card, you might be able to easily "reflash" the firmware on it to put it into "IT" mode instead of RAID mode if you're not using the RAID capabilities of the RAID controller itself and using RAID in software.
It's possible this might even get smartctl working. (Its possible your recreation of pool fixed corruption or bypassed it for now.)
(Example for LSI: https://www.ikus-soft.com/en/blog/2021-01-20-flash-lsi-controller-for-zfs/ )

@liyimeng
Copy link
Author

@jittygitty Thanks for the hint! It is very tempating. I have this raid card running on an old Intel server(https://ark.intel.com/content/www/us/en/ark/products/56253/intel-server-board-s2600gz.html), the model number is LSI SAS 2208.

06:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS 2208 [Thunderbolt] (rev 05)
	Subsystem: Intel Corporation Integrated RAID Module RMS25CB080
	Physical Slot: 2-1

How risky for re-flashing the firmware. I am afraid if I brick it, it will be difficult to find a replacement since it is so old one.

@jittygitty
Copy link

@liyimeng Lol I was gonna say just buy a cheap one on Ebay first to experiment flashing with, but Ebay is completely down:
"upstream connect error or disconnect/reset before headers. reset reason: overflow"

I've done it before on some Dell version of LSI card, was long ago but recall it was pretty easy and I don't think its very risky. But of course don't take my word for it or blame me after you brick it :D

You can probably get help with that from the guys on the servethehome forum etc:
https://www.servethehome.com/lsi-sas-2208-raid-controller-information-listing/
https://forums.servethehome.com/index.php?threads/the-complete-lsi-avago-broadcom-sbr-download-megathread.33607/

@liyimeng
Copy link
Author

@jittygitty Thanks, I need to think twice :) It is only server that I can perform some work at the moment. I will wait for a while to try that out.

@liyimeng
Copy link
Author

@jittygitty According to https://forums.servethehome.com/index.php?threads/the-complete-lsi-avago-broadcom-sbr-download-megathread.33607/post-326204
It seem I am out of luck.

Meanwhile, I find your speculation would be writing, I am obersving slowing down on the pool just after two days of use.

@jittygitty
Copy link

jittygitty commented May 17, 2022

@liyimeng I think its doable, since Intel ones are being sold on ebay already flashed to IT mode specifically advertised for ZFS:
$45 (offers accepted) https://www.ebay.com/itm/144469452953?
These are cheap at $29 (or lower/offer) https://www.ebay.com/itm/165377309689
But some say they'll only work in an "INTEL" motherboard: https://forums.servethehome.com/index.php?threads/strange-intel-sas-raid-card.5030/

If you do recreate your stripe yet again to try and pair with a different drive maybe, you can try connecting to onboard motherboard SATA and try get some smartctl readings for now that way.

@liyimeng
Copy link
Author

@jittygitty Thanks a lot. It is very nice of you. I tried this https://plone.lucidsolutions.co.nz/hardware/sas-controller/lsi-2208/configuring-jbod-with-a-lsi-2208-controller/view

The author warn that is not stable. I don't know what exactly he mean. In my case, I am not possible boot OS for disk in the raid. I workaround it by booting from disk out of raid instead.

Now I can run smartctl on disks now :D

I don't know how to read these output. Meanwhile, I am still wokring on upgrading to 2.1.4.


/ # smartctl -a /dev/sdc
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.0-88-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue Mobile (SMR)
Device Model:     WDC WD20SPZX-22UA7T0
Serial Number:    WD-WXL2E107D1PX
LU WWN Device Id: 5 0014ee 2680ecfcd
Firmware Version: 01.01A01
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed May 18 10:33:59 2022 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(22740) seconds.
Offline data collection
capabilities: 			 (0x71) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 365) minutes.
Conveyance self-test routine
recommended polling time: 	 (   3) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   220   219   021    Pre-fail  Always       -       1991
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       37
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       11786
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       36
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       34
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       793
194 Temperature_Celsius     0x0022   115   110   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

/ # smartctl -a /dev/sdd
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.0-88-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue Mobile (SMR)
Device Model:     WDC WD20SPZX-22UA7T0
Serial Number:    WD-WXA2E3076TCU
LU WWN Device Id: 5 0014ee 2680edb82
Firmware Version: 01.01A01
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed May 18 10:27:42 2022 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(22260) seconds.
Offline data collection
capabilities: 			 (0x71) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 368) minutes.
Conveyance self-test routine
recommended polling time: 	 (   3) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   220   219   021    Pre-fail  Always       -       2000
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       36
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       11785
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       35
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       33
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       420
194 Temperature_Celsius     0x0022   113   111   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

@putnam
Copy link

putnam commented Jun 21, 2022

Ever since adding a cache device, this happens to me on every boot when there is a long import going on (using 2.1.4).

e.g.:

image

I assume it's harmless but I have no idea.

@imero
Copy link

imero commented Oct 14, 2022

Have had the same problem for a long time that boot takes up to 20 minutes on a small server with 8 disk raidz2 and 2x nvme cache. Always updating to latest zfs version but have not seen noticeable changes up to including 2.1.5.

[ 1088.601676] kernel: INFO: task l2arc_feed:1431 blocked for more than 966 seconds.
[ 1088.602210] kernel: Tainted: P OE 5.8.0-31-generic #33-Ubuntu
[ 1088.602700] kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1088.603192] kernel: l2arc_feed D 0 1431 2 0x00004000

And the "a start job is running for import zfs pools by cache file" message.

Today installed zfs 2.1.6 and problem is completely gone, boot time in few seconds!
Cannot find what fix in release notes is the factor and have also applied regular Ubuntu updates so not completely isolated test but if anyone has the same problem try it out. I use JonathonF PPA for zfs install.

@liyimeng
Copy link
Author

liyimeng commented Oct 15, 2022

it is kind of random, I am still on 2.1.4, but it has disappeared for a while, not sure if it will comes back. I seldom reboot my computer, might be once every three months, So I am not sure if problem still there.

@jay-to-the-dee
Copy link

jay-to-the-dee commented Oct 31, 2022

I had this also on Ubuntu 22.04 default ZFS package (so ZFS 2.1.4), but after installing zfs 2.1.6 from JonothonP the stask traces have gone away.

Importing a degraded RAID-Z3 pool (missing a single drive) is still taking a few minutes though. (even though I have ensured the cachefile is up to date).

@morganchristiansson
Copy link

I had this issue due to a extreme amount of snapshots created by containerd zfs snapshotter plugin

Reducing number of snapshots made it fast again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

8 participants