task l2arc_feed:1815 blocked for more than 120 seconds. #13435

liyimeng · 2022-05-07T18:23:17Z

System information

Type	Version/Name
Distribution Name	ubuntu
Distribution Version	20.04
Kernel Version	5.4.0.88
Architecture	amd64
OpenZFS Version	2.1.0

Describe the problem you're observing

Eveytime I reboot my system, when importing zpool, booting get stuck for a long time and I observe a call trace like below, is this something to worry about?

[  243.948884] INFO: task l2arc_feed:1815 blocked for more than 120 seconds.
[  243.948992]       Tainted: P           OE     5.4.0-72-generic #80
[  243.949088] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  243.949220] l2arc_feed      D    0  1815      2 0x80004000
[  243.949321] Call Trace:
[  243.949422]  __schedule+0x2e3/0x740
[  243.949515]  schedule+0x42/0xb0
[  243.949606]  schedule_preempt_disabled+0xe/0x10
[  243.949704]  __mutex_lock.isra.0+0x182/0x4f0
[  243.949808]  ? sigprocmask+0x72/0xa0
[  243.949897]  __mutex_lock_slowpath+0x13/0x20
[  243.949988]  mutex_lock+0x2e/0x40
[  243.950166]  l2arc_feed_thread+0x116/0x1a70 [zfs]
[  243.950263]  ? __switch_to_asm+0x34/0x70
[  243.950353]  ? __switch_to_asm+0x40/0x70
[  243.950444]  ? __switch_to_asm+0x34/0x70
[  243.950534]  ? __switch_to_asm+0x40/0x70
[  243.950626]  ? __switch_to_asm+0x34/0x70
[  243.950719]  ? __switch_to_asm+0x40/0x70
[  243.950809]  ? __switch_to_asm+0x34/0x70
[  243.950899]  ? __switch_to_asm+0x40/0x70
[  243.950989]  ? __switch_to_asm+0x34/0x70
[  243.951080]  ? __switch_to_asm+0x40/0x70
[  243.951170]  ? __switch_to_asm+0x34/0x70
[  243.951262]  ? __switch_to_asm+0x40/0x70
[  243.951353]  ? __switch_to_asm+0x40/0x70
[  243.951452]  ? __switch_to+0x7f/0x470
[  243.951542]  ? __switch_to_asm+0x40/0x70
[  243.951632]  ? __switch_to_asm+0x34/0x70
[  243.951735]  ? spl_kmem_free+0x33/0x40 [spl]
[  243.951859]  ? l2arc_remove_vdev+0x220/0x220 [zfs]
[  243.951957]  thread_generic_wrapper+0x79/0x90 [spl]
[  243.952056]  kthread+0x104/0x140
[  243.952146]  ? __thread_exit+0x20/0x20 [spl]
[  243.952237]  ? kthread_park+0x90/0x90
[  243.952327]  ret_from_fork+0x35/0x40
[  364.780885] INFO: task l2arc_feed:1815 blocked for more than 241 seconds.
[  364.780992]       Tainted: P           OE     5.4.0-72-generic #80
[  364.781088] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  364.781218] l2arc_feed      D    0  1815      2 0x80004000
[  364.781315] Call Trace:
[  364.781413]  __schedule+0x2e3/0x740
[  364.781504]  schedule+0x42/0xb0
[  364.781593]  schedule_preempt_disabled+0xe/0x10
[  364.781686]  __mutex_lock.isra.0+0x182/0x4f0
[  364.781783]  ? sigprocmask+0x72/0xa0
[  364.781872]  __mutex_lock_slowpath+0x13/0x20
[  364.781965]  mutex_lock+0x2e/0x40
[  364.782134]  l2arc_feed_thread+0x116/0x1a70 [zfs]
[  364.782229]  ? __switch_to_asm+0x34/0x70
[  364.782320]  ? __switch_to_asm+0x40/0x70
[  364.782410]  ? __switch_to_asm+0x34/0x70
[  364.782500]  ? __switch_to_asm+0x40/0x70
[  364.782591]  ? __switch_to_asm+0x34/0x70
[  364.782681]  ? __switch_to_asm+0x40/0x70
[  364.782772]  ? __switch_to_asm+0x34/0x70
[  364.782862]  ? __switch_to_asm+0x40/0x70
[  364.782952]  ? __switch_to_asm+0x34/0x70
[  364.783042]  ? __switch_to_asm+0x40/0x70
[  364.783135]  ? __switch_to_asm+0x34/0x70
[  364.783227]  ? __switch_to_asm+0x40/0x70
[  364.783317]  ? __switch_to_asm+0x40/0x70
[  364.783417]  ? __switch_to+0x7f/0x470
[  364.783507]  ? __switch_to_asm+0x40/0x70
[  364.783597]  ? __switch_to_asm+0x34/0x70
[  364.783699]  ? spl_kmem_free+0x33/0x40 [spl]
[  364.783826]  ? l2arc_remove_vdev+0x220/0x220 [zfs]
[  364.783923]  thread_generic_wrapper+0x79/0x90 [spl]
[  364.784020]  kthread+0x104/0x140
[  364.784111]  ? __thread_exit+0x20/0x20 [spl]
[  364.784202]  ? kthread_park+0x90/0x90
[  364.784292]  ret_from_fork+0x35/0x40
[  743.635861]  zd0: p1

The text was updated successfully, but these errors were encountered:

rincebrain · 2022-05-07T20:43:11Z

I'd suggest using 2.1.4, for one thing - there've been a lot of random fixes that might or might not make a difference. (Maybe #12365, for example, or another L2ARC removal race that I don't have the link for offhand.)

liyimeng · 2022-05-08T20:13:08Z

@rincebrain Thanks a lot. I find this is an real issue. sometime, it will keep printing this with no end. I have to do hard reset on the system, repeating a couple of times and then it pass. I will upgrade to 2.1.4 to see if it helps.

liyimeng · 2022-05-08T20:22:59Z

If I am lucky enough, get the system boot, the boot is almost unusable. It keep freezing now and then. a quick fio test show miserable result

bash-5.0# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --size=960M --numjobs=1 --runtime=60 --time_based --end_fsync=1

Jobs: 1 (f=1): [F(1)][100.0%][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=124080: Sun May  8 20:24:32 2022
  write: IOPS=8, BW=35.7KiB/s (36.5kB/s)(22.4MiB/643285msec); 0 zone resets
    slat (usec): min=45, max=4592, avg=76.80, stdev=98.31
    clat (msec): min=2, max=218, avg=10.38, stdev= 8.46
     lat (msec): min=2, max=218, avg=10.46, stdev= 8.46
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    4],
     | 30.00th=[    6], 40.00th=[    8], 50.00th=[    9], 60.00th=[    9],
     | 70.00th=[   11], 80.00th=[   17], 90.00th=[   21], 95.00th=[   27],
     | 99.00th=[   40], 99.50th=[   42], 99.90th=[   52], 99.95th=[   54],
     | 99.99th=[  220]
   bw (  KiB/s): min=  120, max= 1464, per=100.00%, avg=384.47, stdev=273.18, samples=119
   iops        : min=   30, max=  366, avg=96.12, stdev=68.30, samples=119
  lat (msec)   : 4=20.55%, 10=46.18%, 20=22.02%, 50=11.11%, 100=0.10%
  lat (msec)   : 250=0.03%
  cpu          : usr=0.02%, sys=0.06%, ctx=11322, majf=0, minf=5748
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,5736,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=35.7KiB/s (36.5kB/s), 35.7KiB/s-35.7KiB/s (36.5kB/s-36.5kB/s), io=22.4MiB (23.5MB), run=643285-643285msec

rincebrain · 2022-05-08T20:55:47Z

If it's flooding your system with errors about the l2arc, if 2.1.4 doesn't help, I'd try removing the l2arc from the pool? (If you can't get it to respond well enough to do that, you could try just offlining it then removing it if that works, or you can physically detach the l2arc and then import the pool.)

What's the pool layout? Was this working until recently or after some update, or is this a new setup?

liyimeng · 2022-05-09T11:19:54Z

I have removed both logs and l2arc device from the pool, that make no difference at all. I am struggling in upgrading to 2.1.4. because I have issues to get everything compiled and installed.

The setup is pretty simple, a raid0 with ssd partition as cache and logs device


$ sudo zpool status 
  pool: data
 state: ONLINE
  scan: scrub in progress since Sat May  7 21:40:47 2022
	684G scanned at 1.09G/s, 4.01G issued at 6.53M/s, 753G total
	0B repaired, 0.53% done, no estimated completion time
remove: Removal of vdev 0 copied 17.8G in 0h3m, completed on Mon Jan 11 18:02:51 2021
	114K memory used for removed device mappings
config:

	NAME          STATE     READ WRITE CKSUM
	data          ONLINE       0     0     0
	  sdd         ONLINE       0     0     0
	  sde         ONLINE       0     0     0
	logs	
	  nvme0n1p1   ONLINE       0     0     0
	cache
	  nvme0n1p3   ONLINE       0     0     0

errors: No known data errors

It was working great at beginning, then I start to see the filesystem frozen now and then. Up to now it is becoming unusable.

I thought it was disk fault, run a couple of scrub, but find no errors

h1z1 · 2022-05-09T16:33:54Z

The thread is blocked yes but what are the other cpus doing when that is printed though. Guessing something else borked.

jittygitty · 2022-05-11T22:15:24Z

@liyimeng I'm suspecting some issue with one of the drives in your raid0 type stripe, I could be wrong, did smartctl show anything interesting?

liyimeng · 2022-05-12T21:31:51Z

My driver is behind a hard raid controller, which dose not support passthrough. I therefore configure raid0 for each disk, and the controller hide all smartctl metrics aways :(


/ # smartctl -a /dev/sdc
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.0-72-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               Intel
Product:              RMS25CB080
Revision:             3.19
Compliance:           SPC-3
User Capacity:        1,998,998,994,944 bytes [1.99 TB]
Logical block size:   512 bytes
Logical Unit id:      0x6001e67881d5e0002a0ee2abcb7e9e21
Serial number:        00219e7ecbabe20e2a00e0d58178e601
Device type:          disk
Local Time is:        Thu May 12 21:24:10 2022 UTC
SMART support is:     Unavailable - device lacks SMART capability.

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     0 C
Drive Trip Temperature:        0 C

Error Counter logging not supported

Device does not support Self Test logging
/ # smartctl -a /dev/sdd
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.0-72-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               Intel
Product:              RMS25CB080
Revision:             3.19
Compliance:           SPC-3
User Capacity:        1,998,998,994,944 bytes [1.99 TB]
Logical block size:   512 bytes
Logical Unit id:      0x6001e67881d5e0002a0ee2f7d00690fb
Serial number:        00fb9006d0f7e20e2a00e0d58178e601
Device type:          disk
Local Time is:        Thu May 12 21:24:54 2022 UTC
SMART support is:     Unavailable - device lacks SMART capability.

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     0 C
Drive Trip Temperature:        0 C

Error Counter logging not supported

Device does not support Self Test logging

However, when I was hacking around, I accidently screw the raid confirmation in the controller, make one of disk become available, I therefore have to destroy the pool and rebuild it with the same setup. Now it is back into normal again. .e.g. I can get 170MB (vs 35KB) of 4k random write now.

Lets see if the number will drop after a while

jittygitty · 2022-05-13T17:50:04Z

@liyimeng Depending on your RAID controller card, you might be able to easily "reflash" the firmware on it to put it into "IT" mode instead of RAID mode if you're not using the RAID capabilities of the RAID controller itself and using RAID in software.
It's possible this might even get smartctl working. (Its possible your recreation of pool fixed corruption or bypassed it for now.)
(Example for LSI: https://www.ikus-soft.com/en/blog/2021-01-20-flash-lsi-controller-for-zfs/ )

liyimeng · 2022-05-13T21:03:36Z

@jittygitty Thanks for the hint! It is very tempating. I have this raid card running on an old Intel server(https://ark.intel.com/content/www/us/en/ark/products/56253/intel-server-board-s2600gz.html), the model number is LSI SAS 2208.

06:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS 2208 [Thunderbolt] (rev 05)
	Subsystem: Intel Corporation Integrated RAID Module RMS25CB080
	Physical Slot: 2-1

How risky for re-flashing the firmware. I am afraid if I brick it, it will be difficult to find a replacement since it is so old one.

jittygitty · 2022-05-14T03:10:58Z

@liyimeng Lol I was gonna say just buy a cheap one on Ebay first to experiment flashing with, but Ebay is completely down:
"upstream connect error or disconnect/reset before headers. reset reason: overflow"

I've done it before on some Dell version of LSI card, was long ago but recall it was pretty easy and I don't think its very risky. But of course don't take my word for it or blame me after you brick it :D

You can probably get help with that from the guys on the servethehome forum etc:
https://www.servethehome.com/lsi-sas-2208-raid-controller-information-listing/
https://forums.servethehome.com/index.php?threads/the-complete-lsi-avago-broadcom-sbr-download-megathread.33607/

liyimeng · 2022-05-15T20:32:49Z

@jittygitty Thanks, I need to think twice :) It is only server that I can perform some work at the moment. I will wait for a while to try that out.

liyimeng · 2022-05-16T13:40:01Z

@jittygitty According to https://forums.servethehome.com/index.php?threads/the-complete-lsi-avago-broadcom-sbr-download-megathread.33607/post-326204
It seem I am out of luck.

Meanwhile, I find your speculation would be writing, I am obersving slowing down on the pool just after two days of use.

jittygitty · 2022-05-17T06:43:04Z

@liyimeng I think its doable, since Intel ones are being sold on ebay already flashed to IT mode specifically advertised for ZFS:
$45 (offers accepted) https://www.ebay.com/itm/144469452953?
These are cheap at $29 (or lower/offer) https://www.ebay.com/itm/165377309689
But some say they'll only work in an "INTEL" motherboard: https://forums.servethehome.com/index.php?threads/strange-intel-sas-raid-card.5030/

If you do recreate your stripe yet again to try and pair with a different drive maybe, you can try connecting to onboard motherboard SATA and try get some smartctl readings for now that way.

liyimeng · 2022-05-18T10:36:06Z

@jittygitty Thanks a lot. It is very nice of you. I tried this https://plone.lucidsolutions.co.nz/hardware/sas-controller/lsi-2208/configuring-jbod-with-a-lsi-2208-controller/view

The author warn that is not stable. I don't know what exactly he mean. In my case, I am not possible boot OS for disk in the raid. I workaround it by booting from disk out of raid instead.

Now I can run smartctl on disks now :D

I don't know how to read these output. Meanwhile, I am still wokring on upgrading to 2.1.4.


/ # smartctl -a /dev/sdc
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.0-88-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue Mobile (SMR)
Device Model:     WDC WD20SPZX-22UA7T0
Serial Number:    WD-WXL2E107D1PX
LU WWN Device Id: 5 0014ee 2680ecfcd
Firmware Version: 01.01A01
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed May 18 10:33:59 2022 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(22740) seconds.
Offline data collection
capabilities: 			 (0x71) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 365) minutes.
Conveyance self-test routine
recommended polling time: 	 (   3) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   220   219   021    Pre-fail  Always       -       1991
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       37
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       11786
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       36
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       34
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       793
194 Temperature_Celsius     0x0022   115   110   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

/ # smartctl -a /dev/sdd
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.0-88-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue Mobile (SMR)
Device Model:     WDC WD20SPZX-22UA7T0
Serial Number:    WD-WXA2E3076TCU
LU WWN Device Id: 5 0014ee 2680edb82
Firmware Version: 01.01A01
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed May 18 10:27:42 2022 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(22260) seconds.
Offline data collection
capabilities: 			 (0x71) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 368) minutes.
Conveyance self-test routine
recommended polling time: 	 (   3) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   220   219   021    Pre-fail  Always       -       2000
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       36
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       11785
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       35
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       33
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       420
194 Temperature_Celsius     0x0022   113   111   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

putnam · 2022-06-21T00:22:32Z

Ever since adding a cache device, this happens to me on every boot when there is a long import going on (using 2.1.4).

e.g.:

I assume it's harmless but I have no idea.

imero · 2022-10-14T20:49:31Z

Have had the same problem for a long time that boot takes up to 20 minutes on a small server with 8 disk raidz2 and 2x nvme cache. Always updating to latest zfs version but have not seen noticeable changes up to including 2.1.5.

[ 1088.601676] kernel: INFO: task l2arc_feed:1431 blocked for more than 966 seconds.
[ 1088.602210] kernel: Tainted: P OE 5.8.0-31-generic #33-Ubuntu
[ 1088.602700] kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1088.603192] kernel: l2arc_feed D 0 1431 2 0x00004000

And the "a start job is running for import zfs pools by cache file" message.

Today installed zfs 2.1.6 and problem is completely gone, boot time in few seconds!
Cannot find what fix in release notes is the factor and have also applied regular Ubuntu updates so not completely isolated test but if anyone has the same problem try it out. I use JonathonF PPA for zfs install.

liyimeng · 2022-10-15T09:26:38Z

it is kind of random, I am still on 2.1.4, but it has disappeared for a while, not sure if it will comes back. I seldom reboot my computer, might be once every three months, So I am not sure if problem still there.

jay-to-the-dee · 2022-10-31T22:22:48Z

I had this also on Ubuntu 22.04 default ZFS package (so ZFS 2.1.4), but after installing zfs 2.1.6 from JonothonP the stask traces have gone away.

Importing a degraded RAID-Z3 pool (missing a single drive) is still taking a few minutes though. (even though I have ensured the cachefile is up to date).

morganchristiansson · 2023-02-24T19:33:24Z

I had this issue due to a extreme amount of snapshots created by containerd zfs snapshotter plugin

Reducing number of snapshots made it fast again.

liyimeng added the Type: Defect label May 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

task l2arc_feed:1815 blocked for more than 120 seconds. #13435

task l2arc_feed:1815 blocked for more than 120 seconds. #13435

liyimeng commented May 7, 2022 •

edited

Loading

rincebrain commented May 7, 2022

liyimeng commented May 8, 2022

liyimeng commented May 8, 2022 •

edited

Loading

rincebrain commented May 8, 2022

liyimeng commented May 9, 2022

h1z1 commented May 9, 2022

jittygitty commented May 11, 2022

liyimeng commented May 12, 2022

jittygitty commented May 13, 2022 •

edited

Loading

liyimeng commented May 13, 2022

jittygitty commented May 14, 2022

liyimeng commented May 15, 2022

liyimeng commented May 16, 2022

jittygitty commented May 17, 2022 •

edited

Loading

liyimeng commented May 18, 2022

putnam commented Jun 21, 2022

imero commented Oct 14, 2022 •

edited

Loading

liyimeng commented Oct 15, 2022 •

edited

Loading

jay-to-the-dee commented Oct 31, 2022 •

edited

Loading

morganchristiansson commented Feb 24, 2023

task l2arc_feed:1815 blocked for more than 120 seconds. #13435

task l2arc_feed:1815 blocked for more than 120 seconds. #13435

Comments

liyimeng commented May 7, 2022 • edited Loading

System information

Describe the problem you're observing

rincebrain commented May 7, 2022

liyimeng commented May 8, 2022

liyimeng commented May 8, 2022 • edited Loading

rincebrain commented May 8, 2022

liyimeng commented May 9, 2022

h1z1 commented May 9, 2022

jittygitty commented May 11, 2022

liyimeng commented May 12, 2022

jittygitty commented May 13, 2022 • edited Loading

liyimeng commented May 13, 2022

jittygitty commented May 14, 2022

liyimeng commented May 15, 2022

liyimeng commented May 16, 2022

jittygitty commented May 17, 2022 • edited Loading

liyimeng commented May 18, 2022

putnam commented Jun 21, 2022

imero commented Oct 14, 2022 • edited Loading

liyimeng commented Oct 15, 2022 • edited Loading

jay-to-the-dee commented Oct 31, 2022 • edited Loading

morganchristiansson commented Feb 24, 2023

liyimeng commented May 7, 2022 •

edited

Loading

liyimeng commented May 8, 2022 •

edited

Loading

jittygitty commented May 13, 2022 •

edited

Loading

jittygitty commented May 17, 2022 •

edited

Loading

imero commented Oct 14, 2022 •

edited

Loading

liyimeng commented Oct 15, 2022 •

edited

Loading

jay-to-the-dee commented Oct 31, 2022 •

edited

Loading