Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

L2ARC failed to rebuild after reboot #11787

Open
aaronjwood opened this issue Mar 23, 2021 · 85 comments
Open

L2ARC failed to rebuild after reboot #11787

aaronjwood opened this issue Mar 23, 2021 · 85 comments
Labels
Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@aaronjwood
Copy link

aaronjwood commented Mar 23, 2021

System information

Type Version/Name
Distribution Name Debian (Proxmox)
Distribution Version 10
Linux Kernel 5.4.103-1-pve
Architecture x64
ZFS Version 2.0.3-pve2
SPL Version 2.0.3-pve2

Describe the problem you're observing

Had a system up and running for months. L2ARC was a few hundred GB. Rebooted the system and found that the L2ARC was empty.

Describe how to reproduce the problem

Not sure about this. I've rebooted in the past and had my L2ARC preserved. I haven't been able to find any logs around this either. If anyone has suggestions on what I can check I'd be happy to post it here. FWIW /sys/module/zfs/parameters/l2arc_rebuild_enabled is set to 1.

One difference from the last reboot I did where it worked: I had a kernel update this time. Is there some special handling that forces the L2ARC to not rebuild when a new kernel version is used or something?

@aaronjwood aaronjwood added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels Mar 23, 2021
@gamanakis
Copy link
Contributor

gamanakis commented Mar 26, 2021

Is there some special handling that forces the L2ARC to not rebuild when a new kernel version is used or something?

No, that is not the case. Curious though what caused this. If it happens again could you post a "cat /proc/spl/kstat/zfs/arcstats" after it fails to rebuild the L2ARC contents? That might give us a clue.

You can also look in zpool history to see if there is anything L2ARC related.

@aaronjwood
Copy link
Author

Is there some special handling that forces the L2ARC to not rebuild when a new kernel version is used or something?

No, that is not the case. Curious though what caused this. If it happens again could you post a "cat /proc/spl/kstat/zfs/arcstats" after it fails to rebuild the L2ARC contents? That might give us a clue.

You can also look in zpool history to see if there is anything L2ARC related.

Yeah, will do. Haven't been able to reproduce it yet. I don't see anything relevant in zpool history unfortunately.

@aaronjwood
Copy link
Author

Still haven't reproduced this yet but saw something that I thought I should share:

...
l2_rebuild_success              4    0
l2_rebuild_unsupported          4    0
l2_rebuild_io_errors            4    0
l2_rebuild_dh_errors            4    1
l2_rebuild_cksum_lb_errors      4    0
l2_rebuild_lowmem               4    0
l2_rebuild_size                 4    0
l2_rebuild_asize                4    0
l2_rebuild_bufs                 4    0
l2_rebuild_bufs_precached       4    0
l2_rebuild_log_blks             4    0
...

Note the l2_rebuild_dh_errors field. I'm not clear on what this field means or if it's relevant to the issue I hit.

@gamanakis
Copy link
Contributor

gamanakis commented Apr 26, 2021 via email

@aaronjwood
Copy link
Author

This would cause the behavior I saw, right? Is there anything else within ZFS that I can look at that might indicate why this header got corrupted?

@aaronjwood
Copy link
Author

I just had this happen to me again last night due to a power outage. Once my server came back I saw my L2 ARC was empty again. FWIW l2_rebuild_dh_errors is now sitting at 3.

Anything else I can check here to get more information on why I keep failing to rebuild?

@gamanakis
Copy link
Contributor

gamanakis commented May 4, 2021 via email

@aaronjwood
Copy link
Author

zdb -lll /dev/nvme0n1
failed to unpack label 0
failed to unpack label 1
failed to unpack label 2
failed to unpack label 3

I can try removing and adding it back. It's interesting you say that because I had done that (for another reason) after upgrading to ZFS 2.0.0. I'll do it again right now and see if I can reproduce the issue again.

@aaronjwood
Copy link
Author

Even after readding the device:

zdb -lll /dev/nvme0n1
failed to unpack label 0
failed to unpack label 1
failed to unpack label 2
failed to unpack label 3

Also tried with the ID which is how I actually add it to the pool:

zdb -lll /dev/disk/by-id/nvme-HP_SSD_EX950_2TB_HBSE59340600764
failed to unpack label 0
failed to unpack label 1
failed to unpack label 2
failed to unpack label 3

@gamanakis
Copy link
Contributor

gamanakis commented May 4, 2021 via email

@gamanakis
Copy link
Contributor

gamanakis commented May 4, 2021 via email

@aaronjwood
Copy link
Author

aaronjwood commented May 4, 2021

Hah, wow! Guess I am lucky :) Some more info for context:

The drive seems to work fine as is, regardless of being used as an L2ARC or not. Here's more data about the drive:

=== START OF INFORMATION SECTION ===
Model Number:                       HP SSD EX950 2TB
Serial Number:                      HBSE59340600764
Firmware Version:                   SS0411B
PCI Vendor/Subsystem ID:            0x126f
IEEE OUI Identifier:                0x000000
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,000,073,818,112 [2.00 TB]
Namespace 1 Utilization:            27,793,907,712 [27.7 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Tue May  4 10:59:34 2021 PDT
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     75 Celsius
Critical Comp. Temp. Threshold:     80 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.00W       -        -    0  0  0  0        0       0
 1 +     4.60W       -        -    1  1  1  1        0       0
 2 +     3.80W       -        -    2  2  2  2        0       0
 3 -   0.0450W       -        -    3  3  3  3     2000    2000
 4 -   0.0040W       -        -    4  4  4  4    15000   15000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        30 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    2%
Data Units Read:                    100,762,616 [51.5 TB]
Data Units Written:                 81,736,109 [41.8 TB]
Host Read Commands:                 842,433,926
Host Write Commands:                1,132,098,062
Controller Busy Time:               21,608
Power Cycles:                       262
Power On Hours:                     7,994
Unsafe Shutdowns:                   103
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged
Disk /dev/nvme0n1: 1.8 TiB, 2000073818112 bytes, 3906394176 sectors
Disk model: HP SSD EX950 2TB
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 755545AB-39E5-8747-989A-8216460E3F6E

Device              Start        End    Sectors  Size Type
/dev/nvme0n1p1       2048 3906377727 3906375680  1.8T Solaris /usr & Apple ZFS
/dev/nvme0n1p9 3906377728 3906394111      16384    8M Solaris reserved 1

ZFS is the one managing the label, partition, etc. of the entire disk right?

I'll try your suggestion above and post back in a bit.

@aaronjwood
Copy link
Author

zpool remove vault nvme-HP_SSD_EX950_2TB_HBSE59340600764
zpool labelclear nvme-HP_SSD_EX950_2TB_HBSE59340600764
zpool add vault cache nvme-HP_SSD_EX950_2TB_HBSE59340600764
zdb -lll /dev/disk/by-id/nvme-HP_SSD_EX950_2TB_HBSE59340600764
failed to unpack label 0
failed to unpack label 1
failed to unpack label 2
failed to unpack label 3

Is ZFS corrupting this?

@aaronjwood
Copy link
Author

FWIW this also fails:

zdb -lll /dev/nvme0n1p9

but this works:

zdb -lll /dev/nvme0n1p1

It dumps WAY too much info for me to post here. Let me know if you want me to attach it if you think it'd be useful/relevant.

@gamanakis
Copy link
Contributor

gamanakis commented May 4, 2021

Could you do a zdb -l /dev/nvme0n1p1 and post here?

@aaronjwood
Copy link
Author

For sure:

------------------------------------
LABEL 0
------------------------------------
    version: 5000
    state: 4
    guid: 6053299143396366987
    labels = 0 1 2 3
------------------------------------
L2ARC device header
------------------------------------
    magic: 6504978260106102853
    version: 1
    pool_guid: 8691708886196254037
    flags: 1
    start_lbps[0]: 31282282496
    start_lbps[1]: 31148523520
    log_blk_ent: 1022
    start: 4198400
    end: 2000063823872
    evict: 4198400
    lb_asize_refcount: 6148096
    lb_count_refcount: 315
    trim_action_time: 1620151866
    trim_state: 4

------------------------------------
L2ARC device log blocks
------------------------------------
log_blk_count:	 315 with valid cksum
		 0 with invalid cksum
log_blk_asize:	 6148096

@gamanakis
Copy link
Contributor

gamanakis commented May 4, 2021

So the device is still on its first pass (has not been filled to the end), has written about 31GB out of a total capacity of about 2TB.
You also seem to have TRIM for L2ARC enabled, right?

I suspect that the TRIM code may be interfering with the label (trimming it out), although the device is still on its first pass.
You also use Proxmox, so ZFS is installed from packages, correct?
Is the reported ZFS version (zfs -V) 2.0.3-pve2 as in the original post?

@aaronjwood
Copy link
Author

Yeah, it was much more full before the power loss event last night :) Since I've been removing/adding it today it's pretty small right now as you said.

I do have TRIM enabled. I noticed that a TRIM is kicked off whenever I add this drive to a pool. Guessing that is intended behavior from ZFS. My conf for reference:

options zfs zfs_arc_max=40802189312
options zfs l2arc_noprefetch=0
options zfs l2arc_write_max=1074000000
options zfs l2arc_trim_ahead=100

Yes to your last question about Proxmox and ZFS. Since I've reported the issue there have been a few patches that came in:

zfs-2.0.4-pve1
zfs-kmod-2.0.4-pve1

@gamanakis
Copy link
Contributor

Yes, adding a cache device to a pool with l2arc_trim_ahead>0 triggers a TRIM of the whole device.

Notably your device reports trim_state=4 meaning that TRIM was successfully completed. My initial thought was that TRIM of the cache device was interrupted somehow leaving it in an incomplete state and leading to re-trimming every time the pool was imported. This doesn't seem to be the case here though.

@aaronjwood
Copy link
Author

Interesting. Let me know if there's any other info you want to see, or if there's anything else you want me to try. I'm not quite sure what to do otherwise.

@aaronjwood
Copy link
Author

This is a longshot but there shouldn't be any weirdness in working with the reported (fake) sector size of 512/512, right?

@gamanakis
Copy link
Contributor

I do not think so, the sector size is handled by core ZFS code, the L2ARC code has barely anything to do with it.

@aaronjwood
Copy link
Author

How come we do see the header when doing zdb -l nvme-HP_SSD_EX950_2TB_HBSE59340600764 or zdb -l /dev/nvme0n1p1 but not zdb -l /dev/nvme0n1 or zdb -l /dev/disk/by-id/nvme-HP_SSD_EX950_2TB_HBSE59340600764? Should the header be present on both partitions here?

BTW I tried the process of removing the L2ARC device, labelclear'ing it, and adding it back all while l2arc_trim_ahead was set to 0. Still see the same behavior.

@aaronjwood
Copy link
Author

aaronjwood commented May 21, 2021

One other thing I noticed: when adding the device to the pool with l2arc_trim_ahead set to 100 I see this until the trimming has completed:

zdb -l nvme-HP_SSD_EX950_2TB_HBSE59340600764
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    state: 4
    guid: 2538846175622296943
    labels = 0 1 2 3
L2ARC device header not found

Once it's finished it sees the header:

zdb -l nvme-HP_SSD_EX950_2TB_HBSE59340600764
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    state: 4
    guid: 2538846175622296943
    labels = 0 1 2 3
------------------------------------
L2ARC device header
------------------------------------
    magic: 6504978260106102853
    version: 1
    pool_guid: 8691708886196254037
    flags: 1
    start_lbps[0]: 15741607936
    start_lbps[1]: 15607631872
    log_blk_ent: 1022
    start: 4198400
    end: 2000063823872
    evict: 4198400
    lb_asize_refcount: 2437120
    lb_count_refcount: 119
    trim_action_time: 1621575040
    trim_state: 4

------------------------------------
L2ARC device log blocks
------------------------------------
log_blk_count:   119 with valid cksum
                 0 with invalid cksum
log_blk_asize:   2437120

Is this expected behavior? The L2ARC device shows as ONLINE from a zpool status as it is trimming (when the L2ARC header isn't present).

@gamanakis
Copy link
Contributor

gamanakis commented May 21, 2021 via email

@aaronjwood
Copy link
Author

Ok, thanks. Here's what I have:

options zfs zfs_arc_max=40802189312
options zfs l2arc_noprefetch=0
options zfs l2arc_write_max=1074000000
options zfs l2arc_trim_ahead=100

FWIW I've also tried upgrading to 5.11 yesterday (https://forum.proxmox.com/threads/kernel-5-11.86225/#post-378338) but the same behavior persists with all of the different scenarios that I've gone through above. You can assume going forward that I'm now on Linux server 5.11.17-1-pve #1 SMP PVE 5.11.17-1~bpo10 (Wed, 12 May 2021 12:45:37 +0200) x86_64 GNU/Linux

@gamanakis
Copy link
Contributor

gamanakis commented May 21, 2021

I do not think kernel updates affect this. In your case the label is destroyed before the device is completely filled with data (ie l2ad_first=1 seen as flags=1 in the zdb output) which excludes a good chunk of code.

Upon importing the pool the following happens:

  1. l2arc_add_vdev() creates the l2arc related pointer storing the properties of the device
  2. l2arc_rebuild_vdev() reads the device header, if faulty and l2arc_trim_ahead is set then the device is marked to be trimmed.

If the device is marked for trimming or for rebuild, no writes take place until those flags are cleared.
Let me take another look.

@gamanakis
Copy link
Contributor

Do you get this problem only after power outage or does it also occur after a normal export of the pool or reboot?

@gamanakis
Copy link
Contributor

Could you post a cat /proc/spl/kstat/zfs/arcstats?

@aaronjwood
Copy link
Author

Sure

13 1 0x01 123 33456 3819723010 4877793644536
name                            type data
hits                            4    16758513
misses                          4    4839220
demand_data_hits                4    4983938
demand_data_misses              4    3700930
demand_metadata_hits            4    11762719
demand_metadata_misses          4    74750
prefetch_data_hits              4    7121
prefetch_data_misses            4    1054043
prefetch_metadata_hits          4    4735
prefetch_metadata_misses        4    9497
mru_hits                        4    6172365
mru_ghost_hits                  4    231877
mfu_hits                        4    10576654
mfu_ghost_hits                  4    132528
deleted                         4    3240807
mutex_miss                      4    884
access_skip                     4    1
evict_skip                      4    3093
evict_not_enough                4    1
evict_l2_cached                 4    31861516288
evict_l2_eligible               4    17000919552
evict_l2_eligible_mfu           4    5533679616
evict_l2_eligible_mru           4    11467239936
evict_l2_ineligible             4    47326208
evict_l2_skip                   4    0
hash_elements                   4    6754248
hash_elements_max               4    7488933
hash_collisions                 4    5733669
hash_chains                     4    1620259
hash_chain_max                  4    9
p                               4    29556899712
c                               4    34359738368
c_min                           4    2104755968
c_max                           4    34359738368
size                            4    34138237744
compressed_size                 4    31164290048
uncompressed_size               4    34767844352
overhead_size                   4    1050404864
hdr_size                        4    1607998064
data_size                       4    31300067840
metadata_size                   4    914627072
dbuf_size                       4    56927232
dnode_size                      4    68893408
bonus_size                      4    21324480
anon_size                       4    6993920
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    26566782976
mru_evictable_data              4    25178240512
mru_evictable_metadata          4    210531328
mru_ghost_size                  4    4929249792
mru_ghost_evictable_data        4    2124873728
mru_ghost_evictable_metadata    4    2804376064
mfu_size                        4    5640918016
mfu_evictable_data              4    4969264640
mfu_evictable_metadata          4    296652288
mfu_ghost_size                  4    9956258816
mfu_ghost_evictable_data        4    9364298240
mfu_ghost_evictable_metadata    4    591960576
l2_hits                         4    42920
l2_misses                       4    2693922
l2_prefetch_asize               4    3776512
l2_mru_asize                    4    32493990912
l2_mfu_asize                    4    7601672192
l2_bufc_data_asize              4    39115636736
l2_bufc_metadata_asize          4    983802880
l2_feeds                        4    4714
l2_rw_clash                     4    0
l2_read_bytes                   4    405105664
l2_write_bytes                  4    57483476480
l2_writes_sent                  4    1884
l2_writes_done                  4    1884
l2_writes_error                 4    0
l2_writes_lock_retry            4    196
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    292
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    43661074432
l2_asize                        4    40099439616
l2_hdr_size                     4    167972640
l2_log_blk_writes               4    6893
l2_log_blk_avg_asize            4    17398
l2_log_blk_asize                4    99170304
l2_log_blk_count                4    6893
l2_data_to_meta_ratio           4    584
l2_rebuild_success              4    0
l2_rebuild_unsupported          4    0
l2_rebuild_io_errors            4    0
l2_rebuild_dh_errors            4    9
l2_rebuild_cksum_lb_errors      4    0
l2_rebuild_lowmem               4    0
l2_rebuild_size                 4    0
l2_rebuild_asize                4    0
l2_rebuild_bufs                 4    0
l2_rebuild_bufs_precached       4    0
l2_rebuild_log_blks             4    0
memory_throttle_count           4    0
memory_direct_count             4    0
memory_indirect_count           4    0
memory_all_bytes                4    67352190976
memory_free_bytes               4    21224951808
memory_available_bytes          3    18871056640
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    2837742896
arc_meta_limit                  4    25769803776
arc_dnode_limit                 4    2576980377
arc_meta_max                    4    3266657552
arc_meta_min                    4    16777216
async_upgrade_sync              4    26438
demand_hit_predictive_prefetch  4    889837
demand_hit_prescient_prefetch   4    0
arc_need_free                   4    0
arc_sys_free                    4    2353895168
arc_raw_size                    4    0
cached_only_in_progress         4    0
abd_chunk_waste_size            4    427008

@gamanakis
Copy link
Contributor

At this point I am sure L2ARC fails to rebuild because it cannot physically read the device header from the device (ie its not like its reading a header that isn't valid but it actually fails to read from the device). It eludes me though why.

@h1z1
Copy link

h1z1 commented Nov 18, 2021

Could it be something silly like the device itself isn't available yet on reboot? Are the reboots proper shutdowns or result of something else? (power failure was mentioned).

@aaronjwood
Copy link
Author

I can answer the second part :) Other than the one time where I mentioned it was a power failure, everything else has been from me doing sudo reboot

@gamanakis
Copy link
Contributor

gamanakis commented Nov 19, 2021

I was mistaken, actually the IO error returned is 52, EBADE. In OpenZFS in Linux, EBADE is used to declare checksum errors, and that zio_wait() was modified to actually perform a checksum check embedded in the header. So, it's not a failure to physically read from the device, but more likely a checksum error (ZIO_CHECKSUM_LABEL). The question is why.

@gamanakis
Copy link
Contributor

My assumption is that for some reason the header may be written incompletely when the reboot is happening, leading to corruption in corner cases. I also noticed that I had used a ZIO_PRIORITY_ASYNC_WRITE as a priority in the write command.

@aaronjwood do you have the capacity to try the following patch:

diff --git a/module/zfs/arc.c b/module/zfs/arc.c
index 79e2d4381..dbe029ce7 100644
--- a/module/zfs/arc.c
+++ b/module/zfs/arc.c
@@ -10744,7 +10744,7 @@ l2arc_dev_hdr_update(l2arc_dev_t *dev)

        err = zio_wait(zio_write_phys(NULL, dev->l2ad_vdev,
            VDEV_LABEL_START_SIZE, l2dhdr_asize, abd, ZIO_CHECKSUM_LABEL, NULL,
-           NULL, ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE));
+           NULL, ZIO_PRIORITY_SYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE));

        abd_free(abd);

@aaronjwood
Copy link
Author

Sorry, unfortunately I am not able to deploy a custom build of ZFS on this environment :(

@alexdrl
Copy link

alexdrl commented Apr 20, 2022

I was mistaken, actually the IO error returned is 52, EBADE. In OpenZFS in Linux, EBADE is used to declare checksum errors, and that zio_wait() was modified to actually perform a checksum check embedded in the header. So, it's not a failure to physically read from the device, but more likely a checksum error (ZIO_CHECKSUM_LABEL). The question is why.

I am having the exact same error IO 52 on TrueNAS Scale RC.2 after a reboot.

@devZer0
Copy link

devZer0 commented Dec 7, 2022

on my system with zfs 2.1.6 after reboot, l2arc shows "0" in alloc/capacity column . before reboot, 1,5gb was allocated there. secondarycache=metadata

root@pbs01:~# cat /proc/spl/kstat/zfs/arcstats |grep l2
evict_l2_cached                 4    0
evict_l2_eligible               4    349184
evict_l2_eligible_mfu           4    98304
evict_l2_eligible_mru           4    250880
evict_l2_ineligible             4    6656
evict_l2_skip                   4    0
l2_hits                         4    0
l2_misses                       4    770
l2_prefetch_asize               4    0
l2_mru_asize                    4    0
l2_mfu_asize                    4    0
l2_bufc_data_asize              4    0
l2_bufc_metadata_asize          4    0
l2_feeds                        4    227
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_log_blk_writes               4    0
l2_log_blk_avg_asize            4    0
l2_log_blk_asize                4    0
l2_log_blk_count                4    0
l2_data_to_meta_ratio           4    0
l2_rebuild_success              4    0
l2_rebuild_unsupported          4    0
l2_rebuild_io_errors            4    0
l2_rebuild_dh_errors            4    2
l2_rebuild_cksum_lb_errors      4    0
l2_rebuild_lowmem               4    0
l2_rebuild_size                 4    0
l2_rebuild_asize                4    0
l2_rebuild_bufs                 4    0
l2_rebuild_bufs_precached       4    0
l2_rebuild_log_blks             4    0

@shodanshok
Copy link
Contributor

shodanshok commented Jan 12, 2023

I was just hit but this issue: after a reboot, L2ARC was not rebuilt. Before the reboot it was at 870/380 GB (size/compressed), while now it is at 23.5/11.5 GB.

# arc_summary.py
L2ARC size (adaptive):                                          23.5 GiB
        Compressed:                                    49.0 %   11.5 GiB
# cat arcstats | grep l2_reb
l2_rebuild_success              4    0
l2_rebuild_unsupported          4    6
l2_rebuild_io_errors            4    0
l2_rebuild_dh_errors            4    0
l2_rebuild_cksum_lb_errors      4    0
l2_rebuild_lowmem               4    0
l2_rebuild_size                 4    0
l2_rebuild_asize                4    0
l2_rebuild_bufs                 4    0
l2_rebuild_bufs_precached       4    0
l2_rebuild_log_blks             4    0

What does l2_rebuild_unsupported 4 6 means?

EDIT: from here it seems that no correct L2 headers were found. This is the first time I experience the issue, which happened at the first reboot after I manually trimmed the pool and then enabled autotrim.

My zfs module options are:
options zfs l2arc_noprefetch=0 l2arc_headroom=0 l2arc_write_max=268435456 l2arc_trim_ahead=100

Reading the previous posts seems to confirm we have some sort of "interference" between trim (manual and/or automatic) and persistent L2ARC. Running zdb -l on my cache disks:

# zdb -l /dev/disk/by-path/pci-0000\:01\:00.1-ata-3.0-part5 
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    state: 4
    guid: 11135912476541031070
    labels = 0 1 2 3 
------------------------------------
L2ARC device header
------------------------------------
    magic: 6504978260106102853
    version: 1
    pool_guid: 11455676591258953768
    flags: 1
    start_lbps[0]: 7158409728
    start_lbps[1]: 7103791616
    log_blk_ent: 1022
    start: 4194816
    end: 451118891008
    evict: 4194816
    lb_asize_refcount: 2567168
    lb_count_refcount: 117
    trim_action_time: 1673546173
    trim_state: 4

------------------------------------
L2ARC device log blocks
------------------------------------
log_blk_count:   117 with valid cksum
                 0 with invalid cksum
log_blk_asize:   2567168

# zdb -l /dev/disk/by-path/pci-0000\:01\:00.1-ata-4.0-part5 
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    state: 4
    guid: 16280120629556133767
    labels = 0 1 2 3 
------------------------------------
L2ARC device header
------------------------------------
    magic: 6504978260106102853
    version: 1
    pool_guid: 11455676591258953768
    flags: 1
    start_lbps[0]: 7077579776
    start_lbps[1]: 7022830080
    log_blk_ent: 1022
    start: 4194816
    end: 451118891008
    evict: 4194816
    lb_asize_refcount: 2496000
    lb_count_refcount: 114
    trim_action_time: 1673546173
    trim_state: 4

------------------------------------
L2ARC device log blocks
------------------------------------
log_blk_count:   114 with valid cksum
                 0 with invalid cksum
log_blk_asize:   2496000

ZFS debug log show the following info related to my cache disks:

# cat dbgmsg | grep ata-[34]
1673546157   spa_history.c:307:spa_history_log_sync(): txg 42177271 trim vdev=/dev/disk/by-path/pci-0000:01:00.1-ata-3.0-part5 activated
1673546157   spa_history.c:307:spa_history_log_sync(): txg 42177271 trim vdev=/dev/disk/by-path/pci-0000:01:00.1-ata-4.0-part5 activated
1673546173   spa_history.c:307:spa_history_log_sync(): txg 42177278 trim vdev=/dev/disk/by-path/pci-0000:01:00.1-ata-4.0-part5 complete
1673546173   spa_history.c:307:spa_history_log_sync(): txg 42177278 trim vdev=/dev/disk/by-path/pci-0000:01:00.1-ata-3.0-part5 complete

I'm on Rocky Linux 9.1, kernel 5.14.0-162.6.1.el9_1.0.1.x86_64, ZFS 2.1.7-1

Regards.

@rokezu
Copy link

rokezu commented Feb 27, 2023

Was having the same issue in a newly configured pool to which I recently added L2ARC SSD (a couple of days later). The L2ARC was listed in the zpool iostat output, but 0 bytes alloc across reboots.

Using Arch Linux, so, I issued a zpool.cache file rebuild again now after adding the L2ARC with:

zpool set cachefile=/etc/zfs/zpool.cache POOLNAME

Then regenerated the init image with mkinitcpio (which adds that cache file to the image? I guess?), and restarted and now there are 26GB allocated in the L2ARC. Arch uses mkinitcpio to load the initrd image as in other systems, or so it goes... so, it will load the zfs module through that and probably in turn will use that zpool.cache to load the ZFS pool. I haven't checked the internals of how it works.

It seems to me that upon rebooting, at least in Arch they peek at the cache and load from there, and I didn't had added the L2ARC device to that cache until later. But now it is persistent across reboots (or so it seems after a couple of reboots).

Just a couple of cents for somebody stumbling upon this.

@clhedrick
Copy link

clhedrick commented Mar 16, 2023

I had the same problem. Reboot after crash. L2ARC empty. this was the first reboot after adding it.

arcstat shows
l2_rebuild_dh_errors 4 6

I just rebuilt the cachefile, based on a comment here. However zdb -C doesn't show the cache devices, and there is no evidence of them in the cache file. Is this normal?

zdb -l /dev/sdbx1 shows labels.
zdb -l /dev/sdbx does not
The cache appears to be working normally.

The log file shows

Mar 13 15:29:38 eternal.lcsr.rutgers.edu systemd[1]: Starting Import ZFS pools by cache file...
Mar 13 15:29:39 eternal.lcsr.rutgers.edu zpool[5184]: cannot import 'big': one or more devices is currently unavailable
Mar 13 15:30:24 eternal.lcsr.rutgers.edu zpool[5184]: cachefile import failed, retrying
Mar 13 15:30:24 eternal.lcsr.rutgers.edu systemd[1]: Finished Import ZFS pools by cache file.

I wonder if there's a timing issue. If so, it might be a good idea to add the cache devices to the cache file, so that import will wait until they are ready.

@richardm1
Copy link

I've just hit this bug after removing my cache device (a partition on an NVMe drive) and re-adding it.

arc.c:10440:l2arc_dev_hdr_read(): L2ARC IO error (52) while reading device header, vdev guid:

zpool status shows a cache device:

NAME          STATE     READ WRITE CKSUM
 rustpool      ONLINE       0     0     0
   mirror-0    ONLINE       0     0     0
     sda       ONLINE       0     0     0
     sdb       ONLINE       0     0     0
   mirror-1    ONLINE       0     0     0
     sdc       ONLINE       0     0     0
     sdh       ONLINE       0     0     0
 logs
   nvme1n1p2   ONLINE       0     0     0
 cache
   nvme1n1p3   ONLINE       0     0     0

But arc_summary does not:


------------------------------------------------------------------------
ZFS Subsystem Report                            Thu May 18 21:36:37 2023
Linux 6.2.14-1-clear                                            2.1.11-1
Machine: rm-desktop (x86_64)                                    2.1.11-1

L2ARC not detected, skipping section

I've tried export/re-importing the pool, I've rebooted several times, and I've secure-wiped the partition and re-created it as a bare partition with no filesystem. I've turned off l2arc_rebuild_enabled. I'm running out of ideas for this. I guess I could try throwing the whole drive at it?

@shodanshok
Copy link
Contributor

If I am not mistaken, I sometime saw arc_summary reporting "L2ARC not detected" when L2ARC is totally empty. Have you tried forcing some data to L2ARC (ie: by reading some large file)? What does zpool iostat -v show?

@richardm1
Copy link

If I am not mistaken, I sometime saw arc_summary reporting "L2ARC not detected" when L2ARC is totally empty. Have you tried forcing some data to L2ARC (ie: by reading some large file)? What does zpool iostat -v show?

You're a genius. My arcstat alias with custom columns was erroring-out:

No L2ARC Here
l2hit%
No L2ARC Here
l2hits
No L2ARC Here
l2miss
No L2ARC Here
l2data
No L2ARC Here
l2mfu
No L2ARC Here
l2mru
No L2ARC Here
l2size
No L2ARC Here
l2asize
No L2ARC Here
l2bytes

Then I fired up a VM and it started working a few seconds later:

hit%  hits  miss  phit  ph%  mh%   mfu   mru  mfug  mrug     c  size  free  avail  arcsz  l2hit%  l2hits  l2miss  l2data  l2mfu  l2mru  l2size  l2asize  l2bytes
   0     0     0     0    0    0     0     0     0     0  3.9G  429M  118G   114G   429M       0       0       0       0   5.5K      0     17K     5.5K        0
  78   370   100     0    0   97   252   117     0     0  3.9G  383M  117G   113G   383M       0       0     100    1.6M   2.1M      0    3.5M     2.1M        0
  62   341   209     0    0   92   171   170     0     0  3.9G  400M  115G   111G   400M       0       0     209    6.2M   7.3M      0     10M     7.3M        0

zpool iostat -v did show a few writes to the cache device beforehand -- apparently it wasn't enough.

@rokezu
Copy link

rokezu commented Jun 4, 2023

Was having the same issue in a newly configured pool to which I recently added L2ARC SSD (a couple of days later). The L2ARC was listed in the zpool iostat output, but 0 bytes alloc across reboots.

Recently had to change the L2ARC SSD drive, and after 10 reboots or so, it refused to hold data (iostat alloc = 0).

I was using the l2arc_headroom=0 zfs module parameter beforehand, so I changed it to the default and added an explicit rebuild with l2arc_headroom=2 l2arc_rebuild_enabled=1, and now it is working.

Also, before it started to work again, arc_summary.py showed something like these lines at zero (now it shows a few megabytes) which had me wondering that L2ARC while being filled, wasn't being properly used?:

L2ARC size (adaptive):                                         153.8 GiB
        Header size:                                  0.0 %    0.0 MiB

@andrei-trybukhouski
Copy link

I faced the same issue as richardm1

cat /proc/spl/kstat/zfs/dbgmsg 
L2ARC IO error (52) while reading device header
ws1 andrey # arc_summary -s l2arc

------------------------------------------------------------------------
ZFS Subsystem Report                            Fri Jun 09 12:45:07 2023
Linux 6.1.2zen1-v2                                2.1.5-1ubuntu6~22.04.1
Machine: ws1 (x86_64)                             2.1.5-1ubuntu6~22.04.1

L2ARC not detected, skipping section
s1 andrey # zpool status
  pool: pool0
 state: ONLINE
config:

	NAME         STATE     READ WRITE CKSUM
	pool0        ONLINE       0     0     0
	  sda1       ONLINE       0     0     0
	cache
	  nvme0n1p5  ONLINE       0     0     0

errors: No known data errors

I tried everything that was written here. the cache never survived a reboot. I didn’t see any error before, this one appeared after recreating the pool with ashift=12

@shodanshok
Copy link
Contributor

@gamanakis Any ideas on what is failing here?

@gamanakis
Copy link
Contributor

gamanakis commented Jun 9, 2023

There seems to be a problem with the ashift of L2ARC not being constant when the pool has an ashift other than 9. issue #14313. This will lead to inability of reading the L2ARC header and rebuilding its contents in ARC.

@aaronjwood
Copy link
Author

Excellent find! Glad there is some idea of the root of the problem. The seemingly random loss of the l2arc on reboot has been killer :(

@gamanakis
Copy link
Contributor

Just opened #14963 for #14313. Anyone interested, feel free to give it a try.

@rokezu
Copy link

rokezu commented Jun 9, 2023

Thanks for looking into this, I can only share that indeed I use ashift=12 for my 4K drives, but didn't specify one for the L2ARC device (didn't catch that in the documentation?), that could be the reason indeed. Can't test it right now as my L2ARC is now working after several reboots and tries, not sure what triggered it to rebuild, but it's fine now!

@aaronjwood
Copy link
Author

Thanks for looking into this, I can only share that indeed I use ashift=12 for my 4K drives, but didn't specify one for the L2ARC device (didn't catch that in the documentation?), that could be the reason indeed. Can't test it right now as my L2ARC is now working after several reboots and tries, not sure what triggered it to rebuild, but it's fine now!

Wanted to share the same, I use an ashift of 12 in the pool that my cache device lives in.

FWIW since experiencing the issue long ago my cache device has changed from an HP EX950 to a set of 3 SATA SSDs (different brands and sizes) to now currently an Intel 670p. No matter the device(s) I always hit the issue in a seemingly random manner.

@andrei-trybukhouski
Copy link

Thanks gamanakis, I applied #14963 to master, but unfortunately it didn't solve the problem.
what was done: rebuilt the code, rebooted, removed the old cache, wiped cache dev, added a new zpool add -o ashift=12 pool0 cache /dev/nvme0n1p5 (this is the same old pool with ashift=12), filled the cache, rebooted
I don't see the l2arc error anymore, but after a reboot, the pool doesn't mount.


ws1 andrey # zpool import pool0
ws1 andrey # zpool list -v
NAME          SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
pool0          97G  31.9G  65.1G        -         -     0%    32%  1.00x    ONLINE  -
  sdb4       97.2G  31.9G  65.1G        -         -     0%  32.9%      -    ONLINE
cache            -      -      -        -         -      -      -      -         -
  nvme0n1p5     1G      0  1019M        -         -     0%  0.00%      -    ONLINE 
ws1 andrey # cat /proc/spl/kstat/zfs/dbgmsg
timestamp    message 
1686418516   spa.c:6371:spa_tryimport(): spa_tryimport: importing pool0
1686418516   spa_misc.c:417:spa_load_note(): spa_load($import, config trusted): LOADING
1686418516   vdev.c:161:vdev_dbgmsg(): disk vdev '/dev/sdb4': best uberblock found for spa $import. txg 537
1686418516   spa_misc.c:417:spa_load_note(): spa_load($import, config untrusted): using uberblock with txg=537
1686418516   spa.c:8514:spa_async_request(): spa=$import async request task=4096
1686418516   spa.c:8514:spa_async_request(): spa=$import async request task=2048
1686418516   spa_misc.c:417:spa_load_note(): spa_load($import, config trusted): LOADED
1686418516   spa_misc.c:417:spa_load_note(): spa_load($import, config trusted): UNLOADING
1686418516   spa.c:6223:spa_import(): spa_import: importing pool0
1686418516   spa_misc.c:417:spa_load_note(): spa_load(pool0, config trusted): LOADING
1686418516   vdev.c:161:vdev_dbgmsg(): disk vdev '/dev/sdb4': best uberblock found for spa pool0. txg 537
1686418516   spa_misc.c:417:spa_load_note(): spa_load(pool0, config untrusted): using uberblock with txg=537
1686418516   spa.c:8514:spa_async_request(): spa=pool0 async request task=4096
1686418517   spa_misc.c:417:spa_load_note(): spa_load(pool0, config trusted): read 105 log space maps (105 total blocks - blksz = 131072 bytes) in 475 ms
1686418517   mmp.c:239:mmp_thread_start(): MMP thread started pool 'pool0' gethrtime 56674227960
1686418517   spa.c:8514:spa_async_request(): spa=pool0 async request task=1
1686418517   spa.c:8514:spa_async_request(): spa=pool0 async request task=2048
1686418517   spa_misc.c:417:spa_load_note(): spa_load(pool0, config trusted): LOADED
1686418517   spa.c:8514:spa_async_request(): spa=pool0 async request task=4096
1686418517   metaslab.c:2449:metaslab_load_impl(): metaslab_load: txg 539, spa pool0, vdev_id 0, ms_id 1, smp_length 144, unflushed_allocs 159744, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 56678 ms, loading_time 10 ms, ms_max_size 535977984, max size error 535977984, old_weight 700000000000001, new_weight 700000000000001
1686418517   metaslab.c:2449:metaslab_load_impl(): metaslab_load: txg 539, spa pool0, vdev_id 0, ms_id 2, smp_length 136, unflushed_allocs 684032, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 56688 ms, loading_time 11 ms, ms_max_size 534876160, max size error 534876160, old_weight 700000000000001, new_weight 700000000000001
1686418517   metaslab.c:2449:metaslab_load_impl(): metaslab_load: txg 539, spa pool0, vdev_id 0, ms_id 3, smp_length 240, unflushed_allocs 139264, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 56699 ms, loading_time 9 ms, ms_max_size 536326144, max size error 536326144, old_weight 700000000000001, new_weight 700000000000001
1686418517   metaslab.c:2449:metaslab_load_impl(): metaslab_load: txg 539, spa pool0, vdev_id 0, ms_id 4, smp_length 224, unflushed_allocs 65536, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 56708 ms, loading_time 10 ms, ms_max_size 536334336, max size error 536334336, old_weight 700000000000001, new_weight 700000000000001
1686418517   metaslab.c:2449:metaslab_load_impl(): metaslab_load: txg 539, spa pool0, vdev_id 0, ms_id 5, smp_length 312, unflushed_allocs 45056, unflushed_frees 8192, freed 0, defer 0 + 0, unloaded time 56719 ms, loading_time 6 ms, ms_max_size 536551424, max size error 536547328, old_weight 700000000000001, new_weight 700000000000001
1686418517   metaslab.c:2449:metaslab_load_impl(): metaslab_load: txg 539, spa pool0, vdev_id 0, ms_id 6, smp_length 336, unflushed_allocs 65536, unflushed_frees 4096, freed 0, defer 0 + 0, unloaded time 56726 ms, loading_time 18 ms, ms_max_size 536588288, max size error 536584192, old_weight 700000000000001, new_weight 700000000000001
1686418517   metaslab.c:2449:metaslab_load_impl(): metaslab_load: txg 539, spa pool0, vdev_id 0, ms_id 7, smp_length 200, unflushed_allocs 77824, unflushed_frees 4096, freed 0, defer 0 + 0, unloaded time 56745 ms, loading_time 8 ms, ms_max_size 536702976, max size error 536698880, old_weight 700000000000001, new_weight 700000000000001
1686418517   metaslab.c:2449:metaslab_load_impl(): metaslab_load: txg 539, spa pool0, vdev_id 0, ms_id 8, smp_length 208, unflushed_allocs 237568, unflushed_frees 4096, freed 0, defer 0 + 0, unloaded time 56753 ms, loading_time 13 ms, ms_max_size 535990272, max size error 535986176, old_weight 700000000000001, new_weight 700000000000001
1686418517   metaslab.c:2449:metaslab_load_impl(): metaslab_load: txg 539, spa pool0, vdev_id 0, ms_id 9, smp_length 560, unflushed_allocs 69632, unflushed_frees 12288, freed 0, defer 0 + 0, unloaded time 56766 ms, loading_time 12 ms, ms_max_size 536260608, max size error 536256512, old_weight 700000000000001, new_weight 700000000000001
1686418517   spa_history.c:306:spa_history_log_sync(): txg 539 open pool version 5000; software version zfs-2.1.99-1994-gad9ba3a8a; uts ws1 6.1.2zen1-v2 #1 SMP PREEMPT_DYNAMIC Mon Jan  2 10:23:58 +03 2023 x86_64
... cut 
1686418523   spa_history.c:293:spa_history_log_sync(): command: zpool import pool0 
ws1 andrey # zdb -l nvme0n1p5
------------------------------------
LABEL 0 
------------------------------------
    version: 5000
    state: 4
    guid: 1555297785859709891
    labels = 0 1 2 3 
------------------------------------
L2ARC device header
------------------------------------
    magic: 6504978260106102853
    version: 1
    pool_guid: 400170765385537083
    flags: 1
    start_lbps[0]: 0
    start_lbps[1]: 0
    log_blk_ent: 0
    start: 4198400
    end: 1073217536
    evict: 4198400
    lb_asize_refcount: 0
    lb_count_refcount: 0
    trim_action_time: 0
    trim_state: 0

------------------------------------
L2ARC device log blocks
------------------------------------
No log blocks to read
ws1 andrey # sudo arc_summary -s l2arc

------------------------------------------------------------------------
ZFS Subsystem Report                            Sat Jun 10 20:51:21 2023
Linux 6.1.2zen1-v2                                2.1.99-1994_gad9ba3a8a
Machine: ws1 (x86_64)                             2.1.99-1994_gad9ba3a8a

L2ARC not detected, skipping section
ws1 andrey # cat /etc/modprobe.d/zfs.conf 
options zfs l2arc_rebuild_enabled=1
options zfs l2arc_headroom=2

I don't see any mention of nvme0n1p5 in the log. this is normal?

@gamanakis
Copy link
Contributor

Your cache device is 1GB and for persistent L2ARC to work it needs to be greater than 1GB. Try again with a slightly bigger partition.

@yhaenggi
Copy link

or set this accordingly: https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#l2arc-rebuild-blocks-min-l2size

@andrei-trybukhouski
Copy link

thanks a lot! this solved my issue.
it seems I can get rid of btrfs and bcache already :)

@DavidFair
Copy link

I'm still seeing this issue and have to workaround this with #14963 in v2.2.3 when relying on an implicit ashift value.

  • Omitting the ashift option during cache device creation does not persist the cache between reboots
  • -o ashift=0 persists between reboots
  • -o ashift=12 also persists between reboots

(Full command for reference zpool add -o ashift=n media-pool cache /dev/disk/by-id/nvme-<UUID>-part4)

I'm happy to add and remove the cache devices whilst testing, but can't recreate the pool if someone has a potential commit they want me to test.

Pool Details:
The pool and cache devices were created with v2.2.3, the pool was created first with ashift left as default (0).
Cache devices we added / removed later

zpool get ashift syncthing-pool
NAME        PROPERTY  VALUE   SOURCE
syncthing-pool  ashift    0       default

Running zdb gives the true ashift value:

MOS Configuration:
        version: 5000
        name: 'syncthing-pool'
        state: 0
        txg: 94840
        pool_guid: 11492548645590784878
        errata: 0
        hostid: 2360010168
        hostname: 'pve'
        com.delphix:has_per_vdev_zaps
        vdev_children: 1
        vdev_tree:
            type: 'root'
            id: 0
            guid: 11492548645590784878
            create_txg: 4
            com.klarasystems:vdev_zap_root: 129
            children[0]:
                type: 'raidz'
                id: 0
                guid: 17832893951284810316
                nparity: 1
                metaslab_array: 256
                metaslab_shift: 34
                ashift: 12

ZFS options:

options zfs l2arc_rebuild_enabled=1 zfs_txg_timeout=30 l2arc_trim_ahead=100

(Not sure how to get the cache vdev ashift, I'm happy to edit and add this)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests