Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zed excessive logging with class=config_sync #7132

Closed
sbonds opened this issue Feb 5, 2018 · 30 comments · Fixed by #13729
Closed

zed excessive logging with class=config_sync #7132

sbonds opened this issue Feb 5, 2018 · 30 comments · Fixed by #13729
Assignees
Labels
Component: ZED ZFS Event Daemon good first issue Indicates a good issue for first-time contributors Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@sbonds
Copy link

sbonds commented Feb 5, 2018

This appears to be an exact duplicate of the closed issue #6667. I have no ability to re-open that issue so based on this guidance (https://stackoverflow.com/questions/21333654/how-to-re-open-an-issue-in-github) I'm opening a new issue and referencing the old one.

System information

Type Version/Name
Distribution Name CentOS 7
Distribution Version 7.4.1708 (Core)
Linux Kernel 3.10.0-693.11.6.el7.x86_64
Architecture x86_64
ZFS Version 0.7.5-1
SPL Version 0.7.5-1

Describe the problem you're observing

"zed" is spamming syslog with messages like:

Feb  5 03:39:03 hostname zed: eid=<incrementing number> class=config_sync pool_guid=0xCA08E3DA44C8B6D6

The total volume is pretty impressive, here's last week's count:

# grep -c zed messages-20180205
1154710

Describe how to reproduce the problem

It started after I rebooted the server on Jan 17. That was the first reboot after I had copied a number of ZFS filesystems from another server onto this one. It was also the first reboot after I had installed some HP hardware tools onto this DL360 server.

Jan 17 10:34:39 servername yum[15213]: Installed: hpsmh-7.6.0-11.x86_64
Jan 17 10:34:49 servername yum[15213]: Installed: hpdiags-10.60.2109-2176.x86_64
Jan 17 10:54:52 servername yum[20732]: Installed: hponcfg-4.6.0-0.x86_64

Include any warning/errors/backtraces from the system logs

No backtraces, but the errors are of the same form as previously reported. Here are the very first ones with some context during the boot:

Jan 17 10:56:43 hostname systemd: Started Initialize the iWARP/InfiniBand/RDMA stack in the kernel.
Jan 17 10:56:45 hostname kernel: hpdiagstc:21831 map pfn expected mapping type uncached-minus for [mem 0xdf7fe000-0xdf7fefff], got write-back
Jan 17 10:56:45 hostname kernel: hpdiagstc:21848 map pfn expected mapping type uncached-minus for [mem 0xdf7fe000-0xdf7fefff], got write-back
Jan 17 10:56:45 hostname kernel: hpdiagstc:21865 map pfn expected mapping type uncached-minus for [mem 0xdf7fe000-0xdf7fefff], got write-back
Jan 17 10:56:45 hostname kernel: sdd: sdd1 sdd9
Jan 17 10:56:45 hostname kernel: sde: sde1 sde9
Jan 17 10:56:46 hostname kernel: sdd: sdd1 sdd9
Jan 17 10:56:46 hostname kernel: sde: sde1 sde9
Jan 17 10:56:46 hostname kernel: hpdiagstc:21865 map pfn expected mapping type uncached-minus for [mem 0xdf7fe000-0xdf7fefff], got write-back
Jan 17 10:56:47 hostname kernel: sdd: sdd1 sdd9
Jan 17 10:56:47 hostname kernel: sde: sde1 sde9
Jan 17 10:56:47 hostname zed: eid=42 class=config_sync pool_guid=0xCA08E3DA44C8B6D6
Jan 17 10:56:47 hostname zed: eid=43 class=config_sync pool_guid=0xCA08E3DA44C8B6D6
@sbonds
Copy link
Author

sbonds commented Feb 5, 2018

I traced the initiation of this error spam to starting the HP server diagnostics via the System Management Homepage. When started, I see the following in the logs:

Feb  5 15:41:41 hostname kernel: hpdiagstc:19405 map pfn expected mapping type uncached-minus for [mem 0xdf7fe000-0xdf7fefff], got write-back
Feb  5 15:41:41 hostname kernel: hpdiagstc:19414 map pfn expected mapping type uncached-minus for [mem 0xdf7fe000-0xdf7fefff], got write-back
Feb  5 15:41:42 hostname kernel: hpdiagstc:19439 map pfn expected mapping type uncached-minus for [mem 0xdf7fe000-0xdf7fefff], got write-back
Feb  5 15:41:43 hostname kernel: hpdiagstc:19439 map pfn expected mapping type uncached-minus for [mem 0xdf7fe000-0xdf7fefff], got write-back
Feb  5 15:41:44 hostname zed: eid=17 class=config_sync pool_guid=0xCA08E3DA44C8B6D6
Feb  5 15:41:44 hostname zed: eid=18 class=config_sync pool_guid=0xCA08E3DA44C8B6D6
Feb  5 15:41:45 hostname zed: eid=19 class=config_sync pool_guid=0xCA08E3DA44C8B6D6
... and so on ...

A reboot of the server seems to clear the problem, at least temporarily.

@behlendorf
Copy link
Contributor

@sbonds thanks for creating a new issue for this. The config_sync event with an increasing eid suggests something on the system is forcing a configuration update. This should primarily happen during import or device addition/removal, not during normal operation. A few questions:

  • Are there other events mixed in with the config_sync ones?
  • Does stopping the ZED halt the creating of new events?

@behlendorf behlendorf added the good first issue Indicates a good issue for first-time contributors label Feb 6, 2018
@sbonds
Copy link
Author

sbonds commented Feb 6, 2018

Are there other events mixed in with the config_sync ones?

No.

Does stopping the ZED halt the creating of new events?

Yes. A restart also stops all future (non-queued) events.

Feb  5 18:55:25 hostname systemd: Started ZFS Event Daemon (zed).
Feb  5 18:55:25 hostname systemd: Starting ZFS Event Daemon (zed)...
Feb  5 18:55:25 hostname zed: ZFS Event Daemon 0.7.5-1 (PID 25510)
Feb  5 18:55:25 hostname zed: Processing events since eid=168
Feb  5 18:55:25 hostname zed: eid=169 class=config_sync pool_guid=0xCA08E3DA44C8B6D6
Feb  5 18:55:25 hostname zed: eid=170 class=config_sync pool_guid=0xCA08E3DA44C8B6D6
... no additional log after those two, presumably queued, messages ...

I also did some poking around to determine if this was an HP tools bug or if the HP diags just kicked something off that created the zed cascade. It seems like the latter. I can also start the same cascade with a simple udevadm trigger command that doesn't involve the tools at all.

@joharri9
Copy link

joharri9 commented May 7, 2019

I have had the same issue. Centos 7, zfs 0.7.13.

May 7 14:36:49 virtual zed: eid=66353 class=config_sync pool_guid=0xBC0B31523322D839
May 7 14:36:49 virtual zed: eid=66354 class=config_sync pool_guid=0xBC0B31523322D839
May 7 14:36:50 virtual zed: eid=66355 class=config_sync pool_guid=0xBC0B31523322D839
May 7 14:36:50 virtual zed: eid=66356 class=config_sync pool_guid=0xBC0B31523322D839
May 7 14:36:50 virtual zed: eid=66357 class=config_sync pool_guid=0xBC0B31523322D839
May 7 14:36:50 virtual zed: eid=66358 class=config_sync pool_guid=0xBC0B31523322D839
May 7 14:36:51 virtual zed: eid=66359 class=config_sync pool_guid=0xBC0B31523322D839
May 7 14:36:51 virtual zed: eid=66360 class=config_sync pool_guid=0xBC0B31523322D839

Restarting zed seems to have cleared it. If it reoccurs, I'll do a better job of capturing data.

@abclution
Copy link

My server doing the same thing for a year+, restarting zfs-zed service stop the nonsense for ~1 hour or so.
Pegs a cpu and uses a bunch of disk io.

I've scheduled an hourly "service zfs-zed restart" which alleviates but does not fix this.

@behlendorf behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jun 5, 2019
@behlendorf behlendorf added this to To do in 0.8-release Jun 5, 2019
@idfxken
Copy link

idfxken commented Mar 6, 2020

This issue is still present in debian buster + proxmox.
Zed version: zfs-0.8.3-pve1

@kdar
Copy link

kdar commented Sep 24, 2020

Still persistent on ubuntu, and also this is creating about a 20% CPU load on my machine. Restarting the service fixes it.
Zed version: zfs-zed/now 0.8.4-1~18.04.york0 amd64

@don-brady
Copy link
Contributor

@kdar one work-around is to disable autoexpand on the pool

@kdar
Copy link

kdar commented Oct 2, 2020

@don-brady Disabled it and restarted the service. Still getting Oct 02 21:38:28 vault zed[4174495]: eid=9959921 class=config_sync pool_guid=0xB6D5D39105D5F808 messages.

@behlendorf behlendorf removed this from To do in 0.8-release Dec 9, 2020
@rnz
Copy link

rnz commented Dec 11, 2020

Hi!
I catched one more state when ZED is start spaming. In my case that was happend when attach drive to pool (single -> mirror).

Also ZED was active read/write on drives in mirror:

# zpool status -L
  pool: pool1
 state: ONLINE
  scan: resilvered 792K in 0h0m with 0 errors on Thu Dec 10 19:43:20 2020
config:

	NAME        STATE     READ WRITE CKSUM
	pool1       ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    sda     ONLINE       0     0     0
	    sdb     ONLINE       0     0     0

errors: No known data errors

# dstat -anr --disk-util -D sda,sdb 10
----total-cpu-usage---- --dsk/sda-----dsk/sdb-- -net/total- ---paging-- ---system-- -net/total- ---io/sda------io/sdb-- sda--sdb-
usr sys idl wai hiq siq| read  writ: read  writ| recv  send|  in   out | int   csw | recv  send| read  writ: read  writ|util:util
  0   0 100   0   0   0|  65k  208k: 165k  208k|   0     0 |   0     0 |5770  4932 |   0     0 |0.96  7.67 :7.12  6.41 |0.66:0.76
  0   0  99   0   0   0|4828k   14M:  12M   14M|1337k 1204k|   0     0 |  36k   48k|1337k 1204k|71.8   597 : 531   486 |46.8:56.1
  0   1  99   0   0   0|4423k   13M:  11M   13M|1331k 1275k|   0     0 |  34k   44k|1331k 1275k|65.4   545 : 489   443 |51.1:60.2
...
  0   1  99   0   0   0|7552k   23M:  19M   23M|1147k 1086k|   0     0 |  51k   73k|1151k 1104k| 112   946 : 837   765 |38.9:46.0
  0   1  99   0   0   0|8132k   24M:  20M   24M|1290k 1261k|   0     0 |  51k   77k|1249k 1220k| 121   991 : 895   791 |37.6:46.9
  0   0 100   0   0   0|2818k 8450k:7223k 8450k|1187k 1143k|   0     0 |  22k   31k|1187k 1143k|41.3   343 : 312   274 |13.0:16.3
  0   0 100   0   0   0|   0     0 :   0     0 |1214k 1247k|   0     0 |5718  4746 |1214k 1247k|   0     0 :   0     0 |   0:   0
  0   0 100   0   0   0|   0     0 :   0     0 |1126k 1122k|   0     0 |5533  4605 |1126k 1122k|   0     0 :   0     0 |   0:   0
  0   0 100   0   0   0|   0     0 :   0     0 |1246k 1233k|   0     0 |5618  4711 |1246k 1233k|   0     0 :   0     0 |   0:   0

After restart ZED all operations was gone (last three dstat states)

@AssKoala
Copy link

AssKoala commented Apr 8, 2021

Is there any sort of logging, as verbose as need be, that can be enabled so I can help track this down?

It seems to generally startup after any sort of "big" update, but its not 100% and occasionally just decides to happen. It only started once I "upgraded" my zpool to enable all the features because that seemed like a good idea at the time and I'm not planning on going "backwards" to SunOS or an older Debian. I've done nothing other than setup the pool and enable the various features.

I would say the rate is about once a week and the server is only rebooted when necessary for updates.

Restarting the zed service always fixes the problem.

My setup is as follows and extremely vanilla (ZFS pool + SMB share and PLEX server, single user):

Type Version/Name
Distribution Name Debian Stable (Buster + backports)
Distribution Version 10
Linux Kernel 5.10.0-0.bpo.4-amd64
Architecture x86_64
ZFS/ZED Version 2.0.3-1~bpo10+1
SPL Version 2.0.3-1~bpo10+1

zpool status:
` pool: tank0
state: ONLINE
scan: scrub repaired 0B in 01:47:51 with 0 errors on Sun Apr 4 05:47:53 2021
config:

    NAME                        STATE     READ WRITE CKSUM
    tank0                       ONLINE       0     0     0
      raidz3-0                  ONLINE       0     0     0
        wwn-0x500003987bc02157  ONLINE       0     0     0
        wwn-0x5000039a6b800076  ONLINE       0     0     0
        wwn-0x5000039a6b800079  ONLINE       0     0     0
        wwn-0x5000039a6b880065  ONLINE       0     0     0
        wwn-0x5000039a6ba00341  ONLINE       0     0     0
        wwn-0x5000039a5c70022f  ONLINE       0     0     0
        wwn-0x5000cca249d0b1de  ONLINE       0     0     0
        wwn-0x5000cca269d65227  ONLINE       0     0     0

errors: No known data errors
`

@mikagika
Copy link

Also have seen this on Ubuntu 20.04. In our case it was a backup server that just receives hourly snapshots from a production pool. So most of the time the pools are idle. Except I noticed one of them wasn't because the pool is on AWS sc1 volumes and I noticed that their "burst balance" was declining when we weren't running anything. zpool iostat confirmed there was I/O to the pool and iotop implicated zed. The messages in syslog lead me here. "systemctl restart zed" stopped the I/O and messages in this case.

It looks like it may have been triggered in this case after the pool was expanded after it ran out of space. I.E. the pool filled up, I expanded the EBS volumes, then (eventually, autoexpand doesn't seem to always work) got the pool to expand. The class=config_sync messages seem to have started around the time I was doing that work.

@sfatula
Copy link

sfatula commented Oct 18, 2021

I just saw this yesterday on zfs 2.1.1 on ubuntu 20.04. Every second was getting the config_sync log messages. A small sample:

syslog.1:Oct 16 23:32:10 Home-Server zed: eid=36362 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:10 Home-Server zed: eid=36363 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:10 Home-Server zed: eid=36364 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:11 Home-Server zed: eid=36365 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:11 Home-Server zed: eid=36366 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:11 Home-Server zed: eid=36367 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:11 Home-Server zed: eid=36368 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:12 Home-Server zed: eid=36369 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:12 Home-Server zed: eid=36370 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:13 Home-Server zed: eid=36371 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:13 Home-Server zed: eid=36372 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:13 Home-Server zed: eid=36373 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:13 Home-Server zed: eid=36374 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:14 Home-Server zed: eid=36375 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:14 Home-Server zed: eid=36376 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:14 Home-Server zed: eid=36377 class=config_sync pool='Raid'
syslog.1:Oct 16 23:32:14 Home-Server zed: eid=36378 class=config_sync pool='Raid'

Restarting zed stopped the issue for me. I've resorted to cron.hourly to restart it. Not using HP hardware, no events like attaching et al, just normal operation and it (seemingly randomly) starts happening. Was not able to figure out what starts it but it never goes away without a zed restart.

@arcasinky
Copy link

arcasinky commented Dec 22, 2021

I just came back from vacation to find the same thing happening on my system. I can't tell when the problem started because zed messages have completely filled the journal to the point that only 2 days' worth of logs are present.

journalctl -l | grep zed | wc -l
368015

I have a weekly cron job that monitors the # LBAs written to my SLOG SSD so that I can tell if it's approaching the rated TBW lifespan. In a typical idle week, this drive will write about 2GB/day. Since 12/16, it looks like ~500GB have been written which is a huge amount given the system should have been idle as I was on vacation.

@derekakelly
Copy link

I too am experiencing this issue.

journalctl -l | grep zed | wc -l
2639766

ubuntu 21.10
Linux 5.13.0-27-generic on x86_64
zfs-2.0.6-1ubuntu2
zfs-kmod-2.0.6-1ubuntu2

@derekakelly
Copy link

For me, this issue comes back on reboot, but restarting the zed service stops it indefinitely. However, I've found that whenever I open webmin, The issue immediately comes back.

@xgiovio
Copy link

xgiovio commented Feb 6, 2022

I have the same issue, #13070
Feb 06 19:20:53 shareserver zed[6892]: eid=36 class=config_sync pool='xpool3'
Feb 06 19:20:53 shareserver zed[6891]: eid=37 class=config_sync pool='xpool3'
Feb 06 19:20:54 shareserver zed[7545]: eid=38 class=config_sync pool='xpool3'
Feb 06 19:20:54 shareserver zed[7559]: eid=39 class=config_sync pool='xpool3'
Feb 06 19:20:55 shareserver zed[7940]: eid=40 class=config_sync pool='xpool1'
Feb 06 19:20:55 shareserver zed[7947]: eid=41 class=config_sync pool='xpool1'
Feb 06 19:20:57 shareserver zed[8619]: eid=42 class=config_sync pool='xpool2'
Feb 06 19:20:57 shareserver zed[8620]: eid=43 class=config_sync pool='xpool2'
Feb 06 19:20:58 shareserver zed[8850]: eid=44 class=config_sync pool='xpool3'
Feb 06 19:20:58 shareserver zed[8851]: eid=45 class=config_sync pool='xpool3'
Feb 06 19:20:59 shareserver zed[9080]: eid=46 class=config_sync pool='xpool3'
Feb 06 19:20:59 shareserver zed[9081]: eid=47 class=config_sync pool='xpool3'

After startup, i see these messages for 10 minutes straight. After all goes to 0 mb/s for the following hours. It happens only at startup.

@xgiovio
Copy link

xgiovio commented Feb 6, 2022

@behlendorf can you please give a look at it? This strange behaviour seems very old.

@xgiovio
Copy link

xgiovio commented Feb 6, 2022

#7366 (comment)

At this link there is the solution. At least for me.

Setting autoexpand to off to all pools resolved the problem.

journalctl | grep zed

Feb 06 22:10:52 shareserver systemd[1]: Started ZFS Event Daemon (zed).
Feb 06 22:10:52 shareserver zed[3386]: ZFS Event Daemon 2.1.2-1 (PID 3386)
Feb 06 22:10:52 shareserver zed[3386]: Processing events since eid=0
Feb 06 22:10:52 shareserver zed[3409]: eid=2 class=config_sync pool='xpool1'
Feb 06 22:10:52 shareserver zed[3412]: eid=3 class=pool_import pool='xpool1'
Feb 06 22:10:52 shareserver zed[3430]: eid=5 class=config_sync pool='xpool1'
Feb 06 22:10:52 shareserver zed[3439]: eid=9 class=config_sync pool='xpool2'
Feb 06 22:10:52 shareserver zed[3441]: eid=10 class=pool_import pool='xpool2'
Feb 06 22:10:52 shareserver zed[3454]: eid=12 class=config_sync pool='xpool2'
Feb 06 22:10:52 shareserver zed[3461]: eid=14 class=config_sync pool='xpool3'
Feb 06 22:10:52 shareserver zed[3468]: eid=15 class=pool_import pool='xpool3'
Feb 06 22:10:52 shareserver zed[3476]: eid=17 class=config_sync pool='xpool3'
Feb 06 22:10:59 shareserver openvpn[4867]: Sun Feb 6 22:10:59 2022 Diffie-Hellman initialized with 2048 bit key
Feb 06 22:10:59 shareserver openvpn[4867]: Sun Feb 6 22:10:59 2022 Outgoing Control Channel Encryption: Cipher 'AES-256-CTR' initialized with 256 bit key
Feb 06 22:10:59 shareserver openvpn[4867]: Sun Feb 6 22:10:59 2022 Incoming Control Channel Encryption: Cipher 'AES-256-CTR' initialized with 256 bit key
Feb 06 22:10:59 shareserver kernel: Guest personality initialized and is inactive
Feb 06 22:10:59 shareserver kernel: Initialized host personality

@xgiovio
Copy link

xgiovio commented Feb 7, 2022

problem seems still here, what is very strange is that after shutting down the system and waking up, the autoexpand feature was still on even if it had been disabled.

@derekakelly
Copy link

@xgiovio I added the following line to my root crontab to automatically restart zed 30 seconds after rebooting. It's the best option I've found until there's a permanent fix. Plus, no need to disable autoexpand or anything else.

@reboot sleep 30 && systemctl restart zfs-zed

@xgiovio
Copy link

xgiovio commented Feb 7, 2022

@derekakelly thanks but i really think there is something related to autoexpand setting and import cache.
Yesterday I set to all pools the autoexpand feature to off. I checked again with zpool get. They were with autoexpand off. I rebooted, the config_sync messages disappeared and I verified again that autoexpand was off. ALL FINE.

Today I start the system and I notice again these config_sync messages. I verify the autoexpand feature on the pools and I notice it is still on on all the pools. So i disable autoexpand again. I restart zed. I disable autoexpand again to be sure.

I reboot the server and start again after a power off and now the autoexpand feature is off. No zed config_sync errors.

From what I'm understanding, there is some bug around zed, autoexpand and import from cache maybe.

I hope some of the main devs can give a deep look into it.

@Zrin
Copy link

Zrin commented Mar 6, 2022

I've seen 25 minutes of config_sync event flood after autoexpand of zpool.
Perhaps this is related, probably it is possible to reproduce it this way:

  • qemu/kvm VM, recently updated to Debian bullseye, (it is still using sysvinit-core)
  • backing block device is a LVM LV on the hypervisor
  • created 820 GiB zpool "archive" on the full device /dev/vde, zfs has created two partitions, vde1 and vde9
  • destroyed the zpool and re-created it with a different mountpoint
  • zfs set acltype=posixacl archive; zfs set xattr=sa archive
  • copied ~550 GiB file system to it
  • installed zfs-auto-snapshot, disabled frequent and hourly snapshots, created an initial snapshot

Proceeded to try online resize of the zpool:

  • checked autoexpand, it was off

  • zpool set autoexpand=on archive

  • lvextend-ed the LV on the hypervisor to 840 GiB

  • virsh blockreasize ...

  • while watching dmesg on guest:
    [113829.223812] virtio_blk virtio6: [vde] new size: 1761607680 512-byte logical blocks (902 GB/840 GiB)
    [113829.224248] vde: detected capacity change from 880468295680 to 901943132160

  • zpool list
    NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
    archive 816G 526G 290G - - 0% 64% 1.00x ONLINE -

  • zpool online archive /dev/vde (took few seconds)

  • zpool list
    NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
    archive 832G 526G 306G - - 0% 63% 1.00x ONLINE -

  • the events start to appear in syslog and continue with frequency approx. 2/s
    zed[28996]: zed_disk_event:
    zed[28996]: #011class: EC_dev_status
    zed[28996]: #011subclass: dev_dle
    zed[28996]: #011dev_name: /dev/vde1
    zed[28996]: #011path: /devices/pci0000:00/0000:00:09.0/virtio6/block/vde/vde1
    zed[28996]: #011phys_path: pci-0000:00:09.0
    zed[28996]: #011dev_size: 901932646400
    zed[28996]: #011pool_guid: 14094888156864081133
    zed[28996]: #011vdev_guid: 14550771222278676910
    zed[28996]: zfs_slm_event: EC_dev_status.dev_dle
    zed[28996]: zfsdle_vdev_online: searching for '14550771222278676910' in 'archive'
    zed[28996]: Invoking "all-syslog.sh" eid=331 pid=6939
    zed: eid=331 class=config_sync pool='archive'
    zed[28996]: Finished "all-syslog.sh" eid=331 pid=6939 exit=0

Started rsync to compare the initial snapshot to the source FS:

  • rsync -aHAX --delete -i -n /mnt/src/ /srv/archive/.zfs/snapshot/snap_initial-2022-03-05-1533

Few minutes later: noticed the config_sync event flood with around 2 events per second, rsync is still running
zed: eid=1337 class=config_sync pool='archive'

Event flood continued and lasted all together around 25 minutes, then:
kernel: [115475.716128] vde: vde1 vde9
zed[28996]: zed_disk_event:
zed[28996]: #011class: EC_dev_remove
zed[28996]: #011subclass: disk
zed[28996]: #011dev_name: /dev/vde1
zed[28996]: #011path: /devices/pci0000:00/0000:00:09.0/virtio6/block/vde/vde1
zed[28996]: #011phys_path: pci-0000:00:09.0
zed[28996]: #011dev_size: 901932646400
zed[28996]: #011pool_guid: 14094888156864081133
zed[28996]: #011vdev_guid: 14550771222278676910
zed[28996]: agent post event: mapping 'EC_dev_remove' to 'resource.fs.zfs.removed'
zed[28996]: zed_disk_event:
zed[28996]: #011class: EC_dev_add
zed[28996]: #011subclass: disk
zed[28996]: #011dev_name: /dev/vde1
zed[28996]: #011path: /devices/pci0000:00/0000:00:09.0/virtio6/block/vde/vde1
zed[28996]: #011phys_path: pci-0000:00:09.0
zed[28996]: #011dev_size: 901932646400
zed[28996]: #011pool_guid: 14094888156864081133
zed[28996]: #011vdev_guid: 14550771222278676910
zed[28996]: Invoking "all-syslog.sh" eid=2112 pid=31927
zed: eid=2112 class=config_sync pool='archive'
zed[28996]: Finished "all-syslog.sh" eid=2112 pid=31927 exit=0
zed[28996]: zfsdle_vdev_online: setting device 'vde' to ONLINE state in pool 'archive': 0
zed[28996]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.removed'
zed[28996]: Invoking "all-syslog.sh" eid=2113 pid=31929
zed[28996]: zfs_slm_event: EC_dev_add.disk
zed: eid=2113 class=config_sync pool='archive'
zed[28996]: Finished "all-syslog.sh" eid=2113 pid=31929 exit=0

It seems that at the end zfs has repartitioned the device, then the flood stops.

After the resize, I'm turning autoexpand off.

@joakimlemb
Copy link

joakimlemb commented Apr 9, 2022

Just had the same issue, had to reboot the server to make zed stop doing config_sync continuously.
From the daemon.log file:

Apr 8 23:15:03 server01 systemd[1]: Starting Snapshot ZFS Pool...
Apr 8 23:15:03 server01 sanoid[339261]: ignored key 'recursive' from 'mergerfs' template.
Apr 8 23:15:03 server01 sanoid[339261]: ignored key 'recursive' from 'mergerfs' template.
Apr 8 23:15:03 server01 sanoid[339261]: ignored key 'recursive' from 'mergerfs' template.
Apr 8 23:15:03 server01 sanoid[339261]: ignored key 'recursive' from 'zfs_scratch' template.
Apr 8 23:15:03 server01 sanoid[339261]: INFO: cache expired - updating from zfs list.
Apr 8 23:15:03 server01 sanoid[339261]: INFO: taking snapshots...
Apr 8 23:15:03 server01 systemd[1]: sanoid.service: Succeeded.
Apr 8 23:15:03 server01 systemd[1]: Finished Snapshot ZFS Pool.
Apr 8 23:15:04 server01 systemd[1]: Starting Cleanup ZFS Pool...
Apr 8 23:15:04 server01 sanoid[339269]: ignored key 'recursive' from 'mergerfs' template.
Apr 8 23:15:04 server01 sanoid[339269]: ignored key 'recursive' from 'mergerfs' template.
Apr 8 23:15:04 server01 sanoid[339269]: ignored key 'recursive' from 'mergerfs' template.
Apr 8 23:15:04 server01 sanoid[339269]: ignored key 'recursive' from 'zfs_scratch' template.
Apr 8 23:15:04 server01 sanoid[339269]: INFO: pruning snapshots...
Apr 8 23:15:04 server01 systemd[1]: sanoid-prune.service: Succeeded.
Apr 8 23:15:04 server01 systemd[1]: Finished Cleanup ZFS Pool.
Apr 8 23:28:29 server01 zed: eid=163 class=config_sync pool='zfs-scratch'
Apr 8 23:28:29 server01 zed: eid=164 class=config_sync pool='zfs-scratch'
Apr 8 23:28:29 server01 zed: eid=165 class=config_sync pool='zfs-scratch'
Apr 8 23:28:29 server01 zed: eid=166 class=config_sync pool='zfs-scratch'
Apr 8 23:28:29 server01 zed: eid=167 class=config_sync pool='zfs-scratch'
Apr 8 23:28:29 server01 zed: eid=168 class=config_sync pool='zfs-scratch'
Apr 8 23:28:29 server01 zed: eid=169 class=config_sync pool='zfs-scratch'
Apr 8 23:28:29 server01 zed: eid=170 class=config_sync pool='zfs-scratch'
Apr 8 23:28:30 server01 zed: eid=171 class=config_sync pool='zfs-scratch'
Apr 8 23:28:30 server01 zed: eid=172 class=config_sync pool='zfs-scratch'
Apr 8 23:28:30 server01 zed: eid=173 class=config_sync pool='zfs-scratch'
Apr 8 23:28:30 server01 zed: eid=174 class=config_sync pool='zfs-scratch'
Apr 8 23:28:30 server01 zed: eid=175 class=config_sync pool='disk-hgst1'
Apr 8 23:28:30 server01 zed: eid=176 class=config_sync pool='disk-hgst1'
Apr 8 23:28:31 server01 zed: eid=177 class=config_sync pool='zfs-scratch'
Apr 8 23:28:31 server01 zed: eid=178 class=config_sync pool='zfs-scratch'
Apr 8 23:28:31 server01 zed: eid=179 class=config_sync pool='disk-stiron1'
Apr 8 23:28:31 server01 zed: eid=180 class=config_sync pool='disk-stiron1'
Apr 8 23:28:32 server01 zed: eid=181 class=config_sync pool='disk-stiron2'
Apr 8 23:28:32 server01 zed: eid=182 class=config_sync pool='disk-stiron2'
Apr 8 23:28:32 server01 zed: eid=183 class=config_sync pool='zfs-scratch'
Apr 8 23:28:32 server01 zed: eid=184 class=config_sync pool='zfs-scratch'
Apr 8 23:28:32 server01 zed: eid=185 class=config_sync pool='zfs-scratch'
Apr 8 23:28:32 server01 zed: eid=186 class=config_sync pool='zfs-scratch'
Apr 8 23:28:32 server01 zed: eid=187 class=config_sync pool='zfs-scratch'
Apr 8 23:28:32 server01 zed: eid=188 class=config_sync pool='zfs-scratch'
Apr 8 23:28:32 server01 zed: eid=189 class=config_sync pool='zfs-scratch'
Apr 8 23:28:32 server01 zed: eid=190 class=config_sync pool='zfs-scratch'
Apr 8 23:28:33 server01 zed: eid=191 class=config_sync pool='disk-hgst1'
Apr 8 23:28:33 server01 zed: eid=192 class=config_sync pool='disk-hgst1'
Apr 8 23:28:33 server01 zed: eid=193 class=config_sync pool='zfs-scratch'
Apr 8 23:28:33 server01 zed: eid=194 class=config_sync pool='zfs-scratch'
Apr 8 23:28:33 server01 zed: eid=195 class=config_sync pool='disk-stiron1'
Apr 8 23:28:33 server01 zed: eid=196 class=config_sync pool='disk-stiron1'
Apr 8 23:28:34 server01 zed: eid=197 class=config_sync pool='disk-stiron2'
Apr 8 23:28:34 server01 zed: eid=198 class=config_sync pool='disk-stiron2'
Apr 8 23:28:34 server01 zed: eid=199 class=config_sync pool='zfs-scratch'
Apr 8 23:28:34 server01 zed: eid=200 class=config_sync pool='zfs-scratch'
Apr 8 23:28:34 server01 zed: eid=201 class=config_sync pool='zfs-scratch'
Apr 8 23:28:34 server01 zed: eid=202 class=config_sync pool='zfs-scratch'
Apr 8 23:28:34 server01 zed: eid=203 class=config_sync pool='zfs-scratch'
Apr 8 23:28:34 server01 zed: eid=204 class=config_sync pool='zfs-scratch'
Apr 8 23:28:34 server01 zed: eid=205 class=config_sync pool='zfs-scratch'
Apr 8 23:28:34 server01 zed: eid=206 class=config_sync pool='zfs-scratch'
Apr 8 23:28:35 server01 zed: eid=207 class=config_sync pool='disk-hgst1'
Apr 8 23:28:35 server01 zed: eid=208 class=config_sync pool='disk-hgst1'
..... (truncated due to size)
Apr 9 10:52:23 server01 zed: eid=201335 class=config_sync pool='disk-stiron1'
Apr 9 10:52:23 server01 zed: eid=201336 class=config_sync pool='disk-stiron1'
Apr 9 10:52:24 server01 zed: eid=201337 class=config_sync pool='disk-stiron1'
Apr 9 10:52:24 server01 zed: eid=201338 class=config_sync pool='disk-stiron1'
Apr 9 10:52:24 server01 zed: eid=201339 class=config_sync pool='disk-stiron1'
Apr 9 10:52:24 server01 zed: eid=201340 class=config_sync pool='disk-stiron1'
Apr 9 10:52:25 server01 zed: eid=201341 class=config_sync pool='disk-stiron1'
Apr 9 10:52:25 server01 zed: eid=201342 class=config_sync pool='disk-stiron1'
Apr 9 10:52:26 server01 zed: eid=201343 class=config_sync pool='disk-stiron1'
Apr 9 10:52:26 server01 zed: eid=201344 class=config_sync pool='disk-stiron1'
Apr 9 10:52:26 server01 zed: eid=201345 class=config_sync pool='disk-stiron1'
Apr 9 10:52:26 server01 zed: eid=201346 class=config_sync pool='disk-stiron1'
Apr 9 10:52:26 server01 zed: eid=201347 class=config_sync pool='disk-stiron1'
Apr 9 10:52:26 server01 zed: eid=201348 class=config_sync pool='disk-stiron1'
Apr 9 10:52:27 server01 zed: eid=201349 class=config_sync pool='disk-stiron1'
Apr 9 10:52:27 server01 zed: eid=201350 class=config_sync pool='disk-stiron1'
Apr 9 10:52:27 server01 zed: eid=201351 class=config_sync pool='disk-stiron1'
Apr 9 10:52:27 server01 zed: eid=201352 class=config_sync pool='disk-stiron1'
Apr 9 10:52:28 server01 zed: eid=201353 class=config_sync pool='disk-stiron1'
Apr 9 10:52:28 server01 zed: eid=201354 class=config_sync pool='disk-stiron1'

It kept doing config_sync for hours until I had to reboot the server. This also produces a fair amount of read operations on the pools: https://i.imgur.com/mykKkRH.png

@pbond630
Copy link

pbond630 commented Jul 19, 2022

Same issue here.
Ubuntu 20.04.4 LTS
zfs-0.8.3-1ubuntu12.14
zfs-kmod-0.8.3-1ubuntu12.14

Extract from journalctl:
Jul 19 17:09:01 hockney zed[599246]: eid=384 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:01 hockney zed[599300]: eid=385 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:03 hockney zed[599456]: eid=386 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:03 hockney zed[599466]: eid=387 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:05 hockney zed[599612]: eid=388 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:05 hockney zed[599622]: eid=389 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:07 hockney zed[599765]: eid=390 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:07 hockney zed[599778]: eid=391 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:09 hockney zed[599906]: eid=392 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:09 hockney zed[599916]: eid=393 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:11 hockney zed[600090]: eid=394 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:11 hockney zed[600098]: eid=395 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:12 hockney zed[600264]: eid=396 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:13 hockney zed[600277]: eid=397 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:14 hockney zed[600407]: eid=398 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:14 hockney zed[600417]: eid=399 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:15 hockney zed[600548]: eid=400 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:16 hockney zed[600571]: eid=401 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:18 hockney zed[600720]: eid=402 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:18 hockney zed[600730]: eid=403 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:19 hockney zed[600877]: eid=404 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:20 hockney zed[600887]: eid=405 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:21 hockney zed[601034]: eid=406 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:21 hockney zed[601044]: eid=407 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:23 hockney zed[601201]: eid=408 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:23 hockney zed[601211]: eid=409 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:25 hockney zed[601356]: eid=410 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:25 hockney zed[601366]: eid=411 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:27 hockney zed[601515]: eid=412 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:27 hockney zed[601525]: eid=413 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:28 hockney zed[601676]: eid=414 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:28 hockney zed[601686]: eid=415 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:30 hockney zed[601836]: eid=416 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:30 hockney zed[601846]: eid=417 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:32 hockney zed[602004]: eid=418 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:32 hockney zed[602014]: eid=419 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:34 hockney zed[602163]: eid=420 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:34 hockney zed[602173]: eid=421 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:36 hockney zed[602310]: eid=422 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:36 hockney zed[602320]: eid=423 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:38 hockney zed[602466]: eid=424 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:38 hockney zed[602476]: eid=425 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:39 hockney zed[602631]: eid=426 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:40 hockney zed[602642]: eid=427 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:41 hockney zed[602764]: eid=428 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:41 hockney zed[602774]: eid=429 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:43 hockney zed[602925]: eid=430 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:43 hockney zed[602935]: eid=431 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:45 hockney zed[603090]: eid=432 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:45 hockney zed[603103]: eid=433 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:47 hockney zed[603246]: eid=434 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:47 hockney zed[603256]: eid=435 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:48 hockney zed[603395]: eid=436 class=config_sync pool_guid=0x86C368E087026C2F
Jul 19 17:09:49 hockney zed[603397]: eid=437 class=config_sync pool_guid=0x86C368E087026C2F
...

Issue began immediately after running sudo zpool set autoexpand=on pool1. Journalctl spam every 30mins with multiple instances as per above.

After running sudo zpool set autoexpand=off pool1 and rebooting server, log spamming has stopped.
This issue seems to have been around since at least 2017, eg https://www.reddit.com/r/zfs/comments/71nw43/zfsol_zed_syslog_spam/

@tonyhutter tonyhutter self-assigned this Jul 25, 2022
@tonyhutter
Copy link
Contributor

As mentioned in #7366 (comment), the simplest reproducer is to call sg_logs -t <dev> on a vdev with autoexpand=on and zed running. If autoexpand=off then it wont work Example:

$ sudo ./zpool status
  pool: tank
 state: ONLINE
config:

	NAME        STATE     READ WRITE CKSUM
	tank        ONLINE       0     0     0
	  sda       ONLINE       0     0     0

$ sudo sg_logs -t /dev/sda
    Linux     scsi_debug        0191
  Current temperature = 38 C
  Reference temperature = 65 C
$ sudo ./zpool events 
TIME                           CLASS
$ sudo ./zpool set autoexpand=on tank
$ sudo sg_logs -t /dev/sda
    Linux     scsi_debug        0191
  Current temperature = 38 C
  Reference temperature = 65 C
$ sudo ./zpool events 
TIME                           CLASS
Jul 29 2022 14:25:21.186928957 sysevent.fs.zfs.history_event
Jul 29 2022 14:25:25.703928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:25.764928957 sysevent.fs.zfs.config_sync

A minute or two later I noticed that running though these steps kicked off a "config_sync storm":

TIME                           CLASS
Jul 29 2022 14:25:21.186928957 sysevent.fs.zfs.history_event
Jul 29 2022 14:25:25.703928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:25.764928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:26.424928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:26.451928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:27.056928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:27.083928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:27.813928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:27.914928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:28.521928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:28.548928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:28.892928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:29.190928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:29.794928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:29.821928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:30.494928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:30.521928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:31.127928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:31.155928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:31.815928957 sysevent.fs.zfs.config_sync
Jul 29 2022 14:25:31.842928957 sysevent.fs.zfs.config_sync

zed was spamming these logs:

zfsdle_vdev_online: setting device 'sda' to ONLINE state in pool 'tank': 0
zfs_slm_event: EC_dev_status.dev_dle
Invoking "all-debug.sh" eid=709 pid=24785
Invoking "all-syslog.sh" eid=709 pid=24786
zfsdle_vdev_online: searching for '15968317179963686959' in 'tank'
Finished "all-syslog.sh" eid=709 pid=24786 time=0.009596s exit=0
Finished "all-debug.sh" eid=709 pid=24785 time=0.024603s exit=0
zed_udev_monitor: skip /dev/sda since it has a gpt partition already
zed_udev_monitor: /dev/sda9 sectors 16384 < 131072 (minimum)
zed_disk_event:
	class: EC_dev_status
	subclass: dev_dle
	dev_name: /dev/sda1
	path: /devices/pseudo_0/adapter0/host2/target2:0:0/2:0:0:0/block/sda/sda1
	devid: scsi-1Linux_scsi_debug_6000-part1
	phys_path: /dev/disk/by-uuid/16200864169707950020
	dev_size: 199229440
	pool_guid: 16200864169707950020
	vdev_guid: 15968317179963686959
Invoking "all-debug.sh" eid=710 pid=24841
Invoking "all-syslog.sh" eid=710 pid=24842
Finished "all-syslog.sh" eid=710 pid=24842 time=0.010857s exit=0
Finished "all-debug.sh" eid=710 pid=24841 time=0.025998s exit=0
zfsdle_vdev_online: setting device 'sda' to ONLINE state in pool 'tank': 0
Invoking "all-debug.sh" eid=711 pid=24853
zfs_slm_event: EC_dev_status.dev_dle
Invoking "all-syslog.sh" eid=711 pid=24854
zfsdle_vdev_online: searching for '15968317179963686959' in 'tank'
Finished "all-syslog.sh" eid=711 pid=24854 time=0.009782s exit=0
Finished "all-debug.sh" eid=711 pid=24853 time=0.024482s exit=0
zed_udev_monitor: skip /dev/sda since it has a gpt partition already
zed_udev_monitor: /dev/sda9 sectors 16384 < 131072 (minimum)
zed_disk_event:
	class: EC_dev_status
	subclass: dev_dle
	dev_name: /dev/sda1
	path: /devices/pseudo_0/adapter0/host2/target2:0:0/2:0:0:0/block/sda/sda1
	devid: scsi-1Linux_scsi_debug_6000-part1
	phys_path: /dev/disk/by-uuid/16200864169707950020
	dev_size: 199229440
	pool_guid: 16200864169707950020
	vdev_guid: 15968317179963686959
Invoking "all-debug.sh" eid=712 pid=24911
Invoking "all-syslog.sh" eid=712 pid=24912
Finished "all-syslog.sh" eid=712 pid=24912 time=0.011311s exit=0
Finished "all-debug.sh" eid=712 pid=24911 time=0.026689s exit=0

So maybe the "dev status change" event is causing zed or something else to do an operation on the vdev partition, which in turn kicks off a new "dev status change" udev event, causing the feedback loop to continue. For reference, I ran the reproducer on Fedora 36 running master.

Anyway, now that I have a reproducer I can look into a fix.

@tonyhutter
Copy link
Contributor

So maybe the "dev status change" event is causing zed or something else to do an operation on the vdev partition, which in turn kicks off a new "dev status change" udev event, causing the feedback loop to continue

Yep, that was the case:

<disk change event>
zfsdle_vdev_online()
    zpool_vdev_online()
        zpool_relabel_disk()
    	    open(vdev)
    	    close(vdev) <--- generates new disk change event

The easiest fix is to compare the vdev's "original size" (vs->vs_pspace) to the new size after the change event. If the vdev is the same size, then just ignore the event.

tonyhutter added a commit to tonyhutter/zfs that referenced this issue Aug 3, 2022
Users were seeing floods of `config_sync` events when autoexpand was
enabled.  This happened because all "disk status change" udev events
invoke the autoexpand codepath, which calls zpool_relabel_disk(),
which in turn cause another "disk status change" event to happen,
in a feedback loop.  Note that "disk status change" happens every time
a user calls close() on a block device.

This commit breaks the feedback loop by only allowing an autoexpand
to happen if the disk actually changed size.

Fixes: openzfs#7132
Fixes: openzfs#7366

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
tonyhutter added a commit to tonyhutter/zfs that referenced this issue Aug 3, 2022
Users were seeing floods of `config_sync` events when autoexpand was
enabled.  This happened because all "disk status change" udev events
invoke the autoexpand codepath, which calls zpool_relabel_disk(),
which in turn cause another "disk status change" event to happen,
in a feedback loop.  Note that "disk status change" happens every time
a user calls close() on a block device.

This commit breaks the feedback loop by only allowing an autoexpand
to happen if the disk actually changed size.

Fixes: openzfs#7132
Fixes: openzfs#7366

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
tonyhutter added a commit to tonyhutter/zfs that referenced this issue Aug 9, 2022
Users were seeing floods of `config_sync` events when autoexpand was
enabled.  This happened because all "disk status change" udev events
invoke the autoexpand codepath, which calls zpool_relabel_disk(),
which in turn cause another "disk status change" event to happen,
in a feedback loop.  Note that "disk status change" happens every time
a user calls close() on a block device.

This commit breaks the feedback loop by only allowing an autoexpand
to happen if the disk actually changed size.

Fixes: openzfs#7132
Fixes: openzfs#7366

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
tonyhutter added a commit to tonyhutter/zfs that referenced this issue Sep 6, 2022
Users were seeing floods of `config_sync` events when autoexpand was
enabled.  This happened because all "disk status change" udev events
invoke the autoexpand codepath, which calls zpool_relabel_disk(),
which in turn cause another "disk status change" event to happen,
in a feedback loop.  Note that "disk status change" happens every time
a user calls close() on a block device.

This commit breaks the feedback loop by only allowing an autoexpand
to happen if the disk actually changed size.

Fixes: openzfs#7132
Fixes: openzfs#7366

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
behlendorf pushed a commit that referenced this issue Sep 8, 2022
Users were seeing floods of `config_sync` events when autoexpand was
enabled.  This happened because all "disk status change" udev events
invoke the autoexpand codepath, which calls zpool_relabel_disk(),
which in turn cause another "disk status change" event to happen,
in a feedback loop.  Note that "disk status change" happens every time
a user calls close() on a block device.

This commit breaks the feedback loop by only allowing an autoexpand
to happen if the disk actually changed size.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes: #7132
Closes: #7366
Closes #13729
tonyhutter added a commit to tonyhutter/zfs that referenced this issue Sep 15, 2022
Users were seeing floods of `config_sync` events when autoexpand was
enabled.  This happened because all "disk status change" udev events
invoke the autoexpand codepath, which calls zpool_relabel_disk(),
which in turn cause another "disk status change" event to happen,
in a feedback loop.  Note that "disk status change" happens every time
a user calls close() on a block device.

This commit breaks the feedback loop by only allowing an autoexpand
to happen if the disk actually changed size.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes: openzfs#7132
Closes: openzfs#7366
Closes openzfs#13729
beren12 pushed a commit to beren12/zfs that referenced this issue Sep 19, 2022
Users were seeing floods of `config_sync` events when autoexpand was
enabled.  This happened because all "disk status change" udev events
invoke the autoexpand codepath, which calls zpool_relabel_disk(),
which in turn cause another "disk status change" event to happen,
in a feedback loop.  Note that "disk status change" happens every time
a user calls close() on a block device.

This commit breaks the feedback loop by only allowing an autoexpand
to happen if the disk actually changed size.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes: openzfs#7132
Closes: openzfs#7366
Closes openzfs#13729
@Mimikoo
Copy link

Mimikoo commented Apr 13, 2023

The problem persists.
Linux 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux
zfs-2.0.3-9
zfs-kmod-2.0.3-9

After installing webmin or restartin webmin, the syslog is spamming with:

zed: eid=360 class=config_sync pool='zfsdata'

Interesting that for the second pool there is no such entries.
After restarting zfs-zed, the messages stop appearing.

Or setting autoexpand=off also fix the problem.

@derekakelly
Copy link

I can confirm this still happens with my system as well.

I have mitigated my issue at boot with a crontab.

@reboot sleep 30 && systemctl restart zfs-zed

However whenever I open webmin, the issue persists until I first close webmin, and then restart zed

Ubuntu Linux 22.04.2

@versus167
Copy link

Same here with:

root@vm-postgresql64:/home/user# uname -a
Linux vm-postgresql64 5.15.0-78-generic #85-Ubuntu SMP Fri Jul 7 15:25:09 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
root@vm-postgresql64:/home/user# zfs version
zfs-2.1.5-1ubuntu6~22.04.1
zfs-kmod-2.1.5-1ubuntu6~22.04.1
root@vm-postgresql64:/home/user# journalctl -l | grep zed | wc -l
584134

After 6 hours of "class=config_sync"-events the pool became unavailable:

Aug 24 05:12:06 vm-postgresql64 zed[391989]: eid=588210 class=config_sync pool='tank'
Aug 24 05:12:06 vm-postgresql64 kernel:  sdc: sdc1 sdc9
Aug 24 05:12:06 vm-postgresql64 zed[392006]: eid=588212 class=statechange pool='tank' vdev=sdc1 vdev_state=UNAVAIL
Aug 24 05:12:06 vm-postgresql64 zed[392013]: eid=588211 class=vdev.unknown pool='tank' vdev=sdc1
Aug 24 05:12:06 vm-postgresql64 kernel: WARNING: Pool 'tank' has encountered an uncorrectable I/O failure and has been suspended.
Aug 24 05:12:06 vm-postgresql64 zed[392051]: eid=588213 class=io_failure pool='tank'

After reboot all is running well...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: ZED ZFS Event Daemon good first issue Indicates a good issue for first-time contributors Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

Successfully merging a pull request may close this issue.