-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zed excessive logging with class=config_sync #7132
Comments
I traced the initiation of this error spam to starting the HP server diagnostics via the System Management Homepage. When started, I see the following in the logs:
A reboot of the server seems to clear the problem, at least temporarily. |
@sbonds thanks for creating a new issue for this. The
|
No.
Yes. A restart also stops all future (non-queued) events.
I also did some poking around to determine if this was an HP tools bug or if the HP diags just kicked something off that created the zed cascade. It seems like the latter. I can also start the same cascade with a simple |
I have had the same issue. Centos 7, zfs 0.7.13.
Restarting zed seems to have cleared it. If it reoccurs, I'll do a better job of capturing data. |
My server doing the same thing for a year+, restarting zfs-zed service stop the nonsense for ~1 hour or so. I've scheduled an hourly "service zfs-zed restart" which alleviates but does not fix this. |
This issue is still present in debian buster + proxmox. |
Still persistent on ubuntu, and also this is creating about a 20% CPU load on my machine. Restarting the service fixes it. |
@kdar one work-around is to disable autoexpand on the pool |
@don-brady Disabled it and restarted the service. Still getting |
Hi! Also ZED was active read/write on drives in mirror:
After restart ZED all operations was gone (last three dstat states) |
Is there any sort of logging, as verbose as need be, that can be enabled so I can help track this down? It seems to generally startup after any sort of "big" update, but its not 100% and occasionally just decides to happen. It only started once I "upgraded" my zpool to enable all the features because that seemed like a good idea at the time and I'm not planning on going "backwards" to SunOS or an older Debian. I've done nothing other than setup the pool and enable the various features. I would say the rate is about once a week and the server is only rebooted when necessary for updates. Restarting the zed service always fixes the problem. My setup is as follows and extremely vanilla (ZFS pool + SMB share and PLEX server, single user):
zpool status:
errors: No known data errors |
Also have seen this on Ubuntu 20.04. In our case it was a backup server that just receives hourly snapshots from a production pool. So most of the time the pools are idle. Except I noticed one of them wasn't because the pool is on AWS sc1 volumes and I noticed that their "burst balance" was declining when we weren't running anything. zpool iostat confirmed there was I/O to the pool and iotop implicated zed. The messages in syslog lead me here. "systemctl restart zed" stopped the I/O and messages in this case. It looks like it may have been triggered in this case after the pool was expanded after it ran out of space. I.E. the pool filled up, I expanded the EBS volumes, then (eventually, autoexpand doesn't seem to always work) got the pool to expand. The class=config_sync messages seem to have started around the time I was doing that work. |
I just saw this yesterday on zfs 2.1.1 on ubuntu 20.04. Every second was getting the config_sync log messages. A small sample:
Restarting zed stopped the issue for me. I've resorted to cron.hourly to restart it. Not using HP hardware, no events like attaching et al, just normal operation and it (seemingly randomly) starts happening. Was not able to figure out what starts it but it never goes away without a zed restart. |
I just came back from vacation to find the same thing happening on my system. I can't tell when the problem started because zed messages have completely filled the journal to the point that only 2 days' worth of logs are present. journalctl -l | grep zed | wc -l I have a weekly cron job that monitors the # LBAs written to my SLOG SSD so that I can tell if it's approaching the rated TBW lifespan. In a typical idle week, this drive will write about 2GB/day. Since 12/16, it looks like ~500GB have been written which is a huge amount given the system should have been idle as I was on vacation. |
I too am experiencing this issue. journalctl -l | grep zed | wc -l ubuntu 21.10 |
For me, this issue comes back on reboot, but restarting the zed service stops it indefinitely. However, I've found that whenever I open webmin, The issue immediately comes back. |
I have the same issue, #13070 After startup, i see these messages for 10 minutes straight. After all goes to 0 mb/s for the following hours. It happens only at startup. |
@behlendorf can you please give a look at it? This strange behaviour seems very old. |
At this link there is the solution. At least for me. Setting autoexpand to off to all pools resolved the problem. journalctl | grep zed Feb 06 22:10:52 shareserver systemd[1]: Started ZFS Event Daemon (zed). |
problem seems still here, what is very strange is that after shutting down the system and waking up, the autoexpand feature was still on even if it had been disabled. |
@xgiovio I added the following line to my root crontab to automatically restart zed 30 seconds after rebooting. It's the best option I've found until there's a permanent fix. Plus, no need to disable autoexpand or anything else.
|
@derekakelly thanks but i really think there is something related to autoexpand setting and import cache. Today I start the system and I notice again these config_sync messages. I verify the autoexpand feature on the pools and I notice it is still on on all the pools. So i disable autoexpand again. I restart zed. I disable autoexpand again to be sure. I reboot the server and start again after a power off and now the autoexpand feature is off. No zed config_sync errors. From what I'm understanding, there is some bug around zed, autoexpand and import from cache maybe. I hope some of the main devs can give a deep look into it. |
I've seen 25 minutes of config_sync event flood after autoexpand of zpool.
Proceeded to try online resize of the zpool:
Started rsync to compare the initial snapshot to the source FS:
Few minutes later: noticed the config_sync event flood with around 2 events per second, rsync is still running Event flood continued and lasted all together around 25 minutes, then: It seems that at the end zfs has repartitioned the device, then the flood stops. After the resize, I'm turning autoexpand off. |
Just had the same issue, had to reboot the server to make zed stop doing config_sync continuously. Apr 8 23:15:03 server01 systemd[1]: Starting Snapshot ZFS Pool... It kept doing config_sync for hours until I had to reboot the server. This also produces a fair amount of read operations on the pools: https://i.imgur.com/mykKkRH.png |
Same issue here. Extract from journalctl: Issue began immediately after running After running |
As mentioned in #7366 (comment), the simplest reproducer is to call
A minute or two later I noticed that running though these steps kicked off a "config_sync storm":
zed was spamming these logs:
So maybe the "dev status change" event is causing zed or something else to do an operation on the vdev partition, which in turn kicks off a new "dev status change" udev event, causing the feedback loop to continue. For reference, I ran the reproducer on Fedora 36 running Anyway, now that I have a reproducer I can look into a fix. |
Yep, that was the case:
The easiest fix is to compare the vdev's "original size" ( |
Users were seeing floods of `config_sync` events when autoexpand was enabled. This happened because all "disk status change" udev events invoke the autoexpand codepath, which calls zpool_relabel_disk(), which in turn cause another "disk status change" event to happen, in a feedback loop. Note that "disk status change" happens every time a user calls close() on a block device. This commit breaks the feedback loop by only allowing an autoexpand to happen if the disk actually changed size. Fixes: openzfs#7132 Fixes: openzfs#7366 Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Users were seeing floods of `config_sync` events when autoexpand was enabled. This happened because all "disk status change" udev events invoke the autoexpand codepath, which calls zpool_relabel_disk(), which in turn cause another "disk status change" event to happen, in a feedback loop. Note that "disk status change" happens every time a user calls close() on a block device. This commit breaks the feedback loop by only allowing an autoexpand to happen if the disk actually changed size. Fixes: openzfs#7132 Fixes: openzfs#7366 Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Users were seeing floods of `config_sync` events when autoexpand was enabled. This happened because all "disk status change" udev events invoke the autoexpand codepath, which calls zpool_relabel_disk(), which in turn cause another "disk status change" event to happen, in a feedback loop. Note that "disk status change" happens every time a user calls close() on a block device. This commit breaks the feedback loop by only allowing an autoexpand to happen if the disk actually changed size. Fixes: openzfs#7132 Fixes: openzfs#7366 Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Users were seeing floods of `config_sync` events when autoexpand was enabled. This happened because all "disk status change" udev events invoke the autoexpand codepath, which calls zpool_relabel_disk(), which in turn cause another "disk status change" event to happen, in a feedback loop. Note that "disk status change" happens every time a user calls close() on a block device. This commit breaks the feedback loop by only allowing an autoexpand to happen if the disk actually changed size. Fixes: openzfs#7132 Fixes: openzfs#7366 Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Users were seeing floods of `config_sync` events when autoexpand was enabled. This happened because all "disk status change" udev events invoke the autoexpand codepath, which calls zpool_relabel_disk(), which in turn cause another "disk status change" event to happen, in a feedback loop. Note that "disk status change" happens every time a user calls close() on a block device. This commit breaks the feedback loop by only allowing an autoexpand to happen if the disk actually changed size. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes: #7132 Closes: #7366 Closes #13729
Users were seeing floods of `config_sync` events when autoexpand was enabled. This happened because all "disk status change" udev events invoke the autoexpand codepath, which calls zpool_relabel_disk(), which in turn cause another "disk status change" event to happen, in a feedback loop. Note that "disk status change" happens every time a user calls close() on a block device. This commit breaks the feedback loop by only allowing an autoexpand to happen if the disk actually changed size. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes: openzfs#7132 Closes: openzfs#7366 Closes openzfs#13729
Users were seeing floods of `config_sync` events when autoexpand was enabled. This happened because all "disk status change" udev events invoke the autoexpand codepath, which calls zpool_relabel_disk(), which in turn cause another "disk status change" event to happen, in a feedback loop. Note that "disk status change" happens every time a user calls close() on a block device. This commit breaks the feedback loop by only allowing an autoexpand to happen if the disk actually changed size. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes: openzfs#7132 Closes: openzfs#7366 Closes openzfs#13729
The problem persists. After installing webmin or restartin webmin, the syslog is spamming with: zed: eid=360 class=config_sync pool='zfsdata' Interesting that for the second pool there is no such entries. Or setting autoexpand=off also fix the problem. |
I can confirm this still happens with my system as well. I have mitigated my issue at boot with a crontab.
However whenever I open webmin, the issue persists until I first close webmin, and then restart zed Ubuntu Linux 22.04.2 |
Same here with:
After 6 hours of "class=config_sync"-events the pool became unavailable:
After reboot all is running well... |
This appears to be an exact duplicate of the closed issue #6667. I have no ability to re-open that issue so based on this guidance (https://stackoverflow.com/questions/21333654/how-to-re-open-an-issue-in-github) I'm opening a new issue and referencing the old one.
System information
Describe the problem you're observing
"zed" is spamming syslog with messages like:
The total volume is pretty impressive, here's last week's count:
Describe how to reproduce the problem
It started after I rebooted the server on Jan 17. That was the first reboot after I had copied a number of ZFS filesystems from another server onto this one. It was also the first reboot after I had installed some HP hardware tools onto this DL360 server.
Include any warning/errors/backtraces from the system logs
No backtraces, but the errors are of the same form as previously reported. Here are the very first ones with some context during the boot:
The text was updated successfully, but these errors were encountered: