Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Posting checksum events fails with ENOMEM on FreeBSD #14717

Closed
asomers opened this issue Apr 4, 2023 · 12 comments
Closed

Posting checksum events fails with ENOMEM on FreeBSD #14717

asomers opened this issue Apr 4, 2023 · 12 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@asomers
Copy link
Contributor

asomers commented Apr 4, 2023

System information

Type Version/Name
Distribution Name FreeBSD
Distribution Version 14.0-CURRENT
Kernel Version bef81bc0aef (main as of 3-April-2023)
Architecture amd64
OpenZFS Version zfs-2.1.99-FreeBSD_g57cfae4a2 zfs-kmod-2.1.99-FreeBSD_g431083f75

Describe the problem you're observing

On FreeBSD, devctl events must fit within a 1016 byte string buffer. If they are too large, they will be silently dropped and never sent to userland. ZFS's ereport.fs.zfs.checksum events are far too large: 1862 bytes in my test case. So they never get sent to userland. This means that zfsd is unable to degrade a vdev that produces too many checksum errors.

Describe how to reproduce the problem

vim /etc/kyua/kyua.conf # Set test_suites.FreeBSD.disks
cd /usr/tests/sys/cddl/zfs/tests/zfsd
sudo kyua debug zfsd_test:zfsd_degrade_001_pos

Optionally, run a dtrace script like this, which will show various ZFS events get queued for userland, but not any ereport.fs.zfs.checksum events.

sudo dtrace -i 'fbt:kernel:devctl_queue:entry {printf("%s", stringof(args[0]->dei_data));}'

Possible solutions

PR #14716 partially fixes the problem by making the report format less verbose. However, it isn't enough. We still need to trim another ~75 bytes from the report (the exact number depends on the size of the pool and vdev names). Here are some possibilities for how to fully fix it:

  • Combine the bad_cleared_histogram and bad_set_histogram into a single bad_bit_histogram. That would save about 150 bytes.
  • Widen the histogram buckets from one bit wide to two bits wide.
  • Simply remove the histogram fields. zpool-events(8) suggests that they are intended for use with IDE and parallel SCSI, which are obsolete.
  • Remove several other fields. But which ones? zio_err will probably always be zero. But I'm not sure which of the others can go.
  • Double the size of FreeBSD's buffer, at an uncertain performance cost

Current Report format

Here's an example of a current report, including the changes from PR #14716 .

 class=ereport.fs.zfs.checksum ena=96599319807790081 pool=testpool.1933 pool_guid=1236902063710799041 pool_state=0 pool_context=0 pool_failmode=wait vdev_guid=2774253874431514999 vdev_type=file vdev_path=/tmp/kyua.6Temlq/2/work/file1.1933 vdev_ashift=9 vdev_complete_ts=92124283803 vdev_delta_ts=46670 vdev_read_errors=0 vdev_write_errors=0 vdev_cksum_errors=20 vdev_delays=0 parent_guid=8090931855087882905 parent_type=raidz vdev_spare_guids= zio_err=0 zio_flags=1048752 zio_stage=4194304 zio_pipeline=65011712 zio_delay=0 zio_timestamp=0 zio_delta=0 zio_priority=4 zio_offset=1028608 zio_size=512 zio_objset=0 zio_object=0 zio_level=0 zio_blkid=4 bad_ranges=0000000000000200 bad_ranges_min_gap=8 bad_range_sets=0000061f bad_range_clears=000001f4 bad_set_histogram=1719161c1c1c101618171a151a1a19161e1c171d1816161c191f1a18192117191c131d171b1613151a171419161a1b1319101b14171b18151e191a1b141a1c17 bad_cleared_histogram=06090a0808070a0b020609060506090a01090a050a0a0509070609080d050d0607080d060507080c04070807070a0608020c080c080908040808090a05090a07 time=00000016806477050000000604157480 eid=62
@asomers asomers added the Type: Defect Incorrect behavior (e.g. crash, hang) label Apr 4, 2023
@rincebrain
Copy link
Contributor

I personally have found the contents of the histogram buckets useful a couple of times, for telling it was always a certain bit mangled, or that all the bits were wrong (e.g. the result was noise). So I'd prefer to not lose them.

Could we change the interface moving forward so we chunk up the events, rather than either bloating the buffer per message with uncertain implications or dropping fields? It seems like it'd be more robust to if we ever had occasion to add more data to those again, and failing to pass too-large messages in the "old" interface would be status quo ante anyway...

@asomers
Copy link
Contributor Author

asomers commented Apr 5, 2023

What about removing all of the histogram stuff from the event, but adding a custom dtrace probe to get it? If checksum events are rare, then the histogram probably won't be useful anyway. And if checksum events are frequent, then one can start dtracing after noticing checksum events.

@rincebrain
Copy link
Contributor

Removing useful information because FreeBSD has a tiny buffer seems like a poor tradeoff, to me. What are you concerned about, with increasing the buffer limit? Is it a very hot path for some use cases? Is it fixed-size copies every time, so everyone gets a larger copy penalty?

@asomers
Copy link
Contributor Author

asomers commented Apr 5, 2023

Yes, it's a fixed size, so increasing it for one event increases it for all. And it's already pretty big, compared to what most events need (usually fewer than 200 bytes). How useful is the histogram, really? You said that you noticed a stuck bit sometimes. What kind of hardware was that on?

@Precidata
Copy link

If valid values goes only to 64, you could divide by two the length of the histogram messages by returning each 0-63 value as a Base64 character.

@rincebrain
Copy link
Contributor

The other "bad" idea that came to mind is, we have all these compression functions laying around...I wonder if any of them are sufficient to binpack this better, knowing a lot of them have fixed overheads...

@mcmilk
Copy link
Contributor

mcmilk commented Apr 13, 2023

@asomers - could this issue be closed - cause PR #14716 is in?

@asomers
Copy link
Contributor Author

asomers commented Apr 13, 2023

No, because that PR was only a partial solution. The messages are still too long.

@asomers
Copy link
Contributor Author

asomers commented Jul 10, 2023

@rincebrain this is still a problem. Would you be open to adding an off-by-default sysctl knob, something like vfs.zfs.checksum_event_verbose? When off, ZFS would omit the checksum histograms.

@rincebrain
Copy link
Contributor

Oh, interesting, I had assumed this went in back when 14716 did.

Having read the code more carefully, embarassingly, I'm not actually sure I mind this any more, because what I was using is bad_set_bits/bad_cleared_bits, not the histogram.

I'm so sorry for derailing this over my incorrect perception.

@asomers
Copy link
Contributor Author

asomers commented Jul 11, 2023

Great! I'll create a PR to remove the histogram fields. However, I now realize there's another problem too. For disks that aren't in RAIDZ arrays, ZFS adds a few other fields. They look like this cksum_actual=4d458e2ca9781347e3b80ea566e095a42df54a8fcb91eb058f365662cfb23c73 cksum_algorithm=sha256 cksum_expected=e0c327ff9d388f51695151ff6f51d253ff865d718d6a514980ff67658f8d4be1. And those extra fields also cause the event to overflow the buffer. May I eliminate those, too? cksum_actual, in particular, should almost never be useful. Apart from fletcher4, a single-bit change in the input will render the checksum completely different.

@rincebrain
Copy link
Contributor

SGTM. Hopefully if someone else is going to miss them, they'll speak up.

asomers added a commit to asomers/zfs that referenced this issue Jul 11, 2023
The checksum histograms were intended to be used with ATA and parallel
SCSI, which are obsolete.  With modern storage hardware, they will
almost always look like white noise; all bits will be wrong.  They only
serve to bloat the event.  That's a particular problem on FreeBSD, where
events must fit into a 1016 byte buffer.

This fixes issue openzfs#14717 for RAIDZ pools, but not for mirror pools.

Sponsored-by:	Axcient
Signed-off-by:	Alan Somers <asomers@gmail.com>
asomers added a commit to asomers/zfs that referenced this issue Jul 11, 2023
With anything but fletcher-4, even a tiny change in the input will cause
the checksum value to change completely.  So knowing the actual and
expected checksums doesn't provide much more information than "they
don't match".  The harm in sending them is simply that they bloat the
event.  In particular, on FreeBSD the event must fit into a 1016 byte
buffer.

Fixes openzfs#14717 for mirrored pools.

Sponsored-by:   Axcient
Signed-off-by:  Alan Somers <asomers@gmail.com>
asomers added a commit to asomers/zfs that referenced this issue Jul 20, 2023
The checksum histograms were intended to be used with ATA and parallel
SCSI, which are obsolete.  With modern storage hardware, they will
almost always look like white noise; all bits will be wrong.  They only
serve to bloat the event.  That's a particular problem on FreeBSD, where
events must fit into a 1016 byte buffer.

This fixes issue openzfs#14717 for RAIDZ pools, but not for mirror pools.

Sponsored-by:	Axcient
Signed-off-by:	Alan Somers <asomers@gmail.com>
asomers added a commit to asomers/zfs that referenced this issue Jul 20, 2023
With anything but fletcher-4, even a tiny change in the input will cause
the checksum value to change completely.  So knowing the actual and
expected checksums doesn't provide much more information than "they
don't match".  The harm in sending them is simply that they bloat the
event.  In particular, on FreeBSD the event must fit into a 1016 byte
buffer.

Fixes openzfs#14717 for mirrored pools.

Sponsored-by:   Axcient
Signed-off-by:  Alan Somers <asomers@gmail.com>
behlendorf pushed a commit that referenced this issue Jul 21, 2023
The checksum histograms were intended to be used with ATA and parallel
SCSI, which are obsolete.  With modern storage hardware, they will
almost always look like white noise; all bits will be wrong.  They only
serve to bloat the event.  That's a particular problem on FreeBSD, where
events must fit into a 1016 byte buffer.

This fixes issue #14717 for RAIDZ pools, but not for mirror pools.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes #15052
behlendorf pushed a commit to behlendorf/zfs that referenced this issue Jul 21, 2023
The checksum histograms were intended to be used with ATA and parallel
SCSI, which are obsolete.  With modern storage hardware, they will
almost always look like white noise; all bits will be wrong.  They only
serve to bloat the event.  That's a particular problem on FreeBSD, where
events must fit into a 1016 byte buffer.

This fixes issue openzfs#14717 for RAIDZ pools, but not for mirror pools.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes openzfs#15052
behlendorf pushed a commit to behlendorf/zfs that referenced this issue Jul 21, 2023
With anything but fletcher-4, even a tiny change in the input will cause
the checksum value to change completely.  So knowing the actual and
expected checksums doesn't provide much more information than "they
don't match".  The harm in sending them is simply that they bloat the
event.  In particular, on FreeBSD the event must fit into a 1016 byte
buffer.

Fixes openzfs#14717 for mirrored pools.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes openzfs#14717
Closes openzfs#15052
behlendorf pushed a commit that referenced this issue Jul 21, 2023
The checksum histograms were intended to be used with ATA and parallel
SCSI, which are obsolete.  With modern storage hardware, they will
almost always look like white noise; all bits will be wrong.  They only
serve to bloat the event.  That's a particular problem on FreeBSD, where
events must fit into a 1016 byte buffer.

This fixes issue #14717 for RAIDZ pools, but not for mirror pools.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes #15052
behlendorf pushed a commit that referenced this issue Jul 21, 2023
With anything but fletcher-4, even a tiny change in the input will cause
the checksum value to change completely.  So knowing the actual and
expected checksums doesn't provide much more information than "they
don't match".  The harm in sending them is simply that they bloat the
event.  In particular, on FreeBSD the event must fit into a 1016 byte
buffer.

Fixes #14717 for mirrored pools.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes #14717
Closes #15052
lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Dec 12, 2023
The checksum histograms were intended to be used with ATA and parallel
SCSI, which are obsolete.  With modern storage hardware, they will
almost always look like white noise; all bits will be wrong.  They only
serve to bloat the event.  That's a particular problem on FreeBSD, where
events must fit into a 1016 byte buffer.

This fixes issue openzfs#14717 for RAIDZ pools, but not for mirror pools.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes openzfs#15052
lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Dec 12, 2023
With anything but fletcher-4, even a tiny change in the input will cause
the checksum value to change completely.  So knowing the actual and
expected checksums doesn't provide much more information than "they
don't match".  The harm in sending them is simply that they bloat the
event.  In particular, on FreeBSD the event must fit into a 1016 byte
buffer.

Fixes openzfs#14717 for mirrored pools.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes openzfs#14717
Closes openzfs#15052
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants