Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARNING: COMPLETELY BROKEN WITH KERNEL 5.16.X #13210

Closed
Gibson85 opened this issue Mar 13, 2022 · 47 comments
Closed

WARNING: COMPLETELY BROKEN WITH KERNEL 5.16.X #13210

Gibson85 opened this issue Mar 13, 2022 · 47 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@Gibson85
Copy link

Gibson85 commented Mar 13, 2022

Hi,

I am totally speechless. After awaiting the new ZFS release being compatible with the 5.16 Kernel staying around unprotected with open security bugs for months your newest release 2.1.3 is totally broken. I installed this release yesterday. After that the system became totally instable. Trying to copy data from my main ZFS pool to a fresh formatted external drive (also ZFS) via USB failed.

Dozens of the following messages appeared. The system freezed, the monitor became black and the user was logged out.
I tried this about 10 times and reformatted the external drive with ZFS again each time. But nothing changed.

verification failed - discarding update (trying again)

Almost all applications like VLC, Firefox, Vivaldi, VSCode had been crashing directly. It was not possible to copy data over USB. Audio problems. Then I installed an slightly newer NVIDIA driver and the system became a bit more stable. Applications started now.

But this morning I tried again to copy data per rsync to the external drive (reformatted before again). And boom black monitor logged out again. Ok then I did two Memcheck86+ passes without any errors and checked the BIOS settings again. Everything is normal. The PC works with Windows like normal also.

Ok, maybe the kernel upgrade was broken. So I decided to do a fresh Fedora install. And what should I say. After installing ZFS the system was instable again. Horrible unstable to be precise. The first import of my pool worked. But when I tried to open a folder the PC crashed. After a reboot I could access at least some folders without crashing.

Then I installed the NVIDIA driver again. Now the system was much more stable again. I could open all folders now.
But the problems still persist. The System is completely unstable.

PLEASE INVESTIGATE THIS! THERE ARE HORRIBLE BUGS!

@Gibson85 Gibson85 added the Type: Defect Incorrect behavior (e.g. crash, hang) label Mar 13, 2022
@rincebrain
Copy link
Contributor

rincebrain commented Mar 13, 2022

Could you say a bit more about the pool(s) you're using, exact kernel version, and the settings on them?

I've got Fedora 35 running 2.1.3 and 5.16.8-200.fc35.x86_64, and quickly flinging data around to and from the pool doesn't catch fire for me.

@dioni21
Copy link
Contributor

dioni21 commented Mar 14, 2022

@Gibson85 we've had reports of ZFS troubles in USB . Can you redo the tests with disks attached in a SATA bus?

I also run ZFS on Fedora 35, both almost always on latest version. ZFS from git master, indeed.

@Gibson85
Copy link
Author

Gibson85 commented Mar 14, 2022

It is really hard to say whats going on. All applications are crashing randomly. Never seen something like this.

Main pool:

NAME     PROPERTY                       VALUE                          SOURCE
MYPOOL  size                           14.5T                          -
MYPOOL  capacity                       71%                            -
MYPOOL  altroot                        -                              default
MYPOOL  health                         ONLINE                         -
MYPOOL  guid                           XXXXXXXXXXXXXXXXXXXX           -
MYPOOL  version                        -                              default
MYPOOL  bootfs                         -                              default
MYPOOL  delegation                     on                             default
MYPOOL  autoreplace                    off                            default
MYPOOL  cachefile                      -                              default
MYPOOL  failmode                       wait                           default
MYPOOL  listsnapshots                  off                            default
MYPOOL  autoexpand                     off                            default
MYPOOL  dedupratio                     1.00x                          -
MYPOOL  free                           4.08T                          -
MYPOOL  allocated                      10.4T                          -
MYPOOL  readonly                       off                            -
MYPOOL  ashift                         12                             local
MYPOOL  comment                        -                              default
MYPOOL  expandsize                     -                              -
MYPOOL  freeing                        0                              -
MYPOOL  fragmentation                  1%                             -
MYPOOL  leaked                         0                              -
MYPOOL  multihost                      off                            default
MYPOOL  checkpoint                     -                              -
MYPOOL  load_guid                      XXXXXXXXXXXXXXXXXXX            -
MYPOOL  autotrim                       off                            default
MYPOOL  compatibility                  off                            default
MYPOOL  feature@async_destroy          enabled                        local
MYPOOL  feature@empty_bpobj            active                         local
MYPOOL  feature@lz4_compress           active                         local
MYPOOL  feature@multi_vdev_crash_dump  enabled                        local
MYPOOL  feature@spacemap_histogram     active                         local
MYPOOL  feature@enabled_txg            active                         local
MYPOOL  feature@hole_birth             active                         local
MYPOOL  feature@extensible_dataset     active                         local
MYPOOL  feature@embedded_data          active                         local
MYPOOL  feature@bookmarks              enabled                        local
MYPOOL  feature@filesystem_limits      enabled                        local
MYPOOL  feature@large_blocks           enabled                        local
MYPOOL  feature@large_dnode            enabled                        local
MYPOOL  feature@sha512                 enabled                        local
MYPOOL  feature@skein                  enabled                        local
MYPOOL  feature@edonr                  enabled                        local
MYPOOL  feature@userobj_accounting     active                         local
MYPOOL  feature@encryption             enabled                        local
MYPOOL  feature@project_quota          active                         local
MYPOOL  feature@device_removal         enabled                        local
MYPOOL  feature@obsolete_counts        enabled                        local
MYPOOL  feature@zpool_checkpoint       enabled                        local
MYPOOL  feature@spacemap_v2            active                         local
MYPOOL  feature@allocation_classes     enabled                        local
MYPOOL  feature@resilver_defer         enabled                        local
MYPOOL  feature@bookmark_v2            enabled                        local
MYPOOL  feature@redaction_bookmarks    enabled                        local
MYPOOL  feature@redacted_datasets      enabled                        local
MYPOOL  feature@bookmark_written       enabled                        local
MYPOOL  feature@log_spacemap           active                         local
MYPOOL  feature@livelist               enabled                        local
MYPOOL  feature@device_rebuild         enabled                        local
MYPOOL  feature@zstd_compress          enabled                        local
MYPOOL  feature@draid                  enabled                        local
NAME     PROPERTY              VALUE                 SOURCE
MYPOOL  type                  filesystem            -
MYPOOL  creation              Fr Mai  8 19:47 2020  -
MYPOOL  used                  10.4T                 -
MYPOOL  available             3.90T                 -
MYPOOL  referenced            96K                   -
MYPOOL  compressratio         1.01x                 -
MYPOOL  mounted               no                    -
MYPOOL  quota                 none                  default
MYPOOL  reservation           none                  default
MYPOOL  recordsize            128K                  default
MYPOOL  mountpoint            legacy                local
MYPOOL  sharenfs              off                   default
MYPOOL  checksum              on                    default
MYPOOL  compression           lz4                   local
MYPOOL  atime                 on                    local
MYPOOL  devices               on                    default
MYPOOL  exec                  on                    default
MYPOOL  setuid                on                    default
MYPOOL  readonly              off                   default
MYPOOL  zoned                 off                   default
MYPOOL  snapdir               hidden                default
MYPOOL  aclmode               discard               default
MYPOOL  aclinherit            passthrough           local
MYPOOL  createtxg             1                     -
MYPOOL  canmount              on                    default
MYPOOL  xattr                 sa                    local
MYPOOL  copies                1                     default
MYPOOL  version               5                     -
MYPOOL  utf8only              off                   -
MYPOOL  normalization         none                  -
MYPOOL  casesensitivity       sensitive             -
MYPOOL  vscan                 off                   default
MYPOOL  nbmand                off                   default
MYPOOL  sharesmb              off                   default
MYPOOL  refquota              none                  default
MYPOOL  refreservation        none                  default
MYPOOL  guid                  XXXXXXXXXXXXXXXXXX    -
MYPOOL  primarycache          all                   default
MYPOOL  secondarycache        all                   default
MYPOOL  usedbysnapshots       0B                    -
MYPOOL  usedbydataset         96K                   -
MYPOOL  usedbychildren        10.4T                 -
MYPOOL  usedbyrefreservation  0B                    -
MYPOOL  logbias               latency               default
MYPOOL  objsetid              51                    -
MYPOOL  dedup                 off                   default
MYPOOL  mlslabel              none                  default
MYPOOL  sync                  standard              default
MYPOOL  dnodesize             legacy                default
MYPOOL  refcompressratio      1.00x                 -
MYPOOL  written               96K                   -
MYPOOL  logicalused           10.6T                 -
MYPOOL  logicalreferenced     40K                   -
MYPOOL  volmode               default               default
MYPOOL  filesystem_limit      none                  default
MYPOOL  snapshot_limit        none                  default
MYPOOL  filesystem_count      none                  default
MYPOOL  snapshot_count        none                  default
MYPOOL  snapdev               hidden                default
MYPOOL  acltype               posix                 local
MYPOOL  context               none                  default
MYPOOL  fscontext             none                  default
MYPOOL  defcontext            none                  default
MYPOOL  rootcontext           none                  default
MYPOOL  relatime              on                    local
MYPOOL  redundant_metadata    all                   default
MYPOOL  overlay               on                    default
MYPOOL  encryption            off                   default
MYPOOL  keylocation           none                  default
MYPOOL  keyformat             none                  default
MYPOOL  pbkdf2iters           0                     default
MYPOOL  special_small_blocks  0                     default

External pool:

EXTPOOL  size                           2.72T                          -
EXTPOOL  capacity                       0%                             -
EXTPOOL  altroot                        -                              default
EXTPOOL  health                         ONLINE                         -
EXTPOOL  guid                           XXXXXXXXXXXXXXXXXXXX           -
EXTPOOL  version                        -                              default
EXTPOOL  bootfs                         -                              default
EXTPOOL  delegation                     on                             default
EXTPOOL  autoreplace                    off                            default
EXTPOOL  cachefile                      -                              default
EXTPOOL  failmode                       wait                           default
EXTPOOL  listsnapshots                  off                            default
EXTPOOL  autoexpand                     off                            default
EXTPOOL  dedupratio                     1.00x                          -
EXTPOOL  free                           2.72T                          -
EXTPOOL  allocated                      410M                           -
EXTPOOL  readonly                       off                            -
EXTPOOL  ashift                         12                             local
EXTPOOL  comment                        -                              default
EXTPOOL  expandsize                     -                              -
EXTPOOL  freeing                        0                              -
EXTPOOL  fragmentation                  0%                             -
EXTPOOL  leaked                         0                              -
EXTPOOL  multihost                      off                            default
EXTPOOL  checkpoint                     -                              -
EXTPOOL  load_guid                      XXXXXXXXXXXXXXXXXXXX           -
EXTPOOL  autotrim                       off                            default
EXTPOOL  compatibility                  off                            default
EXTPOOL  feature@async_destroy          enabled                        local
EXTPOOL  feature@empty_bpobj            active                         local
EXTPOOL  feature@lz4_compress           active                         local
EXTPOOL  feature@multi_vdev_crash_dump  enabled                        local
EXTPOOL  feature@spacemap_histogram     active                         local
EXTPOOL  feature@enabled_txg            active                         local
EXTPOOL  feature@hole_birth             active                         local
EXTPOOL  feature@extensible_dataset     active                         local
EXTPOOL  feature@embedded_data          active                         local
EXTPOOL  feature@bookmarks              enabled                        local
EXTPOOL  feature@filesystem_limits      enabled                        local
EXTPOOL  feature@large_blocks           enabled                        local
EXTPOOL  feature@large_dnode            enabled                        local
EXTPOOL  feature@sha512                 enabled                        local
EXTPOOL  feature@skein                  enabled                        local
EXTPOOL  feature@edonr                  enabled                        local
EXTPOOL  feature@userobj_accounting     active                         local
EXTPOOL  feature@encryption             active                         local
EXTPOOL  feature@project_quota          active                         local
EXTPOOL  feature@device_removal         enabled                        local
EXTPOOL  feature@obsolete_counts        enabled                        local
EXTPOOL  feature@zpool_checkpoint       enabled                        local
EXTPOOL  feature@spacemap_v2            active                         local
EXTPOOL  feature@allocation_classes     enabled                        local
EXTPOOL  feature@resilver_defer         enabled                        local
EXTPOOL  feature@bookmark_v2            enabled                        local
EXTPOOL  feature@redaction_bookmarks    enabled                        local
EXTPOOL  feature@redacted_datasets      enabled                        local
EXTPOOL  feature@bookmark_written       enabled                        local
EXTPOOL  feature@log_spacemap           active                         local
EXTPOOL  feature@livelist               enabled                        local
EXTPOOL  feature@device_rebuild         enabled                        local
EXTPOOL  feature@zstd_compress          active                         local
EXTPOOL  feature@draid                  enabled                        local
NAME       PROPERTY              VALUE                  SOURCE
EXTPOOL  type                  filesystem             -
EXTPOOL  creation              So Mär 13 16:58 2022  -
EXTPOOL  used                  410M                   -
EXTPOOL  available             2.63T                  -
EXTPOOL  referenced            192K                   -
EXTPOOL  compressratio         1.02x                  -
EXTPOOL  mounted               no                     -
EXTPOOL  quota                 none                   default
EXTPOOL  reservation           none                   default
EXTPOOL  recordsize            128K                   default
EXTPOOL  mountpoint            none                   local
EXTPOOL  sharenfs              off                    default
EXTPOOL  checksum              on                     default
EXTPOOL  compression           zstd                   local
EXTPOOL  atime                 on                     local
EXTPOOL  devices               on                     default
EXTPOOL  exec                  on                     default
EXTPOOL  setuid                on                     default
EXTPOOL  readonly              off                    default
EXTPOOL  zoned                 off                    default
EXTPOOL  snapdir               hidden                 default
EXTPOOL  aclmode               discard                default
EXTPOOL  aclinherit            passthrough            local
EXTPOOL  createtxg             1                      -
EXTPOOL  canmount              on                     default
EXTPOOL  xattr                 sa                     local
EXTPOOL  copies                1                      default
EXTPOOL  version               5                      -
EXTPOOL  utf8only              off                    -
EXTPOOL  normalization         none                   -
EXTPOOL  casesensitivity       sensitive              -
EXTPOOL  vscan                 off                    default
EXTPOOL  nbmand                off                    default
EXTPOOL  sharesmb              off                    default
EXTPOOL  refquota              none                   default
EXTPOOL  refreservation        none                   default
EXTPOOL  guid                  XXXXXXXXXXXXXXXXXXXX   -
EXTPOOL  primarycache          all                    default
EXTPOOL  secondarycache        all                    default
EXTPOOL  usedbysnapshots       0B                     -
EXTPOOL  usedbydataset         192K                   -
EXTPOOL  usedbychildren        410M                   -
EXTPOOL  usedbyrefreservation  0B                     -
EXTPOOL  logbias               latency                default
EXTPOOL  objsetid              54                     -
EXTPOOL  dedup                 off                    default
EXTPOOL  mlslabel              none                   default
EXTPOOL  sync                  standard               default
EXTPOOL  dnodesize             legacy                 default
EXTPOOL  refcompressratio      1.00x                  -
EXTPOOL  written               192K                   -
EXTPOOL  logicalused           414M                   -
EXTPOOL  logicalreferenced     69K                    -
EXTPOOL  volmode               default                default
EXTPOOL  filesystem_limit      none                   default
EXTPOOL  snapshot_limit        none                   default
EXTPOOL  filesystem_count      none                   default
EXTPOOL  snapshot_count        none                   default
EXTPOOL  snapdev               hidden                 default
EXTPOOL  acltype               posix                  local
EXTPOOL  context               none                   default
EXTPOOL  fscontext             none                   default
EXTPOOL  defcontext            none                   default
EXTPOOL  rootcontext           none                   default
EXTPOOL  relatime              on                     local
EXTPOOL  redundant_metadata    all                    default
EXTPOOL  overlay               on                     default
EXTPOOL  encryption            aes-256-gcm            -
EXTPOOL  keylocation           prompt                 local
EXTPOOL  keyformat             passphrase             -
EXTPOOL  pbkdf2iters           350000                 -
EXTPOOL  encryptionroot        EXTPOOL              -
EXTPOOL  keystatus             unavailable            -
EXTPOOL  special_small_blocks  0                      default

OS:

Kernel: 5.16.12-200.fc35.x86_64
NVIDIA Driver: 510.54

The biggest difference between these pools is that the main pool has no enryption, but the external has.

I have an additional pool in my PC. But this is my only full backup. So no. I definitely do nothing on it with such an broken system :).

@Gibson85
Copy link
Author

PS: I see there is an kernel update 5.16.13-200 today. Maybe the kernel was broken. I'll try it out in the evening again with the new kernel (if it is supported by ZFS ;)).

@rincebrain
Copy link
Contributor

If you've got any backtraces from the crashes, or if anything turns up in dmesg, that'd be useful as well.

@Gibson85
Copy link
Author

Gibson85 commented Mar 14, 2022

The problem also appears with kernel 5.16.13-200.fc35.x86_64. Copying data to the external USB HDD fails with verification errors. Maybe it's nothing you can do and the kernel is broken.

Over night I scrubbed my main pool. Thanks god it has no errors. Also I did a smarttools long run on the external HDD. It also has no errors.

I am not very used to dmesg. But this is all in red/related to ZFS I could find:

...
[   29.605658] ACPI Warning: SystemIO range 0x0000000000000828-0x000000000000082F conflicts with OpRegion 0x0000000000000800-0x000000000000084F (\PMRG) (20210930/utaddress-204)
[   29.605668] ACPI: OSL: Resource conflict; ACPI support missing from driver?
[   29.605672] ACPI Warning: SystemIO range 0x0000000000000530-0x000000000000053F conflicts with OpRegion 0x0000000000000500-0x000000000000053F (\GPS0) (20210930/utaddress-204)
[   29.605676] ACPI: OSL: Resource conflict; ACPI support missing from driver?
[   29.605678] ACPI Warning: SystemIO range 0x0000000000000500-0x000000000000052F conflicts with OpRegion 0x0000000000000500-0x000000000000053F (\GPS0) (20210930/utaddress-204)
[   29.605682] ACPI: OSL: Resource conflict; ACPI support missing from driver?
[   29.605683] lpc_ich: Resource conflict(s) found affecting gpio_ich
[   29.618982] videodev: Linux video capture interface: v2.00
[   29.671617] ACPI Warning: SystemIO range 0x0000000000001000-0x000000000000101F conflicts with OpRegion 0x0000000000001000-0x000000000000100F (\SMRG) (20210930/utaddress-204)
[   29.671627] ACPI: OSL: Resource conflict; ACPI support missing from driver?
[   29.979839] kvm: VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL does not work properly. Using workaround
...
[   31.126692] nvidia-gpu 0000:03:00.3: i2c timeout error e0000000
[   31.126702] ucsi_ccg 0-0008: i2c_transfer failed -110
[   31.126708] ucsi_ccg 0-0008: ucsi_ccg_init failed - -110
[   31.126713] ucsi_ccg: probe of 0-0008 failed with error -110
[   31.203535] EXT4-fs (sdc1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
[   34.864040] kauditd_printk_skb: 96 callbacks suppressed
[   34.864044] audit: type=1130 audit(1647271911.943:124): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=zfs-import-cache comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   34.955597] audit: type=1130 audit(1647271912.035:125): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=zfs-mount comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   34.968521] audit: type=1130 audit(1647271912.048:126): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dracut-shutdown comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   34.986710] audit: type=1130 audit(1647271912.066:127): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=plymouth-read-write comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   35.008808] audit: type=1130 audit(1647271912.088:128): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=import-state comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   35.119253] audit: type=1130 audit(1647271912.199:129): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-tmpfiles-setup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   35.129941] audit: type=1334 audit(1647271912.209:130): prog-id=32 op=LOAD
[   35.130119] audit: type=1334 audit(1647271912.210:131): prog-id=33 op=LOAD
[   35.130189] audit: type=1334 audit(1647271912.210:132): prog-id=34 op=LOAD
[   35.132788] audit: type=1334 audit(1647271912.212:133): prog-id=35 op=LOAD
...
[  430.438571] gnome-terminal-[3854]: segfault at 55cfe9027000 ip 00007facf8ee1abf sp 00007ffd0a1584a0 error 4 in libXi.so.6.1.0[7facf8edf000+b000]
[  430.438596] Code: 89 9d a0 00 00 00 85 d2 0f 84 da 02 00 00 8d 72 ff 31 c0 48 c1 e6 03 eb 0d 66 90 48 8b 9d a0 00 00 00 48 83 c0 08 66 0f ef c0 <f2> 0f 2a 04 07 f2 0f 11 04 03 8b 4c 07 04 66 0f ef c0 48 8b 95 a0
[  489.388141] sd 16:0:0:0: [sdd] Synchronizing SCSI cache
[  489.518217] sd 16:0:0:0: [sdd] Synchronize Cache(10) failed: Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[  489.530072] usb 14-1.1: USB disconnect, device number 4
[  493.912037] usb 13-1: USB disconnect, device number 2
[  494.016109] usb 14-1: USB disconnect, device number 2

I tried to upload several crashes. But at some point the backtrace generation failed always. There are lots of problems with Fedora I see :).

@tonyhutter
Copy link
Contributor

[  489.388141] sd 16:0:0:0: [sdd] Synchronizing SCSI cache
[  489.518217] sd 16:0:0:0: [sdd] Synchronize Cache(10) failed: Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK

⬆️ there might be a problem with your drive

@Gibson85
Copy link
Author

@tonyhutter
I think this happened when I disconnected it from USB.

I connected the drive internally now per SATA. And it shows the same behaviour getting verification errors. So it is not related to USB.

@tonyhutter
Copy link
Contributor

Does zpool status -s show any errors? Does zpool events show anything unusal? Are you seeing any additional kernel errors in dmesg?

@behlendorf
Copy link
Contributor

@Gibson85 can you please try setting the following module options in an /etc/modprobe.d/zfs.conf file, then load the kernel modules. This will disable some processor specific performance optimizations which may be the root cause.

cat /etc/modprobe.d/zfs.conf
options zfs zfs_vdev_raidz_impl=scalar
options zcommon zfs_fletcher_4_impl=scalar
options icp icp_aes_impl=generic
options icp icp_gcm_impl=generic

You can verify they're set properly by checking:

cat /sys/module/zfs/parameters/zfs_vdev_raidz_impl
cat /sys/module/zcommon/parameters/zfs_fletcher_4_impl
cat /sys/module/icp/parameters/icp_aes_impl
cat /sys/module/icp/parameters/icp_gcm_impl

Assuming this resolves the stability issue can you post the output of cat /proc/cpuinfo so we can debug the issue.

@Gibson85
Copy link
Author

Do you know what. I booted into windows now and formatted the USB drive with NTFS (I hope you feel a bit ashamed treating me to do so :)). Copied a 50 GB file to it without problems. Also a second USB drive works normal.

When I booted back to Fedora I had a 20 pixel wide barn with graphic errors over the screen. I had this after the kernel update these days once before.

zpool status -s
zpool events

Shows no errors.

@behlendorf
I created this file with the mentioned content. Then I did a "modprobe zfs" and rebooted again. After that I copied data to the external drive again. But same behaviour like before. But this time I had a barn with pixel errors again in the middle of the screen. Maybe I should also mention, that around all windows there appears black rectangles with a margin of like 50 pixels. But the system hangs then heavily and I have trouble stopping the terminal going kamikaazee.

@gamanakis
Copy link
Contributor

@Gibson85 Could you also report what journalctl -k | grep -i mce says?

@dioni21
Copy link
Contributor

dioni21 commented Mar 15, 2022 via email

@Gibson85
Copy link
Author

@Gibson85 Could you also report what journalctl -k | grep -i mce says?

In fact nothing.

No. After the fresh install I installed ZFS first making the system unstable again. After that I installed the NVIDIA driver. So I don't think it is related to the NVIDIA driver.

@Gibson85
Copy link
Author

It seems I am the only one having these problems. In a few days Fedora 36 with kernel 5.17 should be released I have heard. Hopefully this solves the issues :(.

@behlendorf
Copy link
Contributor

behlendorf commented Mar 15, 2022

@Gibson85 if you haven't already, I'd just double check those module options got set correctly by reading the /sys/module/ files I mentioned here. It's the only patch I see in the 2.1.2 -> 2.1.3 patch stack which seems like it could explain this.

One other thing you could try to help narrow this down would be to rollback to an older kernel and verify the 2.1.2 release runs without issue.

@AttilaFueloep
Copy link
Contributor

Yeah, I'd also suggest making sure that the settings mentioned are really enabled (as explained above). If they are, I'd suggest the following. Since it seems that you are not using root on ZFS and assuming that your /home isn't on ZFS either, you could export all your pools (zpool export -a) and unload the zfs modules (modprobe -r zfs). Then you could try to copy/rsync files to an external non-zfs disk. This way we would make sure that ZFS is the culprit.

@comicchang
Copy link

comicchang commented Mar 17, 2022

I had the same problems.
I'm using arch with zfs-dkms and zfs-utils from aur, after upgrading to 2.1.3, two thing happen:

  1. rsync complains about something like verification failed - discarding update (trying again) aborting due to invalid path from sender ieof packet referred to nonexistent channel
  2. random segment fault of everything. (PS. My rootfs is not zfs)

After attempts, reverting zfs-dkms and zfs-utils from 2.1.3 to 2.1.2 resolves all my problems.

other info: cpu Intel(R) Celeron(R) CPU J1900

@AttilaFueloep
Copy link
Contributor

This definitely looks like a problem with FPU state handling, which is kinda strange since my change should've only affected AVX/XSAVE capable CPUs. I'll try to have a look later today.

@comicchang You're using Linux 5.16, right? Are you using zfs kmod or dkms packages? Any chance you could try archzfs/zfs-linux-git or archzfs/zfs-dkms-git? That would be great!

@Gibson85 What is the CPU model you encountered this on?

@comicchang
Copy link

@AttilaFueloep
Yes, I'm using Linux 5.16.14.arch1-1 and zfs dkms (from https://aur.archlinux.org/packages/zfs-dkms).

I'll try archzfs soon later.

@AttilaFueloep
Copy link
Contributor

Thanks for the confirmation. Well, using zfs from the archzfs repo or via aur shouldn't make a difference. What I was heading at is the -git suffix since there were simplification to the code in questions in current master. This could narrow down where to look.

@Lalufu
Copy link
Contributor

Lalufu commented Mar 17, 2022

If it's of any help, on my Fedora 5.16.13-200.fc35.x86_64/2.1.3 where I seem to be unable to reproduce this, the CPU is a Intel(R) Core(TM) i3-4010U CPU @ 1.70GHz, with the following XSAVE abilities according to dmesg:

[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.

@Gibson85
Copy link
Author

Gibson85 commented Mar 17, 2022

@comicchang
I was already believing my hardware must have an issue somewhere :).

Today I removed all unnecessary hardware and with ZFS modules loaded I did the following:

  • Copied a 4GB iso to an USB NTFS drive at about 75 MB/s per Gnome UI (no problems).
  • Copied a 64GB iso to an USB NTFS drive at about 40 MB/S per rsync.
    The speed was obviously reduced. After 19 minutes the speed dropped immediatelly to 77 KB/s.
    After an minute I cancelled rsync. The drive was not accessible anymore.
    Only after I powered off/on the docking station.
  • Copied the 64GB iso again to the USB NTFS drive per Gnome UI.
    Again after a few minutes I could hear nothing happen anymore,
    The windows got frozen like before.
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Core(TM) i7 CPU         950  @ 3.07GHz
stepping	: 5
microcode	: 0x1d
cpu MHz		: 3068.000
cache size	: 8192 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid dtherm ida flush_l1d
vmx flags	: vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips	: 6147.34
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

@Gibson85
Copy link
Author

Gibson85 commented Mar 17, 2022

The dmesg output looks quite wild today also. Maybe this helps somehow.

[    0.000000] microcode: microcode updated early to revision 0x1d, date = 2018-05-11
[    0.000000] Linux version 5.16.14-200.fc35.x86_64 (mockbuild@bkernel01.iad2.fedoraproject.org) (gcc (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9), GNU ld version 2.37-10.fc35) #1 SMP PREEMPT Fri Mar 11 20:31:18 UTC 2022
[    0.000000] x86/fpu: x87 FPU will use FXSAVE
...
[  365.516134]  sdd: sdd1
[  365.604636] sd 16:0:0:0: [sdd] Attached SCSI disk
[  366.020276] ntfs3: Max link count 4000
[  366.020280] ntfs3: Enabled Linux POSIX ACLs support
[  366.020280] ntfs3: Read-only LZX/Xpress compression included
[  366.020829] ntfs3: Unknown parameter 'windows_names'
[ 1103.991543] perf: interrupt took too long (2502 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[ 1175.030842] perf: interrupt took too long (3166 > 3127), lowering kernel.perf_event_max_sample_rate to 63000
[ 1446.029790] sd 16:0:0:0: [sdd] tag#12 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD 
...
[ 1617.924369] sd 16:0:0:0: [sdd] tag#28 uas_zap_pending 0 uas-tag 29 inflight: CMD 
[ 1617.924372] sd 16:0:0:0: [sdd] tag#28 CDB: Write(10) 2a 00 07 a5 3c 90 00 04 00 00
[ 1617.925527] scsi host16: uas_eh_device_reset_handler FAILED err -19
[ 1617.925537] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925542] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925545] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925549] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925553] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925556] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925560] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925563] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925566] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925569] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925572] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925575] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925578] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925581] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925584] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925587] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925591] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925594] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925597] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925600] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925604] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925607] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925610] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925613] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925616] sd 16:0:0:0: Device offlined - not ready after error recovery
[ 1617.925681] sd 16:0:0:0: [sdd] tag#27 timing out command, waited 180s
[ 1617.925688] sd 16:0:0:0: [sdd] tag#27 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK cmd_age=181s
[ 1617.925693] sd 16:0:0:0: [sdd] tag#27 CDB: Write(10) 2a 00 07 a5 25 08 00 04 00 00
[ 1617.925696] I/O error, dev sdd, sector 128263432 op 0x1:(WRITE) flags 0x104000 phys_seg 128 prio class 0
[ 1617.925705] Buffer I/O error on dev sdd1, logical block 16032673, lost async page write
[ 1617.925718] Buffer I/O error on dev sdd1, logical block 16032674, lost async page write
[ 1617.925725] Buffer I/O error on dev sdd1, logical block 16032675, lost async page write
[ 1617.925731] Buffer I/O error on dev sdd1, logical block 16032676, lost async page write
[ 1617.925735] Buffer I/O error on dev sdd1, logical block 16032677, lost async page write
[ 1617.925740] Buffer I/O error on dev sdd1, logical block 16032678, lost async page write
[ 1617.925744] Buffer I/O error on dev sdd1, logical block 16032679, lost async page write
[ 1617.925749] Buffer I/O error on dev sdd1, logical block 16032680, lost async page write
[ 1617.925754] Buffer I/O error on dev sdd1, logical block 16032681, lost async page write
[ 1617.925759] Buffer I/O error on dev sdd1, logical block 16032682, lost async page write
[ 1617.926025] sd 16:0:0:0: rejecting I/O to offline device
[ 1617.926032] I/O error, dev sdd, sector 128268432 op 0x1:(WRITE) flags 0x104000 phys_seg 128 prio class 0
[ 1617.926037] sd 16:0:0:0: [sdd] tag#26 timing out command, waited 180s
[ 1617.926045] sd 16:0:0:0: [sdd] tag#26 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK cmd_age=181s
[ 1617.926051] sd 16:0:0:0: [sdd] tag#26 CDB: Write(10) 2a 00 07 a5 19 08 00 04 00 00
[ 1617.926054] I/O error, dev sdd, sector 128260360 op 0x1:(WRITE) flags 0x104000 phys_seg 128 prio class 0
[ 1617.926244] I/O error, dev sdd, sector 6278296 op 0x1:(WRITE) flags 0x100000 phys_seg 9 prio class 0
[ 1617.926267] I/O error, dev sdd, sector 6295504 op 0x1:(WRITE) flags 0x100000 phys_seg 1 prio class 0
[ 1617.926282] I/O error, dev sdd, sector 6295536 op 0x1:(WRITE) flags 0x100000 phys_seg 1 prio class 0
[ 1617.926295] I/O error, dev sdd, sector 128274576 op 0x1:(WRITE) flags 0x104000 phys_seg 128 prio class 0
[ 1617.926337] sd 16:0:0:0: [sdd] tag#25 timing out command, waited 180s
[ 1617.926345] sd 16:0:0:0: [sdd] tag#25 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK cmd_age=181s
[ 1617.926351] sd 16:0:0:0: [sdd] tag#25 CDB: Write(10) 2a 00 07 a5 15 08 00 04 00 00
[ 1617.926354] I/O error, dev sdd, sector 128259336 op 0x1:(WRITE) flags 0x104000 phys_seg 128 prio class 0
[ 1617.926632] sd 16:0:0:0: [sdd] tag#24 timing out command, waited 180s
[ 1617.926639] sd 16:0:0:0: [sdd] tag#24 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK cmd_age=181s
[ 1617.926645] sd 16:0:0:0: [sdd] tag#24 CDB: Write(10) 2a 00 07 a5 21 08 00 04 00 00
[ 1617.926649] I/O error, dev sdd, sector 128262408 op 0x1:(WRITE) flags 0x104000 phys_seg 128 prio class 0
[ 1617.926764] I/O error, dev sdd, sector 128275600 op 0x1:(WRITE) flags 0x104000 phys_seg 128 prio class 0
[ 1617.926946] sd 16:0:0:0: [sdd] tag#23 timing out command, waited 180s
[ 1617.926953] sd 16:0:0:0: [sdd] tag#23 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK cmd_age=181s
[ 1617.926959] sd 16:0:0:0: [sdd] tag#23 CDB: Write(10) 2a 00 07 a5 1d 08 00 04 00 00
[ 1617.927229] sd 16:0:0:0: [sdd] tag#22 timing out command, waited 180s
[ 1617.927237] sd 16:0:0:0: [sdd] tag#22 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK cmd_age=181s
[ 1617.927242] sd 16:0:0:0: [sdd] tag#22 CDB: Write(10) 2a 00 07 a5 11 08 00 04 00 00
[ 1617.927521] sd 16:0:0:0: [sdd] tag#21 timing out command, waited 180s
[ 1617.927529] sd 16:0:0:0: [sdd] tag#21 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK cmd_age=181s
[ 1617.927534] sd 16:0:0:0: [sdd] tag#21 CDB: Write(10) 2a 00 07 a5 0d 08 00 04 00 00
[ 1617.927825] sd 16:0:0:0: [sdd] tag#20 timing out command, waited 180s
[ 1617.927832] sd 16:0:0:0: [sdd] tag#20 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK cmd_age=181s
[ 1617.927838] sd 16:0:0:0: [sdd] tag#20 CDB: Write(10) 2a 00 07 a5 05 08 00 04 00 00
[ 1617.928133] sd 16:0:0:0: [sdd] tag#19 timing out command, waited 180s
[ 1617.928141] sd 16:0:0:0: [sdd] tag#19 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK cmd_age=181s
[ 1617.928146] sd 16:0:0:0: [sdd] tag#19 CDB: Write(10) 2a 00 07 a5 01 08 00 04 00 00
[ 1617.928415] sd 16:0:0:0: [sdd] tag#18 timing out command, waited 180s
[ 1617.928422] sd 16:0:0:0: [sdd] tag#18 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK cmd_age=181s
[ 1617.928429] sd 16:0:0:0: [sdd] tag#18 CDB: Write(10) 2a 00 07 a4 fd 08 00 04 00 00
[ 1617.928716] sd 16:0:0:0: [sdd] tag#17 timing out command, waited 180s
[ 1617.929013] sd 16:0:0:0: [sdd] tag#16 timing out command, waited 180s
[ 1617.929377] sd 16:0:0:0: [sdd] tag#7 timing out command, waited 180s
[ 1617.929650] sd 16:0:0:0: [sdd] tag#6 timing out command, waited 180s
[ 1617.932581] sd 16:0:0:0: [sdd] tag#5 timing out command, waited 180s
[ 1617.932870] sd 16:0:0:0: [sdd] tag#4 timing out command, waited 180s
[ 1617.933153] sd 16:0:0:0: [sdd] tag#3 timing out command, waited 180s
[ 1617.933433] sd 16:0:0:0: [sdd] tag#2 timing out command, waited 180s
[ 1617.933723] sd 16:0:0:0: [sdd] tag#1 timing out command, waited 180s
[ 1617.934007] sd 16:0:0:0: [sdd] tag#0 timing out command, waited 180s
[ 1618.271082] usb 10-2: USB disconnect, device number 3
...
[ 1987.186115] I/O error, dev sdd, sector 142630552 op 0x1:(WRITE) flags 0x104000 phys_seg 128 prio class 0
[ 1987.186123] buffer_io_error: 188898 callbacks suppressed
[ 1987.186124] Buffer I/O error on dev sdd1, logical block 17828563, lost async page write
[ 1987.186136] Buffer I/O error on dev sdd1, logical block 17828564, lost async page write
[ 1987.186140] Buffer I/O error on dev sdd1, logical block 17828565, lost async page write
[ 1987.186143] Buffer I/O error on dev sdd1, logical block 17828566, lost async page write
[ 1987.186146] Buffer I/O error on dev sdd1, logical block 17828567, lost async page write
[ 1987.186149] Buffer I/O error on dev sdd1, logical block 17828568, lost async page write
[ 1987.186152] Buffer I/O error on dev sdd1, logical block 17828569, lost async page write
[ 1987.186155] Buffer I/O error on dev sdd1, logical block 17828570, lost async page write
[ 1987.186158] Buffer I/O error on dev sdd1, logical block 17828571, lost async page write
[ 1987.186161] Buffer I/O error on dev sdd1, logical block 17828572, lost async page write
[ 1987.186324] sd 16:0:0:0: [sdd] tag#4 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=154s
[ 1987.186329] sd 16:0:0:0: [sdd] tag#4 CDB: Write(10) 2a 00 08 80 46 98 00 04 00 00
[ 1987.186331] I/O error, dev sdd, sector 142624408 op 0x1:(WRITE) flags 0x104000 phys_seg 128 prio class 0
[ 1987.186501] sd 16:0:0:0: [sdd] tag#5 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=154s
[ 1987.186505] sd 16:0:0:0: [sdd] tag#5 CDB: Write(10) 2a 00 08 80 6a 98 00 04 00 00
[ 1987.186506] I/O error, dev sdd, sector 142633624 op 0x1:(WRITE) flags 0x104000 phys_seg 128 prio class 0
[ 1987.186694] sd 16:0:0:0: [sdd] tag#6 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=154s
[ 1987.186698] sd 16:0:0:0: [sdd] tag#6 CDB: Write(10) 2a 00 08 80 66 98 00 04 00 00
[ 1987.186700] I/O error, dev sdd, sector 142632600 op 0x1:(WRITE) flags 0x104000 phys_seg 128 prio class 0
[ 1987.186862] sd 16:0:0:0: [sdd] tag#7 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=154s
[ 1987.186866] sd 16:0:0:0: [sdd] tag#7 CDB: Write(10) 2a 00 08 80 62 98 00 04 00 00
[ 1987.186868] I/O error, dev sdd, sector 142631576 op 0x1:(WRITE) flags 0x104000 phys_seg 128 prio class 0
[ 1987.187033] sd 16:0:0:0: [sdd] tag#8 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=154s
[ 1987.187036] sd 16:0:0:0: [sdd] tag#8 CDB: Write(10) 2a 00 08 80 56 98 00 04 00 00
[ 1987.187038] I/O error, dev sdd, sector 142628504 op 0x1:(WRITE) flags 0x104000 phys_seg 128 prio class 0
[ 1987.187207] sd 16:0:0:0: [sdd] tag#9 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=154s
[ 1987.187211] sd 16:0:0:0: [sdd] tag#9 CDB: Write(10) 2a 00 08 80 42 98 00 04 00 00
[ 1987.187213] I/O error, dev sdd, sector 142623384 op 0x1:(WRITE) flags 0x104000 phys_seg 128 prio class 0
[ 1987.187374] sd 16:0:0:0: [sdd] tag#10 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=154s
[ 1987.187377] sd 16:0:0:0: [sdd] tag#10 CDB: Write(10) 2a 00 08 80 b2 98 00 04 00 00
[ 1987.187379] I/O error, dev sdd, sector 142652056 op 0x1:(WRITE) flags 0x104000 phys_seg 128 prio class 0
[ 1987.187566] sd 16:0:0:0: [sdd] tag#11 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=154s
[ 1987.187571] sd 16:0:0:0: [sdd] tag#11 CDB: Write(10) 2a 00 08 80 ae 98 00 04 00 00
[ 1987.187572] I/O error, dev sdd, sector 142651032 op 0x1:(WRITE) flags 0x104000 phys_seg 128 prio class 0
[ 1987.187750] sd 16:0:0:0: [sdd] tag#12 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=154s
[ 1987.187753] sd 16:0:0:0: [sdd] tag#12 CDB: Write(10) 2a 00 08 80 aa 98 00 04 00 00
[ 1987.187755] I/O error, dev sdd, sector 142650008 op 0x1:(WRITE) flags 0x104000 phys_seg 128 prio class 0
[ 1988.428746] sd 16:0:0:0: [sdd] Synchronizing SCSI cache
[ 1988.558544] sd 16:0:0:0: [sdd] Synchronize Cache(10) failed: Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK

@Gibson85 Gibson85 changed the title WARNING: COMPLETELY BROKEN UNDER FEDORA 35 WARNING: COMPLETELY BROKEN WITH KERNEL 5.16.X Mar 18, 2022
@Gibson85
Copy link
Author

So I think you got the right direction to fix this issue. But I don't expect a fixed version soon. So can you please say which kernel version 5.16.? is safe/stable for use so long?

@mjevans
Copy link

mjevans commented Mar 18, 2022

I just upgraded to 2.1.3 and am currently in the process of shuffling block devices around; a full scrub followed by resilvering with each block device swap. So far at least I haven't seen any of these issues.

One of the more interesting google results I ran across talks about how the Linux kernel (at least in 2015) disables use of XSAVES since it uses a 'compacted' format; while XSAVE and XSAVEOPT use a standard format. https://lore.kernel.org/lkml/tip-65ac2e9baa7deebe3e9588769d44d85555e05619@git.kernel.org/

Checking my CPU's flags, I see only xsave and xsaveopt.

tr '\040' '\n' < /proc/cpuinfo | grep xsave | sort -u

While the code here searches for XSAVES and uses that first if it has the option; which might be where your system is tripping up? https://github.com/openzfs/zfs/blob/master/include/os/linux/kernel/linux/simd_x86.h

@mjevans
Copy link

mjevans commented Mar 18, 2022

I noticed Gibson85 posted one core of their /proc/cpuinfo earlier. They don't have XSAVE at all, but do have FXSR.

https://github.com/openzfs/zfs/blob/master/include/os/linux/kernel/linux/simd_x86.h#L382

Am I reading the current code correctly?

static inline void
kfpu_end(void)
{
	uint8_t  *state = zfs_kfpu_fpregs[smp_processor_id()];
#if defined(HAVE_XSAVES)
	if (static_cpu_has(X86_FEATURE_XSAVES)) {
		kfpu_do_xrstor("xrstors", state, ~0);
		goto out;
	}
#endif
	if (static_cpu_has(X86_FEATURE_XSAVE)) {
		kfpu_do_xrstor("xrstor", state, ~0);
	} else if (static_cpu_has(X86_FEATURE_FXSR)) {
		kfpu_save_fxsr(state);
	} else {
		kfpu_save_fsave(state);
	}
out:
	local_irq_enable();
	preempt_enable();

}

The NON XSAVE paths are trying to SAVE the state again, rather than RESTORE it? E.G. on Line 382 shouldn't this be: kfpu_restore_fxsr ? Line 384 kfpu_restore_fsave ?

@behlendorf
Copy link
Contributor

You're reading that correctly, and that is the right fix. Let me open a PR with the needed change. I'm not sure how we missed that when reviewing the change, but thank you for calling it out.

behlendorf added a commit to behlendorf/zfs that referenced this issue Mar 19, 2022
Commit 3b52ccd introduced a flaw where FSR and FSAVE are not restored
when using a Linux 5.16 kernel.  These instructions are only used when
XSAVE is not supported by the processor meaning only some systems will
encounter this issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#13210
behlendorf added a commit that referenced this issue Mar 19, 2022
Commit 3b52ccd introduced a flaw where FSR and FSAVE are not restored
when using a Linux 5.16 kernel.  These instructions are only used when
XSAVE is not supported by the processor meaning only some systems will
encounter this issue.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13210
Closes #13236
@behlendorf
Copy link
Contributor

@Gibson85 if you stick with the 5.14 kernel for now you should be able to use zfs-2.1.3 without encountering this issue. We'll get this fix applied to 2.1.4 and when it's released you'll be able to update the kernel.

@mjevans
Copy link

mjevans commented Mar 19, 2022

@Gibson85 Behlendorf also poked the patch which is now in the current git (latest 'unstable') version of ZFS.

https://github.com/openzfs/zfs/blob/master/include/os/linux/kernel/linux/simd_x86.h

If for some reason you'd rather run with the per-release version your OS distribution might have a package such as zfs-dkms-git or similar that will pull both this fix and any other pending changes in automatically.

The git tag zfs-2.1.4-staging status can be found here: #13235 You'll note some of it's integration tests need to be re-run as well, though they also need some shepherding. When that pull request is completed you could also build the zfs-2.1.4 from the source tag. (Though 2.1.3 with the simple 2 lines changed yourself is going to be the fastest and safest solution.)

@Lalufu
Copy link
Contributor

Lalufu commented Mar 20, 2022

Just out of curiosity, is 2.1.4 going to be released soon to deal with this issue, or is it going to be another three months?

@tonyhutter
Copy link
Contributor

@Lalufu #13242

@colemickens
Copy link

Thanks for the investigation and fix! And my condolences to other folks running under Hyper-V (which seems to lack XSAVE support), I was only mostly wrong to blame it for the corruption.

@tonyhutter
Copy link
Contributor

This fix for this is included in zfs-2.1.4, which was just released:
https://github.com/openzfs/zfs/releases/tag/zfs-2.1.4

KenMacD added a commit to KenMacD/etc-nixos that referenced this issue Mar 23, 2022
nicman23 pushed a commit to nicman23/zfs that referenced this issue Aug 22, 2022
Commit 3b52ccd introduced a flaw where FSR and FSAVE are not restored
when using a Linux 5.16 kernel.  These instructions are only used when
XSAVE is not supported by the processor meaning only some systems will
encounter this issue.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#13210
Closes openzfs#13236
nicman23 pushed a commit to nicman23/zfs that referenced this issue Aug 22, 2022
Commit 3b52ccd introduced a flaw where FSR and FSAVE are not restored
when using a Linux 5.16 kernel.  These instructions are only used when
XSAVE is not supported by the processor meaning only some systems will
encounter this issue.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#13210
Closes openzfs#13236
lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Sep 2, 2022
Commit 3b52ccd introduced a flaw where FSR and FSAVE are not restored
when using a Linux 5.16 kernel.  These instructions are only used when
XSAVE is not supported by the processor meaning only some systems will
encounter this issue.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#13210
Closes openzfs#13236
beren12 pushed a commit to beren12/zfs that referenced this issue Sep 19, 2022
Commit 3b52ccd introduced a flaw where FSR and FSAVE are not restored
when using a Linux 5.16 kernel.  These instructions are only used when
XSAVE is not supported by the processor meaning only some systems will
encounter this issue.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#13210
Closes openzfs#13236
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Commit 3b52ccd introduced a flaw where FSR and FSAVE are not restored
when using a Linux 5.16 kernel.  These instructions are only used when
XSAVE is not supported by the processor meaning only some systems will
encounter this issue.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#13210
Closes openzfs#13236
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Commit 3b52ccd introduced a flaw where FSR and FSAVE are not restored
when using a Linux 5.16 kernel.  These instructions are only used when
XSAVE is not supported by the processor meaning only some systems will
encounter this issue.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#13210
Closes openzfs#13236
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Commit 3b52ccd introduced a flaw where FSR and FSAVE are not restored
when using a Linux 5.16 kernel.  These instructions are only used when
XSAVE is not supported by the processor meaning only some systems will
encounter this issue.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#13210
Closes openzfs#13236
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Commit 3b52ccd introduced a flaw where FSR and FSAVE are not restored
when using a Linux 5.16 kernel.  These instructions are only used when
XSAVE is not supported by the processor meaning only some systems will
encounter this issue.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#13210
Closes openzfs#13236
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Commit 3b52ccd introduced a flaw where FSR and FSAVE are not restored
when using a Linux 5.16 kernel.  These instructions are only used when
XSAVE is not supported by the processor meaning only some systems will
encounter this issue.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#13210
Closes openzfs#13236
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Commit 3b52ccd introduced a flaw where FSR and FSAVE are not restored
when using a Linux 5.16 kernel.  These instructions are only used when
XSAVE is not supported by the processor meaning only some systems will
encounter this issue.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#13210
Closes openzfs#13236
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Commit 3b52ccd introduced a flaw where FSR and FSAVE are not restored
when using a Linux 5.16 kernel.  These instructions are only used when
XSAVE is not supported by the processor meaning only some systems will
encounter this issue.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#13210
Closes openzfs#13236
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Commit 3b52ccd introduced a flaw where FSR and FSAVE are not restored
when using a Linux 5.16 kernel.  These instructions are only used when
XSAVE is not supported by the processor meaning only some systems will
encounter this issue.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#13210
Closes openzfs#13236
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Commit 3b52ccd introduced a flaw where FSR and FSAVE are not restored
when using a Linux 5.16 kernel.  These instructions are only used when
XSAVE is not supported by the processor meaning only some systems will
encounter this issue.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#13210
Closes openzfs#13236
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Commit 3b52ccd introduced a flaw where FSR and FSAVE are not restored
when using a Linux 5.16 kernel.  These instructions are only used when
XSAVE is not supported by the processor meaning only some systems will
encounter this issue.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#13210
Closes openzfs#13236
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Commit 3b52ccd introduced a flaw where FSR and FSAVE are not restored
when using a Linux 5.16 kernel.  These instructions are only used when
XSAVE is not supported by the processor meaning only some systems will
encounter this issue.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#13210
Closes openzfs#13236
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Commit 3b52ccd introduced a flaw where FSR and FSAVE are not restored
when using a Linux 5.16 kernel.  These instructions are only used when
XSAVE is not supported by the processor meaning only some systems will
encounter this issue.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#13210
Closes openzfs#13236
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Commit 3b52ccd introduced a flaw where FSR and FSAVE are not restored
when using a Linux 5.16 kernel.  These instructions are only used when
XSAVE is not supported by the processor meaning only some systems will
encounter this issue.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#13210
Closes openzfs#13236
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests