Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes a delay of up to zfs_txg_timeout seconds when doing ftruncate #14290

Open
Safari77 opened this issue Dec 15, 2022 · 10 comments
Open

Sometimes a delay of up to zfs_txg_timeout seconds when doing ftruncate #14290

Safari77 opened this issue Dec 15, 2022 · 10 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@Safari77
Copy link

System information

Type Version/Name
Distribution Name Fedora
Distribution Version 36
Kernel Version 5.15.82
Architecture x86_64
OpenZFS Version 2.1.7

Describe the problem you're observing

Usually rpm commands execute quickly, but pretty often (10% of the cases) there is 4 second extra delay:

16:22:55.694445 openat(AT_FDCWD, "/usr/lib/sysimage/rpm/rpmdb.sqlite-shm", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 6 <0.000086>
16:22:55.694563 newfstatat(6, "", {st_dev=makedev(0, 0x1e), st_ino=27231358, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=131072, st_blocks=18, st_size=32768, st_atime=1671114175 /* 2022-12-15T16:22:55.105109260+0200 */, st_atime_nsec=105109260, st_mtime=1671114175 /* 2022-12-15T16:22:55.106109257+0200 */, st_mtime_nsec=106109257, st_ctime=1671114175 /* 2022-12-15T16:22:55.106109257+0200 */, st_ctime_nsec=106109257}, AT_EMPTY_PATH) = 0 <0.000011>
16:22:55.694692 geteuid()       = 0 <0.000010>
16:22:55.694759 fchown(6, 0, 0) = 0 <0.000047>
16:22:55.694855 fcntl(6, F_GETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=128, l_len=1, l_pid=0}) = 0 <0.000010>
16:22:55.694909 fcntl(6, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=128, l_len=1}) = 0 <0.000009>
16:22:55.694953 ftruncate(6, 3) = 0 <4.372489>

I have ZFS as root fs running on top of LUKS2. NVMe Corsair MP510 (Max Random Write QD32 IOMeter: Up to 440K IOPS) with practically zero other IO operations when I run rpm.

Describe how to reproduce the problem

Just try rpm again.

Include any warning/errors/backtraces from the system logs

zpool show no errors and:

NAME  PROPERTY              VALUE                  SOURCE
tank  type                  filesystem             -
tank  creation              Wed Nov 10 15:40 2021  -
tank  used                  1.39T                  -
tank  available             296G                   -
tank  referenced            96K                    -
tank  compressratio         1.29x                  -
tank  mounted               no                     -
tank  quota                 none                   default
tank  reservation           none                   default
tank  recordsize            128K                   default
tank  mountpoint            /tank                  default
tank  sharenfs              off                    default
tank  checksum              on                     default
tank  compression           zstd-3                 local
tank  atime                 on                     default
tank  devices               on                     default
tank  exec                  on                     default
tank  setuid                on                     default
tank  readonly              off                    default
tank  zoned                 off                    default
tank  snapdir               hidden                 default
tank  aclmode               discard                default
tank  aclinherit            restricted             default
tank  createtxg             1                      -
tank  canmount              off                    local
tank  xattr                 sa                     local
tank  copies                1                      default
tank  version               5                      -
tank  utf8only              on                     -
tank  normalization         formD                  -
tank  casesensitivity       sensitive              -
tank  vscan                 off                    default
tank  nbmand                off                    default
tank  sharesmb              off                    default
tank  refquota              none                   default
tank  refreservation        none                   default
tank  guid                  8575689710526589949    -
tank  primarycache          all                    default
tank  secondarycache        all                    default
tank  usedbysnapshots       0B                     -
tank  usedbydataset         96K                    -
tank  usedbychildren        1.39T                  -
tank  usedbyrefreservation  0B                     -
tank  logbias               latency                default
tank  objsetid              54                     -
tank  dedup                 off                    default
tank  mlslabel              none                   default
tank  sync                  standard               default
tank  dnodesize             auto                   local
tank  refcompressratio      1.00x                  -
tank  written               0                      -
tank  logicalused           1.77T                  -
tank  logicalreferenced     42K                    -
tank  volmode               default                default
tank  filesystem_limit      none                   default
tank  snapshot_limit        none                   default
tank  filesystem_count      none                   default
tank  snapshot_count        none                   default
tank  snapdev               hidden                 default
tank  acltype               posix                  local
tank  context               none                   default
tank  fscontext             none                   default
tank  defcontext            none                   default
tank  rootcontext           none                   default
tank  relatime              on                     local
tank  redundant_metadata    all                    default
tank  overlay               on                     default
tank  encryption            off                    default
tank  keylocation           none                   default
tank  keyformat             none                   default
tank  pbkdf2iters           0                      default
tank  special_small_blocks  0                      default
@amarshall
Copy link

May I suggest the title and description of this issue be altered to make it more generic? It appears to not be specific to RPM, and is at least a more general issue with sqlite, and likely even more generally just ftruncate. There are a few linked issues above where other uses of sqlite on ZFS see this problem.

@farcaller
Copy link

I can trivially reproduce this in my setup (zfs over luks2 as well). set sync=disabled helps, but that's barely a workaround. OpenZFS 2.2.3 on 6.6.28 x86_64.

@Safari77
Copy link
Author

May I suggest the title and description of this issue be altered to make it more generic? It appears to not be specific to RPM, and is at least a more general issue with sqlite, and likely even more generally just ftruncate. There are a few linked issues above where other uses of sqlite on ZFS see this problem.

Would you say it is just ftruncate or also fsync?

@kbknapp
Copy link

kbknapp commented May 10, 2024

I think it's only ftruncate, but I'm not 100% certain. Reason I believe that is in investigating a similar issue in atuin (atuinsh/atuin#952) the problem does not happen with fsync alone. For context when sqlite is using WAL mode and set to "truncate on commit" it emits the ftruncate does the issue trigger.

Quoting myself from that issue:

[..] looking at the sqlite source it appears the WAL can be set to "truncate on commit", which issues ftruncate followed by osFcntl(walfd, F_FULLSYNC). I'm assuming it means "write transaction commit" which is what atuin history start is doing.

@farcaller
Copy link

FWIW, I was trying (albeit unsuccessfully) to repro this with sqlite alone, which is funny, given how easy it is to reproduce on the atuin:
image

@kbknapp
Copy link

kbknapp commented May 10, 2024

Not sure if it makes a difference, but atuin also sets PRAGMA synchronous

@kbknapp
Copy link

kbknapp commented May 10, 2024

Oh also, atuin is doing a full transaction for that insert - just doing a single bare insert doesn't trigger the issue from what I can tell.

@farcaller
Copy link

wasn't able to trigger it with PRAGMA synchronous=FULL; PRAGMA journal_mode=WAL; PRAGMA wal_checkpoint(TRUNCATE); begin transaction; insert into a (i ) values (1); select count(*) from a; commit; yet. I spent enough time on pressing up arrow and enter in waves enough for it to be done more efficiently with a script now.

It's interesting that it comes in waves for atuin. Sometimes there are minutes without this issue and thn it triggers several times in a row. The loadavg of this machine is negligible (0.44 0.73 0.75 on a 12-core box), the ram is almost free. There's a VM slowly spinning (but mostly doing nothing) in the background. iotop is mostly journald and zvol activity.

@Safari77 Safari77 changed the title Sometimes excessive delay truncating rpmdb.sqlite-shm Sometimes a delay of several seconds when doing ftruncate May 10, 2024
@kbknapp
Copy link

kbknapp commented May 10, 2024

I'm going to say some things I don't really understand in the hopes that it triggers something in someone who knows what they're doing...

I hear the ZIL has a default time for "flush" for every 5s set by the module param zfs_txg_timeout. That seems suspicious to me that the atuin issue can induce delays up to around 5s. This also seems to track with other RPM/flatpak issues I've seen on ZFS that use sqlite internally (although I haven't investigated to see if they're using WAL or any truncating...so this may just be anecdotal).

I'm wondering if maybe the ZIL must flush txgs before the return, thus sometimes it's in the low ms if that timer was already about to fire or other times if the ftruncate happens just after a flush it'll be nearly another 5s before return?

Looking at /proc/spl/kstat/zfs/<zpool>/txgs gives info about the txgs, but I'm not familiar enough with ZFS to know what I'm looking at.

@Nemo157
Copy link

Nemo157 commented May 10, 2024

I hear the ZIL has a default time for "flush" for every 5s set by the module param zfs_txg_timeout. That seems suspicious to me that the atuin issue can induce delays up to around 5s.

I previously tested this, changing that timeout to 20s made atuin's delays start ranging from 0-20s, so I definitely think it is related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

5 participants