-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Problem
Upon testing OpenZFS versions 2.1.13-2.1.15 and 2.2.2-2.2.3 on CentOS 8 Stream
with various kernel versions ranging from 4.18.0-408 to .547, and utilizing
Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz with 8GB ECC RAM, we encountered
a memory consumption issue which leads to kerno panic during disk usage stress testing.
Test setup
Utilizing zpool with multiple configurations:
- Prior to 2.2.3: Compression disabled
- 2.2.3: Compression set to default
- All versions include variants with defaults and ashift=12, autoextend (planned for use), others set to defaults
Pools consist of non-mirrored configurations with block devices of varying
sizes but consistent speed and throughput, ranging from 147GB to 6.7TB.
The test involves running multiple writers to fill the disk with random-sized
files ranging from 1KB to 2GB. Once the disks are filled, all files are
removed, and the process is repeated.
Observed issue
Across all tested versions, particularly pronounced in versions prior to 2.2.3,
significant memory consumption occurs when files are removed.
Memory usage spikes, consuming all available memory.
The OOM killer activates in an attempt to free memory, resulting in kernel
panics when no further resources are available for the OOM killer to release.
With 8GB RAM, the issue consistently occurs in every test instance before
version 2.2.3, with a decreased frequency in version 2.2.3 (5 out of 20 CentOS
test instances experienced kernel panics).
- Increasing RAM to 32GB sometimes mitigates the issue.
- Removal of approximately 500GB of small files consumes around 20GB of memory on a 32GB machine.
- Memory is predominantly consumed by zio_buff_... and zio_cache.
Logs
Machine info
Current instance is the only one that I left with for testing rn:
# cat /etc/os-release
NAME="CentOS Stream"
VERSION="8"
# zfs --version
zfs-2.2.3-1
zfs-kmod-2.2.3-1
# dmidecode -t memory
# dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 2.7 present.
Handle 0x0008, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Unknown
Maximum Capacity: 8 GB
Error Information Handle: Not Provided
Number Of Devices: 1
Handle 0x0009, DMI type 17, 34 bytes
Memory Device
Array Handle: 0x0008
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 8 GB
Form Factor: DIMM
Set: None
Locator: Not Specified
Bank Locator: Not Specified
Type: DDR4
Type Detail: Static Column Pseudo-static Synchronous Window DRAM
Speed: 2933 MT/s
Manufacturer: Not Specified
Serial Number: Not Specified
Asset Tag: Not Specified
Part Number: Not Specified
Rank: Unknown
Configured Memory Speed: Unknown
# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
BIOS Vendor ID: Intel(R) Corporation
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
BIOS Model name: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping: 4
CPU MHz: 2999.998
BogoMIPS: 5999.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 25344K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
# uname -srm
Linux 4.18.0-540.el8.x86_64 x86_64
issue demo
total 295G
-rw-r--r--. 1 root root 2.0G Mar 20 02:45 Swordsman-13505
-rw-r--r--. 1 root root 2.0G Mar 20 03:37 Swordsman-13554
....
-rw-r--r--. 1 root root 2.0G Mar 20 19:32 Swordsman-14370
-rw-r--r--. 1 root root 884M Mar 20 19:33 Swordsman-14371
# ll | wc -l
138
# du -ch
295G .
295G total
# rm -f *
#
frclient_loop: send disconnect: Broken pipe
After re-ssh directory still has all the files
138
# du -ch /mnt/dir2
295G /mnt/dir2
295G total
zpool status
pool: mnt
state: ONLINE
remove: Removal of vdev 19 copied 28.6G in 0h3m, completed on Thu Mar 21 20:28:28 2024
14.7M memory used for removed device mappings
config:
NAME STATE READ WRITE CKSUM
mnt ONLINE 0 0 0
nvme4n1 ONLINE 0 0 0
nvme3n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
nvme2n1 ONLINE 0 0 0
errors: No known data errors
zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
mnt 3.05T 2.52T 549G - - 0% 82% 1.00x ONLINE -
indirect-0 - - - - - - - - ONLINE
indirect-1 - - - - - - - - ONLINE
indirect-2 - - - - - - - - ONLINE
indirect-3 - - - - - - - - ONLINE
indirect-4 - - - - - - - - ONLINE
indirect-5 - - - - - - - - ONLINE
indirect-6 - - - - - - - - ONLINE
indirect-7 - - - - - - - - ONLINE
indirect-8 - - - - - - - - ONLINE
indirect-9 - - - - - - - - ONLINE
indirect-10 - - - - - - - - ONLINE
indirect-11 - - - - - - - - ONLINE
indirect-12 - - - - - - - - ONLINE
indirect-13 - - - - - - - - ONLINE
indirect-14 - - - - - - - - ONLINE
indirect-15 - - - - - - - - ONLINE
indirect-16 - - - - - - - - ONLINE
nvme4n1 2.93T 2.47T 469G - - 0% 84.3% - ONLINE
nvme3n1 25.0G 24.3G 169M - - 0% 99.3% - ONLINE
indirect-19 - - - - - - - - ONLINE
nvme1n1 25.0G 24.3G 232M - - 22% 99.1% - ONLINE
nvme2n1 80.0G 680K 79.5G - - 0% 0.00% - ONLINE
zpool config
NAME PROPERTY VALUE SOURCE
mnt size 3.05T -
mnt capacity 82% -
mnt altroot - default
mnt health ONLINE -
mnt guid 8946787721482689307 -
mnt version - default
mnt bootfs - default
mnt delegation on default
mnt autoreplace off default
mnt cachefile - default
mnt failmode wait default
mnt listsnapshots off default
mnt autoexpand on local
mnt dedupratio 1.00x -
mnt free 549G -
mnt allocated 2.52T -
mnt readonly off -
mnt ashift 12 local
mnt comment - default
mnt expandsize - -
mnt freeing 0 -
mnt fragmentation 0% -
mnt leaked 0 -
mnt multihost off default
mnt checkpoint - -
mnt load_guid 17249711793930708177 -
mnt autotrim off default
mnt compatibility off default
mnt bcloneused 0 -
mnt bclonesaved 0 -
mnt bcloneratio 1.00x -
mnt feature@async_destroy enabled local
mnt feature@empty_bpobj enabled local
mnt feature@lz4_compress active local
mnt feature@multi_vdev_crash_dump enabled local
mnt feature@spacemap_histogram active local
mnt feature@enabled_txg active local
mnt feature@hole_birth active local
mnt feature@extensible_dataset active local
mnt feature@embedded_data active local
mnt feature@bookmarks enabled local
mnt feature@filesystem_limits enabled local
mnt feature@large_blocks enabled local
mnt feature@large_dnode enabled local
mnt feature@sha512 enabled local
mnt feature@skein enabled local
mnt feature@edonr enabled local
mnt feature@userobj_accounting active local
mnt feature@encryption enabled local
mnt feature@project_quota active local
mnt feature@device_removal active local
mnt feature@obsolete_counts active local
mnt feature@zpool_checkpoint enabled local
mnt feature@spacemap_v2 active local
mnt feature@allocation_classes enabled local
mnt feature@resilver_defer enabled local
mnt feature@bookmark_v2 enabled local
mnt feature@redaction_bookmarks enabled local
mnt feature@redacted_datasets enabled local
mnt feature@bookmark_written enabled local
mnt feature@log_spacemap active local
mnt feature@livelist enabled local
mnt feature@device_rebuild enabled local
mnt feature@zstd_compress enabled local
mnt feature@draid enabled local
mnt feature@zilsaxattr enabled local
mnt feature@head_errlog active local
mnt feature@blake3 enabled local
mnt feature@block_cloning enabled local
mnt feature@vdev_zaps_v2 active local
zfs config
NAME PROPERTY VALUE SOURCE
mnt type filesystem -
mnt creation Sun Mar 10 13:54 2024 -
mnt used 2.52T -
mnt available 451G -
mnt referenced 2.52T -
mnt compressratio 1.00x -
mnt mounted yes -
mnt quota none default
mnt reservation none default
mnt recordsize 128K default
mnt mountpoint /mnt local
mnt sharenfs off default
mnt checksum on default
mnt compression on default
mnt atime on default
mnt devices on default
mnt exec on default
mnt setuid on default
mnt readonly off default
mnt zoned off default
mnt snapdir hidden default
mnt aclmode discard default
mnt aclinherit restricted default
mnt createtxg 1 -
mnt canmount on default
mnt xattr on default
mnt copies 1 default
mnt version 5 -
mnt utf8only off -
mnt normalization none -
mnt casesensitivity sensitive -
mnt vscan off default
mnt nbmand off default
mnt sharesmb off default
mnt refquota none default
mnt refreservation none default
mnt guid 11115806655719226472 -
mnt primarycache all default
mnt secondarycache all default
mnt usedbysnapshots 0B -
mnt usedbydataset 2.52T -
mnt usedbychildren 120M -
mnt usedbyrefreservation 0B -
mnt logbias latency default
mnt objsetid 54 -
mnt dedup off default
mnt mlslabel none default
mnt sync standard default
mnt dnodesize legacy default
mnt refcompressratio 1.00x -
mnt written 2.52T -
mnt logicalused 2.35T -
mnt logicalreferenced 2.35T -
mnt volmode default default
mnt filesystem_limit none default
mnt snapshot_limit none default
mnt filesystem_count none default
mnt snapshot_count none default
mnt snapdev hidden default
mnt acltype off default
mnt context none default
mnt fscontext none default
mnt defcontext none default
mnt rootcontext none default
mnt relatime on default
mnt redundant_metadata all default
mnt overlay on default
mnt encryption off default
mnt keylocation none default
mnt keyformat none default
mnt pbkdf2iters 0 default
mnt special_small_blocks 0 default