os: increase os.ReadDir() dirent buffer from 8KiB to 32KiB #64597
Labels
NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Performance
Milestone
Proposal Details
Propose to increase the dirent buffer
blockSize
from 8KiB to 32KiB insrc/os/dir_unix.go
foros.ReadDir()
to align withglibc
[1] [2], as they both wrapgetdents64()
syscall in Linux systems.Background
Machine: 4 core 8 GiB VM via UTM on MacBook Pro, Apple M2 Pro arm64 CPU, native virtualization
OS: Fedora 38 (tried CentOS 9, Ubuntu 22), Linux 6.5.10 (tried V1 patch in [3]), 4 KiB page size (tried 16 KiB)
Go: go version go1.20.10 linux/arm64
I was trying out continuous profiling tools (pyroscope, inspektor-gadget) and found
os.ReadDir("/proc")
sometimes dominate the CPU time while CPU util% is low like <=5%. Here is the flamegraph.Observation
os.ReadDir()
is a wrapper ofgetdents64()
likeglibc
, confirm it by tracing (2nd column is latency):The monitoring daemon calls
os.ReadDir("/proc")
every 1s to get all process IDs, and latency spikes (1ms to 2ms) can be observed. The dirent buffer size is 8KiB, oneos.ReadDir()
results in 3getdents64()
to get all the PIDs (roughly 300).Trace
ps
to see howglibc
behaves:glibc
uses 32KiB dirent buffer [1] [2], so result in 2getdents64()
, less syscalls. Latency is lower probably because of directory cache (dcache) hit warmed up by the monitoring daemon every 1s, see the reproduce program traces below.Reproduce
This
os.ReadDir()
hotspot can be reproduced without the monitoring system:Run above reproduce program and monitoring daemons at the same time:
Stop the monitoring daemons and run the program alone (8KiB buffer is sufficient when less threads):
Bottom-up analysis
System level
Some search results [3] [4] show Linux
procfs
has scalability issue inproc_pid_readdir()
when many threads. Here it's a 4 core 8 GiB arm64 VM with 300 threads, and system is idle without any workloads yet. I did some verifications:os.ReadDir("/proc")
's max latency is lower than desktop VM: AMD Epyc 7T83 VM is < 0.28ms, ARM Neoverse N2 VM is < 0.36ms.__d_lookup()
becomes most significant inproc_pid_readdir()
. Here is the flamegraph for it.proc_pid_readdir()
can take more than 50ms and addedcond_resched()
in it.So
os.ReadDir("/proc")
can be slow in Linux systems, sometimes because of the processor, sometimes system loads.Standard library
I found #24015 increased the
blockSize
to 8KiB to fix missing CIFS file bug. It's somewhat similar to the comment inglibc
[2], so increase to the same 32KiB lower limit asglibc
may be good for both compatibility and performance described above.Checked
solaris/illumos libc
as a reference, filesystem independent dirent buffer lower limit is 8KiB, and upper limit is 64KiB.Application level
Monitoring software checks PIDs often, some may change to use
sysfs
instead ofprocfs
, like thecgroup.procs
file for container process where linux sub-system stores the PIDs in one memory place without frequent memory traversal in/proc
.Suggestion
Align with
glibc
implementation to allocate 32KiB buffer for dirent buffer for better performance and compatiblity.[1] https://sourceware.org/git/?p=glibc.git;a=commit;h=4b962c9e859de23b461d61f860dbd3f21311e83a
[2] https://inbox.sourceware.org/libc-alpha/87lfml9gl7.fsf@mid.deneb.enyo.de/
[3] https://lore.kernel.org/linux-fsdevel/20220614180949.102914-1-bfoster@redhat.com/
[4] https://www.usenix.org/system/files/srecon22emea_slides_liku.pdf
The text was updated successfully, but these errors were encountered: