Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce the number of open syscalls getting ENOENT from unexisting caches in sysfs #434

Open
bgoglin opened this issue Nov 18, 2020 · 3 comments

Comments

@bgoglin
Copy link
Contributor

bgoglin commented Nov 18, 2020

We currently try to open /sys/devices/system/cpu/cpuX/cache/indexY/shared_cpu_map for every PU and Y between 0 and 9. That's usually 6 useless syscalls per PU since most CPUs have 4 caches per PU. That's almost 1ms per PU.

Linux numbers caches from 0 to N-1 internally but some of them might get skip when added to sysfs for some reasons (see cache_add_dev() in drivers/base/cacheinfo.c). That means we have no easy way to break the loop when index4 is missing as usual.

Doing stat on the parent directory might be a good way to find out the total number of indexY subdirectories. That would mean one syscall to avoid 6 syscalls. However btrfs (for fsroot regression tests) has some issues with nlink being wrong (see comments in topology-linux.c).

Reducing to 5 instead of 9 is likely a good start for now. Most current CPUs have 4 caches in sysfs. There are some L4 out there but I have never seen those in sysfs since they are rather outside of the CPUs. Itanium had 5 caches (L2i and L2d) but it's dead. So 5 works fine and gives us one free slot in case newer CPUs bring an additional level.

@sthibaul
Copy link
Contributor

Perhaps using opendir() to get the actual list could be more efficient even if being an n+1th call? Even with a large directory that ends up with only one getdents64() system call.

@xWuWux
Copy link

xWuWux commented Oct 19, 2023

The easiest solution would be to reduce the number of iterations\and use the opendir() function for efficient directory listing is a promising approach. It would lead to a reduction in unnecessary syscalls and enhance the performance of Open MPI's cache information retrieval process on Linux.

bgoglin added a commit to bgoglin/hwloc that referenced this issue Oct 20, 2023
Instead of trying to open all "index%u" from 0 to 9.

tests/hwloc/linux/gather must be update to ignore obj ID/gp_index
because readdir() doesn't always get the caches in the expected
order when loading from the sysfs dump.

Refs open-mpi#434

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
@bgoglin
Copy link
Contributor Author

bgoglin commented Oct 20, 2023

I did a quick test. We actually get more syscalls using opendir. Instead of having one useless openat() for each of the 6 non-existing caches (those failing openat are likely very cheap), opendir+readdirs+closedir uses 7 syscalls (openat+newfstatat+2fnctl+2getdents+close). That's for each core.

If you want to play with it, the code is in PR #629. There will be a tarball at https://ci.inria.fr/hwloc/job/basic/job/PR-629/ soon.

bgoglin added a commit to bgoglin/hwloc that referenced this issue Oct 20, 2023
Instead of trying to open all "index%u" from 0 to 9.

Refs open-mpi#434

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants