Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS doesn't respect Linux kernel CPU isolation mechanisms #8908

Closed
sjuxax opened this issue Jun 14, 2019 · 11 comments
Closed

ZFS doesn't respect Linux kernel CPU isolation mechanisms #8908

sjuxax opened this issue Jun 14, 2019 · 11 comments
Labels
Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@sjuxax
Copy link

sjuxax commented Jun 14, 2019

System information

Type Version/Name
Distribution Name ArchLinux
Distribution Version Rolling
Linux Kernel 4.19.48
Architecture x86_64
ZFS Version 0.8.0
SPL Version 0.8.0

Describe the problem you're observing

module/spl/spl-taskq.c contains this code:

  tqt->tqt_thread = spl_kthread_create(taskq_thread, tqt,
      "%s", tq->tq_name);
  if (tqt->tqt_thread == NULL) {
    kmem_free(tqt, sizeof (taskq_thread_t));
    return (NULL);
  }

  if (spl_taskq_thread_bind) {
    last_used_cpu = (last_used_cpu + 1) % num_online_cpus();
    kthread_bind(tqt->tqt_thread, last_used_cpu);
  }

Thus, kthreads spawn either with the default cpumask or, if spl_taskq_thread_bind=1 is set on module import, are bound to CPUs without regard for their availability to the scheduler. This can be a substantial source of latency, which is not acceptable on many systems that use the isolcpus boot parameter to isolate designated "real-time" cores.

While spl_taskq_thread_bind=1 prevents latency from thread migration on/off RT CPUs, it can make things substantially worse by locking the threads to arbitrary cores in a way that can't be changed with taskset, leaving the RT CPUs permanently saddled with the kthread for its full lifecycle.

Ideally, the modular CPU selection would be replaced with something that uses the kernel's housekeeping API in include/linux/sched/isolation.h to get the cpumask of non-isolated CPUs and use kthread_create_on_cpu in spl_kthread_create and/or kthread_bind_mask to schedule and bind threads across non-RT cores only. Note, however, this is an incomplete solution because the kernel's interface to get an isolcpus cpumask has changed several times across the versions supported by ZFS.

Various hacks can be done to try to prevent unbound kthreads from using isolated cores, and threads not bound with spl_taskq_thread_bind can be moved, but these solutions are iffy and incomplete at best. It would be great if ZFS respected isolcpus from the start.

Describe how to reproduce the problem

Boot with isolcpus, capture a trace of the RT CPUs with perf sched record or other tracing mechanisms, observe ZFS-spawned kthreads coming on and off isolated cores. This is the primary remaining source of latency on my local system.

Include any warning/errors/backtraces from the system logs

@sjuxax
Copy link
Author

sjuxax commented Jun 15, 2019

Minimal quick and dirty patch that appears to work for me here: sjuxax@7c2a896 .

Looks like kthread_bind_mask isn't exported, so using cpumask_next_wrap instead. This iterates through the CPUs, but skips CPUs that have HK_FLAG_DOMAIN specified in the housekeeping API. I'm sure there are other places where the affinity needs to be set, but at a glance, this appears to quiet things down a bit. 🤷‍♂️

@gamanakis
Copy link
Contributor

gamanakis commented Dec 16, 2019

@sjuxax your observations are correct. The other place you would have to do this is in __thread_create() in module/spl/spl-thread.c. You can see a very primitive example here:
gamanakis@b9bad20

@behlendorf
Copy link
Contributor

@sjuxax would you mind opening a PR with the proposed fix for taskqs and dedicated threads. Then we can get you some better feedback and shouldn't lose track of this again.

@behlendorf behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Dec 16, 2019
@gamanakis
Copy link
Contributor

@behlendorf Would it be worth it to have additionally a cpulist as an spl module parameter that would bind those threads to defined cpus?

@testdasi
Copy link

Has there been any progress on fixing this defect please?

@IvanVolosyuk
Copy link

CPU hotpluging code changes the relevant code:
#11212
This PR will have to implement this changes in hotplug-aware way.

@stale
Copy link

stale bot commented Nov 18, 2021

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Nov 18, 2021
@stale stale bot closed this as completed Feb 18, 2022
@Jauchi
Copy link

Jauchi commented Mar 13, 2022

Has this been fixed?
On my system (zfs-2.1.2-1), using isolcpus and spl_taskq_thread_bind with either 0 or 1 has no effect (ZFS still using a cpu that is excluded by isolcpus).

@ipaqmaster
Copy link

ipaqmaster commented Mar 13, 2022

Interesting,

I saw your reply via email and tried it myself to confirm. I am on Archlinux here using:

  • Kernel 5.16.13
  • zfs-2.1.2-1
  • zfs-kmod-2.1.2-1
  • AMD Ryzen 9 3900X
  • 32G DDR4 @3600MHz ( 2x F4-3600C16-16GTZNC )
  • 2TB Corsair MP600 nvme to read from as a test

My boot arguments were:

zfs=myPool/myRoot rw iommu=pt iommu=1 quiet acpi_enforce_resources=lax hugepagesz=1G hugepages=12 isolcpus=3-11,15-23 rcu_nocbs=3-11,15-23 nohz_full=3-11,15-23 systemd.unified_cgroup_hierarchy=0 rcu_nocb_poll irqaffinity=0,1,2,12,13,14

I opened htop on one screen and could already see that only cores 0,1,2 + 12,13,14 were given work by my host.

At this point I used pv /data/somelargefile.dat > /dev/null in another terminal and ZFS read it out at ~1.9gb/s.

I could see the z_rd_init_0 (and incremented) threads giving thread's 0,1,2,12,13,14 the workload of their life. But the other cores were left 100% idle. This wasn't the case before.

I tried another pv from data in an encrypted dataset and while the read speed was expectedly slower, it still only executed on the 6 cpu threads which were not isolated. I don't know why your situation is behaving differently.

@Jauchi
Copy link

Jauchi commented Mar 13, 2022

Hello, thanks for the quick and detailed reply!

I forgot to mention that I am running NixOS unstable.

I tried to adapt my system as far as possible to your kernel parameters, now I have the following cmdline (hashes and PCI IDs removed for readability):
BOOT_IMAGE=(hd0,gpt2)//kernels/[...]-linux-5.15.27-bzImage init=/nix/store/[...]/init vfio-pci.ids=[...] amd_iommu=on iommu=pt iommu=1 acpi_enforce_resources=lax isolcpus=7-15 rcu_nocbs=7-15 nohz_full=7-15 rcu_nocb_poll irqaffinity=0,1,2,3,4,5,6 spl_taskq_thread_bind=1 nohibernate zfs_force=1 systemd.unified_cgroup_hierarchy=0 loglevel=4

The line spl_taskq_thread_bind=1 does not have any effect it seems, I tried booting once with and without it and it made no difference.
As soon as my system is booted, there is some (less than 2%) kernel activity on core #14, along with userspace activity on 0-6. All other cores are silent.

However, I took a look in htop and zfs is not the only kernel process using that core, so likely something entirely wrong on my part.

So, this is cleary some sort of user error on my part. If you have any suggestions or Ideas, I would of course be very thankful nonetheless.
If I find a solution, I will try to post it here too. Thanks!

@Jauchi
Copy link

Jauchi commented Mar 13, 2022

Okay, so I think I figured it out, although the reasons why it is the way it is are beyond my understanding.

To make a long story short, if I leave a CPU core between 8 and 15 for the kernel, it uses that CPU core, otherwise it will just assign a random one at boot time and be stuck with it.
So, if I isolated all CPUs except 0 and 1, CPUs 2-7 would not have any kernel processes, 8-15 would have a random CPU that has the kernel processes.
Basically, what I did is followed the example made by @ipaqmaster - just leave some CPUs in the upper range as well, now I can use 2-7 and 10-15 exclusively and I have no more issues, everything is nice and isolated.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

8 participants
@sjuxax @behlendorf @IvanVolosyuk @gamanakis @Jauchi @ipaqmaster @testdasi and others