Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to configure ECHO? #3

Closed
marioroy opened this issue Apr 2, 2024 · 12 comments
Closed

How to configure ECHO? #3

marioroy opened this issue Apr 2, 2024 · 12 comments

Comments

@marioroy
Copy link

marioroy commented Apr 2, 2024

Hi, @hamadmarri

Thank you for ECHO. I captured results comparing EEVDF, BORE, and ECHO. I now realize you made an update and will run again, and report back. Testing was on a 32-core box (64 CPU threads); AMD Ryzen Threadripper 3970X; NVIDIA RTX 3070; XanMod Edge 6.8.2 kernel. I'm unsure if bs_shared_quota is fixed or depends on the number of CPU threads?

ECHO tuning

/sys/kernel/debug/sched/base_slice_ns         6800
/proc/sys/kernel/sched_bs_shared_quota        81600 (6800 * 12)
/proc/sys/kernel/yield_type                   1

I ran 4 tasks concurrently, twice (with and without idle policy for the compute job). Afterwards, I timed a kernel compile job.

chrt -f 10 Chromium Browser https://slowroads.io/
           Google Chrome    https://webglsamples.org/blob/blob.html
                            Number of blobs: 10   Resolution: 48^3

chrt -i 0 ./algorithm3.pl 2e12    (i)
          ./algorithm3.pl 2e12   (noi)
          ./schbench          (99.0th,max)

Compile:  time HZ=800 LOCALMODCONFIG=1 ./xm-build edge-preempt
          The compile job runs separately, no other jobs.

Results

Scheduler     algo(i)  blob   sch99th  algo(noi) blob   sch99th   compile
                                max                       max
-----------   -------  -----  ------    -------  -----  ------    --------
EEVDF         41.279s  60fps  3716us    39.472s  60fps  4232us    100.682s
                              7735us                    8198us

BORE v5.0.3   41.510s  60fps  3660us    39.261s  60fps  4584us    100.269s
                              7149us                    8225us

ECHO v001     36.284s  56fps  1230us    34.692s  30fps  3404us     99.906s
                              4370us                    7909us

Observations

  1. The slowroads demo ran smoothest with ECHO under CPU load. Even though Chromium was running with 'fifo' policy, there were micro jitters with EEVDF/BORE. But quite smooth with ECHO.
  2. EEVDF/BORE has SCHED_AUTOGROUP enabled. So, launching Firefox appears in 1 second under CPU load. For ECHO, Firefox may take longer to appear depending on running background jobs with 'idle' policy.
  3. Achieving blob 60fps is possible with ECHO algo(i,noi), simply by running Google Chrome with 'fifo' policy i.e. chrt -f 10.
  4. Idle policy 'chrt -i 0' has a nice effect for ECHO, optionally apply to background jobs. That keeps ECHO responsive for the desktop environment.
  5. Possibilities; the 'fifo' and 'idle' policies are helpful.
    idle: gives background jobs less priority, keeping ECHO responsive
    fifo: gives more priority, if needed.

Blessings and grace.

@marioroy
Copy link
Author

marioroy commented Apr 2, 2024

The README states, "All tasks in a CPU have a shared quota = 105us in which every task runs (105us / # of tasks)". In the case not fixed value, how is the new ECHO-002 bs_shared_quota computed? 50 base_slice_ns * 8 CPUs * 1.25 = 500

I will try bs_shared_quota 4000, 5000, and 6000: 50 base_slice_ns * 64 CPUs * 1.25 = 4000.

/sys/kernel/debug/sched/base_slice_ns 50
/proc/sys/kernel/sched_bs_shared_quota 4000

Edit: Interesting, 60fps for the WebGL blob demo, previously 56. I'm testing 6000 first.

@hamadmarri
Copy link
Owner

hamadmarri commented Apr 2, 2024

Hello @marioroy

Sorry for late response. The shared quota is per cpu, so simply it is the maximum amount of nano seconds the running tasks per cpu have to share in one round. The smallest the value the smoothest but more context switches. The minimum value must be no less than 2x base_slice_ns assuming you consider roughly two tasks per cpu running at a time so if base_slice_ns==500, the shared_quota minimum is 1000. In my machine the sweet spot was bs_shared_quota=500 ns and the base_slice_ns I just hard coded to 50 ns.

Thank you for sharing the test results. If you don't mind sharing it here https://github.com/hamadmarri/benchmarks

Could you please explain a bit on the results, or maybe just mention which is more is best or less is best.

Thank you

@marioroy
Copy link
Author

marioroy commented Apr 3, 2024

The shared quota is per cpu

I had wondered about bs_shared_quota since using ECHO. That is now clear.

Could you please explain a bit on the results, or maybe just mention which is more best or less best.

I struggled choosing BORE v5.0.3 or ECHO v001. BORE is more responsive under CPU load; for example launching Firefox. The window appears in less than 1 second. That is possible with ECHO (~ 1 second) by running the background CPU burner with idle policy i.e. chrt -i 0. For the WebGL blog demonstration. ECHO too, can reach 60fps under CPU load by running Chrome with 'fifo' policy i.e. chrt -f 10.

ECHO completes the CPU burner job in less time, counting prime numbers.

Your video is where I learned about the WebGL Blog demonstration. Now, there is ECHO v002. It will take some time to do various testing, including bs_shared_quota=500.

Thank you for the explanation, @hamadmarri.

@marioroy marioroy closed this as completed Apr 3, 2024
@marioroy
Copy link
Author

marioroy commented Apr 3, 2024

Maybe the sched_base_slice or bs_shared_quota v002 defaults are extreme. Try running steps 1 and 2 concurrently. Launching Firefox may freeze the entire desktop momentarily.

  1. Run Chrome or Chromium with fifo policy (chrt -f 10), goto https://slowroads.io/
  2. Count primes consuming all CPU cores.
  3. Launch Firefox. Check for micro-pauses in Chrome.

Repeat: Quit Chrome and run normally, without fifo policy.

@marioroy
Copy link
Author

marioroy commented Apr 3, 2024

The freeze issue is a problem. I experienced the (1 ~ 2 seconds) freeze two more times, using HZ_625, and again with HZ_800. I like HZ_800 for the improved interactivity. A higher base_slice_ns mitigates jitters. Thank you for allowing tuning.

/sys/kernel/debug/sched/base_slice_ns 3500
/proc/sys/kernel/sched_bs_shared_quota 35000

I reverted the following v002 change back to RR_TIMESLICE (100 * HZ / 1000). Interestingly, I had no freezes before with ECHO v001.

+#ifdef CONFIG_ECHO_SCHED
+#define RR_TIMESLICE		(1)
+#else
 #define RR_TIMESLICE		(100 * HZ / 1000)
+#endif

Edit: That did it. No freezes, and running HZ_800. About base_slice_ns. I tried going lower, but jitters came back launching Firefox and watching the "slowroads" demo. Another test involves lots of memory. Decreasing bs_shared_quota below 35000 causes "write stdout" to take 1.2 ~ 1.8 seconds. Likely cache misses. So, bs_shared_quota = 35000 it is for my machine.

$ ./llil4emh in/big* in/big* in/big* | cksum
llil4emh (fixed string length=12) start
use OpenMP
use boost sort
get properties         5.910 secs
map to vector          0.879 secs
vector stable sort     1.132 secs
write stdout           0.970 secs    <--- here
total time             8.892 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

@hamadmarri
Copy link
Owner

Interesting! I have to revert the RR_TIMESLICE changes and will think of new default values for both base_slice_ns and bs_shared_quota. Thank you so much for the testing and debugging

@hamadmarri
Copy link
Owner

hamadmarri commented Apr 3, 2024

Hello @marioroy

4a8cd2a

I have done some tests and 35us is also a better value in my machine too
https://openbenchmarking.org/result/2404031-NE-DEFAULTVS23

Thank you so much 👍

@marioroy
Copy link
Author

marioroy commented Apr 4, 2024

ECHO loves bs_shared_quota 35000. Go much higher, no good. Go much lower, no good. That seems to be spot on. I tried tuning the base_slice_ns setting (default 6000)?

/sys/kernel/debug/sched/base_slice_ns 4200
/proc/sys/kernel/sched_bs_shared_quota 35000

This looks mystical.

35000 / 4200 = 8.3(3)
35000 / 3 = 11.6(6)

Does base_slice_ns 4200 work well on your system? That is safely the lowest one can go to not cause jitters.

Hackbench wall clock time dropped from 40 seconds (base_slice_ns 6000) down to 37.5 seconds (base_slice_ns 4200); under CPU load (counting prime numbers), and cyclictest concurrently.

@hamadmarri
Copy link
Owner

Hi @marioroy

https://openbenchmarking.org/result/2404043-NE-DEFAULTVS48

In cpu bound tasks, the 4200 is the best so far (see Rust Mandelbrot test). The interactivity overall is better.

Thank you for you efforts

@marioroy
Copy link
Author

marioroy commented Apr 7, 2024

A Clear Linux user tried my ClearMod repository and compared the Vanilla native kernel (no preemption) and ECHO (XanMod + preemption + ECHO).

https://community.clearlinux.org/t/nvidia-and-xanmod-cl-updates/9299/32

Very cool.

@marioroy
Copy link
Author

marioroy commented Apr 8, 2024

I completed testing a demo for the phmap author. Yet, another surprise. :-)

image

@hamadmarri
Copy link
Owner

Hi @marioroy

Thank you for sharing the results. I am pleased to see that echo has some performance advantages 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants