lib: embedded ATQ #2

etsal · 2025-09-27T05:34:38Z

Add the necessary changes to the rbtree and ATQ APIs to enable no-allocation ATQ operations.

NOTE: There is currently a verification failure due to the stack depth being exceeded. This is incidental to the patch and just requires some more debugging that I will be doing over the weekend.

Signed-off-by: Changwoo Min <changwoo@igalia.com>

Add a utility script, cgpath.sh, that takes a cgroup ID as a command-line argument and returns the full path of the cgroup. The cgroup ID is the inode number of the cgroup. This is for easy debugging of the cpu.max support. Signed-off-by: Changwoo Min <changwoo@igalia.com>

Define key data structures (scx_cgroup_ctx, scx_cgroup_llc_ctx) to support CPU bandwidth control (cpu.max) in cgroup v2. In addition, add an API skeleton for BPF schedulers. Signed-off-by: Changwoo Min <changwoo@igalia.com>

To avoid an --E2BIG error, tweak the code to reduce the BPF program size. Signed-off-by: Changwoo Min <changwoo@igalia.com>

Interate the CPU bandwidth control with the scx_lavd scheduler. The library is initialized (scx_cgroup_bw_lib_init) when the scheduler is initialized. Also, ops.cgroup_init(), ops.cgroup_exit(), and ops.cgroup_move() are implemented; scx_cgroup_bw_reenqueue() is called at ops.dispatch(). A new option, `--enable-cpu-bw` is added to enable the feature. Finally, replace __nr_cpu_ids to nr_cpu_ids defined in the scx library. Signed-off-by: Changwoo Min <changwoo@igalia.com>

scx_cgroup_bw_lib_init() first initializes the config and replelish timer. Signed-off-by: Changwoo Min <changwoo@igalia.com>

When a cgroup is initialized, the cgroup's context and its LLC contexts are initialized. Also, its parent now becomes non-leaf. If its parent is not threaded, it cannot have tasks, so we delete its LLC contexts. Signed-off-by: Changwoo Min <changwoo@igalia.com>

When a cgroup's bandwidth is updated, we should update the nquota_lb of all its descendants too. Signed-off-by: Changwoo Min <changwoo@igalia.com>

Destroy cgroup context, and its LLC contexts, and drain & free BTQs associated with the LLC contexts. Signed-off-by: Changwoo Min <changwoo@igalia.com>

We reserve the budget in a bottom-up manner: first at the LLC level, then at the cgroup level, and finally at the subroot cgroup level. Before asking the budget to the subroot cgroup level, update this cgroup's runtime_total_sloppy to avoid spending the budget over its upper bound (nquota_ub). We traverse the cgroup hierarchy in a post-order manner (left, right, then root) and check the cgroup's level to efficiently sum the runtime_total of all its descendants. Signed-off-by: Changwoo Min <changwoo@igalia.com>

After executing a task, we update runtime_toal and compensate for budget_remaining_before by comparing the planned vs. actual time usage. Signed-off-by: Changwoo Min <changwoo@igalia.com>

When a cgroup is throttled (i.e., scx_cgroup_bw_reserve() returns -EAGAIN), a task that is in the ops.enqueue() path should be put aside to the BTQ of its associated LLC context. When the cgroup becomes unthrottled again, the registered enqueue_cb() will be called to re-enqueue the task for execution. Signed-off-by: Changwoo Min <changwoo@igalia.com>

We replenish the cgroup's budget every 100ms (nperiod) interval. Only subroot cgroups (their levels == 1) distribute the budget to their descendants; we only replenish the budget for the subroot cgroup with a limited quota. The underused budget can be accumulated by the burst specified. On the other hand, the overused budget will be charged over the intervals. The replenish timer is split into two parts: the top half and the bottom half. The top half -- the actual BPF timer function (replenish_timerfn) -- runs the essential, critical part, such as refilling the time budget. On the other hand, the bottom half -- scx_cgroup_bw_reenqueue() -- runs on a BPF scheduler's ops.dispatch() and requeues the backlogged tasks to proper DSQs. Signed-off-by: Changwoo Min <changwoo@igalia.com>

Under CPU bandwidth control using cpu.max, we also need to report how much time was actually consumed compared to the reserved time. Signed-off-by: Changwoo Min <changwoo@igalia.com>

We updated ops.enqueue(), enqueue callback, and ops.dispatch() paths. Under CPU bandwidth control using cpu.max, we should first reserve time for execution. If we succeed in reserving the time, we will go ahead. Otherwise, we should set the task aside for later execution. When triggered by lavd_enqueue_cb(), we should still enqueue the task even if the time reservation fails (-EAGAIN). Note that we do not throttle the scheduler process itself to guarantee forward progress. Signed-off-by: Changwoo Min <changwoo@igalia.com>

This reverts commit ae41fd9.

… attribute

etsal · 2025-09-27T05:35:26Z

Closing since the merge was accidentally pointed to main.

Changwoo Min and others added 28 commits September 26, 2025 10:03

XXX: Update rust/scx_utils/vmlinux.tar.zst for 6.17-rc4

8374304

Signed-off-by: Changwoo Min <changwoo@igalia.com>

XXX ATQ: rbtree -> minheap

ae41fd9

Signed-off-by: Changwoo Min <changwoo@igalia.com>

XXX: scx_lavd: 30 sec -> 3 sec timeout

e153852

Signed-off-by: Changwoo Min <changwoo@igalia.com>

lib: cgroup_bw: Add skeleton for CPU bandwidth control.

80efcce

Define key data structures (scx_cgroup_ctx, scx_cgroup_llc_ctx) to support CPU bandwidth control (cpu.max) in cgroup v2. In addition, add an API skeleton for BPF schedulers. Signed-off-by: Changwoo Min <changwoo@igalia.com>

scx_lavd: Reduce the BPF program size.

1751d8d

To avoid an --E2BIG error, tweak the code to reduce the BPF program size. Signed-off-by: Changwoo Min <changwoo@igalia.com>

scx_lavd: rename struct task_ctx to task_ctx

05f2218

scx_lavd: move task_ctx to arenas

6185835

scx_lavd: mark functions taking a task context with __arg_arena

c8e808f

lib: cgroup_bw: Implement scx_cgroup_bw_lib_init().

ec731c8

scx_cgroup_bw_lib_init() first initializes the config and replelish timer. Signed-off-by: Changwoo Min <changwoo@igalia.com>

lib: cgroup_bw: Implement scx_cgroup_bw_set().

a521348

When a cgroup's bandwidth is updated, we should update the nquota_lb of all its descendants too. Signed-off-by: Changwoo Min <changwoo@igalia.com>

lib: cgroup_bw: Implement scx_cgroup_bw_exit().

abdd701

Destroy cgroup context, and its LLC contexts, and drain & free BTQs associated with the LLC contexts. Signed-off-by: Changwoo Min <changwoo@igalia.com>

lib: cgroup_bw: Implement scx_cgroup_bw_consume().

1196ea0

After executing a task, we update runtime_toal and compensate for budget_remaining_before by comparing the planned vs. actual time usage. Signed-off-by: Changwoo Min <changwoo@igalia.com>

scx_lavd: Support cpu.max at ops.stopping().

1ff27cf

Under CPU bandwidth control using cpu.max, we also need to report how much time was actually consumed compared to the reserved time. Signed-off-by: Changwoo Min <changwoo@igalia.com>

Revert "XXX ATQ: rbtree -> minheap"

f556742

This reverts commit ae41fd9.

scx_lavd: move pid to main task_ctx

1718fc6

lib/atq: factor out task insertion into scx_atq_insert_node

39bedb8

lib/rbtree: add noalloc/nofree variants of the API

62d3496

lib/rbtree: turn rbtree_insert_mode from a per-insert into a per-tree…

a092d41

… attribute

atq: only use embedded rbnodes on scx_atq_insert_*()

bd294e2

lib/cgroup_bw: move to rbnode-based ATQ API

e13ac08

[wip] stack depth exceeded debugging

41c852c

etsal closed this Sep 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

lib: embedded ATQ #2

lib: embedded ATQ #2

Uh oh!

etsal commented Sep 27, 2025

Uh oh!

etsal commented Sep 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lib: embedded ATQ #2

lib: embedded ATQ #2

Uh oh!

Conversation

etsal commented Sep 27, 2025

Uh oh!

etsal commented Sep 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant