Skip to content

Conversation

@etsal
Copy link

@etsal etsal commented Sep 27, 2025

Add the necessary changes to the rbtree and ATQ APIs to enable no-allocation ATQ operations.

NOTE: There is currently a verification failure due to the stack depth being exceeded. This is incidental to the patch and just requires some more debugging that I will be doing over the weekend.

Changwoo Min and others added 28 commits September 26, 2025 10:03
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Add a utility script, cgpath.sh, that takes a cgroup ID as a
command-line argument and returns the full path of the cgroup.
The cgroup ID is the inode number of the cgroup. This is for easy
debugging of the cpu.max support.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Define key data structures (scx_cgroup_ctx, scx_cgroup_llc_ctx)
to support CPU bandwidth control (cpu.max) in cgroup v2. In addition,
add an API skeleton for BPF schedulers.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
To avoid an --E2BIG error, tweak the code to reduce the BPF program size.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Interate the CPU bandwidth control with the scx_lavd scheduler.

The library is initialized (scx_cgroup_bw_lib_init) when the scheduler
is initialized. Also, ops.cgroup_init(), ops.cgroup_exit(), and
ops.cgroup_move() are implemented; scx_cgroup_bw_reenqueue() is called
at ops.dispatch(). A new option, `--enable-cpu-bw` is added to enable
the feature. Finally, replace __nr_cpu_ids to nr_cpu_ids defined in the
scx library.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
scx_cgroup_bw_lib_init() first initializes the config and replelish timer.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a cgroup is initialized, the cgroup's context and its LLC contexts
are initialized. Also, its parent now becomes non-leaf. If its parent is
not threaded, it cannot have tasks, so we delete its LLC contexts.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a cgroup's bandwidth is updated, we should update the nquota_lb
of all its descendants too.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Destroy cgroup context, and its LLC contexts, and drain & free BTQs
associated with the LLC contexts.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
We reserve the budget in a bottom-up manner: first at the LLC level,
then at the cgroup level, and finally at the subroot cgroup level.
Before asking the budget to the subroot cgroup level, update this cgroup's
runtime_total_sloppy to avoid spending the budget over its upper bound
(nquota_ub). We traverse the cgroup hierarchy in a post-order manner
(left, right, then root) and check the cgroup's level to efficiently sum
the runtime_total of all its descendants.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
After executing a task, we update runtime_toal and compensate for
budget_remaining_before by comparing the planned vs. actual time usage.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
When a cgroup is throttled (i.e., scx_cgroup_bw_reserve() returns -EAGAIN),
a task that is in the ops.enqueue() path should be put aside to the BTQ of
its associated LLC context. When the cgroup becomes unthrottled again, the
registered enqueue_cb() will be called to re-enqueue the task for execution.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
We replenish the cgroup's budget every 100ms (nperiod) interval. Only
subroot cgroups (their levels == 1) distribute the budget to their
descendants; we only replenish the budget for the subroot cgroup with a
limited quota. The underused budget can be accumulated by the burst
specified. On the other hand, the overused budget will be charged over the
intervals.

The replenish timer is split into two parts: the top half and the bottom
half. The top half -- the actual BPF timer function (replenish_timerfn)
-- runs the essential, critical part, such as refilling the time budget.
On the other hand, the bottom half -- scx_cgroup_bw_reenqueue() -- runs
on a BPF scheduler's ops.dispatch() and requeues the backlogged tasks to
proper DSQs.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Under CPU bandwidth control using cpu.max, we also need to report how much
time was actually consumed compared to the reserved time.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
We updated ops.enqueue(), enqueue callback, and ops.dispatch() paths.

Under CPU bandwidth control using cpu.max, we should first reserve time
for execution. If we succeed in reserving the time, we will go ahead.
Otherwise, we should set the task aside for later execution. When triggered
by lavd_enqueue_cb(), we should still enqueue the task even if the time
reservation fails (-EAGAIN).

Note that we do not throttle the scheduler process itself to guarantee
forward progress.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
@etsal
Copy link
Author

etsal commented Sep 27, 2025

Closing since the merge was accidentally pointed to main.

@etsal etsal closed this Sep 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant