fix(BA-3308): Support multi-GPU fractional allocation in anti-fragmentation guard by seedspirit · Pull Request #10477 · lablup/backend.ai

seedspirit · 2026-03-24T09:01:09Z

Checklist: (if applicable)

Milestone metadata specifying the target backport version
Mention to the original issue
Installer updates including:
- Fixtures for db schema changes
- New mandatory config options
Update of end-to-end CLI integration tests in ai.backend.test
API server-client counterparts (e.g., manager API -> client SDK)
Test case(s) to:
- Demonstrate the difference of before/after
- Demonstrate the flow of abstract/conceptual models with a concrete implementation
Documentation
- Contents in the docs directory
- docstrings in public interfaces and type annotations

Copilot

Pull request overview

This PR updates the agent’s fractional allocation anti-fragmentation guard to allow multi-device fractional GPU allocations (BA-3308 / issue #275), and adds unit tests intended to validate the revised behavior.

Changes:

Revise FractionAllocMap.ensure_slot_not_fragmented() to evaluate multi-device feasibility using per-device “density” and quantum rounding.
Add an extensive unit-test matrix for density quantization and expected device usage under FILL/EVENLY strategies (plus occupied-device scenarios and edge cases).
Add a changelog entry describing the fix.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`src/ai/backend/agent/alloc_map.py`	Reworks the anti-fragmentation guard to support multi-device fractional allocations.
`tests/unit/agent/test_alloc_map.py`	Adds new test suites/cases covering defrag density math, strategies, occupied devices, and edge cases.
`changes/10477.fix.md`	Documents the bugfix in the changelog.

Review notes (blockers):

The new guard explicitly assumes homogeneous per-device capacity, but the codebase can produce heterogeneous DeviceSlotInfo.amount values (e.g., mock accelerator’s _get_share_raw() varies by device). This can cause false rejections of otherwise feasible allocations.
The guard’s “remainder/quantum” reasoning is described as matching distribute_evenly, but distribute_evenly operates in self.digits (0.01) while final rounding uses quantum_size (often 0.1 for CUDA shares). This mismatch can allow allocations that later get truncated by round_down(..., quantum_size) such that the returned allocation sum no longer equals the requested amount; the newly added tests currently don’t assert “sum allocated == requested” for the parametrized FILL/EVENLY cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/ai/backend/agent/alloc_map.py

…vice aware guard

- Replace indirect fixture pattern with explicit device_remaining list - Split tests into separate FILL and EVENLY strategy methods - Add strategy parameter to _make_map_with_remaining helper Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… in defrag tests

…guard

fregataa

This task does not address the original issue. (Please also update the PR description, it is not related to #275)
It only changes a ensure_slot_not_fragmented() function that checks whether a given agent has enough resources with no fragment.
We have to update the _allocate_by_filling / _allocate_evenly allocator functions.

seedspirit added this to the 25.15 milestone Mar 24, 2026

seedspirit self-assigned this Mar 24, 2026

github-actions bot added size:XL 500~ LoC comp:agent Related to Agent component labels Mar 24, 2026

seedspirit added a commit that referenced this pull request Mar 24, 2026

changelog: add news fragment for PR #10477

891ae25

seedspirit requested review from a team, achimnol and kyujin-cho March 24, 2026 09:16

seedspirit marked this pull request as ready for review March 24, 2026 09:16

Copilot AI review requested due to automatic review settings March 24, 2026 09:16

Copilot started reviewing on behalf of seedspirit March 24, 2026 09:16 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

src/ai/backend/agent/alloc_map.py Outdated Show resolved Hide resolved

seedspirit and others added 7 commits March 25, 2026 15:23

fix(BA-3308): replace single-device fragmentation check with multi-de…

d2580f9

…vice aware guard

test(BA-3308): use power-of-2 GPU counts and quantum-aligned requests…

101d5ee

… in defrag tests

changelog: add news fragment for PR #10477

6bf0c64

fix: Rollback sorted_dev_allocs negative num check

c1c9e8f

fix(BA-3308): implement Algorithm 2 incremental-N anti-fragmentation …

2b85201

…guard

fix(BA-3308): restrict N-increment to multi-device requests only

4a342cc

seedspirit force-pushed the fix/BA-3308 branch from c798eb6 to 4a342cc Compare March 25, 2026 06:23

fregataa requested changes Mar 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(BA-3308): Support multi-GPU fractional allocation in anti-fragmentation guard#10477

fix(BA-3308): Support multi-GPU fractional allocation in anti-fragmentation guard#10477
seedspirit wants to merge 7 commits intomainfrom
fix/BA-3308

seedspirit commented Mar 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

fregataa left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

seedspirit commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

fregataa left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

seedspirit commented Mar 24, 2026 •

edited

Loading

fregataa left a comment •

edited

Loading