Skip to content

fix(BA-3308): Support multi-GPU fractional allocation in anti-fragmentation guard#10477

Open
seedspirit wants to merge 7 commits intomainfrom
fix/BA-3308
Open

fix(BA-3308): Support multi-GPU fractional allocation in anti-fragmentation guard#10477
seedspirit wants to merge 7 commits intomainfrom
fix/BA-3308

Conversation

@seedspirit
Copy link
Contributor

@seedspirit seedspirit commented Mar 24, 2026

resolves #275 (BA-3308)

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Installer updates including:
    • Fixtures for db schema changes
    • New mandatory config options
  • Update of end-to-end CLI integration tests in ai.backend.test
  • API server-client counterparts (e.g., manager API -> client SDK)
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

@seedspirit seedspirit added this to the 25.15 milestone Mar 24, 2026
@seedspirit seedspirit self-assigned this Mar 24, 2026
@github-actions github-actions bot added size:XL 500~ LoC comp:agent Related to Agent component labels Mar 24, 2026
seedspirit added a commit that referenced this pull request Mar 24, 2026
@seedspirit seedspirit requested review from a team, achimnol and kyujin-cho March 24, 2026 09:16
@seedspirit seedspirit marked this pull request as ready for review March 24, 2026 09:16
Copilot AI review requested due to automatic review settings March 24, 2026 09:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the agent’s fractional allocation anti-fragmentation guard to allow multi-device fractional GPU allocations (BA-3308 / issue #275), and adds unit tests intended to validate the revised behavior.

Changes:

  • Revise FractionAllocMap.ensure_slot_not_fragmented() to evaluate multi-device feasibility using per-device “density” and quantum rounding.
  • Add an extensive unit-test matrix for density quantization and expected device usage under FILL/EVENLY strategies (plus occupied-device scenarios and edge cases).
  • Add a changelog entry describing the fix.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/ai/backend/agent/alloc_map.py Reworks the anti-fragmentation guard to support multi-device fractional allocations.
tests/unit/agent/test_alloc_map.py Adds new test suites/cases covering defrag density math, strategies, occupied devices, and edge cases.
changes/10477.fix.md Documents the bugfix in the changelog.

Review notes (blockers):

  • The new guard explicitly assumes homogeneous per-device capacity, but the codebase can produce heterogeneous DeviceSlotInfo.amount values (e.g., mock accelerator’s _get_share_raw() varies by device). This can cause false rejections of otherwise feasible allocations.
  • The guard’s “remainder/quantum” reasoning is described as matching distribute_evenly, but distribute_evenly operates in self.digits (0.01) while final rounding uses quantum_size (often 0.1 for CUDA shares). This mismatch can allow allocations that later get truncated by round_down(..., quantum_size) such that the returned allocation sum no longer equals the requested amount; the newly added tests currently don’t assert “sum allocated == requested” for the parametrized FILL/EVENLY cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

seedspirit and others added 7 commits March 25, 2026 15:23
- Replace indirect fixture pattern with explicit device_remaining list
- Split tests into separate FILL and EVENLY strategy methods
- Add strategy parameter to _make_map_with_remaining helper

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Member

@fregataa fregataa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This task does not address the original issue. (Please also update the PR description, it is not related to #275)
It only changes a ensure_slot_not_fragmented() function that checks whether a given agent has enough resources with no fragment.
We have to update the _allocate_by_filling / _allocate_evenly allocator functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:agent Related to Agent component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants