Enforce adapters cannot be loaded past `--adapter-memory-fraction` #306

tgaddair · 2024-03-06T02:07:21Z

Fixes #53.

One of the main issues with our existing adapter loading and offloading strategy is that it relies on the user setting a fixed adapter limit via --max-active-adapters (default: 128). However, this doesn't account for the fact that adapter sizes can vary by orders of magnitude based on the rank. As such, the server might do well with 128 rank 8 adapters but fall over with just a handful of rank 128 adapters.

In #303, we introduced a new parameter --adapter-memory-fraction that allows setting aside a dedicated pool of GPU memory to account for adapter overhead. This prevents the KV cache from expanding past the reservation set aside for adapters, reducing memory pressure. However, because the LoRAX scheduler still works off of the max active adapters, this means that users can still blow up the GPU memory by setting max active adapters higher than what can be accommodated by the adapter memory fraction.

This PR reconciles the scheduler with the adapter memory fraction. Now, the LoRAX scheduler will look at the size of each adapter after download to determine whether it can be loaded safely on the GPU. If not, then the adapter will wait until enough space is freed up the other active adapters becoming idle before the adapter can be moved to the GPU. This should ensure that no CUDA OOMs occur due to loading too many adapters.

Note that with this change, the max active adapters may no longer be needed, but we will keep it around for now to avoid backwards incompatible changes. However, the new default of 1024 should be sufficiently high that it won't be used in most cases.

The new default --adapter-memory-fraction will be 0.1, meaning 10% of GPU memory will be set aside for adapters. To go back to the previous behavior, users can set the following parameters:

--max-active-adapters 128 --adapter-memory-fraction 0.0

But in general it is recommended to avoid modifying max active adapters going forward and instead tune adapter memory fraction to find the right balance between KV cache size and concurrent adapters.

tgaddair · 2024-03-06T20:11:32Z

server/lorax_server/models/model.py

@@ -81,6 +81,9 @@ def info(self) -> InfoResponse:
    @abstractmethod
    def batch_type(self) -> Type[B]:
        raise NotImplementedError
+
+    def adapter_memory_size(self) -> int:


We do want to return 0 here for other model types to ensure they still work (like Bloom), even though we won't be enforcing the adapter memory reservation.

jeffreyftang · 2024-03-06T23:10:28Z

router/src/queue.rs

+        // Add back cost for all offload adapters
+        for adapter in offload_adapters.iter() {
+            let queue = self.queue_map.get(adapter).unwrap().clone();
+            let cost = queue.cost().unwrap();


Does this need a none check?

I had one originally, but we never add to the active set until the cost non-none (see check below). So if this condition is violated, it should be a programming error.

jeffreyftang · 2024-03-06T23:12:11Z

server/lorax_server/server.py

+
+        return generate_pb2.DownloadAdapterResponse(
+            downloaded=True,
+            memory_fraction=adapter_memory_fraction


Just curious - why convert to a fraction instead of passing the actual cost?

That way I don't have to plumb the reservation amount to the router, I just always work with 1 as the reservation amount for simplicity.

tgaddair added 9 commits March 5, 2024 16:38

WIP: memory constrained adapters

d40c4aa

Compute total bytes

40f3765

Account for unset adapter memory fraction value

2e0c008

Set 0.1 default adapter memory fraction

69c9761

Fixed adapter size calculation

c573bd5

Fixed cost calculation

754bd43

cargo fmt

75edacc

Revert local changes

974c495

Synchronize on offload

c6571c4

tgaddair requested review from jeffreyftang, geoffreyangus and noyoshi March 6, 2024 19:08

tgaddair commented Mar 6, 2024

View reviewed changes

jeffreyftang approved these changes Mar 6, 2024

View reviewed changes

tgaddair merged commit 6dea404 into main Mar 7, 2024
2 checks passed

tgaddair deleted the no-max-adapters branch March 7, 2024 00:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce adapters cannot be loaded past `--adapter-memory-fraction` #306

Enforce adapters cannot be loaded past `--adapter-memory-fraction` #306

tgaddair commented Mar 6, 2024 •

edited

tgaddair Mar 6, 2024

jeffreyftang Mar 6, 2024

tgaddair Mar 7, 2024

jeffreyftang Mar 6, 2024

tgaddair Mar 7, 2024

Enforce adapters cannot be loaded past --adapter-memory-fraction #306

Enforce adapters cannot be loaded past --adapter-memory-fraction #306

Conversation

tgaddair commented Mar 6, 2024 • edited

tgaddair Mar 6, 2024

Choose a reason for hiding this comment

jeffreyftang Mar 6, 2024

Choose a reason for hiding this comment

tgaddair Mar 7, 2024

Choose a reason for hiding this comment

jeffreyftang Mar 6, 2024

Choose a reason for hiding this comment

tgaddair Mar 7, 2024

Choose a reason for hiding this comment

Enforce adapters cannot be loaded past `--adapter-memory-fraction` #306

Enforce adapters cannot be loaded past `--adapter-memory-fraction` #306

tgaddair commented Mar 6, 2024 •

edited