Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce adapters cannot be loaded past --adapter-memory-fraction #306

Merged
merged 9 commits into from
Mar 7, 2024

Conversation

tgaddair
Copy link
Contributor

@tgaddair tgaddair commented Mar 6, 2024

Fixes #53.

One of the main issues with our existing adapter loading and offloading strategy is that it relies on the user setting a fixed adapter limit via --max-active-adapters (default: 128). However, this doesn't account for the fact that adapter sizes can vary by orders of magnitude based on the rank. As such, the server might do well with 128 rank 8 adapters but fall over with just a handful of rank 128 adapters.

In #303, we introduced a new parameter --adapter-memory-fraction that allows setting aside a dedicated pool of GPU memory to account for adapter overhead. This prevents the KV cache from expanding past the reservation set aside for adapters, reducing memory pressure. However, because the LoRAX scheduler still works off of the max active adapters, this means that users can still blow up the GPU memory by setting max active adapters higher than what can be accommodated by the adapter memory fraction.

This PR reconciles the scheduler with the adapter memory fraction. Now, the LoRAX scheduler will look at the size of each adapter after download to determine whether it can be loaded safely on the GPU. If not, then the adapter will wait until enough space is freed up the other active adapters becoming idle before the adapter can be moved to the GPU. This should ensure that no CUDA OOMs occur due to loading too many adapters.

Note that with this change, the max active adapters may no longer be needed, but we will keep it around for now to avoid backwards incompatible changes. However, the new default of 1024 should be sufficiently high that it won't be used in most cases.

The new default --adapter-memory-fraction will be 0.1, meaning 10% of GPU memory will be set aside for adapters. To go back to the previous behavior, users can set the following parameters:

--max-active-adapters 128 --adapter-memory-fraction 0.0

But in general it is recommended to avoid modifying max active adapters going forward and instead tune adapter memory fraction to find the right balance between KV cache size and concurrent adapters.

@@ -81,6 +81,9 @@ def info(self) -> InfoResponse:
@abstractmethod
def batch_type(self) -> Type[B]:
raise NotImplementedError

def adapter_memory_size(self) -> int:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do want to return 0 here for other model types to ensure they still work (like Bloom), even though we won't be enforcing the adapter memory reservation.

// Add back cost for all offload adapters
for adapter in offload_adapters.iter() {
let queue = self.queue_map.get(adapter).unwrap().clone();
let cost = queue.cost().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a none check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had one originally, but we never add to the active set until the cost non-none (see check below). So if this condition is violated, it should be a programming error.


return generate_pb2.DownloadAdapterResponse(
downloaded=True,
memory_fraction=adapter_memory_fraction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious - why convert to a fraction instead of passing the actual cost?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That way I don't have to plumb the reservation amount to the router, I just always work with 1 as the reservation amount for simplicity.

@tgaddair tgaddair merged commit 6dea404 into main Mar 7, 2024
2 checks passed
@tgaddair tgaddair deleted the no-max-adapters branch March 7, 2024 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unexpected CUDA out of memory errors
2 participants