-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how does this differ from s-Lora? #90
Comments
Great question, @priyankat99! The short answer is that LoRAX is intended to be a production LLM serving system for handling many fine-tuned LLMs at once, while S-LoRA is a novel algorithm that can allow multi-adapter inference to scale well under heavy load. As an analogy, LoRAX is more like fully-featured inference servers like TGI, vLLM, etc., while S-LoRA is closer to specific algorithms used by those systems like Paged Attention or Flash Attention. In the S-LoRA repo, they do implement a proof of concept of a fully working serving system, but it has a lot of caveats at the moment. Compared to S-LoRA, LoRAX:
I do want to caveat this by saying that this isn't to disparage S-LoRA, rather the goals of the projects are quite different. There doesn't seem to be a desire to fix these issues in the S-LoRA repo itself, rather the authors have stated that their intention is to add S-LoRA's capabilities to the existing vLLM project. So if and when that happens, many of these issues may be resolved. I'll also say that one thing S-LoRA does that LoRAX does not today is support an optimize multi-rank kernel (efficiently handling adapters of different ranks in a single batch). We support this as well, but their kernel is highly optimized for this use case, and our plan is to add their kernel to LoRAX in the near future. Looking forward, we'll need to see how the integration with vLLM shakes out, though it's not clear what the timeline is on that. I will say, though, that we also considered adding these features to an existing LLM inference library, rather than forking TGI and building LoRAX as a separate project, but found that there were significant architecture changes that needed to be made to support dynamic adapter inference. Going forward, we plan to do a lot more with this idea, including supporting embedding model adapters, classification head adapters, etc. All of this means that I suspect LoRAX will continue to diverge from general purpose LLM inference systems like vLLM over time, rather than these systems converging cleanly. But time will tell! Hope that answers your question. |
ah ok thanks so much this was very helpful and very interesting stuff! |
Hi @priyankat99, yes, LoRAX is a standalone solution for serving. In some cases, it could make sense to, for example, serve your base models with vLLM and then serve your fine-tuned LoRAs with LoRAX, but LoRAX allows you to do both, and we include many of the main features of those libraries, such that LoRAX should be at or near performance parity with those systems (even without multi-LoRA serving): |
Hi @tgaddair, just noticed this project in one of the TGI issues. It is amazing.
I would like to check if it is possible to decouple the adapters and base model in LoRAX (two micro-services). Examples: multiple LoRAX instances hosting multiple adapters and each of the instances connected to different shared base models (Llma, Mistral etc, hosting via LoraX, vLLM or HF etc). The use case is we have |
Hi @lizzzcai, your use case sounds very similar to the one we have at Predibase. I'd definitely love to collaborate on this to make sure it's well supported for you. It's certainly possible to have one LoRAX instance per tenant, but I consider that pretty heavyweight. What we opt for instead is having the user provide an access token per request that uses an adapter using the LoRAX parameter One thing we could consider do is allow the user to pass their S3 credentials using this same |
Thanks @tgaddair for your reply. I think the existing solution and this api_token approach already help a lot! I am happy to discuss more on it further, I will drop you an email, thanks. |
Could the user load more than one adapters at a time, so I could serve multiple tasks to diferent customers with multiple adapters? I just find that adapter_id's format is str not a list. There is a way with merge adapters, but might has interference problem, will the interference obvious? To check interference might be a bit troublesome. |
Hey @gjfdklfjd, yes, LoRAX can serve multiple tasks to different customers at the same time. When you say you want to provide multiple adapters per request, what is the expected behavior in that case? Do you want to adapters to be merged together? |
Hi tgaddair, thank you for you reply. Right now I need to serve multiple tasks to different customers at the same time. Merged adapter might be an alternative way if the independent adapters does not work. I just want to use one base model, with multiple independent adapters, which only take up nearly the gpu memory of one base model takes up. So I could save gpu memory, at the same time, serve diferent customers with their corresponding adapters. How could I do this? Thank you. |
really cool project! im wondering how its different from s-Lora? https://github.com/S-LoRA/S-LoRA
The text was updated successfully, but these errors were encountered: